# Using Keras and Spacy for NLP tasks

### Extracting data from the corpus

We will have a look at the subtitle corpus that will be used in the assignment.

The dataset is structured with one line per utterance, except a special line starting with ### to denote the start of a new subtitle:

In [53]:
!head -n 20 /nr/samba/user/plison/code/grounding/outputs/no-all.txt

### {"file": "OpenSubtitles/raw/no/0/1115475/4558788.xml", "genre": "Documentary", "duration2": 1399.78, "tokens": "2365", "sentences": "255", "language": "Norwegian", "rating": "4.0", "blocks": "257"}
For å forstå hvordan en storby virker må man løfte på huden og blottlegge den skjulte livsnerven.
Et ufattelig komplisert system som er nødvendig for alle, men begripelig for få.
Her begynner vår oppdagelsesferd under overflaten i verdens storbyer.
London, en gang imperiets hovedstad og fortsatt et av verdens knutepunkt.
London ble bygget ved en elv.
Byen overlever ved hjelp av tre transportårer:
Langs vann, på land og i luften.
Alle må holdes åpne for at London skal klare seg.
For å unngå katastrofer overvåkes London dag og natt.
Og dette er vaktene:
Kameraer.
Sensorer.
Radarer.
London er jordens mest overvåkede by.
Storebror følger med døgnet rundt:
På land, langs elva og i lufta.
Heathrow - porten til London.
Ingen annen flyplass gjør så mye med så lite.
En halv million flygninger per

We start with extracting 1000 subtitles from the text data:

In [54]:
fd = open("/nr/samba/user/plison/code/grounding/outputs/no-all.txt")
nb_subtitles =1000
dialogues = []
for line in fd:
    if line.startswith("###"):
        dialogues.append([])
        if len(dialogues) >= nb_subtitles:
            break
    else:
        dialogues[-1].append(line.rstrip("\n"))
fd.close()
        

Each dialogue is a list of utterances:

In [55]:
dialogues[10][:20]

['Folk frykter at Grace plutselig skal gjøre noe som gjør stor skade.',
 'Hvorfor vil du ikke snakke om det?',
 'Du har ikke snakket med meg!',
 'Lensmann Jansen og onkelen min, de tjenestegjorde sammen.',
 '-Hei, er det Johansson, journalisten?',
 '-Ja, det er meg.',
 '-Jeg heter Elise og...',
 '-Beklager, jeg kan ikke.',
 'Bjørn sier han kan hjelpe, men...',
 'Ja det nytter iallfall ikke å gjøre avtaler med det onde.',
 'Faen!',
 '-Hvem var han?',
 '-De aner ikke.',
 'Vi mistenker at han kom over på russisk side og ledet oppdrag derfra.',
 'Mia Holt og Thomas Lønnhøiden, de skal stanses.',
 '--==DBRETAiL==-- Released on Danishbits.org',
 '-Dette er risikabelt.',
 '-Slapp av nå.',
 '-Ikke så lenge vi ikke har en plan.',
 '-Tror du ikke jeg har det?']

We now have to tokenise and lemmatise the dialogues. The easiest is to use Spacy.

In [56]:
import spacy
nlp = spacy.load("nb_core_news_sm")  # This is the standard Spacy model for Norwegian Bokmål

In [57]:
doc = nlp("Pierre ga boken til Jan Tore mens de var på Universitetet.")
for tok in doc:
    print(tok, "with POS tag:", tok.tag_, "and dependency relation:", tok.dep_, "with", tok.head, "as head")
for ent in doc.ents:
    print(ent, ent.label_)

Pierre with POS tag: PROPN___ and dependency relation: nsubj with ga as head
ga with POS tag: VERB__Mood=Ind|Tense=Past|VerbForm=Fin and dependency relation: ROOT with ga as head
boken with POS tag: NOUN__Definite=Def|Gender=Masc|Number=Sing and dependency relation: dobj with ga as head
til with POS tag: ADP___ and dependency relation: case with Jan as head
Jan with POS tag: PROPN__Gender=Masc and dependency relation: nmod with ga as head
Tore with POS tag: PROPN__Gender=Masc and dependency relation: name with Jan as head
mens with POS tag: SCONJ___ and dependency relation: mark with Universitetet as head
de with POS tag: PRON__Case=Nom|Number=Plur|Person=3|PronType=Prs and dependency relation: nsubj with Universitetet as head
var with POS tag: VERB__Mood=Ind|Tense=Past|VerbForm=Fin and dependency relation: cop with Universitetet as head
på with POS tag: ADP___ and dependency relation: case with Universitetet as head
Universitetet with POS tag: PROPN___ and dependency relation: advcl w

We run the tokenisation on the texts:

In [58]:
nlp = spacy.load("nb_core_news_sm", disable=["tagger", "parser", "ner"])  # This is the standard Spacy model for Norwegian Bokmål
for i, dialogue in enumerate(dialogues):
    for j, utterance in enumerate(nlp.pipe(dialogue)):
        dialogues[i][j] = [tok.lower_ for tok in utterance]
    if i % 100 == 0:
        print("Number of tokenised subtitles:", i)


Number of tokenised subtitles: 0
Number of tokenised subtitles: 100
Number of tokenised subtitles: 200
Number of tokenised subtitles: 300
Number of tokenised subtitles: 400
Number of tokenised subtitles: 500
Number of tokenised subtitles: 600
Number of tokenised subtitles: 700
Number of tokenised subtitles: 800
Number of tokenised subtitles: 900


In [59]:
dialogues[0][0]

['for',
 'å',
 'forstå',
 'hvordan',
 'en',
 'storby',
 'virker',
 'må',
 'man',
 'løfte',
 'på',
 'huden',
 'og',
 'blottlegge',
 'den',
 'skjulte',
 'livsnerven',
 '.']

## Building a simple neural network for NLP

Let's start with a simple toy example: we wish to predict will be a _clarification ellipsis_, such as "Mark killed everyone." --> "Mark?".

In [60]:
import keras

max_utterance_length = 32
utterance_input = keras.layers.Input((max_utterance_length,), dtype=np.int32)

vocab_size = 10000
embedding = keras.layers.Embedding(input_dim=vocab_size, output_dim=100)
utterance_word_embeddings = embedding(utterance_input)

A simple approach is then to perform a max pooling of all the embeddings, followed by a dense layer for the final prediction:

In [61]:
pooling = keras.layers.GlobalMaxPooling1D()
utterance_embedding = pooling(utterance_word_embeddings)

output = keras.layers.Dense(1, activation="sigmoid")
prediction = output(utterance_embedding)

model = keras.models.Model(utterance_input, prediction)
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 32)                0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 32, 100)           1000000   
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 100)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 101       
Total params: 1,000,101
Trainable params: 1,000,101
Non-trainable params: 0
_________________________________________________________________


## Preparing the data

We build a vocabulary (based on the most common words):

In [62]:
counts = {}
for dialogue in dialogues:
    for utterance in dialogue:
        for tok in utterance:
            counts[tok] = counts.get(tok, 0) + 1

sorted_toks = sorted(counts.keys(), key=lambda x: counts[x], reverse=True)
vocab_mapping = {tok:(i+2) for i, tok in enumerate(sorted_toks) if i < vocab_size-2}


We can then map tokens to indices:

In [63]:
input_data = []
target_data = []
for i, dialogue in enumerate(dialogues):
    for j, utterance in enumerate(dialogue):
        token_indices = [vocab_mapping.get(tok, 1) for tok in utterance]

        input_data.append(token_indices)
        
        ce_next_utt = (j < len(dialogues[i])-1 and dialogues[i][j+1][-1]=="?" and 
                       set(dialogues[i][j+1][:-1]) <= set(dialogues[i][j]))
  #      if ce_next_utt:
  #          print(dialogues[i][j], dialogues[i][j+1])
        target_data.append(ce_next_utt)
    if i % 100 == 0:
        print("Number of indexed dialogues:", i)


Number of indexed dialogues: 0
Number of indexed dialogues: 100
Number of indexed dialogues: 200
Number of indexed dialogues: 300
Number of indexed dialogues: 400
Number of indexed dialogues: 500
Number of indexed dialogues: 600
Number of indexed dialogues: 700
Number of indexed dialogues: 800
Number of indexed dialogues: 900


In [64]:
print("%i/%i (%.2f %%) utterances are followed by a clarification ellipsis"%(sum(target_data), len(target_data), 
                                                                             100*sum(target_data)/len(target_data)))

5182/1162770 (0.45 %) utterances are followed by a clarification ellipsis


But the input data is not yet in the proper format for Keras: we need to "pad" the utterances to the maximum utterance length, in order to have a single X matrix as input.

In [65]:
input_data2 = np.zeros((len(input_data), max_utterance_length), dtype=np.int32)
for i, utterance in enumerate(input_data):
    if len(utterance) <= max_utterance_length:
        input_data2[i,:len(utterance)] = utterance
    else:
        input_data2[i,:] = utterance[:max_utterance_length]
input_data = input_data2
target_data = np.array(target_data, dtype=np.float32)

Here is an example of a data point + target:

In [66]:
print(input_data[100], "-->", target_data[100])

[   1  815   39 1178   34  941   27   19  106   15  121  113    2    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0] --> 0.0


Finally, we need to split the data into a training, development and test set (we use 1000 utterances for development, 1000 for test, and the rest for training):

In [67]:
X_train, y_train = input_data[:-2000], target_data[:-2000]
X_dev, y_dev = input_data[-2000:-1000], target_data[-2000:-1000]
X_test, y_test = input_data[-1000:], target_data[-1000:]


And we can now fit the model:


In [68]:
model.fit(X_train, y_train, validation_data=(X_dev, y_dev))

Train on 1160770 samples, validate on 1000 samples
Epoch 1/1


<keras.callbacks.callbacks.History at 0x7fc744226890>

As we can see, the model does not seem to improve upon a majority baseline in this case.

## Model with a recurrent layer

Finally, we can try using a recurrent layer instead of a max pooling operation:

In [69]:
gru = keras.layers.GRU(100)
utterance_embedding = gru(utterance_word_embeddings)

output = keras.layers.Dense(1, activation="sigmoid")
prediction = output(utterance_embedding)

model = keras.models.Model(utterance_input, prediction)
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()

Model: "model_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 32)                0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 32, 100)           1000000   
_________________________________________________________________
gru_2 (GRU)                  (None, 100)               60300     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 101       
Total params: 1,060,401
Trainable params: 1,060,401
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.fit(X_train, y_train, validation_data=(X_dev, y_dev))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 1160770 samples, validate on 1000 samples
Epoch 1/1