Or just a convenient plot device?She kinda seems like a Ravenclaw to me.¶I was playing around with the Harry Potter books (for a post that is yet to be released), when the realization hit me: I could use the computational power of semantic models to figure out whether Hermione really belongs to Gryffindor, or if she's actually a Ravenclaw. Here's a rough outline of the process.
Note:¶Click the button below to hide/show the raw code
In [1]:
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Hide/Show the code."></form>''')
Out[1]:
In [5]:
import nltk
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.axes as axes
from itertools import compress
import pickle
from scipy import sparse
from scipy.sparse.linalg import svds as svd
In [ ]:
# load file
def read_file(nums):
text = ''
for num in nums:
with open('../../../corpus/HarryPotter/text/hp' + str(num) + '.txt', 'rt',encoding='utf-8') as file:
for line in file:
text = text + line
return text
# read in a file
book_content = read_file(range(1,8))
# load character names and houses
with open('character_words.p', 'rb') as file:
character_words = pickle.load(file)
In [ ]:
# tokenize on word
tokenized = nltk.word_tokenize(book_content)
#lowercase each word
tokenized_lower = [i.lower() for i in tokenized]
In [ ]:
frequencies = nltk.FreqDist(tokenized_lower)
Reducing the corpus¶Here's a plot showing the 5000 most frequent words used in Harry Potter. The shape of the curve is an inverse exponential, commonly referred to as a Zipfian Distribution. Simply put, most words don't occur that commonly, but the frequencies of a few words vastly outweighs that of the rest. These words (referred to as stop words) are removed from a corpus - their overwhelming frequency often prevent semantic models to pick up on any meaningful associations between words.
In [226]:
y = list(zip(*frequencies.most_common()))[1][:5000]
x = np.arange(1,len(y) + 1)
plt.plot(x,y)
plt.ylim([0,4000])
plt.axvline(200, color = 'red')
plt.show()
In [ ]:
# find the 50 most common words from the corpus
most_common_words = list(zip(*frequencies.most_common(200)))[0]
# keep character names though
most_common_words = list(set(most_common_words) - character_words)
In [ ]:
# find the words that only occur once
words,freqs = list(zip(*frequencies.most_common()))
least_common_words = list(compress(words, [bool(i <= 5) for i in freqs]))
In [ ]:
# remove most and least common words from the corpus
stop_listed_text = [i for i in tokenized_lower if (i not in most_common_words) & (i not in least_common_words)]
In [220]:
print('Length of original corpus: ' + str(len(tokenized_lower)))
print('Length of reduced corpus: ' + str(len(stop_listed_text)))
print(str(np.round((1 - len(stop_listed_text) / len(tokenized_lower)) * 100, 2)) + '% reduction')
Length of original corpus: 1392955 Length of reduced corpus: 447443 67.88% reduction
In [ ]:
with open('stop_listed_text.p', 'wb') as file:
pickle.dump(stop_listed_text, file)
In [6]:
with open('stop_listed_text.p', 'rb') as file:
stop_listed_text = pickle.load(file)
In [7]:
# make a scrolling window - width=4 words
# we're going to do a tapered window
windowed_text = []
length_window = 10
for i in np.arange(len(stop_listed_text) - length_window):
windowed_text.extend(stop_listed_text[i:i + length_window])
In [8]:
# make a dictionary
words = sorted(list(set(stop_listed_text)))
dictionary = dict(zip(words, np.arange(len(words))))
In [9]:
# convert text to a word by document matrix
data = 1 - np.arange(0,1,1/length_window)
data = np.tile(data, len(stop_listed_text) - length_window)
rows = np.arange(len(stop_listed_text) - length_window)
rows = np.repeat(rows, length_window)
cols = [dictionary[word] for word in windowed_text]
wd = sparse.csr_matrix((data, (rows, cols)), shape=(len(stop_listed_text) - length_window, len(dictionary)))
In [25]:
# reconstruct matrix along word and reduced document dimensions
# get ww
ww = np.dot(wd.transpose(),wd).tolil()
ww = np.array(ww.transpose() / wd.sum(0)).transpose()
#save this
# with open('ww_hp.p','wb') as file:
# pickle.dump(ww,file)
In [26]:
# for each house name and each character name, set all house and character names to 0
with open('character_words.p','rb') as file:
character_words = pickle.load(file)
house_character_names = list(set(words).intersection(character_words))
for i in house_character_names:
for j in house_character_names:
ww[dictionary[i],dictionary[j]] = 0
In [27]:
def cosineTable(vects):
return np.dot(vects,vects.transpose()) / np.outer(np.sqrt(np.sum(vects*vects,1)),np.sqrt(np.sum(vects*vects,1)))
What did we find out?¶
In [30]:
names = ['ravenclaw',
'gryffindor',
'hufflepuff',
'slytherin',
'harry',
'cho',
'cedric',
'draco',
'justin']
rows = [dictionary[i] for i in names]
test = ww[rows]
values = cosineTable(test)[:4,4:]
In [31]:
labels = names[4:]
ravenclaw = values[0]
gryffindor = values[1]
hufflepuff = values[2]
slytherin = values[3]
x = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars
fig, ax = plt.subplots()
ax.bar(x - (width/4 * 2), gryffindor, width, label='Gryffindor',color='#ae0001')
ax.bar(x - (width/4 * 1), ravenclaw, width, label='Ravenclaw', color='#033e8c')
ax.bar(x + (width/4 * 1), hufflepuff, width, label='Hufflepuff',color='#ffdb00')
ax.bar(x + (width/4 * 2), slytherin, width, label='Slytherin',color='#2c8309')
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Characters')
ax.set_title('Similarity between House and Character')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
fig.tight_layout()
plt.show()
The figure above plots the similarities between the vectors representing each character and each house. The x-axis is the character, the y-axis is the cosine similarity (you'd better be impressed by the colors - I scoured the internet for the best hex-codes). I wanted to make sure that the model could correctly place a character in each house - Harry (obviously) has to be placed in Gryffindor, I chose Cho Chang to represent Ravenclaw, Draco to represent Slytherin, and Cedric Diggory to represent Hufflepuff. I was a little annoyed that the model placed him in Gryffindor (I guess he wasn't that great of a finder after all), so I had to dig deeper to find another Hufflepuff to validate my model. I settled on Justin FinchFletchley as representative (oh come on, you know who he is.. he's the one who did the one thing that one time...). Anyway, the model can correctly identify the houses for representatives for each house. With the model modestly validated, let's see how it sorts Hermione:
In [32]:
names = ['ravenclaw',
'gryffindor',
'hufflepuff',
'slytherin',
'hermione']
rows = [dictionary[i] for i in names]
test = ww[rows]
values = cosineTable(test)[:4,4:]
In [33]:
labels = names[4:]
ravenclaw = values[0]
gryffindor = values[1]
hufflepuff = values[2]
slytherin = values[3]
x = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars
fig, ax = plt.subplots()
ax.bar(x - (width/4 * 2), gryffindor, width, label='Gryffindor',color='#ae0001')
ax.bar(x - (width/4 * 1), ravenclaw, width, label='Ravenclaw', color='#033e8c')
ax.bar(x + (width/4 * 1), hufflepuff, width, label='Hufflepuff',color='#ffdb00')
ax.bar(x + (width/4 * 2), slytherin, width, label='Slytherin',color='#2c8309')
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Characters')
ax.set_title('Similarity between House and Character')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
# fig.tight_layout()
plt.show()
And there we have it. Despite her bookish nature and the cleverness of her wand hand, Hermione is clearly a Gryffindor (though, if she were anything else, it would obviously be a Ravenclaw). Also, let's be amused by Draco for a moment. It's not that he's more Slytherin than anyone else. It's just that he's less of anything else than Slytherin. To Do:¶These posts are mostly just fun things for me to do. They're works in progress (even after posted). If you think there's something I should look at, comment, and I'll add it to my to do list.
0 Comments
|
AuthorIt's bad right now. It'll get better. Archives
March 2021
Categories
All
|