She kinda seems like a Ravenclaw to me.¶

I was playing around with the Harry Potter books (for a post that is yet to be released), when the realization hit me: I could use the computational power of semantic models to figure out whether Hermione really belongs to Gryffindor, or if she's actually a Ravenclaw.

Here's a rough outline of the process.

I tokenized the the Harry Potter books by word
I removed the 200 most frequent words, and words that occur five or fewer times
I used HAL to generate the vectors for every word (including characters and houses)
For every word, I set the dimension that mentioned a character name or a house name to 0. The similarities between characters and houses should come from other features in the text, not who a character spends time with.
I used the cosine similarity to determine how similar a character was to the description of each house.

Note:¶

Click the button below to hide/show the raw code

from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Hide/Show the code."></form>''')

import nltk
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.axes as axes
from itertools import compress
import pickle
from scipy import sparse
from scipy.sparse.linalg import svds as svd

# load file
def read_file(nums):
    text = ''
    for num in nums:
        with open('../../../corpus/HarryPotter/text/hp' + str(num) + '.txt', 'rt',encoding='utf-8') as file:
            for line in file:
                text = text + line
    return text

# read in a file
book_content = read_file(range(1,8))

# load character names and houses
with open('character_words.p', 'rb') as file:
    character_words = pickle.load(file)

# tokenize on word
tokenized = nltk.word_tokenize(book_content)

#lowercase each word
tokenized_lower = [i.lower() for i in tokenized]

frequencies = nltk.FreqDist(tokenized_lower)

Reducing the corpus¶

Here's a plot showing the 5000 most frequent words used in Harry Potter. The shape of the curve is an inverse exponential, commonly referred to as a Zipfian Distribution. Simply put, most words don't occur that commonly, but the frequencies of a few words vastly outweighs that of the rest. These words (referred to as stop words) are removed from a corpus - their overwhelming frequency often prevent semantic models to pick up on any meaningful associations between words.

y = list(zip(*frequencies.most_common()))[1][:5000]
x = np.arange(1,len(y) + 1)
plt.plot(x,y)
plt.ylim([0,4000])
plt.axvline(200, color = 'red')
plt.show()

# find the 50 most common words from the corpus
most_common_words = list(zip(*frequencies.most_common(200)))[0]

# keep character names though
most_common_words = list(set(most_common_words) - character_words)

# find the words that only occur once
words,freqs = list(zip(*frequencies.most_common()))
least_common_words = list(compress(words, [bool(i <= 5) for i in freqs]))

# remove most and least common words from the corpus
stop_listed_text = [i for i in tokenized_lower if (i not in most_common_words) & (i not in least_common_words)]

print('Length of original corpus: ' + str(len(tokenized_lower)))
print('Length of reduced corpus: ' + str(len(stop_listed_text)))
print(str(np.round((1 - len(stop_listed_text) / len(tokenized_lower)) * 100, 2)) + '% reduction')

Length of original corpus: 1392955
Length of reduced corpus: 447443
67.88% reduction

with open('stop_listed_text.p', 'wb') as file:
    pickle.dump(stop_listed_text, file)

with open('stop_listed_text.p', 'rb') as file:
    stop_listed_text = pickle.load(file)

# make a scrolling window - width=4 words
# we're going to do a tapered window
windowed_text = []
length_window = 10
for i in np.arange(len(stop_listed_text) - length_window):
    windowed_text.extend(stop_listed_text[i:i + length_window])

# make a dictionary
words = sorted(list(set(stop_listed_text)))
dictionary = dict(zip(words, np.arange(len(words))))

# convert text to a word by document matrix
data = 1 - np.arange(0,1,1/length_window)
data = np.tile(data, len(stop_listed_text) - length_window)

rows = np.arange(len(stop_listed_text) - length_window)
rows = np.repeat(rows, length_window)

cols = [dictionary[word] for word in windowed_text]

wd = sparse.csr_matrix((data, (rows, cols)), shape=(len(stop_listed_text) - length_window, len(dictionary)))

# reconstruct matrix along word and reduced document dimensions
# get ww
ww = np.dot(wd.transpose(),wd).tolil()

ww = np.array(ww.transpose() / wd.sum(0)).transpose()
#save this
# with open('ww_hp.p','wb') as file:
#     pickle.dump(ww,file)

# for each house name and each character name, set all house and character names to 0
with open('character_words.p','rb') as file:
    character_words = pickle.load(file)
house_character_names = list(set(words).intersection(character_words))

for i in house_character_names:
    for j in house_character_names:
        ww[dictionary[i],dictionary[j]] = 0

def cosineTable(vects):
    return np.dot(vects,vects.transpose()) / np.outer(np.sqrt(np.sum(vects*vects,1)),np.sqrt(np.sum(vects*vects,1)))

What did we find out?¶

names = ['ravenclaw',
        'gryffindor',
        'hufflepuff',
        'slytherin',
        'harry',
        'cho',
        'cedric',
        'draco',
        'justin']

rows = [dictionary[i] for i in names]
test = ww[rows]
values = cosineTable(test)[:4,4:]

labels = names[4:]
ravenclaw = values[0]
gryffindor = values[1]
hufflepuff = values[2]
slytherin = values[3]

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots()
ax.bar(x - (width/4 * 2), gryffindor, width, label='Gryffindor',color='#ae0001')
ax.bar(x - (width/4 * 1), ravenclaw, width, label='Ravenclaw', color='#033e8c')
ax.bar(x + (width/4 * 1), hufflepuff, width, label='Hufflepuff',color='#ffdb00')
ax.bar(x + (width/4 * 2), slytherin, width, label='Slytherin',color='#2c8309')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Characters')
ax.set_title('Similarity between House and Character')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()

fig.tight_layout()

plt.show()

The figure above plots the similarities between the vectors representing each character and each house. The x-axis is the character, the y-axis is the cosine similarity (you'd better be impressed by the colors - I scoured the internet for the best hex-codes).

I wanted to make sure that the model could correctly place a character in each house - Harry (obviously) has to be placed in Gryffindor, I chose Cho Chang to represent Ravenclaw, Draco to represent Slytherin, and Cedric Diggory to represent Hufflepuff. I was a little annoyed that the model placed him in Gryffindor (I guess he wasn't that great of a finder after all), so I had to dig deeper to find another Hufflepuff to validate my model. I settled on Justin FinchFletchley as representative (oh come on, you know who he is.. he's the one who did the one thing that one time...). Anyway, the model can correctly identify the houses for representatives for each house.

With the model modestly validated, let's see how it sorts Hermione:

names = ['ravenclaw',
        'gryffindor',
        'hufflepuff',
        'slytherin',
        'hermione']

rows = [dictionary[i] for i in names]
test = ww[rows]
values = cosineTable(test)[:4,4:]

labels = names[4:]
ravenclaw = values[0]
gryffindor = values[1]
hufflepuff = values[2]
slytherin = values[3]

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots()
ax.bar(x - (width/4 * 2), gryffindor, width, label='Gryffindor',color='#ae0001')
ax.bar(x - (width/4 * 1), ravenclaw, width, label='Ravenclaw', color='#033e8c')
ax.bar(x + (width/4 * 1), hufflepuff, width, label='Hufflepuff',color='#ffdb00')
ax.bar(x + (width/4 * 2), slytherin, width, label='Slytherin',color='#2c8309')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Characters')
ax.set_title('Similarity between House and Character')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()

# fig.tight_layout()

plt.show()

And there we have it. Despite her bookish nature and the cleverness of her wand hand, Hermione is clearly a Gryffindor (though, if she were anything else, it would obviously be a Ravenclaw).

Also, let's be amused by Draco for a moment. It's not that he's more Slytherin than anyone else. It's just that he's less of anything else than Slytherin.

To Do:¶

These posts are mostly just fun things for me to do. They're works in progress (even after posted). If you think there's something I should look at, comment, and I'll add it to my to do list.

better pre-processing of the hp text (I'm getting weird chars in my dictionary)
capture all uses - ron vs ronald, hermione vs miss granger, etc
cluster analysis - can I sort all the characters into their respective houses?
how much of each house attribute do each character capture?

Blog

Is hermione a gryffindor...

Or just a convenient plot device?

She kinda seems like a Ravenclaw to me.¶

Note:¶

Reducing the corpus¶

What did we find out?¶

To Do:¶

Author

Archives

Categories