I trained a Word2Vec model on FoxNews broadcasts
This is what it thinks about the world.
I first started experimenting with Word2Vec a few months ago while working on a project for a text-analysis course I took at NYU. The project was a proof-of-concept of sorts that involved collecting and analyzing millions of Spanish-language YouTube comments in an attempt to detect and measure politically oriented speech.
While Word2Vec wasn’t actually one of the models we learned in this course, I was really impressed by its ability to pick up on subtle relationships between words. One of the major issues with analyzing user-generated text in Spanish is spelling — despite having a largely phonemic orthography, I found spelling and grammar mistakes everywhere.
Unfortunately, the stopword dictionaries currently available only include formal spellings of words, which are actually way less common than incorrect spellings for certain terms. To make matters worse, the misspellings vary so much that they can’t be removed by tweaking term frequency or downsampling.
Word2Vec to the rescue! By taking one common misspelling and querying the model for the 50-most-similar words, I was able to build a comprehensive stopword dictionary to filter them out.
So, why Fox News?The above experiment really illustrated the true power of Word2Vec to uncover the “personality” of language. I wondered: what if I trained a Word2Vec model on language that, in a very subtle way, only represented one vision of reality? The only English-language candidate I could think of was Fox News.
Getting the textWhile Fox News actually produces written copy on its website, I wanted a corpus that would take into account the entire Fox experience: guest commentary, off-the-cuff remarks, banter between anchors, etc.
I don’t have cable at home, so I built a web scraper and extracted the audio for all the videos available on the Fox News website at the time — about 1150 clips ranging in length between 1 minute and 20 minutes. While some of the videos date back to 2015, the vast majority were published during the last six months.
To convert the audio, I used Google’s Speech Recognition API, as the results it produced were much better than any other service (plus they give you $300 free credit). I explain how I did this here.
Oh, punctuation…One of the unfortunate things about speech recognition models is that the text they return doesn’t actually have any punctuation. This is particuarly annoying when using Word2Vec, as you need to feed it tokenized sentences (which need punctuation).
Luckily, Ottokar Tilk already had me covered. He trained a bidirectional recurrent neural network model that restores punctuation in English text. Best of all, this amazing human being also created an API that you can easily query from Python.
Donald Truck is in the house!Yep. Google is not flawless. Sometimes it misinterprets certain words and phrases, especially when you have people talking over each other. For example, one of the most-common terms associated with Michael Flynn was “attorney,” but the word “tourney” also appeared in the top 20.
My initial strategy was to try to spot these mistakes using metaphone encoding and fuzzy string matching. However, this proved to be a bit more time-consuming than I had originally anticipated, so I shelved the idea. Ultimately, I was able to tweak the parameters of the Word2Vec model to minimize the effect of the incorrect terms.
ResultsThe model was trained on about 500,000 terms — not a huge corpus for Word2Vec, but the results were still quite interesting. Have a gander at the above graph and, as always, I would love to hear your feedback or suggestions.
Also, if you’re curious about how I did any of the above stuff, don’t hesitate to reach out!