Pres Nichols

Pres Nichols

Data Nerd/Writer

Contact Me

Background

I first started experimenting with Word2Vec a few months ago while working on a project for a text-analysis course I took at NYU. The project was a proof-of-concept of sorts that involved collecting and analyzing millions of Spanish-language YouTube comments in an attempt to detect and measure politically oriented speech.

While Word2Vec wasn’t actually one of the models we learned in this course, I was really impressed by its ability to pick up on subtle relationships between words. One of the major issues with analyzing user-generated text in Spanish is spelling — despite having a largely phonemic orthography, I found spelling and grammar mistakes everywhere.

Unfortunately, the stopword dictionaries currently available only include formal spellings of words, which are actually way less common than incorrect spellings for certain terms. To make matters worse, the misspellings vary so much that they can’t be removed by tweaking term frequency or downsampling.

Word2Vec to the rescue! By taking one common misspelling and querying the model for the 50-most-similar words, I was able to build a comprehensive stopword dictionary to filter them out.

Fifty common but not super-common variations on the word “haha”? Not funny!

So, why Fox News?

The above experiment really illustrated the true power of Word2Vec to uncover the “personality” of language. I wondered: what if I trained a Word2Vec model on language that, in a very subtle way, only represented one vision of reality? The only English-language candidate I could think of was Fox News.

Getting the text

While Fox News actually produces written copy on its website, I wanted a corpus that would take into account the entire Fox experience: guest commentary, off-the-cuff remarks, banter between anchors, etc.

I don’t have cable at home, so I built a web scraper and extracted the audio for all the videos available on the Fox News website at the time — about 1150 clips ranging in length between 1 minute and 20 minutes. While some of the videos date back to 2015, the vast majority were published during the last six months.

To convert the audio, I used Google’s Speech Recognition API, as the results it produced were much better than any other service (plus they give you $300 free credit). I explain how I did this here.

Oh, punctuation…

One of the unfortunate things about speech recognition models is that the text they return doesn’t actually have any punctuation. This is particuarly annoying when using Word2Vec, as you need to feed it tokenized sentences (which need punctuation).

Luckily, Ottokar Tilk already had me covered. He trained a bidirectional recurrent neural network model that restores punctuation in English text. Best of all, this amazing human being also created an API that you can easily query from Python.

Donald Truck is in the house!

Yep. Google is not flawless. Sometimes it misinterprets certain words and phrases, especially when you have people talking over each other. For example, one of the most-common terms associated with Michael Flynn was “attorney,” but the word “tourney” also appeared in the top 20.

My initial strategy was to try to spot these mistakes using metaphone encoding and fuzzy string matching. However, this proved to be a bit more time-consuming than I had originally anticipated, so I shelved the idea. Ultimately, I was able to tweak the parameters of the Word2Vec model to minimize the effect of the incorrect terms.

Results

The model was trained on about 500,000 terms — not a huge corpus for Word2Vec, but the results were still quite interesting. Have a gander at the above graph and, as always, I would love to hear your feedback or suggestions.

Also, if you’re curious about how I did any of the above stuff, don’t hesitate to reach out!

Other Posts

youtube tutorial

Tutorial: Using YouTube’s Python API for Data Science (Part 1)

A simple technique for searching YouTube’s vast catalog

Find out more!

project name

Tutorial: Data Wrangling and Mapping in R

Using open data to expose the landlords pushing gentrification in Brooklyn

Find out more!

project name

Tutorial: Asynchronous Speech Recognition in Python

A (fairly) simple technique for using Google’s kinda-sorta-really confusing Speech Recognition API

Find out more!

project name

Libraries need maps, too.

A run-through of my cartographic swag while working at Brooklyn Public Library

Find out more!

project name

People, Politics and YouTube in Latin America

What do Latin Americans talk about in the comments section of YouTube? Is it possible to use YouTube comments to model and track public opinion on political issues?

Coming soon!