Topic Modeling

By Clint Hammerberg

What:
Topic modeling is a powerful technique that looks for recurring clumps of words using co-location and tracks them through a larger body of text, which can be an large single work of a collection of other texts. This allows for a researcher to identify potential topics that are prevalent. The number of topics depends on how many you want to produce.

However, a set of stop words needs to be determined as well since as we can see in the word frequency post that the first ten words of any clump would be words like and, of, it. These words would not be helpful in determining the topics. Stock sets of stop words can be easily found, but depending on the texts being used and what the focus of the research is, the individual may want to make their list of stop words. 
It is up to the individual running the program to determine what the topic is, if there is one, for each list of co-located words. This is one of the downsides to topic modeling as the theme assigned to each list of words is entirely subjective. It is also important to note that R ignores word order, so you can’t assume the relationships between nouns and adjectives within a list of words. R also can not differentiate between homonyms and neither can the research as the lists of words are presented out of context. The topic can also appear to vary depending on the length of the list of words that are examined. The list of words that R will spit out will look like this:

  words            weights

heart               0.008924123
animal            0.005856917
street               0.005243476
night                0.004323315
beast                0.004016594
eyes                 0.003709874
grew                0.003709874
cat                   0.003709874
terror             0.003403153
intense           0.003403153

This is a relatively cohesive list as the words general make sense together. You can get a set of words that are much harder to give a single topic to like the list below.

words              weights

de                    0.022600055
jupiter            0.016725517
legrand          0.014561214
bug                  0.011469352
massa              0.010541793
skull                 0.008995862
parchment     0.008377490
tree                  0.007759117
beetle               0.006831559
dat                   0.006831559

R and topic modeling are just tools and sometimes they aren’t going to give results that make sense.

Sometimes the list of words that R gives you are a result of semantic connection. I got two topics that were just authors’ attempts at capturing the sound of Southern dialect. It’s interesting to see that R can identify all of those words as occurring together even if the clump of words doesn’t suggest a topic.

After creating a set of topics that occur within the body, you can go back and track where these topics occur more or less frequently.

For example:

In class, we used a body composed of (41) of texts from Southern Literature from Documenting the American South. One of the lists that was produced when only looking at the first 5 words was:

 time, horse, made, moment, men

To me this set of words is not particularly suggestive of any topic. But when the list is expanded to 15 words, which were:

 time, horse, made, moment, men, captain, man, nigh, hand, head, back, party, side, enemy, ground

The list now seems to suggesting a topic of Military, at least to me. Other people came up with the topic of War, which is similar. But there is small differences, which depending on what you are using this tool for may make a big difference. It is important to show your work so readers can see the data that you are using to draw your conclusions. Again, what a topic is called is entirely up to the individual reading the list. You may determine that the 15 word list above suggests an entirely different topic.

We can then go back and see which texts that we used to compose the larger body of Southern Literature have the highest occurrence of this particular topic.

Naturally the longer the list is the more specific the topic can be. But as the words are weighted by frequency of co-location, at a certain length the words will become less relevant. The standard for more thorough research project seems to be lists of 100 words.

If you want, the collection of words can be displayed as wordcloud instead. The size of the word corresponds to the relative weight or significance of that word within the collection.

tm wordcloud(Image Courtesy of Megan Valentine)

How:

Additional packages to R are required for topic modeling. We used Mallet and wordcloud. You could use the packages topicmodels or lda if you wanted, although I am not familiar with them and they might work differently and need a different code.  The code that you use depends on the type of file that you are using; XML files that have been labeled for TEI or .txt files are both viable options.

The code gets fairly long, and I don’t want to bury you with details. But, broadly speaking, these packages and corresponding code tracks the occurrence of words and their relatively frequency with regards to each other and the within the body as a whole. Then the words that occur with the highest frequency to each other are assembled into a series of lists or clouds.

Click here to see a step by step breakdown of the code.

Why:

Topic modeling is a tool that functions best for sorting through large individual documents or large collections of documents. It allows an individual to process data incredibly quickly. Projects like Laurel Ulrich’s study of Martha Ballard’s diary, which spans nearly 30 years of Ballard’s life, took historians years of work. But with digital tools, much of the tedious work of tracking the occurrence of a particular set of words pertaining to a topic can be done in a day if the document in question has already been transcribed into the appropriate digital format and the researcher is familiar with the use of these digital tools, which is of course not always the case.

Topic modeling could also be used to reconsider traditional genre divisions and literary classifications as these categories are quite broad and at times determined by rather arbitrary factors such as when the book was written.

Other applications of topic modeling are in other fields in which someone might encounter massive amounts of documents. The discovery process of a legal case can at times, depending on the nature of the case, produce thousands and thousands of pages of text, some of which will contain important information that could affect the outcome of the case, but much of the documents may be nothing. Winnowing through this many documents would be a daunting proposition, but using topic modeling you could determine where to start looking more closely if a particular topic that is relevant to the case appears more frequently in certain documents or within a large document.

Journalists might find topic modeling useful for dealing with the massive data leaks that have become more common in the modern world. The Panama papers and other various wikileaks document dumps provide massive amounts of data that can be difficult to make sense of in the short period of time that is demanded by the news cycle. Additionally, topic modeling might be useful for prosecutors who may want to sort through these massive sets of documents to see if they can build a case, and as there is a statue of limitations of many crimes time is a factor.

I am actually in the middle of reading a book, Fighting over the Founders, about how politicians use references to the American Revolution on the campaign trail. The author of the book used a statistical tool to process the thousands and thousands of pages of speeches given in any year. They probably used R or a similar program and did something similar to topic modeling.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

Up ↑

%d bloggers like this: