By Clint Hammerberg
R and topic modeling are just tools and sometimes they aren’t going to give results that make sense.
Sometimes the list of words that R gives you are a result of semantic connection. I got two topics that were just authors’ attempts at capturing the sound of Southern dialect. It’s interesting to see that R can identify all of those words as occurring together even if the clump of words doesn’t suggest a topic.
After creating a set of topics that occur within the body, you can go back and track where these topics occur more or less frequently.
In class, we used a body composed of (41) of texts from Southern Literature from Documenting the American South. One of the lists that was produced when only looking at the first 5 words was:
time, horse, made, moment, men
To me this set of words is not particularly suggestive of any topic. But when the list is expanded to 15 words, which were:
time, horse, made, moment, men, captain, man, nigh, hand, head, back, party, side, enemy, ground
The list now seems to suggesting a topic of Military, at least to me. Other people came up with the topic of War, which is similar. But there is small differences, which depending on what you are using this tool for may make a big difference. It is important to show your work so readers can see the data that you are using to draw your conclusions. Again, what a topic is called is entirely up to the individual reading the list. You may determine that the 15 word list above suggests an entirely different topic.
We can then go back and see which texts that we used to compose the larger body of Southern Literature have the highest occurrence of this particular topic.
Naturally the longer the list is the more specific the topic can be. But as the words are weighted by frequency of co-location, at a certain length the words will become less relevant. The standard for more thorough research project seems to be lists of 100 words.
If you want, the collection of words can be displayed as wordcloud instead. The size of the word corresponds to the relative weight or significance of that word within the collection.
(Image Courtesy of Megan Valentine)
Additional packages to R are required for topic modeling. We used Mallet and wordcloud. You could use the packages topicmodels or lda if you wanted, although I am not familiar with them and they might work differently and need a different code. The code that you use depends on the type of file that you are using; XML files that have been labeled for TEI or .txt files are both viable options.
The code gets fairly long, and I don’t want to bury you with details. But, broadly speaking, these packages and corresponding code tracks the occurrence of words and their relatively frequency with regards to each other and the within the body as a whole. Then the words that occur with the highest frequency to each other are assembled into a series of lists or clouds.
Click here to see a step by step breakdown of the code.
Topic modeling is a tool that functions best for sorting through large individual documents or large collections of documents. It allows an individual to process data incredibly quickly. Projects like Laurel Ulrich’s study of Martha Ballard’s diary, which spans nearly 30 years of Ballard’s life, took historians years of work. But with digital tools, much of the tedious work of tracking the occurrence of a particular set of words pertaining to a topic can be done in a day if the document in question has already been transcribed into the appropriate digital format and the researcher is familiar with the use of these digital tools, which is of course not always the case.
Topic modeling could also be used to reconsider traditional genre divisions and literary classifications as these categories are quite broad and at times determined by rather arbitrary factors such as when the book was written.
Other applications of topic modeling are in other fields in which someone might encounter massive amounts of documents. The discovery process of a legal case can at times, depending on the nature of the case, produce thousands and thousands of pages of text, some of which will contain important information that could affect the outcome of the case, but much of the documents may be nothing. Winnowing through this many documents would be a daunting proposition, but using topic modeling you could determine where to start looking more closely if a particular topic that is relevant to the case appears more frequently in certain documents or within a large document.
Journalists might find topic modeling useful for dealing with the massive data leaks that have become more common in the modern world. The Panama papers and other various wikileaks document dumps provide massive amounts of data that can be difficult to make sense of in the short period of time that is demanded by the news cycle. Additionally, topic modeling might be useful for prosecutors who may want to sort through these massive sets of documents to see if they can build a case, and as there is a statue of limitations of many crimes time is a factor.
I am actually in the middle of reading a book, Fighting over the Founders, about how politicians use references to the American Revolution on the campaign trail. The author of the book used a statistical tool to process the thousands and thousands of pages of speeches given in any year. They probably used R or a similar program and did something similar to topic modeling.