By Megan Valentine
In this post you will find several sample images of varying word frequencies and token distributions that my fellow DH classmates and I put together. Below you will encounter graphs displaying the top ten words for Call of the Wild by Jack London and Untamed by Ashlyn Keil. These graphs are based on frequency of word usage throughout the entire text. Under the graph for Call of the Wild is the code that was used to produce the graph itself.
Along with these top ten graphs, are dispersion plots that display the frequency of specific terms throughout a novel, measured by novel time (word count). The coding that was used for Ashlyn’s dispersion plot is attached to the graph itself, which is a continuation from the code for the top ten words graph from above. Note that while a variety of texts were analyzed, the same code line was followed for each textual analysis.
To create these dispersion plots and graphs of top ten words, we used the computer program RStudio (a statistical analysis program). After determining which text we wished to analyze, we uploaded it to RStudio and began our coding. Here is the code we used if you wish to recreate this project yourself. Note that you will need to upload your own text specifically and insert the document name into the designated spot in the code. Happy coding!
You may be thinking “who cares about the top ten words in a text when they are all articles anyway” or “why does this even matter, it doesn’t tell me anything”. Whatever you’re thinking definitely has valid points, but there is something that can be taken away from all of this seemingly ‘unimportant’ data. Word frequency and token distribution are great starting points to help guide us to deeper questions and analyses about texts. An exploratory phase of sorts.
Looking at the dispersion plots for Hamlet regarding Ophelia and Hamlet raises various questions as to what is going on when there are particularly dense sections of either Hamlet or Ophelia being referenced. Also it is interesting to note how when Hamlet is being mentioned Ophelia seems to be almost non-existent and vice versa.
In the images for the top ten words graph for Call of the Wild and Untamed, it is easy to take note of what point of view or narration style is being used without even reading the text. With this in mind, there is also an importance of this data for authors. As seen in the graphs/plots above, an author can analyze their own text to take note of certain aspects that they may not have considered or realized before that their text was doing. For instance, maybe they realize they are utilizing one word over and over again that they want to change, or a word or theme that they may have been trying to emphasize might not be as prominent as they thought. Regardless, this tool allows authors and readers to have a broad and general viewpoint of what the text is doing at a surface level. Thus allowing them to go down any path that may be of interest or significance. For instance, a path that might lead them to word correlation analysis.