Clustering

By Ashle’ Tate

What is clustering?

If you’re anything like me, a teacher candidate and a huge foodie, when you hear the word ‘cluster’ the first thing you think of are those delicious chocolates with caramel and nuts.The second thing you think of is the grouping and instructional practice that supports gifted and talented students…. Well I’m here to tell you that clustering in the world of DH is neither of those things.

Me after learning that:

If you’re reading or have read Jocker’s Text Analysis with R for Student of Literature and read chapter 11 you may have felt like this:

Jockers does a great job of explaining the code behind the process but between that and all of the Euclidean metric and unsupervised clustering talk, it can feel a little overwhelming.

Here is clustering plain and simple:

Document clustering is used to generate hierarchical clusters of documents. Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Using the Euclidean metric, clustering is able to calculate each book/section’s distance from every other book/section in a corpus. Books with a closer distance will have similarities in regards to their feature usage, and books with a wider distance will be less similar in their feature usage.

How:

We used R Studio

We used dist ( a function that uses the Euclidean metric to create a distance matrix) and hclust or hierarchical clustering (which clusters the data in a distance matrix).

Screen Shot 2017-04-24 at 1.31.12 AM.png

You can use the following code:

Screen Shot 2017-04-24 at 1.12.05 AM.png

In our exploration of this we created a dendrogram using a TEI XML corpus. We looked at top word frequency. The results of the clustering allowed us to see the relationship between texts from different time periods and genres. We wanted to explore this collection of texts to find out who the anonymous author was.

Find our dendrogram attached below:

Cluster Dendrogram (3)

When: If you ran your code and now have your cluster dendrogram, you might feel a little like this:

 

And you might be asking yourself, when would I ever really need to use this? Dr. Jaime Jordan, gave me a great example, “ A grocer might use clustering to determine 5 main groups of shoppers to target with their rewards program.” There are many relevant uses for clustering in business and as DH continues to grow, we will discover more ways to use clustering in the humanities.

So what?: So what did we learn from our cluster dendrogram? We learned that the anonymous author was William Shakespeare. This data allows us to make several predictions about these anonymous texts. #ghostwriter? In our next adventure we explore classification and we will explore this a little more!

We were asked to assess whether or not our results were of “high quality” “High quality” in text mining refers to some combination of “relevance, novelty, and interest.”  The following are some of my classmate’s responses to the cluster dendrogram above.

 

It’s interesting to see Homer’s Odyssey and Iliad and Milton, part of the Classics period, right next to Jack London’s Call of the Wild and White Fang, which are both in the modern period. The fact that the Canterbury Tales is in a category all its own is interesting — is it because it has a very different writing style and frequency of words? The Bible is also right next to two modern texts by Wharton and Conrad, but still seems to be offset by itself as well — I’m not sure what to make of that. (Ashlyn Keil)

 

I believe that we do have “High quality” in our text mining as each tree branch is clearly separated by related/ individual authors. Additionally, the works are also grouped together by time period which makes sense as authors of the same period would be writing in similar styles and language. It is interesting to also note how some authors/texts of a single period are grouped together with different time periods. In particular, the majority of the modern texts are not really grouped together, but dispersed randomly. Such as Kipling who shows up next to Victorian era and Enlightenment texts. (Megan Valentine)

 

We explored the possibility of applying this to other content. I think that it might be interesting to look at Ethnic Literature or even Women’s Literature in comparison to some of the texts written by texts in the canon. My classmates had really interesting ideas as well:

 

Kyle Chavez writes, “I also would like to take a look at how this can be used to determine who wrote what differentiating on pseudonyms and such. I’d love to take a look at how this process was used to determine how they determined that Robert Galbraith was actually J.K. Rowling.”

 

Jesse Sanders writes, “I would be interested to set the bible as the Anon and try to see what author through history writes the most like it. Though this might be riddled with issues but I’d still find it interesting ¯\_(ツ)_/¯”

Conclusion: 

Like many of the tools featured on our site, this tool raised more questions for us. We also discovered some of it’s challenges. One challenge that we thought of is, how the tool couldn’t be applied to translations of texts. Another challenge that you can see is how difficult the dendrogram is to read. While these challenges exist, this exploratory tool is used for a variety of reasons across a wide range of disciplines. Some of these include:

  • Storing, organizing, and integrating huge amounts of unstructured data
  • Processing and analyzing the data
  • Extracting knowledge, insights, and predicting the future from the data
  • Examples:
    • Science:
      • Classification of plants and animals given their features
      • Clustering observed earthquake epicenters to identify dangerous zones
    • Business:
      • A telephone company can use clustering to determine the best locations for new towers, so that all its users receive optimum signal strength
      • A hospital Care chain may use clustering to determine the most accident prone areas in order to open a series of Emergency Care wards
      • Identify groups of motor insurance policy holders with a high average claim cost; identify frauds
      • Libraries use clustering for book ordering
    • Law Enforcement:
      1. The DEA could use clustering to determine higher crime rate areas and then station their patrol vans accordingly
    • News:
      1. Summarize news
    • Humanities:
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

Up ↑

%d bloggers like this: