Working With Messy Text

Heyo! I am doing my best to procrastinate here on a blustery Tuesday afternoon. So, I decided to share some code I’ve put together that solves problems in R that I used to do in perl. HTML or C++ was probably my first real language, but I love the heck out of perl. It’s never done me wrong (unlike you PHP).

Anyways! The context of this project is that we are developing a dictionary of words to complement the work done by Jonathan Haidt and Jesse Graham - learn more. I had a student who was interested in Moral Foundations Theory and its relationship to language, and we had tested some of the dictionary and found it to be frustratingly obtuse. Meaning, that a lot of the words in it are great, but not things that people like, college freshman, or even me were likely to say. She’s moved on to working with the founder of the LIWC - and even worked on the newest version of it :small brag:.

Now I have a second student who’s helping finish up some work on the dictionary, to see if what we were doing is worthwhile (spoiler alert: I don’t know). However, I thought I might share some code we were using and it’s context for people who are also trying to get into doing some of this text mining/cleaning/editing in R. You can find all the materials for this project, including the code in context of our messy paper, on GitHub.

Here’s a view of what the data looks like (this isn’t even the messiest part, and part 2 of our study uses full written paragraphs):

> head(noout1$Q27)
[1] "doctors, babysitting"                    
[2] "criminals, doctors, shootings, medicine "
[3] "Health"                                  
[4] "physical healthiness, mental healthiness"
[5] "hurt, effect, love, protect"             
[6] "hurt, depression, pain"

So, couple things we have to deal with:

  • Mixed case
  • Punctuation
  • Stemming (affixes)

Now, don’t hate on me folks, but I love a good loop. I could probably do this with the apply family, but I didn’t:

> ##stem the data library(corpus) was loaded earlier
> for (i in 1:nrow(noout1)) {
+   noout1$Q27[i] = paste(unlist(
+     text_tokens(noout1$Q27[i], stemmer = "en")), collapse = " ")
+ }

Unpacking what this does:

  • Loops over each participant’s answers in Q27. I did this because text_tokens returns a list of lists, which I personally find troublesome to deal with, and I wanted to retain each persons answers in one cell.
  • Uses text_tokens to “tokenize” or de-affix the data. stemmer = "en" is an argument to stem the words in English.
  • Unlists the list returned by text_tokens.
  • Pastes the updated data back to one cell. Be sure to use collapse here and not sep, as we want 1 item returned, and sep would just stick spaces between items if there were more than one.
##one example
> paste(unlist(
+     text_tokens(noout1$Q27[4], stemmer = "en")), collapse = " ")
[1] "physic healthi , mental healthi" ##one string
> paste(unlist(
+     text_tokens(noout1$Q27[4], stemmer = "en")), sep = " ")
[1] "physic"  "healthi" ","       "mental"  "healthi" ##five strings

Let’s look at the data now:

> head(noout1$Q27)
[1] "doctor , babysit"                 
[2] "crimin , doctor , shoot , medicin"
[3] "health"                           
[4] "physic healthi , mental healthi"  
[5] "hurt , effect , love , protect"   
[6] "hurt , depress , pain"    

You can see that the words have been stemmed and are now in lower case. We haven’t removed punctuation yet. There’s lots of ways to do that, but since one of the next steps does it for me, I won’t cover those. The next step requires the tm library, although I bet the corpus library also does similar steps, just more familiar with tm. We will create a corpus out of the vector of participant answers I have:

> ##create a corpus
> harm_corpus = Corpus(VectorSource(noout1$Q27))
> harm_TDM = as.matrix(TermDocumentMatrix(harm_corpus,
+                               control = list(removePunctuation = TRUE,
+                                              stopwords = TRUE)))

The Corpus step simply creates a big list of all the “documents” (here, each participant is treated as a separate document, which is what I want) from a Vector, rather than opening separate documents in a file. The TermDocumentMatrix function creates a giant matrix wherein:

  • Terms (words) are rows
  • Documents (participants) are columns
  • Each row, column combination stores the number of times a term appeared in each document.

These can get real big, real fast, fyi. The nice thing about the TermDocumentMatrix function is that it handled the punction for me by using removePunctuation = TRUE and also dealt with the stop words. Stop words are things like the, an, a, of that are traditionally removed from these types of analyses that focus on content words over helper words.

> harm_TDM[1:6, 1:6]
         Docs
Terms     1 2 3 4 5 6
  babysit 1 0 0 0 0 0
  doctor  1 1 0 0 0 0
  crimin  0 1 0 0 0 0
  medicin 0 1 0 0 0 0
  shoot   0 1 0 0 0 0
  health  0 0 1 0 0 0

Great, now what can I do with that? Everything! Here’s what we did. Found the most frequent words by creating a data.frame that was a frequency table (thanks StackOverflow!):

> ##view the most frequent words
> harm_freq = data.frame(Word = rownames(harm_TDM),
+                        Freq = rowSums(harm_TDM),
+                        row.names = NULL)
> harm_freq$Word = as.character(harm_freq$Word)
> harm_freq$percent = harm_freq$Freq/nrow(noout1) *100
> head(harm_freq)
     Word Freq    percent
1 babysit    1  0.2298851
2  doctor   52 11.9540230
3  crimin    6  1.3793103
4 medicin    5  1.1494253
5   shoot    1  0.2298851
6  health   16  3.6781609

Doctor is in the top 5, other big words included hurt, love, pain, and hospit(al). In this prompt, participants were free associating with the harm/care foundation. Now the tricky part was to combine this data back with my other data frame that included particiapnt information, including their moral foundation questionnaire scores:

> harm_words = harm_freq$Word[harm_freq$percent >=1]
> head(harm_words)
[1] "doctor"  "crimin"  "medicin" "health"  "mental"  "physic" 

First, I created a list of harm words that were mentioned at least 1% of the time. I use the transpose function t() to flip the dataset from rows as words, to columns as words to maintain “tidy-ish” data (i.e., each participant is their own row). Then I subset out the dataset to only be my top words:

> harm_TDM = as.data.frame(t(harm_TDM))
> harm_TDM = harm_TDM[ , harm_words]
> harm_TDM[1:6, 1:6]
  doctor crimin medicin health mental physic
1      1      0       0      0      0      0
2      1      1       1      0      0      0
3      0      0       0      1      0      0
4      0      0       0      0      1      1
5      0      0       0      0      0      0
6      0      0       0      0      0      0

Now, we can cbind our harm dataset with the other relevant columns for harm.

> harm_final = cbind(noout1[ , c("ResponseId", "Q15_1", "Q23", "harmMFQ")],
+                    harm_TDM)
> harm_final[1:6, 1:6]
         ResponseId Q15_1         Q23 harmMFQ doctor crimin
1 R_2BkYH8gEtZMEQnG     8    Democrat      18      1      0
2 R_qCTluTnJCgGFqXT     6    Democrat      18      1      1
3 R_11hglRVpaSclG0K     5  Republican      13      0      0
4 R_3kMsBrEjwDtu5iJ     6 Independent      16      0      0
5 R_swkbG8889YEOxoZ     3  Republican      14      0      0
6 R_s682tzsz2YIkwJX    10    Democrat      17      0      0

So, now you too can create participant term-document matrices! In later posts, I’ll show you how we are going to use this information to create an updated dictionary and examine if that dictionary relates to the Moral Foundations Questionnaire. This task will involve some correlations, but also a multi-trait multi-method analysis using lavaan so stay tuned if you are interested in structural equation modeling.

comments powered by Disqus