Recently, a colleague of mine was working on an the analysis of the results of a standarized test given to a few hundred students. Part of the work consisted of an analysis of the word difficulty, cognate status, and thematic categorization. In order to do this, she had to combine information from the test result with information from the categorisation description. This would have been a no brainer had the two tables been of an equal length. However, the results table had one entry per item(word) per student. Since we had 225 items and 400 students the results table had 90000 rows, whereas the categorization table had just one row per item, resulting in 225 rows.
Using MS Excel this task could be solved with a fairly complex IF statement. However, jumping between Excel and R disturbs my workflow, so I’d like to stay in R. Also, the readability of even simple IF statements in Excel is very low, making it vulnerable to typing mistakes that in the end mess up the analysis. So how is it done using R? There are actually several good options, but let’s first get some data to play with:
score <- data.frame(StudentID=1:500, grade=1:2, intervention=1:2, word1=0:1, word2=0:1, word3=0:1, word4=0:1, word5=0:1, word6=0:1, word7=0:1, word8=0:1, word9=0:1, word10=0:1) df <- data.frame(Word=1:10, WordCat=1:2) df$Word <- paste0("word",df$Word)
This code gives us a dataframe with test results (okay – not really, but let’s just pretend that these numbers aren’t random) for each student (score) and a dataframe with categoriations of each item (df). However, what we wanted was a dataframe in which each row represents one item per student. We can do that by using the function ‘melt’ from the package ‘reshape’:
library(reshape) scoreWords <- melt(score, id=c(1:3))
If ‘reshape’ is not installed on your system yet you can install it by typing ‘install.packages(reshape)’. With that in place we can now start combining our data.
Take one: Merge
As always, I quickly found some advice on Stack Overflow. A common suggestion is to use the function ‘merge’ to combine the two dataframes, so let’s see how this works. Conceptually, this is how ‘merge’ works:
DfWithMergedData <- merge(df1[, c("variable1", "variable2")], df2, by="variable1")
Essentially, this tells R to create a new dataframe called DfWithMergedData and to merge variables 1 and 2 from df1 with variable1 from df2. That is, we tell R variable 1 in df1 corresponds to variable 1 in df2 and that the new dataframe should be constructed accordingly.
Let’s also see how this works with data. First we need to rename the variable ‘variable’ in scoreWords to ‘Word’ so that R knows how to merge the two dataframes:
names(scoreWords)[names(scoreWords)=="variable"] <- "Word"
Finally, we can get on to the actual merging:
mergedf <- merge(scoreWords[, c(1:5)], df, by="Word")
I must admit that I like this method, it’s a fast way to combine data dataframes and the syntax is way more simple than any Excel IF-statement ever would be. Plus, if you need to do this operation again on another set of data it’s much just to copy paste the code and change the appropriate variable names rather than copying or rewriting and IF statement in Excel and hope for the best.
Take two: Lookup
An alternative to merge is the function ‘lookup’ from the package qdaptools. This method requires a bit more prepatory work, but supposedly it’s a lot faster than merge. Of course you’d need millions of rows to notice the difference rather than the meager ninety thousand we’re juggling with here.
The first thing you need is a package called ‘devtools’, which you install in R like any other package:
Then you install qdapTools:
Remember also to load the packages once they’re installed:
Now, we’re ready again to get some data to play with. We’ll load the same data as we did in take one, and transform the data using reshape:
score <- data.frame(StudentID=1:500, grade=1:2, intervention=1:2, word1=0:1, word2=0:1, word3=0:1, word4=0:1, word5=0:1, word6=0:1, word7=0:1, word8=0:1, word9=0:1, word10=0:1) df <- data.frame(Word=1:10, WordCat=1:2) df$Word <- paste0("word",df$Word) scoreWords <- melt(score, id=c(1:3))
Now instead of merge we write:
scoreWords$cat <- lookup(scoreWords$variable, df[, 1:2])
Let’s break down the syntax. First we’re telling R to create a new variable in the dataFrame ‘scoreWords’ called ‘cat’. Note, that rather than creating a new merged dataframe we’re instead appending an existing dataframe. Next, we’re telling R that the variable ‘variable in the dataframe ‘scoreWords, is the common denominator between the two dataframes. Finally, df[, 1:2] tells R that column number 1 in the dataframe ‘df’ corresponds to the variable ‘variable’ in the dataframe ‘scoreWords’, and to paste the contents coloumn 2 from ‘df’ in the ‘cat’ variable. In this way R checks every row in the ‘scoreWords’ dataframe and assigns it a category, based on the definitions in the dataframe ‘df’.
Once the two dataframes are combined we can run a regression analysis (or any other type of analysis) on word level and include category (cat) as one the predictor variables in the model.
Stack Overflow is always a great place to check out, and my first go-to place when I need help with just about anything.