When applying a linear model to a dataset you often want to see which effect an independent (or predictor) variable has on an a dependent (or outcome) variable. This is quite easy to accomplish in R, SPSS or any other statistical software package. However, this task proves quite menial when you have many dependent variables. In this post we’ll loop through severeal variables with just a few lines of code.

#### Basic linear model

Building a linear model in R is quite easy:

modelName <- lm(dv~iv, data=datasetName)

Where dv denotes your dependent variable and iv denotes your independent variable.

To display the results you can use `summary`

summary(modelName)

Now imagine that you have 10 variables you want to test, or 50, or 100. That’s quite a number of code and lots of copy/paste commands.

#### Loop your way out of trouble

Rather than copy/pasting yourself to death you can create a loop, where each iteration of the loop corresponds to a variable in your dataset. The first thing you need to do is to create a list of the names of variables you want to run you model on:

dvList <- names(datasetName)[3:10]

The above code creates a new variable called `dvList`

and stores the variable names from coloumn 3 to 10 from the dataframe `datasetName`

in it.

Next, we create our model within a `lapply`

loop like this:

model <- lapply(dvList, function(x) { lm(substitute(i~iv, list(i = as.name(x))), data = datasetName)})

You can also run different type of models, for example the general linear model (glm) rather than the linear model (lm) we’re using here.

You can also add more independent variables and interaction terms just as you normally would using + and * after the tilde (~) symbol:

model <- lapply(dvList, function(x) { lm(substitute(i~iv+iv, list(i = as.name(x))), data = datasetName)})

The final step is to summarize the model for each of the variables we selected earlier. Again, this is done with the help of `lapply`

:

lapply(model, summary)

You should see the summary of eight linear models, one for each of the variables we defined in dvList. Unfortunately, R won’t label each of the summaries with their variable name. Instead the model summaries are listed in the same order as the variable list in `dvList`

. Again, you can replace `summary`

with for example Anova (from the car package) if you want to run an ANOVA instead.

Here’s a full working example using the mtcars dataset:

mtcars$cyl <- factor(mtcars$cyl,levels=c(4,6,8), labels=c("4cyl","6cyl","8cyl")) dvList <- names(mtcars)[3:10] model <- lapply(dvList, function(x) { lm(substitute(i~cyl, list(i = as.name(x))), data = mtcars)}) lapply(model, summary)

#### Last tweaks and tips

Often times an independent variable will have more than two levels, which means that you would have to run a post-hoc analysis (e.g. Tukey) on each of the models that yielded significant results. However, this too can be automated:

lapply(model, function(model) TukeyHSD(aov(model)))

Again, R will number each of the post-hoc tests according to the order of the variables in `dvList`

.

This blog post makes the assumption that you want to loop a list of dependent variables through a certain model. However, you might also want to loop through a list of different independent variables. This is also possible, you simply have to replace “iv” with “i” and “i” with dv.

#### Acknowledgements

Most of the code in this post is not my own, I merely “pasted” it together in this post. I originally found the code for the lapply here. Eric Lecoutre from Stackoverflow.com provided me with the code for the tukey loop.

This is very helpful, thank you. However, I was hoping you could explain why it is necessary to use:

substitute(i~iv, list(i = as.name(x)))

instead of simply putting in x directly:

x~iv

or

as.name(x)~iv

I tried all options in my code and obviously your solution is the only one that worked, but could you explain why? What sort of object is x, and how does feeding it through list(i=as.name(x)) modify it?

Thanks so much!

Hi Anon,

I don’t know if I can give a thorough enough answer, but I’ll try. The list created in the first step (dvList) functions as a point of reference between the function (which defines the linear model) and the dependent variables on which you want to run the model on. Without looping through this list your model won’t be fed any dependent variables. In other words, you’re asking the function to substitute i with the first item on your dvList, which again is matched against that same item in your dataframe. It does this until it has matched all the items in the dvList. I hope this helps!