Hadley Wickham: Managing many models with R

By: Psychology at the University of Edinburgh

286   1   21234

Uploaded on 05/11/2016

Hadley Wickham is Chief Scientist at RStudio, and an Adjunct Professor of Statistics at the University of Auckland. This talk has been organised by EdinbR (The Edinburgh R User Group, http://www.edinbR.org, represented at the event by Caterina Constantinescu, Psychology PhD candidate at the University of Edinburgh), and was kindly supported by The Data Lab, MBN Solutions and the School of Philosophy, Psychology and Language Sciences at the University of Edinburgh. The talk summary is presented below:
---
Visualisation alone is not enough to solve most data analysis challenges. The data may be too big or too messy to show in a single plot. In this talk, Hadley outlines his current thinking about how the synthesis of visualisation, modelling, and data manipulation allows you to effectively explore and understand large and complex datasets. There are three key ideas:
1. Using tidyr to make nested data frame, where one column is a list of data frames.
2. Using purrr to use function programming tools instead of writing for loops 3. Visualising models by converting them to tidy data with broom, by David Robinson.
This work is embedded in R so Hadley not only talks about the ideas, but shows concrete code for working with large sets of models. You'll see how you can combine the dplyr and purrr packages to fit many models, then use tidyr and broom to convert to tidy data which can be visualised with ggplot2.
---

Comments (7):

By anonymous    2017-09-20

i would recommend watching this video from hadley wickham when you have the time. it relates very much to your challenge.

this also seems like a classic split-apply-combine problem, so my first thought is to consider the tidyverse. here is some code that might help you:

library(tidyverse)
library(randomForest)

df2 <- df %>% group_by(cl) %>% mutate(rfcol=list(randomForest(x=.,
                                  formula=.$cl~.$Work+.$Age)))

basically a new column has been created that contains the randomforest algorithm appropriate for that row based on its value in cl. you can explore the details of each model by looking at df2$rfcol[[2]]

to summarize what's going on, the group_by function gets you started with creating dataframes based on cl values. the . within the randomForest function nested within mutate is a way of referencing each grouped dataframe.

hope this helps. but as noted, try watching that video from hadley wickham if you have the time. it will really explain how to think about these types of problems in detail.

Original Thread

By anonymous    2017-09-20

This is a really good application of the tidyr::nest() function in conjunction with purrr and broom. What you do is: - Group the data frame - Apply a model with mutate(mod = map(data, model) - summarize the model using broom::tidy() - extract the relevant statistics.

For more on this here's a great talk by Hadley on the subject: https://www.youtube.com/watch?v=rz3_FDVt9eg

In your case I think you can do something like this:

library(tidyverse)
library(broom)
diamonds %>% 
        group_by(cut) %>% 
           nest() %>% 
           mutate(
               model1 = map(data, ~lm(price~carat, data=.)),
               model2 = map(data, ~lm(price~carat+depth, data=.))
           ) %>% 
           mutate(anova = map2(model1, model2, ~anova(.x,.y))) %>% 
        mutate(tidy_anova = map(anova, broom::tidy)) %>% 
        mutate(p_val = map_dbl(tidy_anova, ~.$p.value[2])) %>%
        select(p_val)

Original Thread

By anonymous    2018-01-01

It helped me to look at this post http://omaymas.github.io/Climate_Change_ExpAnalysis/, and this video https://www.youtube.com/watch?v=rz3_FDVt9eg to understand how to best use purrr and broom together. As G. Grothendeik points out, I can add a column with models to a data frame (where each cell is a full model). The way to do this with the map function is

 duneJ %>% group_by(Species) %>% nest %>% mutate(Mod = map(data, my_lm0)) -> test

Here, nest is a key function that makes a column that is a list of data frames, each of which contain the data about each species and saves it to a default column named "data". I run map inside of the mutate funciton to save the models to yet another column, where each cell is a new model.

If I want to look at model results, I can combine them into a list of data frames with map and broom, select the relevant data, and then unnest them, like so:

 test %>% mutate(Glance = map(Mod, glance)) %>% select(Species, Glance) %>% unnest

This gets me a new data frame that has model results for each species, which was what I was ultimately aiming for, even if I didn't fully explain that in the question.

Original Thread

By anonymous    2018-02-18

Relevant https://www.youtube.com/watch?v=rz3_FDVt9eg

Original Thread

By anonymous    2018-03-26

The thing that you're missing in your dplyr pipe is `purrr::map`. Can I suggest that you check out this video of Hadley explaining how to solve exactly this problem while you wait for potential answers to be posted below: https://www.youtube.com/watch?v=rz3_FDVt9eg Slide deck here: https://speakerdeck.com/hadley/managing-many-models

Original Thread

Submit Your Video

If you have some great dev videos to share, please fill out this form.