Functional Programming with Purrr

In this post I’m going to show some cool feature of Purrr. Purrr is an R package for functional programming. I have always been fascinated by functional programming. I first heard about it while I was reading Hadley R for Data Science book. With this approach, not only it makes our code more succinct and clear, but more expressive. Without further ado lets dive straight in. For this exercise, I’m using mtcars data set.

Lets load some libraries like tidyverse, broom and tidytext.

Suppose if we want to run regression model on 3 sets of data sets grouped by certain feature (cyl in our case), we need to regress data set separately.

library(tidyverse)
library(tidytext)
library(broom)

Regression for dataset with cyl == 8

cyl8 <- mtcars |> 
  filter(cyl == 8)

summary(lm(mpg~wt, data=cyl8))
## 
## Call:
## lm(formula = mpg ~ wt, data = cyl8)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1491 -1.4664 -0.8458  1.5711  3.7619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.8680     3.0055   7.942 4.05e-06 ***
## wt           -2.1924     0.7392  -2.966   0.0118 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.024 on 12 degrees of freedom
## Multiple R-squared:  0.423,  Adjusted R-squared:  0.3749 
## F-statistic: 8.796 on 1 and 12 DF,  p-value: 0.01179

So we have to do like this for each cyl. One thing you might have noticed is that we have separate results. Imagine doing some analysis on bigger data set and having to keep track of each result separately. It would be a nightmare. You might say we can achieve by using loops but they have their own disadvantages. We can achieve this easily by use of Purrr. Another thing which is crucial is use of ‘tibble’. A tibble is similar to traditional data frame but much more efficient. I like mostly of the fact that it can store ‘list columns’. Let me show this.

nested<-mtcars |> 
  nest(-cyl)
## Warning: All elements of `...` must be named.
## Did you want `data = -cyl`?
nested
## # A tibble: 3 × 2
##     cyl data              
##   <dbl> <list>            
## 1     6 <tibble [7 × 10]> 
## 2     4 <tibble [11 × 10]>
## 3     8 <tibble [14 × 10]>

Here if you look at column data, it’s a list column. Each entry is a separate data frame. It’s like an entire Excel spreadsheet stored into that tiny cell. And this is made possible by using nest function from tidyr package. Lets see what’s inside one of them.

nested$data[[1]]
## # A tibble: 7 × 10
##     mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  21    160    110  3.9   2.62  16.5     0     1     4     4
## 2  21    160    110  3.9   2.88  17.0     0     1     4     4
## 3  21.4  258    110  3.08  3.22  19.4     1     0     3     1
## 4  18.1  225    105  2.76  3.46  20.2     1     0     3     1
## 5  19.2  168.   123  3.92  3.44  18.3     1     0     4     4
## 6  17.8  168.   123  3.92  3.44  18.9     1     0     4     4
## 7  19.7  145    175  3.62  2.77  15.5     0     1     5     6

Ok! as mentioned each entry of that list is a separate data frame.

Now we can run regression on each entry of that list and store each model into another list column. For this we have to use map() function, which helps to iterate on each entry of that list and run regression.

model_nested<-nested |> 
  mutate(model = map(.x=data,.f =~lm(mpg~wt, data=.)))

model_nested
## # A tibble: 3 × 3
##     cyl data               model 
##   <dbl> <list>             <list>
## 1     6 <tibble [7 × 10]>  <lm>  
## 2     4 <tibble [11 × 10]> <lm>  
## 3     8 <tibble [14 × 10]> <lm>

Here, mutate is generic command to create a column called model. The meat of the operation starts with map function.

Purrr have a different variant of map function. We have map(), map_int(), map_dbl(), map_chr(), map_lgl(). As you may have guessed, each one returns certain kind of data like map_int() returns Integers, map_dbl() returns Doubles, map() always returns list.

So in our above code, we can be certain that our result will be a list. You may be wondering what is (~)? It denotes an anonymous function, a function which is defined on a fly. so, lm(mpg~wt) denotes a linear regression being run with ‘mpg’ against wt. The (.) denotes the current data frame in that context. So what map has done is run 3 regression models and stored the respective results under model column. We can see what’s the first entry

model_nested$model[[1]]
## 
## Call:
## lm(formula = mpg ~ wt, data = .)
## 
## Coefficients:
## (Intercept)           wt  
##       28.41        -2.78

As we see, it’s one of the models. We can get more info into it by running summary function.

summary(model_nested$model[[1]])
## 
## Call:
## lm(formula = mpg ~ wt, data = .)
## 
## Residuals:
##       1       2       3       4       5       6       7 
## -0.1250  0.5840  1.9292 -0.6897  0.3547 -1.0453 -1.0080 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   28.409      4.184   6.789  0.00105 **
## wt            -2.780      1.335  -2.083  0.09176 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.165 on 5 degrees of freedom
## Multiple R-squared:  0.4645, Adjusted R-squared:  0.3574 
## F-statistic: 4.337 on 1 and 5 DF,  p-value: 0.09176

But why do summary separately, if we know how to Purrr

model_nested_summarised<-model_nested |>  
  mutate(model_summary = map(.x = model, .f = ~summary(.)))

model_nested_summarised
## # A tibble: 3 × 4
##     cyl data               model  model_summary
##   <dbl> <list>             <list> <list>       
## 1     6 <tibble [7 × 10]>  <lm>   <smmry.lm>   
## 2     4 <tibble [11 × 10]> <lm>   <smmry.lm>   
## 3     8 <tibble [14 × 10]> <lm>   <smmry.lm>

Again, here we can see ‘model_summary’ is a list column which stores summaries of each model. The first entry should be the same as above result.

model_nested_summarised$model_summary[[1]]
## 
## Call:
## lm(formula = mpg ~ wt, data = .)
## 
## Residuals:
##       1       2       3       4       5       6       7 
## -0.1250  0.5840  1.9292 -0.6897  0.3547 -1.0453 -1.0080 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   28.409      4.184   6.789  0.00105 **
## wt            -2.780      1.335  -2.083  0.09176 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.165 on 5 degrees of freedom
## Multiple R-squared:  0.4645, Adjusted R-squared:  0.3574 
## F-statistic: 4.337 on 1 and 5 DF,  p-value: 0.09176

We can extract the necessary information from those list columns separately.

model_nested_summarised$model_summary[[1]][["r.squared"]]
## [1] 0.4645102

But lets go one step and get all r.squared values of each model and store them under separate column with the help of Purrr.

model_nested_summarised |>  
  mutate(rsqr = map_dbl(.x=model_summary,.f = ~.[["r.squared"]]))
## # A tibble: 3 × 5
##     cyl data               model  model_summary  rsqr
##   <dbl> <list>             <list> <list>        <dbl>
## 1     6 <tibble [7 × 10]>  <lm>   <smmry.lm>    0.465
## 2     4 <tibble [11 × 10]> <lm>   <smmry.lm>    0.509
## 3     8 <tibble [14 × 10]> <lm>   <smmry.lm>    0.423

Here we used map_dbl() as we want our result as Doubles not a list. Thus,we managed to get all the requested values under one data frame. This is neat.

Now let’s try all of this using a package called broom. It presents all the result into a tidy format. I will compact all into one single block this time.

all_in_one<- mtcars |> 
  nest(-cyl) |> 
  mutate(model = map(.x = data, .f = ~lm(mpg~wt, data = .)),
         tidied = map(model,glance)) |> 
  unnest(tidied)
## Warning: All elements of `...` must be named.
## Did you want `data = -cyl`?
all_in_one
## # A tibble: 3 × 15
##     cyl data     model  r.squared adj.r.squared sigma statistic p.value    df
##   <dbl> <list>   <list>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>
## 1     6 <tibble> <lm>       0.465         0.357  1.17      4.34  0.0918     1
## 2     4 <tibble> <lm>       0.509         0.454  3.33      9.32  0.0137     1
## 3     8 <tibble> <lm>       0.423         0.375  2.02      8.80  0.0118     1
## # … with 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
## #   df.residual <int>, nobs <int>

We use glance function to return all the output on each row-wise basis for each data frame. We can now use this output straight to plot some nice figures

Oluwafemi Oyedele
Oluwafemi Oyedele
MSc in Agrometeorology

My research interests include Agrometeorology, Soil Fertility and Machine Learning with tidymodel.