Functional Programming with Purrr
In this post I’m going to show some cool feature of Purrr. Purrr is an R package for functional programming. I have always been fascinated by functional programming. I first heard about it while I was reading Hadley R for Data Science book. With this approach, not only it makes our code more succinct and clear, but more expressive. Without further ado lets dive straight in. For this exercise, I’m using mtcars
data set.
Lets load some libraries like tidyverse, broom and tidytext.
Suppose if we want to run regression model on 3 sets of data sets grouped by certain feature (cyl in our case), we need to regress data set separately.
library(tidyverse)
library(tidytext)
library(broom)
Regression for dataset with cyl == 8
cyl8 <- mtcars |>
filter(cyl == 8)
summary(lm(mpg~wt, data=cyl8))
##
## Call:
## lm(formula = mpg ~ wt, data = cyl8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1491 -1.4664 -0.8458 1.5711 3.7619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.8680 3.0055 7.942 4.05e-06 ***
## wt -2.1924 0.7392 -2.966 0.0118 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.024 on 12 degrees of freedom
## Multiple R-squared: 0.423, Adjusted R-squared: 0.3749
## F-statistic: 8.796 on 1 and 12 DF, p-value: 0.01179
So we have to do like this for each cyl
. One thing you might have noticed is that we have separate results. Imagine doing some analysis on bigger data set and having to keep track of each result separately. It would be a nightmare. You might say we can achieve by using loops but they have their own disadvantages. We can achieve this easily by use of Purrr. Another thing which is crucial is use of ‘tibble’. A tibble is similar to traditional data frame but much more efficient. I like mostly of the fact that it can store ‘list columns’. Let me show this.
nested<-mtcars |>
nest(-cyl)
## Warning: All elements of `...` must be named.
## Did you want `data = -cyl`?
nested
## # A tibble: 3 × 2
## cyl data
## <dbl> <list>
## 1 6 <tibble [7 × 10]>
## 2 4 <tibble [11 × 10]>
## 3 8 <tibble [14 × 10]>
Here if you look at column data
, it’s a list column. Each entry is a separate data frame. It’s like an entire Excel spreadsheet stored into that tiny cell. And this is made possible by using nest
function from tidyr package. Lets see what’s inside one of them.
nested$data[[1]]
## # A tibble: 7 × 10
## mpg disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 160 110 3.9 2.88 17.0 0 1 4 4
## 3 21.4 258 110 3.08 3.22 19.4 1 0 3 1
## 4 18.1 225 105 2.76 3.46 20.2 1 0 3 1
## 5 19.2 168. 123 3.92 3.44 18.3 1 0 4 4
## 6 17.8 168. 123 3.92 3.44 18.9 1 0 4 4
## 7 19.7 145 175 3.62 2.77 15.5 0 1 5 6
Ok! as mentioned each entry of that list is a separate data frame.
Now we can run regression on each entry of that list and store each model into another list column. For this we have to use map() function, which helps to iterate on each entry of that list and run regression.
model_nested<-nested |>
mutate(model = map(.x=data,.f =~lm(mpg~wt, data=.)))
model_nested
## # A tibble: 3 × 3
## cyl data model
## <dbl> <list> <list>
## 1 6 <tibble [7 × 10]> <lm>
## 2 4 <tibble [11 × 10]> <lm>
## 3 8 <tibble [14 × 10]> <lm>
Here, mutate is generic command to create a column called model
. The meat of the operation starts with map
function.
Purrr have a different variant of map function. We have map(), map_int(), map_dbl(), map_chr(), map_lgl(). As you may have guessed, each one returns certain kind of data like map_int() returns Integers, map_dbl() returns Doubles, map() always returns list.
So in our above code, we can be certain that our result will be a list. You may be wondering what is (~)
? It denotes an anonymous function, a function which is defined on a fly. so, lm(mpg~wt)
denotes a linear regression being run with ‘mpg’ against wt
. The (.) denotes the current data frame in that context. So what map has done is run 3 regression models and stored the respective results under model
column. We can see what’s the first entry
model_nested$model[[1]]
##
## Call:
## lm(formula = mpg ~ wt, data = .)
##
## Coefficients:
## (Intercept) wt
## 28.41 -2.78
As we see, it’s one of the models. We can get more info into it by running summary function.
summary(model_nested$model[[1]])
##
## Call:
## lm(formula = mpg ~ wt, data = .)
##
## Residuals:
## 1 2 3 4 5 6 7
## -0.1250 0.5840 1.9292 -0.6897 0.3547 -1.0453 -1.0080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.409 4.184 6.789 0.00105 **
## wt -2.780 1.335 -2.083 0.09176 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.165 on 5 degrees of freedom
## Multiple R-squared: 0.4645, Adjusted R-squared: 0.3574
## F-statistic: 4.337 on 1 and 5 DF, p-value: 0.09176
But why do summary separately, if we know how to Purrr
model_nested_summarised<-model_nested |>
mutate(model_summary = map(.x = model, .f = ~summary(.)))
model_nested_summarised
## # A tibble: 3 × 4
## cyl data model model_summary
## <dbl> <list> <list> <list>
## 1 6 <tibble [7 × 10]> <lm> <smmry.lm>
## 2 4 <tibble [11 × 10]> <lm> <smmry.lm>
## 3 8 <tibble [14 × 10]> <lm> <smmry.lm>
Again, here we can see ‘model_summary’ is a list column which stores summaries of each model. The first entry should be the same as above result.
model_nested_summarised$model_summary[[1]]
##
## Call:
## lm(formula = mpg ~ wt, data = .)
##
## Residuals:
## 1 2 3 4 5 6 7
## -0.1250 0.5840 1.9292 -0.6897 0.3547 -1.0453 -1.0080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.409 4.184 6.789 0.00105 **
## wt -2.780 1.335 -2.083 0.09176 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.165 on 5 degrees of freedom
## Multiple R-squared: 0.4645, Adjusted R-squared: 0.3574
## F-statistic: 4.337 on 1 and 5 DF, p-value: 0.09176
We can extract the necessary information from those list columns separately.
model_nested_summarised$model_summary[[1]][["r.squared"]]
## [1] 0.4645102
But lets go one step and get all r.squared
values of each model and store them under separate column with the help of Purrr
.
model_nested_summarised |>
mutate(rsqr = map_dbl(.x=model_summary,.f = ~.[["r.squared"]]))
## # A tibble: 3 × 5
## cyl data model model_summary rsqr
## <dbl> <list> <list> <list> <dbl>
## 1 6 <tibble [7 × 10]> <lm> <smmry.lm> 0.465
## 2 4 <tibble [11 × 10]> <lm> <smmry.lm> 0.509
## 3 8 <tibble [14 × 10]> <lm> <smmry.lm> 0.423
Here we used map_dbl()
as we want our result as Doubles not a list. Thus,we managed to get all the requested values under one data frame. This is neat.
Now let’s try all of this using a package called broom
. It presents all the result into a tidy format. I will compact all into one single block this time.
all_in_one<- mtcars |>
nest(-cyl) |>
mutate(model = map(.x = data, .f = ~lm(mpg~wt, data = .)),
tidied = map(model,glance)) |>
unnest(tidied)
## Warning: All elements of `...` must be named.
## Did you want `data = -cyl`?
all_in_one
## # A tibble: 3 × 15
## cyl data model r.squared adj.r.squared sigma statistic p.value df
## <dbl> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 <tibble> <lm> 0.465 0.357 1.17 4.34 0.0918 1
## 2 4 <tibble> <lm> 0.509 0.454 3.33 9.32 0.0137 1
## 3 8 <tibble> <lm> 0.423 0.375 2.02 8.80 0.0118 1
## # … with 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
## # df.residual <int>, nobs <int>
We use glance
function to return all the output on each row-wise basis for each data frame. We can now use this output straight to plot some nice figures