Bayesian robust correlations with brms (and why you should love Student’s $t$)

[edited June 18, 2019]

In this post, we’ll show how Student’s \(t\)-distribution can produce better correlation estimates when your data have outliers. As is often the case, we’ll do so as Bayesians.

This post is a direct consequence of Adrian Baez-Ortega’s great blog, “Bayesian robust correlation with Stan in R (and why you should use Bayesian methods)”. Baez-Ortega worked out the approach and code for direct use with Stan computational environment. That solution is great because Stan is free, open source, and very flexible. However, Stan’s interface might be prohibitively technical for non-statistician users. Happily, the brms package allows users to access the computational power of Stan through a simpler interface. In this post, we show how to extend Baez-Ortega’s method to brms. To pay respects where they’re due, the synthetic data, priors, and other model settings are largely the same as those Baez-Ortega used in his blog.

I make assumptions

For this post, I’m presuming you are vaguely familiar with linear regression, know about the basic differences between frequentist and Bayesian approaches to fitting models, and have a sense that the issue of outlier values is a pickle worth contending with. All code in is R, with a heavy use of the tidyverse–which you might learn a lot about here, especially chapter 5–, and, of course, Bürkner’s brms.

If you’d like a warmup, consider checking out my related post, Robust Linear Regression with Student’s \(t\)-Distribution.

What’s the deal?

Pearson’s correlations are designed to quantify the linear relationship between two normally distributed variables. The normal distribution and its multivariate generalization, the multivariate normal distribution, are sensitive to outliers. When you have well-behaved synthetic data, this isn’t an issue. But if you work real-world data, this can be a problem. One can have data for which the vast majority of cases are well-characterized by a nice liner relationship, but have a few odd cases for which that relationship does not hold. And if those odd cases happen to be overly influential–sometimes called leverage points–the resulting Pearson’s correlation coefficient might look off.

Recall that the normal distribution is a special case of Student’s \(t\)-distribution with the \(\nu\) parameter (i.e., nu, degree of freedom) set to infinity. As it turns out, when \(\nu\) is small, Student’s \(t\)-distribution is more robust to multivariate outliers. It’s less influenced by them. I’m not going to cover why in any detail. For that you’ve got Baez-Ortega’s blog, an even earlier blog from Rasmus Bååth, and textbook treatments on the topic by Gelman & Hill (2007, chapter 6) and Kruschke (2014, chapter 16). Here we’ll get a quick sense of how vulnerable Pearson’s correlations–with their reliance on the Gaussian–are to outliers, we’ll demonstrate how fitting correlations within the Bayesian paradigm using the conventional Gaussian likelihood is similarly vulnerable to distortion, and then see how Student’s \(t\)-distribution can save the day. And importantly, we’ll do the bulk of this with the brms package.

We need data

To start off, we’ll make a multivariate normal simulated data set using the same steps Baez-Ortega’s used.

library(mvtnorm)
library(tidyverse)

sigma <- c(20, 40)  # the variances
rho   <- -.95       # the desired correlation

# here's the variance/covariance matrix
cov.mat <- 
  matrix(c(sigma[1] ^ 2,
           sigma[1] * sigma[2] * rho,
           sigma[1] * sigma[2] * rho,
           sigma[2] ^ 2),
         nrow = 2, byrow = T)

# after setting our seed, we're ready to simulate with `rmvnorm()`
set.seed(210191)
x.clean <- 
  rmvnorm(n = 40, sigma = cov.mat) %>% 
  as_tibble() %>% 
  rename(x = V1,
         y = V2)

Here we make our second data set, x.noisy, which is identical to our well-behaved x.clean data, but with the first three cases transformed to outlier values.

x.noisy <- x.clean

x.noisy[1:3,] <-
  matrix(c(-40, -60,
           20, 100,
           40, 40),
         nrow = 3, byrow = T)

Finally, we’ll add an outlier index to the data sets, which will help us with plotting.

x.clean <-
  x.clean %>% 
  mutate(outlier = factor(0))

x.noisy <- 
  x.noisy %>% 
  mutate(outlier = c(rep(1, 3), rep(0, 37)) %>% as.factor(.))

The plot below shows what the x.clean data look like. I’m a fan of FiveThirtyEight, so we’ll use a few convenience functions from the handy ggthemes package to give our plots a FiveThirtyEight-like feel.

library(ggthemes)

x.clean %>% 
  ggplot(aes(x = x, y = y, color = outlier, fill = outlier)) +
  geom_point() +
  stat_ellipse(geom = "polygon", alpha = .15, size = .15, level = .5) +
  stat_ellipse(geom = "polygon", alpha = .15, size = .15, level = .95) +
  scale_color_fivethirtyeight() +
  scale_fill_fivethirtyeight() +
  coord_cartesian(xlim = -50:50,
                  ylim = -100:100) +
  theme_fivethirtyeight() +
  theme(legend.position = "none")

And here are the x.noisy data.

x.noisy %>% 
  ggplot(aes(x = x, y = y, color = outlier, fill = outlier)) +
  geom_point() +
  stat_ellipse(geom = "polygon", alpha = .15, size = .15, level = .5) +
  stat_ellipse(geom = "polygon", alpha = .15, size = .15, level = .95) +
  scale_color_fivethirtyeight() +
  scale_fill_fivethirtyeight() +
  coord_cartesian(xlim = -50:50,
                  ylim = -100:100) +
  theme_fivethirtyeight() +
  theme(legend.position = "none")

The three outliers are in red. Even in their presence, the old interocular trauma test suggests there is a pronounced overall trend in the data. I would like a correlation procedure that’s capable of capturing that overall trend. Let’s examine some candidates.

How does old Pearson hold up?

A quick way to get a Pearson’s correlation coefficient in R is with the cor() function, which does a nice job recovering the correlation we simulated the x.clean data with:

cor(x.clean$x, x.clean$y)
## [1] -0.959702

However, things fall apart if you use cor() on the x.noisy data.

cor(x.noisy$x, x.noisy$y)
## [1] -0.6365649

So even though most of the x.noisy data continue to show a clear strong relation, three outlier values reduced the Pearson’s correlation a third of the way toward zero. Let’s see what happens when we go Bayesian.

Bayesian correlations in brms

Bürkner’s brms is a general purpose interface for fitting all manner of Bayesian regression models with Stan as the engine under the hood. It has popular lme4-like syntax and offers a variety of convenience functions for post processing. Let’s load it up.

library(brms)
## Warning: package 'Rcpp' was built under R version 3.5.2

First with the Gaussian likelihood.

I’m not going to spend a lot of time walking through the syntax in the main brms function, brm(). You can learn all about that here or with my project Statistical Rethinking with brms, ggplot2, and the tidyverse. But our particular use of brm() requires we make a few fine points.

One doesn’t always think about bivariate correlations within the regression paradigm. But they work just fine. Within brms, you would typically specify the conventional Gaussian likelihood (i.e., family = gaussian), use the mvbind() syntax to set up a multivariate model, and fit that model without predictors. For each variable specified in cbind(), you’ll estimate an intercept (i.e., mean, \(\mu\)) and sigma (i.e., \(\sigma\), often called a residual variance). Since there are no predictors in the model, the residual variance is just the variance and the brms default for multivariate models is to allow the residual variances to covary. But since variances are parameterized in the standard deviation metric in brms, the residual variances and their covariance are SDs and their correlation, respectively.

Here’s what it looks like in practice.

f0 <- 
  brm(data = x.clean, 
      family = gaussian,
      mvbind(x, y) ~ 1,
      prior = c(prior(normal(0, 100), class = Intercept),
                prior(normal(0, 100), class = sigma, resp = x),
                prior(normal(0, 100), class = sigma, resp = y),
                prior(lkj(1), class = rescor)),
      iter = 2000, warmup = 500, chains = 4, cores = 4, 
      seed = 210191)

In a typical Bayesian workflow, you’d examine the quality of the chains with trace plots. The easy way to do that in brms is with plot(). E.g., to get the trace plots for our first model, you’d code plot(f0). Happily, the trace plots look fine for all models in this post. For the sake of space, I’ll leave their inspection as exercises for interested readers.

Our priors and such mirror those in Baez-Ortega’s blog. Here are the results.

print(f0)
##  Family: MV(gaussian, gaussian) 
##   Links: mu = identity; sigma = identity
##          mu = identity; sigma = identity 
## Formula: x ~ 1 
##          y ~ 1 
##    Data: x.clean (Number of observations: 40) 
## Samples: 4 chains, each with iter = 2000; warmup = 500; thin = 1;
##          total post-warmup samples = 6000
## 
## Population-Level Effects: 
##             Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## x_Intercept    -2.83      3.33    -9.33     3.69       2995 1.00
## y_Intercept     3.55      6.65    -9.45    16.60       2972 1.00
## 
## Family Specific Parameters: 
##         Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma_x    21.47      2.47    17.29    26.99       2453 1.00
## sigma_y    42.93      4.86    34.55    53.51       2418 1.00
## 
## Residual Correlations: 
##             Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## rescor(x,y)    -0.95      0.02    -0.98    -0.92       2636 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Way down there in the last line in the ‘Family Specific Parameters’ section we have rescor(x,y), which is our correlation. And indeed, our Gaussian intercept-only multivariate model did a great job recovering the correlation we used to simulate the x.clean data with. Look at what happens when we try this approach with x.noisy.

f1 <-
  update(f0,
         newdata = x.noisy,
         iter = 2000, warmup = 500, chains = 4, cores = 4, seed = 210191)
print(f1)
##  Family: MV(gaussian, gaussian) 
##   Links: mu = identity; sigma = identity
##          mu = identity; sigma = identity 
## Formula: x ~ 1 
##          y ~ 1 
##    Data: x.noisy (Number of observations: 40) 
## Samples: 4 chains, each with iter = 2000; warmup = 500; thin = 1;
##          total post-warmup samples = 6000
## 
## Population-Level Effects: 
##             Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## x_Intercept    -2.95      3.75   -10.39     4.57       4477 1.00
## y_Intercept     6.52      7.45    -8.31    20.98       4692 1.00
## 
## Family Specific Parameters: 
##         Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma_x    23.65      2.76    18.97    29.83       4312 1.00
## sigma_y    47.20      5.42    37.94    59.03       4332 1.00
## 
## Residual Correlations: 
##             Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## rescor(x,y)    -0.61      0.10    -0.78    -0.39       4480 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

And the correlation estimate is -.61. As it turns out, data = x.noisy + family = gaussian in brm() failed us just like Pearson’s correlation failed us. Time to leave failure behind.

Now with Student’s \(t\)-distribution.

Before we jump into using family = student, we should talk a bit about \(\nu\). This is our new parameter which is silently fixed to infinity when we use the Gaussian likelihood. The \(\nu\) parameter is bound at zero but, as discussed in Baez-Ortega’s blog, is somewhat nonsensical for values below 1. As it turns out, \(\nu\) is constrained to be equal to or greater than 1 in brms. So nothing for us to worry about, there. The Stan team currently recommends the gamma(2, 0.1) prior for \(\nu\), which is also the current brms default. This is what that distribution looks like.

tibble(x = seq(from = 1, to = 120, by = .5)) %>% 
  ggplot(aes(x = x, fill = factor(0))) +
  geom_ribbon(aes(ymin = 0, 
                  ymax = dgamma(x, 2, 0.1))) +
  scale_y_continuous(NULL, breaks = NULL) +
  scale_fill_fivethirtyeight() +
  coord_cartesian(xlim = 0:100) +
  ggtitle("gamma(2, 0.1)") +
  theme_fivethirtyeight() +
  theme(legend.position = "none")

So gamma(2, 0.1) should gently push the \(\nu\) posterior toward low values, but it’s slowly-sloping right tail will allow higher values to emerge.

Following the Stan team’s recommendation, the brms default and Baez-Ortega’s blog, here’s our robust Student’s \(t\) model for the x.noisy data.

f2 <- 
  brm(data = x.noisy, 
      family = student,
      mvbind(x, y) ~ 1,
      prior = c(prior(gamma(2, .1), class = nu),
                prior(normal(0, 100), class = Intercept),
                prior(normal(0, 100), class = sigma, resp = x),
                prior(normal(0, 100), class = sigma, resp = y),
                prior(lkj(1), class = rescor)),
      iter = 2000, warmup = 500, chains = 4, cores = 4, 
      seed = 210191)
print(f2)
##  Family: MV(student, student) 
##   Links: mu = identity; sigma = identity; nu = identity
##          mu = identity; sigma = identity; nu = identity 
## Formula: x ~ 1 
##          y ~ 1 
##    Data: x.noisy (Number of observations: 40) 
## Samples: 4 chains, each with iter = 2000; warmup = 500; thin = 1;
##          total post-warmup samples = 6000
## 
## Population-Level Effects: 
##             Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## x_Intercept    -2.11      3.61    -9.22     4.96       2936 1.00
## y_Intercept     1.93      7.12   -11.74    16.03       2949 1.00
## 
## Family Specific Parameters: 
##         Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma_x    18.26      2.92    13.06    24.48       3188 1.00
## sigma_y    36.31      5.79    26.08    48.60       3206 1.00
## nu          2.65      1.00     1.36     5.13       3905 1.00
## 
## Residual Correlations: 
##             Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## rescor(x,y)    -0.93      0.03    -0.97    -0.84       3484 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Whoa, look at that correlation, rescore(x,y)! It’s right about what we’d hope for. Sure, it’s not a perfect -.95, but that’s way better than -.61.

While we’re at it, we may as well see what happens when we fit a Student’s \(t\) model when we have perfectly multivariate normal data. Here it is with the x.clean data.

f3 <- 
  update(f2,
         newdata = x.clean, 
         iter = 2000, warmup = 500, chains = 4, cores = 4, seed = 210191)
print(f3)
##  Family: MV(student, student) 
##   Links: mu = identity; sigma = identity; nu = identity
##          mu = identity; sigma = identity; nu = identity 
## Formula: x ~ 1 
##          y ~ 1 
##    Data: x.clean (Number of observations: 40) 
## Samples: 4 chains, each with iter = 2000; warmup = 500; thin = 1;
##          total post-warmup samples = 6000
## 
## Population-Level Effects: 
##             Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## x_Intercept    -2.37      3.50    -9.20     4.39       2909 1.00
## y_Intercept     2.71      6.98   -10.90    16.48       3032 1.00
## 
## Family Specific Parameters: 
##         Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma_x    20.79      2.60    16.28    26.60       2388 1.00
## sigma_y    41.34      5.17    32.33    52.62       2417 1.00
## nu         22.42     13.78     5.70    57.53       4384 1.00
## 
## Residual Correlations: 
##             Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## rescor(x,y)    -0.96      0.01    -0.98    -0.92       3045 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

So when you don’t need Student’s \(t\), it yields the right answer anyways. That’s a nice feature.

We should probably compare the posteriors of the correlations across the four models. First we’ll collect the posterior samples into a tibble.

posts <-
  tibble(model = str_c("f", 0:3)) %>% 
  mutate(fit  = map(model, get)) %>% 
  mutate(post = map(fit, posterior_samples)) %>% 
  unnest(post)

head(posts)
## # A tibble: 6 x 8
##   model b_x_Intercept b_y_Intercept sigma_x sigma_y rescor__x__y  lp__
##   <chr>         <dbl>         <dbl>   <dbl>   <dbl>        <dbl> <dbl>
## 1 f0             1.31        -5.60     18.2    37.8       -0.947 -353.
## 2 f0            -7.41        10.6      25.2    50.5       -0.941 -357.
## 3 f0            -4.51         5.65     23.3    49.4       -0.975 -354.
## 4 f0            -2.65        -0.597    18.3    37.3       -0.929 -354.
## 5 f0            -2.76        -1.50     18.4    37.5       -0.923 -355.
## 6 f0            -9.84        15.2      26.3    45.1       -0.953 -358.
## # … with 1 more variable: nu <dbl>

With the posterior draws in hand, we just need to wrangle a bit before showing the correlation posteriors in a coefficient plot. To make things easier, we’ll do so with a couple convenience functions from the tidybayes package.

library(tidybayes)

# wrangle
posts %>% 
  group_by(model) %>% 
  median_qi(rescor__x__y, .width = c(.5, .95)) %>% 
  mutate(key = recode(model, 
                      f0 = "Gaussian likelihood with clean data",
                      f1 = "Gaussian likelihood with noisy data",
                      f2 = "Student likelihood with noisy data",
                      f3 = "Student likelihood with clean data"),
         clean = ifelse(model %in% c("f0", "f3"), "0", "1")) %>%
  
  # plot
  ggplot(aes(x = rescor__x__y, y = key, color = clean)) +
  geom_pointintervalh() +
  scale_color_fivethirtyeight() +
  coord_cartesian(xlim = -1:0) +
  labs(subtitle = expression(paste("The posterior for ", rho, " depends on the likelihood. Why not go robust and use Student's ", italic(t), "?"))) +
  theme_fivethirtyeight() +
  theme(axis.text.y     = element_text(hjust = 0),
        legend.position = "none")

From our tidybayes::median_qi() code, the dots are the posterior medians, the thick inner lines the 50% intervals, and the thinner outer lines the 95% intervals. The posteriors for the x.noisy data are in red and those for the x.clean data are in blue. If the data are clean multivariate normal Gaussian or if they’re dirty but fit with robust Student’s \(t\), everything is pretty much alright. But whoa, if you fit a correlation with a combination of family = gaussian and noisy outlier-laden data, man that’s just a mess.

Don’t let a few overly-influential outliers make a mess of your analyses. Try the robust Student’s \(t\).

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] tidybayes_1.0.4 bindrcpp_0.2.2  brms_2.8.8      Rcpp_1.0.1     
##  [5] ggthemes_4.0.1  forcats_0.3.0   stringr_1.4.0   dplyr_0.8.0.1  
##  [9] purrr_0.2.5     readr_1.1.1     tidyr_0.8.1     tibble_2.1.1   
## [13] ggplot2_3.1.1   tidyverse_1.2.1 mvtnorm_1.0-10 
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.1.0              backports_1.1.4          
##   [3] Hmisc_4.1-1               plyr_1.8.4               
##   [5] igraph_1.2.1              lazyeval_0.2.2           
##   [7] splines_3.5.1             svUnit_0.7-12            
##   [9] crosstalk_1.0.0           rstantools_1.5.1         
##  [11] inline_0.3.15             digest_0.6.18            
##  [13] htmltools_0.3.6           rsconnect_0.8.8          
##  [15] fansi_0.4.0               gdata_2.18.0             
##  [17] magrittr_1.5              checkmate_1.8.5          
##  [19] cluster_2.0.7-1           modelr_0.1.2             
##  [21] matrixStats_0.54.0        xts_0.10-2               
##  [23] prettyunits_1.0.2         colorspace_1.3-2         
##  [25] rvest_0.3.2               pan_1.6                  
##  [27] haven_1.1.2               xfun_0.3                 
##  [29] callr_3.1.0               crayon_1.3.4             
##  [31] jsonlite_1.5              lme4_1.1-17              
##  [33] bindr_0.1.1               survival_2.42-3          
##  [35] zoo_1.8-2                 glue_1.3.1.9000          
##  [37] gtable_0.3.0              pkgbuild_1.0.2           
##  [39] weights_1.0               rstan_2.18.2             
##  [41] jomo_2.6-2                abind_1.4-5              
##  [43] scales_1.0.0              miniUI_0.1.1.1           
##  [45] xtable_1.8-2              htmlTable_1.12           
##  [47] ggstance_0.3              foreign_0.8-70           
##  [49] Formula_1.2-3             stats4_3.5.1             
##  [51] StanHeaders_2.18.0-1      DT_0.4                   
##  [53] htmlwidgets_1.2           httr_1.3.1               
##  [55] threejs_0.3.1             arrayhelpers_1.0-20160527
##  [57] RColorBrewer_1.1-2        acepack_1.4.1            
##  [59] mice_3.1.0                pkgconfig_2.0.2          
##  [61] loo_2.1.0                 nnet_7.3-12              
##  [63] utf8_1.1.4                tidyselect_0.2.5         
##  [65] labeling_0.3              rlang_0.3.4              
##  [67] reshape2_1.4.3            later_0.7.3              
##  [69] munsell_0.5.0             cellranger_1.1.0         
##  [71] tools_3.5.1               cli_1.0.1                
##  [73] generics_0.0.2            broom_0.5.1              
##  [75] ggridges_0.5.0            evaluate_0.10.1          
##  [77] yaml_2.1.19               processx_3.2.1           
##  [79] knitr_1.20                mitml_0.3-6              
##  [81] nlme_3.1-137              mime_0.5                 
##  [83] xml2_1.2.0                compiler_3.5.1           
##  [85] bayesplot_1.7.0           shinythemes_1.1.1        
##  [87] rstudioapi_0.7            stringi_1.4.3            
##  [89] ps_1.2.1                  blogdown_0.8             
##  [91] Brobdingnag_1.2-6         lattice_0.20-35          
##  [93] Matrix_1.2-14             nloptr_1.0.4             
##  [95] markdown_0.8              shinyjs_1.0              
##  [97] pillar_1.3.1              bridgesampling_0.6-0     
##  [99] data.table_1.11.4         httpuv_1.4.4.2           
## [101] R6_2.3.0                  latticeExtra_0.6-28      
## [103] bookdown_0.9              promises_1.0.1           
## [105] gridExtra_2.3             codetools_0.2-15         
## [107] colourpicker_1.0          MASS_7.3-50              
## [109] gtools_3.8.1              assertthat_0.2.0         
## [111] rprojroot_1.3-2           withr_2.1.2              
## [113] shinystan_2.5.0           parallel_3.5.1           
## [115] hms_0.4.2                 grid_3.5.1               
## [117] rpart_4.1-13              coda_0.19-2              
## [119] minqa_1.2.4               rmarkdown_1.10           
## [121] shiny_1.1.0               lubridate_1.7.4          
## [123] base64enc_0.1-3           dygraphs_1.1.1.5

Related