[edited June 18, 2019]

In this post, we’ll show how Student’s \(t\)-distribution can produce better correlation estimates when your data have outliers. As is often the case, we’ll do so as Bayesians.

This post is a direct consequence of Adrian Baez-Ortega’s great blog, “Bayesian robust correlation with Stan in R (and why you should use Bayesian methods)”. Baez-Ortega worked out the approach and code for direct use with Stan computational environment. That solution is great because Stan is free, open source, and very flexible. However, Stan’s interface might be prohibitively technical for non-statistician users. Happily, the brms package allows users to access the computational power of Stan through a simpler interface. In this post, we show how to extend Baez-Ortega’s method to brms. To pay respects where they’re due, the synthetic data, priors, and other model settings are largely the same as those Baez-Ortega used in his blog.

## I make assumptions

For this post, I’m presuming you are vaguely familiar with linear regression, know about the basic differences between frequentist and Bayesian approaches to fitting models, and have a sense that the issue of outlier values is a pickle worth contending with. All code in is R, with a heavy use of the tidyverse–which you might learn a lot about here, especially chapter 5–, and, of course, Bürkner’s brms.

If you’d like a warmup, consider checking out my related post, Robust Linear Regression with Student’s \(t\)-Distribution.

## What’s the deal?

Pearson’s correlations are designed to quantify the linear relationship between two normally distributed variables. The normal distribution and its multivariate generalization, the multivariate normal distribution, are sensitive to outliers. When you have well-behaved synthetic data, this isn’t an issue. But if you work real-world data, this can be a problem. One can have data for which the vast majority of cases are well-characterized by a nice liner relationship, but have a few odd cases for which that relationship does not hold. And if those odd cases happen to be overly influential–sometimes called leverage points–the resulting Pearson’s correlation coefficient might look off.

Recall that the normal distribution is a special case of Student’s \(t\)-distribution with the \(\nu\) parameter (i.e., *nu*, degree of freedom) set to infinity. As it turns out, when \(\nu\) is small, Student’s \(t\)-distribution is more robust to multivariate outliers. It’s less influenced by them. I’m not going to cover why in any detail. For that you’ve got Baez-Ortega’s blog, an even earlier blog from Rasmus Bååth, and textbook treatments on the topic by Gelman & Hill (2007, chapter 6) and Kruschke (2014, chapter 16). Here we’ll get a quick sense of how vulnerable Pearson’s correlations–with their reliance on the Gaussian–are to outliers, we’ll demonstrate how fitting correlations within the Bayesian paradigm using the conventional Gaussian likelihood is similarly vulnerable to distortion, and then see how Student’s \(t\)-distribution can save the day. And importantly, we’ll do the bulk of this with the brms package.

## We need data

To start off, we’ll make a multivariate normal simulated data set using the same steps Baez-Ortega’s used.

```
library(mvtnorm)
library(tidyverse)
sigma <- c(20, 40) # the variances
rho <- -.95 # the desired correlation
# here's the variance/covariance matrix
cov.mat <-
matrix(c(sigma[1] ^ 2,
sigma[1] * sigma[2] * rho,
sigma[1] * sigma[2] * rho,
sigma[2] ^ 2),
nrow = 2, byrow = T)
# after setting our seed, we're ready to simulate with `rmvnorm()`
set.seed(210191)
x.clean <-
rmvnorm(n = 40, sigma = cov.mat) %>%
as_tibble() %>%
rename(x = V1,
y = V2)
```

Here we make our second data set, `x.noisy`

, which is identical to our well-behaved `x.clean`

data, but with the first three cases transformed to outlier values.

```
x.noisy <- x.clean
x.noisy[1:3,] <-
matrix(c(-40, -60,
20, 100,
40, 40),
nrow = 3, byrow = T)
```

Finally, we’ll add an `outlier`

index to the data sets, which will help us with plotting.

```
x.clean <-
x.clean %>%
mutate(outlier = factor(0))
x.noisy <-
x.noisy %>%
mutate(outlier = c(rep(1, 3), rep(0, 37)) %>% as.factor(.))
```

The plot below shows what the `x.clean`

data look like. I’m a fan of FiveThirtyEight, so we’ll use a few convenience functions from the handy ggthemes package to give our plots a FiveThirtyEight-like feel.

```
library(ggthemes)
x.clean %>%
ggplot(aes(x = x, y = y, color = outlier, fill = outlier)) +
geom_point() +
stat_ellipse(geom = "polygon", alpha = .15, size = .15, level = .5) +
stat_ellipse(geom = "polygon", alpha = .15, size = .15, level = .95) +
scale_color_fivethirtyeight() +
scale_fill_fivethirtyeight() +
coord_cartesian(xlim = -50:50,
ylim = -100:100) +
theme_fivethirtyeight() +
theme(legend.position = "none")
```

And here are the `x.noisy`

data.

```
x.noisy %>%
ggplot(aes(x = x, y = y, color = outlier, fill = outlier)) +
geom_point() +
stat_ellipse(geom = "polygon", alpha = .15, size = .15, level = .5) +
stat_ellipse(geom = "polygon", alpha = .15, size = .15, level = .95) +
scale_color_fivethirtyeight() +
scale_fill_fivethirtyeight() +
coord_cartesian(xlim = -50:50,
ylim = -100:100) +
theme_fivethirtyeight() +
theme(legend.position = "none")
```

The three outliers are in red. Even in their presence, the old interocular trauma test suggests there is a pronounced overall trend in the data. I would like a correlation procedure that’s capable of capturing that overall trend. Let’s examine some candidates.

## How does old Pearson hold up?

A quick way to get a Pearson’s correlation coefficient in R is with the `cor()`

function, which does a nice job recovering the correlation we simulated the `x.clean`

data with:

`cor(x.clean$x, x.clean$y)`

`## [1] -0.959702`

However, things fall apart if you use `cor()`

on the `x.noisy`

data.

`cor(x.noisy$x, x.noisy$y)`

`## [1] -0.6365649`

So even though most of the `x.noisy`

data continue to show a clear strong relation, three outlier values reduced the Pearson’s correlation a third of the way toward zero. Let’s see what happens when we go Bayesian.

## Bayesian correlations in brms

Bürkner’s brms is a general purpose interface for fitting all manner of Bayesian regression models with Stan as the engine under the hood. It has popular lme4-like syntax and offers a variety of convenience functions for post processing. Let’s load it up.

`library(brms)`

### First with the Gaussian likelihood.

I’m not going to spend a lot of time walking through the syntax in the main brms function, `brm()`

. You can learn all about that here or with my project *Statistical Rethinking with brms, ggplot2, and the tidyverse*. But our particular use of `brm()`

requires we make a few fine points.

One doesn’t always think about bivariate correlations within the regression paradigm. But they work just fine. Within brms, you would typically specify the conventional Gaussian likelihood (i.e., `family = gaussian`

), use the `mvbind()`

syntax to set up a multivariate model, and fit that model without predictors. For each variable specified in `cbind()`

, you’ll estimate an intercept (i.e., mean, \(\mu\)) and sigma (i.e., \(\sigma\), often called a residual variance). Since there are no predictors in the model, the residual variance is just the variance and the brms default for multivariate models is to allow the residual variances to covary. But since variances are parameterized in the standard deviation metric in brms, the residual variances and their covariance are *SD*s and their correlation, respectively.

Here’s what it looks like in practice.

```
f0 <-
brm(data = x.clean,
family = gaussian,
mvbind(x, y) ~ 1,
prior = c(prior(normal(0, 100), class = Intercept),
prior(normal(0, 100), class = sigma, resp = x),
prior(normal(0, 100), class = sigma, resp = y),
prior(lkj(1), class = rescor)),
iter = 2000, warmup = 500, chains = 4, cores = 4,
seed = 210191)
```

In a typical Bayesian workflow, you’d examine the quality of the chains with trace plots. The easy way to do that in brms is with `plot()`

. E.g., to get the trace plots for our first model, you’d code `plot(f0)`

. Happily, the trace plots look fine for all models in this post. For the sake of space, I’ll leave their inspection as exercises for interested readers.

Our priors and such mirror those in Baez-Ortega’s blog. Here are the results.

`print(f0)`

```
## Family: MV(gaussian, gaussian)
## Links: mu = identity; sigma = identity
## mu = identity; sigma = identity
## Formula: x ~ 1
## y ~ 1
## Data: x.clean (Number of observations: 40)
## Samples: 4 chains, each with iter = 2000; warmup = 500; thin = 1;
## total post-warmup samples = 6000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## x_Intercept -2.83 3.33 -9.33 3.69 1.00 3000 3111
## y_Intercept 3.55 6.65 -9.45 16.60 1.00 2978 2928
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_x 21.47 2.47 17.29 26.99 1.00 2514 3094
## sigma_y 42.93 4.86 34.55 53.51 1.00 2477 3144
##
## Residual Correlations:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## rescor(x,y) -0.95 0.02 -0.98 -0.92 1.00 2686 3218
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Way down there in the last line in the ‘Family Specific Parameters’ section we have `rescor(x,y)`

, which is our correlation. And indeed, our Gaussian intercept-only multivariate model did a great job recovering the correlation we used to simulate the `x.clean`

data with. Look at what happens when we try this approach with `x.noisy`

.

```
f1 <-
update(f0,
newdata = x.noisy,
iter = 2000, warmup = 500, chains = 4, cores = 4, seed = 210191)
```

`print(f1)`

```
## Family: MV(gaussian, gaussian)
## Links: mu = identity; sigma = identity
## mu = identity; sigma = identity
## Formula: x ~ 1
## y ~ 1
## Data: x.noisy (Number of observations: 40)
## Samples: 4 chains, each with iter = 2000; warmup = 500; thin = 1;
## total post-warmup samples = 6000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## x_Intercept -2.95 3.75 -10.39 4.57 1.00 4515 3963
## y_Intercept 6.52 7.45 -8.31 20.98 1.00 4727 4057
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_x 23.65 2.76 18.97 29.83 1.00 4536 4351
## sigma_y 47.20 5.42 37.94 59.03 1.00 4619 4222
##
## Residual Correlations:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## rescor(x,y) -0.61 0.10 -0.78 -0.39 1.00 4344 4033
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

And the correlation estimate is -.61. As it turns out, `data = x.noisy`

+ `family = gaussian`

in `brm()`

failed us just like Pearson’s correlation failed us. Time to leave failure behind.

### Now with Student’s \(t\)-distribution.

Before we jump into using `family = student`

, we should talk a bit about \(\nu\). This is our new parameter which is silently fixed to infinity when we use the Gaussian likelihood. The \(\nu\) parameter is bound at zero but, as discussed in Baez-Ortega’s blog, is somewhat nonsensical for values below 1. As it turns out, \(\nu\) is constrained to be equal to or greater than 1 in brms. So nothing for us to worry about, there. The Stan team currently recommends the gamma(2, 0.1) prior for \(\nu\), which is also the current brms default. This is what that distribution looks like.

```
tibble(x = seq(from = 1, to = 120, by = .5)) %>%
ggplot(aes(x = x, fill = factor(0))) +
geom_ribbon(aes(ymin = 0,
ymax = dgamma(x, 2, 0.1))) +
scale_y_continuous(NULL, breaks = NULL) +
scale_fill_fivethirtyeight() +
coord_cartesian(xlim = 0:100) +
ggtitle("gamma(2, 0.1)") +
theme_fivethirtyeight() +
theme(legend.position = "none")
```

So gamma(2, 0.1) should gently push the \(\nu\) posterior toward low values, but it’s slowly-sloping right tail will allow higher values to emerge.

Following the Stan team’s recommendation, the brms default and Baez-Ortega’s blog, here’s our robust Student’s \(t\) model for the `x.noisy`

data.

```
f2 <-
brm(data = x.noisy,
family = student,
mvbind(x, y) ~ 1,
prior = c(prior(gamma(2, .1), class = nu),
prior(normal(0, 100), class = Intercept),
prior(normal(0, 100), class = sigma, resp = x),
prior(normal(0, 100), class = sigma, resp = y),
prior(lkj(1), class = rescor)),
iter = 2000, warmup = 500, chains = 4, cores = 4,
seed = 210191)
```

`print(f2)`

```
## Family: MV(student, student)
## Links: mu = identity; sigma = identity; nu = identity
## mu = identity; sigma = identity; nu = identity
## Formula: x ~ 1
## y ~ 1
## Data: x.noisy (Number of observations: 40)
## Samples: 4 chains, each with iter = 2000; warmup = 500; thin = 1;
## total post-warmup samples = 6000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## x_Intercept -2.11 3.61 -9.22 4.96 1.00 2964 3421
## y_Intercept 1.93 7.12 -11.74 16.03 1.00 2973 3497
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_x 18.26 2.92 13.06 24.48 1.00 3231 3281
## sigma_y 36.31 5.79 26.08 48.60 1.00 3214 3625
## nu 2.65 1.00 1.36 5.13 1.00 3967 2851
##
## Residual Correlations:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## rescor(x,y) -0.93 0.03 -0.97 -0.84 1.00 3896 3625
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Whoa, look at that correlation, `rescore(x,y)`

! It’s right about what we’d hope for. Sure, it’s not a perfect -.95, but that’s way better than -.61.

While we’re at it, we may as well see what happens when we fit a Student’s \(t\) model when we have perfectly multivariate normal data. Here it is with the `x.clean`

data.

```
f3 <-
update(f2,
newdata = x.clean,
iter = 2000, warmup = 500, chains = 4, cores = 4, seed = 210191)
```

`print(f3)`

```
## Family: MV(student, student)
## Links: mu = identity; sigma = identity; nu = identity
## mu = identity; sigma = identity; nu = identity
## Formula: x ~ 1
## y ~ 1
## Data: x.clean (Number of observations: 40)
## Samples: 4 chains, each with iter = 2000; warmup = 500; thin = 1;
## total post-warmup samples = 6000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## x_Intercept -2.37 3.50 -9.20 4.39 1.00 2931 3188
## y_Intercept 2.71 6.98 -10.90 16.48 1.00 3067 3363
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_x 20.79 2.60 16.28 26.60 1.00 2427 2778
## sigma_y 41.34 5.17 32.33 52.62 1.00 2459 2863
## nu 22.42 13.78 5.70 57.53 1.00 4146 3881
##
## Residual Correlations:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## rescor(x,y) -0.96 0.01 -0.98 -0.92 1.00 3084 3577
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

So when you don’t need Student’s \(t\), it yields the right answer anyways. That’s a nice feature.

We should probably compare the posteriors of the correlations across the four models. First we’ll collect the posterior samples into a tibble.

```
posts <-
tibble(model = str_c("f", 0:3)) %>%
mutate(fit = map(model, get)) %>%
mutate(post = map(fit, posterior_samples)) %>%
unnest(post)
head(posts)
```

```
## # A tibble: 6 x 9
## model fit b_x_Intercept b_y_Intercept sigma_x sigma_y rescor__x__y
## <chr> <lis> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 f0 <brm… 1.31 -5.60 18.2 37.8 -0.947
## 2 f0 <brm… -7.41 10.6 25.2 50.5 -0.941
## 3 f0 <brm… -4.51 5.65 23.3 49.4 -0.975
## 4 f0 <brm… -2.65 -0.597 18.3 37.3 -0.929
## 5 f0 <brm… -2.76 -1.50 18.4 37.5 -0.923
## 6 f0 <brm… -9.84 15.2 26.3 45.1 -0.953
## # … with 2 more variables: lp__ <dbl>, nu <dbl>
```

With the posterior draws in hand, we just need to wrangle a bit before showing the correlation posteriors in a coefficient plot. To make things easier, we’ll do so with a couple convenience functions from the tidybayes package.

```
library(tidybayes)
# wrangle
posts %>%
group_by(model) %>%
median_qi(rescor__x__y, .width = c(.5, .95)) %>%
mutate(key = recode(model,
f0 = "Gaussian likelihood with clean data",
f1 = "Gaussian likelihood with noisy data",
f2 = "Student likelihood with noisy data",
f3 = "Student likelihood with clean data"),
clean = ifelse(model %in% c("f0", "f3"), "0", "1")) %>%
# plot
ggplot(aes(x = rescor__x__y, y = key, color = clean)) +
geom_pointintervalh() +
scale_color_fivethirtyeight() +
coord_cartesian(xlim = -1:0) +
labs(subtitle = expression(paste("The posterior for ", rho, " depends on the likelihood. Why not go robust and use Student's ", italic(t), "?"))) +
theme_fivethirtyeight() +
theme(axis.text.y = element_text(hjust = 0),
legend.position = "none")
```

From our `tidybayes::median_qi()`

code, the dots are the posterior medians, the thick inner lines the 50% intervals, and the thinner outer lines the 95% intervals. The posteriors for the `x.noisy`

data are in red and those for the `x.clean`

data are in blue. If the data are clean multivariate normal Gaussian or if they’re dirty but fit with robust Student’s \(t\), everything is pretty much alright. But whoa, if you fit a correlation with a combination of `family = gaussian`

and noisy outlier-laden data, man that’s just a mess.

Don’t let a few overly-influential outliers make a mess of your analyses. Try the robust Student’s \(t\).

`sessionInfo()`

```
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## Random number generation:
## RNG: Mersenne-Twister
## Normal: Inversion
## Sample: Rounding
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidybayes_1.1.0 brms_2.10.0 Rcpp_1.0.2 ggthemes_4.2.0
## [5] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3 purrr_0.3.2
## [9] readr_1.3.1 tidyr_1.0.0 tibble_2.1.3 ggplot2_3.2.1
## [13] tidyverse_1.2.1 mvtnorm_1.0-11
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.1.4
## [3] Hmisc_4.2-0 plyr_1.8.4
## [5] igraph_1.2.4.1 lazyeval_0.2.2
## [7] svUnit_0.7-12 splines_3.6.0
## [9] crosstalk_1.0.0 rstantools_1.5.1
## [11] inline_0.3.15 digest_0.6.20
## [13] htmltools_0.3.6 rsconnect_0.8.13
## [15] gdata_2.18.0 fansi_0.4.0
## [17] magrittr_1.5 checkmate_1.9.3
## [19] cluster_2.0.8 modelr_0.1.4
## [21] matrixStats_0.54.0 xts_0.11-2
## [23] prettyunits_1.0.2 colorspace_1.4-1
## [25] rvest_0.3.4 haven_2.1.0
## [27] pan_1.6 xfun_0.8
## [29] callr_3.2.0 crayon_1.3.4
## [31] jsonlite_1.6 lme4_1.1-21
## [33] zeallot_0.1.0 survival_2.44-1.1
## [35] zoo_1.8-6 glue_1.3.1
## [37] gtable_0.3.0 pkgbuild_1.0.3
## [39] weights_1.0 rstan_2.19.2
## [41] jomo_2.6-8 abind_1.4-5
## [43] scales_1.0.0 miniUI_0.1.1.1
## [45] xtable_1.8-4 htmlTable_1.13.1
## [47] ggstance_0.3.2 foreign_0.8-71
## [49] Formula_1.2-3 stats4_3.6.0
## [51] StanHeaders_2.18.1-10 DT_0.7
## [53] htmlwidgets_1.3 httr_1.4.0
## [55] threejs_0.3.1 arrayhelpers_1.0-20160527
## [57] RColorBrewer_1.1-2 acepack_1.4.1
## [59] mice_3.5.0 pkgconfig_2.0.2
## [61] loo_2.1.0 nnet_7.3-12
## [63] utf8_1.1.4 tidyselect_0.2.5
## [65] labeling_0.3 rlang_0.4.0
## [67] reshape2_1.4.3 later_0.8.0
## [69] munsell_0.5.0 cellranger_1.1.0
## [71] tools_3.6.0 cli_1.1.0
## [73] generics_0.0.2 broom_0.5.2
## [75] ggridges_0.5.1 evaluate_0.14
## [77] yaml_2.2.0 processx_3.3.1
## [79] knitr_1.23 mitml_0.3-7
## [81] nlme_3.1-139 mime_0.7
## [83] xml2_1.2.0 compiler_3.6.0
## [85] bayesplot_1.7.0 shinythemes_1.1.2
## [87] rstudioapi_0.10 stringi_1.4.3
## [89] ps_1.3.0 blogdown_0.14
## [91] Brobdingnag_1.2-6 lattice_0.20-38
## [93] Matrix_1.2-17 nloptr_1.2.1
## [95] markdown_1.0 shinyjs_1.0
## [97] vctrs_0.2.0 pillar_1.4.2
## [99] lifecycle_0.1.0 bridgesampling_0.6-0
## [101] data.table_1.12.2 httpuv_1.5.1
## [103] R6_2.4.0 latticeExtra_0.6-28
## [105] bookdown_0.12 promises_1.0.1
## [107] gridExtra_2.3 boot_1.3-22
## [109] colourpicker_1.0 MASS_7.3-51.4
## [111] gtools_3.8.1 assertthat_0.2.1
## [113] withr_2.1.2 shinystan_2.5.0
## [115] parallel_3.6.0 hms_0.4.2
## [117] grid_3.6.0 rpart_4.1-15
## [119] coda_0.19-2 minqa_1.2.4
## [121] rmarkdown_1.13 shiny_1.3.2
## [123] lubridate_1.7.4 base64enc_0.1-3
## [125] dygraphs_1.1.1.6
```