Regression sandbox
Sep 16 2017 stats Rlibrary(ggplot2)
library(broom)
library(magrittr)
library(dplyr)
library(knitr)
Logistic regression
One way or another
If we have two binary variables and we want to see if they are associated we could use a logistic regression. How do we decide which variable to be the predictor and which variable to observed variable ?
In theory there shouldn’t be any differences but let’s check with a dummy example:
df = data.frame(x = sample(c(FALSE, TRUE), 100, TRUE))
df$y = df$x
df$y[1:70] = sample(c(FALSE, TRUE), 70, TRUE)
glm(y ~ x, data = df, family = binomial()) %>% tidy %>%
kable
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.2336149 | 0.3070802 | -0.7607617 | 0.4467994 |
| xTRUE | 1.1745982 | 0.4256623 | 2.7594604 | 0.0057897 |
glm(x ~ y, data = df, family = binomial()) %>% tidy %>%
kable
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.4054651 | 0.3227486 | -1.256288 | 0.2090117 |
| yTRUE | 1.1745982 | 0.4256624 | 2.759460 | 0.0057897 |
df$z = runif(100)
glm(y ~ x + z, data = df, family = binomial()) %>%
tidy %>% kable
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.5930811 | 0.5436028 | -1.0910191 | 0.2752645 |
| xTRUE | 1.2012521 | 0.4293265 | 2.7979921 | 0.0051421 |
| z | 0.6601875 | 0.8199692 | 0.8051369 | 0.4207407 |
glm(x ~ y + z, data = df, family = binomial()) %>%
tidy %>% kable
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.1246648 | 0.5141828 | -0.2424523 | 0.8084297 |
| yTRUE | 1.1996170 | 0.4291206 | 2.7955243 | 0.0051816 |
| z | -0.5586854 | 0.8023449 | -0.6963157 | 0.4862311 |
Adding another predictor doesn’t change the estimates either.
Interpretation
Just to make sure I understand the estimates correctly. It represents the log odds ratio change for each “unit” of the predictor. In the case of a binary variable, the log odds ratio between the two groups.
glm(y ~ x, data = df, family = binomial()) %>% tidy %>%
kable
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.2336149 | 0.3070802 | -0.7607617 | 0.4467994 |
| xTRUE | 1.1745982 | 0.4256623 | 2.7594604 | 0.0057897 |
odds.y.ifx = mean(subset(df, x)$y)/mean(!subset(df,
x)$y)
odds.y.ifnotx = mean(subset(df, !x)$y)/mean(!subset(df,
!x)$y)
log(odds.y.ifx/odds.y.ifnotx)
## [1] 1.174598
Extreme cases
How efficient is the logistic regression when there is an imbalance between different types of observations ? For example if just a few genomic regions overlap an interesting annotation and I want to test is the overlap is significant.
Let’s look at the worst cases when there are only 1 observation for a particular class.
df = data.frame(y = sample(c(FALSE, TRUE), 100, TRUE))
df$x = 1:nrow(df) %in% sample.int(nrow(df), 1)
glm(y ~ x, data = df, family = binomial()) %>% tidy %>%
kable
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.1010961 | 0.2012644 | 0.5023050 | 0.6154530 |
| xTRUE | 15.4649721 | 1455.3975462 | 0.0106259 | 0.9915219 |
Although the significance is low, the estimate seems quite high. I’ll repeat this process a bunch of time and with different number of supporting observations to have an idea of the distribution.
ext.df = lapply(1:500, function(ii) {
res = lapply(1:10, function(ssi) {
df$x = 1:nrow(df) %in% sample.int(nrow(df),
ssi)
glm(y ~ x, data = df, family = binomial()) %>%
tidy %>% mutate(rep = ii, ss = ssi)
})
do.call(rbind, res)
})
ext.df = do.call(rbind, ext.df)
ext.df %>% filter(term == "xTRUE") %>% ggplot(aes(x = estimate)) +
geom_density(fill = "grey50") + facet_grid(ss ~
., scales = "free") + theme_bw()

It seems like the estimate “inflation” is problematic mostly when there are only 1 or 2 supporting observations. If there are more than 5 supporting observations the estimate is correctly centered in 0.
This problem is in fact called the problem of separation. There are two approaches to deal with it:
- Firth logistic regression.
- Exact logistic regression.
The rms package from Frank Harell. It implements a penalized maximum likelihood estimation of the model coefficients through the lrm function which has a penalty= parameter.
library(rms)
extrms.df = lapply(1:200, function(ii) {
res = lapply(1:10, function(ssi) {
res = lapply(c(1, 3, 5), function(pen) {
df$x = 1:nrow(df) %in% sample.int(nrow(df),
ssi)
cc = lrm(y ~ x, data = df, penalty = pen)$coefficient
data.frame(term = names(cc), estimate = cc,
rep = ii, ss = ssi, penalty = pen,
stringsAsFactors = FALSE)
})
do.call(rbind, res)
})
do.call(rbind, res)
})
extrms.df = do.call(rbind, extrms.df)
extrms.df %>% filter(term == "x") %>% ggplot(aes(x = estimate)) +
geom_density(fill = "grey50") + facet_grid(ss ~
penalty, scales = "free") + theme_bw()

It definitely helps: the estimates are now much closer to 0. I don’t see much difference between penalties 1, 3 or 5.
The logistf package. It implements Firth’s bias reduction method with its logistf function.
library(logistf)
extstf.df = lapply(1:200, function(ii) {
res = lapply(1:10, function(ssi) {
df$x = 1:nrow(df) %in% sample.int(nrow(df),
ssi)
cc = logistf(y ~ x, data = df)$coefficient
data.frame(term = names(cc), estimate = cc,
rep = ii, ss = ssi, stringsAsFactors = FALSE)
})
do.call(rbind, res)
})
extstf.df = do.call(rbind, extstf.df)
extstf.df %>% filter(term == "xTRUE") %>% ggplot(aes(x = estimate)) +
geom_density(fill = "grey50") + facet_grid(ss ~
., scales = "free") + theme_bw()

This works well too.
More advanced models
A dummy example with some code for Generalized Additive Models, LOESS and SVM.
nb.samp = 1000
df = data.frame(x = runif(nb.samp, 0, 100))
df$y = rnorm(nb.samp, 0, 5) + abs(df$x - 25)
df$y = ifelse(df$x > 40, rnorm(nb.samp, 0, 5) - df$x *
df$x/300 + 20, df$y)
ggplot(df, aes(x = x, y = y)) + geom_point(alpha = 0.5) +
theme_bw()

glm.o = glm(y ~ x, data = df)
loess.o = loess(y ~ x, data = df)
library(mgcv)
gam.o = gam(y ~ s(x, bs = "cs"), data = df)
library(e1071)
svm.o = svm(y ~ x, data = df)
pred.df = rbind(df %>% mutate(y = predict(glm.o), model = "glm"),
df %>% mutate(y = predict(gam.o), model = "gam"),
df %>% mutate(y = predict(loess.o), model = "LOESS"),
df %>% mutate(y = predict(svm.o), model = "SVM"))
ggplot(df, aes(x = x, y = y)) + geom_point(alpha = 0.2) +
geom_line(aes(colour = model), size = 2, alpha = 0.9,
data = pred.df) + theme_bw() + scale_colour_brewer(palette = "Set1")
