---
title: "Statistics"
author: "Introduction to R for Public Health Researchers"
output:
ioslides_presentation:
css: ../styles.css
widescreen: yes
---
```{r knit-setup, include=FALSE}
library(knitr)
opts_chunk$set(echo = TRUE,
message = FALSE,
warning = FALSE,
fig.height = 4,
fig.width = 7,
comment = "")
library(dplyr)
```
## Statistics
Now we are going to cover how to perform a variety of basic statistical tests in R.
* Correlation
* T-tests/Rank-sum tests
* Linear Regression
* Logistic Regression
* Proportion tests
* Chi-squared
* Fisher's Exact Test
Note: We will be glossing over the statistical theory and "formulas" for these tests. There are plenty of resources online for learning more about these tests, as well as dedicated Biostatistics series at the School of Public Health
## Correlation
`cor()` performs correlation in R
```
cor(x, y = NULL, use = "everything",
method = c("pearson", "kendall", "spearman"))
```
Like other functions, if there are NAs, you get NA as the result. But if you specify use only the complete observations, then it will give you correlation on the non-missing data.
```{r cor1, comment="", message = FALSE}
library(readr)
circ = read_csv("http://johnmuschelli.com/intro_to_r/data/Charm_City_Circulator_Ridership.csv")
cor(circ$orangeAverage, circ$purpleAverage, use="complete.obs")
```
## Correlation with `corrr`
The `corrr` package allows you to do correlations easily:
```{r cor2, comment="", message = TRUE}
library(corrr); library(dplyr)
circ %>%
select(orangeAverage, purpleAverage) %>%
correlate()
```
## Correlation
`correlate` is usually better with more than 2 columns:
```{r cor3, comment=""}
avgs = circ %>% select(ends_with("Average"))
cobj = avgs %>% correlate(use = "complete.obs", diagonal = 1)
cobj %>% fashion(decimals = 3)
```
## Correlation
```{r corplot, comment=""}
cobj %>% rplot()
```
## Correlation
You can also use `cor.test()` to test for whether correlation is significant (ie non-zero). Note that linear regression may be better, especially if you want to regress out other confounders.
```{r cor4, comment=""}
ct = cor.test(circ$orangeAverage, circ$purpleAverage,
use = "complete.obs")
ct
```
## Correlation
For many of these testing result objects, you can extract specific slots/results as numbers, as the `ct` object is just a list.
```{r cor5, comment=""}
# str(ct)
names(ct)
ct$statistic
ct$p.value
```
## Broom package
The `broom` package has a `tidy` function that puts most objects into `data.frame`s so that they are easily manipulated:
```{r broom, comment=""}
library(broom)
tidy_ct = tidy(ct)
tidy_ct
```
## Correlation
Note that you can add the correlation to a plot, via the `annotate`
```{r cor_ggplot, comment="", fig.height=4,fig.width=4}
library(ggplot2)
txt = paste0("r=", signif(ct$estimate,3))
q = qplot(data = circ, x = orangeAverage, y = purpleAverage)
q + annotate("text", x = 4000, y = 8000, label = txt, size = 5)
```
## T-tests
The T-test is performed using the `t.test()` function, which essentially tests for the difference in means of a variable between two groups.
In this syntax, x and y are the column of data for each group.
```{r tt1, comment=""}
tt = t.test(circ$orangeAverage, circ$purpleAverage)
tt
```
## T-tests
Using `t.test` treats the data as independent. Realistically, this data should be treated as a paired t-test. The `paired = TRUE` argument to do a paired test
```{r tt1_paired, comment=""}
t.test(circ$orangeAverage, circ$purpleAverage, paired = TRUE)
```
## T-tests
`t.test` saves a lot of information: the difference in means `estimate`, confidence interval for the difference `conf.int`, the p-value `p.value`, etc.
```{r tt1_1, comment=""}
names(tt)
```
## T-tests
```{r tt1_broom, comment=""}
tidy(tt)
```
## T-tests
You can also use the 'formula' notation. In this syntax, it is `y ~ x`, where `x` is a factor with 2 levels or a binary variable and `y` is a vector of the same length.
```{r long_tt}
library(tidyr)
long = circ %>%
select(date, orangeAverage, purpleAverage) %>%
gather(key = line, value = avg, -date)
tt = t.test(avg ~ line, data = long)
tidy(tt)
```
## Wilcoxon Rank-Sum Tests
Nonparametric analog to t-test (testing medians):
```{r wt, comment=""}
tidy(wilcox.test(avg ~ line, data = long))
```
## Lab Part 1
[Website](http://johnmuschelli.com/intro_to_r/index.html)
## Analysis of Variance
The `aov` function exists for ANOVA, but we'd recommend `lm` (Linear models):
```{r aov, comment=""}
long3 = circ %>%
select(date, orangeAverage, purpleAverage, bannerAverage) %>%
gather(key = line, value = avg, -date)
anova_res = aov(avg ~ line, data = long3)
anova_res
```
## Linear Regression
Now we will briefly cover linear regression. I will use a little notation here so some of the commands are easier to put in the proper context.
$$
y_i = \alpha + \beta x_i + \varepsilon_i
$$
where:
* $y_i$ is the outcome for person i
* $\alpha$ is the intercept
* $\beta$ is the slope
* $x_i$ is the predictor for person i
* $\varepsilon_i$ is the residual variation for person i
## Linear Regression
The `R` version of the regression model is:
```
y ~ x
```
where:
* y is your outcome
* x is/are your predictor(s)
## Linear Regression
For a linear regression, when the predictor is binary this is the same as a t-test:
```{r regress1, comment=""}
fit = lm(avg ~ line, data = long)
fit
```
'(Intercept)' is $\alpha$
'linepurpleAverage' is $\beta$
## Linear Regression
```{r regress1000, comment=""}
fit = lm(avg ~ line, data = long3)
fit
```
## Linear Regression
The `summary` command gets all the additional information (p-values, t-statistics, r-square) that you usually want from a regression.
```{r regress2, comment=""}
sfit = summary(fit)
print(sfit)
```
## Linear Regression
We can `tidy` linear models as well and it gives us all of this in a tibble:
```{r tidy_lm, comment=""}
tidy(fit)
```
## Linear Regression
The `confint` argument allows for confidence intervals
```{r tidy_lm_onf, comment=""}
tidy(fit, conf.int = TRUE)
```
## Using Cars Data
```{r tt2, comment="", message = FALSE}
http_data_dir = "http://johnmuschelli.com/intro_to_r/data/"
cars = read_csv(
paste0(http_data_dir, "kaggleCarAuction.csv"),
col_types = cols(VehBCost = col_double()))
head(cars)
```
## Linear Regression
We'll look at vehicle odometer value by vehicle age:
```{r}
fit = lm(VehOdo~VehicleAge, data = cars)
print(fit)
```
## Linear Regression
Note that you can have more than 1 predictor in regression models.The interpretation for each slope is change in the predictor corresponding to a one-unit change in the outcome, holding all other predictors constant.
```{r regress5, comment="", fig.height=4,fig.width=8}
fit2 = lm(VehOdo ~ IsBadBuy + VehicleAge, data = cars)
tidy(fit2)
```
## Linear Regression: Interactions
The `*` does interactions:
```{r regress9, comment="", fig.height=4,fig.width=8}
fit3 = lm(VehOdo ~ IsBadBuy * VehicleAge, data = cars)
tidy(fit3)
```
## Linear Regression: Interactions
You can take out main effects with minus
```{r regress10, comment="", fig.height=4,fig.width=8}
fit4 = lm(VehOdo ~ IsBadBuy * VehicleAge -IsBadBuy , data = cars)
tidy(fit4)
```
## Linear Regression {.smaller}
Factors get special treatment in regression models - lowest level of the factor is the comparison group, and all other factors are relative to its values.
```{r regress6, comment="", fig.height=4,fig.width=8}
fit3 = lm(VehOdo ~ factor(TopThreeAmericanName), data = cars)
tidy(fit3)
```
## Logistic Regression and GLMs {.smaller}
Generalized Linear Models (GLMs) allow for fitting regressions for non-continuous/normal outcomes. The `glm` has similar syntax to the `lm` command. Logistic regression is one example. See `?family` for
```{r regress7, comment="", fig.height=4,fig.width=8}
glmfit = glm(IsBadBuy ~ VehOdo + VehicleAge, data=cars, family = binomial())
tidy(glmfit)
```
## Tidying GLMs
```{r tidy_glmfit, comment="", fig.height=4,fig.width=8}
tidy(glmfit, conf.int = TRUE)
```
## Tidying GLMs
```{r tidy_glm_exp, comment="", fig.height=4,fig.width=8}
tidy(glmfit, conf.int = TRUE, exponentiate = TRUE)
```
## Logistic Regression
Note the coefficients are on the original scale, we must exponentiate them for odds ratios:
```{r regress8, comment="", fig.height=4,fig.width=8}
exp(coef(glmfit))
```
## Chi-squared tests
`chisq.test()` performs chi-squared contingency table tests and goodness-of-fit tests.
```
chisq.test(x, y = NULL, correct = TRUE,
p = rep(1/length(x), length(x)), rescale.p = FALSE,
simulate.p.value = FALSE, B = 2000)
```
```{r chisq1, comment=""}
tab = table(cars$IsBadBuy, cars$IsOnlineSale)
tab
```
## Chi-squared tests
You can also pass in a table object (such as `tab` here)
```{r chisq2, comment=""}
cq = chisq.test(tab)
cq
names(cq)
cq$p.value
```
## Chi-squared tests
Note that does the same test as `prop.test`, for a 2x2 table (`prop.test` not relevant for greater than 2x2).
```{r chisq3, comment=""}
chisq.test(tab)
prop.test(tab)
```
## Fisher's Exact test
`fisher.test()` performs contingency table test using the hypogeometric distribution (used for small sample sizes).
```
fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE,
control = list(), or = 1, alternative = "two.sided",
conf.int = TRUE, conf.level = 0.95,
simulate.p.value = FALSE, B = 2000)
```
```{r fish.test, comment=""}
fisher.test(tab)
```
## Sampling
Also, if you want to only plot a subset of the data (for speed/time or overplotting)
```{r samp_plot, comment="", fig.height=4,fig.width=7}
samp.cars = sample_n(cars, size = 10000)
samp.cars = sample_frac(cars, size = 0.2)
ggplot(aes(x = VehBCost, y = VehOdo),
data = samp.cars) + geom_point()
```
## Lab Part 2
[Website](http://johnmuschelli.com/intro_to_r/index.html)