STATISTICS AND ANALYSES
4.6
Hypothesis tests with R
In this article we show you the commands that you need to carry out various statistical tests within R.
We presuppose you know which test(s) you want to run. If you are unsure which test you need for your hypothesis and data, check the decision tree we encountered in a previous chapter.
In the example below we start by reading two data sets which we will use to illustrate. Download the ‘diet’ data and drag it in the folder where you have saved your R script.
Note: a line beginning with a hashtag (#) indicates a comment in R. This means that it is not carried out as code.
# Load packages and read in the data
library(MASS)
# If this line throws an error, run the following command once before the library() command: install.packages(“MASS”)
# Download the diet data from the link in the text above.
# Read data
Diet <- read.csv(file.choose())
# Alternative way to read the data
# Set the working directory to the folder where the data lies with setwd(“path/to/your/folder”)
Diet <- read.csv("diet_data.csv") # read “Diet” data
# Look at the data
head(Diet)
# Person gender Age Height pre.weight Diet weight6weeks HealthStatus
# 1 25 NA 41 171 60 2 60.0 unaffected
# 2 26 NA 32 174 103 2 103.0 unaffected
# 3 1 0 22 159 58 1 54.2 affected
# ...
# Load another test data set
data("survey") # load data (note: this data is part of R, therefore we do not read it from a csv file)
head(survey) # look at the first rows and all variables in “survey”
# Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer Smoke Height M.I Age
# 1 Female 18.5 18.0 Right R on L 92 Left Some Never 173.00 Metric 18.250
# 2 Male 19.5 20.5 Left R on L 104 Left None Regul 177.80 Imperial 17.583
# 3 Male 18.0 13.3 Right L on R 87 Neither None Occas NA <NA> 16.917
# …
The ‘diet’ data contains data about health and dieting (the data originates from this website). The ‘survey’ data is a survey about exercise and smoking.
Chi-squared test
This example uses the ‘survey’ data to test an association of the two categorical variables ‘smoke’ (smoking measured from never to heavy) and ‘exer’ (exercise frequency measured as none, some or frequently).
This is a test against the independence of the variables1:
# Look at the data
tbl <- table(survey$Smoke, survey$Exer) # contingency table
tbl # view the table
chisq.test(tbl) # Conduct the Chi-squared test
# output: if significant we reject the null hypothesis that exercise level is independent of smoking habit.
Correlation analysis
The next example uses the ‘diet’ data set, to check if the continuous variables ‘age’ and ‘pre.weight’ (weight before the diet) are correlated.
cor(Diet$Age, Diet$pre.weight, # specify data set and variables
use="complete.obs", # pairwise exclusion of NA values
method="pearson") # specify correlation method
# Additionally compute a test of the value being zero, testing for significance
cor.test(Diet$Age, Diet$pre.weight, # specify data and variables
method = "pearson", # specify correlation method
alternative = "two.sided") # directionality
Note: other types of correlation are possible (see ?cor
).
Repeated measures ANOVA
This example of R code uses the ‘diet’ dataset to check the effect of ‘time’ (pre diet vs. post diet) on the continuous variable ‘weight’ while controlling for the covariates ‘age’ and ‘gender’. Weight is a repeated (within-subject) continuous dependent variable, because every participant reported their weight twice.
# The data contains two weight columns ‘pre.weight’ and ‘weight6weeks’
# Person gender Age Height pre.weight Diet weight6weeks HealthStatus
# 25 NA 41 171 60 2 60.0 unaffected
# 26 NA 32 174 103 2 103.0 unaffected
# We want it to contain one longer ‘Weight’ column and a ‘Time’ column, like this:
# Person gender Age Height Diet HealthStatus Time Weight
# 25 NA 41 171 2 unaffected pre.weight 60
# 26 NA 32 174 2 unaffected pre.weight 103
# - … -
# 77 1 40 167 3 unaffected weight6weeks 77.8
# 78 1 51 175 3 unaffected weight6weeks 81.9
library(tidyr) # If you do not have the package, run install.packages(‘tidyr’)
Diet_long <- gather(data = Diet, # specify which data to use
Time, Weight, # specify names of NEW variables
Pre.weight, weight6weeks) # specify variables combine
head(Diet_long) # Check result
Run the repeated measures ANOVA
library(nlme) # load the package, if it is missing run install.packages(‘nlme’)
model <- lme(Weight ~ Age + gender + Time, # specify DV and fixed effects
random = ~1|Person, # specify random effects
data = Diet_long, # specify dataset
control = lmeControl(opt="optim"),
na.action = "na.omit") # what to do in case of NA values
summary(model) # View full results
anova(model, type = “marginal”) # Just print the fixed effects
Between-subjects ANOVA
This example uses the ‘survey’ data set to regress the continuous dependent variable ‘pulse’ (pulse of participants) on the categorical independent variable ‘exer’ (exercise level).
model <- aov(Pulse ~ Exer, # specify dependent and indep. variables
data = survey) # specify data
summary(model) # show the results
Now, we add age and gender as covariates
model <- aov(Pulse ~ Exer + Age + Sex, # specify variables and covariates
data = survey) # specify data
summary(model) # show the results
Single-sample t-test
This example uses the ‘survey’ data set to test if the continuous variable ‘pulse’ (the pulse of participants) is, on average, equal to 79 with a one-sample t-test.
mean(survey$Pulse, na.rm = T) # view the mean (74.15) ignoring NA values
t.test(survey$Pulse, # your data and the variable
mu = 79, # the value against which to test
alternative = "two.sided") # directionality
Paired samples t-test
The next example uses the ‘diet’ data set to test if the continuous within-subject (repeated measures) variable ‘weight’ changed before and after a diet.
First, we need to reshape the data a bit to contain a new variable ‘time’ and only one variable ‘weight’ (currently there is not ‘time’ variable and there are two ‘weight’ variables). If your data contains a ‘time’ (or ‘block‘, ‘repetition’, ‘trial’, …) variable and only one independent variable, skip this lines of code and compute the t-test directly.
This is called reshaping from wide to long format:
# Currently the data has TWO weight-variables (‘pre.weight’ and ‘weight6weeks’)
# Person gender Age Height pre.weight Diet weight6weeks HealthStatus
# 25 NA 41 171 60 2 60.0 unaffected
# 26 NA 32 174 103 2 103.0 unaffected
# We want the data to have ONE weight variable and one new TIME variable
# Person gender Age Height Diet HealthStatus Time Weight
# 25 NA 41 171 2 unaffected pre.weight 60
# 26 NA 32 174 2 unaffected pre.weight 103
# - … -
# 155 77 1 40 167 3 unaffected weight6weeks 77.8
# 156 78 1 51 175 3 unaffected weight6weeks 81.9
library(tidyr) # If you do not have the package, run install.packages(‘tidyr’)
# Generate a new dataset ‘Diet_long’
Diet_long <- gather(data = Diet, # specify which data to use
Time, Weight, # specify names of NEW variables
pre.weight, weight6weeks) # specify the two variables combine
head(Diet_long) # Check if this was successful
Now we can perform the t-test:
t.test(Weight ~ Time, # specify dependent var. and time variable
data = Diet_long, # specify data to use
paired = TRUE, # specify that it is a paired test
alternative = "two.sided") # directionality
Independent samples t-test
This example uses the ‘diet’ data set to test if the continuous variable ‘height’ differs between the categorical variable ‘gender’.
t.test(Height ~ gender, # specify the variables
data = Diet, # specify data
alternative = "two.sided") # directionality
Simple regression
The next code uses the ‘diet’ data to regress ‘weight6weeks’ on ‘age’.
model <- lm(weight6weeks ~ Age, # specify dependent and independent var.
data = Diet) # specify data
summary(model) # view results
# Other results
coefficients(model) # parameter estimates
confint(model, level = 0.95) # confidence interval
fitted(model) # predicted values
residuals(model) # residuals
anova(model) # effect of age on weight at week 6
vcov(model) # covariance matrix for model parameters
influence(model) # regression diagnostics
Multiple Regression
The below example uses the ‘diet’ data to regress three measured variables (age, gender, and height) onto a continuous dependent variable (‘weight6weeks’).
model <- lm(weight6weeks ~ Age + gender + Height, # specify variables
data = Diet) # specify data
# After running the line(s) above, display at the results
summary(model) # Pr(|z|) is the significance of coefficients
coefficients(model) # View the regression coefficients
confint(model, level = 0.95) # view the confidence intervals
# Predict the values with the model
fitted(model) # predict the dependent variable values
# Further analytics of the quality of the regression
residuals(model) # residuals
anova(model) # ANOVA table, good for tables in a paper
vcov(model) # variance covariance matrix
Binomial regression, such as a logistic or probit regression
This code uses the ‘diet’ dataset to regress the categorical variable ‘healthstatus’ on age, gender, weights before and after the diet and whether participants participated in a diet. If you are interested in more information on logistic regression, you can find it here.
model <- glm(HealthStatus ~ Age + gender + weight6weeks + pre.weight + Diet, # specify the independent an ddependent variables
Data = Diet, # specify the data
Family = binomial(link='logit')) # specify the logistic regression
summary(model) # The column Pr(>|z|) # view result
# shows the p-value of the beta weights
# Compare your model performance vs. a null model (model with only an intercept)
anova(model, test="Chisq")
# assess the how good the model fit is, using McFadden R
library(pscl) # if you do not have the package, run install.packages(“pscl”)
pR2(model) 2
fitted.results <- predict(model, newdata = DF_test, type = "response")
misClassificationError <- mean(fitted.results != DF_test$Group)
print(paste('Accuracy',1-misClassificationError))
# Further steps: use ROC (see the above link)
1 This is an example from R Tutorial, accessed December 6th, 2017.
Downloads