STATISTICS AND ANALYSES

4.6

Hypothesis tests with R

In this article we show you the commands that you need to carry out various statistical tests within R.

We presuppose you know which test(s) you want to run. If you are unsure which test you need for your hypothesis and data, check the decision tree we encountered in a previous chapter.

In the example below we start by reading two data sets which we will use to illustrate. Download the ‘diet’ data and drag it in the folder where you have saved your R script.

Note: a line beginning with a hashtag (#) indicates a comment in R. This means that it is not carried out as code.

# Load packages and read in the data
library(MASS)
# If this line throws an error, run the following command once before the library() command: install.packages(“MASS”)

# Download the diet data from the link in the text above.
# Read data
Diet <- read.csv(file.choose())

# Alternative way to read the data
# Set the working directory to the folder where the data lies with setwd(“path/to/your/folder”)
Diet <- read.csv("diet_data.csv")       # read “Diet” data

# Look at the data
head(Diet)                          
#   Person  gender  Age Height  pre.weight  Diet    weight6weeks    HealthStatus
# 1 25      NA      41  171     60          2       60.0            unaffected
# 2 26      NA      32  174     103         2       103.0           unaffected
# 3 1       0       22  159     58          1       54.2            affected
# ...

# Load another test data set
data("survey")          # load data (note: this data is part of R, therefore we do not read it from a csv file)
head(survey)          # look at the first rows and all variables in “survey”
#   Sex     Wr.Hnd      NW.Hnd      W.Hnd   Fold    Pulse   Clap        Exer    Smoke   Height  M.I         Age
# 1 Female  18.5        18.0        Right   R on L  92      Left        Some    Never   173.00  Metric      18.250
# 2 Male    19.5        20.5        Left    R on L  104     Left        None    Regul   177.80  Imperial    17.583
# 3 Male    18.0        13.3        Right   L on R  87      Neither     None    Occas   NA      <NA>        16.917
# …

The ‘diet’ data contains data about health and dieting (the data originates from this website). The ‘survey’ data is a survey about exercise and smoking.

Chi-squared test
This example uses the ‘survey’ data to test an association of the two categorical variables ‘smoke’ (smoking measured from never to heavy) and ‘exer’ (exercise frequency measured as none, some or frequently).

This is a test against the independence of the variables1:

# Look at the data
tbl <- table(survey$Smoke, survey$Exer)       # contingency table
tbl                                         # view the table

chisq.test(tbl) # Conduct the Chi-squared test
# output: if significant we reject the null hypothesis that exercise level is independent of smoking habit.

Correlation analysis
The next example uses the ‘diet’ data set, to check if the continuous variables ‘age’ and ‘pre.weight’ (weight before the diet) are correlated.

cor(Diet$Age, Diet$pre.weight,   # specify data set and variables
    use="complete.obs",         # pairwise exclusion of NA values
    method="pearson")           # specify correlation method

# Additionally compute a test of the value being zero,  testing for significance
cor.test(Diet$Age, Diet$pre.weight,    # specify data and variables
        method = "pearson",             # specify correlation method
        alternative = "two.sided")      # directionality

Note: other types of correlation are possible (see ?cor).

Repeated measures ANOVA
This example of R code uses the ‘diet’ dataset to check the effect of ‘time’ (pre diet vs. post diet) on the continuous variable ‘weight’ while controlling for the covariates ‘age’ and ‘gender’. Weight is a repeated (within-subject) continuous dependent variable, because every participant reported their weight twice.

# The data contains two weight columns ‘pre.weight’ and ‘weight6weeks’
# Person    gender  Age Height  pre.weight  Diet    weight6weeks    HealthStatus
# 25        NA      41  171     60          2       60.0            unaffected
# 26        NA      32  174     103         2       103.0           unaffected

# We want it to contain one longer ‘Weight’ column and a ‘Time’ column, like this: 
# Person    gender  Age Height  Diet    HealthStatus    Time            Weight
# 25        NA      41  171     2       unaffected      pre.weight      60
# 26        NA      32  174     2       unaffected      pre.weight      103
# - … -
# 77        1       40  167     3       unaffected      weight6weeks    77.8
# 78        1       51  175     3       unaffected      weight6weeks    81.9

library(tidyr) # If you do not have the package, run install.packages(‘tidyr’)

Diet_long <- gather(data = Diet,              # specify which data to use
                    Time, Weight,                 # specify names of NEW variables
                    Pre.weight, weight6weeks)   # specify variables combine

head(Diet_long)                   # Check result 

Run the repeated measures ANOVA
library(nlme)  # load the package, if it is missing run install.packages(‘nlme’)
model <- lme(Weight ~  Age + gender + Time,      # specify DV and fixed effects
            random = ~1|Person,                  # specify random effects
            data = Diet_long,                    # specify dataset 
            control = lmeControl(opt="optim"),  
            na.action = "na.omit")              # what to do in case of NA values

summary(model)              # View full results
anova(model, type = marginal)     # Just print the fixed effects

Between-subjects ANOVA
This example uses the ‘survey’ data set to regress the continuous dependent variable ‘pulse’ (pulse of participants) on the categorical independent variable ‘exer’ (exercise level).

model <-  aov(Pulse ~ Exer,   # specify dependent and indep. variables
              data = survey)      # specify data
summary(model)            # show the results

Now, we add age and gender as covariates

model <-  aov(Pulse ~ Exer + Age + Sex,     # specify variables and covariates
              data = survey)            # specify data
summary(model)                  # show the results

Single-sample t-test
This example uses the ‘survey’ data set to test if the continuous variable ‘pulse’ (the pulse of participants) is, on average, equal to 79 with a one-sample t-test.

mean(survey$Pulse, na.rm = T)       # view the mean (74.15) ignoring NA values

t.test(survey$Pulse,            # your data and the variable
       mu = 79,                         # the value against which to test
       alternative = "two.sided")   # directionality

Paired samples t-test
The next example uses the ‘diet’ data set to test if the continuous within-subject (repeated measures) variable ‘weight’ changed before and after a diet.

First, we need to reshape the data a bit to contain a new variable ‘time’ and only one variable ‘weight’ (currently there is not ‘time’ variable and there are two ‘weight’ variables). If your data contains a ‘time’ (or ‘block‘, ‘repetition’, ‘trial’, …) variable and only one independent variable, skip this lines of code and compute the t-test directly.

This is called reshaping from wide to long format:

# Currently the data has TWO weight-variables (‘pre.weight’ and ‘weight6weeks’)
# Person    gender  Age Height  pre.weight  Diet    weight6weeks    HealthStatus
# 25        NA      41  171     60          2       60.0            unaffected
# 26        NA      32  174     103         2       103.0           unaffected

# We want the data to have ONE weight variable and one new TIME variable
# Person    gender  Age Height  Diet    HealthStatus    Time            Weight
# 25        NA      41  171     2       unaffected      pre.weight      60
# 26        NA      32  174     2       unaffected      pre.weight      103
# - … -
# 155 77    1       40  167     3       unaffected      weight6weeks    77.8
# 156 78    1       51  175     3       unaffected      weight6weeks    81.9

library(tidyr) # If you do not have the package, run install.packages(‘tidyr’)
# Generate a new dataset ‘Diet_long’
Diet_long <- gather(data = Diet,            # specify which data to use
                    Time, Weight,           # specify names of NEW variables
                    pre.weight, weight6weeks)   # specify the two variables combine
head(Diet_long)                     # Check if this was successful

Now we can perform the t-test:

t.test(Weight ~ Time,           # specify dependent var. and time variable
      data = Diet_long,             # specify data to use
      paired = TRUE,            # specify that it is a paired test
alternative = "two.sided")  # directionality

Independent samples t-test
This example uses the ‘diet’ data set to test if the continuous variable ‘height’ differs between the categorical variable ‘gender’.

t.test(Height ~ gender,             # specify the variables
      data = Diet,                  # specify data
      alternative = "two.sided")        # directionality

Simple regression
The next code uses the ‘diet’ data to regress ‘weight6weeks’ on ‘age’.

model <- lm(weight6weeks ~ Age,     # specify dependent and independent var.
           data = Diet)     # specify data
summary(model)                          # view results

# Other results
coefficients(model)                         # parameter estimates
confint(model, level = 0.95)                # confidence interval
fitted(model)                               # predicted values
residuals(model)                            # residuals
anova(model)                                # effect of age on weight at week 6
vcov(model)                                 # covariance matrix for model parameters
influence(model)                            # regression diagnostics

Multiple Regression
The below example uses the ‘diet’ data to regress three measured variables (age, gender, and height) onto a continuous dependent variable (‘weight6weeks’).

model <- lm(weight6weeks ~ Age + gender + Height,   # specify variables
            data = Diet)                    # specify data

# After running the line(s) above, display at the results
summary(model)              # Pr(|z|) is the significance of coefficients
coefficients(model)         # View the regression coefficients
confint(model, level = 0.95)    # view the confidence intervals

# Predict the values with the model
fitted(model)       # predict the dependent variable values

# Further analytics of the quality of the regression
residuals(model)    # residuals
anova(model)        # ANOVA table, good for tables in a paper
vcov(model)     # variance covariance matrix

Binomial regression, such as a logistic or probit regression
This code uses the ‘diet’ dataset to regress the categorical variable ‘healthstatus’ on age, gender, weights before and after the diet and whether participants participated in a diet. If you are interested in more information on logistic regression, you can find it here.

model <- glm(HealthStatus ~ Age + gender + weight6weeks + pre.weight + Diet, # specify the independent an ddependent variables
            Data = Diet,                       # specify the data 
            Family = binomial(link='logit'))   # specify the logistic regression

summary(model) # The column Pr(>|z|) # view result
# shows the p-value of the beta weights

# Compare your model performance vs. a null model (model with only an intercept)
anova(model, test="Chisq")     

# assess the how good the model fit is, using McFadden R                
library(pscl)  # if you do not have the package, run install.packages(“pscl”)
pR2(model)                                      2
fitted.results <- predict(model, newdata = DF_test, type = "response")
misClassificationError <- mean(fitted.results != DF_test$Group)
print(paste('Accuracy',1-misClassificationError))
# Further steps: use ROC (see the above link)

1 This is an example from R Tutorial, accessed December 6th, 2017.

Downloads