Switch navigation

STATISTICS AND ANALYSES

4.5

Clean data with R

In the last step you learned how to read data in R. But what if you need to clean your data in R? This video shows you how to check your data for invalid responses (think of typos like age = 125), or how to rename variables.

Below the video, you find the R-code we used in the video. In plus, we added some code that could be useful for you as well. Once again, you can find the example data in the ‘downloads’ section below. Note: it is the same data as in the previous step.

Here is the code:

# Clean data with R

# Set working directory
setwd("~/Desktop/Test R") # Change the content between quotation marks. It needs to be the path to the folder with your data

mydata <- read.csv2("clean_data.csv")   # read the data in
head(mydata)                    # look at the data

# Clean this dataset
# For example, we want no value > 100
which(mydata > 100)
toohigh <- which(mydata > 100)

# For example, we want no value < 0
which(mydata < 0)
toolow <- which(mydata < 0)

# Second version of my data
mydatav2 <- mydata # copy data
mydatav2[ mydatav2 > 100] <- NA # substitute by "NA" (missing value)

which(mydatav2 > 100)
which(mydata > 100)

mydatav2[ mydatav2 < 0 ] <- NA
which(mydatav2 < 0)
which(mydata < 0)

# variable privilege should not be 50
mydata$privileges
# substitute 50 for NA
mydatav3 <- mydata # copy data set
mydatav3$privileges[ mydatav3$privileges == 50 ] # two cases
# substitute these cases with NA
mydatav3$privileges[ mydatav3$privileges == 50 ] <- NA

which( mydata$privileges == 50 ) # two cases of 50
which( mydatav3$privileges == 50 ) # no case

# column names
head(mydata)
names(mydata)
# rename "rating" to "score"
names(mydata)[1] # return the first entry
names(mydata)[names(mydata) == "rating"] # alternative way to select 1. entry
# rename the first entry
names(mydata)[1] <- "score"
names(mydata) # with new names



# ADDITIONAL CODE THAT COULD BE USEFUL TO YOU (not shown in the video)

# Mean of different variables
# Calculate the mean of several variables in your dataset
# For example the mean of 'complaints' and 'privileges'
variable_names <- c("complaints", "privileges") # store names of variables
MW <- rowSums(mydata[, variable_names]) # compute mean of these two variables for each row and store it in MW
mydata$Mean <- MW # Store the computed mean in mydata as variable ‘Mean’

head(mydata) # check the result


# Recode reverse-coded variables
# Say the variable 'complaints' was reverse coded with the highest value of 111
# Recoding works by subtracting the variable from the highest possible value and adding 1
# highest value - variable + 1

complaints_rev <- 111 - mydata$complaints - 1 # Recode complaints and store it in complaints_rev
mydata$complaints_rev <- complaints_rev # store complaints_rev in the data mydata

head(mydata) # check the result

A good resource for learning R is the textbook ‘Yarr: The Pirates Guide to R’ by Nathaniel Phillips.

Copyright

University of Basel