A Few Reminders to Begin

  • R is case sensitive
  • You can always use ? to get help with a function, e.g. ?mean
  • Use the console for quick and dirty stuff, but keep a clean R script with working code
  • R Projects are super cool!
  • The 'Environment' tab contains useful info
  • R contains functions that contain arguments inside parentheses, separated by commas
  • You can download packages that contain even more functions

Week 1: Data Cleaning (etc.) Topics

This week I will cover the following topics related to data management:

  • Importing and formatting data
  • Missing data
  • Outliers
  • Data reduction decisions
  • Using summary statistics and data visualizations to inspect data in R

Section 1: Importing and Formatting the Data

Read in the Datafile

  • First find out where R will try to look for a file - type getwd() into the console and click enter. The result will be a path to a directory on your computer, called the 'working directory.'
  • If you want to change the working directory, you can use setwd
setwd("Users/Melissa/Documents/PSYC302")

Hint: It is always helpful to look at the result of getwd() for a reminder on how to format the path for setwd.

Importing Data

A simple approach is to save as .csv first, then read that file into R:

data <- read.csv(file = filename.csv, 
                header = TRUE, 
                sep = ",", 
                na.strings = ".")
  • file - file name, including path
  • header - TRUE if 1st row contains variable names
  • sep - explicitly tell R what the separator is
  • na.strings - tell R what values should be read as missing

Assigning Objects in R

Notice that I assigned the file to an object using <-. It is assigned as a 'dataframe' by default, and I named it "data"

When something is assigned using the <- operator, it becomes part of your R Workspace and will appear in your Environment tab.

  • Don't forget to assign the data to an object
  • If your workspace is getting cluttered, you can use rm function
rm(var1, var2, var3)

Evaluate the Structure of the Datafile

Most of the "clean" datasets we are used to seeing have:

  • one header row with variable names
  • one additional row per record of data
  • columns that traditionally represent variables

Some datafiles (e.g. files downloaded from some major survey distributing sites, or from data collection programs like e-prime) have funky formats that stray from this structure.

Example of a Data Structure Issue

Example: Academic integrity study using a survey program Prolific - Downloaded datafile includes a row of variable names, plus an additional row of variable labels

Solution: I created a copy of the raw datafile, with the only difference being that the second row was excluded from the new version using bracket subsetting

Bracket subsetting

You can subset data by using brackets in the format [row, column], such as

datanew <- dataold[r,c]

And you can use a - to remove particular rows and columns. For this example, I used

prodat <- prodat1[-2,]

You Solve It

Challenge 1: If I also wanted to remove the 4th and 9th columns, for some reason, how would I modify the above syntax?

Challenge 2: How would I modify the syntax to remove the first column, but retain all rows?

YSI Answers

Answer 1: prodat <- prodat1[-2,-c(4,9)]

Answer 2: prodat <- prodat1[,-1]

Variables and Variable Names

  • A dataframe in R consists of rows (records) and columns (variables) and has dimensions [rows, columns]
  • Most often, the first row of the dataframe contains variable names. When you want to refer to, summarize, or transform a single variable from your dataframe, you will use the format:
data$variable

Changing Variable Names

Now is a good time for me to introduce the pipe operator %>%. You can go ahead and install the tidyverse package using the code below:

install.packages("tidyverse")

After installation, load the package during each new session of R using library fn

Changing Variable Names, cont.

The %>% operator, called a pipe operator, is a very useful function included in the tidyverse package. You can use it to rename columns in your dataframe:

data %>% 
  rename(
    vB612 = PeerPer1,
    vB613 = PeerPer2,
    vB614 = PeerPer3
        )

'Take [object on left of pipe] and then do to that object [transformation described by the content after the pipe].'

Changing Variable Names, cont.

Using only base R functions:

names(data)[12:14] <- c("PeerPer1", "PeerPer2", "PeerPer3")

where names(data)[12:14] refers to the 12th through the 14th items in the name vector of the data object.

Check Variable Types

Use the str function or just peek in the R Studio 'Environment' tab to see what variable type has been assigned to each column.

To change a variable type to factor, you can write something like this, assuming the variables exist in a dataframe titled 'data':

data$var1 <- factor(data$var1)

Notice that this code changes the variable directly

You Solve It

How would I modify this code (shown again below) to assign the categorical version of var1 to a new variable (object) called var_new?

data$var1 <- factor(data$var1)

YSI Answers

data$var_new <- factor(data$var1)

Section 2: Checking the Data Values

Range Checks

Ensure that the values in the dataset are within the expected ranges for each variable. When variables fail range checks, it is most likely because one of the following occurred:

  • The data was read into R incorrectly
  • There was a researcher error during data entry
  • The participant made an error when responding to a survey item

http://rmarkdown.rstudio.com

Range Checks, cont.

There are several functions in R that will be helpful for range checks. You can use the summary function on an entire dataframe or for a single variable at a time.

summary(data)

will return basic summary statistics for every numeric variable in the data frame, including the mean, median, min, max, and 1st and 3rd quartiles.

You Solve It

What if I wanted to look at a summary of a variable called AgeAtStart in my dataframe called Dat2014? Write the R code.

YSI Answer

summary(Dat2014$AgeAtStart)

Did you make sure everything was in the correct case?

Range Checks, cont.

You may also directly apply min, max, or range

min(data$variable, na.rm=TRUE)

The na.rm argument allows you to tell R to ignore the missing values.

Dealing with Out of Range Values

If you find a truly impossible value that can not be explained by an error in the data structure, I recommend you delete it.

However, if you recognize patterns in invalid values (one participant has many out-of-bounds values, or that many of the invalid values were entered by a single data entry assistant) you may want to investigate further.

Editing Categorical Level Labels

This will mostly be necessary for open-response survey variables

Example: Gender Identity (open response). Because R is case-sensitive, it will assume that "transgender" and "Transgender" are different gender identities.

grep and grepl (as well as sub and gsub) functions allow you to search for and edit strings in R

# Change all values to lowercase
data$genderID <- tolower(data$genderID)

# Change "trans" to "transgender"
data$genderID <- gsub('^trans$', 'transgender', data$genderID)

I placed the string "trans" between a ^ and a $. These mean "beginning of string" and "end of string", respectively.

You Solve It

What do you think would happen if I didn't include these operators, and simply typed the code below?

data$genderID <- gsub('trans', 'transgender', data$genderID)

YSI Answer

All strings (even partial ones) that read 'trans' would be changed to 'transgender', so all responses that read 'transgender', for example, would be transformed to 'transgendergender'

Reverse Coding in R

If a variable range is 1:max, we use the rule new = (max + 1) - old to reverse-score. Aapply this rule directly to the columns of your dataframe that need to be reverse-coded:

If the negatively worded items are represented in columns 4,5,6, and 11 of my dataframe:

data[c(4:6,11)] <- 6 - data[c(4:6,11)] 

Or if you only have one or two items to reverse code, simply:

data$v1 <- 6 - data$v1
data$v2 <- 6 - data$v2

Section 3: Data Quality

Data Quality

Social and behavioral sciences commonly collect data via surveys or questionnaires. This data is notoriously challenging to deal with for many reasons, including the mysteries around data quality.

There is a whole body of literature dedicated to the decisions researchers need to make around survey data quality and how to go about making those decisions.

We will just get into a bit of it here. In any case, you probably have at least some data cleaning you need to take care of.

Removing Records

There are a few reasons why you might want to remove records from your datafile.

Example: Academic Integrity online survey- population of interest is current college students, but we had individuals answer "no" to the survey quesiton asking "Are you currently a college student?". I want to remove these records.

There are several approaches you may take to remove records. It is worth noting that as you remove records, you are not actually removing them from the raw datafile on your hard drive.

Removing Records, cont

One way to remove records is to subset based on selection criteria that you determine. I want a subset of the dataframe that only includes the rows for which the column titled "enrolled" contains a value of "Yes".

datanew <- subset(data, data$enrolled=="Yes")

Check out the dimensions of datanew compared to data. There should be fewer rows in datanew, and the number of rows should be exactly equal to the number of "Yes" responses in the "enrolled" variable of data.

Removing Records, cont

I have mentioned before that you can subset using brackets, too.

datanew <- data[data$enrolled=="Yes"]

Why else might I want to subset to exclude participants?

You Solve It

You may also want to exclude participants who opt out of your study at the end of the survey. Imagine you have a variable called 'DataUse' in a dataframe called 'SurveyData'. A value of "Yes" means the participant consents to their data being used. How would you ensure you only use their data?

YSI Answer

datanew <- subset(SurveyData, data$DataUse=="Yes")

or

datanew <- SurveyData[SurveyData$DataUse=="Yes"]

Identifying "Bad" Data

Next you can start to think about removing other rows with just plain bad data, because

Garbage in, Garbage out

Identifying "Bad" Data, cont.

One recommended approach for survey data is to use time to completion.

Approach: Create a flag variable (aka dummy or indicator variable) then use it to filter

Question: How short is “too short”?; How long is “too long” if at all?

We can make this decision based on reason, or by inspecting the data, or a combination of approaches

Identifying "Bad" Data, cont.

Inspecting the data for proper cut-off points may involve looking at a histogram of the time to completion variable

hist(data$CompTm_seconds)

If there are large breaks in the tails of the distribution, you may use these as clues

Identifying "Bad" Data, cont.

After you decide on cut-off values, you can create your flag. If I decide that surveys completed in fewer than 2 minutes or more than 1 hour are likely to be of poor quality, I could do:

data$tmFlag <- ifelse(
                  data$CompTm_seconds < 120 | 
                  data$CompTm_seconds > 3600, 
                  1, 
                  0)

The | is code for "or"

You Solve It

What do you think would have happened if I just wrote:

tmFlag <- ...

?

YSI Answer

The variable (object) 'tmFlag' would have been created the same way, but it would be floating on its own, instead of being a column inside my dataframe

Basic Visualizations

Another approach for detecting bad data is to look at the data.

In addition to the hist function, the plot and boxplot functions are good options for taking a quick peek at numeric data.

Basic Histogram

hist(Age)

Basic Scatter Plot

plot(Age)

Basic Boxplot

boxplot(Age)

Basic Plot with Two Variables

plot(Age, Grays)

Plot a Factor

You can plot a factor variable using the table function to get counts, and then plotting that table:

barplot(table(GenderID))

Detecting Bad Data with Internal Consistency Measures

You may also be able to identify careless/suspicious responses (beyond completion time) by looking into internal consistency.

One approach: within-person even-odd (split-half) correlation within each survey or subscale. Specifically:

  • review internal consistency of all scales (for full sample)
  • identify 1 or 2 surveys that have high internal consistency and have >20 items

Detecting Bad Data with Internal Consistency Measures, cont

For each survey that meets the criteria, create a subscale score for all even items and a subscale score for all odd items. Finally, calculate the within-person correlation between these two subscale scores.

  • Very low correlations may suggest erratic response patterns
  • Very high values may suggest monotonous response patterns

Data Cleaning Transparency

So many decisions!! Remember:

  • Be conservative when it comes to removing values and records
  • Never alter data (i.e. make an assumption that a "400" should have been a "40" and manually change it).
  • Keep a careful record of the decisions you make and why
  • Be transparent about your decision-making in your presentations and publications

Section 4: Evaluating Missingness

Get to know your data

Data can be missing for many reasons, and those reasons should effect how you respond to the missingness. Here is a non-exhaustive list of missing data causes:

  • Participant chose not to respond because they are ashamed/embarassed/afraid
  • Participant skipped the question by accident
  • Participant was skipped for this item because of their response to a previous question
  • Participant stopped responding at the end of a longitudinal study
  • Data collector forgot to ask the question

Missing Data Mechanisms

There are more technical terms we can use as we talk about missing data mechanisms, and those are:

  • Missing Not at Random (MNAR),
  • Missing Completely at Random (MCAR), and
  • Missing at Random (MAR)

MNAR

Data that are MNAR are systematically missing, meaning that their being missing is related to the values that are missing (or the values that would have been reported, if they were). This kind of missingness is often called non-ignorable.

What might cause MNAR data?

MCAR

If data are MCAR, the missing values are NOT related to the values that are missing, nor are they related to any other variables in the dataset. In other words, the missingness is essentially random.

What might cause MCAR data?

MAR

If data are MAR, the missing values are NOT related to the values that are missing, but ARE related to at least one other variable in the dataset.

What might cause MAR data?

How to Respond to Missingness

There are many ways that researchers handle missing values

  • List-wise or case-wise deletion
  • Mean substitution
  • Regression substitution
  • Imputation & Multiple Imputation
  • Maximum Likelihood Estimation

How to Respond to Missingness, cont

I am not going to go into detail about each of these approaches. However, there are a few important things to understand:

  • Listwise Deletion and its relatives can lead to biased inferences (especially if data are MNAR), and decreases power
  • Mean substitution and its relatives reduce the variance in a variable, which may effect inferences
  • Regression substition leads to perfectly predicted values, thereby reducing standard errors in tests

How to Respond to Missingness, cont

  • In almost any case, the safest and most recommended approach is to use maximum likelihood estimation (MLE) methods when possible.
  • Multiple imputation is the preferred method if MLE approaches aren't available.

Exploring Missingness in R

One of the cooler packages I have seen for exploring missing data is the naniar package