Process Description


Below is the Analytics Learning Lab’s approach to cleaning data. This process was developed in order to create a systematic and reproducible strategy to both validate and process data to be used in later analysis.




Process Steps


Checking your dataset

  1. Check for sensitive data
  2. Check for missing columns
  3. Check variables names
  4. Check missing observations
  5. Check variable classification
  6. Check misspellings/extra spaces
  7. Check numeric data distribution
  8. Check duplicate rows
  9. Check statistical assumptions




Step One: Check for sensitive data

If there are columns that you feel you should not be seeing (e.g. names, unrelated diagnoses, protected information), check with your manager and then remove offending columns. This code chunk illustrates how to remove a column from a dataframe.

Data was located from Kaggle https://www.kaggle.com/mylesoneill/overwatch-game-records/data. It can also be found on Andrew’s OneDrive under training materials.


# Loading Packages
library(data.table)
library(readr)

# Loading Overwatch data from Kaggle
overwatch_diary <- read_csv("C:/Users/AK055125/Documents/overwatch-diary.csv")

# Removing a column
# Let's say we wanted to create a new dataset without the X50 or X49 collumns
drops <- c("X50","X49")
newdataset <- overwatch_diary[ , !(names(overwatch_diary) %in% drops)]




Step Two: Check for Missing Columns

Ask yourself what variables you will need to conduct your analysis. If all of those variables are present in your dataset, then you are good to go. Otherwise, you may need to contact third party groups in order to get the desired information. There is no code needed for to check your variables, however how to get a list of variables in your dataset and how to merge datasets is covered below.


# Prints a list of variables in your dataset (overwatch_diary)
names(overwatch_diary)
##  [1] "date"                    "time"                   
##  [3] "season"                  "game_id"                
##  [5] "sr_start"                "sr_finish"              
##  [7] "sr_delta"                "streak_number"          
##  [9] "result"                  "my_team_sr"             
## [11] "enemy_team_sr"           "team_sr_delta"          
## [13] "my_team_1"               "my_team_2"              
## [15] "my_team_3"               "my_team_4"              
## [17] "my_team_5"               "my_team_6"              
## [19] "enemy_team_1"            "enemy_team_2"           
## [21] "enemy_team_3"            "enemy_team_4"           
## [23] "enemy_team_5"            "enemy_team_6"           
## [25] "map"                     "round"                  
## [27] "capscore"                "score_distance"         
## [29] "time_left"               "team_role"              
## [31] "match_length"            "charcter_1"             
## [33] "character_2"             "character_3"            
## [35] "communication"           "note"                   
## [37] "psychological_condition" "eliminations"           
## [39] "objective_kills"         "damage"                 
## [41] "healing"                 "deaths"                 
## [43] "weapon_accuracy"         "offensive_assists"      
## [45] "defensive_assists"       "scoped"                 
## [47] "X47"                     "X48"                    
## [49] "X49"                     "X50"
# Merging the dataset we created above with original (wouldn't want to do this in practice)
# Need Uniqe Identifier for each row
# Attempt One: concatinating game and round number (includes separator)
overwatch_diary$rowid <- paste(overwatch_diary$game_id, "-", overwatch_diary$round)
length(unique(overwatch_diary$rowid))  #number of unique rowids
## [1] 1417
length(overwatch_diary$rowid)    #number of observations in rowid
## [1] 3299
# Oh no, our row IDs are not unique
# Attempt Two: If we know each observation number corresponds 1:1 to other dataset, then ID=observation number
overwatch_diary$rowid2 <- seq.int(nrow(overwatch_diary))
length(unique(overwatch_diary$rowid2))  #number of unique rowids
## [1] 3299
length(overwatch_diary$rowid2)    #number of observations in rowid
## [1] 3299
# Since this works we'll do the same proceedure to the other dataset
newdataset$rowid2 <- seq.int(nrow(newdataset))

# Now merge the datasets (we have duplicate columns because they are essentially the same dataset)
mergeddata <- merge(overwatch_diary, newdataset, by="rowid2")




Step Three: Check Variables Names

This is the step where you change your variable names. Some properties of a bad name include:

  • Name is too long
  • Name does not sufficently describe the data
  • Name is inconsistent with other datasets you are using
  • Name contains spaces

“In principle, variable names should only include the underscore, letters a-z and A-Z and numeric characters 0-9.” -Conclusion of Statlist Forum

It is better to do this step early so the way you reference a column is consistent throughout your code.


# Loading packages
library(dplyr)

# How to rename a column
# Lets say I wanted to rename "time" to "timeofday" and "time_left" to "time_remaining"
overwatch_diary <- rename(overwatch_diary,
                          timeofday = time,    
                          time_remaining = time_left)




Step Four: Check Missing/Null Observations

Time to find out how complete your dataset is. For each column the Analytics Learning Lab rule of thumb on missing observations is:

  • Less than 5% of a column’s data is missing: document removal of data with little justification needed
  • 5-15% of a column’s data is missing: document removal of data, moderate justification needed
  • Greater than 15% of a column’s data is missing: column validity is in question, investigate cause of missingness or drop column.

When greater than 20% of a column’s data is missing, it is important to determine why such a high volume of data is missing with the data owner. If there is any indication that the missing data is not missing completely at random, investigate whether removal of missing observations skews the data and if imputation would be more appropriate.


# How to check missingness
# This shows there appear to be high levels of missingness for many variables such as death, elims, scoped, etc.
colMeans(is.na(overwatch_diary))
##                    date               timeofday                  season 
##             0.000000000             0.544104274             0.000000000 
##                 game_id                sr_start               sr_finish 
##             0.221885420             0.503485905             0.503485905 
##                sr_delta           streak_number                  result 
##             0.493179751             0.316762655             0.000000000 
##              my_team_sr           enemy_team_sr           team_sr_delta 
##             0.013337375             0.013337375             0.014549864 
##               my_team_1               my_team_2               my_team_3 
##             0.001212489             0.003031222             0.003637466 
##               my_team_4               my_team_5               my_team_6 
##             0.003637466             0.004849955             0.004243710 
##            enemy_team_1            enemy_team_2            enemy_team_3 
##             0.008790543             0.007881176             0.007578054 
##            enemy_team_4            enemy_team_5            enemy_team_6 
##             0.007578054             0.007578054             0.007578054 
##                     map                   round                capscore 
##             0.000000000             0.001818733             0.006971810 
##          score_distance          time_remaining               team_role 
##             0.340103062             0.157017278             0.001515611 
##            match_length              charcter_1             character_2 
##             0.032434071             0.018187329             0.843892089 
##             character_3           communication                    note 
##             0.980903304             0.605941194             0.849045165 
## psychological_condition            eliminations         objective_kills 
##             0.613519248             0.917550773             0.917853895 
##                  damage                 healing                  deaths 
##             0.917853895             0.918460139             0.918460139 
##         weapon_accuracy       offensive_assists       defensive_assists 
##             0.919672628             0.954834798             0.966959685 
##                  scoped                     X47                     X48 
##             0.952712943             0.942103668             0.987571992 
##                     X49                     X50                   rowid 
##             0.999696878             0.999696878             0.000000000 
##                  rowid2 
##             0.000000000
# How to drop missing rows
# Lets overwrite the "newdataset" we made earlier to include only data we have information on match length
newdataset <- overwatch_diary[complete.cases(overwatch_diary[ , "match_length"]),]
summary(is.na(newdataset$match_length)) #our data is now not missing any match_length observations
##    Mode   FALSE 
## logical    3192
# How to drop missing column
# We also see that X47 and X48 have over 90% data missing so we decide to take that out of our new dataset too
drops <- c("X47","X48")
newdataset <- newdataset[ , !(names(newdataset) %in% drops)]




Step Five: Check Variable Classification

This step is used to make sure variables are the correct classification. The most common data classes in R are numeric (numbers), character (text), and logical (True/False). The code below shows how to identify a column’s data class and change it if that class is undesirable. NOTE: If a character vector is stored as a factor, then you will have to convert to character, then to numeric.

Read more about data structures here: http://www.diegobarneche.com/2014-12-11-ufsc/lessons/01-intro_r/data-structures.html. You can check attributes of an entire data frame by using the str() function with your dataframe in the parentheses.

This is also the step where you convert any dates/times into a usable format. For example, pulling month or year information into a different column. Another use is measuring the length of time between two dates.


# Checking the data type classification
# We can see the season variable is classified as an integer 
typeof(overwatch_diary$season)
## [1] "integer"
# Changing numeric to character class
# Lets change the season variable to a character class
overwatch_diary$season <- as.character(overwatch_diary$season)
typeof(overwatch_diary$season)
## [1] "character"
# Changing numeric to factor structure
# sometimes we want to have a character variable as a factor (or categorical variable)
overwatch_diary$season <- as.factor(overwatch_diary$season)
str(overwatch_diary$season)
##  Factor w/ 5 levels "3","4","5","6",..: 1 1 1 1 1 1 1 1 1 1 ...
# Changing factor to numeric
# Now let's change it back to an integer (remember can only do to whole numbers, use as.numeric if unsure of what values it takes)
overwatch_diary$season <- as.integer(as.character(overwatch_diary$season))
typeof(overwatch_diary$season)
## [1] "integer"




Step Six: Check Misspellings/Extra Spaces

You need to take out misspellings and extra spaces. In addition, you should make sure to standardize names. An example of name standardization is if KU was written: “The University of Kansas”, “University of Kansas”, “Kansas University”, “KU”, “K.U.”, “K U”. All of the previous spellings would be counted as different schools. It doesn’t really matter which spelling we use, we just want to make sure it’s standard throughout our dataset.

There is no good way to handle alternate spellings without a dictionary mapping all different possible names. Thus, this code focuses on instructing how to remove extra spaces and some spell check algorithms.


# Loading packages
library(qdapRegex)
library(qdap)

# Removing extra spaces
# Lets say we have column that had strings like this:
string <- "  Hi friend  ,   what's going   on ? "
string
## [1] "  Hi friend  ,   what's going   on ? "
# We can take extra spaces out using:
string <- rm_white(string)
string
## [1] "Hi friend, what's going on?"
# Spell check function
x <- "Robots are evl creatres and deserv exterimanitation."
which_misspelled(x, suggest=FALSE)   # Will give a list of mispelled words
##                  3                  4                  6 
##              "evl"         "creatres"           "deserv" 
##                  7 
## "exterimanitation"
which_misspelled(x, suggest=TRUE)   # Will give a list of misspelled words with suggestions
## 
##   word.no        not.found     suggestion                                                                                                     more.suggestions
## 1       3              evl           evil                                                                ev, evils, elva, e, eel, ell, erl, esl, eva, eve, evy
## 2       4         creatres      creatures                                                      creates, create, creators, creature, cremates, cerates, creaser
## 3       6           deserv        deserve    deserved, deserver, deserves, denser, desert, deer, debaser, deposer, deserts, dessert, deviser, diester, dresser
## 4       7 exterimanitation exterminations experimentation, experimentations, emancipatation, emancipatations, extracontinental, extemporaneous, extravasations

check_spelling_interactive(x) # Allows you to interactively change words in console




Step Seven: Check Numeric Data Distribution

In many types of statistical analyses, like regressions and difference in means, assumptions are made about the numeric distribution. Variables that have a right tail skew should be natural log transformed in order to use these statistical methods. It is usually best to keep the original variable intact, e.g. if you have an income variable, keep the income variable and add a log income variable to your dataset. This section teaches how to check quantitative data distribution and transform columns as needed.

In this step you will also have to look at the distribution of numeric variables in order to check for implausible outliers. Here an implausible outlier would be something like a negative age or an age equal to 200. If you find one implausible outlier, it is important to follow up with the data owner and determine the cause. Even a single instance of a weird number could indicate miscalculations within the other values.

Sometimes data you receive will come from a larger dataset. In this smaller dataset, variables may become less important. For example, if you were looking at US Cerner Associates, a column called “country” would be “USA” for all observations in your dataset. In this case, since there is no variance in this column, you can remove this variable since it will play no part in your analysis and it saves memory (but always remember to document!).


# Loading packages
library(Hmisc)

# Checking the distribution of a column
# No apparent problem in the distribution of season, noting a downward trend in records as time increases though
hist(overwatch_diary$season)

# You should make a log version of a variable fi your histogram/density plot looks something like this:
value = rlnorm(500,1,.6)
hist(value)

# This step creates a function that log transforms your data
signedlog10 = function(x) {
ifelse(abs(x) <= 1, 0, sign(x)*log10(abs(x)))
}

# Let's log transform our data
logvalue <- signedlog10(value)

# This is closer to normal distribution
hist(logvalue)

# Checking variance of a variable
var(overwatch_diary$season)
## [1] 1.9719




Step Nine: Check Duplicate Rows

Many times data you will receive will be hand entered or will be generated in methods that may cause duplication of rows. It is important to check rows to see if data is duplicated and remove said data. Some columns should also be unique. If you have associate ID and you know each row is supposed to correspond to an associate, then checking to see if each cell in that column is unique is also important.

One way to check if data is duplicated is to check if rows are identical to another. If a large proportion of your data appears to be duplicated, ask the data owner about how the data was generated to see if the duplicates are actually duplicated or convey information that happens to be the same as another row.


# Loading packages
library(dplyr)

# Checking if one row is identical to another
distinctdata <- distinct(overwatch_diary)
nrow(overwatch_diary)
## [1] 3299
nrow(distinctdata)
## [1] 3299
# The dataset of distinct values has the same number of rows as the original, meaning there are no duplicates!




Additional Tests for Statistical Models


Gaus-Markov Assumptions (for Regressions)

  • Linarity of Parameters
  • Conditional Expected Variance of error term is 0
  • No heteroskedasticity
  • No Perfect Co-linearity
  • Data is randomly ‘drawn’ from the population