Data Cleaning and Analysis in R

First thing you want to do is download R and R Stuido R PC: https://cran.r-project.org/bin/windows/base/ R Mac: https://cran.r-project.org/bin/macosx/ RStudio (choose PC or Mac version): https://www.rstudio.com/products/rstudio/download/#download

Here is the location of the data set you will want to use: https://drive.google.com/open?id=1F4QVSPgvs20nnX6LU5FuIH4wC6Xnn4mR

Here we are going to load the data into R. I placed my data on the desktop, so I can use the code below. The easiest way to do this by setting the working directory to the desktop using, set working directory and then selecting the location.

You can read a data set by using the read.csv package. You need to put the name of the data set and then have a dot and csv (this package only works with csv’s). Header equals true means that the first row in the data set contains the variable names.

Then if you want to write the data set to a csv you can use the write csv function. You also will likely want to set the row.names to false, because if it is TRUE then it adds a new row that has a number for each row (i.e. row one gets a 1, row two gets a 2).

setwd("~/Desktop")
dat = read.csv("dat.csv", header = TRUE)
write.csv(dat, "dat.csv", row.names = FALSE)

Here I am creating some missing data. I am using the $ command to identify the variables I want to access. Then, I am using the [] to indicate the rows in the specific variables of the variable that I want to access. For the Ethnicity variable, I am accessing rows 100 through 200. Then I am using the = sign to tell R that I want the selected rows in the selected variables to equal NA.

dat$Ethnicity[100:200] = NA
dat$postScore[120:220] = NA

A lot of packages in R freak out if there is missing data, so it is probably best just to get rid of the missing values. To get rid of all data points that have any missing value in some variable we can use the na.omit function in R.

We can use the dim function to show us how many data points and variables that the data set has. Therefore, by having dim before and after using the na.omit function, we can see how much data is missing before and after excluding the missing data.

dim(dat)

## [1] 10000     6

dat = na.omit(dat)
dim(dat)

## [1] 9879    6

Creating more data to merge (ignore this)

set.seed(12345)
income = c("high", "low")
dat1 = data.frame(id = 1:10000, income = sample(income, 10000, replace= TRUE))

Sometimes datasets are spread out and not all together and we want to merge them by some unique identifier usually id. In this case we have a data set called dat1 which contains the unique id variable and a new variable called income.

We can use the merge function in R to merge data sets. There are three kinds of merges right (y), left (x), and all. For right this means that we match all the data in the right data set (the second data set you list). So for right it will match all the ids in the right data set with all of the matching ids in the left data set and any id that is in the left data set that is not in the right data set will be dropped. Both means that we include ids from both data sets even if they have no matching one and any missing data is left blank. So in the both example below you will see some missing data for eth and the other variables, because some people in dat are missing those variables and not included in the dat1.

datLeft = merge(dat, dat1, all.x = TRUE)
head(datLeft)

##   id         Gender             Ethnicity    SES preScore postScore income
## 1  1 Other_Identity              Hispanic Middle       56        62    low
## 2  2         Female      African_American    Low       57        65    low
## 3  3 Other_Identity                 White    Low       49        47    low
## 4  4 Other_Identity              Hispanic    Low       45        72    low
## 5  5 Other_Identity Other_Ethnic_Identity    Low       56        50   high
## 6  6           Male Other_Ethnic_Identity   High       32        57   high

datRight = merge(dat, dat1, all.y = TRUE)
head(datRight)

##   id         Gender             Ethnicity    SES preScore postScore income
## 1  1 Other_Identity              Hispanic Middle       56        62    low
## 2  2         Female      African_American    Low       57        65    low
## 3  3 Other_Identity                 White    Low       49        47    low
## 4  4 Other_Identity              Hispanic    Low       45        72    low
## 5  5 Other_Identity Other_Ethnic_Identity    Low       56        50   high
## 6  6           Male Other_Ethnic_Identity   High       32        57   high

datAll = merge(dat, dat1, all = TRUE)
datAll[200:250,]

##      id         Gender             Ethnicity      SES preScore postScore
## 200 200           <NA>                  <NA>     <NA>       NA        NA
## 201 201           <NA>                  <NA>     <NA>       NA        NA
## 202 202           <NA>                  <NA>     <NA>       NA        NA
## 203 203           <NA>                  <NA>     <NA>       NA        NA
## 204 204           <NA>                  <NA>     <NA>       NA        NA
## 205 205           <NA>                  <NA>     <NA>       NA        NA
## 206 206           <NA>                  <NA>     <NA>       NA        NA
## 207 207           <NA>                  <NA>     <NA>       NA        NA
## 208 208           <NA>                  <NA>     <NA>       NA        NA
## 209 209           <NA>                  <NA>     <NA>       NA        NA
## 210 210           <NA>                  <NA>     <NA>       NA        NA
## 211 211           <NA>                  <NA>     <NA>       NA        NA
## 212 212           <NA>                  <NA>     <NA>       NA        NA
## 213 213           <NA>                  <NA>     <NA>       NA        NA
## 214 214           <NA>                  <NA>     <NA>       NA        NA
## 215 215           <NA>                  <NA>     <NA>       NA        NA
## 216 216           <NA>                  <NA>     <NA>       NA        NA
## 217 217           <NA>                  <NA>     <NA>       NA        NA
## 218 218           <NA>                  <NA>     <NA>       NA        NA
## 219 219           <NA>                  <NA>     <NA>       NA        NA
## 220 220           <NA>                  <NA>     <NA>       NA        NA
## 221 221 Other_Identity                 White   Middle       59        66
## 222 222           Male                 White Very_Low       27        68
## 223 223           Male              Hispanic Very_Low       48        69
## 224 224         Female Other_Ethnic_Identity     High       47        57
## 225 225           Male                 Asian Very_Low       68        55
## 226 226           Male                 Asian      Low       57        72
## 227 227 Other_Identity Other_Ethnic_Identity      Low       48        85
## 228 228           Male              Hispanic Very_Low       50        70
## 229 229 Other_Identity                 White Very_Low       65        55
## 230 230 Other_Identity                 White   Middle       51        62
## 231 231         Female              Hispanic      Low       51        73
## 232 232           Male                 White      Low       42        75
## 233 233         Female                 Asian     High       52        67
## 234 234           Male                 Asian   Middle       53        53
## 235 235           Male                 Asian Very_Low       59        67
## 236 236 Other_Identity                 White      Low       45        77
## 237 237 Other_Identity      African_American     High       58        62
## 238 238         Female      African_American     High       44        66
## 239 239 Other_Identity      African_American      Low       56        55
## 240 240           Male Other_Ethnic_Identity      Low       52        56
## 241 241           Male Other_Ethnic_Identity     High       43        42
## 242 242         Female              Hispanic      Low       44        41
## 243 243           Male              Hispanic      Low       47        59
## 244 244           Male              Hispanic     High       39        51
## 245 245           Male                 White     High       49        71
## 246 246 Other_Identity                 White Very_Low       58        54
## 247 247           Male      African_American Very_Low       77        59
## 248 248           Male              Hispanic   Middle       49        38
## 249 249 Other_Identity              Hispanic     High       55        50
## 250 250         Female                 White Very_Low       58        84
##     income
## 200   high
## 201    low
## 202    low
## 203   high
## 204    low
## 205    low
## 206    low
## 207   high
## 208   high
## 209    low
## 210    low
## 211   high
## 212    low
## 213   high
## 214   high
## 215    low
## 216   high
## 217   high
## 218    low
## 219    low
## 220   high
## 221   high
## 222   high
## 223   high
## 224    low
## 225    low
## 226    low
## 227    low
## 228   high
## 229   high
## 230    low
## 231   high
## 232    low
## 233   high
## 234    low
## 235    low
## 236    low
## 237    low
## 238    low
## 239   high
## 240   high
## 241    low
## 242    low
## 243    low
## 244    low
## 245    low
## 246    low
## 247    low
## 248    low
## 249   high
## 250    low

Sometimes you want to get a summary of the data set and each of the variables. For simplicity, let us just use the original dat dataset and get the descriptives. One way to get the descriptives is through the summary function. For continuous variables, summary provides the mean, median, and range (min and max). For dichotomous, nominal and ordinal outcomes, it provides the counts for each of the levels.

summary(dat)

##        id                   Gender                     Ethnicity   
##  Min.   :    1   Female        :2994   African_American     :1900  
##  1st Qu.: 2592   Male          :3932   Asian                :1924  
##  Median : 5061   Other_Identity:2953   Hispanic             :2045  
##  Mean   : 5060                         Other_Ethnic_Identity:1996  
##  3rd Qu.: 7530                         White                :2014  
##  Max.   :10000                                                     
##        SES          preScore       postScore    
##  High    :2497   Min.   :11.00   Min.   :23.00  
##  Low     :2462   1st Qu.:43.00   1st Qu.:53.00  
##  Middle  :2456   Median :50.00   Median :60.00  
##  Very_Low:2464   Mean   :49.99   Mean   :59.97  
##                  3rd Qu.:57.00   3rd Qu.:67.00  
##                  Max.   :84.00   Max.   :99.00

One common statistic that we want to evaluate is the percentage change over time. We can create a new variable in R that has the percentage change from pre to post.

We create this by using the $ for the dat set to create a new variable called PercentChange. Then we say that PercentChange equals the percentage change formula which is (postScore-preScore)/preScore. Then we use the round function to round the percentage change variable to two decimal points.

dat$PercentChange = round((dat$postScore-dat$preScore)/dat$preScore,2)
head(dat)

##   id         Gender             Ethnicity    SES preScore postScore
## 1  1 Other_Identity              Hispanic Middle       56        62
## 2  2         Female      African_American    Low       57        65
## 3  3 Other_Identity                 White    Low       49        47
## 4  4 Other_Identity              Hispanic    Low       45        72
## 5  5 Other_Identity Other_Ethnic_Identity    Low       56        50
## 6  6           Male Other_Ethnic_Identity   High       32        57
##   PercentChange
## 1          0.11
## 2          0.14
## 3         -0.04
## 4          0.60
## 5         -0.11
## 6          0.78

Sometimes we want to evaluate whether or not there was a statistically significant change between two time points. To do that we would use a paired t-test to account for the correlation between the two time points.

The results demonstrate that the p-value is well below .05 providing evidence that the average post score is statistically significantly larger than the pre scores.

t.test(dat$postScore, dat$preScore, paired = TRUE)

## 
##  Paired t-test
## 
## data:  dat$postScore and dat$preScore
## t = 71.163, df = 9878, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   9.707418 10.257356
## sample estimates:
## mean of the differences 
##                9.982387

Although, because we have a large sample our data is likely normally distributed often times data are not and we need to use a test that does not make the assumption that data are normally distributed. Therefore, we can use a nonparametric test like the Wilcoxon rank test to evaluate whether the post scores are larger than the pre scores.

wilcox.test(dat$postScore, dat$preScore, paired = TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  dat$postScore and dat$preScore
## V = 39593000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Sometimes the package that we need isn’t already in R. For example, the package prettyR can provide counts and percentages for nominal, ordinal, and binary data. First, we need to install the package prettyR and then library it which tells R that we want to use the package. Once you install the package once, you don’t have to install it again just library it each time you want to use it.

#install.packages("prettyR")
library(prettyR)

Now we can use the describe.factor function to get the counts and percentages for non-continuous variables such as gender.

describe.factor(dat$Gender)

##           
## dat$Gender      Male     Female Other_Identity
##    Count   3932.0000 2994.00000     2953.00000
##    Percent   39.8016   30.30671       29.89169