First thing you want to do is download R and R Stuido
R PC: https://cran.r-project.org/bin/windows/base/
R Mac: https://cran.r-project.org/bin/macosx/
RStudio (choose PC or Mac version): https://www.rstudio.com/products/rstudio/download/#download
Here is the location of the data set you will want to use: https://drive.google.com/open?id=1F4QVSPgvs20nnX6LU5FuIH4wC6Xnn4mR
Here we are going to load the data into R. I placed my data on the desktop, so I can use the code below. The easiest way to do this by setting the working directory to the desktop using, set working directory and then selecting the location.
You can read a data set by using the read.csv package. You need to put the name of the data set and then have a dot and csv (this package only works with csv’s). Header equals true means that the first row in the data set contains the variable names.
Then if you want to write the data set to a csv you can use the write csv function. You also will likely want to set the row.names to false, because if it is TRUE then it adds a new row that has a number for each row (i.e. row one gets a 1, row two gets a 2).
setwd("~/Desktop")
dat = read.csv("dat.csv", header = TRUE)
write.csv(dat, "dat.csv", row.names = FALSE)
Dealing with missing data (Creating some missing data ignore this)
dat$Ethnicity[100:200] = NA
dat$postScore[120:220] = NA
A lot of packages in R freak out if there is missing data, so it is probably best just to get rid of the missing values. To get rid of all data points that have any missing value in some variable we can use the na.omit function in R.
We can use the dim function to show us how many data points and variables that the data set has. Therefore, by having dim before and after using the na.omit function, we can see how much data is missing before and after excluding the missing data.
dim(dat)
## [1] 10000 6
dat = na.omit(dat)
dim(dat)
## [1] 9879 6
Creating more data to merge (ignore this)
set.seed(12345)
income = c("high", "low")
dat1 = data.frame(id = 1:10000, income = sample(income, 10000, replace= TRUE))
Sometimes datasets are spread out and not all together and we want to merge them by some unique identifier usually id. In this case we have a data set called dat1 which contains the unique id variable and a new variable called income.
We can use the merge function in R to merge data sets. There are three kinds of merges right (y), left (x), and all. For right this means that we match all the data in the right data set (the second data set you list). So for right it will match all the ids in the right data set with all of the matching ids in the left data set and any id that is in the left data set that is not in the right data set will be dropped. Both means that we include ids from both data sets even if they have no matching one and any missing data is left blank. So in the both example below you will see some missing data for eth and the other variables, because some people in dat are missing those variables and not included in the dat1.
datLeft = merge(dat, dat1, all.x = TRUE)
datRight = merge(dat, dat1, all.y = TRUE)
datAll = merge(dat, dat1, all = TRUE)
datAll[200:250,]
## id Gender Ethnicity SES preScore postScore
## 200 200 <NA> <NA> <NA> NA NA
## 201 201 <NA> <NA> <NA> NA NA
## 202 202 <NA> <NA> <NA> NA NA
## 203 203 <NA> <NA> <NA> NA NA
## 204 204 <NA> <NA> <NA> NA NA
## 205 205 <NA> <NA> <NA> NA NA
## 206 206 <NA> <NA> <NA> NA NA
## 207 207 <NA> <NA> <NA> NA NA
## 208 208 <NA> <NA> <NA> NA NA
## 209 209 <NA> <NA> <NA> NA NA
## 210 210 <NA> <NA> <NA> NA NA
## 211 211 <NA> <NA> <NA> NA NA
## 212 212 <NA> <NA> <NA> NA NA
## 213 213 <NA> <NA> <NA> NA NA
## 214 214 <NA> <NA> <NA> NA NA
## 215 215 <NA> <NA> <NA> NA NA
## 216 216 <NA> <NA> <NA> NA NA
## 217 217 <NA> <NA> <NA> NA NA
## 218 218 <NA> <NA> <NA> NA NA
## 219 219 <NA> <NA> <NA> NA NA
## 220 220 <NA> <NA> <NA> NA NA
## 221 221 Other_Identity White Middle 59 66
## 222 222 Male White Very_Low 27 68
## 223 223 Male Hispanic Very_Low 48 69
## 224 224 Female Other_Ethnic_Identity High 47 57
## 225 225 Male Asian Very_Low 68 55
## 226 226 Male Asian Low 57 72
## 227 227 Other_Identity Other_Ethnic_Identity Low 48 85
## 228 228 Male Hispanic Very_Low 50 70
## 229 229 Other_Identity White Very_Low 65 55
## 230 230 Other_Identity White Middle 51 62
## 231 231 Female Hispanic Low 51 73
## 232 232 Male White Low 42 75
## 233 233 Female Asian High 52 67
## 234 234 Male Asian Middle 53 53
## 235 235 Male Asian Very_Low 59 67
## 236 236 Other_Identity White Low 45 77
## 237 237 Other_Identity African_American High 58 62
## 238 238 Female African_American High 44 66
## 239 239 Other_Identity African_American Low 56 55
## 240 240 Male Other_Ethnic_Identity Low 52 56
## 241 241 Male Other_Ethnic_Identity High 43 42
## 242 242 Female Hispanic Low 44 41
## 243 243 Male Hispanic Low 47 59
## 244 244 Male Hispanic High 39 51
## 245 245 Male White High 49 71
## 246 246 Other_Identity White Very_Low 58 54
## 247 247 Male African_American Very_Low 77 59
## 248 248 Male Hispanic Middle 49 38
## 249 249 Other_Identity Hispanic High 55 50
## 250 250 Female White Very_Low 58 84
## income
## 200 high
## 201 low
## 202 low
## 203 high
## 204 low
## 205 low
## 206 low
## 207 high
## 208 high
## 209 low
## 210 low
## 211 high
## 212 low
## 213 high
## 214 high
## 215 low
## 216 high
## 217 high
## 218 low
## 219 low
## 220 high
## 221 high
## 222 high
## 223 high
## 224 low
## 225 low
## 226 low
## 227 low
## 228 high
## 229 high
## 230 low
## 231 high
## 232 low
## 233 high
## 234 low
## 235 low
## 236 low
## 237 low
## 238 low
## 239 high
## 240 high
## 241 low
## 242 low
## 243 low
## 244 low
## 245 low
## 246 low
## 247 low
## 248 low
## 249 high
## 250 low
Sometimes you want to get a summary of the data set and each of the variables. For simplicity, let us just use the original dat dataset and get the descriptives. One way to get the descriptives is through the summary function. For continuous variables, summary provides the mean, median, and range (min and max). For dichotomous, nominal and ordinal outcomes, it provides the counts for each of the levels.
summary(dat)
## id Gender Ethnicity
## Min. : 1 Female :2994 African_American :1900
## 1st Qu.: 2592 Male :3932 Asian :1924
## Median : 5061 Other_Identity:2953 Hispanic :2045
## Mean : 5060 Other_Ethnic_Identity:1996
## 3rd Qu.: 7530 White :2014
## Max. :10000
## SES preScore postScore
## High :2497 Min. :11.00 Min. :23.00
## Low :2462 1st Qu.:43.00 1st Qu.:53.00
## Middle :2456 Median :50.00 Median :60.00
## Very_Low:2464 Mean :49.99 Mean :59.97
## 3rd Qu.:57.00 3rd Qu.:67.00
## Max. :84.00 Max. :99.00
One common statistic that we want to evaluate is the percentage change over time. We can create a new variable in R that has the percentage change from pre to post.
We create this by using the $ for the dat set to create a new variable called PercentChange. Then we say that PercentChange equals the percentage change formula which is (postScore-preScore)/preScore. Then we use the round function to round the percentage change variable to two decimal points.
dat$PercentChange = round((dat$postScore-dat$preScore)/dat$preScore,2)
head(dat)
## id Gender Ethnicity SES preScore postScore
## 1 1 Other_Identity Hispanic Middle 56 62
## 2 2 Female African_American Low 57 65
## 3 3 Other_Identity White Low 49 47
## 4 4 Other_Identity Hispanic Low 45 72
## 5 5 Other_Identity Other_Ethnic_Identity Low 56 50
## 6 6 Male Other_Ethnic_Identity High 32 57
## PercentChange
## 1 0.11
## 2 0.14
## 3 -0.04
## 4 0.60
## 5 -0.11
## 6 0.78
Sometimes we want to evaluate whether or not there was a statistically significant change between two time points. To do that we would use a paired t-test to account for the correlation between the two time points.
The results demonstrate that the p-value is well below .05 providing evidence that the average post score is statistically significantly larger than the pre scores.
t.test(dat$postScore, dat$preScore, paired = TRUE)
##
## Paired t-test
##
## data: dat$postScore and dat$preScore
## t = 71.163, df = 9878, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 9.707418 10.257356
## sample estimates:
## mean of the differences
## 9.982387
Although, because we have a large sample our data is likely normally distributed often times data are not and we need to use a test that does not make the assumption that data are normally distributed. Therefore, we can use a nonparametric test like the Wilcoxon rank test to evaluate whether the post scores are larger than the pre scores.
wilcox.test(dat$postScore, dat$preScore, paired = TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: dat$postScore and dat$preScore
## V = 39593000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0