The first step is to load the data. The data I will be using is related to a math placement exam from a Liberal Arts college.
mathP <- read.csv(url("https://raw.githubusercontent.com/lysanthus/CUNYDSBridge/master/MathPlacement.csv"),header=TRUE,sep=',',quote="\"")
The data loads into a data frame. We can look at the structure of each variable in the frame by:
str(mathP)
## 'data.frame': 2696 obs. of 17 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Student : int 625 628 629 630 634 636 638 641 643 645 ...
## $ Gender : int 0 0 1 0 1 0 0 0 0 0 ...
## $ PSATM : int 56 57 NA 53 NA 63 42 52 51 60 ...
## $ SATM : int 56 NA 62 NA 64 68 NA NA 58 NA ...
## $ ACTM : int 25 23 27 27 31 NA 23 24 NA 26 ...
## $ Rank : int 1 1 42 6 72 96 38 72 51 215 ...
## $ Size : int 420 85 421 75 462 518 382 480 703 524 ...
## $ GPAadj : int 40 40 38 38 35 34 37 37 34 32 ...
## $ PlcmtScore : int 23 21 20 20 19 18 18 17 17 16 ...
## $ Recommends : Factor w/ 9 levels "R0","R01","R1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Course : int 210 117 117 117 114 117 117 117 117 117 ...
## $ Grade : Factor w/ 17 levels "","A","A-","A+",..: 2 2 3 5 2 7 3 7 3 5 ...
## $ RecTaken : int 1 1 1 1 1 1 1 1 1 1 ...
## $ TooHigh : int 0 0 0 0 0 0 0 0 0 0 ...
## $ TooLow : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CourseSuccess: int 1 1 1 1 1 1 1 1 1 1 ...
…and then summarize the data by:
summary(mathP)
## X Student Gender PSATM
## Min. : 1.0 Min. : 2.0 Min. :0.0000 Min. : 0.00
## 1st Qu.: 674.8 1st Qu.: 925.8 1st Qu.:0.0000 1st Qu.:54.00
## Median :1348.5 Median :1953.0 Median :0.0000 Median :59.00
## Mean :1348.5 Mean :1942.7 Mean :0.4586 Mean :58.14
## 3rd Qu.:2022.2 3rd Qu.:2968.2 3rd Qu.:1.0000 3rd Qu.:65.00
## Max. :2696.0 Max. :4067.0 Max. :1.0000 Max. :80.00
## NA's :2116 NA's :1560
## SATM ACTM Rank Size
## Min. :35.0 Min. :13.00 Min. : 0.00 Min. : 0.0
## 1st Qu.:58.0 1st Qu.:25.00 1st Qu.: 7.00 1st Qu.:177.0
## Median :63.0 Median :27.00 Median : 28.00 Median :322.0
## Mean :62.6 Mean :26.98 Mean : 51.01 Mean :323.5
## 3rd Qu.:68.0 3rd Qu.:30.00 3rd Qu.: 73.00 3rd Qu.:455.0
## Max. :80.0 Max. :36.00 Max. :530.00 Max. :888.0
## NA's :1460 NA's :322 NA's :196 NA's :179
## GPAadj PlcmtScore Recommends Course
## Min. : 0.00 Min. :-18.00 R1 :1132 Min. :109.0
## 1st Qu.:33.00 1st Qu.: 26.00 R2 : 487 1st Qu.:120.0
## Median :37.00 Median : 33.00 R4 : 308 Median :120.0
## Mean :35.73 Mean : 32.44 R01 : 240 Mean :123.4
## 3rd Qu.:39.00 3rd Qu.: 39.00 R8 : 215 3rd Qu.:122.0
## Max. :40.00 Max. : 59.00 R0 : 177 Max. :398.0
## NA's :20 NA's :35 (Other): 137
## Grade RecTaken TooHigh TooLow
## :562 Min. :0.0000 Min. :0.000 Min. :0.00000
## A :439 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.00000
## B :380 Median :1.0000 Median :1.000 Median :0.00000
## B+ :301 Mean :0.6855 Mean :0.569 Mean :0.02003
## A- :293 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:0.00000
## B- :184 Max. :1.0000 Max. :1.000 Max. :1.00000
## (Other):537
## CourseSuccess
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.6768
## 3rd Qu.:1.0000
## Max. :1.0000
## NA's :567
Let’s compare the mean and median of SAT and ACT math scores:
meanSAT <- round(mean(mathP$SATM, na.rm=TRUE),2)
medianSAT <- round(median(mathP$SATM, na.rm=TRUE),2)
meanACT <- round(mean(mathP$ACTM, na.rm=TRUE),2)
medianACT <- round(median(mathP$ACTM, na.rm=TRUE),2)
We see that the SAT scores have a mean of 62.6 and a median of 63, while the ACT scores have a mean of 26.98 and a median of 27
Now, let’s split the data byy those who graduated in small classes, say the smallest 50%, and the larger classes (> 50%):
mathP.small <- mathP[which(mathP$Size<=quantile(mathP$Size,0.50,na.rm=TRUE)),]
mathP.large <- mathP[which(mathP$Size>quantile(mathP$Size,0.50,na.rm=TRUE)),]
…and see if the mean and medians differ much.
meanSAT.small <- round(mean(mathP.small$SATM,na.rm=TRUE),2)
medianSAT.small <- round(median(mathP.small$SATM,na.rm=TRUE),2)
meanSAT.large <- round(mean(mathP.large$SATM,na.rm=TRUE),2)
medianSAT.large <- round(median(mathP.large$SATM,na.rm=TRUE),2)
meanACT.small <- round(mean(mathP.small$ACTM,na.rm=TRUE),2)
medianACT.small <- round(median(mathP.small$ACTM,na.rm=TRUE),2)
meanACT.large <- round(mean(mathP.large$ACTM,na.rm=TRUE),2)
medianACT.large <- round(median(mathP.large$ACTM,na.rm=TRUE),2)
We see that the SAT scores from students in the smaller schools have a mean of 62.66 and a median of 63, while the ACT scores have a mean of 26.69 and a median of 27.
We can compare the smaller schools to the larger ones:
Group | SAT Mean | SAT Median | ACT Mean | ACT Median |
---|---|---|---|---|
Large Classes | 62.56 | 63 | 27.26 | 27 |
Small Classes | 62.66 | 63 | 26.69 | 27 |
Looking at the variable “Grade”,
levels(mathP$Grade)
## [1] "" "A" "A-" "A+" "B" "B-" "B+" "C" "C-" "C+" "D" "D-" "D+" "F"
## [15] "I" "S" "W"
…we see that most are the typical letter grades given to students. Some, such as “W”, “I”, or “S”, though, represent other situations. Perhaps we prefer to make them a bit more descriptive:
# Add the new factors
levels(mathP.small$Grade) <- c(levels(mathP.small$Grade),c("Incomplete","Withdrawal","Satisfactory"))
levels(mathP.small$Grade)
## [1] "" "A" "A-" "A+"
## [5] "B" "B-" "B+" "C"
## [9] "C-" "C+" "D" "D-"
## [13] "D+" "F" "I" "S"
## [17] "W" "Incomplete" "Withdrawal" "Satisfactory"
# Change the values
mathP.small$Grade[which(mathP.small$Grade == "I")] <- "Incomplete"
mathP.small$Grade[which(mathP.small$Grade == "W")] <- "Withdrawal"
mathP.small$Grade[which(mathP.small$Grade == "S")] <- "Satisfactory"
mathP.small[which(mathP.small$Grade == "Incomplete" | mathP.small$Grade == "Withdrawal" | mathP.small$Grade == "Satisfactory"),]
## X Student Gender PSATM SATM ACTM Rank Size GPAadj PlcmtScore
## 334 334 2282 NA NA NA 25 0 0 29 21
## 836 836 1390 NA 58 NA 29 12 123 38 36
## 854 854 1528 NA 55 NA 25 92 320 31 22
## 968 968 1933 NA 64 NA 27 50 262 37 30
## Recommends Course Grade RecTaken TooHigh TooLow CourseSuccess
## 334 R01 120 Incomplete 0 1 0 0
## 836 R1 120 Satisfactory 1 1 0 NA
## 854 R1 117 Satisfactory 0 0 0 NA
## 968 R1 120 Satisfactory 1 1 0 NA
We could also the the same for the larger class size subset if we wanted.