The chosen sample data comes from: link
Description: link
There are 2696 results from a Math placement exam at a liberal arts college. Some of the fields included are gender, standardized test scores, high school rank, GPA, which course was ultimately taken, and grade.
These data can be analyzed to determine whether the placement exam is an effective tool in determining the correct level the student is at upon enrollment. We could compare the performance of those students who enrolled in the suggested class vs that of students who enrolled in a more difficult course.
Did the students who enrolled in a more difficult course get lower grades?
Also, will there be a linear relationship between GPA and their placement exam score?
theURL <- "https://raw.githubusercontent.com/HildaRamirez/HW2/master/MathPlacement.csv"
MathTest <- read.table(file=theURL, header = TRUE, sep = ",")
str(MathTest)
## 'data.frame': 2696 obs. of 17 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Student : int 625 628 629 630 634 636 638 641 643 645 ...
## $ Gender : int 0 0 1 0 1 0 0 0 0 0 ...
## $ PSATM : int 56 57 NA 53 NA 63 42 52 51 60 ...
## $ SATM : int 56 NA 62 NA 64 68 NA NA 58 NA ...
## $ ACTM : int 25 23 27 27 31 NA 23 24 NA 26 ...
## $ Rank : int 1 1 42 6 72 96 38 72 51 215 ...
## $ Size : int 420 85 421 75 462 518 382 480 703 524 ...
## $ GPAadj : int 40 40 38 38 35 34 37 37 34 32 ...
## $ PlcmtScore : int 23 21 20 20 19 18 18 17 17 16 ...
## $ Recommends : chr "R0" "R0" "R0" "R0" ...
## $ Course : int 210 117 117 117 114 117 117 117 117 117 ...
## $ Grade : chr "A" "A" "A-" "B" ...
## $ RecTaken : int 1 1 1 1 1 1 1 1 1 1 ...
## $ TooHigh : int 0 0 0 0 0 0 0 0 0 0 ...
## $ TooLow : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CourseSuccess: int 1 1 1 1 1 1 1 1 1 1 ...
With the exception of Recommends and Grade, the fields are all integers. RecTaken, TooHigh, TooLow, and CourseSuccess may have made more sense as logical data type as they are really recording a boolean concept.
summary(MathTest)
## X Student Gender PSATM
## Min. : 1.0 Min. : 2.0 Min. :0.0000 Min. : 0.00
## 1st Qu.: 674.8 1st Qu.: 925.8 1st Qu.:0.0000 1st Qu.:54.00
## Median :1348.5 Median :1953.0 Median :0.0000 Median :59.00
## Mean :1348.5 Mean :1942.7 Mean :0.4586 Mean :58.14
## 3rd Qu.:2022.2 3rd Qu.:2968.2 3rd Qu.:1.0000 3rd Qu.:65.00
## Max. :2696.0 Max. :4067.0 Max. :1.0000 Max. :80.00
## NA's :2116 NA's :1560
## SATM ACTM Rank Size
## Min. :35.0 Min. :13.00 Min. : 0.00 Min. : 0.0
## 1st Qu.:58.0 1st Qu.:25.00 1st Qu.: 7.00 1st Qu.:177.0
## Median :63.0 Median :27.00 Median : 28.00 Median :322.0
## Mean :62.6 Mean :26.98 Mean : 51.01 Mean :323.5
## 3rd Qu.:68.0 3rd Qu.:30.00 3rd Qu.: 73.00 3rd Qu.:455.0
## Max. :80.0 Max. :36.00 Max. :530.00 Max. :888.0
## NA's :1460 NA's :322 NA's :196 NA's :179
## GPAadj PlcmtScore Recommends Course
## Min. : 0.00 Min. :-18.00 Length:2696 Min. :109.0
## 1st Qu.:33.00 1st Qu.: 26.00 Class :character 1st Qu.:120.0
## Median :37.00 Median : 33.00 Mode :character Median :120.0
## Mean :35.73 Mean : 32.44 Mean :123.4
## 3rd Qu.:39.00 3rd Qu.: 39.00 3rd Qu.:122.0
## Max. :40.00 Max. : 59.00 Max. :398.0
## NA's :20 NA's :35
## Grade RecTaken TooHigh TooLow
## Length:2696 Min. :0.0000 Min. :0.000 Min. :0.00000
## Class :character 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.00000
## Mode :character Median :1.0000 Median :1.000 Median :0.00000
## Mean :0.6855 Mean :0.569 Mean :0.02003
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.000 Max. :1.00000
##
## CourseSuccess
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.6768
## 3rd Qu.:1.0000
## Max. :1.0000
## NA's :567
Some observations from the summary: 1. PSAT & SAT scores are not so useful as there are many NA values for those.
2. The median and mean for class rank when compared to class size, demonstrates that a large portion of the students were in the top quartile. A more uniform population could help eliminate other variables from being the cause of poor performance in their Math course. 3. The summary statistics are meaningless for those columns that are really boolean in nature.
# First, am removing rows with no Grade since they will not help answer the questions we have. Removing the PSAT & SAT score columns due to high prevalence of NA. Also removing the gender column as not interested in making observations by gender. The course number column is also eliminated as there is no legend for it to extrapolate meaning from.
MathTestSub <- subset(MathTest, Grade != "")
MathTestSub <- MathTestSub[, c(1,2,6:11,13:17)]
# Dividing High School GPA by 10 and setting to double so that it can be in the usual scale
MathTestSub$GPAadj <- as.double(MathTestSub$GPAadj/10)
# Retyping the last columns as booleans
MathTestSub$RecTaken <- as.logical(MathTestSub$RecTaken)
MathTestSub$TooHigh <- as.logical(MathTestSub$TooHigh)
MathTestSub$TooLow <- as.logical(MathTestSub$TooLow)
MathTestSub$CourseSuccess <- as.logical(MathTestSub$CourseSuccess)
# Should have 2134 rows once blanks removed (verified in Excel)
# Take a peek at the data after the changes using head()
nrow(MathTestSub)
## [1] 2134
head(MathTestSub)
## X Student ACTM Rank Size GPAadj PlcmtScore Recommends Grade RecTaken TooHigh
## 1 1 625 25 1 420 4.0 23 R0 A TRUE FALSE
## 2 2 628 23 1 85 4.0 21 R0 A TRUE FALSE
## 3 3 629 27 42 421 3.8 20 R0 A- TRUE FALSE
## 4 4 630 27 6 75 3.8 20 R0 B TRUE FALSE
## 5 5 634 31 72 462 3.5 19 R0 A TRUE FALSE
## 6 6 636 NA 96 518 3.4 18 R0 B+ TRUE FALSE
## TooLow CourseSuccess
## 1 FALSE TRUE
## 2 FALSE TRUE
## 3 FALSE TRUE
## 4 FALSE TRUE
## 5 FALSE TRUE
## 6 FALSE TRUE
require(ggplot2)
## Loading required package: ggplot2
data(MathTestSub)
## Warning in data(MathTestSub): data set 'MathTestSub' not found
ggplot(MathTestSub, aes(x=GPAadj, y=PlcmtScore)) + geom_point()
## Warning: Removed 35 rows containing missing values (geom_point).
Result does show moderate positive association between High School GPA and placement test scores. This helps add confidence to efficacy of test as a way to measure Math performance.
# Additional wrangling: Before creating any plots letter grades need to be converted to numerical ones. New column added for this:
convert_grades <- function(x) {
A <- factor(x, levels=c("A+", "A", "A-", "B+", "B", "B-",
"C+", "C", "C-", "D+", "D", "D-", "F", "I", "S", "W"))
values <- c(4.3, 4, 3.7, 3.3, 3, 2.7,
2.3, 2, 1.7, 1.3, 1, 0.7, 0, 0, 0, 0)
values[A]
}
MathTestSub$NumGrade <- with(MathTestSub, convert_grades(MathTestSub$Grade))
# Verify new column
head(MathTestSub)
## X Student ACTM Rank Size GPAadj PlcmtScore Recommends Grade RecTaken TooHigh
## 1 1 625 25 1 420 4.0 23 R0 A TRUE FALSE
## 2 2 628 23 1 85 4.0 21 R0 A TRUE FALSE
## 3 3 629 27 42 421 3.8 20 R0 A- TRUE FALSE
## 4 4 630 27 6 75 3.8 20 R0 B TRUE FALSE
## 5 5 634 31 72 462 3.5 19 R0 A TRUE FALSE
## 6 6 636 NA 96 518 3.4 18 R0 B+ TRUE FALSE
## TooLow CourseSuccess NumGrade
## 1 FALSE TRUE 4.0
## 2 FALSE TRUE 4.0
## 3 FALSE TRUE 3.7
## 4 FALSE TRUE 3.0
## 5 FALSE TRUE 4.0
## 6 FALSE TRUE 3.3
# Even more wrangling. Creating subsets for students who took recommended course vs a harder course. Verifying row counts against Excel.
MathTestSubRec <- MathTestSub[MathTestSub$RecTaken == TRUE,]
nrow(MathTestSubRec)
## [1] 1488
MathTestSubHigh <- MathTestSub[MathTestSub$TooHigh == TRUE,]
nrow(MathTestSubHigh)
## [1] 1185
boxplot(MathTestSubRec$NumGrade)
Wider, lower range and lower median observed
boxplot(MathTestSubHigh$NumGrade)
These data will be easier to observe via histograms.
ggplot(data = MathTestSubRec) + geom_histogram(aes(x=NumGrade))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Added ylim parameter so that the graphs would have the same scale. Also added some color for a slightly different visual.
ggplot(data = MathTestSubHigh, aes(x=NumGrade)) + geom_histogram(color = "red") + ylim(0,300)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We observe a significantly lower number of students getting B’s and above.
Mean and Median for both populations confirms the visual observations:
a <- median(MathTestSubRec$NumGrade, na.rm=TRUE)
b <- median(MathTestSubHigh$NumGrade, na.rm=TRUE)
c <- mean(MathTestSubRec$NumGrade, na.rm=TRUE)
d <- mean(MathTestSubHigh$NumGrade, na.rm=TRUE)
chart <- data.frame(result=c("Rec Median", "TooHard Median", "Rec Mean", "TooHard Mean"),
allstudents=c(a,b,c,d))
chart
## result allstudents
## 1 Rec Median 3.300000
## 2 TooHard Median 3.000000
## 3 Rec Mean 3.102554
## 4 TooHard Mean 2.868101
We set out to show whether this college’s math placement test was an effective way to test Math performance and recommend the best course for enrollments. We first showed some association between test performance and high school GPA, the latter which is assumed to be an accurate long-term measure of performance. This gives credibility to the placement test. We then showed how students who took the recommended course fared better as demonstrated by higher median and mean. A larger number of students performed better (B or above) when taking the recommended course. Overall the school should feel confident in using this Math test for placement,