Final Project

Sample Data and Questions

The chosen sample data comes from: link

Description: link

There are 2696 results from a Math placement exam at a liberal arts college. Some of the fields included are gender, standardized test scores, high school rank, GPA, which course was ultimately taken, and grade.

These data can be analyzed to determine whether the placement exam is an effective tool in determining the correct level the student is at upon enrollment. We could compare the performance of those students who enrolled in the suggested class vs that of students who enrolled in a more difficult course.

Did the students who enrolled in a more difficult course get lower grades?

Also, will there be a linear relationship between GPA and their placement exam score?

Data Exploration

theURL <- "https://raw.githubusercontent.com/HildaRamirez/HW2/master/MathPlacement.csv"
MathTest <- read.table(file=theURL, header = TRUE, sep = ",")

str(MathTest)

## 'data.frame':    2696 obs. of  17 variables:
##  $ X            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Student      : int  625 628 629 630 634 636 638 641 643 645 ...
##  $ Gender       : int  0 0 1 0 1 0 0 0 0 0 ...
##  $ PSATM        : int  56 57 NA 53 NA 63 42 52 51 60 ...
##  $ SATM         : int  56 NA 62 NA 64 68 NA NA 58 NA ...
##  $ ACTM         : int  25 23 27 27 31 NA 23 24 NA 26 ...
##  $ Rank         : int  1 1 42 6 72 96 38 72 51 215 ...
##  $ Size         : int  420 85 421 75 462 518 382 480 703 524 ...
##  $ GPAadj       : int  40 40 38 38 35 34 37 37 34 32 ...
##  $ PlcmtScore   : int  23 21 20 20 19 18 18 17 17 16 ...
##  $ Recommends   : chr  "R0" "R0" "R0" "R0" ...
##  $ Course       : int  210 117 117 117 114 117 117 117 117 117 ...
##  $ Grade        : chr  "A" "A" "A-" "B" ...
##  $ RecTaken     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ TooHigh      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TooLow       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CourseSuccess: int  1 1 1 1 1 1 1 1 1 1 ...

With the exception of Recommends and Grade, the fields are all integers. RecTaken, TooHigh, TooLow, and CourseSuccess may have made more sense as logical data type as they are really recording a boolean concept.

summary(MathTest)

##        X             Student           Gender           PSATM      
##  Min.   :   1.0   Min.   :   2.0   Min.   :0.0000   Min.   : 0.00  
##  1st Qu.: 674.8   1st Qu.: 925.8   1st Qu.:0.0000   1st Qu.:54.00  
##  Median :1348.5   Median :1953.0   Median :0.0000   Median :59.00  
##  Mean   :1348.5   Mean   :1942.7   Mean   :0.4586   Mean   :58.14  
##  3rd Qu.:2022.2   3rd Qu.:2968.2   3rd Qu.:1.0000   3rd Qu.:65.00  
##  Max.   :2696.0   Max.   :4067.0   Max.   :1.0000   Max.   :80.00  
##                                    NA's   :2116     NA's   :1560   
##       SATM           ACTM            Rank             Size      
##  Min.   :35.0   Min.   :13.00   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.:58.0   1st Qu.:25.00   1st Qu.:  7.00   1st Qu.:177.0  
##  Median :63.0   Median :27.00   Median : 28.00   Median :322.0  
##  Mean   :62.6   Mean   :26.98   Mean   : 51.01   Mean   :323.5  
##  3rd Qu.:68.0   3rd Qu.:30.00   3rd Qu.: 73.00   3rd Qu.:455.0  
##  Max.   :80.0   Max.   :36.00   Max.   :530.00   Max.   :888.0  
##  NA's   :1460   NA's   :322     NA's   :196      NA's   :179    
##      GPAadj        PlcmtScore      Recommends            Course     
##  Min.   : 0.00   Min.   :-18.00   Length:2696        Min.   :109.0  
##  1st Qu.:33.00   1st Qu.: 26.00   Class :character   1st Qu.:120.0  
##  Median :37.00   Median : 33.00   Mode  :character   Median :120.0  
##  Mean   :35.73   Mean   : 32.44                      Mean   :123.4  
##  3rd Qu.:39.00   3rd Qu.: 39.00                      3rd Qu.:122.0  
##  Max.   :40.00   Max.   : 59.00                      Max.   :398.0  
##  NA's   :20      NA's   :35                                         
##     Grade              RecTaken         TooHigh          TooLow       
##  Length:2696        Min.   :0.0000   Min.   :0.000   Min.   :0.00000  
##  Class :character   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.00000  
##  Mode  :character   Median :1.0000   Median :1.000   Median :0.00000  
##                     Mean   :0.6855   Mean   :0.569   Mean   :0.02003  
##                     3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:0.00000  
##                     Max.   :1.0000   Max.   :1.000   Max.   :1.00000  
##                                                                       
##  CourseSuccess   
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :1.0000  
##  Mean   :0.6768  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
##  NA's   :567

Some observations from the summary: 1. PSAT & SAT scores are not so useful as there are many NA values for those.
2. The median and mean for class rank when compared to class size, demonstrates that a large portion of the students were in the top quartile. A more uniform population could help eliminate other variables from being the cause of poor performance in their Math course. 3. The summary statistics are meaningless for those columns that are really boolean in nature.

Data Wrangling

# First, am removing rows with no Grade since they will not help answer the questions we have.  Removing the PSAT & SAT score columns due to high prevalence of NA.  Also removing the gender column as not interested in making observations by gender.  The course number column is also eliminated as there is no legend for it to extrapolate meaning from.

MathTestSub <- subset(MathTest, Grade != "")
MathTestSub <- MathTestSub[, c(1,2,6:11,13:17)]

# Dividing High School GPA by 10 and setting to double so that it can be in the usual scale

MathTestSub$GPAadj <- as.double(MathTestSub$GPAadj/10)

# Retyping the last columns as booleans
MathTestSub$RecTaken <- as.logical(MathTestSub$RecTaken)
MathTestSub$TooHigh <- as.logical(MathTestSub$TooHigh)
MathTestSub$TooLow <- as.logical(MathTestSub$TooLow)
MathTestSub$CourseSuccess <- as.logical(MathTestSub$CourseSuccess)

# Should have 2134 rows once blanks removed (verified in Excel)
# Take a peek at the data after the changes using head()
nrow(MathTestSub)

## [1] 2134

head(MathTestSub)

##   X Student ACTM Rank Size GPAadj PlcmtScore Recommends Grade RecTaken TooHigh
## 1 1     625   25    1  420    4.0         23         R0     A     TRUE   FALSE
## 2 2     628   23    1   85    4.0         21         R0     A     TRUE   FALSE
## 3 3     629   27   42  421    3.8         20         R0    A-     TRUE   FALSE
## 4 4     630   27    6   75    3.8         20         R0     B     TRUE   FALSE
## 5 5     634   31   72  462    3.5         19         R0     A     TRUE   FALSE
## 6 6     636   NA   96  518    3.4         18         R0    B+     TRUE   FALSE
##   TooLow CourseSuccess
## 1  FALSE          TRUE
## 2  FALSE          TRUE
## 3  FALSE          TRUE
## 4  FALSE          TRUE
## 5  FALSE          TRUE
## 6  FALSE          TRUE

Graphics & Discussion

Scatterplot to look at GPA vs Placement test grade

require(ggplot2)

## Loading required package: ggplot2

data(MathTestSub)

## Warning in data(MathTestSub): data set 'MathTestSub' not found

ggplot(MathTestSub, aes(x=GPAadj, y=PlcmtScore)) + geom_point()

## Warning: Removed 35 rows containing missing values (geom_point).

Result does show moderate positive association between High School GPA and placement test scores. This helps add confidence to efficacy of test as a way to measure Math performance.

# Additional wrangling:  Before creating any plots letter grades need to be converted to numerical ones.  New column added for this: 

convert_grades <- function(x) {
  A <- factor(x, levels=c("A+", "A", "A-", "B+", "B", "B-",
                     "C+", "C", "C-", "D+", "D", "D-", "F", "I", "S", "W"))
  values <- c(4.3, 4, 3.7, 3.3, 3, 2.7,
              2.3, 2, 1.7, 1.3, 1, 0.7, 0, 0, 0, 0)
  values[A]
}

MathTestSub$NumGrade <- with(MathTestSub, convert_grades(MathTestSub$Grade))

# Verify new column
head(MathTestSub)

##   X Student ACTM Rank Size GPAadj PlcmtScore Recommends Grade RecTaken TooHigh
## 1 1     625   25    1  420    4.0         23         R0     A     TRUE   FALSE
## 2 2     628   23    1   85    4.0         21         R0     A     TRUE   FALSE
## 3 3     629   27   42  421    3.8         20         R0    A-     TRUE   FALSE
## 4 4     630   27    6   75    3.8         20         R0     B     TRUE   FALSE
## 5 5     634   31   72  462    3.5         19         R0     A     TRUE   FALSE
## 6 6     636   NA   96  518    3.4         18         R0    B+     TRUE   FALSE
##   TooLow CourseSuccess NumGrade
## 1  FALSE          TRUE      4.0
## 2  FALSE          TRUE      4.0
## 3  FALSE          TRUE      3.7
## 4  FALSE          TRUE      3.0
## 5  FALSE          TRUE      4.0
## 6  FALSE          TRUE      3.3

# Even more wrangling.  Creating subsets for students who took recommended course vs a harder course.  Verifying row counts against Excel.

MathTestSubRec <- MathTestSub[MathTestSub$RecTaken == TRUE,]
nrow(MathTestSubRec)

## [1] 1488

MathTestSubHigh <- MathTestSub[MathTestSub$TooHigh == TRUE,]
nrow(MathTestSubHigh)

## [1] 1185

Boxplot for students that took recommended course

boxplot(MathTestSubRec$NumGrade)

Boxplot for students that took harder course

Wider, lower range and lower median observed

boxplot(MathTestSubHigh$NumGrade)

Histogram for students that took recommended course

These data will be easier to observe via histograms.

ggplot(data = MathTestSubRec) + geom_histogram(aes(x=NumGrade))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histogram for students that took a harder course.

# Added ylim parameter so that the graphs would have the same scale.  Also added some color for a slightly different visual.

ggplot(data = MathTestSubHigh, aes(x=NumGrade)) + geom_histogram(color = "red") + ylim(0,300)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We observe a significantly lower number of students getting B’s and above.

Mean and Median for both populations confirms the visual observations:

a <- median(MathTestSubRec$NumGrade, na.rm=TRUE)
b <- median(MathTestSubHigh$NumGrade, na.rm=TRUE)
c <- mean(MathTestSubRec$NumGrade, na.rm=TRUE)
d <- mean(MathTestSubHigh$NumGrade, na.rm=TRUE)

chart <- data.frame(result=c("Rec Median", "TooHard Median", "Rec Mean", "TooHard Mean"),
                    allstudents=c(a,b,c,d))

chart

##           result allstudents
## 1     Rec Median    3.300000
## 2 TooHard Median    3.000000
## 3       Rec Mean    3.102554
## 4   TooHard Mean    2.868101

Additional Discussion

We set out to show whether this college’s math placement test was an effective way to test Math performance and recommend the best course for enrollments. We first showed some association between test performance and high school GPA, the latter which is assumed to be an accurate long-term measure of performance. This gives credibility to the placement test. We then showed how students who took the recommended course fared better as demonstrated by higher median and mean. A larger number of students performed better (B or above) when taking the recommended course. Overall the school should feel confident in using this Math test for placement,