The purpose of this project is to explore the Algebra Common Core Regents. A replication of the type of output schools receive after the the test’s administration has been created. This output is a .csv file with student information, their answer to each multiple choice question, the amount of points they earned for each free response question, their raw score on the test and their scaled score. To begin our exploration, we must first remove any extraneous information and perform data clean up.
Here is a look at the original data:
library(stringr)
mxrc <- read.csv("sample_alg_scores.csv",stringsAsFactors = FALSE)
head(mxrc)
## X X.5 X.6 X.7 X.8 X.9
## 1 1
## 2 2
## 3 3 Raw Score Final Score Scan Date Page Num Scan Date 2
## 4 4
## 5 5 ABS 01/30/15 08:44 AM 315460046933-1 Not Scanned
## 6 6 ABS 01/30/15 08:44 AM 315460046934-1 Not Scanned
## X.10 X.11 X.12 X.13 X.14 X.15 X.16 X.17 X.18
## 1
## 2
## 3 Page Num 2 IEP ELL Errors Alt Lang Multi Scan Questions: 1 2
## 4 Answers: 2 2
## 5 315460046933-2
## 6 315460046934-2
## X.19 X.20 X.21 X.22 X.23 X.24 X.25 X.26 X.27 X.28 X.29 X.30 X.31 X.32
## 1
## 2
## 3 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 4 4 1 3 2 1 1 4 2 4 2 3 4 3 1
## 5
## 6
## X.33 X.34 X.35 X.36 X.37 X.38 X.39 X.40 X.41 X.42 X.43 X.44 X.45 X.46
## 1
## 2
## 3 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 4 4 3 4 3 1 3 4 1 * * * * * *
## 5
## 6
## X.47 X.48 X.49 X.50 X.51 X.52 X.53 X.54 X.55
## 1
## 2
## 3 31 32 33 34 35 36 37 Omitted Multiple
## 4 * * * * * * *
## 5
## 6
suppressMessages(library(dplyr))
#get rid of useless columns and rows
mxrc <- mxrc[-c(1:4),-c(4:13,51:55)]
#remove abs students (detect a numeric score 0-100 if they are present)
present <- str_detect(mxrc$X.6, "\\d")
mxrc <- filter(mxrc, present)
#rename columns
colnames(mxrc) <- c("id","raw","final",c(1:37))
#move questions to front so that question# = column#
mxrc <- mxrc[,c(4:40,1:3)]
mxrc$mc <- rowSums(sapply(mxrc[,1:24], str_count, "-"))
mxrc$final <- as.numeric(mxrc$final)
mxrc$raw <- as.numeric(mxrc$raw)
Here is the cleaned up data:
head(mxrc)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
## 1 3 1 - - - 3 2 - 1 1 - 3 - 3 - 2 1 2 3 - - 1 3 2 1 2 0
## 2 - - - 3 1 1 2 - 2 - 3 3 4 - 2 - 1 - - - 3 4 3 3 1 0 0
## 3 - 1 2 3 - - 4 3 - 3 2 3 - - 4 3 2 4 3 2 2 1 1 2 1 0 2
## 4 - - 2 4 4 - 2 4 - - 3 4 - 2 2 3 - - 3 - 2 4 2 2 1 0 1
## 5 - 1 3 3 4 - 4 3 2 1 3 - 2 2 2 4 1 1 1 4 3 1 2 2 0 0 1
## 6 - 1 - 3 1 3 4 2 - 3 1 - 2 2 - - 3 4 2 4 - - 2 - 0 0 0
## 28 29 30 31 32 33 34 35 36 37 id raw final mc
## 1 0 0 0 0 0 1 0 0 0 0 7 22 58 9
## 2 1 0 0 0 0 2 3 0 0 0 9 27 63 10
## 3 2 0 0 0 0 2 1 0 0 0 10 20 55 6
## 4 1 0 0 0 0 0 0 0 0 0 11 21 57 9
## 5 2 0 0 0 0 0 0 0 0 0 13 9 33 3
## 6 0 0 0 0 0 0 0 0 0 0 15 18 52 9
The Common Core Algebra Curriculum is broken up into separate concepts. The above scores have been generated for the January 2015 Regents. Let us take a look at the concept breakdown for this specific Regents exam.
#use the file I created to make the shiny ap
#this is a df of all questions and their attributes
source('algebra_df.R', local=TRUE)
#create a new data frame that combines the questions by concept
concept_df <- algebra_df %>%
filter(test_name == "January_2015") %>%
arrange(question) %>%
group_by(concept) %>%
#add number of questions and possible points for each concept
mutate(questions = length(question), total_points = sum(points)) %>%
#focus in on the multiple choice questions
filter(question < 25) %>%
mutate(mc_questions = length(question),
mc_points = sum(points)) %>%
#free response
mutate(fr_questions = questions-mc_questions,
fr_points = total_points-mc_points) %>%
#keep concept and new data
select(concept, 13:18) %>%
distinct(concept)
#not long so we can see data as table
concept_df
## Source: local data frame [4 x 7]
## Groups: concept [4]
##
## concept questions total_points mc_questions mc_points
## (chr) (int) (int) (int) (int)
## 1 Algebra 19 44 14 28
## 2 Number & Quantity 2 4 1 2
## 3 Functions 12 26 8 16
## 4 Statistics and Probability 4 10 1 2
## Variables not shown: fr_questions (int), fr_points (int)
library(ggplot2)
#create long data
concept_df <- concept_df %>% gather(type, total, 2:7)
p <- ggplot(data=concept_df, aes(type, total, fill=concept))
p <- p + geom_bar(stat="identity", position="dodge")
p <- p + theme(axis.text.x = element_text(angle=75, hjust=1, vjust=1))
p <- p + xlab("") + ylab("Total")
p <- p + ggtitle("January 2015 Common Core Algebra Regents")
p <- p + scale_x_discrete(labels=c("Questions","Total\nPoints",
"Multiple Choice\nQuestions",
"Multiple Choice\nPoints",
"Free Response\nQuestions",
"Free Response\nPoints"))
p
Now that we know the test breakdown, it would be helpful for schools to see which concepts need to receive more focus while preparing for the next administration.
#this function takes a concept and returns the proportion
#of mc right and prop of possible fr points earned
concept_breakdown <- function(c){
concept <- algebra_df %>% arrange(question) %>%
filter(month=="January",
year==15,
concept==c) %>%
select(question)
mc_q <- concept$question[concept$question < 25]
concept_mc <- sum(str_count(mxrc[,mc_q],"-"))/(nrow(mxrc) * length(mc_q))
fr_q <- concept$question[concept$question > 24]
fr_points <- sum(algebra_df %>% filter(month=="January",
year==15,
question %in% fr_q) %>%
select(points))
concept_fr <- sum(sapply(mxrc[,fr_q], as.numeric))/(nrow(mxrc) * fr_points)
return(c(c, concept_mc, concept_fr))
}
#creata a data frame to displaythe information
concept_bd_df <- data_frame("Concept"=NA, "MC_Right"=NA, "FR_Points"=NA)
for(i in unique(concept_df$concept)){
concept_bd_df <- rbind(concept_bd_df, concept_breakdown(i))
}
concept_bd_df <- concept_bd_df[-1,]
library(scales)
#create long data
concept_bd_df <- concept_bd_df %>% gather(type, total, 2:3)
#remove factor
concept_bd_df$total <- as.numeric(as.character(concept_bd_df$total))
p1 <- ggplot(data=concept_bd_df, aes(Concept, total, fill=type))
p1 <- p1 + geom_bar(stat="identity", position="dodge")
p1 <- p1 + theme(axis.text.x = element_text(angle=75, hjust=1, vjust=1))
p1 <- p1 + xlab("") + ylab("Percent")
p1 <- p1 + scale_y_continuous(limits=c(0, .5), labels=percent)
p1 <- p1 + ggtitle("January 2015 Concept Breakdown")
p1
A few years ago the NYC DOE changed the way the Regents exams are scored. Previously they were scored in house at the school. Due to a large amount of 65s (the passing score) showing up the city decided to use a centralized scoring system. This means that after each test, the multiple choice answers are scanned and the free response answers are shipped to a central scoring site.
Teachers are then paid overtime to go to the central sites and score the exams on nights and weekends. It always seem to me that this is all a colossal waste of money. I understand not scoring in house, but there should be a better system. For example, the first day of scoring is always spent waiting for the boxes of tests to arrive at the central scoring site.
Could the test just be multiple choice? Is the free response section essential to finding out which students really know the material? The SAT free response section is still bubbled, so couldn’t the Regents use the same process?
q <- ggplot(mxrc, aes(mxrc$mc, as.numeric(mxrc$final)))
q <- q + geom_point()
q <- q + geom_abline(aes(slope=0, intercept=64.5, color="red"))
q <- q + xlab("Multiple Choice Correct") + ylab("Final Score")
q <- q + ggtitle("January 2014 Common Core Algebra Regents")
q <- q + ylim(0,100)
q
mc <- data.frame(table(mxrc$mc > 12, mxrc$final >= 65))
mc <- arrange(mc, desc(Var1), desc(Var2))
colnames(mc) <- c("Pass the Regents","More than Half MC Correct","Total")
mc
## Pass the Regents More than Half MC Correct Total
## 1 TRUE TRUE 18
## 2 TRUE FALSE 2
## 3 FALSE TRUE 3
## 4 FALSE FALSE 196
There were only 3 students who got more than half the multiple choice right and did not pass the exam. That is only 1.37% of all the students who took the test. Now, let us raise the bar and focus on students who got at least 62.5% of the multiple choice questions right.
mc <- data.frame(table(mxrc$mc >= 15, mxrc$final >= 65))
mc <- arrange(mc, desc(Var1), desc(Var2))
colnames(mc) <- c("Pass the Regents","At Least 62.5% MC Correct","Total")
mc
## Pass the Regents At Least 62.5% MC Correct Total
## 1 TRUE TRUE 7
## 2 TRUE FALSE 0
## 3 FALSE TRUE 14
## 4 FALSE FALSE 198
If a student gets at least 15 multiple choice questions correct, they will pass the Regents.
Now I will use my rudimentary statistical analysis skills to find the relationship between the number of multiple choice questions right and a student’s final test score.
x <- lm(final ~ log(mc), data=mxrc)
summary(x)
##
## Call:
## lm(formula = final ~ log(mc), data = mxrc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1233 -2.8795 -0.7741 1.9740 10.8767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.0972 1.1848 -4.302 2.56e-05 ***
## log(mc) 27.4704 0.5715 48.067 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.231 on 217 degrees of freedom
## Multiple R-squared: 0.9141, Adjusted R-squared: 0.9137
## F-statistic: 2310 on 1 and 217 DF, p-value: < 2.2e-16
q <- q + geom_smooth(method="lm", formula=y~log(x))
q
There is a statistically significant relationship between a student’s final score and the number of multiple choice questions they answered correctly. We can now use this model to assess the probability of passing the test by guessing.
As a geometry teacher, I always felt that my student’s algebra skills were lacking. Since a student only needs to get half of the multiple choice questions correct to pass the Algebra Regents, I shouldn’t have been suprised. Does passing this test mean anything? Do students actually need any algebraic knowledge to pass the test? What if they just guessed on the multiple choice (leaving the free response blank), could they still pass?
I will generate 100,000 random multiple choice test responses and use the model to see if the student would pass the test.
ans_key <- c(algebra_df %>% filter(month=="January", year==15, question < 25) %>%
arrange(question) %>% select(answer))
r_mc <- c()
runs <- 100000
for(i in 1:runs){
comp <- data_frame(random_answers = sample(1:4, 24, replace = T),
key = unlist(ans_key))
comp$check <- comp$random_answers == comp$key
r_mc <- c(r_mc, sum(comp$check))
}
pred_values <- data_frame(r_mc=c(r_mc))
pred_values <- mutate(pred_values, final=-5.0972+27.4704*log(r_mc))
pred_values$final <- replace(pred_values$final, pred_values$final < 0, 0)
summary(pred_values)
## r_mc final
## Min. : 0 Min. : 0.00
## 1st Qu.: 5 1st Qu.:39.11
## Median : 6 Median :44.12
## Mean : 6 Mean :42.18
## 3rd Qu.: 7 3rd Qu.:48.36
## Max. :17 Max. :72.73
pv <- distinct(pred_values)
r <- ggplot(pv, aes(r_mc, final))
r <- r + geom_point()
r <- r + geom_abline(aes(slope=0, intercept=64.5, color="red"))
r <- r + xlab("Random Multiple Choice Correct") + ylab("Predicted Final Score")
r <- r + ggtitle("Random Algebra Regents Scores")
r <- r + ylim(0,100)
r
0.189%. So students should do a little bit of studying prior to the test if they expect to be successful!