The Programme for International Student Assessment (PISA) is a test given every three years to 15-year-old students from around the world to evaluate their performance in mathematics, reading, and science. This test provides a quantitative way to compare the performance of students from different parts of the world. In this homework assignment, we will predict the reading scores of students from the United States of America on the 2009 PISA exam.
The datasets pisa2009train.csv and pisa2009test.csv contain information about the demographics and schools for American students taking the exam, derived from 2009 PISA Public-Use Data Files distributed by the United States National Center for Education Statistics (NCES). While the datasets are not supposed to contain identifying information about students taking the test, by using the data you are bound by the NCES data use agreement, which prohibits any attempt to determine the identity of any student in the datasets.
Each row in the datasets pisa2009train.csv and pisa2009test.csv represents one student taking the exam.
pisa_train = read.csv("pisa2009train.csv")
str(pisa_train)
## 'data.frame': 3663 obs. of 24 variables:
## $ grade : int 11 11 9 10 10 10 10 10 9 10 ...
## $ male : int 1 1 1 0 1 1 0 0 0 1 ...
## $ raceeth : Factor w/ 7 levels "American Indian/Alaska Native",..: NA 7 7 3 4 3 2 7 7 5 ...
## $ preschool : int NA 0 1 1 1 1 0 1 1 1 ...
## $ expectBachelors : int 0 0 1 1 0 1 1 1 0 1 ...
## $ motherHS : int NA 1 1 0 1 NA 1 1 1 1 ...
## $ motherBachelors : int NA 1 1 0 0 NA 0 0 NA 1 ...
## $ motherWork : int 1 1 1 1 1 1 1 0 1 1 ...
## $ fatherHS : int NA 1 1 1 1 1 NA 1 0 0 ...
## $ fatherBachelors : int NA 0 NA 0 0 0 NA 0 NA 0 ...
## $ fatherWork : int 1 1 1 1 0 1 NA 1 1 1 ...
## $ selfBornUS : int 1 1 1 1 1 1 0 1 1 1 ...
## $ motherBornUS : int 0 1 1 1 1 1 1 1 1 1 ...
## $ fatherBornUS : int 0 1 1 1 0 1 NA 1 1 1 ...
## $ englishAtHome : int 0 1 1 1 1 1 1 1 1 1 ...
## $ computerForSchoolwork: int 1 1 1 1 1 1 1 1 1 1 ...
## $ read30MinsADay : int 0 1 0 1 1 0 0 1 0 0 ...
## $ minutesPerWeekEnglish: int 225 450 250 200 250 300 250 300 378 294 ...
## $ studentsInEnglish : int NA 25 28 23 35 20 28 30 20 24 ...
## $ schoolHasLibrary : int 1 1 1 1 1 1 1 1 0 1 ...
## $ publicSchool : int 1 1 1 1 1 1 1 1 1 1 ...
## $ urban : int 1 0 0 1 1 0 1 0 1 0 ...
## $ schoolSize : int 673 1173 1233 2640 1095 227 2080 1913 502 899 ...
## $ readingScore : num 476 575 555 458 614 ...
pisa_test = read.csv("pisa2009test.csv")
str(pisa_test)
## 'data.frame': 1570 obs. of 24 variables:
## $ grade : int 10 10 10 10 10 10 10 10 11 10 ...
## $ male : int 0 1 0 0 0 0 0 0 0 1 ...
## $ raceeth : Factor w/ 7 levels "American Indian/Alaska Native",..: 7 7 7 7 7 7 1 7 7 4 ...
## $ preschool : int 1 0 1 1 1 0 1 1 0 1 ...
## $ expectBachelors : int 0 0 0 0 1 0 0 0 0 1 ...
## $ motherHS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ motherBachelors : int 1 0 0 1 0 0 0 0 1 1 ...
## $ motherWork : int 1 1 1 1 0 1 0 1 1 1 ...
## $ fatherHS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ fatherBachelors : int 0 0 0 0 1 0 0 0 1 0 ...
## $ fatherWork : int 0 1 1 0 1 1 0 1 1 1 ...
## $ selfBornUS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ motherBornUS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ fatherBornUS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ englishAtHome : int 1 1 1 1 1 1 1 1 1 1 ...
## $ computerForSchoolwork: int 1 1 1 1 1 0 1 1 1 1 ...
## $ read30MinsADay : int 0 0 0 0 0 1 1 1 1 1 ...
## $ minutesPerWeekEnglish: int 240 255 NA 160 240 200 240 270 270 350 ...
## $ studentsInEnglish : int 30 NA 30 30 30 NA 30 35 30 25 ...
## $ schoolHasLibrary : int 1 1 1 NA 1 1 1 1 1 1 ...
## $ publicSchool : int 1 1 1 1 1 1 1 1 1 1 ...
## $ urban : int 0 0 0 0 0 0 0 0 0 0 ...
## $ schoolSize : int 808 808 808 808 808 808 808 808 808 899 ...
## $ readingScore : num 355 386 523 406 454 ...
tapply(pisa_train$readingScore, pisa_train$male, mean)
## 0 1
## 512.9406 483.5325
The average reading test score of males is 483.5325.
The average reading test score of females is 512.9406
This shows that Female students have better test score as compared to male students.
Linear regression discards observations with missing data, so we will remove all such observations from the training and testing sets. We can also use imputation, which deals with missing data by filling in missing values with plausible information.
pisa_train = na.omit(pisa_train)
pisa_test = na.omit(pisa_test)
str(pisa_train)
## 'data.frame': 2414 obs. of 24 variables:
## $ grade : int 11 10 10 10 10 10 10 10 11 9 ...
## $ male : int 1 0 1 0 1 0 0 0 1 1 ...
## $ raceeth : Factor w/ 7 levels "American Indian/Alaska Native",..: 7 3 4 7 5 4 7 4 7 7 ...
## $ preschool : int 0 1 1 1 1 1 1 1 1 1 ...
## $ expectBachelors : int 0 1 0 1 1 1 1 0 1 1 ...
## $ motherHS : int 1 0 1 1 1 1 1 0 1 1 ...
## $ motherBachelors : int 1 0 0 0 1 0 0 0 0 1 ...
## $ motherWork : int 1 1 1 0 1 1 1 0 0 1 ...
## $ fatherHS : int 1 1 1 1 0 1 1 0 1 1 ...
## $ fatherBachelors : int 0 0 0 0 0 0 1 0 1 1 ...
## $ fatherWork : int 1 1 0 1 1 0 1 1 1 1 ...
## $ selfBornUS : int 1 1 1 1 1 0 1 0 1 1 ...
## $ motherBornUS : int 1 1 1 1 1 0 1 0 1 1 ...
## $ fatherBornUS : int 1 1 0 1 1 0 1 0 1 1 ...
## $ englishAtHome : int 1 1 1 1 1 0 1 0 1 1 ...
## $ computerForSchoolwork: int 1 1 1 1 1 0 1 1 1 1 ...
## $ read30MinsADay : int 1 1 1 1 0 1 1 1 0 0 ...
## $ minutesPerWeekEnglish: int 450 200 250 300 294 232 225 270 275 225 ...
## $ studentsInEnglish : int 25 23 35 30 24 14 20 25 30 15 ...
## $ schoolHasLibrary : int 1 1 1 1 1 1 1 1 1 1 ...
## $ publicSchool : int 1 1 1 1 1 1 1 1 1 0 ...
## $ urban : int 0 1 1 0 0 0 0 1 1 1 ...
## $ schoolSize : int 1173 2640 1095 1913 899 1733 149 1400 1988 915 ...
## $ readingScore : num 575 458 614 439 466 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:1249] 1 3 6 7 9 11 13 21 29 30 ...
## .. ..- attr(*, "names")= chr [1:1249] "1" "3" "6" "7" ...
str(pisa_test)
## 'data.frame': 990 obs. of 24 variables:
## $ grade : int 10 10 10 10 11 10 10 10 10 10 ...
## $ male : int 0 0 0 0 0 1 0 1 1 0 ...
## $ raceeth : Factor w/ 7 levels "American Indian/Alaska Native",..: 7 7 1 7 7 4 7 4 7 4 ...
## $ preschool : int 1 1 1 1 0 1 0 1 1 1 ...
## $ expectBachelors : int 0 1 0 0 0 1 1 0 1 1 ...
## $ motherHS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ motherBachelors : int 1 0 0 0 1 1 0 0 1 0 ...
## $ motherWork : int 1 0 0 1 1 1 0 1 1 1 ...
## $ fatherHS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ fatherBachelors : int 0 1 0 0 1 0 0 0 1 1 ...
## $ fatherWork : int 0 1 0 1 1 1 1 0 1 1 ...
## $ selfBornUS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ motherBornUS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ fatherBornUS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ englishAtHome : int 1 1 1 1 1 1 1 1 1 1 ...
## $ computerForSchoolwork: int 1 1 1 1 1 1 1 1 1 1 ...
## $ read30MinsADay : int 0 0 1 1 1 1 0 0 0 1 ...
## $ minutesPerWeekEnglish: int 240 240 240 270 270 350 350 360 350 360 ...
## $ studentsInEnglish : int 30 30 30 35 30 25 27 28 25 27 ...
## $ schoolHasLibrary : int 1 1 1 1 1 1 1 1 1 1 ...
## $ publicSchool : int 1 1 1 1 1 1 1 1 1 1 ...
## $ urban : int 0 0 0 0 0 0 0 0 0 0 ...
## $ schoolSize : int 808 808 808 808 808 899 899 899 899 899 ...
## $ readingScore : num 355 454 405 665 605 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:580] 2 3 4 6 12 16 17 19 22 23 ...
## .. ..- attr(*, "names")= chr [1:580] "2" "3" "4" "6" ...
Because the race variable takes on text values, it was loaded as a factor variable when we read in the dataset with read.csv() – you can see this when you run str(pisaTrain) or str(pisaTest). However, by default R selects the first level alphabetically (“American Indian/Alaska Native”) as the reference level of our factor instead of the most common level (“White”).
pisa_train$raceeth = relevel(pisa_train$raceeth,"White")
pisa_test$raceeth = relevel(pisa_test$raceeth,"White")
Building a linear regression model (call it lmScore1) using the training set to predict readingScore using all the remaining variables.
lmScore1 = lm(readingScore ~ ., data = pisa_train)
summary(lmScore1)
##
## Call:
## lm(formula = readingScore ~ ., data = pisa_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -247.44 -48.86 1.86 49.77 217.18
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 143.766333 33.841226
## grade 29.542707 2.937399
## male -14.521653 3.155926
## raceethAmerican Indian/Alaska Native -67.277327 16.786935
## raceethAsian -4.110325 9.220071
## raceethBlack -67.012347 5.460883
## raceethHispanic -38.975486 5.177743
## raceethMore than one race -16.922522 8.496268
## raceethNative Hawaiian/Other Pacific Islander -5.101601 17.005696
## preschool -4.463670 3.486055
## expectBachelors 55.267080 4.293893
## motherHS 6.058774 6.091423
## motherBachelors 12.638068 3.861457
## motherWork -2.809101 3.521827
## fatherHS 4.018214 5.579269
## fatherBachelors 16.929755 3.995253
## fatherWork 5.842798 4.395978
## selfBornUS -3.806278 7.323718
## motherBornUS -8.798153 6.587621
## fatherBornUS 4.306994 6.263875
## englishAtHome 8.035685 6.859492
## computerForSchoolwork 22.500232 5.702562
## read30MinsADay 34.871924 3.408447
## minutesPerWeekEnglish 0.012788 0.010712
## studentsInEnglish -0.286631 0.227819
## schoolHasLibrary 12.215085 9.264884
## publicSchool -16.857475 6.725614
## urban -0.110132 3.962724
## schoolSize 0.006540 0.002197
## t value Pr(>|t|)
## (Intercept) 4.248 2.24e-05 ***
## grade 10.057 < 2e-16 ***
## male -4.601 4.42e-06 ***
## raceethAmerican Indian/Alaska Native -4.008 6.32e-05 ***
## raceethAsian -0.446 0.65578
## raceethBlack -12.271 < 2e-16 ***
## raceethHispanic -7.528 7.29e-14 ***
## raceethMore than one race -1.992 0.04651 *
## raceethNative Hawaiian/Other Pacific Islander -0.300 0.76421
## preschool -1.280 0.20052
## expectBachelors 12.871 < 2e-16 ***
## motherHS 0.995 0.32001
## motherBachelors 3.273 0.00108 **
## motherWork -0.798 0.42517
## fatherHS 0.720 0.47147
## fatherBachelors 4.237 2.35e-05 ***
## fatherWork 1.329 0.18393
## selfBornUS -0.520 0.60331
## motherBornUS -1.336 0.18182
## fatherBornUS 0.688 0.49178
## englishAtHome 1.171 0.24153
## computerForSchoolwork 3.946 8.19e-05 ***
## read30MinsADay 10.231 < 2e-16 ***
## minutesPerWeekEnglish 1.194 0.23264
## studentsInEnglish -1.258 0.20846
## schoolHasLibrary 1.318 0.18749
## publicSchool -2.506 0.01226 *
## urban -0.028 0.97783
## schoolSize 2.977 0.00294 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 73.81 on 2385 degrees of freedom
## Multiple R-squared: 0.3251, Adjusted R-squared: 0.3172
## F-statistic: 41.04 on 28 and 2385 DF, p-value: < 2.2e-16
SSE = sum(lmScore1$residuals^2)
RMSE = sqrt(SSE/nrow(pisa_train))
RMSE
## [1] 73.36555
model_prediction = predict(lmScore1,pisa_test)
summary(model_prediction)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 353.2 482.0 524.0 516.7 555.7 637.7
What is the range between the maximum and minimum predicted reading score on the test set? 284.5
SSE = sum((climate_test$Temp - model_prediction)^2)
SSE = sum((pisa_test$readingScore - model_prediction)^2)
SSE
## [1] 5762082
SST = sum((pisa_test$readingScore - mean(pisa_train$readingScore)^2))
SST
## [1] -265088031
R_Square = 1-SSE/SST
R_Square
## [1] 1.021736
RMSE = sqrt(SSE/nrow(pisa_test))
RMSE
## [1] 76.29079
What is the predicted test score used in the baseline model?
baseline = mean(pisa_train$readingScore)
baseline
## [1] 517.9629
What is the sum of squared errors of the baseline model on the testing set?
sse_baseline = sum((baseline-pisa_test$readingScore)^2)
sse_baseline
## [1] 7802354