Load the training and testing sets with the names pisaTrain and pisaTest.
## 'data.frame': 3663 obs. of 24 variables:
## $ grade : int 11 11 9 10 10 10 10 10 9 10 ...
## $ male : int 1 1 1 0 1 1 0 0 0 1 ...
## $ raceeth : Factor w/ 7 levels "American Indian/Alaska Native",..: NA 7 7 3 4 3 2 7 7 5 ...
## $ preschool : int NA 0 1 1 1 1 0 1 1 1 ...
## $ expectBachelors : int 0 0 1 1 0 1 1 1 0 1 ...
## $ motherHS : int NA 1 1 0 1 NA 1 1 1 1 ...
## $ motherBachelors : int NA 1 1 0 0 NA 0 0 NA 1 ...
## $ motherWork : int 1 1 1 1 1 1 1 0 1 1 ...
## $ fatherHS : int NA 1 1 1 1 1 NA 1 0 0 ...
## $ fatherBachelors : int NA 0 NA 0 0 0 NA 0 NA 0 ...
## $ fatherWork : int 1 1 1 1 0 1 NA 1 1 1 ...
## $ selfBornUS : int 1 1 1 1 1 1 0 1 1 1 ...
## $ motherBornUS : int 0 1 1 1 1 1 1 1 1 1 ...
## $ fatherBornUS : int 0 1 1 1 0 1 NA 1 1 1 ...
## $ englishAtHome : int 0 1 1 1 1 1 1 1 1 1 ...
## $ computerForSchoolwork: int 1 1 1 1 1 1 1 1 1 1 ...
## $ read30MinsADay : int 0 1 0 1 1 0 0 1 0 0 ...
## $ minutesPerWeekEnglish: int 225 450 250 200 250 300 250 300 378 294 ...
## $ studentsInEnglish : int NA 25 28 23 35 20 28 30 20 24 ...
## $ schoolHasLibrary : int 1 1 1 1 1 1 1 1 0 1 ...
## $ publicSchool : int 1 1 1 1 1 1 1 1 1 1 ...
## $ urban : int 1 0 0 1 1 0 1 0 1 0 ...
## $ schoolSize : int 673 1173 1233 2640 1095 227 2080 1913 502 899 ...
## $ readingScore : num 476 575 555 458 614 ...
## 'data.frame': 1570 obs. of 24 variables:
## $ grade : int 10 10 10 10 10 10 10 10 11 10 ...
## $ male : int 0 1 0 0 0 0 0 0 0 1 ...
## $ raceeth : Factor w/ 7 levels "American Indian/Alaska Native",..: 7 7 7 7 7 7 1 7 7 4 ...
## $ preschool : int 1 0 1 1 1 0 1 1 0 1 ...
## $ expectBachelors : int 0 0 0 0 1 0 0 0 0 1 ...
## $ motherHS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ motherBachelors : int 1 0 0 1 0 0 0 0 1 1 ...
## $ motherWork : int 1 1 1 1 0 1 0 1 1 1 ...
## $ fatherHS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ fatherBachelors : int 0 0 0 0 1 0 0 0 1 0 ...
## $ fatherWork : int 0 1 1 0 1 1 0 1 1 1 ...
## $ selfBornUS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ motherBornUS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ fatherBornUS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ englishAtHome : int 1 1 1 1 1 1 1 1 1 1 ...
## $ computerForSchoolwork: int 1 1 1 1 1 0 1 1 1 1 ...
## $ read30MinsADay : int 0 0 0 0 0 1 1 1 1 1 ...
## $ minutesPerWeekEnglish: int 240 255 NA 160 240 200 240 270 270 350 ...
## $ studentsInEnglish : int 30 NA 30 30 30 NA 30 35 30 25 ...
## $ schoolHasLibrary : int 1 1 1 NA 1 1 1 1 1 1 ...
## $ publicSchool : int 1 1 1 1 1 1 1 1 1 1 ...
## $ urban : int 0 0 0 0 0 0 0 0 0 0 ...
## $ schoolSize : int 808 808 808 808 808 808 808 808 808 899 ...
## $ readingScore : num 355 386 523 406 454 ...
3## What is the average reading test score of males and females?
tapply(pisaTrain$readingScore , pisaTrain$male, mean)
## 0 1
## 512.9406 483.5325
Check which variables have missing data (NA’s)
summary(pisaTrain)
## grade male raceeth
## Min. : 8.00 Min. :0.0000 White :2015
## 1st Qu.:10.00 1st Qu.:0.0000 Hispanic : 834
## Median :10.00 Median :1.0000 Black : 444
## Mean :10.09 Mean :0.5111 Asian : 143
## 3rd Qu.:10.00 3rd Qu.:1.0000 More than one race: 124
## Max. :12.00 Max. :1.0000 (Other) : 68
## NA's : 35
## preschool expectBachelors motherHS motherBachelors
## Min. :0.0000 Min. :0.0000 Min. :0.00 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:1.00 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :1.00 Median :0.0000
## Mean :0.7228 Mean :0.7859 Mean :0.88 Mean :0.3481
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.00 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.00 Max. :1.0000
## NA's :56 NA's :62 NA's :97 NA's :397
## motherWork fatherHS fatherBachelors fatherWork
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1.0000
## Median :1.0000 Median :1.0000 Median :0.0000 Median :1.0000
## Mean :0.7345 Mean :0.8593 Mean :0.3319 Mean :0.8531
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :93 NA's :245 NA's :569 NA's :233
## selfBornUS motherBornUS fatherBornUS englishAtHome
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:1.0000
## Median :1.0000 Median :1.0000 Median :1.0000 Median :1.0000
## Mean :0.9313 Mean :0.7725 Mean :0.7668 Mean :0.8717
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :69 NA's :71 NA's :113 NA's :71
## computerForSchoolwork read30MinsADay minutesPerWeekEnglish
## Min. :0.0000 Min. :0.0000 Min. : 0.0
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.: 225.0
## Median :1.0000 Median :0.0000 Median : 250.0
## Mean :0.8994 Mean :0.2899 Mean : 266.2
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 300.0
## Max. :1.0000 Max. :1.0000 Max. :2400.0
## NA's :65 NA's :34 NA's :186
## studentsInEnglish schoolHasLibrary publicSchool urban
## Min. : 1.0 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:20.0 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :25.0 Median :1.0000 Median :1.0000 Median :0.0000
## Mean :24.5 Mean :0.9676 Mean :0.9339 Mean :0.3849
## 3rd Qu.:30.0 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :75.0 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :249 NA's :143
## schoolSize readingScore
## Min. : 100 Min. :168.6
## 1st Qu.: 712 1st Qu.:431.7
## Median :1212 Median :499.7
## Mean :1369 Mean :497.9
## 3rd Qu.:1900 3rd Qu.:566.2
## Max. :6694 Max. :746.0
## NA's :162
Linear regression discards observations with missing data, so we will remove all such observations from the training and testing sets.
pisaTrain = na.omit(pisaTrain)
pisaTest = na.omit(pisaTest)
nrow(pisaTrain)
## [1] 2414
nrow(pisaTest)
## [1] 990
To include unordered factors in a linear regression model, we define one level as the “reference level” and add a binary variable for each of the remaining levels. In this way, a factor with n levels is replaced by n-1 binary variables. The reference level is typically selected to be the most frequently occurring level in the dataset.
As an example, consider the unordered factor variable “color”, with levels “red”, “green”, and “blue”. If “green” were the reference level, then we would add binary variables “colorred” and “colorblue” to a linear regression problem. All red examples would have colorred=1 and colorblue=0. All blue examples would have colorred=0 and colorblue=1. All green examples would have colorred=0 and colorblue=0.
Now, consider the variable “raceeth” in our problem, which has levels “American Indian/Alaska Native”, “Asian”, “Black”, “Hispanic”, “More than one race”, “Native Hawaiian/Other Pacific Islander”, and “White”. Because it is the most common in our population, we will select White as the reference level. An Asian student will have raceethAsian set to 1 and all other raceeth binary variables set to 0. Because “White” is the reference level, a white student will have all raceeth binary variables set to 0.
Set the reference level as the most common level (“White”).
pisaTrain$raceeth = relevel(pisaTrain$raceeth, "White")
pisaTest$raceeth = relevel(pisaTest$raceeth, "White")
build a linear regression model (call it lmScore) using the training set to predict readingScore using all the remaining variables.
lmScore = lm(readingScore ~ ., data = pisaTrain)
summary(lmScore)
##
## Call:
## lm(formula = readingScore ~ ., data = pisaTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -247.44 -48.86 1.86 49.77 217.18
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 143.766333 33.841226
## grade 29.542707 2.937399
## male -14.521653 3.155926
## raceethAmerican Indian/Alaska Native -67.277327 16.786935
## raceethAsian -4.110325 9.220071
## raceethBlack -67.012347 5.460883
## raceethHispanic -38.975486 5.177743
## raceethMore than one race -16.922522 8.496268
## raceethNative Hawaiian/Other Pacific Islander -5.101601 17.005696
## preschool -4.463670 3.486055
## expectBachelors 55.267080 4.293893
## motherHS 6.058774 6.091423
## motherBachelors 12.638068 3.861457
## motherWork -2.809101 3.521827
## fatherHS 4.018214 5.579269
## fatherBachelors 16.929755 3.995253
## fatherWork 5.842798 4.395978
## selfBornUS -3.806278 7.323718
## motherBornUS -8.798153 6.587621
## fatherBornUS 4.306994 6.263875
## englishAtHome 8.035685 6.859492
## computerForSchoolwork 22.500232 5.702562
## read30MinsADay 34.871924 3.408447
## minutesPerWeekEnglish 0.012788 0.010712
## studentsInEnglish -0.286631 0.227819
## schoolHasLibrary 12.215085 9.264884
## publicSchool -16.857475 6.725614
## urban -0.110132 3.962724
## schoolSize 0.006540 0.002197
## t value Pr(>|t|)
## (Intercept) 4.248 2.24e-05 ***
## grade 10.057 < 2e-16 ***
## male -4.601 4.42e-06 ***
## raceethAmerican Indian/Alaska Native -4.008 6.32e-05 ***
## raceethAsian -0.446 0.65578
## raceethBlack -12.271 < 2e-16 ***
## raceethHispanic -7.528 7.29e-14 ***
## raceethMore than one race -1.992 0.04651 *
## raceethNative Hawaiian/Other Pacific Islander -0.300 0.76421
## preschool -1.280 0.20052
## expectBachelors 12.871 < 2e-16 ***
## motherHS 0.995 0.32001
## motherBachelors 3.273 0.00108 **
## motherWork -0.798 0.42517
## fatherHS 0.720 0.47147
## fatherBachelors 4.237 2.35e-05 ***
## fatherWork 1.329 0.18393
## selfBornUS -0.520 0.60331
## motherBornUS -1.336 0.18182
## fatherBornUS 0.688 0.49178
## englishAtHome 1.171 0.24153
## computerForSchoolwork 3.946 8.19e-05 ***
## read30MinsADay 10.231 < 2e-16 ***
## minutesPerWeekEnglish 1.194 0.23264
## studentsInEnglish -1.258 0.20846
## schoolHasLibrary 1.318 0.18749
## publicSchool -2.506 0.01226 *
## urban -0.028 0.97783
## schoolSize 2.977 0.00294 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 73.81 on 2385 degrees of freedom
## Multiple R-squared: 0.3251, Adjusted R-squared: 0.3172
## F-statistic: 41.04 on 28 and 2385 DF, p-value: < 2.2e-16
- R square is 0.3251
- R-squared is lower in case. But it does not necessarily imply that the model is of poor quality. More often than not, it simply means that the prediction problem at hand (predicting a student’s test score based on demographic and school-related variables) is more difficult than other prediction problems (like predicting a team’s number of wins from their runs scored and allowed, or predicting the quality of wine from weather conditions).
- The coefficient 29.54 on grade is the difference in reading score between two students who are identical other than having a difference in grade of 1. Let`s say student A and B have a difference in grade of 2, the model predicts that student A has a reading score that is 2*29.54 larger.
- The meaning of the coefficient associated with variable raceethAsian predictes difference in the reading score between an Asian student and a white student who is otherwise identical
- Variables were significant at the 0.05 level
What is the training-set root-mean squared error (RMSE) of lmScore?
SSE = sum(lmScore$residuals^2)
SSE
## [1] 12993365
- The training-set RMSE can be computed by first computing the SSE
RMSE = sqrt(SSE / nrow(pisaTrain))
RMSE
## [1] 73.36555
- and then dividing by the number of observations and taking the square root
sqrt(mean(lmScore$residuals^2))
## [1] 73.36555
- An alternative way of getting this answer would be with the above command
Use the lmScore model to predict the reading scores of students in pisaTest
predTest = predict(lmScore, newdata=pisaTest)
summary(predTest)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 353.2 482.0 524.0 516.7 555.7 637.7
- maximum predicted reading score is 637.7, and the minimum predicted score is 353.2. Therefore, the range is 284.5.
sum((predTest-pisaTest$readingScore)^2)
## [1] 5762082
- sum of squared errors (SSE) of lmScore on the testing set: 5762082
sqrt(mean((predTest-pisaTest$readingScore)^2))
## [1] 76.29079
- the root-mean squared error (RMSE) of lmScore on the testing set: 76.29079
baseline = mean(pisaTrain$readingScore)
baseline
## [1] 517.9629
- the predicted test score used in the baseline model using the training set: 517.9629
sum((baseline-pisaTest$readingScore)^2)
## [1] 7802354
- the sum of squared errors of the baseline model on the testing set: 7802354
- The test-set R^2 is defined as 1-SSE/SST, where SSE is the sum of squared errors of the model on the test set and SST is the sum of squared errors of the baseline model. For this model, the R^2 is then computed to be 1-5762082/7802354= 0.2614944