Load the training and testing sets using the read.csv() function, and save them as variables with the names pisaTrain and pisaTest.
How many students are there in the training set?
pisaTrain
Error: object 'pisaTrain' not found
Using tapply() on pisaTrain, what is the average reading test score of males?Of females?
A
0 1
512.9406 483.5325
Which variables are missing data in at least one observation in the training set? Select all that apply.
summary(pisaTrain)
grade male raceeth
Min. : 8.00 Min. :0.0000 White :2015
1st Qu.:10.00 1st Qu.:0.0000 Hispanic : 834
Median :10.00 Median :1.0000 Black : 444
Mean :10.09 Mean :0.5111 Asian : 143
3rd Qu.:10.00 3rd Qu.:1.0000 More than one race: 124
Max. :12.00 Max. :1.0000 (Other) : 68
NA's : 35
preschool expectBachelors motherHS motherBachelors
Min. :0.0000 Min. :0.0000 Min. :0.00 Min. :0.0000
1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:1.00 1st Qu.:0.0000
Median :1.0000 Median :1.0000 Median :1.00 Median :0.0000
Mean :0.7228 Mean :0.7859 Mean :0.88 Mean :0.3481
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.00 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.00 Max. :1.0000
NA's :56 NA's :62 NA's :97 NA's :397
motherWork fatherHS fatherBachelors
Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
Median :1.0000 Median :1.0000 Median :0.0000
Mean :0.7345 Mean :0.8593 Mean :0.3319
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000
NA's :93 NA's :245 NA's :569
fatherWork selfBornUS motherBornUS
Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:1.0000
Median :1.0000 Median :1.0000 Median :1.0000
Mean :0.8531 Mean :0.9313 Mean :0.7725
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000
NA's :233 NA's :69 NA's :71
fatherBornUS englishAtHome computerForSchoolwork
Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:1.0000
Median :1.0000 Median :1.0000 Median :1.0000
Mean :0.7668 Mean :0.8717 Mean :0.8994
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000
NA's :113 NA's :71 NA's :65
read30MinsADay minutesPerWeekEnglish studentsInEnglish
Min. :0.0000 Min. : 0.0 Min. : 1.0
1st Qu.:0.0000 1st Qu.: 225.0 1st Qu.:20.0
Median :0.0000 Median : 250.0 Median :25.0
Mean :0.2899 Mean : 266.2 Mean :24.5
3rd Qu.:1.0000 3rd Qu.: 300.0 3rd Qu.:30.0
Max. :1.0000 Max. :2400.0 Max. :75.0
NA's :34 NA's :186 NA's :249
schoolHasLibrary publicSchool urban schoolSize
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 100
1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.: 712
Median :1.0000 Median :1.0000 Median :0.0000 Median :1212
Mean :0.9676 Mean :0.9339 Mean :0.3849 Mean :1369
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1900
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :6694
NA's :143 NA's :162
readingScore
Min. :168.6
1st Qu.:431.7
Median :499.7
Mean :497.9
3rd Qu.:566.2
Max. :746.0
Linear regression discards observations with missing data, so we will remove all such observations from the training and testing sets. Later in the course, we will learn about imputation, which deals with missing data by filling in missing values with plausible information.
Type the following commands into your R console to remove observations with any missing value from pisaTrain and pisaTest:
nrow(pisaTest)
[1] 990
Factor variables are variables that take on a discrete set of values, like the “Region” variable in the WHO dataset from the second lecture of Unit 1. This is an unordered factor because there isn’t any natural ordering between the levels. An ordered factor has a natural ordering between the levels (an example would be the classifications “large,” “medium,” and “small”).
Which of the following variables is an unordered factor with at least 3 levels? (Select all that apply.)
Which of the following variables is an ordered factor with at least 3 levels? (Select all that apply.)
head(pisaTest)
#grade
Now, consider the variable “raceeth” in our problem, which has levels “American Indian/Alaska Native”, “Asian”, “Black”, “Hispanic”, “More than one race”, “Native Hawaiian/Other Pacific Islander”, and “White”. Because it is the most common in our population, we will select White as the reference level.
Which binary variables will be included in the regression model? (Select all that apply.)
#raceethAmerican Indian/Alaska Native
#raceethAsian
#raceethBlack
#raceethHispanic
#raceethMore than one race
#raceethNative Hawaiian/Other Pacific Islander
Consider again adding our unordered factor race to the regression model with reference level “White”.
For a student who is Asian, which binary variables would be set to 0? All remaining variables will be set to 1. (Select all that apply.)
#raceethAmerican Indian/Alaska Native
#raceethBlack
#raceethHispanic
#raceethMore than one race
#raceethNative Hawaiian/Other Pacific Islander
For a student who is white, which binary variables would be set to 0? All remaining variables will be set to 1. (Select all that apply.)
#raceethAmerican Indian/Alaska Native
#raceethAsian
#raceethBlack
#raceethHispanic
#raceethMore than one race
#raceethNative Hawaiian/Other Pacific Islander
summary(lmScore)
Call:
lm(formula = readingScore ~ ., data = pisaTrain)
Residuals:
Min 1Q Median 3Q Max
-247.44 -48.86 1.86 49.77 217.18
Coefficients:
Estimate
(Intercept) 143.766333
grade 29.542707
male -14.521653
raceethAmerican Indian/Alaska Native -67.277327
raceethAsian -4.110325
raceethBlack -67.012347
raceethHispanic -38.975486
raceethMore than one race -16.922522
raceethNative Hawaiian/Other Pacific Islander -5.101601
preschool -4.463670
expectBachelors 55.267080
motherHS 6.058774
motherBachelors 12.638068
motherWork -2.809101
fatherHS 4.018214
fatherBachelors 16.929755
fatherWork 5.842798
selfBornUS -3.806278
motherBornUS -8.798153
fatherBornUS 4.306994
englishAtHome 8.035685
computerForSchoolwork 22.500232
read30MinsADay 34.871924
minutesPerWeekEnglish 0.012788
studentsInEnglish -0.286631
schoolHasLibrary 12.215085
publicSchool -16.857475
urban -0.110132
schoolSize 0.006540
Std. Error t value
(Intercept) 33.841226 4.248
grade 2.937399 10.057
male 3.155926 -4.601
raceethAmerican Indian/Alaska Native 16.786935 -4.008
raceethAsian 9.220071 -0.446
raceethBlack 5.460883 -12.271
raceethHispanic 5.177743 -7.528
raceethMore than one race 8.496268 -1.992
raceethNative Hawaiian/Other Pacific Islander 17.005696 -0.300
preschool 3.486055 -1.280
expectBachelors 4.293893 12.871
motherHS 6.091423 0.995
motherBachelors 3.861457 3.273
motherWork 3.521827 -0.798
fatherHS 5.579269 0.720
fatherBachelors 3.995253 4.237
fatherWork 4.395978 1.329
selfBornUS 7.323718 -0.520
motherBornUS 6.587621 -1.336
fatherBornUS 6.263875 0.688
englishAtHome 6.859492 1.171
computerForSchoolwork 5.702562 3.946
read30MinsADay 3.408447 10.231
minutesPerWeekEnglish 0.010712 1.194
studentsInEnglish 0.227819 -1.258
schoolHasLibrary 9.264884 1.318
publicSchool 6.725614 -2.506
urban 3.962724 -0.028
schoolSize 0.002197 2.977
Pr(>|t|)
(Intercept) 2.24e-05 ***
grade < 2e-16 ***
male 4.42e-06 ***
raceethAmerican Indian/Alaska Native 6.32e-05 ***
raceethAsian 0.65578
raceethBlack < 2e-16 ***
raceethHispanic 7.29e-14 ***
raceethMore than one race 0.04651 *
raceethNative Hawaiian/Other Pacific Islander 0.76421
preschool 0.20052
expectBachelors < 2e-16 ***
motherHS 0.32001
motherBachelors 0.00108 **
motherWork 0.42517
fatherHS 0.47147
fatherBachelors 2.35e-05 ***
fatherWork 0.18393
selfBornUS 0.60331
motherBornUS 0.18182
fatherBornUS 0.49178
englishAtHome 0.24153
computerForSchoolwork 8.19e-05 ***
read30MinsADay < 2e-16 ***
minutesPerWeekEnglish 0.23264
studentsInEnglish 0.20846
schoolHasLibrary 0.18749
publicSchool 0.01226 *
urban 0.97783
schoolSize 0.00294 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 73.81 on 2385 degrees of freedom
Multiple R-squared: 0.3251, Adjusted R-squared: 0.3172
F-statistic: 41.04 on 28 and 2385 DF, p-value: < 2.2e-16
What is the training-set root-mean squared error (RMSE) of lmScore?
sqrt(sum((lmScore$residuals^2))/nrow(pisaTrain))
[1] 73.36555
Consider two students A and B. They have all variable values the same, except that student A is in grade 11 and student B is in grade 9. What is the predicted reading score of student A minus the predicted reading score of student B?
29.542707*2
[1] 59.08541
What is the meaning of the coefficient associated with variable raceethAsian?
#Predicted difference in the reading score between an Asian student and a white student who is otherwise identical
Based on the significance codes, which variables are candidates for removal from the model? Select all that apply. (We’ll assume that the factor variable raceeth should only be removed if none of its levels are significant.)
summary(lmScore)
Call:
lm(formula = readingScore ~ ., data = pisaTrain)
Residuals:
Min 1Q Median 3Q Max
-247.44 -48.86 1.86 49.77 217.18
Coefficients:
Estimate
(Intercept) 143.766333
grade 29.542707
male -14.521653
raceethAmerican Indian/Alaska Native -67.277327
raceethAsian -4.110325
raceethBlack -67.012347
raceethHispanic -38.975486
raceethMore than one race -16.922522
raceethNative Hawaiian/Other Pacific Islander -5.101601
preschool -4.463670
expectBachelors 55.267080
motherHS 6.058774
motherBachelors 12.638068
motherWork -2.809101
fatherHS 4.018214
fatherBachelors 16.929755
fatherWork 5.842798
selfBornUS -3.806278
motherBornUS -8.798153
fatherBornUS 4.306994
englishAtHome 8.035685
computerForSchoolwork 22.500232
read30MinsADay 34.871924
minutesPerWeekEnglish 0.012788
studentsInEnglish -0.286631
schoolHasLibrary 12.215085
publicSchool -16.857475
urban -0.110132
schoolSize 0.006540
Std. Error t value
(Intercept) 33.841226 4.248
grade 2.937399 10.057
male 3.155926 -4.601
raceethAmerican Indian/Alaska Native 16.786935 -4.008
raceethAsian 9.220071 -0.446
raceethBlack 5.460883 -12.271
raceethHispanic 5.177743 -7.528
raceethMore than one race 8.496268 -1.992
raceethNative Hawaiian/Other Pacific Islander 17.005696 -0.300
preschool 3.486055 -1.280
expectBachelors 4.293893 12.871
motherHS 6.091423 0.995
motherBachelors 3.861457 3.273
motherWork 3.521827 -0.798
fatherHS 5.579269 0.720
fatherBachelors 3.995253 4.237
fatherWork 4.395978 1.329
selfBornUS 7.323718 -0.520
motherBornUS 6.587621 -1.336
fatherBornUS 6.263875 0.688
englishAtHome 6.859492 1.171
computerForSchoolwork 5.702562 3.946
read30MinsADay 3.408447 10.231
minutesPerWeekEnglish 0.010712 1.194
studentsInEnglish 0.227819 -1.258
schoolHasLibrary 9.264884 1.318
publicSchool 6.725614 -2.506
urban 3.962724 -0.028
schoolSize 0.002197 2.977
Pr(>|t|)
(Intercept) 2.24e-05 ***
grade < 2e-16 ***
male 4.42e-06 ***
raceethAmerican Indian/Alaska Native 6.32e-05 ***
raceethAsian 0.65578
raceethBlack < 2e-16 ***
raceethHispanic 7.29e-14 ***
raceethMore than one race 0.04651 *
raceethNative Hawaiian/Other Pacific Islander 0.76421
preschool 0.20052
expectBachelors < 2e-16 ***
motherHS 0.32001
motherBachelors 0.00108 **
motherWork 0.42517
fatherHS 0.47147
fatherBachelors 2.35e-05 ***
fatherWork 0.18393
selfBornUS 0.60331
motherBornUS 0.18182
fatherBornUS 0.49178
englishAtHome 0.24153
computerForSchoolwork 8.19e-05 ***
read30MinsADay < 2e-16 ***
minutesPerWeekEnglish 0.23264
studentsInEnglish 0.20846
schoolHasLibrary 0.18749
publicSchool 0.01226 *
urban 0.97783
schoolSize 0.00294 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 73.81 on 2385 degrees of freedom
Multiple R-squared: 0.3251, Adjusted R-squared: 0.3172
F-statistic: 41.04 on 28 and 2385 DF, p-value: < 2.2e-16
Using the “predict” function and supplying the “newdata” argument, use the lmScore model to predict the reading scores of students in pisaTest. Call this vector of predictions “predTest”. Do not change the variables in the model (for example, do not remove variables that we found were not significant in the previous part of this problem). Use the summary function to describe the test set predictions.
What is the range between the maximum and minimum predicted reading score on the test set?
max(predTest)-min(predTest)
[1] 284.4683
What is the sum of squared errors (SSE) of lmScore on the testing set?
SSE
[1] 5762082
What is the predicted test score used in the baseline model? Remember to compute this value using the training set and not the test set.
mean(pisaTrain$readingScore)
[1] 517.9629
What is the sum of squared errors of the baseline model on the testing set? HINT: We call the sum of squared errors for the baseline model the total sum of squares (SST).
SSE = sum((pisaTest$readingScore - baseline)^2)
What is the test-set R-squared value of lmScore?
SSE= sum((pisaTest$readingScore-predTest)^2)
SST = sum ((pisaTest$readingScore - baseline)^2)
1-SSE/SST
[1] 0.2614944