§ 1.1 How many students are there in the training set?
nrow(pisaTrain)
[1] 3663
§ 1.2 what is the average reading test score of males and females?
tapply(pisaTrain$readingScore, pisaTrain$male, mean)
0 1
512.9406 483.5325
print(c("males = 483.5325", "females = 512.9406"))
[1] "males = 483.5325" "females = 512.9406"
§ 1.3 Which variables are missing data in at least one observation in the training set? Select all that apply.
pisaTrain %>% is.na %>% colSums
grade male raceeth preschool
0 0 35 56
expectBachelors motherHS motherBachelors motherWork
62 97 397 93
fatherHS fatherBachelors fatherWork selfBornUS
245 569 233 69
motherBornUS fatherBornUS englishAtHome computerForSchoolwork
71 113 71 65
read30MinsADay minutesPerWeekEnglish studentsInEnglish schoolHasLibrary
34 186 249 143
publicSchool urban schoolSize readingScore
0 0 162 0
§ 1.4 How many observations are now in the training set?
nrow(pisaTrain)
[1] 2414
How many observations are now in the testing set?
nrow(pisaTest)
[1] 990
§ 2.1
Which of the following variables is an unordered factor with at least 3 levels? (Select all that apply.)
print("raceeth")
[1] "raceeth"
Which of the following variables is an ordered factor with at least 3 levels? (Select all that apply.)
print("grade")
[1] "grade"
§ 2.2
Now, consider the variable “raceeth” in our problem, which has levels “American Indian/Alaska Native”, “Asian”, “Black”, “Hispanic”, “More than one race”, “Native Hawaiian/Other Pacific Islander”, and “White”. Because it is the most common in our population, we will select White as the reference level.
Which binary variables will be included in the regression model? (Select all that apply.)
print("we would create all these variables except for raceethWhite")
[1] "we would create all these variables except for raceethWhite"
§ 2.3 Consider again adding our unordered factor race to the regression model with reference level “White”.
For a student who is Asian, which binary variables would be set to 0? All remaining variables will be set to 1. (Select all that apply.)
print("1 : raceethAsian")
[1] "1 : raceethAsian"
Warning message:
In strsplit(code, "\n", fixed = TRUE) :
input string 1 is invalid in this locale
print("0 : raceethAmerican Indian/Alaska Native, raceethBlack, raceethHispanic, raceethMore than one race, raceethNative Hawaiian/Other Pacific Islander")
[1] "0 : raceethAmerican Indian/Alaska Native, raceethBlack, raceethHispanic, raceethMore than one race, raceethNative Hawaiian/Other Pacific Islander"
For a student who is white, which binary variables would be set to 0? All remaining variables will be set to 1. (Select all that apply.)
print("raceethAmerican Indian/Alaska Native, raceethAsian, raceethBlack, raceethHispanic, raceethMore than one race, raceethNative Hawaiian/Other Pacific Islander")
[1] "raceethAmerican Indian/Alaska Native, raceethAsian, raceethBlack, raceethHispanic, raceethMore than one race, raceethNative Hawaiian/Other Pacific Islander"
§ 3.1 What is the Multiple R-squared value of lmScore on the training set?
lmScore = lm(readingScore ~ ., data = pisaTrain)
summary(lmScore)
Call:
lm(formula = readingScore ~ ., data = pisaTrain)
Residuals:
Min 1Q Median 3Q Max
-247.44 -48.86 1.86 49.77 217.18
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 143.766333 33.841226 4.248 2.24e-05
grade 29.542707 2.937399 10.057 < 2e-16
male -14.521653 3.155926 -4.601 4.42e-06
raceethAmerican Indian/Alaska Native -67.277327 16.786935 -4.008 6.32e-05
raceethAsian -4.110325 9.220071 -0.446 0.65578
raceethBlack -67.012347 5.460883 -12.271 < 2e-16
raceethHispanic -38.975486 5.177743 -7.528 7.29e-14
raceethMore than one race -16.922522 8.496268 -1.992 0.04651
raceethNative Hawaiian/Other Pacific Islander -5.101601 17.005696 -0.300 0.76421
preschool -4.463670 3.486055 -1.280 0.20052
expectBachelors 55.267080 4.293893 12.871 < 2e-16
motherHS 6.058774 6.091423 0.995 0.32001
motherBachelors 12.638068 3.861457 3.273 0.00108
motherWork -2.809101 3.521827 -0.798 0.42517
fatherHS 4.018214 5.579269 0.720 0.47147
fatherBachelors 16.929755 3.995253 4.237 2.35e-05
fatherWork 5.842798 4.395978 1.329 0.18393
selfBornUS -3.806278 7.323718 -0.520 0.60331
motherBornUS -8.798153 6.587621 -1.336 0.18182
fatherBornUS 4.306994 6.263875 0.688 0.49178
englishAtHome 8.035685 6.859492 1.171 0.24153
computerForSchoolwork 22.500232 5.702562 3.946 8.19e-05
read30MinsADay 34.871924 3.408447 10.231 < 2e-16
minutesPerWeekEnglish 0.012788 0.010712 1.194 0.23264
studentsInEnglish -0.286631 0.227819 -1.258 0.20846
schoolHasLibrary 12.215085 9.264884 1.318 0.18749
publicSchool -16.857475 6.725614 -2.506 0.01226
urban -0.110132 3.962724 -0.028 0.97783
schoolSize 0.006540 0.002197 2.977 0.00294
(Intercept) ***
grade ***
male ***
raceethAmerican Indian/Alaska Native ***
raceethAsian
raceethBlack ***
raceethHispanic ***
raceethMore than one race *
raceethNative Hawaiian/Other Pacific Islander
preschool
expectBachelors ***
motherHS
motherBachelors **
motherWork
fatherHS
fatherBachelors ***
fatherWork
selfBornUS
motherBornUS
fatherBornUS
englishAtHome
computerForSchoolwork ***
read30MinsADay ***
minutesPerWeekEnglish
studentsInEnglish
schoolHasLibrary
publicSchool *
urban
schoolSize **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 73.81 on 2385 degrees of freedom
Multiple R-squared: 0.3251, Adjusted R-squared: 0.3172
F-statistic: 41.04 on 28 and 2385 DF, p-value: < 2.2e-16
print(0.3251)
[1] 0.3251
§ 3.2 What is the training-set root-mean squared error (RMSE) of lmScore?
RMSE
[1] 73.36555
§ 3.3 Consider two students A and B. They have all variable values the same, except that student A is in grade 11 and student B is in grade 9. What is the predicted reading score of student A minus the predicted reading score of student B?
29.54*(11-9)
[1] 59.08
§ 3.4 What is the meaning of the coefficient associated with variable raceethAsian?
print("Predicted difference in the reading score between an Asian student and a white student who is otherwise identical ")
[1] "Predicted difference in the reading score between an Asian student and a white student who is otherwise identical "
§ 3.5 Based on the significance codes, which variables are candidates for removal from the model? Select all that apply. (We’ll assume that the factor variable raceeth should only be removed if none of its levels are significant.)
summary(lmScore)
Call:
lm(formula = readingScore ~ ., data = pisaTrain)
Residuals:
Min 1Q Median 3Q Max
-247.44 -48.86 1.86 49.77 217.18
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 143.766333 33.841226 4.248 2.24e-05
grade 29.542707 2.937399 10.057 < 2e-16
male -14.521653 3.155926 -4.601 4.42e-06
raceethAmerican Indian/Alaska Native -67.277327 16.786935 -4.008 6.32e-05
raceethAsian -4.110325 9.220071 -0.446 0.65578
raceethBlack -67.012347 5.460883 -12.271 < 2e-16
raceethHispanic -38.975486 5.177743 -7.528 7.29e-14
raceethMore than one race -16.922522 8.496268 -1.992 0.04651
raceethNative Hawaiian/Other Pacific Islander -5.101601 17.005696 -0.300 0.76421
preschool -4.463670 3.486055 -1.280 0.20052
expectBachelors 55.267080 4.293893 12.871 < 2e-16
motherHS 6.058774 6.091423 0.995 0.32001
motherBachelors 12.638068 3.861457 3.273 0.00108
motherWork -2.809101 3.521827 -0.798 0.42517
fatherHS 4.018214 5.579269 0.720 0.47147
fatherBachelors 16.929755 3.995253 4.237 2.35e-05
fatherWork 5.842798 4.395978 1.329 0.18393
selfBornUS -3.806278 7.323718 -0.520 0.60331
motherBornUS -8.798153 6.587621 -1.336 0.18182
fatherBornUS 4.306994 6.263875 0.688 0.49178
englishAtHome 8.035685 6.859492 1.171 0.24153
computerForSchoolwork 22.500232 5.702562 3.946 8.19e-05
read30MinsADay 34.871924 3.408447 10.231 < 2e-16
minutesPerWeekEnglish 0.012788 0.010712 1.194 0.23264
studentsInEnglish -0.286631 0.227819 -1.258 0.20846
schoolHasLibrary 12.215085 9.264884 1.318 0.18749
publicSchool -16.857475 6.725614 -2.506 0.01226
urban -0.110132 3.962724 -0.028 0.97783
schoolSize 0.006540 0.002197 2.977 0.00294
(Intercept) ***
grade ***
male ***
raceethAmerican Indian/Alaska Native ***
raceethAsian
raceethBlack ***
raceethHispanic ***
raceethMore than one race *
raceethNative Hawaiian/Other Pacific Islander
preschool
expectBachelors ***
motherHS
motherBachelors **
motherWork
fatherHS
fatherBachelors ***
fatherWork
selfBornUS
motherBornUS
fatherBornUS
englishAtHome
computerForSchoolwork ***
read30MinsADay ***
minutesPerWeekEnglish
studentsInEnglish
schoolHasLibrary
publicSchool *
urban
schoolSize **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 73.81 on 2385 degrees of freedom
Multiple R-squared: 0.3251, Adjusted R-squared: 0.3172
F-statistic: 41.04 on 28 and 2385 DF, p-value: < 2.2e-16
print("preschool, motherHS, motherWork, fatherHS,fatherWork, selfBornUS, motherBornUS, fatherBornUS, englishAtHome, minutesPerWeekEnglish, studentsInEnglish, schoolHasLibrary,urban")
[1] "preschool, motherHS, motherWork, fatherHS,fatherWork, selfBornUS, motherBornUS, fatherBornUS, englishAtHome, minutesPerWeekEnglish, studentsInEnglish, schoolHasLibrary,urban"
§ 4.1 What is the range between the maximum and minimum predicted reading score on the test set?
predictscore = predict(lmScore, newdata = pisaTest)
summary(predictscore)
Min. 1st Qu. Median Mean 3rd Qu. Max.
353.2 482.0 524.0 516.7 555.7 637.7
637.7 - 353.2
[1] 284.5
§ 4.2 What is the sum of squared errors (SSE) of lmScore on the testing set?
SSE1
[1] 5762082
What is the root-mean squared error (RMSE) of lmScore on the testing set?
RMSE1
[1] 76.29079
§ 4.3
What is the predicted test score used in the baseline model? Remember to compute this value using the training set and not the test set.
baseline
[1] 517.9629
What is the sum of squared errors of the baseline model on the testing set? HINT: We call the sum of squared errors for the baseline model the total sum of squares (SST).
sum((baseline-pisaTest$readingScore)^2)
[1] 7802354
§ 4.4
What is the test-set R-squared value of lmScore?
1 - SSE1/SST1
[1] 0.2614944