【1.1】 How many students are there in the training set?
【1.2】 Using tapply() on pisaTrain, what is the average reading test score of males? Of females?
tapply(ptr$readingScore, ptr$male , mean)
0 1
529.4637 506.5191
【1.3】 Which variables are missing data in at least one observation in the training set? Select all that apply.
raceeth, preschool, expectBachelors
motherHS, motherBachelors, motherWork
fatherHS, fatherBachelors, fatherWork
selfBornUS, motherBornUS, fatherBornUS
englishAtHome, computerForSchoolwork, read30MinsADay
minutesPerWeekEnglish, studentsInEnglish, schoolHasLibrary
schoolSize
summary(ptr)
grade male
Min. : 8.00 Min. :0.0000
1st Qu.:10.00 1st Qu.:0.0000
Median :10.00 Median :1.0000
Mean :10.13 Mean :0.5012
3rd Qu.:10.00 3rd Qu.:1.0000
Max. :12.00 Max. :1.0000
raceeth preschool
American Indian/Alaska Native : 20 Min. :0.0000
Asian : 95 1st Qu.:0.0000
Black : 228 Median :1.0000
Hispanic : 500 Mean :0.7274
More than one race : 81 3rd Qu.:1.0000
Native Hawaiian/Other Pacific Islander: 20 Max. :1.0000
White :1470
expectBachelors motherHS motherBachelors motherWork
Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
1st Qu.:1.0000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.0000
Median :1.0000 Median :1.000 Median :0.0000 Median :1.0000
Mean :0.8343 Mean :0.896 Mean :0.3637 Mean :0.7357
3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
fatherHS fatherBachelors fatherWork selfBornUS
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:1.0000
Median :1.0000 Median :0.0000 Median :1.0000 Median :1.0000
Mean :0.8741 Mean :0.3484 Mean :0.8571 Mean :0.9362
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
motherBornUS fatherBornUS englishAtHome computerForSchoolwork
Min. :0.00 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:1.00 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:1.0000
Median :1.00 Median :1.0000 Median :1.0000 Median :1.0000
Mean :0.79 Mean :0.7854 Mean :0.8815 Mean :0.9155
3rd Qu.:1.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.00 Max. :1.0000 Max. :1.0000 Max. :1.0000
read30MinsADay minutesPerWeekEnglish studentsInEnglish schoolHasLibrary
Min. :0.0000 Min. : 0.0 Min. : 1.00 Min. :0.0000
1st Qu.:0.0000 1st Qu.: 225.0 1st Qu.:20.00 1st Qu.:1.0000
Median :0.0000 Median : 250.0 Median :25.00 Median :1.0000
Mean :0.3016 Mean : 269.8 Mean :24.56 Mean :0.9714
3rd Qu.:1.0000 3rd Qu.: 300.0 3rd Qu.:30.00 3rd Qu.:1.0000
Max. :1.0000 Max. :1680.0 Max. :75.00 Max. :1.0000
publicSchool urban schoolSize readingScore
Min. :0.0000 Min. :0.0000 Min. : 100 Min. :244.5
1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.: 712 1st Qu.:455.8
Median :1.0000 Median :0.0000 Median :1233 Median :520.2
Mean :0.9176 Mean :0.3629 Mean :1372 Mean :518.0
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1900 3rd Qu.:581.4
Max. :1.0000 Max. :1.0000 Max. :6694 Max. :746.0
【1.4】 How many observations are now in the training set? and testing set?
ptr = na.omit(ptr)
pte = na.omit(pte)
【2.1】 Which of the following variables is an unordered factor with at least 3 levels? and which is ordered?
table(ptr$grade)
8 9 10 11 12
2 188 1730 491 3
table(ptr$male)
0 1
1204 1210
table(ptr$raceeth)
White
1470
American Indian/Alaska Native
20
Asian
95
Black
228
Hispanic
500
More than one race
81
Native Hawaiian/Other Pacific Islander
20
【2.2】 Because it is the most common in our population, we will select White as the reference level. Now, consider the variable “raceeth” in our problem, which binary variables will be included in the regression model?
【2.3】 For a student who is Asian, which binary variables would be set to 0? All remaining variables will be set to 1. For a student who is white?
【3.1】 What is the Multiple R-squared value of lmScore on the training set?
summary(model1)
Call:
lm(formula = readingScore ~ ., data = ptr)
Residuals:
Min 1Q Median 3Q Max
-247.44 -48.86 1.86 49.77 217.18
Coefficients:
Estimate Std. Error
(Intercept) 143.766333 33.841226
grade 29.542707 2.937399
male -14.521653 3.155926
raceethAmerican Indian/Alaska Native -67.277327 16.786935
raceethAsian -4.110325 9.220071
raceethBlack -67.012347 5.460883
raceethHispanic -38.975486 5.177743
raceethMore than one race -16.922522 8.496268
raceethNative Hawaiian/Other Pacific Islander -5.101601 17.005696
preschool -4.463670 3.486055
expectBachelors 55.267080 4.293893
motherHS 6.058774 6.091423
motherBachelors 12.638068 3.861457
motherWork -2.809101 3.521827
fatherHS 4.018214 5.579269
fatherBachelors 16.929755 3.995253
fatherWork 5.842798 4.395978
selfBornUS -3.806278 7.323718
motherBornUS -8.798153 6.587621
fatherBornUS 4.306994 6.263875
englishAtHome 8.035685 6.859492
computerForSchoolwork 22.500232 5.702562
read30MinsADay 34.871924 3.408447
minutesPerWeekEnglish 0.012788 0.010712
studentsInEnglish -0.286631 0.227819
schoolHasLibrary 12.215085 9.264884
publicSchool -16.857475 6.725614
urban -0.110132 3.962724
schoolSize 0.006540 0.002197
t value Pr(>|t|)
(Intercept) 4.248 2.24e-05 ***
grade 10.057 < 2e-16 ***
male -4.601 4.42e-06 ***
raceethAmerican Indian/Alaska Native -4.008 6.32e-05 ***
raceethAsian -0.446 0.65578
raceethBlack -12.271 < 2e-16 ***
raceethHispanic -7.528 7.29e-14 ***
raceethMore than one race -1.992 0.04651 *
raceethNative Hawaiian/Other Pacific Islander -0.300 0.76421
preschool -1.280 0.20052
expectBachelors 12.871 < 2e-16 ***
motherHS 0.995 0.32001
motherBachelors 3.273 0.00108 **
motherWork -0.798 0.42517
fatherHS 0.720 0.47147
fatherBachelors 4.237 2.35e-05 ***
fatherWork 1.329 0.18393
selfBornUS -0.520 0.60331
motherBornUS -1.336 0.18182
fatherBornUS 0.688 0.49178
englishAtHome 1.171 0.24153
computerForSchoolwork 3.946 8.19e-05 ***
read30MinsADay 10.231 < 2e-16 ***
minutesPerWeekEnglish 1.194 0.23264
studentsInEnglish -1.258 0.20846
schoolHasLibrary 1.318 0.18749
publicSchool -2.506 0.01226 *
urban -0.028 0.97783
schoolSize 2.977 0.00294 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 73.81 on 2385 degrees of freedom
Multiple R-squared: 0.3251, Adjusted R-squared: 0.3172
F-statistic: 41.04 on 28 and 2385 DF, p-value: < 2.2e-16
【3.2】 What is the training-set root-mean squared error (RMSE) of lmScore?
pred = predict(model1, ptr)
SSE = sum((ptr$readingScore - pred)^2)
RMSE = sqrt(SSE/nrow(ptr))
print(RMSE)
[1] 73.36555
【3.3】 Consider two students A and B. They have all variable values the same, except that student A is in grade 11 and student B is in grade 9. What is the predicted reading score of student A minus the predicted reading score of student B?
print(29.542707*2)
[1] 59.08541
【3.4】 What is the meaning of the coefficient associated with variable raceethAsian?
【3.5】 Based on the significance codes, which variables are candidates for removal from the model? Select all that apply. (We’ll assume that the factor variable raceeth should only be removed if none of its levels are significant.)
【4.1】 What is the range between the maximum and minimum predicted reading score on the test set?
pred = predict(model1, pte)
range(pred) %>% diff()
[1] 284.4683
【4.2】 What is the sum of squared errors (SSE) of lmScore on the testing? What is the root-mean squared error (RMSE)?
SSE = sum((pred - pte$readingScore)^2)
RMSE = sqrt(SSE/nrow(pte))
print(SSE)
[1] 5762082
print(RMSE)
[1] 76.29079
【4.3】 What is the predicted test score used in the baseline model? Remember to compute this value using the training set and not the test set. What is the sum of squared errors of the baseline model on the testing set? HINT: We call the sum of squared errors for the baseline model the total sum of squares (SST).
mean(ptr$readingScore)
[1] 517.9629
sum((pte$readingScore - mean(ptr$readingScore))^2)
[1] 7802354
【4.4】 What is the test-set R-squared value of lmScore?
SSR = sum((pred - mean(pte$readingScore))^2)
R2 = 1-SSE/SST
print(R2)
[1] 0.2614944