Load the training and testing sets using the read.csv() function, and save them as variables with the names pisaTrain and pisaTest.
How many students are there in the training set?
test= read.csv("Unit2/pisa2009test.csv")
train= read.csv("Unit2/pisa2009train.csv")
#3663
Using tapply() on pisaTrain, what is the average reading test score of males?
tapply(train$grade, train$male, mean)
## 0 1
## 10.14517 10.03686
#10.03686
Of females?
#10.14517
Which variables are missing data in at least one observation in the training set? Select all that apply.
summary(train)
## grade male raceeth
## Min. : 8.00 Min. :0.0000 White :2015
## 1st Qu.:10.00 1st Qu.:0.0000 Hispanic : 834
## Median :10.00 Median :1.0000 Black : 444
## Mean :10.09 Mean :0.5111 Asian : 143
## 3rd Qu.:10.00 3rd Qu.:1.0000 More than one race: 124
## Max. :12.00 Max. :1.0000 (Other) : 68
## NA's : 35
## preschool expectBachelors motherHS motherBachelors
## Min. :0.0000 Min. :0.0000 Min. :0.00 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:1.00 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :1.00 Median :0.0000
## Mean :0.7228 Mean :0.7859 Mean :0.88 Mean :0.3481
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.00 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.00 Max. :1.0000
## NA's :56 NA's :62 NA's :97 NA's :397
## motherWork fatherHS fatherBachelors fatherWork
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1.0000
## Median :1.0000 Median :1.0000 Median :0.0000 Median :1.0000
## Mean :0.7345 Mean :0.8593 Mean :0.3319 Mean :0.8531
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :93 NA's :245 NA's :569 NA's :233
## selfBornUS motherBornUS fatherBornUS englishAtHome
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:1.0000
## Median :1.0000 Median :1.0000 Median :1.0000 Median :1.0000
## Mean :0.9313 Mean :0.7725 Mean :0.7668 Mean :0.8717
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :69 NA's :71 NA's :113 NA's :71
## computerForSchoolwork read30MinsADay minutesPerWeekEnglish
## Min. :0.0000 Min. :0.0000 Min. : 0.0
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.: 225.0
## Median :1.0000 Median :0.0000 Median : 250.0
## Mean :0.8994 Mean :0.2899 Mean : 266.2
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 300.0
## Max. :1.0000 Max. :1.0000 Max. :2400.0
## NA's :65 NA's :34 NA's :186
## studentsInEnglish schoolHasLibrary publicSchool urban
## Min. : 1.0 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:20.0 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :25.0 Median :1.0000 Median :1.0000 Median :0.0000
## Mean :24.5 Mean :0.9676 Mean :0.9339 Mean :0.3849
## 3rd Qu.:30.0 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :75.0 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :249 NA's :143
## schoolSize readingScore
## Min. : 100 Min. :168.6
## 1st Qu.: 712 1st Qu.:431.7
## Median :1212 Median :499.7
## Mean :1369 Mean :497.9
## 3rd Qu.:1900 3rd Qu.:566.2
## Max. :6694 Max. :746.0
## NA's :162
#all, except grade, male, publicSchool, urban and readingScore
How many observations are now in the training set?
pisaTrain = na.omit(train) #na.omit秉棄missing data
pisaTest = na.omit(test)
#2414
How many observations are now in the testing set?
#990
Which of the following variables is an unordered factor with at least 3 levels? (Select all that apply.)
#raceeth
Which of the following variables is an ordered factor with at least 3 levels? (Select all that apply.)
#grade
Which binary variables will be included in the regression model? (Select all that apply.)
#except White.
Consider again adding our unordered factor race to the regression model with reference level “White”.
For a student who is Asian, which binary variables would be set to 0? All remaining variables will be set to 1. (Select all that apply.)
#raceethAsian is ste to 0, all the other are 1.
For a student who is white, which binary variables would be set to 0? All remaining variables will be set to 1. (Select all that apply.)
#all are 0.
What is the Multiple R-squared value of lmScore on the training set?
lmScore= lm(readingScore ~ ., pisaTrain)
summary(lmScore)
##
## Call:
## lm(formula = readingScore ~ ., data = pisaTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -247.44 -48.86 1.86 49.77 217.18
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 76.489006 37.302678
## grade 29.542707 2.937399
## male -14.521653 3.155926
## raceethAsian 63.167002 18.972648
## raceethBlack 0.264980 17.369507
## raceethHispanic 28.301842 17.258860
## raceethMore than one race 50.354805 18.570123
## raceethNative Hawaiian/Other Pacific Islander 62.175726 23.782766
## raceethWhite 67.277327 16.786935
## preschool -4.463670 3.486055
## expectBachelors 55.267080 4.293893
## motherHS 6.058774 6.091423
## motherBachelors 12.638068 3.861457
## motherWork -2.809101 3.521827
## fatherHS 4.018214 5.579269
## fatherBachelors 16.929755 3.995253
## fatherWork 5.842798 4.395978
## selfBornUS -3.806278 7.323718
## motherBornUS -8.798153 6.587621
## fatherBornUS 4.306994 6.263875
## englishAtHome 8.035685 6.859492
## computerForSchoolwork 22.500232 5.702562
## read30MinsADay 34.871924 3.408447
## minutesPerWeekEnglish 0.012788 0.010712
## studentsInEnglish -0.286631 0.227819
## schoolHasLibrary 12.215085 9.264884
## publicSchool -16.857475 6.725614
## urban -0.110132 3.962724
## schoolSize 0.006540 0.002197
## t value Pr(>|t|)
## (Intercept) 2.050 0.040425 *
## grade 10.057 < 2e-16 ***
## male -4.601 4.42e-06 ***
## raceethAsian 3.329 0.000884 ***
## raceethBlack 0.015 0.987830
## raceethHispanic 1.640 0.101169
## raceethMore than one race 2.712 0.006744 **
## raceethNative Hawaiian/Other Pacific Islander 2.614 0.008997 **
## raceethWhite 4.008 6.32e-05 ***
## preschool -1.280 0.200516
## expectBachelors 12.871 < 2e-16 ***
## motherHS 0.995 0.320012
## motherBachelors 3.273 0.001080 **
## motherWork -0.798 0.425167
## fatherHS 0.720 0.471470
## fatherBachelors 4.237 2.35e-05 ***
## fatherWork 1.329 0.183934
## selfBornUS -0.520 0.603307
## motherBornUS -1.336 0.181821
## fatherBornUS 0.688 0.491776
## englishAtHome 1.171 0.241527
## computerForSchoolwork 3.946 8.19e-05 ***
## read30MinsADay 10.231 < 2e-16 ***
## minutesPerWeekEnglish 1.194 0.232644
## studentsInEnglish -1.258 0.208460
## schoolHasLibrary 1.318 0.187487
## publicSchool -2.506 0.012261 *
## urban -0.028 0.977830
## schoolSize 2.977 0.002942 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 73.81 on 2385 degrees of freedom
## Multiple R-squared: 0.3251, Adjusted R-squared: 0.3172
## F-statistic: 41.04 on 28 and 2385 DF, p-value: < 2.2e-16
#0.3251
What is the training-set root-mean squared error (RMSE) of lmScore?
SSE= sum(lmScore$residuals^2)
RMSE= sqrt(SSE/nrow(pisaTrain))
RMSE
## [1] 73.36555
#R=sqrt(mean(lmScore$residuals^2))
#R
Consider two students A and B. They have all variable values the same, except that student A is in grade 11 and student B is in grade 9. What is the predicted reading score of student A minus the predicted reading score of student B?
29.542707*2
## [1] 59.08541
What is the meaning of the coefficient associated with variable raceethAsian?
Based on the significance codes, which variables are candidates for removal from the model? Select all that apply. (We’ll assume that the factor variable raceeth should only be removed if none of its levels are significant.)
#motherHS, motherWork, atherHS, fatherWork, selfBornUS, motherBornUS, fatherBornUS, englishAtHome, read30MinsADay,minutesPerWeekEnglish, studentsInEnglish, schoolHasLibrary
What is the range between the maximum and minimum predicted reading score on the test set?
predTest = predict(lmScore, newdata = pisaTest)
summary(predTest)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 353.2 482.0 524.0 516.7 555.7 637.7
637.7-353.2
## [1] 284.5
What is the sum of squared errors (SSE) of lmScore on the testing set?
SSE= sum((predTest-pisaTest$readingScore)^2)
SSE
## [1] 5762082
What is the root-mean squared error (RMSE) of lmScore on the testing set?
RMSE= sqrt(SSE/nrow(pisaTest))
RMSE
## [1] 76.29079
What is the predicted test score used in the baseline model? Remember to compute this value using the training set and not the test set.
B=mean(pisaTrain$readingScore)
B
## [1] 517.9629
What is the sum of squared errors of the baseline model on the testing set? HINT: We call the sum of squared errors for the baseline model the total sum of squares (SST).
SST = sum((B - pisaTest$readingScore)^2)
SST
## [1] 7802354
What is the test-set R-squared value of lmScore?
R2= 1-SSE/SST
R2
## [1] 0.2614944