Reading Test Scores

1.1 Dataset size

Load the training and testing sets using the read.csv() function, and save them as variables with the names pisaTrain and pisaTest.

How many students are there in the training set?

test= read.csv("Unit2/pisa2009test.csv")
train= read.csv("Unit2/pisa2009train.csv")
#3663

1.2 Summarizing the dataset

Using tapply() on pisaTrain, what is the average reading test score of males?

tapply(train$grade, train$male, mean)
##        0        1 
## 10.14517 10.03686
#10.03686

1.2-2

Of females?

#10.14517

1.3 Locating missing values

Which variables are missing data in at least one observation in the training set? Select all that apply.

summary(train)
##      grade            male                      raceeth    
##  Min.   : 8.00   Min.   :0.0000   White             :2015  
##  1st Qu.:10.00   1st Qu.:0.0000   Hispanic          : 834  
##  Median :10.00   Median :1.0000   Black             : 444  
##  Mean   :10.09   Mean   :0.5111   Asian             : 143  
##  3rd Qu.:10.00   3rd Qu.:1.0000   More than one race: 124  
##  Max.   :12.00   Max.   :1.0000   (Other)           :  68  
##                                   NA's              :  35  
##    preschool      expectBachelors     motherHS    motherBachelors 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:1.00   1st Qu.:0.0000  
##  Median :1.0000   Median :1.0000   Median :1.00   Median :0.0000  
##  Mean   :0.7228   Mean   :0.7859   Mean   :0.88   Mean   :0.3481  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.00   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00   Max.   :1.0000  
##  NA's   :56       NA's   :62       NA's   :97     NA's   :397     
##    motherWork        fatherHS      fatherBachelors    fatherWork    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :1.0000   Median :1.0000   Median :0.0000   Median :1.0000  
##  Mean   :0.7345   Mean   :0.8593   Mean   :0.3319   Mean   :0.8531  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :93       NA's   :245      NA's   :569      NA's   :233     
##    selfBornUS      motherBornUS     fatherBornUS    englishAtHome   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:1.0000  
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.9313   Mean   :0.7725   Mean   :0.7668   Mean   :0.8717  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :69       NA's   :71       NA's   :113      NA's   :71      
##  computerForSchoolwork read30MinsADay   minutesPerWeekEnglish
##  Min.   :0.0000        Min.   :0.0000   Min.   :   0.0       
##  1st Qu.:1.0000        1st Qu.:0.0000   1st Qu.: 225.0       
##  Median :1.0000        Median :0.0000   Median : 250.0       
##  Mean   :0.8994        Mean   :0.2899   Mean   : 266.2       
##  3rd Qu.:1.0000        3rd Qu.:1.0000   3rd Qu.: 300.0       
##  Max.   :1.0000        Max.   :1.0000   Max.   :2400.0       
##  NA's   :65            NA's   :34       NA's   :186          
##  studentsInEnglish schoolHasLibrary  publicSchool        urban       
##  Min.   : 1.0      Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:20.0      1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :25.0      Median :1.0000   Median :1.0000   Median :0.0000  
##  Mean   :24.5      Mean   :0.9676   Mean   :0.9339   Mean   :0.3849  
##  3rd Qu.:30.0      3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :75.0      Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :249       NA's   :143                                       
##    schoolSize    readingScore  
##  Min.   : 100   Min.   :168.6  
##  1st Qu.: 712   1st Qu.:431.7  
##  Median :1212   Median :499.7  
##  Mean   :1369   Mean   :497.9  
##  3rd Qu.:1900   3rd Qu.:566.2  
##  Max.   :6694   Max.   :746.0  
##  NA's   :162
#all, except grade, male, publicSchool, urban and readingScore

1.4 Removing missing values

How many observations are now in the training set?

pisaTrain = na.omit(train) #na.omit秉棄missing data
pisaTest = na.omit(test)
#2414

How many observations are now in the testing set?

#990

2.1 Factor variables

Which of the following variables is an unordered factor with at least 3 levels? (Select all that apply.)

#raceeth

Which of the following variables is an ordered factor with at least 3 levels? (Select all that apply.)

#grade

2.2

Which binary variables will be included in the regression model? (Select all that apply.)

#except White.

2.3 - Example unordered factors

Consider again adding our unordered factor race to the regression model with reference level “White”.

For a student who is Asian, which binary variables would be set to 0? All remaining variables will be set to 1. (Select all that apply.)

#raceethAsian is ste to 0, all the other are 1.

For a student who is white, which binary variables would be set to 0? All remaining variables will be set to 1. (Select all that apply.)

#all are 0.

3.1 - Building a model

What is the Multiple R-squared value of lmScore on the training set?

lmScore= lm(readingScore ~ ., pisaTrain)
summary(lmScore)
## 
## Call:
## lm(formula = readingScore ~ ., data = pisaTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -247.44  -48.86    1.86   49.77  217.18 
## 
## Coefficients:
##                                                 Estimate Std. Error
## (Intercept)                                    76.489006  37.302678
## grade                                          29.542707   2.937399
## male                                          -14.521653   3.155926
## raceethAsian                                   63.167002  18.972648
## raceethBlack                                    0.264980  17.369507
## raceethHispanic                                28.301842  17.258860
## raceethMore than one race                      50.354805  18.570123
## raceethNative Hawaiian/Other Pacific Islander  62.175726  23.782766
## raceethWhite                                   67.277327  16.786935
## preschool                                      -4.463670   3.486055
## expectBachelors                                55.267080   4.293893
## motherHS                                        6.058774   6.091423
## motherBachelors                                12.638068   3.861457
## motherWork                                     -2.809101   3.521827
## fatherHS                                        4.018214   5.579269
## fatherBachelors                                16.929755   3.995253
## fatherWork                                      5.842798   4.395978
## selfBornUS                                     -3.806278   7.323718
## motherBornUS                                   -8.798153   6.587621
## fatherBornUS                                    4.306994   6.263875
## englishAtHome                                   8.035685   6.859492
## computerForSchoolwork                          22.500232   5.702562
## read30MinsADay                                 34.871924   3.408447
## minutesPerWeekEnglish                           0.012788   0.010712
## studentsInEnglish                              -0.286631   0.227819
## schoolHasLibrary                               12.215085   9.264884
## publicSchool                                  -16.857475   6.725614
## urban                                          -0.110132   3.962724
## schoolSize                                      0.006540   0.002197
##                                               t value Pr(>|t|)    
## (Intercept)                                     2.050 0.040425 *  
## grade                                          10.057  < 2e-16 ***
## male                                           -4.601 4.42e-06 ***
## raceethAsian                                    3.329 0.000884 ***
## raceethBlack                                    0.015 0.987830    
## raceethHispanic                                 1.640 0.101169    
## raceethMore than one race                       2.712 0.006744 ** 
## raceethNative Hawaiian/Other Pacific Islander   2.614 0.008997 ** 
## raceethWhite                                    4.008 6.32e-05 ***
## preschool                                      -1.280 0.200516    
## expectBachelors                                12.871  < 2e-16 ***
## motherHS                                        0.995 0.320012    
## motherBachelors                                 3.273 0.001080 ** 
## motherWork                                     -0.798 0.425167    
## fatherHS                                        0.720 0.471470    
## fatherBachelors                                 4.237 2.35e-05 ***
## fatherWork                                      1.329 0.183934    
## selfBornUS                                     -0.520 0.603307    
## motherBornUS                                   -1.336 0.181821    
## fatherBornUS                                    0.688 0.491776    
## englishAtHome                                   1.171 0.241527    
## computerForSchoolwork                           3.946 8.19e-05 ***
## read30MinsADay                                 10.231  < 2e-16 ***
## minutesPerWeekEnglish                           1.194 0.232644    
## studentsInEnglish                              -1.258 0.208460    
## schoolHasLibrary                                1.318 0.187487    
## publicSchool                                   -2.506 0.012261 *  
## urban                                          -0.028 0.977830    
## schoolSize                                      2.977 0.002942 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73.81 on 2385 degrees of freedom
## Multiple R-squared:  0.3251, Adjusted R-squared:  0.3172 
## F-statistic: 41.04 on 28 and 2385 DF,  p-value: < 2.2e-16
#0.3251

3.2

What is the training-set root-mean squared error (RMSE) of lmScore?

SSE= sum(lmScore$residuals^2)
RMSE= sqrt(SSE/nrow(pisaTrain))
RMSE
## [1] 73.36555
#R=sqrt(mean(lmScore$residuals^2))
#R

3.3

Consider two students A and B. They have all variable values the same, except that student A is in grade 11 and student B is in grade 9. What is the predicted reading score of student A minus the predicted reading score of student B?

29.542707*2
## [1] 59.08541

3.4

What is the meaning of the coefficient associated with variable raceethAsian?

3.5

Based on the significance codes, which variables are candidates for removal from the model? Select all that apply. (We’ll assume that the factor variable raceeth should only be removed if none of its levels are significant.)

#motherHS, motherWork, atherHS, fatherWork, selfBornUS, motherBornUS, fatherBornUS, englishAtHome, read30MinsADay,minutesPerWeekEnglish, studentsInEnglish, schoolHasLibrary

4.1

What is the range between the maximum and minimum predicted reading score on the test set?

predTest = predict(lmScore, newdata = pisaTest)
summary(predTest)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   353.2   482.0   524.0   516.7   555.7   637.7
637.7-353.2
## [1] 284.5

4.2

What is the sum of squared errors (SSE) of lmScore on the testing set?

SSE= sum((predTest-pisaTest$readingScore)^2)
         
SSE
## [1] 5762082

What is the root-mean squared error (RMSE) of lmScore on the testing set?

RMSE= sqrt(SSE/nrow(pisaTest))
RMSE
## [1] 76.29079

4.3

What is the predicted test score used in the baseline model? Remember to compute this value using the training set and not the test set.

B=mean(pisaTrain$readingScore)
B
## [1] 517.9629

What is the sum of squared errors of the baseline model on the testing set? HINT: We call the sum of squared errors for the baseline model the total sum of squares (SST).

SST = sum((B - pisaTest$readingScore)^2)
SST
## [1] 7802354

4.4

What is the test-set R-squared value of lmScore?

R2= 1-SSE/SST
R2
## [1] 0.2614944