R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

pisaTrain = read.csv("pisa2009train.csv")
pisaTest = read.csv("pisa2009test.csv")
tapply(pisaTrain$readingScore,pisaTrain$male==1, mean)
##    FALSE     TRUE 
## 512.9406 483.5325
pisaTrain = na.omit(pisaTrain)
pisaTest = na.omit(pisaTest)
summary(pisaTrain)
##      grade            male       
##  Min.   : 8.00   Min.   :0.0000  
##  1st Qu.:10.00   1st Qu.:0.0000  
##  Median :10.00   Median :1.0000  
##  Mean   :10.13   Mean   :0.5012  
##  3rd Qu.:10.00   3rd Qu.:1.0000  
##  Max.   :12.00   Max.   :1.0000  
##                                  
##                                    raceeth       preschool     
##  American Indian/Alaska Native         :  20   Min.   :0.0000  
##  Asian                                 :  95   1st Qu.:0.0000  
##  Black                                 : 228   Median :1.0000  
##  Hispanic                              : 500   Mean   :0.7274  
##  More than one race                    :  81   3rd Qu.:1.0000  
##  Native Hawaiian/Other Pacific Islander:  20   Max.   :1.0000  
##  White                                 :1470                   
##  expectBachelors     motherHS     motherBachelors    motherWork    
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :1.000   Median :0.0000   Median :1.0000  
##  Mean   :0.8343   Mean   :0.896   Mean   :0.3637   Mean   :0.7357  
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000  
##                                                                    
##     fatherHS      fatherBachelors    fatherWork       selfBornUS    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:1.0000  
##  Median :1.0000   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.8741   Mean   :0.3484   Mean   :0.8571   Mean   :0.9362  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##   motherBornUS   fatherBornUS    englishAtHome    computerForSchoolwork
##  Min.   :0.00   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000       
##  1st Qu.:1.00   1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:1.0000       
##  Median :1.00   Median :1.0000   Median :1.0000   Median :1.0000       
##  Mean   :0.79   Mean   :0.7854   Mean   :0.8815   Mean   :0.9155       
##  3rd Qu.:1.00   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000       
##  Max.   :1.00   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000       
##                                                                        
##  read30MinsADay   minutesPerWeekEnglish studentsInEnglish schoolHasLibrary
##  Min.   :0.0000   Min.   :   0.0        Min.   : 1.00     Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.: 225.0        1st Qu.:20.00     1st Qu.:1.0000  
##  Median :0.0000   Median : 250.0        Median :25.00     Median :1.0000  
##  Mean   :0.3016   Mean   : 269.8        Mean   :24.56     Mean   :0.9714  
##  3rd Qu.:1.0000   3rd Qu.: 300.0        3rd Qu.:30.00     3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1680.0        Max.   :75.00     Max.   :1.0000  
##                                                                           
##   publicSchool        urban          schoolSize    readingScore  
##  Min.   :0.0000   Min.   :0.0000   Min.   : 100   Min.   :244.5  
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.: 712   1st Qu.:455.8  
##  Median :1.0000   Median :0.0000   Median :1233   Median :520.2  
##  Mean   :0.9176   Mean   :0.3629   Mean   :1372   Mean   :518.0  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1900   3rd Qu.:581.4  
##  Max.   :1.0000   Max.   :1.0000   Max.   :6694   Max.   :746.0  
## 
str(pisaTrain)
## 'data.frame':    2414 obs. of  24 variables:
##  $ grade                : int  11 10 10 10 10 10 10 10 11 9 ...
##  $ male                 : int  1 0 1 0 1 0 0 0 1 1 ...
##  $ raceeth              : Factor w/ 7 levels "American Indian/Alaska Native",..: 7 3 4 7 5 4 7 4 7 7 ...
##  $ preschool            : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ expectBachelors      : int  0 1 0 1 1 1 1 0 1 1 ...
##  $ motherHS             : int  1 0 1 1 1 1 1 0 1 1 ...
##  $ motherBachelors      : int  1 0 0 0 1 0 0 0 0 1 ...
##  $ motherWork           : int  1 1 1 0 1 1 1 0 0 1 ...
##  $ fatherHS             : int  1 1 1 1 0 1 1 0 1 1 ...
##  $ fatherBachelors      : int  0 0 0 0 0 0 1 0 1 1 ...
##  $ fatherWork           : int  1 1 0 1 1 0 1 1 1 1 ...
##  $ selfBornUS           : int  1 1 1 1 1 0 1 0 1 1 ...
##  $ motherBornUS         : int  1 1 1 1 1 0 1 0 1 1 ...
##  $ fatherBornUS         : int  1 1 0 1 1 0 1 0 1 1 ...
##  $ englishAtHome        : int  1 1 1 1 1 0 1 0 1 1 ...
##  $ computerForSchoolwork: int  1 1 1 1 1 0 1 1 1 1 ...
##  $ read30MinsADay       : int  1 1 1 1 0 1 1 1 0 0 ...
##  $ minutesPerWeekEnglish: int  450 200 250 300 294 232 225 270 275 225 ...
##  $ studentsInEnglish    : int  25 23 35 30 24 14 20 25 30 15 ...
##  $ schoolHasLibrary     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ publicSchool         : int  1 1 1 1 1 1 1 1 1 0 ...
##  $ urban                : int  0 1 1 0 0 0 0 1 1 1 ...
##  $ schoolSize           : int  1173 2640 1095 1913 899 1733 149 1400 1988 915 ...
##  $ readingScore         : num  575 458 614 439 466 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:1249] 1 3 6 7 9 11 13 21 29 30 ...
##   .. ..- attr(*, "names")= chr [1:1249] "1" "3" "6" "7" ...
str(pisaTest)
## 'data.frame':    990 obs. of  24 variables:
##  $ grade                : int  10 10 10 10 11 10 10 10 10 10 ...
##  $ male                 : int  0 0 0 0 0 1 0 1 1 0 ...
##  $ raceeth              : Factor w/ 7 levels "American Indian/Alaska Native",..: 7 7 1 7 7 4 7 4 7 4 ...
##  $ preschool            : int  1 1 1 1 0 1 0 1 1 1 ...
##  $ expectBachelors      : int  0 1 0 0 0 1 1 0 1 1 ...
##  $ motherHS             : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ motherBachelors      : int  1 0 0 0 1 1 0 0 1 0 ...
##  $ motherWork           : int  1 0 0 1 1 1 0 1 1 1 ...
##  $ fatherHS             : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ fatherBachelors      : int  0 1 0 0 1 0 0 0 1 1 ...
##  $ fatherWork           : int  0 1 0 1 1 1 1 0 1 1 ...
##  $ selfBornUS           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ motherBornUS         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ fatherBornUS         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ englishAtHome        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ computerForSchoolwork: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ read30MinsADay       : int  0 0 1 1 1 1 0 0 0 1 ...
##  $ minutesPerWeekEnglish: int  240 240 240 270 270 350 350 360 350 360 ...
##  $ studentsInEnglish    : int  30 30 30 35 30 25 27 28 25 27 ...
##  $ schoolHasLibrary     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ publicSchool         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ urban                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolSize           : int  808 808 808 808 808 899 899 899 899 899 ...
##  $ readingScore         : num  355 454 405 665 605 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:580] 2 3 4 6 12 16 17 19 22 23 ...
##   .. ..- attr(*, "names")= chr [1:580] "2" "3" "4" "6" ...
pisaTrain$raceeth = relevel(pisaTrain$raceeth, "White")

pisaTest$raceeth = relevel(pisaTest$raceeth, "White")
lmScore = lm(readingScore ~ ., data=pisaTrain)

summary(lmScore)
## 
## Call:
## lm(formula = readingScore ~ ., data = pisaTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -247.44  -48.86    1.86   49.77  217.18 
## 
## Coefficients:
##                                                 Estimate Std. Error
## (Intercept)                                   143.766333  33.841226
## grade                                          29.542707   2.937399
## male                                          -14.521653   3.155926
## raceethAmerican Indian/Alaska Native          -67.277327  16.786935
## raceethAsian                                   -4.110325   9.220071
## raceethBlack                                  -67.012347   5.460883
## raceethHispanic                               -38.975486   5.177743
## raceethMore than one race                     -16.922522   8.496268
## raceethNative Hawaiian/Other Pacific Islander  -5.101601  17.005696
## preschool                                      -4.463670   3.486055
## expectBachelors                                55.267080   4.293893
## motherHS                                        6.058774   6.091423
## motherBachelors                                12.638068   3.861457
## motherWork                                     -2.809101   3.521827
## fatherHS                                        4.018214   5.579269
## fatherBachelors                                16.929755   3.995253
## fatherWork                                      5.842798   4.395978
## selfBornUS                                     -3.806278   7.323718
## motherBornUS                                   -8.798153   6.587621
## fatherBornUS                                    4.306994   6.263875
## englishAtHome                                   8.035685   6.859492
## computerForSchoolwork                          22.500232   5.702562
## read30MinsADay                                 34.871924   3.408447
## minutesPerWeekEnglish                           0.012788   0.010712
## studentsInEnglish                              -0.286631   0.227819
## schoolHasLibrary                               12.215085   9.264884
## publicSchool                                  -16.857475   6.725614
## urban                                          -0.110132   3.962724
## schoolSize                                      0.006540   0.002197
##                                               t value Pr(>|t|)    
## (Intercept)                                     4.248 2.24e-05 ***
## grade                                          10.057  < 2e-16 ***
## male                                           -4.601 4.42e-06 ***
## raceethAmerican Indian/Alaska Native           -4.008 6.32e-05 ***
## raceethAsian                                   -0.446  0.65578    
## raceethBlack                                  -12.271  < 2e-16 ***
## raceethHispanic                                -7.528 7.29e-14 ***
## raceethMore than one race                      -1.992  0.04651 *  
## raceethNative Hawaiian/Other Pacific Islander  -0.300  0.76421    
## preschool                                      -1.280  0.20052    
## expectBachelors                                12.871  < 2e-16 ***
## motherHS                                        0.995  0.32001    
## motherBachelors                                 3.273  0.00108 ** 
## motherWork                                     -0.798  0.42517    
## fatherHS                                        0.720  0.47147    
## fatherBachelors                                 4.237 2.35e-05 ***
## fatherWork                                      1.329  0.18393    
## selfBornUS                                     -0.520  0.60331    
## motherBornUS                                   -1.336  0.18182    
## fatherBornUS                                    0.688  0.49178    
## englishAtHome                                   1.171  0.24153    
## computerForSchoolwork                           3.946 8.19e-05 ***
## read30MinsADay                                 10.231  < 2e-16 ***
## minutesPerWeekEnglish                           1.194  0.23264    
## studentsInEnglish                              -1.258  0.20846    
## schoolHasLibrary                                1.318  0.18749    
## publicSchool                                   -2.506  0.01226 *  
## urban                                          -0.028  0.97783    
## schoolSize                                      2.977  0.00294 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73.81 on 2385 degrees of freedom
## Multiple R-squared:  0.3251, Adjusted R-squared:  0.3172 
## F-statistic: 41.04 on 28 and 2385 DF,  p-value: < 2.2e-16
SSE=sum(lmScore$residuals^2)
sqrt(SSE/nrow(pisaTrain))
## [1] 73.36555
sqrt(mean(lmScore$residuals^2))
## [1] 73.36555
predTest = predict(lmScore,newdata = pisaTest)

summary(predTest)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   353.2   482.0   524.0   516.7   555.7   637.7
637.7-353.2
## [1] 284.5
SSE = sum((predTest - pisaTest$readingScore)^2)
RMSE =sqrt(mean((predTest-pisaTest$readingScore)^2))
SST = sum((mean(pisaTrain$readingScore)-pisaTest$readingScore)^2)
1-SSE/SST
## [1] 0.2614944
mean(pisaTrain$readingScore)
## [1] 517.9629

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.