This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
pisaTrain = read.csv("pisa2009train.csv")
pisaTest = read.csv("pisa2009test.csv")
tapply(pisaTrain$readingScore,pisaTrain$male==1, mean)
## FALSE TRUE
## 512.9406 483.5325
pisaTrain = na.omit(pisaTrain)
pisaTest = na.omit(pisaTest)
summary(pisaTrain)
## grade male
## Min. : 8.00 Min. :0.0000
## 1st Qu.:10.00 1st Qu.:0.0000
## Median :10.00 Median :1.0000
## Mean :10.13 Mean :0.5012
## 3rd Qu.:10.00 3rd Qu.:1.0000
## Max. :12.00 Max. :1.0000
##
## raceeth preschool
## American Indian/Alaska Native : 20 Min. :0.0000
## Asian : 95 1st Qu.:0.0000
## Black : 228 Median :1.0000
## Hispanic : 500 Mean :0.7274
## More than one race : 81 3rd Qu.:1.0000
## Native Hawaiian/Other Pacific Islander: 20 Max. :1.0000
## White :1470
## expectBachelors motherHS motherBachelors motherWork
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :1.000 Median :0.0000 Median :1.0000
## Mean :0.8343 Mean :0.896 Mean :0.3637 Mean :0.7357
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
##
## fatherHS fatherBachelors fatherWork selfBornUS
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:1.0000
## Median :1.0000 Median :0.0000 Median :1.0000 Median :1.0000
## Mean :0.8741 Mean :0.3484 Mean :0.8571 Mean :0.9362
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## motherBornUS fatherBornUS englishAtHome computerForSchoolwork
## Min. :0.00 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.00 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:1.0000
## Median :1.00 Median :1.0000 Median :1.0000 Median :1.0000
## Mean :0.79 Mean :0.7854 Mean :0.8815 Mean :0.9155
## 3rd Qu.:1.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.00 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## read30MinsADay minutesPerWeekEnglish studentsInEnglish schoolHasLibrary
## Min. :0.0000 Min. : 0.0 Min. : 1.00 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.: 225.0 1st Qu.:20.00 1st Qu.:1.0000
## Median :0.0000 Median : 250.0 Median :25.00 Median :1.0000
## Mean :0.3016 Mean : 269.8 Mean :24.56 Mean :0.9714
## 3rd Qu.:1.0000 3rd Qu.: 300.0 3rd Qu.:30.00 3rd Qu.:1.0000
## Max. :1.0000 Max. :1680.0 Max. :75.00 Max. :1.0000
##
## publicSchool urban schoolSize readingScore
## Min. :0.0000 Min. :0.0000 Min. : 100 Min. :244.5
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.: 712 1st Qu.:455.8
## Median :1.0000 Median :0.0000 Median :1233 Median :520.2
## Mean :0.9176 Mean :0.3629 Mean :1372 Mean :518.0
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1900 3rd Qu.:581.4
## Max. :1.0000 Max. :1.0000 Max. :6694 Max. :746.0
##
str(pisaTrain)
## 'data.frame': 2414 obs. of 24 variables:
## $ grade : int 11 10 10 10 10 10 10 10 11 9 ...
## $ male : int 1 0 1 0 1 0 0 0 1 1 ...
## $ raceeth : Factor w/ 7 levels "American Indian/Alaska Native",..: 7 3 4 7 5 4 7 4 7 7 ...
## $ preschool : int 0 1 1 1 1 1 1 1 1 1 ...
## $ expectBachelors : int 0 1 0 1 1 1 1 0 1 1 ...
## $ motherHS : int 1 0 1 1 1 1 1 0 1 1 ...
## $ motherBachelors : int 1 0 0 0 1 0 0 0 0 1 ...
## $ motherWork : int 1 1 1 0 1 1 1 0 0 1 ...
## $ fatherHS : int 1 1 1 1 0 1 1 0 1 1 ...
## $ fatherBachelors : int 0 0 0 0 0 0 1 0 1 1 ...
## $ fatherWork : int 1 1 0 1 1 0 1 1 1 1 ...
## $ selfBornUS : int 1 1 1 1 1 0 1 0 1 1 ...
## $ motherBornUS : int 1 1 1 1 1 0 1 0 1 1 ...
## $ fatherBornUS : int 1 1 0 1 1 0 1 0 1 1 ...
## $ englishAtHome : int 1 1 1 1 1 0 1 0 1 1 ...
## $ computerForSchoolwork: int 1 1 1 1 1 0 1 1 1 1 ...
## $ read30MinsADay : int 1 1 1 1 0 1 1 1 0 0 ...
## $ minutesPerWeekEnglish: int 450 200 250 300 294 232 225 270 275 225 ...
## $ studentsInEnglish : int 25 23 35 30 24 14 20 25 30 15 ...
## $ schoolHasLibrary : int 1 1 1 1 1 1 1 1 1 1 ...
## $ publicSchool : int 1 1 1 1 1 1 1 1 1 0 ...
## $ urban : int 0 1 1 0 0 0 0 1 1 1 ...
## $ schoolSize : int 1173 2640 1095 1913 899 1733 149 1400 1988 915 ...
## $ readingScore : num 575 458 614 439 466 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:1249] 1 3 6 7 9 11 13 21 29 30 ...
## .. ..- attr(*, "names")= chr [1:1249] "1" "3" "6" "7" ...
str(pisaTest)
## 'data.frame': 990 obs. of 24 variables:
## $ grade : int 10 10 10 10 11 10 10 10 10 10 ...
## $ male : int 0 0 0 0 0 1 0 1 1 0 ...
## $ raceeth : Factor w/ 7 levels "American Indian/Alaska Native",..: 7 7 1 7 7 4 7 4 7 4 ...
## $ preschool : int 1 1 1 1 0 1 0 1 1 1 ...
## $ expectBachelors : int 0 1 0 0 0 1 1 0 1 1 ...
## $ motherHS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ motherBachelors : int 1 0 0 0 1 1 0 0 1 0 ...
## $ motherWork : int 1 0 0 1 1 1 0 1 1 1 ...
## $ fatherHS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ fatherBachelors : int 0 1 0 0 1 0 0 0 1 1 ...
## $ fatherWork : int 0 1 0 1 1 1 1 0 1 1 ...
## $ selfBornUS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ motherBornUS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ fatherBornUS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ englishAtHome : int 1 1 1 1 1 1 1 1 1 1 ...
## $ computerForSchoolwork: int 1 1 1 1 1 1 1 1 1 1 ...
## $ read30MinsADay : int 0 0 1 1 1 1 0 0 0 1 ...
## $ minutesPerWeekEnglish: int 240 240 240 270 270 350 350 360 350 360 ...
## $ studentsInEnglish : int 30 30 30 35 30 25 27 28 25 27 ...
## $ schoolHasLibrary : int 1 1 1 1 1 1 1 1 1 1 ...
## $ publicSchool : int 1 1 1 1 1 1 1 1 1 1 ...
## $ urban : int 0 0 0 0 0 0 0 0 0 0 ...
## $ schoolSize : int 808 808 808 808 808 899 899 899 899 899 ...
## $ readingScore : num 355 454 405 665 605 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:580] 2 3 4 6 12 16 17 19 22 23 ...
## .. ..- attr(*, "names")= chr [1:580] "2" "3" "4" "6" ...
pisaTrain$raceeth = relevel(pisaTrain$raceeth, "White")
pisaTest$raceeth = relevel(pisaTest$raceeth, "White")
lmScore = lm(readingScore ~ ., data=pisaTrain)
summary(lmScore)
##
## Call:
## lm(formula = readingScore ~ ., data = pisaTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -247.44 -48.86 1.86 49.77 217.18
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 143.766333 33.841226
## grade 29.542707 2.937399
## male -14.521653 3.155926
## raceethAmerican Indian/Alaska Native -67.277327 16.786935
## raceethAsian -4.110325 9.220071
## raceethBlack -67.012347 5.460883
## raceethHispanic -38.975486 5.177743
## raceethMore than one race -16.922522 8.496268
## raceethNative Hawaiian/Other Pacific Islander -5.101601 17.005696
## preschool -4.463670 3.486055
## expectBachelors 55.267080 4.293893
## motherHS 6.058774 6.091423
## motherBachelors 12.638068 3.861457
## motherWork -2.809101 3.521827
## fatherHS 4.018214 5.579269
## fatherBachelors 16.929755 3.995253
## fatherWork 5.842798 4.395978
## selfBornUS -3.806278 7.323718
## motherBornUS -8.798153 6.587621
## fatherBornUS 4.306994 6.263875
## englishAtHome 8.035685 6.859492
## computerForSchoolwork 22.500232 5.702562
## read30MinsADay 34.871924 3.408447
## minutesPerWeekEnglish 0.012788 0.010712
## studentsInEnglish -0.286631 0.227819
## schoolHasLibrary 12.215085 9.264884
## publicSchool -16.857475 6.725614
## urban -0.110132 3.962724
## schoolSize 0.006540 0.002197
## t value Pr(>|t|)
## (Intercept) 4.248 2.24e-05 ***
## grade 10.057 < 2e-16 ***
## male -4.601 4.42e-06 ***
## raceethAmerican Indian/Alaska Native -4.008 6.32e-05 ***
## raceethAsian -0.446 0.65578
## raceethBlack -12.271 < 2e-16 ***
## raceethHispanic -7.528 7.29e-14 ***
## raceethMore than one race -1.992 0.04651 *
## raceethNative Hawaiian/Other Pacific Islander -0.300 0.76421
## preschool -1.280 0.20052
## expectBachelors 12.871 < 2e-16 ***
## motherHS 0.995 0.32001
## motherBachelors 3.273 0.00108 **
## motherWork -0.798 0.42517
## fatherHS 0.720 0.47147
## fatherBachelors 4.237 2.35e-05 ***
## fatherWork 1.329 0.18393
## selfBornUS -0.520 0.60331
## motherBornUS -1.336 0.18182
## fatherBornUS 0.688 0.49178
## englishAtHome 1.171 0.24153
## computerForSchoolwork 3.946 8.19e-05 ***
## read30MinsADay 10.231 < 2e-16 ***
## minutesPerWeekEnglish 1.194 0.23264
## studentsInEnglish -1.258 0.20846
## schoolHasLibrary 1.318 0.18749
## publicSchool -2.506 0.01226 *
## urban -0.028 0.97783
## schoolSize 2.977 0.00294 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 73.81 on 2385 degrees of freedom
## Multiple R-squared: 0.3251, Adjusted R-squared: 0.3172
## F-statistic: 41.04 on 28 and 2385 DF, p-value: < 2.2e-16
SSE=sum(lmScore$residuals^2)
sqrt(SSE/nrow(pisaTrain))
## [1] 73.36555
sqrt(mean(lmScore$residuals^2))
## [1] 73.36555
predTest = predict(lmScore,newdata = pisaTest)
summary(predTest)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 353.2 482.0 524.0 516.7 555.7 637.7
637.7-353.2
## [1] 284.5
SSE = sum((predTest - pisaTest$readingScore)^2)
RMSE =sqrt(mean((predTest-pisaTest$readingScore)^2))
SST = sum((mean(pisaTrain$readingScore)-pisaTest$readingScore)^2)
1-SSE/SST
## [1] 0.2614944
mean(pisaTrain$readingScore)
## [1] 517.9629
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.