quick TIP for R Markdown: echo = FALSE will supress the code abut print the plot
This R Markdown style Notebook contains notes from the course ‘Analytical Edge’ offered by MIT online. This particular chapter focuses on ‘Linear Regression’.
Motivation: To understand all the nuances of using R, as offered by this course and to get better at note-keeping so few years down the line, if I need to revise this topic, I can do it rather quickly (~30 minutes or less) and be completely comfortable.
This notebook will contain notes from the lecture and solution to only interesting questions from the assignment.
=============================================================== ========================= LECTURE============================== ===============================================================
# setting up working directory
setwd("~/Dropbox/R_exercises/r4ds/r4ds/01 Linear Regression")
quick definitions:
==========WINE LECTURE==========
for the wine price example, tutor showed that adding variables to the model increased R^2 but with diminishig returns meaning as we kept adding more variables to the model, the R^2 increased but only slightly.
(side note on above: this is true for model (or read train dataset) R^2 only. When model is applied to a test data set, the resultant R^2 may decrease or even become negative as we add/remove independent variables from the model.)
# loading required package
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
# reading the wine.csv file
(wine <- read_csv("wine.csv"))
## Parsed with column specification:
## cols(
## Year = col_integer(),
## Price = col_double(),
## WinterRain = col_integer(),
## AGST = col_double(),
## HarvestRain = col_integer(),
## Age = col_integer(),
## FrancePop = col_double()
## )
## # A tibble: 25 × 7
## Year Price WinterRain AGST HarvestRain Age FrancePop
## <int> <dbl> <int> <dbl> <int> <int> <dbl>
## 1 1952 7.4950 600 17.1167 160 31 43183.57
## 2 1953 8.0393 690 16.7333 80 30 43495.03
## 3 1955 7.6858 502 17.1500 130 28 44217.86
## 4 1957 6.9845 420 16.1333 110 26 45152.25
## 5 1958 6.7772 582 16.4167 187 25 45653.81
## 6 1959 8.0757 485 17.4833 187 24 46128.64
## 7 1960 6.5188 763 16.4167 290 23 46584.00
## 8 1961 8.4937 830 17.3333 38 22 47128.00
## 9 1962 7.3880 697 16.3000 52 21 48088.67
## 10 1963 6.7127 608 15.7167 155 20 48798.99
## # ... with 15 more rows
# building the model to predict price of wine dependent on AGST
(price_wine <- lm(Price ~ AGST, data = wine))
##
## Call:
## lm(formula = Price ~ AGST, data = wine)
##
## Coefficients:
## (Intercept) AGST
## -3.4178 0.6351
summary(price_wine)
##
## Call:
## lm(formula = Price ~ AGST, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78450 -0.23882 -0.03727 0.38992 0.90318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4178 2.4935 -1.371 0.183710
## AGST 0.6351 0.1509 4.208 0.000335 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4993 on 23 degrees of freedom
## Multiple R-squared: 0.435, Adjusted R-squared: 0.4105
## F-statistic: 17.71 on 1 and 23 DF, p-value: 0.000335
To calculate SSE, first we need to calcuate ‘residuals’, which can be calculated as
# to access residulas in a model
(price_wine$residuals)
## 1 2 3 4 5 6
## 0.04204258 0.82983774 0.21169394 0.15609432 -0.23119140 0.38991701
## 7 8 9 10 11 12
## -0.48959140 0.90318115 0.45372410 0.14887461 -0.23882157 -0.08974238
## 13 14 15 16 17 18
## 0.66185660 -0.05211511 -0.62726647 -0.74714947 0.42113502 -0.03727441
## 19 20 21 22 23 24
## 0.10685278 -0.78450270 -0.64017590 -0.05508720 -0.67055321 -0.22040381
## 25
## 0.55866518
# to calculate SSE
(SSE = sum(price_wine$residuals^2))
## [1] 5.734875
How to use ‘adjusted R^2’ to determine which variables could be used in the model. Keep an eye on ‘correlation’ among all variables given. We found that population of france and age of wine have a perfect negative correlation.
This means that Age of Wine and Population of France have multicollinearity issue. Using these two variables together will confuse the accuracy of the model, if we drop both, the R^2 will decrease which is not good as well. Intuitively we know that age of wine should be a better predictor of price than the population of france. So we keep the former while drop the later from the model.
Predicting wine prices using available model:
# to build a model using wine dataset
(price_wine_train <- lm(Price ~ WinterRain + AGST + HarvestRain + Age, data = wine))
##
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age, data = wine)
##
## Coefficients:
## (Intercept) WinterRain AGST HarvestRain Age
## -3.429980 0.001076 0.607209 -0.003972 0.023931
# reading test data
(wine_test <- read_csv("wine_test.csv"))
## Parsed with column specification:
## cols(
## Year = col_integer(),
## Price = col_double(),
## WinterRain = col_integer(),
## AGST = col_double(),
## HarvestRain = col_integer(),
## Age = col_integer(),
## FrancePop = col_double()
## )
## # A tibble: 2 × 7
## Year Price WinterRain AGST HarvestRain Age FrancePop
## <int> <dbl> <int> <dbl> <int> <int> <dbl>
## 1 1979 6.9541 717 16.1667 122 4 54835.83
## 2 1980 6.4979 578 16.0000 74 3 55110.24
# predicting 'price' for 'test data' using model derived from 'train data'
# Note the use of NEWDATA
# test_model <- predict(training_model, newdata = test_data)
(predict_price <- predict(price_wine_train, newdata = wine_test))
## 1 2
## 6.768925 6.684910
# to calculate SSE, SST and R^2
# SSE = sum( (actual_price_test - predict_price_test)^2 )
(SSE = sum((wine_test$Price - predict_price)^2))
## [1] 0.06926281
# SST = sum( (actual_price_test - mean_wine_price_train)^2 )
# NOTE: we're using average price from training dataset as our baseline for calculating SST for test dataset. If we had to calculate SST for training dataset, we still will be using average price of training dataset
(SST = sum((wine_test$Price - mean(wine$Price))^2))
## [1] 0.3369269
(r_square <- 1 - SSE/SST)
## [1] 0.7944278
Model R^2 and Test R^2:
Basically the mechanism to calculate R^2 is same in both cases, which in turn is derived from SSE. SSE looks at difference between actual value and predicted value. In training dataset, the predicted value is the what you calculate using the regression equation itself (meaning for certain value of ‘x’ what should be value of ‘y’ given, y = mx + c and we know ‘m’ and ‘c’).
In test R^2, predicted values are calculated using ‘predict’ function as shown above
========================= ASSIGNMENT =========================
======================== CLIMATE CHANGE ========================
# reading input data
(climate <- read_csv("climate_change.csv"))
## Parsed with column specification:
## cols(
## Year = col_integer(),
## Month = col_integer(),
## MEI = col_double(),
## CO2 = col_double(),
## CH4 = col_double(),
## N2O = col_double(),
## `CFC-11` = col_double(),
## `CFC-12` = col_double(),
## TSI = col_double(),
## Aerosols = col_double(),
## Temp = col_double()
## )
## # A tibble: 308 × 11
## Year Month MEI CO2 CH4 N2O `CFC-11` `CFC-12` TSI
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1983 5 2.556 345.96 1638.59 303.677 191.324 350.113 1366.102
## 2 1983 6 2.167 345.52 1633.71 303.746 192.057 351.848 1366.121
## 3 1983 7 1.741 344.15 1633.22 303.795 192.818 353.725 1366.285
## 4 1983 8 1.130 342.25 1631.35 303.839 193.602 355.633 1366.420
## 5 1983 9 0.428 340.17 1648.40 303.901 194.392 357.465 1366.234
## 6 1983 10 0.002 340.30 1663.79 303.970 195.171 359.174 1366.059
## 7 1983 11 -0.176 341.53 1658.23 304.032 195.921 360.758 1366.107
## 8 1983 12 -0.176 343.07 1654.31 304.082 196.609 362.174 1366.061
## 9 1984 1 -0.339 344.05 1658.98 304.130 197.219 363.359 1365.426
## 10 1984 2 -0.565 344.77 1656.48 304.194 197.759 364.296 1365.662
## # ... with 298 more rows, and 2 more variables: Aerosols <dbl>, Temp <dbl>
# Problem 1.1
# splitting data into training (<=2006) and testing set (>2006)
# Hint: to use subset
# new_dataset_name <- subset(original_dataset, logical_condition)
(climate_train <- subset(climate, Year <= 2006))
## # A tibble: 284 × 11
## Year Month MEI CO2 CH4 N2O `CFC-11` `CFC-12` TSI
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1983 5 2.556 345.96 1638.59 303.677 191.324 350.113 1366.102
## 2 1983 6 2.167 345.52 1633.71 303.746 192.057 351.848 1366.121
## 3 1983 7 1.741 344.15 1633.22 303.795 192.818 353.725 1366.285
## 4 1983 8 1.130 342.25 1631.35 303.839 193.602 355.633 1366.420
## 5 1983 9 0.428 340.17 1648.40 303.901 194.392 357.465 1366.234
## 6 1983 10 0.002 340.30 1663.79 303.970 195.171 359.174 1366.059
## 7 1983 11 -0.176 341.53 1658.23 304.032 195.921 360.758 1366.107
## 8 1983 12 -0.176 343.07 1654.31 304.082 196.609 362.174 1366.061
## 9 1984 1 -0.339 344.05 1658.98 304.130 197.219 363.359 1365.426
## 10 1984 2 -0.565 344.77 1656.48 304.194 197.759 364.296 1365.662
## # ... with 274 more rows, and 2 more variables: Aerosols <dbl>, Temp <dbl>
(climate_test <- subset(climate, Year > 2006))
## # A tibble: 24 × 11
## Year Month MEI CO2 CH4 N2O `CFC-11` `CFC-12` TSI
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2007 1 0.974 382.93 1799.66 320.561 248.372 539.206 1365.717
## 2 2007 2 0.510 383.81 1803.08 320.571 248.264 538.973 1365.715
## 3 2007 3 0.074 384.56 1803.10 320.548 247.997 538.811 1365.754
## 4 2007 4 -0.049 386.40 1802.11 320.518 247.574 538.586 1365.723
## 5 2007 5 0.183 386.58 1795.65 320.445 247.224 538.130 1365.693
## 6 2007 6 -0.358 386.05 1781.81 320.332 246.881 537.376 1365.762
## 7 2007 7 -0.290 384.49 1771.89 320.349 246.497 537.113 1365.751
## 8 2007 8 -0.440 382.00 1779.38 320.471 246.307 537.125 1365.757
## 9 2007 9 -1.162 380.90 1794.21 320.618 246.214 537.281 1365.716
## 10 2007 10 -1.142 381.14 1802.38 320.855 246.189 537.380 1365.739
## # ... with 14 more rows, and 2 more variables: Aerosols <dbl>, Temp <dbl>
# building a model using training data set
(temp_model <- lm(Temp ~ MEI + CO2 + CH4 + N2O + `CFC-11` + `CFC-12` + TSI + Aerosols, data = climate_train))
##
## Call:
## lm(formula = Temp ~ MEI + CO2 + CH4 + N2O + `CFC-11` + `CFC-12` +
## TSI + Aerosols, data = climate_train)
##
## Coefficients:
## (Intercept) MEI CO2 CH4 N2O
## -1.246e+02 6.421e-02 6.457e-03 1.240e-04 -1.653e-02
## `CFC-11` `CFC-12` TSI Aerosols
## -6.630e-03 3.808e-03 9.314e-02 -1.538e+00
# question about calculating r_squared value for temp_model - we should be able to get that from summary(model)
summary(temp_model)
##
## Call:
## lm(formula = Temp ~ MEI + CO2 + CH4 + N2O + `CFC-11` + `CFC-12` +
## TSI + Aerosols, data = climate_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25888 -0.05913 -0.00082 0.05649 0.32433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.246e+02 1.989e+01 -6.265 1.43e-09 ***
## MEI 6.421e-02 6.470e-03 9.923 < 2e-16 ***
## CO2 6.457e-03 2.285e-03 2.826 0.00505 **
## CH4 1.240e-04 5.158e-04 0.240 0.81015
## N2O -1.653e-02 8.565e-03 -1.930 0.05467 .
## `CFC-11` -6.631e-03 1.626e-03 -4.078 5.96e-05 ***
## `CFC-12` 3.808e-03 1.014e-03 3.757 0.00021 ***
## TSI 9.314e-02 1.475e-02 6.313 1.10e-09 ***
## Aerosols -1.538e+00 2.133e-01 -7.210 5.41e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09171 on 275 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7436
## F-statistic: 103.6 on 8 and 275 DF, p-value: < 2.2e-16
Now we look at correlation between all variables in training dataset.The matrix below shows that some of the variables have high correlation (>=0.7). So issue of multicollinearity creeps in
(cor(climate_train))
## Year Month MEI CO2 CH4
## Year 1.00000000 -0.0279419602 -0.0369876842 0.98274939 0.91565945
## Month -0.02794196 1.0000000000 0.0008846905 -0.10673246 0.01856866
## MEI -0.03698768 0.0008846905 1.0000000000 -0.04114717 -0.03341930
## CO2 0.98274939 -0.1067324607 -0.0411471651 1.00000000 0.87727963
## CH4 0.91565945 0.0185686624 -0.0334193014 0.87727963 1.00000000
## N2O 0.99384523 0.0136315303 -0.0508197755 0.97671982 0.89983864
## CFC-11 0.56910643 -0.0131112236 0.0690004387 0.51405975 0.77990402
## CFC-12 0.89701166 0.0006751102 0.0082855443 0.85268963 0.96361625
## TSI 0.17030201 -0.0346061935 -0.1544919227 0.17742893 0.24552844
## Aerosols -0.34524670 0.0148895406 0.3402377871 -0.35615480 -0.26780919
## Temp 0.78679714 -0.0998567411 0.1724707512 0.78852921 0.70325502
## N2O CFC-11 CFC-12 TSI Aerosols
## Year 0.99384523 0.56910643 0.8970116635 0.17030201 -0.34524670
## Month 0.01363153 -0.01311122 0.0006751102 -0.03460619 0.01488954
## MEI -0.05081978 0.06900044 0.0082855443 -0.15449192 0.34023779
## CO2 0.97671982 0.51405975 0.8526896272 0.17742893 -0.35615480
## CH4 0.89983864 0.77990402 0.9636162478 0.24552844 -0.26780919
## N2O 1.00000000 0.52247732 0.8679307757 0.19975668 -0.33705457
## CFC-11 0.52247732 1.00000000 0.8689851828 0.27204596 -0.04392120
## CFC-12 0.86793078 0.86898518 1.0000000000 0.25530281 -0.22513124
## TSI 0.19975668 0.27204596 0.2553028138 1.00000000 0.05211651
## Aerosols -0.33705457 -0.04392120 -0.2251312440 0.05211651 1.00000000
## Temp 0.77863893 0.40771029 0.6875575483 0.24338269 -0.38491375
## Temp
## Year 0.78679714
## Month -0.09985674
## MEI 0.17247075
## CO2 0.78852921
## CH4 0.70325502
## N2O 0.77863893
## CFC-11 0.40771029
## CFC-12 0.68755755
## TSI 0.24338269
## Aerosols -0.38491375
## Temp 1.00000000
To mitigate multicollinearity issue, we will drop few variables from the model and see how it works.
Now using only N2O, MEI, TSI, Aerosols as independent variable i.e. dropping all variables which have high correlation (>=0.7) with N2O.
(temp_model_N2O <- lm(Temp ~ N2O + MEI + TSI + Aerosols, data = climate_train))
##
## Call:
## lm(formula = Temp ~ N2O + MEI + TSI + Aerosols, data = climate_train)
##
## Coefficients:
## (Intercept) N2O MEI TSI Aerosols
## -116.22686 0.02532 0.06419 0.07949 -1.70174
Understanding STEP function. Its crucial to know which variables are relevant and which should be dropped. Doing this exercise by selecting or dropping one by one is a very tedious exercise. R provides a function “step” just to do that by caclculating AIC (although ‘step’ is not smart but little dumb as it focuses on AIC solely).
(best_temp_model <- step(temp_model))
## Start: AIC=-1348.16
## Temp ~ MEI + CO2 + CH4 + N2O + `CFC-11` + `CFC-12` + TSI + Aerosols
##
## Df Sum of Sq RSS AIC
## - CH4 1 0.00049 2.3135 -1350.1
## <none> 2.3130 -1348.2
## - N2O 1 0.03132 2.3443 -1346.3
## - CO2 1 0.06719 2.3802 -1342.0
## - `CFC-12` 1 0.11874 2.4318 -1335.9
## - `CFC-11` 1 0.13986 2.4529 -1333.5
## - TSI 1 0.33516 2.6482 -1311.7
## - Aerosols 1 0.43727 2.7503 -1301.0
## - MEI 1 0.82823 3.1412 -1263.2
##
## Step: AIC=-1350.1
## Temp ~ MEI + CO2 + N2O + `CFC-11` + `CFC-12` + TSI + Aerosols
##
## Df Sum of Sq RSS AIC
## <none> 2.3135 -1350.1
## - N2O 1 0.03133 2.3448 -1348.3
## - CO2 1 0.06672 2.3802 -1344.0
## - `CFC-12` 1 0.13023 2.4437 -1336.5
## - `CFC-11` 1 0.13938 2.4529 -1335.5
## - TSI 1 0.33500 2.6485 -1313.7
## - Aerosols 1 0.43987 2.7534 -1302.7
## - MEI 1 0.83118 3.1447 -1264.9
##
## Call:
## lm(formula = Temp ~ MEI + CO2 + N2O + `CFC-11` + `CFC-12` + TSI +
## Aerosols, data = climate_train)
##
## Coefficients:
## (Intercept) MEI CO2 N2O `CFC-11`
## -1.245e+02 6.407e-02 6.401e-03 -1.602e-02 -6.609e-03
## `CFC-12` TSI Aerosols
## 3.868e-03 9.312e-02 -1.540e+00
summary(best_temp_model)
##
## Call:
## lm(formula = Temp ~ MEI + CO2 + N2O + `CFC-11` + `CFC-12` + TSI +
## Aerosols, data = climate_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25770 -0.05994 -0.00104 0.05588 0.32203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.245e+02 1.985e+01 -6.273 1.37e-09 ***
## MEI 6.407e-02 6.434e-03 9.958 < 2e-16 ***
## CO2 6.402e-03 2.269e-03 2.821 0.005129 **
## N2O -1.602e-02 8.287e-03 -1.933 0.054234 .
## `CFC-11` -6.609e-03 1.621e-03 -4.078 5.95e-05 ***
## `CFC-12` 3.868e-03 9.812e-04 3.942 0.000103 ***
## TSI 9.312e-02 1.473e-02 6.322 1.04e-09 ***
## Aerosols -1.540e+00 2.126e-01 -7.244 4.36e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09155 on 276 degrees of freedom
## Multiple R-squared: 0.7508, Adjusted R-squared: 0.7445
## F-statistic: 118.8 on 7 and 276 DF, p-value: < 2.2e-16
It is interesting to note that the step function does not address the collinearity of the variables, except that adding highly correlated variables will not improve the R2 significantly. The consequence of this is that the step function will not necessarily produce a very interpretable model - just a model that has balanced quality and simplicity for a particular weighting of quality and simplicity (AIC).
Now predicting temperatures for the ‘testing_dataset’ using the model derived above from step function. we will also calculate the r_square for the test data
(predicting_temp <- predict(best_temp_model, newdata = climate_test))
## 1 2 3 4 5 6 7
## 0.4677808 0.4435404 0.4265541 0.4299162 0.4455113 0.4151422 0.4097367
## 8 9 10 11 12 13 14
## 0.3839390 0.3255595 0.3274147 0.3231401 0.3316704 0.3522134 0.3313129
## 15 16 17 18 19 20 21
## 0.3142112 0.3703410 0.4162213 0.4391458 0.4237965 0.3913679 0.3587615
## 22 23 24
## 0.3451991 0.3607087 0.3638076
(SSE <- sum((climate_test$Temp - predicting_temp)^2))
## [1] 0.2176444
(SST <- sum((climate_test$Temp - mean(climate_train$Temp))^2))
## [1] 0.5860189
(r_square <- 1 - SSE/SST)
## [1] 0.6286051
========================= READING TEST SCORES =================================
reading the train and test dataset
pisa_train <- read_csv("pisa2009train.csv")
## Parsed with column specification:
## cols(
## .default = col_integer(),
## raceeth = col_character(),
## readingScore = col_double()
## )
## See spec(...) for full column specifications.
pisa_test <- read_csv("pisa2009test.csv")
## Parsed with column specification:
## cols(
## .default = col_integer(),
## raceeth = col_character(),
## readingScore = col_double()
## )
## See spec(...) for full column specifications.
problem 1.2, to calculate average reading test score in training dataset for males and females. I dont want to use tapply() so I’ll be using tidyr functions
# reading score for males
pisa_train %>%
filter(male == 1) %>%
summarise(reading_score = mean(readingScore, na.rm=TRUE))
## # A tibble: 1 × 1
## reading_score
## <dbl>
## 1 483.5325
# reading score for females
pisa_train %>%
filter(male == 0) %>%
summarise(reading_score = mean(readingScore, na.rm=TRUE))
## # A tibble: 1 × 1
## reading_score
## <dbl>
## 1 512.9406
problem 1.3 Locating missing values
# using summary() function we can look at which variables have missing values
summary(pisa_train)
## grade male raceeth preschool
## Min. : 8.00 Min. :0.0000 Length:3663 Min. :0.0000
## 1st Qu.:10.00 1st Qu.:0.0000 Class :character 1st Qu.:0.0000
## Median :10.00 Median :1.0000 Mode :character Median :1.0000
## Mean :10.09 Mean :0.5111 Mean :0.7228
## 3rd Qu.:10.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :12.00 Max. :1.0000 Max. :1.0000
## NA's :56
## expectBachelors motherHS motherBachelors motherWork
## Min. :0.0000 Min. :0.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:1.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :1.00 Median :0.0000 Median :1.0000
## Mean :0.7859 Mean :0.88 Mean :0.3481 Mean :0.7345
## 3rd Qu.:1.0000 3rd Qu.:1.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00 Max. :1.0000 Max. :1.0000
## NA's :62 NA's :97 NA's :397 NA's :93
## fatherHS fatherBachelors fatherWork selfBornUS
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:1.0000
## Median :1.0000 Median :0.0000 Median :1.0000 Median :1.0000
## Mean :0.8593 Mean :0.3319 Mean :0.8531 Mean :0.9313
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :245 NA's :569 NA's :233 NA's :69
## motherBornUS fatherBornUS englishAtHome computerForSchoolwork
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:1.0000
## Median :1.0000 Median :1.0000 Median :1.0000 Median :1.0000
## Mean :0.7725 Mean :0.7668 Mean :0.8717 Mean :0.8994
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :71 NA's :113 NA's :71 NA's :65
## read30MinsADay minutesPerWeekEnglish studentsInEnglish schoolHasLibrary
## Min. :0.0000 Min. : 0.0 Min. : 1.0 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.: 225.0 1st Qu.:20.0 1st Qu.:1.0000
## Median :0.0000 Median : 250.0 Median :25.0 Median :1.0000
## Mean :0.2899 Mean : 266.2 Mean :24.5 Mean :0.9676
## 3rd Qu.:1.0000 3rd Qu.: 300.0 3rd Qu.:30.0 3rd Qu.:1.0000
## Max. :1.0000 Max. :2400.0 Max. :75.0 Max. :1.0000
## NA's :34 NA's :186 NA's :249 NA's :143
## publicSchool urban schoolSize readingScore
## Min. :0.0000 Min. :0.0000 Min. : 100 Min. :168.6
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.: 712 1st Qu.:431.7
## Median :1.0000 Median :0.0000 Median :1212 Median :499.7
## Mean :0.9339 Mean :0.3849 Mean :1369 Mean :497.9
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1900 3rd Qu.:566.2
## Max. :1.0000 Max. :1.0000 Max. :6694 Max. :746.0
## NA's :162
however, using tidyverse we can automate this process and run a look over all variables to see which ones have NA’s in them
# this will show the location of variables which have NA's and total of how many variables have NA in them
output <- vector("double", ncol(pisa_train)) # 1. output
for (i in seq_along(pisa_train)) { # 2. sequence
if (sum(is.na(pisa_train[[i]])) > 0) { # 3. body
output[[i]] = 1
} else {
output[[i]] = 0
}
}
output
## [1] 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0
sum(output)
## [1] 19
# this will show name of all variables which have NA's in them
output <- vector("double", ncol(pisa_train)) # 1. output
for (i in seq_along(pisa_train)) { # 2. sequence
if (sum(is.na(pisa_train[[i]])) > 0) { # 3. body
output[[i]] = colnames(pisa_train)[i]
} else {
output[[i]] = 0
}
}
output[output != 0]
## [1] "raceeth" "preschool"
## [3] "expectBachelors" "motherHS"
## [5] "motherBachelors" "motherWork"
## [7] "fatherHS" "fatherBachelors"
## [9] "fatherWork" "selfBornUS"
## [11] "motherBornUS" "fatherBornUS"
## [13] "englishAtHome" "computerForSchoolwork"
## [15] "read30MinsADay" "minutesPerWeekEnglish"
## [17] "studentsInEnglish" "schoolHasLibrary"
## [19] "schoolSize"
# CAUTION : these are tidyverse functions. So make sure data frames are converted to tibbbles.
problem 1.4 - Removing missing values
# na.omit() function will drop all observations which have at least one NA
pisa_train <- na.omit(pisa_train)
pisa_test <- na.omit(pisa_test)
# another tedious (or better depending on problem at hand) method would be
# pisa_train %>% filter( !is.na(preschool), !is.na(expectBachelors))
# and so on. It's more useful when you're focusing on NAs in few variables instead of all
problem 3.1 - building a model
also releveling the ‘race’ factor variable
# Note since we used read_csv function to read files, read_csv doesn't automatically converts variable into factors. So we should first convert 'raceeth' variable into factor
pisa_train$raceeth <- as.factor(pisa_train$raceeth)
pisa_test$raceeth <- as.factor(pisa_test$raceeth)
pisa_train$raceeth <- relevel(pisa_train$raceeth, "White")
pisa_test$raceeth <- relevel(pisa_test$raceeth, "White")
now getting back to the model building
lm_score <- lm(readingScore ~ ., data = pisa_train)
# to get R^2 value
summary(lm_score)
##
## Call:
## lm(formula = readingScore ~ ., data = pisa_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -247.44 -48.86 1.86 49.77 217.18
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 143.766333 33.841226
## grade 29.542707 2.937399
## male -14.521653 3.155926
## raceethAmerican Indian/Alaska Native -67.277327 16.786935
## raceethAsian -4.110325 9.220071
## raceethBlack -67.012347 5.460883
## raceethHispanic -38.975486 5.177743
## raceethMore than one race -16.922522 8.496268
## raceethNative Hawaiian/Other Pacific Islander -5.101601 17.005696
## preschool -4.463670 3.486055
## expectBachelors 55.267080 4.293893
## motherHS 6.058774 6.091423
## motherBachelors 12.638068 3.861457
## motherWork -2.809101 3.521827
## fatherHS 4.018214 5.579269
## fatherBachelors 16.929755 3.995253
## fatherWork 5.842798 4.395978
## selfBornUS -3.806278 7.323718
## motherBornUS -8.798153 6.587621
## fatherBornUS 4.306994 6.263875
## englishAtHome 8.035685 6.859492
## computerForSchoolwork 22.500232 5.702562
## read30MinsADay 34.871924 3.408447
## minutesPerWeekEnglish 0.012788 0.010712
## studentsInEnglish -0.286631 0.227819
## schoolHasLibrary 12.215085 9.264884
## publicSchool -16.857475 6.725614
## urban -0.110132 3.962724
## schoolSize 0.006540 0.002197
## t value Pr(>|t|)
## (Intercept) 4.248 2.24e-05 ***
## grade 10.057 < 2e-16 ***
## male -4.601 4.42e-06 ***
## raceethAmerican Indian/Alaska Native -4.008 6.32e-05 ***
## raceethAsian -0.446 0.65578
## raceethBlack -12.271 < 2e-16 ***
## raceethHispanic -7.528 7.29e-14 ***
## raceethMore than one race -1.992 0.04651 *
## raceethNative Hawaiian/Other Pacific Islander -0.300 0.76421
## preschool -1.280 0.20052
## expectBachelors 12.871 < 2e-16 ***
## motherHS 0.995 0.32001
## motherBachelors 3.273 0.00108 **
## motherWork -0.798 0.42517
## fatherHS 0.720 0.47147
## fatherBachelors 4.237 2.35e-05 ***
## fatherWork 1.329 0.18393
## selfBornUS -0.520 0.60331
## motherBornUS -1.336 0.18182
## fatherBornUS 0.688 0.49178
## englishAtHome 1.171 0.24153
## computerForSchoolwork 3.946 8.19e-05 ***
## read30MinsADay 10.231 < 2e-16 ***
## minutesPerWeekEnglish 1.194 0.23264
## studentsInEnglish -1.258 0.20846
## schoolHasLibrary 1.318 0.18749
## publicSchool -2.506 0.01226 *
## urban -0.028 0.97783
## schoolSize 2.977 0.00294 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 73.81 on 2385 degrees of freedom
## Multiple R-squared: 0.3251, Adjusted R-squared: 0.3172
## F-statistic: 41.04 on 28 and 2385 DF, p-value: < 2.2e-16
To compute RMSE
sqrt(mean(lm_score$residuals^2))
## [1] 73.36555
problem 3.3 - when we have two identicals students in grade 9 vs grade 11, what will be the predicted score difference?
I think this is a GREAT question because its easy to visualize when you just have two variables in y = mx + c type settings but here too, the concept remains the same i.e. look at coefficients of the ‘grade’ variable. It’s 29 something so predicted score difference would be twice of that.
problem 3.4 - What is the meaning of the coefficient associated with variable raceethAsian?
GREAT question. Rememeber this is a particular level of a factor variable. So its quite important to understand what this really means.
Answer is: Predicted difference in the reading score between an Asian student and a white student who is otherwise identical.
Explanation
The only difference between an Asian student and white student with otherwise identical variables is that the former has raceethAsian=1 and the latter has raceethAsian=0. The predicted reading score for these two students will differ by the coefficient on the variable raceethAsian.
problem 4.1 - 4.4 - prediction on test dataset
lm_score_pred <- predict(lm_score, newdata = pisa_test)
# to calculate range between maximum and minimum predicted score
range(lm_score_pred, na.rm = TRUE)
## [1] 353.2231 637.6914
# SSE
(SSE <- sum((pisa_test$readingScore - lm_score_pred)^2))
## [1] 5762082
# RMSE
(RMSE <- sqrt(mean((lm_score_pred - pisa_test$readingScore)^2)))
## [1] 76.29079
# predicted test score in baseline model
baseline_pred_score <- mean(pisa_train$readingScore)
# SST of testing set
(SST <- sum((baseline_pred_score - pisa_test$readingScore)^2))
## [1] 7802354
# Test R-square
(test_r_square <- 1 - SSE/SST)
## [1] 0.2614944