quick TIP for R Markdown: echo = FALSE will supress the code abut print the plot

This R Markdown style Notebook contains notes from the course ‘Analytical Edge’ offered by MIT online. This particular chapter focuses on ‘Linear Regression’.

Motivation: To understand all the nuances of using R, as offered by this course and to get better at note-keeping so few years down the line, if I need to revise this topic, I can do it rather quickly (~30 minutes or less) and be completely comfortable.

This notebook will contain notes from the lecture and solution to only interesting questions from the assignment.

=============================================================== ========================= LECTURE============================== ===============================================================

# setting up working directory
setwd("~/Dropbox/R_exercises/r4ds/r4ds/01 Linear Regression")

quick definitions:

==========WINE LECTURE==========

for the wine price example, tutor showed that adding variables to the model increased R^2 but with diminishig returns meaning as we kept adding more variables to the model, the R^2 increased but only slightly.

(side note on above: this is true for model (or read train dataset) R^2 only. When model is applied to a test data set, the resultant R^2 may decrease or even become negative as we add/remove independent variables from the model.)

# loading required package 
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
# reading the wine.csv file 
(wine <- read_csv("wine.csv"))
## Parsed with column specification:
## cols(
##   Year = col_integer(),
##   Price = col_double(),
##   WinterRain = col_integer(),
##   AGST = col_double(),
##   HarvestRain = col_integer(),
##   Age = col_integer(),
##   FrancePop = col_double()
## )
## # A tibble: 25 × 7
##     Year  Price WinterRain    AGST HarvestRain   Age FrancePop
##    <int>  <dbl>      <int>   <dbl>       <int> <int>     <dbl>
## 1   1952 7.4950        600 17.1167         160    31  43183.57
## 2   1953 8.0393        690 16.7333          80    30  43495.03
## 3   1955 7.6858        502 17.1500         130    28  44217.86
## 4   1957 6.9845        420 16.1333         110    26  45152.25
## 5   1958 6.7772        582 16.4167         187    25  45653.81
## 6   1959 8.0757        485 17.4833         187    24  46128.64
## 7   1960 6.5188        763 16.4167         290    23  46584.00
## 8   1961 8.4937        830 17.3333          38    22  47128.00
## 9   1962 7.3880        697 16.3000          52    21  48088.67
## 10  1963 6.7127        608 15.7167         155    20  48798.99
## # ... with 15 more rows
# building the model to predict price of wine dependent on AGST
(price_wine <- lm(Price ~ AGST, data = wine))
## 
## Call:
## lm(formula = Price ~ AGST, data = wine)
## 
## Coefficients:
## (Intercept)         AGST  
##     -3.4178       0.6351
summary(price_wine)
## 
## Call:
## lm(formula = Price ~ AGST, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78450 -0.23882 -0.03727  0.38992  0.90318 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.4178     2.4935  -1.371 0.183710    
## AGST          0.6351     0.1509   4.208 0.000335 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4993 on 23 degrees of freedom
## Multiple R-squared:  0.435,  Adjusted R-squared:  0.4105 
## F-statistic: 17.71 on 1 and 23 DF,  p-value: 0.000335

To calculate SSE, first we need to calcuate ‘residuals’, which can be calculated as

# to access residulas in a model
(price_wine$residuals)
##           1           2           3           4           5           6 
##  0.04204258  0.82983774  0.21169394  0.15609432 -0.23119140  0.38991701 
##           7           8           9          10          11          12 
## -0.48959140  0.90318115  0.45372410  0.14887461 -0.23882157 -0.08974238 
##          13          14          15          16          17          18 
##  0.66185660 -0.05211511 -0.62726647 -0.74714947  0.42113502 -0.03727441 
##          19          20          21          22          23          24 
##  0.10685278 -0.78450270 -0.64017590 -0.05508720 -0.67055321 -0.22040381 
##          25 
##  0.55866518
# to calculate SSE
(SSE = sum(price_wine$residuals^2))
## [1] 5.734875

How to use ‘adjusted R^2’ to determine which variables could be used in the model. Keep an eye on ‘correlation’ among all variables given. We found that population of france and age of wine have a perfect negative correlation.

This means that Age of Wine and Population of France have multicollinearity issue. Using these two variables together will confuse the accuracy of the model, if we drop both, the R^2 will decrease which is not good as well. Intuitively we know that age of wine should be a better predictor of price than the population of france. So we keep the former while drop the later from the model.

Predicting wine prices using available model:

# to build a model using wine dataset
(price_wine_train <- lm(Price ~ WinterRain + AGST + HarvestRain + Age, data = wine))
## 
## Call:
## lm(formula = Price ~ WinterRain + AGST + HarvestRain + Age, data = wine)
## 
## Coefficients:
## (Intercept)   WinterRain         AGST  HarvestRain          Age  
##   -3.429980     0.001076     0.607209    -0.003972     0.023931
# reading test data 
(wine_test <- read_csv("wine_test.csv"))
## Parsed with column specification:
## cols(
##   Year = col_integer(),
##   Price = col_double(),
##   WinterRain = col_integer(),
##   AGST = col_double(),
##   HarvestRain = col_integer(),
##   Age = col_integer(),
##   FrancePop = col_double()
## )
## # A tibble: 2 × 7
##    Year  Price WinterRain    AGST HarvestRain   Age FrancePop
##   <int>  <dbl>      <int>   <dbl>       <int> <int>     <dbl>
## 1  1979 6.9541        717 16.1667         122     4  54835.83
## 2  1980 6.4979        578 16.0000          74     3  55110.24
# predicting 'price' for 'test data' using model derived from 'train data'
# Note the use of NEWDATA 

# test_model <- predict(training_model, newdata = test_data)
(predict_price <- predict(price_wine_train, newdata = wine_test))
##        1        2 
## 6.768925 6.684910
# to calculate SSE, SST and R^2

# SSE = sum(  (actual_price_test - predict_price_test)^2  )
(SSE = sum((wine_test$Price - predict_price)^2))
## [1] 0.06926281
# SST = sum(  (actual_price_test - mean_wine_price_train)^2 )
# NOTE: we're using average price from training dataset as our baseline for calculating SST for test dataset. If we had to calculate SST for training dataset, we still will be using average price of training dataset
(SST = sum((wine_test$Price - mean(wine$Price))^2))
## [1] 0.3369269
(r_square <- 1 - SSE/SST)
## [1] 0.7944278

Model R^2 and Test R^2:

Basically the mechanism to calculate R^2 is same in both cases, which in turn is derived from SSE. SSE looks at difference between actual value and predicted value. In training dataset, the predicted value is the what you calculate using the regression equation itself (meaning for certain value of ‘x’ what should be value of ‘y’ given, y = mx + c and we know ‘m’ and ‘c’).

In test R^2, predicted values are calculated using ‘predict’ function as shown above

========================= ASSIGNMENT =========================

======================== CLIMATE CHANGE ========================

# reading input data
(climate <- read_csv("climate_change.csv"))
## Parsed with column specification:
## cols(
##   Year = col_integer(),
##   Month = col_integer(),
##   MEI = col_double(),
##   CO2 = col_double(),
##   CH4 = col_double(),
##   N2O = col_double(),
##   `CFC-11` = col_double(),
##   `CFC-12` = col_double(),
##   TSI = col_double(),
##   Aerosols = col_double(),
##   Temp = col_double()
## )
## # A tibble: 308 × 11
##     Year Month    MEI    CO2     CH4     N2O `CFC-11` `CFC-12`      TSI
##    <int> <int>  <dbl>  <dbl>   <dbl>   <dbl>    <dbl>    <dbl>    <dbl>
## 1   1983     5  2.556 345.96 1638.59 303.677  191.324  350.113 1366.102
## 2   1983     6  2.167 345.52 1633.71 303.746  192.057  351.848 1366.121
## 3   1983     7  1.741 344.15 1633.22 303.795  192.818  353.725 1366.285
## 4   1983     8  1.130 342.25 1631.35 303.839  193.602  355.633 1366.420
## 5   1983     9  0.428 340.17 1648.40 303.901  194.392  357.465 1366.234
## 6   1983    10  0.002 340.30 1663.79 303.970  195.171  359.174 1366.059
## 7   1983    11 -0.176 341.53 1658.23 304.032  195.921  360.758 1366.107
## 8   1983    12 -0.176 343.07 1654.31 304.082  196.609  362.174 1366.061
## 9   1984     1 -0.339 344.05 1658.98 304.130  197.219  363.359 1365.426
## 10  1984     2 -0.565 344.77 1656.48 304.194  197.759  364.296 1365.662
## # ... with 298 more rows, and 2 more variables: Aerosols <dbl>, Temp <dbl>
# Problem 1.1

# splitting data into training (<=2006) and testing set (>2006)
# Hint: to use subset
# new_dataset_name <- subset(original_dataset, logical_condition)

(climate_train <- subset(climate, Year <= 2006))
## # A tibble: 284 × 11
##     Year Month    MEI    CO2     CH4     N2O `CFC-11` `CFC-12`      TSI
##    <int> <int>  <dbl>  <dbl>   <dbl>   <dbl>    <dbl>    <dbl>    <dbl>
## 1   1983     5  2.556 345.96 1638.59 303.677  191.324  350.113 1366.102
## 2   1983     6  2.167 345.52 1633.71 303.746  192.057  351.848 1366.121
## 3   1983     7  1.741 344.15 1633.22 303.795  192.818  353.725 1366.285
## 4   1983     8  1.130 342.25 1631.35 303.839  193.602  355.633 1366.420
## 5   1983     9  0.428 340.17 1648.40 303.901  194.392  357.465 1366.234
## 6   1983    10  0.002 340.30 1663.79 303.970  195.171  359.174 1366.059
## 7   1983    11 -0.176 341.53 1658.23 304.032  195.921  360.758 1366.107
## 8   1983    12 -0.176 343.07 1654.31 304.082  196.609  362.174 1366.061
## 9   1984     1 -0.339 344.05 1658.98 304.130  197.219  363.359 1365.426
## 10  1984     2 -0.565 344.77 1656.48 304.194  197.759  364.296 1365.662
## # ... with 274 more rows, and 2 more variables: Aerosols <dbl>, Temp <dbl>
(climate_test <- subset(climate, Year > 2006))
## # A tibble: 24 × 11
##     Year Month    MEI    CO2     CH4     N2O `CFC-11` `CFC-12`      TSI
##    <int> <int>  <dbl>  <dbl>   <dbl>   <dbl>    <dbl>    <dbl>    <dbl>
## 1   2007     1  0.974 382.93 1799.66 320.561  248.372  539.206 1365.717
## 2   2007     2  0.510 383.81 1803.08 320.571  248.264  538.973 1365.715
## 3   2007     3  0.074 384.56 1803.10 320.548  247.997  538.811 1365.754
## 4   2007     4 -0.049 386.40 1802.11 320.518  247.574  538.586 1365.723
## 5   2007     5  0.183 386.58 1795.65 320.445  247.224  538.130 1365.693
## 6   2007     6 -0.358 386.05 1781.81 320.332  246.881  537.376 1365.762
## 7   2007     7 -0.290 384.49 1771.89 320.349  246.497  537.113 1365.751
## 8   2007     8 -0.440 382.00 1779.38 320.471  246.307  537.125 1365.757
## 9   2007     9 -1.162 380.90 1794.21 320.618  246.214  537.281 1365.716
## 10  2007    10 -1.142 381.14 1802.38 320.855  246.189  537.380 1365.739
## # ... with 14 more rows, and 2 more variables: Aerosols <dbl>, Temp <dbl>
# building a model using training data set
(temp_model <- lm(Temp ~ MEI + CO2 + CH4 + N2O + `CFC-11` + `CFC-12` + TSI + Aerosols, data = climate_train))
## 
## Call:
## lm(formula = Temp ~ MEI + CO2 + CH4 + N2O + `CFC-11` + `CFC-12` + 
##     TSI + Aerosols, data = climate_train)
## 
## Coefficients:
## (Intercept)          MEI          CO2          CH4          N2O  
##  -1.246e+02    6.421e-02    6.457e-03    1.240e-04   -1.653e-02  
##    `CFC-11`     `CFC-12`          TSI     Aerosols  
##  -6.630e-03    3.808e-03    9.314e-02   -1.538e+00
# question about calculating r_squared value for temp_model - we should be able to get that from summary(model)
summary(temp_model)
## 
## Call:
## lm(formula = Temp ~ MEI + CO2 + CH4 + N2O + `CFC-11` + `CFC-12` + 
##     TSI + Aerosols, data = climate_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25888 -0.05913 -0.00082  0.05649  0.32433 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.246e+02  1.989e+01  -6.265 1.43e-09 ***
## MEI          6.421e-02  6.470e-03   9.923  < 2e-16 ***
## CO2          6.457e-03  2.285e-03   2.826  0.00505 ** 
## CH4          1.240e-04  5.158e-04   0.240  0.81015    
## N2O         -1.653e-02  8.565e-03  -1.930  0.05467 .  
## `CFC-11`    -6.631e-03  1.626e-03  -4.078 5.96e-05 ***
## `CFC-12`     3.808e-03  1.014e-03   3.757  0.00021 ***
## TSI          9.314e-02  1.475e-02   6.313 1.10e-09 ***
## Aerosols    -1.538e+00  2.133e-01  -7.210 5.41e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09171 on 275 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7436 
## F-statistic: 103.6 on 8 and 275 DF,  p-value: < 2.2e-16

Now we look at correlation between all variables in training dataset.The matrix below shows that some of the variables have high correlation (>=0.7). So issue of multicollinearity creeps in

(cor(climate_train))
##                 Year         Month           MEI         CO2         CH4
## Year      1.00000000 -0.0279419602 -0.0369876842  0.98274939  0.91565945
## Month    -0.02794196  1.0000000000  0.0008846905 -0.10673246  0.01856866
## MEI      -0.03698768  0.0008846905  1.0000000000 -0.04114717 -0.03341930
## CO2       0.98274939 -0.1067324607 -0.0411471651  1.00000000  0.87727963
## CH4       0.91565945  0.0185686624 -0.0334193014  0.87727963  1.00000000
## N2O       0.99384523  0.0136315303 -0.0508197755  0.97671982  0.89983864
## CFC-11    0.56910643 -0.0131112236  0.0690004387  0.51405975  0.77990402
## CFC-12    0.89701166  0.0006751102  0.0082855443  0.85268963  0.96361625
## TSI       0.17030201 -0.0346061935 -0.1544919227  0.17742893  0.24552844
## Aerosols -0.34524670  0.0148895406  0.3402377871 -0.35615480 -0.26780919
## Temp      0.78679714 -0.0998567411  0.1724707512  0.78852921  0.70325502
##                  N2O      CFC-11        CFC-12         TSI    Aerosols
## Year      0.99384523  0.56910643  0.8970116635  0.17030201 -0.34524670
## Month     0.01363153 -0.01311122  0.0006751102 -0.03460619  0.01488954
## MEI      -0.05081978  0.06900044  0.0082855443 -0.15449192  0.34023779
## CO2       0.97671982  0.51405975  0.8526896272  0.17742893 -0.35615480
## CH4       0.89983864  0.77990402  0.9636162478  0.24552844 -0.26780919
## N2O       1.00000000  0.52247732  0.8679307757  0.19975668 -0.33705457
## CFC-11    0.52247732  1.00000000  0.8689851828  0.27204596 -0.04392120
## CFC-12    0.86793078  0.86898518  1.0000000000  0.25530281 -0.22513124
## TSI       0.19975668  0.27204596  0.2553028138  1.00000000  0.05211651
## Aerosols -0.33705457 -0.04392120 -0.2251312440  0.05211651  1.00000000
## Temp      0.77863893  0.40771029  0.6875575483  0.24338269 -0.38491375
##                 Temp
## Year      0.78679714
## Month    -0.09985674
## MEI       0.17247075
## CO2       0.78852921
## CH4       0.70325502
## N2O       0.77863893
## CFC-11    0.40771029
## CFC-12    0.68755755
## TSI       0.24338269
## Aerosols -0.38491375
## Temp      1.00000000

To mitigate multicollinearity issue, we will drop few variables from the model and see how it works.

Now using only N2O, MEI, TSI, Aerosols as independent variable i.e. dropping all variables which have high correlation (>=0.7) with N2O.

(temp_model_N2O <- lm(Temp ~ N2O + MEI + TSI + Aerosols, data = climate_train))
## 
## Call:
## lm(formula = Temp ~ N2O + MEI + TSI + Aerosols, data = climate_train)
## 
## Coefficients:
## (Intercept)          N2O          MEI          TSI     Aerosols  
##  -116.22686      0.02532      0.06419      0.07949     -1.70174

Understanding STEP function. Its crucial to know which variables are relevant and which should be dropped. Doing this exercise by selecting or dropping one by one is a very tedious exercise. R provides a function “step” just to do that by caclculating AIC (although ‘step’ is not smart but little dumb as it focuses on AIC solely).

(best_temp_model <- step(temp_model))
## Start:  AIC=-1348.16
## Temp ~ MEI + CO2 + CH4 + N2O + `CFC-11` + `CFC-12` + TSI + Aerosols
## 
##            Df Sum of Sq    RSS     AIC
## - CH4       1   0.00049 2.3135 -1350.1
## <none>                  2.3130 -1348.2
## - N2O       1   0.03132 2.3443 -1346.3
## - CO2       1   0.06719 2.3802 -1342.0
## - `CFC-12`  1   0.11874 2.4318 -1335.9
## - `CFC-11`  1   0.13986 2.4529 -1333.5
## - TSI       1   0.33516 2.6482 -1311.7
## - Aerosols  1   0.43727 2.7503 -1301.0
## - MEI       1   0.82823 3.1412 -1263.2
## 
## Step:  AIC=-1350.1
## Temp ~ MEI + CO2 + N2O + `CFC-11` + `CFC-12` + TSI + Aerosols
## 
##            Df Sum of Sq    RSS     AIC
## <none>                  2.3135 -1350.1
## - N2O       1   0.03133 2.3448 -1348.3
## - CO2       1   0.06672 2.3802 -1344.0
## - `CFC-12`  1   0.13023 2.4437 -1336.5
## - `CFC-11`  1   0.13938 2.4529 -1335.5
## - TSI       1   0.33500 2.6485 -1313.7
## - Aerosols  1   0.43987 2.7534 -1302.7
## - MEI       1   0.83118 3.1447 -1264.9
## 
## Call:
## lm(formula = Temp ~ MEI + CO2 + N2O + `CFC-11` + `CFC-12` + TSI + 
##     Aerosols, data = climate_train)
## 
## Coefficients:
## (Intercept)          MEI          CO2          N2O     `CFC-11`  
##  -1.245e+02    6.407e-02    6.401e-03   -1.602e-02   -6.609e-03  
##    `CFC-12`          TSI     Aerosols  
##   3.868e-03    9.312e-02   -1.540e+00
summary(best_temp_model)
## 
## Call:
## lm(formula = Temp ~ MEI + CO2 + N2O + `CFC-11` + `CFC-12` + TSI + 
##     Aerosols, data = climate_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25770 -0.05994 -0.00104  0.05588  0.32203 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.245e+02  1.985e+01  -6.273 1.37e-09 ***
## MEI          6.407e-02  6.434e-03   9.958  < 2e-16 ***
## CO2          6.402e-03  2.269e-03   2.821 0.005129 ** 
## N2O         -1.602e-02  8.287e-03  -1.933 0.054234 .  
## `CFC-11`    -6.609e-03  1.621e-03  -4.078 5.95e-05 ***
## `CFC-12`     3.868e-03  9.812e-04   3.942 0.000103 ***
## TSI          9.312e-02  1.473e-02   6.322 1.04e-09 ***
## Aerosols    -1.540e+00  2.126e-01  -7.244 4.36e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09155 on 276 degrees of freedom
## Multiple R-squared:  0.7508, Adjusted R-squared:  0.7445 
## F-statistic: 118.8 on 7 and 276 DF,  p-value: < 2.2e-16

It is interesting to note that the step function does not address the collinearity of the variables, except that adding highly correlated variables will not improve the R2 significantly. The consequence of this is that the step function will not necessarily produce a very interpretable model - just a model that has balanced quality and simplicity for a particular weighting of quality and simplicity (AIC).

Now predicting temperatures for the ‘testing_dataset’ using the model derived above from step function. we will also calculate the r_square for the test data

(predicting_temp <- predict(best_temp_model, newdata = climate_test))
##         1         2         3         4         5         6         7 
## 0.4677808 0.4435404 0.4265541 0.4299162 0.4455113 0.4151422 0.4097367 
##         8         9        10        11        12        13        14 
## 0.3839390 0.3255595 0.3274147 0.3231401 0.3316704 0.3522134 0.3313129 
##        15        16        17        18        19        20        21 
## 0.3142112 0.3703410 0.4162213 0.4391458 0.4237965 0.3913679 0.3587615 
##        22        23        24 
## 0.3451991 0.3607087 0.3638076
(SSE <- sum((climate_test$Temp - predicting_temp)^2))
## [1] 0.2176444
(SST <- sum((climate_test$Temp - mean(climate_train$Temp))^2))
## [1] 0.5860189
(r_square <- 1 - SSE/SST)
## [1] 0.6286051

========================= READING TEST SCORES =================================

reading the train and test dataset

pisa_train <- read_csv("pisa2009train.csv")
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   raceeth = col_character(),
##   readingScore = col_double()
## )
## See spec(...) for full column specifications.
pisa_test <- read_csv("pisa2009test.csv")
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   raceeth = col_character(),
##   readingScore = col_double()
## )
## See spec(...) for full column specifications.

problem 1.2, to calculate average reading test score in training dataset for males and females. I dont want to use tapply() so I’ll be using tidyr functions

# reading score for males
pisa_train %>%
  filter(male == 1) %>%
  summarise(reading_score = mean(readingScore, na.rm=TRUE))
## # A tibble: 1 × 1
##   reading_score
##           <dbl>
## 1      483.5325
# reading score for females
pisa_train %>%
  filter(male == 0) %>%
  summarise(reading_score = mean(readingScore, na.rm=TRUE))
## # A tibble: 1 × 1
##   reading_score
##           <dbl>
## 1      512.9406

problem 1.3 Locating missing values

# using summary() function we can look at which variables have missing values
summary(pisa_train)
##      grade            male          raceeth            preschool     
##  Min.   : 8.00   Min.   :0.0000   Length:3663        Min.   :0.0000  
##  1st Qu.:10.00   1st Qu.:0.0000   Class :character   1st Qu.:0.0000  
##  Median :10.00   Median :1.0000   Mode  :character   Median :1.0000  
##  Mean   :10.09   Mean   :0.5111                      Mean   :0.7228  
##  3rd Qu.:10.00   3rd Qu.:1.0000                      3rd Qu.:1.0000  
##  Max.   :12.00   Max.   :1.0000                      Max.   :1.0000  
##                                                      NA's   :56      
##  expectBachelors     motherHS    motherBachelors    motherWork    
##  Min.   :0.0000   Min.   :0.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:1.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :1.00   Median :0.0000   Median :1.0000  
##  Mean   :0.7859   Mean   :0.88   Mean   :0.3481   Mean   :0.7345  
##  3rd Qu.:1.0000   3rd Qu.:1.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00   Max.   :1.0000   Max.   :1.0000  
##  NA's   :62       NA's   :97     NA's   :397      NA's   :93      
##     fatherHS      fatherBachelors    fatherWork       selfBornUS    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:1.0000  
##  Median :1.0000   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.8593   Mean   :0.3319   Mean   :0.8531   Mean   :0.9313  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :245      NA's   :569      NA's   :233      NA's   :69      
##   motherBornUS     fatherBornUS    englishAtHome    computerForSchoolwork
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000       
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:1.0000       
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :1.0000       
##  Mean   :0.7725   Mean   :0.7668   Mean   :0.8717   Mean   :0.8994       
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000       
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000       
##  NA's   :71       NA's   :113      NA's   :71       NA's   :65           
##  read30MinsADay   minutesPerWeekEnglish studentsInEnglish schoolHasLibrary
##  Min.   :0.0000   Min.   :   0.0        Min.   : 1.0      Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.: 225.0        1st Qu.:20.0      1st Qu.:1.0000  
##  Median :0.0000   Median : 250.0        Median :25.0      Median :1.0000  
##  Mean   :0.2899   Mean   : 266.2        Mean   :24.5      Mean   :0.9676  
##  3rd Qu.:1.0000   3rd Qu.: 300.0        3rd Qu.:30.0      3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :2400.0        Max.   :75.0      Max.   :1.0000  
##  NA's   :34       NA's   :186           NA's   :249       NA's   :143     
##   publicSchool        urban          schoolSize    readingScore  
##  Min.   :0.0000   Min.   :0.0000   Min.   : 100   Min.   :168.6  
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.: 712   1st Qu.:431.7  
##  Median :1.0000   Median :0.0000   Median :1212   Median :499.7  
##  Mean   :0.9339   Mean   :0.3849   Mean   :1369   Mean   :497.9  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1900   3rd Qu.:566.2  
##  Max.   :1.0000   Max.   :1.0000   Max.   :6694   Max.   :746.0  
##                                    NA's   :162

however, using tidyverse we can automate this process and run a look over all variables to see which ones have NA’s in them

# this will show the location of variables which have NA's and total of how many variables have NA in them
output <- vector("double", ncol(pisa_train))  # 1. output
for (i in seq_along(pisa_train)) {            # 2. sequence
  if (sum(is.na(pisa_train[[i]])) > 0) {      # 3. body
    output[[i]] = 1
  } else {
    output[[i]] = 0
  }
  
}
output
##  [1] 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0
sum(output)
## [1] 19
# this will show name of all variables which have NA's in them

output <- vector("double", ncol(pisa_train))  # 1. output
for (i in seq_along(pisa_train)) {            # 2. sequence
  if (sum(is.na(pisa_train[[i]])) > 0) {      # 3. body
    output[[i]] = colnames(pisa_train)[i]
  } else {
    output[[i]] = 0
  }
  
}
output[output != 0]
##  [1] "raceeth"               "preschool"            
##  [3] "expectBachelors"       "motherHS"             
##  [5] "motherBachelors"       "motherWork"           
##  [7] "fatherHS"              "fatherBachelors"      
##  [9] "fatherWork"            "selfBornUS"           
## [11] "motherBornUS"          "fatherBornUS"         
## [13] "englishAtHome"         "computerForSchoolwork"
## [15] "read30MinsADay"        "minutesPerWeekEnglish"
## [17] "studentsInEnglish"     "schoolHasLibrary"     
## [19] "schoolSize"
# CAUTION : these are tidyverse functions. So make sure data frames are converted to tibbbles.

problem 1.4 - Removing missing values

# na.omit() function will drop all observations which have at least one NA
pisa_train <- na.omit(pisa_train)
pisa_test <- na.omit(pisa_test)

# another tedious (or better depending on problem at hand) method would be
# pisa_train %>% filter( !is.na(preschool), !is.na(expectBachelors)) 
# and so on. It's more useful when you're focusing on NAs in few variables instead of all

problem 3.1 - building a model

also releveling the ‘race’ factor variable

# Note since we used read_csv function to read files, read_csv doesn't automatically converts variable into factors. So we should first convert 'raceeth' variable into factor

pisa_train$raceeth <- as.factor(pisa_train$raceeth)
pisa_test$raceeth <- as.factor(pisa_test$raceeth)

pisa_train$raceeth <- relevel(pisa_train$raceeth, "White")
pisa_test$raceeth <- relevel(pisa_test$raceeth, "White")

now getting back to the model building

lm_score <- lm(readingScore ~ ., data = pisa_train)

# to get R^2 value
summary(lm_score)
## 
## Call:
## lm(formula = readingScore ~ ., data = pisa_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -247.44  -48.86    1.86   49.77  217.18 
## 
## Coefficients:
##                                                 Estimate Std. Error
## (Intercept)                                   143.766333  33.841226
## grade                                          29.542707   2.937399
## male                                          -14.521653   3.155926
## raceethAmerican Indian/Alaska Native          -67.277327  16.786935
## raceethAsian                                   -4.110325   9.220071
## raceethBlack                                  -67.012347   5.460883
## raceethHispanic                               -38.975486   5.177743
## raceethMore than one race                     -16.922522   8.496268
## raceethNative Hawaiian/Other Pacific Islander  -5.101601  17.005696
## preschool                                      -4.463670   3.486055
## expectBachelors                                55.267080   4.293893
## motherHS                                        6.058774   6.091423
## motherBachelors                                12.638068   3.861457
## motherWork                                     -2.809101   3.521827
## fatherHS                                        4.018214   5.579269
## fatherBachelors                                16.929755   3.995253
## fatherWork                                      5.842798   4.395978
## selfBornUS                                     -3.806278   7.323718
## motherBornUS                                   -8.798153   6.587621
## fatherBornUS                                    4.306994   6.263875
## englishAtHome                                   8.035685   6.859492
## computerForSchoolwork                          22.500232   5.702562
## read30MinsADay                                 34.871924   3.408447
## minutesPerWeekEnglish                           0.012788   0.010712
## studentsInEnglish                              -0.286631   0.227819
## schoolHasLibrary                               12.215085   9.264884
## publicSchool                                  -16.857475   6.725614
## urban                                          -0.110132   3.962724
## schoolSize                                      0.006540   0.002197
##                                               t value Pr(>|t|)    
## (Intercept)                                     4.248 2.24e-05 ***
## grade                                          10.057  < 2e-16 ***
## male                                           -4.601 4.42e-06 ***
## raceethAmerican Indian/Alaska Native           -4.008 6.32e-05 ***
## raceethAsian                                   -0.446  0.65578    
## raceethBlack                                  -12.271  < 2e-16 ***
## raceethHispanic                                -7.528 7.29e-14 ***
## raceethMore than one race                      -1.992  0.04651 *  
## raceethNative Hawaiian/Other Pacific Islander  -0.300  0.76421    
## preschool                                      -1.280  0.20052    
## expectBachelors                                12.871  < 2e-16 ***
## motherHS                                        0.995  0.32001    
## motherBachelors                                 3.273  0.00108 ** 
## motherWork                                     -0.798  0.42517    
## fatherHS                                        0.720  0.47147    
## fatherBachelors                                 4.237 2.35e-05 ***
## fatherWork                                      1.329  0.18393    
## selfBornUS                                     -0.520  0.60331    
## motherBornUS                                   -1.336  0.18182    
## fatherBornUS                                    0.688  0.49178    
## englishAtHome                                   1.171  0.24153    
## computerForSchoolwork                           3.946 8.19e-05 ***
## read30MinsADay                                 10.231  < 2e-16 ***
## minutesPerWeekEnglish                           1.194  0.23264    
## studentsInEnglish                              -1.258  0.20846    
## schoolHasLibrary                                1.318  0.18749    
## publicSchool                                   -2.506  0.01226 *  
## urban                                          -0.028  0.97783    
## schoolSize                                      2.977  0.00294 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73.81 on 2385 degrees of freedom
## Multiple R-squared:  0.3251, Adjusted R-squared:  0.3172 
## F-statistic: 41.04 on 28 and 2385 DF,  p-value: < 2.2e-16

To compute RMSE

sqrt(mean(lm_score$residuals^2))
## [1] 73.36555

problem 3.3 - when we have two identicals students in grade 9 vs grade 11, what will be the predicted score difference?

I think this is a GREAT question because its easy to visualize when you just have two variables in y = mx + c type settings but here too, the concept remains the same i.e. look at coefficients of the ‘grade’ variable. It’s 29 something so predicted score difference would be twice of that.

problem 3.4 - What is the meaning of the coefficient associated with variable raceethAsian?

GREAT question. Rememeber this is a particular level of a factor variable. So its quite important to understand what this really means.

Answer is: Predicted difference in the reading score between an Asian student and a white student who is otherwise identical.

Explanation

The only difference between an Asian student and white student with otherwise identical variables is that the former has raceethAsian=1 and the latter has raceethAsian=0. The predicted reading score for these two students will differ by the coefficient on the variable raceethAsian.

problem 4.1 - 4.4 - prediction on test dataset

lm_score_pred <- predict(lm_score, newdata = pisa_test)

# to calculate range between maximum and minimum predicted score
range(lm_score_pred, na.rm = TRUE)
## [1] 353.2231 637.6914
# SSE
(SSE <- sum((pisa_test$readingScore - lm_score_pred)^2))
## [1] 5762082
# RMSE
(RMSE <- sqrt(mean((lm_score_pred - pisa_test$readingScore)^2)))
## [1] 76.29079
# predicted test score in baseline model
baseline_pred_score <- mean(pisa_train$readingScore)

# SST of testing set
(SST <- sum((baseline_pred_score - pisa_test$readingScore)^2))
## [1] 7802354
# Test R-square

(test_r_square <- 1 - SSE/SST)
## [1] 0.2614944