Student Performance

In this WPA, you will analyze data from a study on student performance in two classes: math and Portugese. These data come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Student+Performance#

Here is the data description (taken directly from the original website

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

The data are located in two semi-colon (;) separated text files at http://nathanieldphillips.com/wp-content/uploads/2016/04/student-mat.csv (the math data), and http://nathanieldphillips.com/wp-content/uploads/2016/04/student-por.csv (the portugese data).

Here is how the first few rows of the math data should look:

head(student.math)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    2   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason guardian traveltime studytime failures schoolsup famsup paid
## 1     course   mother          2         2        0       yes     no   no
## 2     course   father          1         2        0        no    yes   no
## 3      other   mother          1         2        3       yes     no  yes
## 4       home   mother          1         3        0        no    yes  yes
## 5       home   father          1         2        0        no    yes  yes
## 6 reputation   mother          1         2        0        no    yes  yes
##   activities nursery higher internet romantic famrel freetime goout Dalc
## 1         no     yes    yes       no       no      4        3     4    1
## 2         no      no    yes      yes       no      5        3     3    1
## 3         no     yes    yes      yes       no      4        3     2    2
## 4        yes     yes    yes      yes      yes      3        2     2    1
## 5         no     yes    yes       no       no      4        3     2    1
## 6        yes     yes    yes      yes       no      5        4     2    1
##   Walc health absences G1 G2 G3
## 1    1      3        6  5  6  6
## 2    1      3        4  5  5  6
## 3    3      3       10  7  8 10
## 4    1      5        2 15 14 15
## 5    2      5        4  6 10 10
## 6    2      5       10 15 15 15

Datafile description

Both datafiles have 33 columns. Here they are:

1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)

2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)

3 age - student’s age (numeric: from 15 to 22)

4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)

5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)

6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)

7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)

12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)

13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)

16 schoolsup - extra educational support (binary: yes or no)

17 famsup - family educational support (binary: yes or no)

18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

19 activities - extra-curricular activities (binary: yes or no)

20 nursery - attended nursery school (binary: yes or no)

21 higher - wants to take higher education (binary: yes or no)

22 internet - Internet access at home (binary: yes or no)

23 romantic - with a romantic relationship (binary: yes or no)

24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)

26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)

27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

29 health - current health status (numeric: from 1 - very bad to 5 - very good)

30 absences - number of school absences (numeric: from 0 to 93)

31 G1 - first period grade (numeric: from 0 to 20)

31 G2 - second period grade (numeric: from 0 to 20)

32 G3 - final grade (numeric: from 0 to 20, output target)

Data loading and preparation

A. Open your WPA.RProject and open a new script. Save the script with the name WPA8.R.

B. Using read.table(), load the semi-colon (;) delimited text file containing the data into R and assign them to new objects called student.math and student.por respectively.

student.math <- read.table("http://nathanieldphillips.com/wp-content/uploads/2016/04/student-mat.csv",
                      sep = ";",
                      header = T
                      )

student.por <- read.table("http://nathanieldphillips.com/wp-content/uploads/2016/04/student-por.csv",
                      sep = ";",
                      header = T
                      )

Understand the data

D. Look at the first few rows of the dataframes with the head() function to make sure they were imported correctly.

head(student.math) # Looks ok
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    2   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason guardian traveltime studytime failures schoolsup famsup paid
## 1     course   mother          2         2        0       yes     no   no
## 2     course   father          1         2        0        no    yes   no
## 3      other   mother          1         2        3       yes     no  yes
## 4       home   mother          1         3        0        no    yes  yes
## 5       home   father          1         2        0        no    yes  yes
## 6 reputation   mother          1         2        0        no    yes  yes
##   activities nursery higher internet romantic famrel freetime goout Dalc
## 1         no     yes    yes       no       no      4        3     4    1
## 2         no      no    yes      yes       no      5        3     3    1
## 3         no     yes    yes      yes       no      4        3     2    2
## 4        yes     yes    yes      yes      yes      3        2     2    1
## 5         no     yes    yes       no       no      4        3     2    1
## 6        yes     yes    yes      yes       no      5        4     2    1
##   Walc health absences G1 G2 G3
## 1    1      3        6  5  6  6
## 2    1      3        4  5  5  6
## 3    3      3       10  7  8 10
## 4    1      5        2 15 14 15
## 5    2      5        4  6 10 10
## 6    2      5       10 15 15 15
head(student.por) # Looks ok
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    2   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason guardian traveltime studytime failures schoolsup famsup paid
## 1     course   mother          2         2        0       yes     no   no
## 2     course   father          1         2        0        no    yes   no
## 3      other   mother          1         2        0       yes     no   no
## 4       home   mother          1         3        0        no    yes   no
## 5       home   father          1         2        0        no    yes   no
## 6 reputation   mother          1         2        0        no    yes   no
##   activities nursery higher internet romantic famrel freetime goout Dalc
## 1         no     yes    yes       no       no      4        3     4    1
## 2         no      no    yes      yes       no      5        3     3    1
## 3         no     yes    yes      yes       no      4        3     2    2
## 4        yes     yes    yes      yes      yes      3        2     2    1
## 5         no     yes    yes       no       no      4        3     2    1
## 6        yes     yes    yes      yes       no      5        4     2    1
##   Walc health absences G1 G2 G3
## 1    1      3        4  0 11 11
## 2    1      3        2  9 11 11
## 3    3      3        6 12 13 12
## 4    1      5        0 14 14 14
## 5    2      5        0 11 13 13
## 6    2      5        6 12 12 13

E. Using the summary() function, look at summary statistics for each column in the dataframe. There should be 33 columsn in each dataset. Make sure everything looks ok.

summary(student.math) # Looks ok
##  school   sex          age       address famsize   Pstatus      Medu      
##  GP:349   F:208   Min.   :15.0   R: 88   GT3:281   A: 41   Min.   :0.000  
##  MS: 46   M:187   1st Qu.:16.0   U:307   LE3:114   T:354   1st Qu.:2.000  
##                   Median :17.0                             Median :3.000  
##                   Mean   :16.7                             Mean   :2.749  
##                   3rd Qu.:18.0                             3rd Qu.:4.000  
##                   Max.   :22.0                             Max.   :4.000  
##       Fedu             Mjob           Fjob            reason   
##  Min.   :0.000   at_home : 59   at_home : 20   course    :145  
##  1st Qu.:2.000   health  : 34   health  : 18   home      :109  
##  Median :2.000   other   :141   other   :217   other     : 36  
##  Mean   :2.522   services:103   services:111   reputation:105  
##  3rd Qu.:3.000   teacher : 58   teacher : 29                   
##  Max.   :4.000                                                 
##    guardian     traveltime      studytime        failures      schoolsup
##  father: 90   Min.   :1.000   Min.   :1.000   Min.   :0.0000   no :344  
##  mother:273   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   yes: 51  
##  other : 32   Median :1.000   Median :2.000   Median :0.0000            
##               Mean   :1.448   Mean   :2.035   Mean   :0.3342            
##               3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000            
##               Max.   :4.000   Max.   :4.000   Max.   :3.0000            
##  famsup     paid     activities nursery   higher    internet  romantic 
##  no :153   no :214   no :194    no : 81   no : 20   no : 66   no :263  
##  yes:242   yes:181   yes:201    yes:314   yes:375   yes:329   yes:132  
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##      famrel         freetime         goout            Dalc      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :4.000   Median :3.000   Median :3.000   Median :1.000  
##  Mean   :3.944   Mean   :3.235   Mean   :3.109   Mean   :1.481  
##  3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##       Walc           health         absences            G1       
##  Min.   :1.000   Min.   :1.000   Min.   : 0.000   Min.   : 3.00  
##  1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00  
##  Median :2.000   Median :4.000   Median : 4.000   Median :11.00  
##  Mean   :2.291   Mean   :3.554   Mean   : 5.709   Mean   :10.91  
##  3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:13.00  
##  Max.   :5.000   Max.   :5.000   Max.   :75.000   Max.   :19.00  
##        G2              G3       
##  Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 9.00   1st Qu.: 8.00  
##  Median :11.00   Median :11.00  
##  Mean   :10.71   Mean   :10.42  
##  3rd Qu.:13.00   3rd Qu.:14.00  
##  Max.   :19.00   Max.   :20.00
summary(student.por) # Looks ok
##  school   sex          age        address famsize   Pstatus
##  GP:423   F:383   Min.   :15.00   R:197   GT3:457   A: 80  
##  MS:226   M:266   1st Qu.:16.00   U:452   LE3:192   T:569  
##                   Median :17.00                            
##                   Mean   :16.74                            
##                   3rd Qu.:18.00                            
##                   Max.   :22.00                            
##       Medu            Fedu             Mjob           Fjob    
##  Min.   :0.000   Min.   :0.000   at_home :135   at_home : 42  
##  1st Qu.:2.000   1st Qu.:1.000   health  : 48   health  : 23  
##  Median :2.000   Median :2.000   other   :258   other   :367  
##  Mean   :2.515   Mean   :2.307   services:136   services:181  
##  3rd Qu.:4.000   3rd Qu.:3.000   teacher : 72   teacher : 36  
##  Max.   :4.000   Max.   :4.000                                
##         reason      guardian     traveltime      studytime    
##  course    :285   father:153   Min.   :1.000   Min.   :1.000  
##  home      :149   mother:455   1st Qu.:1.000   1st Qu.:1.000  
##  other     : 72   other : 41   Median :1.000   Median :2.000  
##  reputation:143                Mean   :1.569   Mean   :1.931  
##                                3rd Qu.:2.000   3rd Qu.:2.000  
##                                Max.   :4.000   Max.   :4.000  
##     failures      schoolsup famsup     paid     activities nursery  
##  Min.   :0.0000   no :581   no :251   no :610   no :334    no :128  
##  1st Qu.:0.0000   yes: 68   yes:398   yes: 39   yes:315    yes:521  
##  Median :0.0000                                                     
##  Mean   :0.2219                                                     
##  3rd Qu.:0.0000                                                     
##  Max.   :3.0000                                                     
##  higher    internet  romantic      famrel         freetime   
##  no : 69   no :151   no :410   Min.   :1.000   Min.   :1.00  
##  yes:580   yes:498   yes:239   1st Qu.:4.000   1st Qu.:3.00  
##                                Median :4.000   Median :3.00  
##                                Mean   :3.931   Mean   :3.18  
##                                3rd Qu.:5.000   3rd Qu.:4.00  
##                                Max.   :5.000   Max.   :5.00  
##      goout            Dalc            Walc          health     
##  Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:2.000  
##  Median :3.000   Median :1.000   Median :2.00   Median :4.000  
##  Mean   :3.185   Mean   :1.502   Mean   :2.28   Mean   :3.536  
##  3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000  
##     absences            G1             G2              G3       
##  Min.   : 0.000   Min.   : 0.0   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 0.000   1st Qu.:10.0   1st Qu.:10.00   1st Qu.:10.00  
##  Median : 2.000   Median :11.0   Median :11.00   Median :12.00  
##  Mean   : 3.659   Mean   :11.4   Mean   :11.57   Mean   :11.91  
##  3rd Qu.: 6.000   3rd Qu.:13.0   3rd Qu.:13.00   3rd Qu.:14.00  
##  Max.   :32.000   Max.   :19.0   Max.   :19.00   Max.   :19.00

Standard Regression with lm()

One IV

  1. For the math data, create a regression object predicting first period grade (G1) based on age.
obj.1 <- lm(G1 ~ age, data = student.math)
    1. Is there a significant relationship between age and G1?
summary(obj.1)
## 
## Call:
## lm(formula = G1 ~ age, data = student.math)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6915 -2.7749 -0.1916  2.3085  8.3085 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.6919     2.1926   6.245  1.1e-09 ***
## age          -0.1667     0.1309  -1.273    0.204    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.317 on 393 degrees of freedom
## Multiple R-squared:  0.004106,   Adjusted R-squared:  0.001572 
## F-statistic:  1.62 on 1 and 393 DF,  p-value: 0.2038

Answer: No, there is not a significant negative relationship between age and G1, b = -0.17, t(393) = -1.27, p = 0.204

    1. What is the estimate of the coefficient for age? How do you interpret this value?

Answer: The coefficient for age is -0.17, this means that for every increase of one (year) in age, we expect a decrease of 0.17 in G1 scores.

  1. For the portugese data, create a regression object predicting first period grade (G1) based on age.
obj.2 <- lm(G1 ~ age, data = student.por)
    1. Is there a significant relationship between age and G1?
summary(obj.2)
## 
## Call:
## lm(formula = G1 ~ age, data = student.por)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9057  -1.9057  -0.0843   1.7014   8.0943 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.97725    1.46468  12.274  < 2e-16 ***
## age         -0.39286    0.08724  -4.503 7.95e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.705 on 647 degrees of freedom
## Multiple R-squared:  0.03039,    Adjusted R-squared:  0.02889 
## F-statistic: 20.28 on 1 and 647 DF,  p-value: 7.946e-06

*Answer: Yes, there is a significant negative relationship between age and G1 scores in the Portugese data, b = -0.39, t(647) = -4.50, p < .01

    1. What is the estimate of the coefficient for age? How do you interpret this value?

Answer: The coefficient for age is -0.39, this means that for every increase of one (year) in age, we expect a decrease of 0.39 in G1 scores.

  1. For the math data, create a regression object called g1g3.mod predicting each student’s period 3 grade (G3) based on their period 1 grade (G1)
g1g3.mod <- lm(G3 ~ G1, data = student.math)
    1. Is there a significant relationship between G1 and G3?
summary(g1g3.mod)
## 
## Call:
## lm(formula = G3 ~ G1, data = student.math)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.6223  -0.8348   0.3777   1.6965   5.0153 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.65280    0.47475  -3.481 0.000555 ***
## G1           1.10626    0.04164  26.568  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.743 on 393 degrees of freedom
## Multiple R-squared:  0.6424, Adjusted R-squared:  0.6414 
## F-statistic: 705.8 on 1 and 393 DF,  p-value: < 2.2e-16

Yes, there is a significant relationship between G1 and G3 in math scores, b = 1.11, t(393) = 26.57, p < .01

    1. Create a scatterplot showing the relationship between G1 and G3.
plot(x = student.math$G1,
     y = student.math$G3,
     pch = 16,
     col = gray(.1, .1),
     xlab = "Period 1 Scores",
     ylab = "Period 3 Scores",
     main = "Student Math Data"
     )

    1. Add a regression line to the scatterplot from your regression object.
plot(x = student.math$G1,
     y = student.math$G3,
     pch = 16,
     col = gray(.1, .1),
     xlab = "Period 1 Scores",
     ylab = "Period 3 Scores",
     main = "Student Math Data"
     )

abline(g1g3.mod, lty = 2)

Multiple IVs

  1. For the math data, create a regression object called math.mod1 predicting third period grade (G3) based on sex, age, internet, and failures
math.mod1 <- lm(G3 ~ sex + age + internet + failures,
                data = student.math
                )
    1. Which variables are significantly related to third period grade?
summary(math.mod1)
## 
## Call:
## lm(formula = G3 ~ sex + age + internet + failures, data = student.math)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2156  -1.9523   0.0965   3.0252   9.4370 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.9962     2.9808   4.695 3.69e-06 ***
## sexM          1.0451     0.4282   2.441   0.0151 *  
## age          -0.2407     0.1735  -1.388   0.1660    
## internetyes   0.7855     0.5761   1.364   0.1735    
## failures     -2.1260     0.2966  -7.167 3.86e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.237 on 390 degrees of freedom
## Multiple R-squared:  0.1533, Adjusted R-squared:  0.1446 
## F-statistic: 17.65 on 4 and 390 DF,  p-value: 2.488e-13

Both sex (b = 1.05, t(390) = 2.44, p = 0.02) and failures (b = -2.13, t(390) = -7.17, p < .01) are significantly related to period 3 scores.

    1. What does the estimate for internet mean?

Students with internet access score 0.79 points higher on period 3 scores compared to students without internet access (however, the effect is not significant).

    1. Create a scatterplot showing the relationship between the true values of G3 and the model fits.
student.math$fitted.values <- math.mod1$fitted.values

plot(x = student.math$G3,
     y = student.math$fitted.values,
     pch = 16,
     col = gray(.1, .2),
     xlab = "True period 3 scores",
     ylab = "Model Fits",
     main = "Period 3 math score model fits"
     )

Note: The model does a pretty terrible job of predicting scores.

    1. By hand, calculate the model estimated math grade for a Male student of age 15 with internet access and 3 previous class failures.
13.9962 + 1.0451 + 15 * (-0.2407) + 0.7855 + 3 * (-2.126)
## [1] 5.8383

A 15 year old, male student with internet access and 3 class failures is predicted to have a period 3 math grade of 5.84

    1. Test your prediction in C by creating a new dataframe of test data and using the predict() function
newdata <- data.frame(sex = "M",
                      age = 15,
                      internet = "yes",
                      failures = 3
                      )

predict(object = math.mod1,
        newdata = newdata)
##        1 
## 5.837739

Yep!.

Checkpoint!!!

Using models from one dataset to predict data from another

  1. Create a new regression object called por.mod1 using the same variables as question 4: however, this time use the portugese dataset to fit the model.
por.mod1 <- lm(G1 ~ sex + age + internet + failures,
                data = student.por
                )
summary(por.mod1)
## 
## Call:
## lm(formula = G1 ~ sex + age + internet + failures, data = student.por)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.1928  -1.5912  -0.0887   1.7547   7.0679 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.01091    1.43685   9.751  < 2e-16 ***
## sexM        -0.49747    0.20194  -2.464  0.01402 *  
## age         -0.15656    0.08574  -1.826  0.06833 .  
## internetyes  0.73930    0.23513   3.144  0.00174 ** 
## failures    -1.59438    0.17726  -8.995  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.509 on 644 degrees of freedom
## Multiple R-squared:  0.1697, Adjusted R-squared:  0.1645 
## F-statistic: 32.91 on 4 and 644 DF,  p-value: < 2.2e-16

In the Portugese dataset, the effect of sex is reversed: Now males perform significantly worse than females. The direction of age is also reveresed: in the Portugese dataset, age is negatively correlated with period 3 scores. The effects of internet and failures are similar in both datasets.

student.por$fitted.values <- por.mod1$fitted.values

mean(abs(student.por$fitted.values - student.por$G3))
## [1] 2.196165

On average, the model fits for the Portugese data were 2.196 away from the true data

# Predict G3 in Math using portugese model
student.math$por.fitted.values <- predict(por.mod1, student.math)

# Calculate mean, absolute differences between predictions
#  and actual scores:
mean(abs(student.math$por.fitted.values - student.math$G3))
## [1] 3.278772

On average, the Portugese model fits for the Math data were 3.28 away from the true data

# Predict G3 in Math using math model
student.math$math.fitted.values <- predict(math.mod1, student.math)

# Calculate mean, absolute differences between predictions
#  and actual scores:
mean(abs(student.math$math.fitted.values - student.math$G3))
## [1] 3.216281

On average, the Math model fits for the Math data were 3.207 away from the true data (surprisingly, not much better than when we used the Portugese model)

Logistic regression

  1. For the Portugese data, create a logistic regression model using glm() predicting whether a student attended nursery school based on his/her sex, family size, mother’s education, father’s education, and parent’s cohabitation status. (Hint: To do this, you’ll need to recode the nursery school I mistakenly referred to the internet variable in the original assignment variable into a binary variable of 0s and 1s)
# Create a new variable called nursery.bin that converts
#  the original "no" values to 0, and "yes" values to 1

student.por$nursery.bin <- NA
student.por$nursery.bin[student.por$nursery == "no"] <- 0
student.por$nursery.bin[student.por$nursery == "yes"] <- 1


nursery.mod <- glm(nursery.bin ~ sex + famsize + Medu + Fedu + Pstatus, 
         family = binomial, data = student.por)
summary(nursery.mod)
## 
## Call:
## glm(formula = nursery.bin ~ sex + famsize + Medu + Fedu + Pstatus, 
##     family = binomial, data = student.por)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1816   0.4542   0.6098   0.7089   0.9683  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.59232    0.41529   1.426  0.15379   
## sexM        -0.37040    0.20625  -1.796  0.07251 . 
## famsizeLE3   0.67958    0.25002   2.718  0.00657 **
## Medu         0.32209    0.11943   2.697  0.00700 **
## Fedu        -0.01493    0.12260  -0.122  0.90306   
## PstatusT     0.05938    0.33680   0.176  0.86006   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 644.50  on 648  degrees of freedom
## Residual deviance: 623.54  on 643  degrees of freedom
## AIC: 635.54
## 
## Number of Fisher Scoring iterations: 4

Family size (b = 0.68, z(643) = 2.72, p < .01), and mother’s education (b = 0.32, z(643) = 2.697, p < .01) both significantly predict whether a student attended nursery school

# Add model fits to data
student.por$nursery.pred <- nursery.mod$fitted.values

# Convert model fits to binary values
student.por$nursery.pred[student.por$nursery.pred > 0.5] <- 1
student.por$nursery.pred[student.por$nursery.pred <= .5] <- 0
table(student.por$nursery.bin,
      student.por$nursery.pred
      )
##    
##       1
##   0 128
##   1 521

Looks strange doesn’t it? In fact, the model is ALWAYS predicting values greater than 0.5, which means for each student it always predicts that they have gone to nursery school.

Poisson regression

  1. Create a new regression object called travel.por.mod predicting how a student’s travel time to school as a function of every variable in the Portugese dataset. Do a standard regression with lm()
travel.por.mod <- lm(traveltime ~ ., data = student.por)
summary(travel.por.mod)
## 
## Call:
## lm(formula = traveltime ~ ., data = student.por)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3786 -0.4456 -0.1499  0.4315  2.3615 
## 
## Coefficients: (3 not defined because of singularities)
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.377955   0.514045   4.626 4.56e-06 ***
## schoolMS          0.158953   0.069197   2.297  0.02195 *  
## sexM              0.137813   0.063719   2.163  0.03095 *  
## age              -0.026384   0.026221  -1.006  0.31471    
## addressU         -0.396930   0.064684  -6.136 1.52e-09 ***
## famsizeLE3        0.023153   0.062418   0.371  0.71081    
## PstatusT          0.034305   0.088218   0.389  0.69751    
## Medu             -0.084029   0.038408  -2.188  0.02906 *  
## Fedu             -0.042117   0.035029  -1.202  0.22970    
## Mjobhealth       -0.121490   0.136904  -0.887  0.37521    
## Mjobother        -0.092069   0.077029  -1.195  0.23245    
## Mjobservices     -0.051192   0.095061  -0.539  0.59042    
## Mjobteacher      -0.035233   0.127908  -0.275  0.78306    
## Fjobhealth        0.142938   0.191457   0.747  0.45561    
## Fjobother         0.309403   0.115620   2.676  0.00765 ** 
## Fjobservices      0.194154   0.122141   1.590  0.11245    
## Fjobteacher       0.372420   0.171266   2.175  0.03005 *  
## reasonhome       -0.176384   0.072188  -2.443  0.01483 *  
## reasonother      -0.036431   0.093847  -0.388  0.69801    
## reasonreputation -0.119472   0.075846  -1.575  0.11573    
## guardianmother   -0.058761   0.067569  -0.870  0.38484    
## guardianother     0.191378   0.135082   1.417  0.15707    
## studytime         0.017430   0.035928   0.485  0.62774    
## failures          0.029722   0.053997   0.550  0.58222    
## schoolsupyes     -0.046797   0.094036  -0.498  0.61891    
## famsupyes         0.009733   0.058096   0.168  0.86701    
## paidyes          -0.112197   0.117532  -0.955  0.34016    
## activitiesyes    -0.008821   0.056871  -0.155  0.87679    
## nurseryyes        0.060430   0.069013   0.876  0.38158    
## higheryes         0.091211   0.099115   0.920  0.35781    
## internetyes      -0.146883   0.070061  -2.097  0.03645 *  
## romanticyes      -0.023287   0.058525  -0.398  0.69085    
## famrel            0.001778   0.029686   0.060  0.95226    
## freetime         -0.038153   0.028559  -1.336  0.18207    
## goout             0.048328   0.027282   1.771  0.07699 .  
## Dalc              0.046112   0.038955   1.184  0.23698    
## Walc             -0.007837   0.030130  -0.260  0.79486    
## health           -0.030229   0.019711  -1.534  0.12565    
## absences          0.003092   0.006373   0.485  0.62773    
## G1               -0.007324   0.020606  -0.355  0.72240    
## G2               -0.040002   0.026913  -1.486  0.13770    
## G3                0.040800   0.021960   1.858  0.06367 .  
## fitted.values           NA         NA      NA       NA    
## nursery.bin             NA         NA      NA       NA    
## nursery.pred            NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6776 on 607 degrees of freedom
## Multiple R-squared:  0.2326, Adjusted R-squared:  0.1807 
## F-statistic: 4.486 on 41 and 607 DF,  p-value: < 2.2e-16

A student’s school, sex, address, mother’s education, father’s job, reason to choose this school, and internet access all significantly predict a student’s travel time

student.por$travel.fv <- travel.por.mod$fitted.values
mean(abs(student.por$travel.fv - student.por$traveltime))
## [1] 0.5186681

On average, the travel model had model fits 0.519 points away from the true data

hist(student.por$traveltime)

travel.por.pmod <- glm(traveltime ~ ., data = student.por, family = poisson)
summary(travel.por.pmod)
## 
## Call:
## glm(formula = traveltime ~ ., family = poisson, data = student.por)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0629  -0.3745  -0.1639   0.3293   1.5629  
## 
## Coefficients: (4 not defined because of singularities)
##                    Estimate Std. Error z value Pr(>|z|)   
## (Intercept)       9.276e-01  6.074e-01   1.527  0.12671   
## schoolMS          9.617e-02  7.983e-02   1.205  0.22829   
## sexM              8.682e-02  7.552e-02   1.150  0.25027   
## age              -1.460e-02  3.107e-02  -0.470  0.63834   
## addressU         -2.338e-01  7.277e-02  -3.213  0.00131 **
## famsizeLE3        1.664e-02  7.358e-02   0.226  0.82112   
## PstatusT          1.775e-02  1.066e-01   0.166  0.86778   
## Medu             -4.966e-02  4.469e-02  -1.111  0.26645   
## Fedu             -2.893e-02  4.158e-02  -0.696  0.48647   
## Mjobhealth       -9.372e-02  1.673e-01  -0.560  0.57524   
## Mjobother        -5.288e-02  8.589e-02  -0.616  0.53811   
## Mjobservices     -2.723e-02  1.103e-01  -0.247  0.80493   
## Mjobteacher      -2.334e-02  1.538e-01  -0.152  0.87934   
## Fjobhealth        7.638e-02  2.465e-01   0.310  0.75669   
## Fjobother         1.979e-01  1.402e-01   1.411  0.15811   
## Fjobservices      1.253e-01  1.483e-01   0.845  0.39828   
## Fjobteacher       2.487e-01  2.101e-01   1.184  0.23661   
## reasonhome       -1.215e-01  8.761e-02  -1.387  0.16545   
## reasonother      -2.342e-02  1.070e-01  -0.219  0.82674   
## reasonreputation -7.492e-02  9.077e-02  -0.825  0.40915   
## guardianmother   -3.861e-02  7.944e-02  -0.486  0.62692   
## guardianother     1.070e-01  1.531e-01   0.699  0.48460   
## studytime         1.231e-02  4.236e-02   0.290  0.77144   
## failures          1.167e-02  6.070e-02   0.192  0.84752   
## schoolsupyes     -3.388e-02  1.138e-01  -0.298  0.76592   
## famsupyes         6.087e-05  6.830e-02   0.001  0.99929   
## paidyes          -7.513e-02  1.440e-01  -0.522  0.60181   
## activitiesyes    -5.093e-03  6.723e-02  -0.076  0.93961   
## nurseryyes        2.900e-02  8.066e-02   0.360  0.71917   
## higheryes         5.876e-02  1.121e-01   0.524  0.60011   
## internetyes      -8.682e-02  7.915e-02  -1.097  0.27273   
## romanticyes      -1.671e-02  6.894e-02  -0.242  0.80846   
## famrel           -8.044e-04  3.439e-02  -0.023  0.98134   
## freetime         -2.389e-02  3.345e-02  -0.714  0.47515   
## goout             3.040e-02  3.237e-02   0.939  0.34764   
## Dalc              2.593e-02  4.432e-02   0.585  0.55847   
## Walc             -4.554e-03  3.533e-02  -0.129  0.89743   
## health           -1.942e-02  2.320e-02  -0.837  0.40266   
## absences          1.941e-03  7.647e-03   0.254  0.79966   
## G1               -5.541e-03  2.381e-02  -0.233  0.81599   
## G2               -2.501e-02  3.142e-02  -0.796  0.42603   
## G3                2.486e-02  2.632e-02   0.944  0.34499   
## fitted.values            NA         NA      NA       NA   
## nursery.bin              NA         NA      NA       NA   
## nursery.pred             NA         NA      NA       NA   
## travel.fv                NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 207.43  on 648  degrees of freedom
## Residual deviance: 155.02  on 607  degrees of freedom
## AIC: 1741.5
## 
## Number of Fisher Scoring iterations: 4

Using poisson regression, only a family’s address appears to significantly predict the student’s travel time. The variables that were previously significant (sex, mother’s education, father’s job, reason to choose the school, and internet access) no longer appear to be significantly related

student.por$travel.poisson.fv <- travel.por.pmod$fitted.values

mean(abs(student.por$traveltime - student.por$travel.poisson.fv))
## [1] 0.5206617

On average, the poisson travel model had model fits 0.52 points away from the true data. This is actually slightly worse than the regression model!