Student Performance

In this WPA, you will analyze data from a study on student performance in two classes: math and Portuguese. These data come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Student+Performance#

Here is the data description (taken directly from the original website

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

The data are located in two tab-delimited text files at http://nathanieldphillips.com/wp-content/uploads/2016/11/studentmath.txt (the math data), and http://nathanieldphillips.com/wp-content/uploads/2016/11/studentpor.txt (the portugese data).

Datafile description

Both datafiles have 33 columns. Here they are:

1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)

2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)

3 age - student’s age (numeric: from 15 to 22)

4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)

5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)

6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)

7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)

8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)

9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)

12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)

13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)

16 schoolsup - extra educational support (binary: yes or no)

17 famsup - family educational support (binary: yes or no)

18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

19 activities - extra-curricular activities (binary: yes or no)

20 nursery - attended nursery school (binary: yes or no)

21 higher - wants to take higher education (binary: yes or no)

22 internet - Internet access at home (binary: yes or no)

23 romantic - with a romantic relationship (binary: yes or no)

24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)

26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)

27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

29 health - current health status (numeric: from 1 - very bad to 5 - very good)

30 absences - number of school absences (numeric: from 0 to 93)

31 G1 - first period grade (numeric: from 0 to 20)

31 G2 - second period grade (numeric: from 0 to 20)

32 G3 - final grade (numeric: from 0 to 20, output target)

Data loading and preparation

Open an R project and open a new script. Save the script with the name wpa_8_LastFirst.R.
Using read.table(), load the tab-delimited text file containing the data into R and assign them to new objects called student.math and student.por respectively.

student.math <- read.table("http://nathanieldphillips.com/wp-content/uploads/2016/11/studentmath.txt",
                      sep = "\t",
                      header = TRUE)

student.por <- read.table("http://nathanieldphillips.com/wp-content/uploads/2016/11/studentpor.txt",
                      sep = "\t",
                      header = TRUE)

Understand the data

Look at the first few rows of the dataframes with the head() function to make sure they were imported correctly.

head(student.math)

head(student.por)

Using the str() function, look at summary statistics for each column in the dataframe. There should be 33 columns in each dataset. Make sure everything looks ok.

str(student.math)

## 'data.frame':    395 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

str(student.por)

## 'data.frame':    649 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1        : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2        : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3        : int  11 11 12 14 13 13 13 13 17 13 ...

Standard Regression with lm()

One IV

For the math data, create a regression object called lm.5 predicting first period grade (G1) based on age.

lm.5 <- lm(G1 ~ age, data = student.math)

summary(lm.5)

## 
## Call:
## lm(formula = G1 ~ age, data = student.math)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6915 -2.7749 -0.1916  2.3085  8.3085 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.6919     2.1926   6.245  1.1e-09 ***
## age          -0.1667     0.1309  -1.273    0.204    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.317 on 393 degrees of freedom
## Multiple R-squared:  0.004106,   Adjusted R-squared:  0.001572 
## F-statistic:  1.62 on 1 and 393 DF,  p-value: 0.2038

5b. Run names() and summary() on lm.5 to see additional information from your regression object. Now, return a vector of the coefficients by running lm.5$coefficients

How do you interpret the relationship between age and first period grade?

# There is a slight negative relationship between age and first period grade (b = -0.17), however the relationship is not significant

For the math data, create a regression object called lm.7 predicting first period grade (G1) based on absences

lm.7 <- lm(G1 ~ absences, data = student.math)

summary(lm.7)

## 
## Call:
## lm(formula = G1 ~ absences, data = student.math)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8794 -2.9115  0.0177  2.2363  8.0692 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.98227    0.20539  53.470   <2e-16 ***
## absences    -0.01286    0.02091  -0.615    0.539    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.322 on 393 degrees of freedom
## Multiple R-squared:  0.0009612,  Adjusted R-squared:  -0.001581 
## F-statistic: 0.3781 on 1 and 393 DF,  p-value: 0.539

How do you interpret the relationship between absences and G1?

# There is a slight negative relationship between absences and first period grade (b = -0.01), however the relationship is not significant

For the math data, create a regression object called lm.9 predicting each student’s period 3 grade (G3) based on their period 1 grade (G1). Look at the results of the regression analysis with summary().

lm.9 <- lm(G3 ~ G1, data = student.math)

summary(lm.9)

## 
## Call:
## lm(formula = G3 ~ G1, data = student.math)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.6223  -0.8348   0.3777   1.6965   5.0153 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.65280    0.47475  -3.481 0.000555 ***
## G1           1.10626    0.04164  26.568  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.743 on 393 degrees of freedom
## Multiple R-squared:  0.6424, Adjusted R-squared:  0.6414 
## F-statistic: 705.8 on 1 and 393 DF,  p-value: < 2.2e-16

What is the relationship between G1 and G3?

# There is a strong positive relationship between first period grade and third period grade (b = 1.1, p < .01)

Regression vs. Correlation

Conduct a correlation test between G1 and G3 (Hint: use cor.test()). Compare the t-value for this test to the regression analysis you did in question 9. What do you see?

Adding a regression line to a scatterplot

Create a scatterplot showing the relationship between G1 and G3 for the math data.

plot(x = student.math$G1,
     y = student.math$G3)

Add a regression line to the scatterplot from your regression object lm.9 (hint: use abline()).

plot(x = student.math$G1,
     y = student.math$G3)

abline(lm.9)

Multiple IVs

For the math data, create a regression object called lm.14 predicting third period grade (G3) based on sex, age, internet, and failures

lm.14 <- lm(G3 ~ sex + age + internet + failures, data = student.math)

summary(lm.14)

## 
## Call:
## lm(formula = G3 ~ sex + age + internet + failures, data = student.math)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2156  -1.9523   0.0965   3.0252   9.4370 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.9962     2.9808   4.695 3.69e-06 ***
## sexM          1.0451     0.4282   2.441   0.0151 *  
## age          -0.2407     0.1735  -1.388   0.1660    
## internetyes   0.7855     0.5761   1.364   0.1735    
## failures     -2.1260     0.2966  -7.167 3.86e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.237 on 390 degrees of freedom
## Multiple R-squared:  0.1533, Adjusted R-squared:  0.1446 
## F-statistic: 17.65 on 4 and 390 DF,  p-value: 2.488e-13

How do you interpret the regression output? Which variables are significantly related to third period grade?

# sex and failures predict third period grade. Men perform better than women, and the more failures a person has the lower their grade.

Checkpoint!!!

Create a new regression object called lm.16 using the same variables as question 13 (the model was lm.14 where you predicted third period grade (G3) based on sex, age, internet, and failures): however, this time use the Portuguese dataset.

lm.16 <- lm(G3 ~ sex + age + internet + failures, 
            data = student.por)

summary(lm.16)

## 
## Call:
## lm(formula = G3 ~ sex + age + internet + failures, data = student.por)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8941  -1.8345   0.0522   1.8807   7.8041 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.61020    1.68101   6.907 1.19e-11 ***
## sexM        -0.71515    0.23625  -3.027 0.002568 ** 
## age          0.01986    0.10031   0.198 0.843134    
## internetyes  0.92639    0.27508   3.368 0.000803 ***
## failures    -2.04819    0.20738  -9.877  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.936 on 644 degrees of freedom
## Multiple R-squared:  0.1794, Adjusted R-squared:  0.1743 
## F-statistic: 35.19 on 4 and 644 DF,  p-value: < 2.2e-16

What are the key differences between the beta values for the Portuguese dataset (lm.16) and the math dataset (lm.14)?

# in the portugese datset, men do worse than women, and internet actually helps performance!

Predicting values

For the math dataset, create a regression object called lm.18 predicting a student’s first period grade (G1) based on all variables in the dataset (Hint: use the notation formula = y ~ . to include all variables!

lm.18 <- lm(G1 ~., 
            data = student.math)

Save the fitted values values from the lm.18 object as a vector called lm.18.fitted (Hint: model$fitted.values)

lm.18.fitted <- lm.18$fitted.values

For the math dataset, create a scatterplot showing the relationship between a student’s first period grade (G1) and the fitted values from the model. Does the model appear to correctly fit a student’s first period grade?

plot(x = student.math$G1, 
     y = lm.18.fitted)

# The model does seem to do a pretty good job

Simulating regression analyses

Let’s do some simulations. Run the following code to create some random data:

a <- rnorm(100, mean = 10, sd = 5)
b <- rnorm(100, mean = 30, sd = 2)
c <- rnorm(100, mean = 20, sd = 1)
d <- rnorm(100, mean = 5, sd = 2)

y <- 50 + 2 * a - 5 * b + .3 * d + rnorm(100, mean = 0, sd = 10)

Based on this code, what do you expect the estimated regression coefficients to be for the independent varibles a, b, c, and d?

# They should be close to 2, -5, 0 and 0.3

Test your prediction by running the appropriate regression analysis.

mod.21 <- lm(y ~ a + b + c + d)
summary(mod.21)

## 
## Call:
## lm(formula = y ~ a + b + c + d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.3390  -6.7130  -0.1597   5.5222  28.7035 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  50.9452    25.0859   2.031   0.0451 *  
## a             1.7741     0.2058   8.619 1.47e-13 ***
## b            -3.8854     0.4985  -7.794 8.16e-12 ***
## c            -1.4224     0.9995  -1.423   0.1580    
## d            -0.3855     0.5499  -0.701   0.4850    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.89 on 95 degrees of freedom
## Multiple R-squared:  0.5908, Adjusted R-squared:  0.5736 
## F-statistic:  34.3 on 4 and 95 DF,  p-value: < 2.2e-16

# My predictions were pretty close to the estimated coefficients
# Note that the results might change when you run it due to random variation in the data!

Now, adjust the code so that the regression coefficients will be 3, 7, 2, and 0.

y <- a * 3 + 7 * b + 2 * c + 0 * d + rnorm(100, mean = 0, sd = 10)

Test your adjusted code to see if it worked!

mod.22 <- lm(y ~ a + b + c + d)
summary(mod.22)

## 
## Call:
## lm(formula = y ~ a + b + c + d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.2903  -5.9644   0.3388   6.4065  26.0735 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.1728    26.8174  -0.342   0.7331    
## a             2.9402     0.2201  13.361   <2e-16 ***
## b             7.3640     0.5329  13.818   <2e-16 ***
## c             1.9603     1.0685   1.835   0.0697 .  
## d             0.1151     0.5879   0.196   0.8453    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.57 on 95 degrees of freedom
## Multiple R-squared:  0.8053, Adjusted R-squared:  0.7971 
## F-statistic: 98.23 on 4 and 95 DF,  p-value: < 2.2e-16

# yes the estimated coefficients are close to what I wanted them to be!
#  But again, they were slightly different due to randommness in the data.

Submit!

Save and email your wpa_8_LastFirst.R file to me at nathaniel.phillips@unibas.ch. Then, go to https://goo.gl/forms/UblvQ6dvA76veEWu1 to complete the WPA submission form.

WPA #8 – Regression