Student Performance

In this WPA, you will analyze data from a study on student performance in two classes: math and Portugese. These data come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Student+Performance#

Here is the data description (taken directly from the original website

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

The data are located in two semi-colon (;) separated text files at http://nathanieldphillips.com/wp-content/uploads/2016/04/student-mat.csv (the math data), and http://nathanieldphillips.com/wp-content/uploads/2016/04/student-por.csv (the portugese data).

Here is how the first few rows of the math data should look:

head(student.math)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    2   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason guardian traveltime studytime failures schoolsup famsup paid
## 1     course   mother          2         2        0       yes     no   no
## 2     course   father          1         2        0        no    yes   no
## 3      other   mother          1         2        3       yes     no  yes
## 4       home   mother          1         3        0        no    yes  yes
## 5       home   father          1         2        0        no    yes  yes
## 6 reputation   mother          1         2        0        no    yes  yes
##   activities nursery higher internet romantic famrel freetime goout Dalc
## 1         no     yes    yes       no       no      4        3     4    1
## 2         no      no    yes      yes       no      5        3     3    1
## 3         no     yes    yes      yes       no      4        3     2    2
## 4        yes     yes    yes      yes      yes      3        2     2    1
## 5         no     yes    yes       no       no      4        3     2    1
## 6        yes     yes    yes      yes       no      5        4     2    1
##   Walc health absences G1 G2 G3
## 1    1      3        6  5  6  6
## 2    1      3        4  5  5  6
## 3    3      3       10  7  8 10
## 4    1      5        2 15 14 15
## 5    2      5        4  6 10 10
## 6    2      5       10 15 15 15

Datafile description

Both datafiles have 33 columns. Here they are:

1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)

2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)

3 age - student’s age (numeric: from 15 to 22)

4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)

5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)

6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)

7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)

12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)

13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)

16 schoolsup - extra educational support (binary: yes or no)

17 famsup - family educational support (binary: yes or no)

18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

19 activities - extra-curricular activities (binary: yes or no)

20 nursery - attended nursery school (binary: yes or no)

21 higher - wants to take higher education (binary: yes or no)

22 internet - Internet access at home (binary: yes or no)

23 romantic - with a romantic relationship (binary: yes or no)

24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)

26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)

27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

29 health - current health status (numeric: from 1 - very bad to 5 - very good)

30 absences - number of school absences (numeric: from 0 to 93)

31 G1 - first period grade (numeric: from 0 to 20)

31 G2 - second period grade (numeric: from 0 to 20)

32 G3 - final grade (numeric: from 0 to 20, output target)

Data loading and preparation

A. Open your WPA.RProject and open a new script. Save the script with the name WPA8.R.

B. Using read.table(), load the semi-colon (;) delimited text file containing the data into R and assign them to new objects called student.math and student.por respectively.

Understand the data

D. Look at the first few rows of the dataframes with the head() function to make sure they were imported correctly.

E. Using the summary() function, look at summary statistics for each column in the dataframe. There should be 33 columsn in each dataset. Make sure everything looks ok.

Standard Regression with lm()

One IV

  1. For the math data, create a regression object predicting first period grade (G1) based on age.
    1. Is there a significant relationship between age and G1?
    1. What is the estimate of the coefficient for age? How do you interpret this value?
  1. For the portugese data, create a regression object predicting first period grade (G1) based on age.
    1. Is there a significant relationship between age and G1?
    1. What is the estimate of the coefficient for age? How do you interpret this value?
  1. For the math data, create a regression object called g1g3.mod predicting each student’s period 3 grade (G3) based on their period 1 grade (G1)
    1. Is there a significant relationship between G1 and G3?
    1. Create a scatterplot showing the relationship between G1 and G3.
    1. Add a regression line to the scatterplot from your regression object.

Multiple IVs

  1. For the math data, create a regression object called math.mod1 predicting third period grade (G3) based on sex, age, internet, and failures
    1. Which variables are significantly related to third period grade?
    1. What does the estimate for internet mean?
    1. Create a scatterplot showing the relationship between the true values of G3 and the model fits.
    1. By hand, calculate the model estimated math grade for a Male student of age 15 with internet access and 3 previous class failures.
    1. Test your prediction in C by creating a new dataframe of test data and using the predict() function

Checkpoint!!!

Using models from one dataset to predict data from another

  1. Create a new regression object called por.mod1 using the same variables as question 4: however, this time use the portugese dataset to fit the model.

Logistic regression

  1. For the Portugese data, create a logistic regression model using glm() predicting whether a student attended nursury school based on his/her sex, family size, mother’s education, father’s education, and parent’s cohabitation status. (Hint: To do this, you’ll need to recode the internet variable into a binary variable of 0s and 1s)

Poisson regression

  1. Create a new regression object called travel.por.mod predicting how a student’s travel time to school as a function of every variable in the Portugese dataset. Do a standard regression with lm()