Student Performance

In this WPA, you will analyze data from a study on student performance in two classes: math and Portugese. These data come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Student+Performance#

Here is the data description (taken directly from the original website

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

The data are located in two tab-delimited text files at http://nathanieldphillips.com/wp-content/uploads/2016/11/studentmath.txt (the math data), and http://nathanieldphillips.com/wp-content/uploads/2016/11/studentpor.txt (the portugese data).

Datafile description

Both datafiles have 33 columns. Here they are:

1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)

2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)

3 age - student’s age (numeric: from 15 to 22)

4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)

5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)

6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)

7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)

9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)

12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)

13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)

16 schoolsup - extra educational support (binary: yes or no)

17 famsup - family educational support (binary: yes or no)

18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

19 activities - extra-curricular activities (binary: yes or no)

20 nursery - attended nursery school (binary: yes or no)

21 higher - wants to take higher education (binary: yes or no)

22 internet - Internet access at home (binary: yes or no)

23 romantic - with a romantic relationship (binary: yes or no)

24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)

26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)

27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

29 health - current health status (numeric: from 1 - very bad to 5 - very good)

30 absences - number of school absences (numeric: from 0 to 93)

31 G1 - first period grade (numeric: from 0 to 20)

31 G2 - second period grade (numeric: from 0 to 20)

32 G3 - final grade (numeric: from 0 to 20, output target)

Data loading and preparation

  1. Open an R poject and open a new script. Save the script with the name wpa_8_LastFirst.R.

  2. Using read.table(), load the tab-delimited text file containing the data into R and assign them to new objects called student.math and student.por respectively.

Understand the data

  1. Look at the first few rows of the dataframes with the head() function to make sure they were imported correctly.

  2. Using the str() function, look at summary statistics for each column in the dataframe. There should be 33 columsn in each dataset. Make sure everything looks ok.

Standard Regression with lm()

One IV

  1. For the math data, create a regression object called lm.5 predicting first period grade (G1) based on age.

  2. How do you interpret the relationship between age and first period grade?

  3. For the math data, create a regression object called lm.7 predicting first period grade (G1) based on absences

  4. How do you interpret the relationship between absences and G1?

  5. For the math data, create a regression object called lm.9 predicting each student’s period 3 grade (G3) based on their period 1 grade (G1)

  6. How do you determine the relationship between G1 and G3?

Adding a regression line to a scatterplot

  1. Create a scatterplot showing the relationship between G1 and G3 for the math data.

  2. Add a regression line to the scatterplot from your regression object lm.9 (hint: use abline()).

Multiple IVs

  1. For the math data, create a regression object called lm.13 predicting third period grade (G3) based on sex, age, internet, and failures

  2. How do you interpret the regression output? Which variables are significantly related to third period grade?

Checkpoint!!!

  1. Create a new regression object called lm.15 using the same variables as question 13 (the model was lm.13 where you predicted third period grade (G3) based on sex, age, internet, and failures): however, this time use the portugese dataset.

  2. What are the key differences between the beta values for the portugese dataset (lm.15) and the math dataset (lm.13)?

Predicting values

  1. For the math dataset, create a regression object called lm.17 predicting a student’s first period grade (G1) based on all variables in the dataset (Hint: use the notation formula = y ~ . to include all variables!

  2. Save the fitted values values from the lm.17 object as a vector called lm.17.fitted (Hint: model$fitted.values)

  3. For the math dataset, create a scatterplot showing the relationship between a student’s first period grade (G1) and the fitted values from the model. Does the model appear to correctly fit a student’s first period grade?

Submit!

Save and email your wpa_8_LastFirst.R file to me at nathaniel.phillips@unibas.ch. Then, go to https://goo.gl/forms/UblvQ6dvA76veEWu1 to complete the WPA submission form.