Student Performance

In this WPA, you will analyze data from a study on student performance in two classes: math and Portugese. These data come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Student+Performance#

The data were collected for this paper:
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

Here is the data description (taken directly from the original website):

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

The data are located in two tab-delimited text files at https://www.dropbox.com/s/7qq4yxh6s2qek8r/student-mat.txt?dl=1 (the math data), and https://www.dropbox.com/s/36zxhaem59s25bm/student-por.txt?dl=1 (the portugese data).

Datafile description

Both datafiles have 33 columns. Here they are:

1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)

2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)

3 age - student’s age (numeric: from 15 to 22)

4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)

5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)

6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)

7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)

8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)

9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)

12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)

13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)

16 schoolsup - extra educational support (binary: yes or no)

17 famsup - family educational support (binary: yes or no)

18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

19 activities - extra-curricular activities (binary: yes or no)

20 nursery - attended nursery school (binary: yes or no)

21 higher - wants to take higher education (binary: yes or no)

22 internet - Internet access at home (binary: yes or no)

23 romantic - with a romantic relationship (binary: yes or no)

24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)

26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)

27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

29 health - current health status (numeric: from 1 - very bad to 5 - very good)

30 absences - number of school absences (numeric: from 0 to 93)

31 G1 - first period grade (numeric: from 0 to 20)

31 G2 - second period grade (numeric: from 0 to 20)

32 G3 - final grade (numeric: from 0 to 20, output target)

Data loading and preparation

Open a new script. Save the script with the name wpa_8_LastFirst.R. You should either, add your working directory command to the script (setwd()) or open your existing R project (.Rproj) before creating the new script.
Using read.table(), load the tab-delimited text file containing the data into R and assign them to new objects called student.math and student.por respectively.

Understand the data

Look at the first few rows of the dataframes with the head() function to make sure they were imported correctly.
Using the str() function, look at summary statistics for each column in the dataframe. There should be 33 columns in each dataset. Make sure everything looks ok.

Standard Regression with lm()

Before you do anything else, look at the help file for lm().

The exercises are set up slightly different this week. After some questions you’ll notice a button labelled code. This button, primarily, appears after any question where you are asked to interpret the output of a regression. If you press the code button it will show the answer/explanation for how to do the interpretation. Ideally, you should try to do the interpretation yourself before looking at the answer/explanation. But I am including these explanations this week because I think they might help you to realise when there is a gap in your knowledge/understanding so that you can ask me questions during class, rather than realising afterwards.

One IV

For the math data, create a regression object called lm.5 predicting first period grade (G1) based on age.
How do you interpret the relationship between age and first period grade?

# There is a slight negative relationship between age and first period grade (b = -0.17), however the relationship is not significant.

# The direction and size of the relationship is given by b, labelled "Estimate"" in the "age" line of the output. b is always interpreted as the change in the dependeant variable predicted by a 1 unit change in the predictor variable. So in this case a 1 year increase in age predicts that the first period grade will decrease by 0.17. This means that the units of both the predictor (age) and predicted (first period grade) matter when interpreting b. For instance if age was recorded in months instead of years b would be much smaller (as it would be predicting the change in grade caused by a 1 month increase in age. b/"Estmate for age" is also the slope of your regression line.

# The "Estimate" in the intercept line of the output gives you the intercept of your regression line (i.e. the point at which your regression line cuts the Y-axis). In general you won't be interested in interpreting this parameter so it's significance doesn't matter but it should always be included in the model.

For the math data, create a regression object called lm.7 predicting first period grade (G1) based on absences
How do you interpret the relationship between absences and G1?

# There is a slight negative relationship between absences and first period grade (b = -0.01), however the relationship is not significant

For the math data, create a regression object called lm.9 predicting first period grade (G1) based on school support (schoolsup)
How do you interpret the relationship between school support and G1? Given that school support is a nominal variable with 2 levels, how can you tell from the output (use summary) which direction the effect is? How does this relate to the way the data.frame has stored the levels of the school support factor (use str(student.math$schoolsup)).

# There is a negative relationship between school support and first period grade (b = -2.1). Students with extra support from the school have worse Period 1 grades on average.

# In the regression summary output rather than a predictor line labelled "schoolsup", because "schoolsup" is a nominal variable we are given the variable "schoolsupyes" instead. This tells us that b is the change in grade caused by going from "no" school support to "yes" school support. In essence R has dummy coded the variable so that students with "no" school support are given a value of 0 and those with "yes" school support a value of 1. If we look at the factor levels for the school support column we see that "no" is level 1 and "yes" level two. 'lm()' will always code predictors as the change from level 1 to level 2 (i.e. level 1 is coded as 0, and level 2 as 1).

From the regression what would be your best guess for the first period grade for a student with no school support? What about for a student with school support?

#The "Intercept"" estimate tells us the best estimate for first period grade when all predictor vairables have a value of 0. In this case we only have 1 predictor (school support) and it is dummy coded so that those without school support have a value of 0. Therefore the predicted first period grade for those without school support is 11.18 (the intercept estimate). We can also use our "schoolsupportyes" estimate to calculate our prediction for those with school support, Our estimate tells us that those on school support are predicted to have a first period grade that is 2.1 lower than those without school support, therefore they have a predicted grade of 9.08.

#Because we have a single categorical predictor, these predicted scores should also be the mean of each of the groups. Check this yourself.

For the math data, create a regression object called lm.12 predicting each student’s period 3 grade (G3) based on their period 1 grade (G1)
How do you interpret the relationship between G1 and G3?

# There is a strong positive relationship between first period grade and third period grade (b = 1.1, p < .01). Basically for a 1 unit increase in first period grade third period grade increases by 1.1 on average.

Adding a regression line to a scatterplot

Create a scatterplot showing the relationship between G1 and G3 for the math data. Make sure that both the x and y-axes run from 0 to 20 (the minimum and maximum possible marks).
Add a regression line to the scatterplot from your regression object lm.12 (hint: use abline()).
Create a new figure in which you switch the variables so that your x-axis is now your y-axis and vice versa. Keep the model and abline the same. What happens?

#Comparing the two figures we can see that the data has changed (because we have flipped the way we are plotting the variables), but the regression line is the same for the two plots.

#This is because when you pass a regression object (lm.12 in this case) to abline it plots the regression line so that the predictor variable (G1) is on the x-axis and the predicted variable (G3) is on the y-axis. If this doesn't match the way you have plotted the variables in the scatterplot you won't get an error (because R doesn't know anything is wrong), you will just get a line that doesn't actually match your figure.

#If you wanted you could even plot a regression line for a model that is unrelated to the variables on your scatterplot. For instance you could add abline(lm.5), the model predicting G1 from Age, to your plot from question 14, and R would still produce a plot (albeit one that doesn't make sense).

#Therefore whenever you want to add a regression line to a scatterplot you need to make sure that 1): your regression model matches the scatterplot, and 2) that the predictor variable (in this case G1) is the x-axis and your predicted variable (G3) is the y axis.

Multiple IVs

For the math data, create a regression object called lm.16 predicting third period grade (G3) based on sex, age, internet, and failures
How do you interpret the regression output? Which variables are significantly related to third period grade?

# sex and failures predict third period grade. Men perform better than women (b = 1.04, p = 0.015), and the more failures a person has the lower their grade (b = -2.13, p<.01).

Checkpoint!!!

Create a new regression object called lm.18 using the same variables as question 16 (the model was lm.16 where you predicted third period grade (G3) based on sex, age, internet, and failures): however, this time use the portugese dataset.
What are the key differences between the beta values for the portugese dataset (lm.18) and the math dataset (lm.16)?

# in the portugese datset, men do worse than women (b = -0.72, p < .01), and internet actually helps performance (b = 0.93, p < .01)! Failures still lower grades (b = -2.05, p < .01)

Nominal Variable with 3 levels

For the math data, create a regression object called lm.20 predicting first period grade (G1) based on guardian. Guardian is a nominal variable with 3 levels.
Use summary to look at the output. You should see 2 predictors listed (“guardianmother” and “guardiananother”), rather than the expected 1 (“guardian”). lm has dummy coded your variables with “father” set as the reference group. Look at the levels of the guardian factor to see why “father” is the reference group. How would you interpret the results?

# There is a slight negative effect on first period grades of having mother as opposed to father as a guardian, but it is not significant. There is similarly a slight, non-significant, negatve effect of having another as your guardian compared to having your father.

What is the predicted grade for those with a father as their guardian? Those with a mother? Those with other? Compare these to the means of each group again.

#We can use our three estimates to calculate these predictions like we did in Question 11. "father" is a reference group/coded as 0, so the predicted grade is the "intercept" estimate (11.11). For the other two groups we just need to add their estimate to the intercept, therefore our predicted grade for those with a mother is 10.88 and for those with other is 10.56.

Predicting values

For the math dataset, create a regression object called lm.23 predicting a student’s first period grade (G1) based on all variables in the dataset (Hint: use the notation formula = y ~ . to include all variables!
Save the fitted values values from the lm.23 object as a vector called lm.23.fitted (Hint: model$fitted.values)
For the math dataset, create a scatterplot showing the relationship between a student’s first period grade (G1) and the fitted values from the model. Does the model appear to correctly fit a student’s first period grade?

# The fit should look pretty good. This is probably because we are including both G2 and G3 as predictors.

Create a new regression object, called lm.26 which doesn’t include G2 or G3 as predictors, but still includes all other variables. (Hint: Don’t enter all the predictor names by hand. Try modifying the data.frame instead to exclude the columns you don’t need, i.e. use indexing )
There is also a better/easier way to do this. Just like + adds predictors, - can be used to remove predictors. Try it and compare the results.
Create the same scatterplot for this object. How well does the new model perform.

# It's performing a lot worse (Actually still explains like 30% of the variability so not that bad)

Submit!

Save and email your ‘.R’ file wpa_8_LastFirst.R to me at ashleyjames.luckman@unibas.ch. Put the subject as WPA8-23496-02.

WPA #8 - Regression- Chapter 15