Student Performance
In this WPA, you will analyze data from a study on student performance in two classes: math and Portuguese. These data come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Student+Performance#
Here is the data description (taken directly from the original website
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
The data are located in two tab-delimited text files at http://nathanieldphillips.com/wp-content/uploads/2016/11/studentmath.txt (the math data), and http://nathanieldphillips.com/wp-content/uploads/2016/11/studentpor.txt (the portugese data).
Datafile description
Both datafiles have 33 columns. Here they are:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)
Data loading and preparation
Open an R project and open a new script. Save the script with the name
wpa_8_LastFirst.R.Using
read.table(), load the tab-delimited text file containing the data into R and assign them to new objects calledstudent.mathandstudent.porrespectively.
student.math <- read.table("http://nathanieldphillips.com/wp-content/uploads/2016/11/studentmath.txt",
sep = "\t",
header = TRUE)
student.por <- read.table("http://nathanieldphillips.com/wp-content/uploads/2016/11/studentpor.txt",
sep = "\t",
header = TRUE)Understand the data
Look at the first few rows of the dataframes with the
head()function to make sure they were imported correctly.Using the
str()function, look at summary statistics for each column in the dataframe. There should be 33 columns in each dataset. Make sure everything looks ok.
Standard Regression with lm()
One IV
- For the math data, create a regression object called
lm.5predicting first period grade (G1) based on age.
5b. Run names() and summary() on lm.5 to see additional information from your regression object. Now, return a vector of the coefficients by running lm.5$coefficients
How do you interpret the relationship between age and first period grade?
For the math data, create a regression object called
lm.7predicting first period grade (G1) based on absencesHow do you interpret the relationship between absences and G1?
For the math data, create a regression object called
lm.9predicting each student’s period 3 grade (G3) based on their period 1 grade (G1). Look at the results of the regression analysis withsummary().What is the relationship between G1 and G3?
Regression vs. Correlation
- Conduct a correlation test between G1 and G3 (Hint: use
cor.test()). Compare the t-value for this test to the regression analysis you did in question 9. What do you see?
Adding a regression line to a scatterplot
Create a scatterplot showing the relationship between G1 and G3 for the math data.
Add a regression line to the scatterplot from your regression object
lm.9(hint: useabline()).
Multiple IVs
For the math data, create a regression object called
lm.14predicting third period grade (G3) based on sex, age, internet, and failuresHow do you interpret the regression output? Which variables are significantly related to third period grade?
Checkpoint!!!
Create a new regression object called
lm.16using the same variables as question 13 (the model waslm.14where you predicted third period grade (G3) based on sex, age, internet, and failures): however, this time use the Portuguese dataset.What are the key differences between the beta values for the Portuguese dataset (
lm.16) and the math dataset (lm.14)?
Predicting values
For the math dataset, create a regression object called
lm.18predicting a student’s first period grade (G1) based on all variables in the dataset (Hint: use the notationformula = y ~ .to include all variables!Save the fitted values values from the
lm.18object as a vector calledlm.18.fitted(Hint:model$fitted.values)For the math dataset, create a scatterplot showing the relationship between a student’s first period grade (G1) and the fitted values from the model. Does the model appear to correctly fit a student’s first period grade?
Simulating regression analyses
- Let’s do some simulations. Run the following code to create some random data:
a <- rnorm(100, mean = 10, sd = 5)
b <- rnorm(100, mean = 30, sd = 2)
c <- rnorm(100, mean = 20, sd = 1)
d <- rnorm(100, mean = 5, sd = 2)
y <- 50 + 2 * a - 5 * b + .3 * d + rnorm(100, mean = 0, sd = 10)Based on this code, what do you expect the estimated regression coefficients to be for the independent varibles a, b, c, and d?
Test your prediction by running the appropriate regression analysis.
Now, adjust the code so that the regression coefficients will be 3, 7, 2, and 0.
Test your adjusted code to see if it worked!
Submit!
Save and email your wpa_8_LastFirst.R file to me at nathaniel.phillips@unibas.ch. Then, go to https://goo.gl/forms/UblvQ6dvA76veEWu1 to complete the WPA submission form.