Student Performance
In this WPA, you will analyze data from a study on student performance in two classes: math and Portuguese. These data come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Student+Performance#
Here is the data description (taken directly from the original website
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
The data are located in two tab-delimited text files at http://nathanieldphillips.com/wp-content/uploads/2016/11/studentmath.txt (the math data), and http://nathanieldphillips.com/wp-content/uploads/2016/11/studentpor.txt (the portugese data).
Datafile description
Both datafiles have 33 columns. Here they are:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)
Data loading and preparation
Open an R project and open a new script. Save the script with the name
wpa_8_LastFirst.R.Using
read.table(), load the tab-delimited text file containing the data into R and assign them to new objects calledstudent.mathandstudent.porrespectively.
student.math <- read.table("http://nathanieldphillips.com/wp-content/uploads/2016/11/studentmath.txt",
sep = "\t",
header = TRUE)
student.por <- read.table("http://nathanieldphillips.com/wp-content/uploads/2016/11/studentpor.txt",
sep = "\t",
header = TRUE)Understand the data
- Look at the first few rows of the dataframes with the
head()function to make sure they were imported correctly.
head(student.math)
head(student.por)- Using the
str()function, look at summary statistics for each column in the dataframe. There should be 33 columns in each dataset. Make sure everything looks ok.
str(student.math)## 'data.frame': 395 obs. of 33 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
str(student.por)## 'data.frame': 649 obs. of 33 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 4 2 6 0 0 6 0 2 0 0 ...
## $ G1 : int 0 9 12 14 11 12 13 10 15 12 ...
## $ G2 : int 11 11 13 14 13 12 12 13 16 12 ...
## $ G3 : int 11 11 12 14 13 13 13 13 17 13 ...
Standard Regression with lm()
One IV
- For the math data, create a regression object called
lm.5predicting first period grade (G1) based on age.
lm.5 <- lm(G1 ~ age, data = student.math)
summary(lm.5)##
## Call:
## lm(formula = G1 ~ age, data = student.math)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6915 -2.7749 -0.1916 2.3085 8.3085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.6919 2.1926 6.245 1.1e-09 ***
## age -0.1667 0.1309 -1.273 0.204
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.317 on 393 degrees of freedom
## Multiple R-squared: 0.004106, Adjusted R-squared: 0.001572
## F-statistic: 1.62 on 1 and 393 DF, p-value: 0.2038
5b. Run names() and summary() on lm.5 to see additional information from your regression object. Now, return a vector of the coefficients by running lm.5$coefficients
- How do you interpret the relationship between age and first period grade?
# There is a slight negative relationship between age and first period grade (b = -0.17), however the relationship is not significant- For the math data, create a regression object called
lm.7predicting first period grade (G1) based on absences
lm.7 <- lm(G1 ~ absences, data = student.math)
summary(lm.7)##
## Call:
## lm(formula = G1 ~ absences, data = student.math)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8794 -2.9115 0.0177 2.2363 8.0692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.98227 0.20539 53.470 <2e-16 ***
## absences -0.01286 0.02091 -0.615 0.539
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.322 on 393 degrees of freedom
## Multiple R-squared: 0.0009612, Adjusted R-squared: -0.001581
## F-statistic: 0.3781 on 1 and 393 DF, p-value: 0.539
- How do you interpret the relationship between absences and G1?
# There is a slight negative relationship between absences and first period grade (b = -0.01), however the relationship is not significant- For the math data, create a regression object called
lm.9predicting each student’s period 3 grade (G3) based on their period 1 grade (G1). Look at the results of the regression analysis withsummary().
lm.9 <- lm(G3 ~ G1, data = student.math)
summary(lm.9)##
## Call:
## lm(formula = G3 ~ G1, data = student.math)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.6223 -0.8348 0.3777 1.6965 5.0153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.65280 0.47475 -3.481 0.000555 ***
## G1 1.10626 0.04164 26.568 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.743 on 393 degrees of freedom
## Multiple R-squared: 0.6424, Adjusted R-squared: 0.6414
## F-statistic: 705.8 on 1 and 393 DF, p-value: < 2.2e-16
- What is the relationship between G1 and G3?
# There is a strong positive relationship between first period grade and third period grade (b = 1.1, p < .01)Regression vs. Correlation
- Conduct a correlation test between G1 and G3 (Hint: use
cor.test()). Compare the t-value for this test to the regression analysis you did in question 9. What do you see?
Adding a regression line to a scatterplot
- Create a scatterplot showing the relationship between G1 and G3 for the math data.
plot(x = student.math$G1,
y = student.math$G3)- Add a regression line to the scatterplot from your regression object
lm.9(hint: useabline()).
plot(x = student.math$G1,
y = student.math$G3)
abline(lm.9)Multiple IVs
- For the math data, create a regression object called
lm.14predicting third period grade (G3) based on sex, age, internet, and failures
lm.14 <- lm(G3 ~ sex + age + internet + failures, data = student.math)
summary(lm.14)##
## Call:
## lm(formula = G3 ~ sex + age + internet + failures, data = student.math)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.2156 -1.9523 0.0965 3.0252 9.4370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.9962 2.9808 4.695 3.69e-06 ***
## sexM 1.0451 0.4282 2.441 0.0151 *
## age -0.2407 0.1735 -1.388 0.1660
## internetyes 0.7855 0.5761 1.364 0.1735
## failures -2.1260 0.2966 -7.167 3.86e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.237 on 390 degrees of freedom
## Multiple R-squared: 0.1533, Adjusted R-squared: 0.1446
## F-statistic: 17.65 on 4 and 390 DF, p-value: 2.488e-13
- How do you interpret the regression output? Which variables are significantly related to third period grade?
# sex and failures predict third period grade. Men perform better than women, and the more failures a person has the lower their grade.Checkpoint!!!
- Create a new regression object called
lm.16using the same variables as question 13 (the model waslm.14where you predicted third period grade (G3) based on sex, age, internet, and failures): however, this time use the Portuguese dataset.
lm.16 <- lm(G3 ~ sex + age + internet + failures,
data = student.por)
summary(lm.16)##
## Call:
## lm(formula = G3 ~ sex + age + internet + failures, data = student.por)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.8941 -1.8345 0.0522 1.8807 7.8041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.61020 1.68101 6.907 1.19e-11 ***
## sexM -0.71515 0.23625 -3.027 0.002568 **
## age 0.01986 0.10031 0.198 0.843134
## internetyes 0.92639 0.27508 3.368 0.000803 ***
## failures -2.04819 0.20738 -9.877 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.936 on 644 degrees of freedom
## Multiple R-squared: 0.1794, Adjusted R-squared: 0.1743
## F-statistic: 35.19 on 4 and 644 DF, p-value: < 2.2e-16
- What are the key differences between the beta values for the Portuguese dataset (
lm.16) and the math dataset (lm.14)?
# in the portugese datset, men do worse than women, and internet actually helps performance!Predicting values
- For the math dataset, create a regression object called
lm.18predicting a student’s first period grade (G1) based on all variables in the dataset (Hint: use the notationformula = y ~ .to include all variables!
lm.18 <- lm(G1 ~.,
data = student.math)- Save the fitted values values from the
lm.18object as a vector calledlm.18.fitted(Hint:model$fitted.values)
lm.18.fitted <- lm.18$fitted.values- For the math dataset, create a scatterplot showing the relationship between a student’s first period grade (G1) and the fitted values from the model. Does the model appear to correctly fit a student’s first period grade?
plot(x = student.math$G1,
y = lm.18.fitted)# The model does seem to do a pretty good jobSimulating regression analyses
- Let’s do some simulations. Run the following code to create some random data:
a <- rnorm(100, mean = 10, sd = 5)
b <- rnorm(100, mean = 30, sd = 2)
c <- rnorm(100, mean = 20, sd = 1)
d <- rnorm(100, mean = 5, sd = 2)
y <- 50 + 2 * a - 5 * b + .3 * d + rnorm(100, mean = 0, sd = 10)Based on this code, what do you expect the estimated regression coefficients to be for the independent varibles a, b, c, and d?
# They should be close to 2, -5, 0 and 0.3- Test your prediction by running the appropriate regression analysis.
mod.21 <- lm(y ~ a + b + c + d)
summary(mod.21)##
## Call:
## lm(formula = y ~ a + b + c + d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.3390 -6.7130 -0.1597 5.5222 28.7035
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.9452 25.0859 2.031 0.0451 *
## a 1.7741 0.2058 8.619 1.47e-13 ***
## b -3.8854 0.4985 -7.794 8.16e-12 ***
## c -1.4224 0.9995 -1.423 0.1580
## d -0.3855 0.5499 -0.701 0.4850
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.89 on 95 degrees of freedom
## Multiple R-squared: 0.5908, Adjusted R-squared: 0.5736
## F-statistic: 34.3 on 4 and 95 DF, p-value: < 2.2e-16
# My predictions were pretty close to the estimated coefficients
# Note that the results might change when you run it due to random variation in the data!- Now, adjust the code so that the regression coefficients will be 3, 7, 2, and 0.
y <- a * 3 + 7 * b + 2 * c + 0 * d + rnorm(100, mean = 0, sd = 10)- Test your adjusted code to see if it worked!
mod.22 <- lm(y ~ a + b + c + d)
summary(mod.22)##
## Call:
## lm(formula = y ~ a + b + c + d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.2903 -5.9644 0.3388 6.4065 26.0735
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.1728 26.8174 -0.342 0.7331
## a 2.9402 0.2201 13.361 <2e-16 ***
## b 7.3640 0.5329 13.818 <2e-16 ***
## c 1.9603 1.0685 1.835 0.0697 .
## d 0.1151 0.5879 0.196 0.8453
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.57 on 95 degrees of freedom
## Multiple R-squared: 0.8053, Adjusted R-squared: 0.7971
## F-statistic: 98.23 on 4 and 95 DF, p-value: < 2.2e-16
# yes the estimated coefficients are close to what I wanted them to be!
# But again, they were slightly different due to randommness in the data.Submit!
Save and email your wpa_8_LastFirst.R file to me at nathaniel.phillips@unibas.ch. Then, go to https://goo.gl/forms/UblvQ6dvA76veEWu1 to complete the WPA submission form.