In this WPA, you will analyze data from a study on student performance in two classes: math and Portugese. These data come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Student+Performance#
Here is the data description (taken directly from the original website
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
The data are located in two semi-colon (;) separated text files at http://nathanieldphillips.com/wp-content/uploads/2016/04/student-mat.csv (the math data), and http://nathanieldphillips.com/wp-content/uploads/2016/04/student-por.csv (the portugese data).
Here is how the first few rows of the math data should look:
head(student.math)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 2 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## reason guardian traveltime studytime failures schoolsup famsup paid
## 1 course mother 2 2 0 yes no no
## 2 course father 1 2 0 no yes no
## 3 other mother 1 2 3 yes no yes
## 4 home mother 1 3 0 no yes yes
## 5 home father 1 2 0 no yes yes
## 6 reputation mother 1 2 0 no yes yes
## activities nursery higher internet romantic famrel freetime goout Dalc
## 1 no yes yes no no 4 3 4 1
## 2 no no yes yes no 5 3 3 1
## 3 no yes yes yes no 4 3 2 2
## 4 yes yes yes yes yes 3 2 2 1
## 5 no yes yes no no 4 3 2 1
## 6 yes yes yes yes no 5 4 2 1
## Walc health absences G1 G2 G3
## 1 1 3 6 5 6 6
## 2 1 3 4 5 5 6
## 3 3 3 10 7 8 10
## 4 1 5 2 15 14 15
## 5 2 5 4 6 10 10
## 6 2 5 10 15 15 15
Both datafiles have 33 columns. Here they are:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)
A. Open your WPA.RProject and open a new script. Save the script with the name WPA8.R.
B. Using read.table(), load the semi-colon (;) delimited text file containing the data into R and assign them to new objects called student.math and student.por respectively.
student.math <- read.table("http://nathanieldphillips.com/wp-content/uploads/2016/04/student-mat.csv",
sep = ";",
header = T
)
student.por <- read.table("http://nathanieldphillips.com/wp-content/uploads/2016/04/student-por.csv",
sep = ";",
header = T
)
D. Look at the first few rows of the dataframes with the head() function to make sure they were imported correctly.
head(student.math) # Looks ok
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 2 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## reason guardian traveltime studytime failures schoolsup famsup paid
## 1 course mother 2 2 0 yes no no
## 2 course father 1 2 0 no yes no
## 3 other mother 1 2 3 yes no yes
## 4 home mother 1 3 0 no yes yes
## 5 home father 1 2 0 no yes yes
## 6 reputation mother 1 2 0 no yes yes
## activities nursery higher internet romantic famrel freetime goout Dalc
## 1 no yes yes no no 4 3 4 1
## 2 no no yes yes no 5 3 3 1
## 3 no yes yes yes no 4 3 2 2
## 4 yes yes yes yes yes 3 2 2 1
## 5 no yes yes no no 4 3 2 1
## 6 yes yes yes yes no 5 4 2 1
## Walc health absences G1 G2 G3
## 1 1 3 6 5 6 6
## 2 1 3 4 5 5 6
## 3 3 3 10 7 8 10
## 4 1 5 2 15 14 15
## 5 2 5 4 6 10 10
## 6 2 5 10 15 15 15
head(student.por) # Looks ok
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 2 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## reason guardian traveltime studytime failures schoolsup famsup paid
## 1 course mother 2 2 0 yes no no
## 2 course father 1 2 0 no yes no
## 3 other mother 1 2 0 yes no no
## 4 home mother 1 3 0 no yes no
## 5 home father 1 2 0 no yes no
## 6 reputation mother 1 2 0 no yes no
## activities nursery higher internet romantic famrel freetime goout Dalc
## 1 no yes yes no no 4 3 4 1
## 2 no no yes yes no 5 3 3 1
## 3 no yes yes yes no 4 3 2 2
## 4 yes yes yes yes yes 3 2 2 1
## 5 no yes yes no no 4 3 2 1
## 6 yes yes yes yes no 5 4 2 1
## Walc health absences G1 G2 G3
## 1 1 3 4 0 11 11
## 2 1 3 2 9 11 11
## 3 3 3 6 12 13 12
## 4 1 5 0 14 14 14
## 5 2 5 0 11 13 13
## 6 2 5 6 12 12 13
E. Using the summary() function, look at summary statistics for each column in the dataframe. There should be 33 columsn in each dataset. Make sure everything looks ok.
summary(student.math) # Looks ok
## school sex age address famsize Pstatus Medu
## GP:349 F:208 Min. :15.0 R: 88 GT3:281 A: 41 Min. :0.000
## MS: 46 M:187 1st Qu.:16.0 U:307 LE3:114 T:354 1st Qu.:2.000
## Median :17.0 Median :3.000
## Mean :16.7 Mean :2.749
## 3rd Qu.:18.0 3rd Qu.:4.000
## Max. :22.0 Max. :4.000
## Fedu Mjob Fjob reason
## Min. :0.000 at_home : 59 at_home : 20 course :145
## 1st Qu.:2.000 health : 34 health : 18 home :109
## Median :2.000 other :141 other :217 other : 36
## Mean :2.522 services:103 services:111 reputation:105
## 3rd Qu.:3.000 teacher : 58 teacher : 29
## Max. :4.000
## guardian traveltime studytime failures schoolsup
## father: 90 Min. :1.000 Min. :1.000 Min. :0.0000 no :344
## mother:273 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 yes: 51
## other : 32 Median :1.000 Median :2.000 Median :0.0000
## Mean :1.448 Mean :2.035 Mean :0.3342
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## famsup paid activities nursery higher internet romantic
## no :153 no :214 no :194 no : 81 no : 20 no : 66 no :263
## yes:242 yes:181 yes:201 yes:314 yes:375 yes:329 yes:132
##
##
##
##
## famrel freetime goout Dalc
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000
## Median :4.000 Median :3.000 Median :3.000 Median :1.000
## Mean :3.944 Mean :3.235 Mean :3.109 Mean :1.481
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Walc health absences G1
## Min. :1.000 Min. :1.000 Min. : 0.000 Min. : 3.00
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: 8.00
## Median :2.000 Median :4.000 Median : 4.000 Median :11.00
## Mean :2.291 Mean :3.554 Mean : 5.709 Mean :10.91
## 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:13.00
## Max. :5.000 Max. :5.000 Max. :75.000 Max. :19.00
## G2 G3
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.00 1st Qu.: 8.00
## Median :11.00 Median :11.00
## Mean :10.71 Mean :10.42
## 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :19.00 Max. :20.00
summary(student.por) # Looks ok
## school sex age address famsize Pstatus
## GP:423 F:383 Min. :15.00 R:197 GT3:457 A: 80
## MS:226 M:266 1st Qu.:16.00 U:452 LE3:192 T:569
## Median :17.00
## Mean :16.74
## 3rd Qu.:18.00
## Max. :22.00
## Medu Fedu Mjob Fjob
## Min. :0.000 Min. :0.000 at_home :135 at_home : 42
## 1st Qu.:2.000 1st Qu.:1.000 health : 48 health : 23
## Median :2.000 Median :2.000 other :258 other :367
## Mean :2.515 Mean :2.307 services:136 services:181
## 3rd Qu.:4.000 3rd Qu.:3.000 teacher : 72 teacher : 36
## Max. :4.000 Max. :4.000
## reason guardian traveltime studytime
## course :285 father:153 Min. :1.000 Min. :1.000
## home :149 mother:455 1st Qu.:1.000 1st Qu.:1.000
## other : 72 other : 41 Median :1.000 Median :2.000
## reputation:143 Mean :1.569 Mean :1.931
## 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :4.000 Max. :4.000
## failures schoolsup famsup paid activities nursery
## Min. :0.0000 no :581 no :251 no :610 no :334 no :128
## 1st Qu.:0.0000 yes: 68 yes:398 yes: 39 yes:315 yes:521
## Median :0.0000
## Mean :0.2219
## 3rd Qu.:0.0000
## Max. :3.0000
## higher internet romantic famrel freetime
## no : 69 no :151 no :410 Min. :1.000 Min. :1.00
## yes:580 yes:498 yes:239 1st Qu.:4.000 1st Qu.:3.00
## Median :4.000 Median :3.00
## Mean :3.931 Mean :3.18
## 3rd Qu.:5.000 3rd Qu.:4.00
## Max. :5.000 Max. :5.00
## goout Dalc Walc health
## Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:2.000
## Median :3.000 Median :1.000 Median :2.00 Median :4.000
## Mean :3.185 Mean :1.502 Mean :2.28 Mean :3.536
## 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000
## absences G1 G2 G3
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.:10.0 1st Qu.:10.00 1st Qu.:10.00
## Median : 2.000 Median :11.0 Median :11.00 Median :12.00
## Mean : 3.659 Mean :11.4 Mean :11.57 Mean :11.91
## 3rd Qu.: 6.000 3rd Qu.:13.0 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :32.000 Max. :19.0 Max. :19.00 Max. :19.00
obj.1 <- lm(G1 ~ age, data = student.math)
summary(obj.1)
##
## Call:
## lm(formula = G1 ~ age, data = student.math)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6915 -2.7749 -0.1916 2.3085 8.3085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.6919 2.1926 6.245 1.1e-09 ***
## age -0.1667 0.1309 -1.273 0.204
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.317 on 393 degrees of freedom
## Multiple R-squared: 0.004106, Adjusted R-squared: 0.001572
## F-statistic: 1.62 on 1 and 393 DF, p-value: 0.2038
Answer: No, there is not a significant negative relationship between age and G1, b = -0.17, t(393) = -1.27, p = 0.204
Answer: The coefficient for age is -0.17, this means that for every increase of one (year) in age, we expect a decrease of 0.17 in G1 scores.
obj.2 <- lm(G1 ~ age, data = student.por)
summary(obj.2)
##
## Call:
## lm(formula = G1 ~ age, data = student.por)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9057 -1.9057 -0.0843 1.7014 8.0943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.97725 1.46468 12.274 < 2e-16 ***
## age -0.39286 0.08724 -4.503 7.95e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.705 on 647 degrees of freedom
## Multiple R-squared: 0.03039, Adjusted R-squared: 0.02889
## F-statistic: 20.28 on 1 and 647 DF, p-value: 7.946e-06
*Answer: Yes, there is a significant negative relationship between age and G1 scores in the Portugese data, b = -0.39, t(647) = -4.50, p < .01
Answer: The coefficient for age is -0.39, this means that for every increase of one (year) in age, we expect a decrease of 0.39 in G1 scores.
g1g3.mod <- lm(G3 ~ G1, data = student.math)
summary(g1g3.mod)
##
## Call:
## lm(formula = G3 ~ G1, data = student.math)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.6223 -0.8348 0.3777 1.6965 5.0153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.65280 0.47475 -3.481 0.000555 ***
## G1 1.10626 0.04164 26.568 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.743 on 393 degrees of freedom
## Multiple R-squared: 0.6424, Adjusted R-squared: 0.6414
## F-statistic: 705.8 on 1 and 393 DF, p-value: < 2.2e-16
Yes, there is a significant relationship between G1 and G3 in math scores, b = 1.11, t(393) = 26.57, p < .01
plot(x = student.math$G1,
y = student.math$G3,
pch = 16,
col = gray(.1, .1),
xlab = "Period 1 Scores",
ylab = "Period 3 Scores",
main = "Student Math Data"
)
plot(x = student.math$G1,
y = student.math$G3,
pch = 16,
col = gray(.1, .1),
xlab = "Period 1 Scores",
ylab = "Period 3 Scores",
main = "Student Math Data"
)
abline(g1g3.mod, lty = 2)
math.mod1 <- lm(G3 ~ sex + age + internet + failures,
data = student.math
)
summary(math.mod1)
##
## Call:
## lm(formula = G3 ~ sex + age + internet + failures, data = student.math)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.2156 -1.9523 0.0965 3.0252 9.4370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.9962 2.9808 4.695 3.69e-06 ***
## sexM 1.0451 0.4282 2.441 0.0151 *
## age -0.2407 0.1735 -1.388 0.1660
## internetyes 0.7855 0.5761 1.364 0.1735
## failures -2.1260 0.2966 -7.167 3.86e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.237 on 390 degrees of freedom
## Multiple R-squared: 0.1533, Adjusted R-squared: 0.1446
## F-statistic: 17.65 on 4 and 390 DF, p-value: 2.488e-13
Both sex (b = 1.05, t(390) = 2.44, p = 0.02) and failures (b = -2.13, t(390) = -7.17, p < .01) are significantly related to period 3 scores.
Students with internet access score 0.79 points higher on period 3 scores compared to students without internet access (however, the effect is not significant).
student.math$fitted.values <- math.mod1$fitted.values
plot(x = student.math$G3,
y = student.math$fitted.values,
pch = 16,
col = gray(.1, .2),
xlab = "True period 3 scores",
ylab = "Model Fits",
main = "Period 3 math score model fits"
)
Note: The model does a pretty terrible job of predicting scores.
13.9962 + 1.0451 + 15 * (-0.2407) + 0.7855 + 3 * (-2.126)
## [1] 5.8383
A 15 year old, male student with internet access and 3 class failures is predicted to have a period 3 math grade of 5.84
newdata <- data.frame(sex = "M",
age = 15,
internet = "yes",
failures = 3
)
predict(object = math.mod1,
newdata = newdata)
## 1
## 5.837739
Yep!.
por.mod1 <- lm(G1 ~ sex + age + internet + failures,
data = student.por
)
summary(por.mod1)
##
## Call:
## lm(formula = G1 ~ sex + age + internet + failures, data = student.por)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1928 -1.5912 -0.0887 1.7547 7.0679
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.01091 1.43685 9.751 < 2e-16 ***
## sexM -0.49747 0.20194 -2.464 0.01402 *
## age -0.15656 0.08574 -1.826 0.06833 .
## internetyes 0.73930 0.23513 3.144 0.00174 **
## failures -1.59438 0.17726 -8.995 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.509 on 644 degrees of freedom
## Multiple R-squared: 0.1697, Adjusted R-squared: 0.1645
## F-statistic: 32.91 on 4 and 644 DF, p-value: < 2.2e-16
In the Portugese dataset, the effect of sex is reversed: Now males perform significantly worse than females. The direction of age is also reveresed: in the Portugese dataset, age is negatively correlated with period 3 scores. The effects of internet and failures are similar in both datasets.
student.por$fitted.values <- por.mod1$fitted.values
mean(abs(student.por$fitted.values - student.por$G3))
## [1] 2.196165
On average, the model fits for the Portugese data were 2.196 away from the true data
# Predict G3 in Math using portugese model
student.math$por.fitted.values <- predict(por.mod1, student.math)
# Calculate mean, absolute differences between predictions
# and actual scores:
mean(abs(student.math$por.fitted.values - student.math$G3))
## [1] 3.278772
On average, the Portugese model fits for the Math data were 3.28 away from the true data
# Predict G3 in Math using math model
student.math$math.fitted.values <- predict(math.mod1, student.math)
# Calculate mean, absolute differences between predictions
# and actual scores:
mean(abs(student.math$math.fitted.values - student.math$G3))
## [1] 3.216281
On average, the Math model fits for the Math data were 3.207 away from the true data (surprisingly, not much better than when we used the Portugese model)
# Create a new variable called nursery.bin that converts
# the original "no" values to 0, and "yes" values to 1
student.por$nursery.bin <- NA
student.por$nursery.bin[student.por$nursery == "no"] <- 0
student.por$nursery.bin[student.por$nursery == "yes"] <- 1
nursery.mod <- glm(nursery.bin ~ sex + famsize + Medu + Fedu + Pstatus,
family = binomial, data = student.por)
summary(nursery.mod)
##
## Call:
## glm(formula = nursery.bin ~ sex + famsize + Medu + Fedu + Pstatus,
## family = binomial, data = student.por)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1816 0.4542 0.6098 0.7089 0.9683
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.59232 0.41529 1.426 0.15379
## sexM -0.37040 0.20625 -1.796 0.07251 .
## famsizeLE3 0.67958 0.25002 2.718 0.00657 **
## Medu 0.32209 0.11943 2.697 0.00700 **
## Fedu -0.01493 0.12260 -0.122 0.90306
## PstatusT 0.05938 0.33680 0.176 0.86006
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 644.50 on 648 degrees of freedom
## Residual deviance: 623.54 on 643 degrees of freedom
## AIC: 635.54
##
## Number of Fisher Scoring iterations: 4
Family size (b = 0.68, z(643) = 2.72, p < .01), and mother’s education (b = 0.32, z(643) = 2.697, p < .01) both significantly predict whether a student attended nursery school
# Add model fits to data
student.por$nursery.pred <- nursery.mod$fitted.values
# Convert model fits to binary values
student.por$nursery.pred[student.por$nursery.pred > 0.5] <- 1
student.por$nursery.pred[student.por$nursery.pred <= .5] <- 0
table(student.por$nursery.bin,
student.por$nursery.pred
)
##
## 1
## 0 128
## 1 521
Looks strange doesn’t it? In fact, the model is ALWAYS predicting values greater than 0.5, which means for each student it always predicts that they have gone to nursery school.
travel.por.mod <- lm(traveltime ~ ., data = student.por)
summary(travel.por.mod)
##
## Call:
## lm(formula = traveltime ~ ., data = student.por)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3786 -0.4456 -0.1499 0.4315 2.3615
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.377955 0.514045 4.626 4.56e-06 ***
## schoolMS 0.158953 0.069197 2.297 0.02195 *
## sexM 0.137813 0.063719 2.163 0.03095 *
## age -0.026384 0.026221 -1.006 0.31471
## addressU -0.396930 0.064684 -6.136 1.52e-09 ***
## famsizeLE3 0.023153 0.062418 0.371 0.71081
## PstatusT 0.034305 0.088218 0.389 0.69751
## Medu -0.084029 0.038408 -2.188 0.02906 *
## Fedu -0.042117 0.035029 -1.202 0.22970
## Mjobhealth -0.121490 0.136904 -0.887 0.37521
## Mjobother -0.092069 0.077029 -1.195 0.23245
## Mjobservices -0.051192 0.095061 -0.539 0.59042
## Mjobteacher -0.035233 0.127908 -0.275 0.78306
## Fjobhealth 0.142938 0.191457 0.747 0.45561
## Fjobother 0.309403 0.115620 2.676 0.00765 **
## Fjobservices 0.194154 0.122141 1.590 0.11245
## Fjobteacher 0.372420 0.171266 2.175 0.03005 *
## reasonhome -0.176384 0.072188 -2.443 0.01483 *
## reasonother -0.036431 0.093847 -0.388 0.69801
## reasonreputation -0.119472 0.075846 -1.575 0.11573
## guardianmother -0.058761 0.067569 -0.870 0.38484
## guardianother 0.191378 0.135082 1.417 0.15707
## studytime 0.017430 0.035928 0.485 0.62774
## failures 0.029722 0.053997 0.550 0.58222
## schoolsupyes -0.046797 0.094036 -0.498 0.61891
## famsupyes 0.009733 0.058096 0.168 0.86701
## paidyes -0.112197 0.117532 -0.955 0.34016
## activitiesyes -0.008821 0.056871 -0.155 0.87679
## nurseryyes 0.060430 0.069013 0.876 0.38158
## higheryes 0.091211 0.099115 0.920 0.35781
## internetyes -0.146883 0.070061 -2.097 0.03645 *
## romanticyes -0.023287 0.058525 -0.398 0.69085
## famrel 0.001778 0.029686 0.060 0.95226
## freetime -0.038153 0.028559 -1.336 0.18207
## goout 0.048328 0.027282 1.771 0.07699 .
## Dalc 0.046112 0.038955 1.184 0.23698
## Walc -0.007837 0.030130 -0.260 0.79486
## health -0.030229 0.019711 -1.534 0.12565
## absences 0.003092 0.006373 0.485 0.62773
## G1 -0.007324 0.020606 -0.355 0.72240
## G2 -0.040002 0.026913 -1.486 0.13770
## G3 0.040800 0.021960 1.858 0.06367 .
## fitted.values NA NA NA NA
## nursery.bin NA NA NA NA
## nursery.pred NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6776 on 607 degrees of freedom
## Multiple R-squared: 0.2326, Adjusted R-squared: 0.1807
## F-statistic: 4.486 on 41 and 607 DF, p-value: < 2.2e-16
A student’s school, sex, address, mother’s education, father’s job, reason to choose this school, and internet access all significantly predict a student’s travel time
student.por$travel.fv <- travel.por.mod$fitted.values
mean(abs(student.por$travel.fv - student.por$traveltime))
## [1] 0.5186681
On average, the travel model had model fits 0.519 points away from the true data
hist(student.por$traveltime)
travel.por.pmod <- glm(traveltime ~ ., data = student.por, family = poisson)
summary(travel.por.pmod)
##
## Call:
## glm(formula = traveltime ~ ., family = poisson, data = student.por)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0629 -0.3745 -0.1639 0.3293 1.5629
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.276e-01 6.074e-01 1.527 0.12671
## schoolMS 9.617e-02 7.983e-02 1.205 0.22829
## sexM 8.682e-02 7.552e-02 1.150 0.25027
## age -1.460e-02 3.107e-02 -0.470 0.63834
## addressU -2.338e-01 7.277e-02 -3.213 0.00131 **
## famsizeLE3 1.664e-02 7.358e-02 0.226 0.82112
## PstatusT 1.775e-02 1.066e-01 0.166 0.86778
## Medu -4.966e-02 4.469e-02 -1.111 0.26645
## Fedu -2.893e-02 4.158e-02 -0.696 0.48647
## Mjobhealth -9.372e-02 1.673e-01 -0.560 0.57524
## Mjobother -5.288e-02 8.589e-02 -0.616 0.53811
## Mjobservices -2.723e-02 1.103e-01 -0.247 0.80493
## Mjobteacher -2.334e-02 1.538e-01 -0.152 0.87934
## Fjobhealth 7.638e-02 2.465e-01 0.310 0.75669
## Fjobother 1.979e-01 1.402e-01 1.411 0.15811
## Fjobservices 1.253e-01 1.483e-01 0.845 0.39828
## Fjobteacher 2.487e-01 2.101e-01 1.184 0.23661
## reasonhome -1.215e-01 8.761e-02 -1.387 0.16545
## reasonother -2.342e-02 1.070e-01 -0.219 0.82674
## reasonreputation -7.492e-02 9.077e-02 -0.825 0.40915
## guardianmother -3.861e-02 7.944e-02 -0.486 0.62692
## guardianother 1.070e-01 1.531e-01 0.699 0.48460
## studytime 1.231e-02 4.236e-02 0.290 0.77144
## failures 1.167e-02 6.070e-02 0.192 0.84752
## schoolsupyes -3.388e-02 1.138e-01 -0.298 0.76592
## famsupyes 6.087e-05 6.830e-02 0.001 0.99929
## paidyes -7.513e-02 1.440e-01 -0.522 0.60181
## activitiesyes -5.093e-03 6.723e-02 -0.076 0.93961
## nurseryyes 2.900e-02 8.066e-02 0.360 0.71917
## higheryes 5.876e-02 1.121e-01 0.524 0.60011
## internetyes -8.682e-02 7.915e-02 -1.097 0.27273
## romanticyes -1.671e-02 6.894e-02 -0.242 0.80846
## famrel -8.044e-04 3.439e-02 -0.023 0.98134
## freetime -2.389e-02 3.345e-02 -0.714 0.47515
## goout 3.040e-02 3.237e-02 0.939 0.34764
## Dalc 2.593e-02 4.432e-02 0.585 0.55847
## Walc -4.554e-03 3.533e-02 -0.129 0.89743
## health -1.942e-02 2.320e-02 -0.837 0.40266
## absences 1.941e-03 7.647e-03 0.254 0.79966
## G1 -5.541e-03 2.381e-02 -0.233 0.81599
## G2 -2.501e-02 3.142e-02 -0.796 0.42603
## G3 2.486e-02 2.632e-02 0.944 0.34499
## fitted.values NA NA NA NA
## nursery.bin NA NA NA NA
## nursery.pred NA NA NA NA
## travel.fv NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 207.43 on 648 degrees of freedom
## Residual deviance: 155.02 on 607 degrees of freedom
## AIC: 1741.5
##
## Number of Fisher Scoring iterations: 4
Using poisson regression, only a family’s address appears to significantly predict the student’s travel time. The variables that were previously significant (sex, mother’s education, father’s job, reason to choose the school, and internet access) no longer appear to be significantly related
student.por$travel.poisson.fv <- travel.por.pmod$fitted.values
mean(abs(student.por$traveltime - student.por$travel.poisson.fv))
## [1] 0.5206617
On average, the poisson travel model had model fits 0.52 points away from the true data. This is actually slightly worse than the regression model!