library(readr)
hw3alt <- read_csv("/Users/sophia.halkitis/Desktop/R/Datasets/kagglemath.csv", col_names = TRUE)
Parsed with column specification:
cols(
.default = col_character(),
age = col_integer(),
Medu = col_integer(),
Fedu = col_integer(),
traveltime = col_integer(),
studytime = col_integer(),
failures = col_integer(),
famrel = col_integer(),
freetime = col_integer(),
goout = col_integer(),
Dalc = col_integer(),
Walc = col_integer(),
health = col_integer(),
absences = col_integer(),
G1 = col_integer(),
G2 = col_integer(),
G3 = col_integer()
)
See spec(...) for full column specifications.
head(hw3alt)
library(dplyr)
library(pander)
library(visreg)
highered <- subset(hw3alt, select = c(sex, Medu, Fedu, studytime, age, activities, higher))
head(highered)
highered$higher[highered$higher=="yes"] <- "1"
highered$higher[highered$higher=="no"] <- "0"
highered$higher <- as.integer(highered$higher)
head(highered)
highered1 <- glm(higher ~ sex, family = "binomial", data = highered)
summary(highered1)
Call:
glm(formula = higher ~ sex, family = "binomial", data = highered)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8111 0.1971 0.1971 0.4229 0.4229
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.9318 0.5048 7.788 6.79e-15 ***
sexM -1.5628 0.5685 -2.749 0.00598 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 158.3 on 394 degrees of freedom
Residual deviance: 148.8 on 393 degrees of freedom
AIC: 152.8
Number of Fisher Scoring iterations: 6
coef(highered1)
(Intercept) sexM
3.931826 -1.562751
exp(-1.562751)
[1] 0.2095588
#exp of the log odds ratio (in coef) gives me the odds of pursuing higher education for males
highered <- mutate(highered,
Pedu = ifelse(Medu == 4, "1",
ifelse(Fedu == 4, "1", "0")))
#creates a variable that gives the value 1 if mother or father are college educated, 0 if neither are.
head(highered)
highered2 <- glm(higher ~ sex + age + Pedu + studytime, family = "binomial", data = highered)
summary(highered2)
Call:
glm(formula = higher ~ sex + age + Pedu + studytime, family = "binomial",
data = highered)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.95220 0.07616 0.16056 0.29203 1.54148
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.1606 3.4208 3.263 0.00110 **
sexM -0.9694 0.6384 -1.519 0.12887
age -0.5759 0.1923 -2.995 0.00274 **
Pedu1 2.2898 1.0465 2.188 0.02866 *
studytime 1.0785 0.4565 2.363 0.01814 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 158.30 on 394 degrees of freedom
Residual deviance: 115.91 on 390 degrees of freedom
AIC: 125.91
Number of Fisher Scoring iterations: 8
highered3 <- glm(higher ~ sex*age + studytime + Pedu, family = "binomial", data = highered)
summary(highered3)
Call:
glm(formula = higher ~ sex * age + studytime + Pedu, family = "binomial",
data = highered)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.85141 0.03434 0.10004 0.30782 1.28931
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 34.7428 13.3356 2.605 0.00918 **
sexM -28.1122 13.9097 -2.021 0.04327 *
age -1.9142 0.7336 -2.609 0.00907 **
studytime 1.3284 0.5078 2.616 0.00890 **
Pedu1 2.3505 1.0501 2.238 0.02520 *
sexM:age 1.5228 0.7610 2.001 0.04539 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 158.30 on 394 degrees of freedom
Residual deviance: 110.13 on 389 degrees of freedom
AIC: 122.13
Number of Fisher Scoring iterations: 8
library(texreg)
screenreg(list(highered1, highered2, highered3))
================================================
Model 1 Model 2 Model 3
------------------------------------------------
(Intercept) 3.93 *** 11.16 ** 34.74 **
(0.50) (3.42) (13.34)
sexM -1.56 ** -0.97 -28.11 *
(0.57) (0.64) (13.91)
age -0.58 ** -1.91 **
(0.19) (0.73)
Pedu1 2.29 * 2.35 *
(1.05) (1.05)
studytime 1.08 * 1.33 **
(0.46) (0.51)
sexM:age 1.52 *
(0.76)
------------------------------------------------
AIC 152.80 125.91 122.13
BIC 160.75 145.80 146.00
Log Likelihood -74.40 -57.95 -55.06
Deviance 148.80 115.91 110.13
Num. obs. 395 395 395
================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
anova(highered1, highered2, highered3, test = "Chisq")
Analysis of Deviance Table
Model 1: higher ~ sex
Model 2: higher ~ sex + age + Pedu + studytime
Model 3: higher ~ sex * age + studytime + Pedu
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 393 148.79
2 390 115.91 3 32.889 3.399e-07 ***
3 389 110.13 1 5.779 0.01622 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
lmtest::lrtest(highered1, highered2, highered3)
Likelihood ratio test
Model 1: higher ~ sex
Model 2: higher ~ sex + age + Pedu + studytime
Model 3: higher ~ sex * age + studytime + Pedu
#Df LogLik Df Chisq Pr(>Chisq)
1 2 -74.398
2 5 -57.953 3 32.8889 3.399e-07 ***
3 6 -55.064 1 5.7793 0.01622 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Both the ANOVA and lmtest indicate that model two is the best fit
AIC(highered1, highered2, highered3)
BIC(highered1, highered2, highered3)
With the AIC and BIC, lower values indicate better fit. Keeping this in mind, the AIC best fit model is the third model, but is the second model in the BIC.
visreg(highered2, "sex", scale = "response")
Females are more likely than males to want to pursue higher education.
visreg(highered2, "sex", by = "age", scale = "response")
Willingness to pursue higher education decreases as age increases. The effect is stronger for males than for females.
visreg(highered2, "sex", by = "Pedu", scale = "response")
Students who had a college educated parent were more likely to want to pursue higher education than those who did not have a college educated parent.
In the present assignment we were asked to find a dataset and run several logistic regressions and to determine which model was the best fit. I decided on a dataset from Kaggle that surveyed students from two schools in Brazil on their alcohol consumption and average grades. Although I did not use either of these variables throughout the course of my analysis, there were several other variables of interest in the dataset, so I decided to focus on how certain factors (age, sex, parental education, and study time) contribute to a student’s intention to pursue higher education.
First, I created a subset of the data to only include the variables that I was interested in. Then I recoded the yes/no variable that asked students if they wanted to pursue higher education or not into a binary integer to use as a dependent variable. A simple regression between sex and intention to pursue higher education indicated that males were significantly less likely than females to want to pursue higher education. Upon determing the log odds ratio and then the odds of this value, we find that this disparity is not very large, with men being only .21 times less likely to want to pursue higher education than the female counterparts.
Then, I determined other variables that I think may contribute to this relationship and decided that age, parental education, and time spent studying would probably have an effect on intention to pursue higher education. I created a variable that conslidated mother and father’s education, such that if at least one of the student’s parents was college educated they were given the value “1”, and if neither were college educated, the value “0”. This made it easier to view the effect of higher parental education in my next regression with four IVs.
The results from the second regression tell us that age, parental education, and time spent studying are all significant contributors to a student’s indication of their willingness to pursue higher education. It is also worth mentioning that once these other variables are taken into account, the effect of gender alone becomes insignificant, indicating that these other variables play a larger role than gender does. Age was a significant contributor, such that as one gets older, they become less likely to want to pursue higher education. Both parental education and study time were facilitators to wanting to pursue higher education, where having a college educated parent made students more likely to want to pursue higher education themselves, which was the strongest indicator tested. Additionally, the more time spent studying also increased the likelihood of students to want higher education.
For my third regression, I ran the same four independent variables, but included an interaction term between age and sex. These findings rendered the differential effect that aging has on willingness to pursue higher education for males and females. As females get older they become less likely to want to get higher education and as males get older they become more likely to want to pursue higher education. Parental education, however, remains the strongest facilitator to student’s indication of wanting to pursue higher education.
After I ran all three regressions, I conducted a likelihood ratio test to determine which model was the best fit. I ran this test using both the anova and lmtest command, and they gave me the same results: that the second model (highered2) was the best model as it had the most significance. However, model three had the lowest deviance.
Lastly, I created three plots using the visreg package to visualize the data that I just interepreted. The first plot confirms the initial finding, that females are more likely than males to want to pursue higher education. The second plot exemplifies the relationship between age and gender, such that willingness to pursue higher education decreases as age increases. This plot also shows that this finding is more pronounced for males than for females. The last plot shows the relationship between parental education and gender, where students who had a college educated parent were more likely to want to pursue higher education themselves than their counterparts whose parents did not have college education.