DOES FAMILY SIZE AFFECT HOW STUDY TIME RELATES TO ACADEMIC PERFORMANCE? IMPLICATIONS FOR GENDER IN MATHMATICS
Background research shows that family heavily influences a student’s study habits and academic performance. This research is mainly focused on the quality of family relationships, but family structure might also play a role. For instance, a big family might distract students from studying, making their efforts less effective. Additionally, many school and familial experiences vary by gender. For example, girls are often socialized to be more family oriented than boys, and so might spend more time on family than on academics. The present study aims to observe how the relationship between studying and math performance changes with the family dynamics in a student’s life; first by looking for family structure interactions, and then by looking for gender interactions.
DATA AND APPLICATION
The present study uses the math performance data set from a UC Irvine Machine Learning Repository set labeled School Performance. This data set contains details on the academic lives of adolescent students from two Portuguese schools (N = 395, median age = 17 years old, 47.3% female). The variables tested give information about time spent studying (Likert scale 1-5 based on study hours), final GPAmath (scale of 0-20 GPA points in math classes), family size (binary; greater than three, or less than or equal to three), and gender (binary; female or male) of each participant.
A few trends were found during a preliminary analysis, where each variable was tested individually with GPA as the outcome. Family size was not related to GPAmath, and on average boys earned higher GPAmath scores than girls, even though girls studied more. Additionally, it is worth noting that family size became marginally related to GPAmath once study time was also added (please see the Appendix: Preliminary Analysis). These associations brought up questions of how studying might relate to math performance in different situations. This study will first look at the relationship between GPAmath and study time, expecting a moderate, positive relationship. Then, this study will look at how family size might change this relationship. It is expected that bigger families will provide more distraction, so studying will be less effective. Finally, this study will test whether one gender is affected more by family size than the other, because it is hypothesized that girls will be more influenced due to family role socialization.
METHODS AND ANALYSIS
Linear regression will be used to test these questions because the base model, Model One, consists of two numeric variables best fit linearly. Model One will use simple linear regression and a T-test to observe whether study time can predict GPAmath. Model Two will use multiple linear regression and F-Tests to detect whether family size changes this relationship. Model Three will also use multiple linear regression to test whether gender will interact with family size.
These relationships can be seen graphically:
library(tidyverse)
## -- Attaching packages --------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.1 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
d1=read.table("student-mat.csv",sep=";",header=TRUE)
#Model One
ggplot(d1, aes(studytime, G3))+
geom_point()+
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
mod1 <- lm(G3~studytime, d1)
summary(mod1)
##
## Call:
## lm(formula = G3 ~ studytime, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4643 -1.8623 0.5357 3.0697 9.1377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3283 0.6033 15.463 <2e-16 ***
## studytime 0.5340 0.2741 1.949 0.0521 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.565 on 393 degrees of freedom
## Multiple R-squared: 0.009569, Adjusted R-squared: 0.007049
## F-statistic: 3.797 on 1 and 393 DF, p-value: 0.05206
#Model Two
ggplot(d1, aes(studytime, G3, color = famsize))+
geom_point()+
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
mod2a <- lm(G3~studytime+famsize, d1)
summary(mod2a)
##
## Call:
## lm(formula = G3 ~ studytime + famsize, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1746 -2.0350 0.2949 3.2949 8.7251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9958 0.6301 14.276 <2e-16 ***
## studytime 0.5698 0.2740 2.079 0.0383 *
## famsizeLE3 0.8996 0.5069 1.775 0.0767 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.553 on 392 degrees of freedom
## Multiple R-squared: 0.01746, Adjusted R-squared: 0.01245
## F-statistic: 3.483 on 2 and 392 DF, p-value: 0.03165
#Model Three
ggplot(d1, aes(studytime, G3, color = sex))+
geom_point()+
facet_wrap(~famsize)+
geom_smooth(method = 'lm', se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
mod3c <- lm(G3~studytime+famsize*sex, d1)
summary(mod3c)
##
## Call:
## lm(formula = G3 ~ studytime + famsize * sex, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6042 -2.2225 0.2177 3.0123 9.3947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.8114 0.7460 10.471 < 2e-16 ***
## studytime 0.7939 0.2852 2.784 0.00563 **
## famsizeLE3 1.3831 0.7231 1.913 0.05651 .
## sexM 1.6172 0.5591 2.893 0.00403 **
## famsizeLE3:sexM -1.1273 1.0077 -1.119 0.26394
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.515 on 390 degrees of freedom
## Multiple R-squared: 0.03855, Adjusted R-squared: 0.02869
## F-statistic: 3.91 on 4 and 390 DF, p-value: 0.003987
RESULTS
The outputs for these models show that while gender and family size might affect study time and GPAmath on their own, there are no significant interactions between them. Model One shows a marginally significant positive relationship between study time and GPAmath. Model Two shows a significant relationship between studying and GPAmath and a marginally significant main effect of family size – bigger famililies predict lower GPAs. Finally, Model Three shows that all three predictor variables are significant – smaller families and boys independently tended to yield a higher GPAmath. Overall, the model that best fit this data was Model Three with the most significant variables, and the best balance between fit (smallest MSE) and the ability to generalize.
CONCLUSION
As expected, there was a positive relationship between GPAmath and study hours. However, gender and family size did not interact significantly in a any model. This is surprising given that the graphics found during early analysis seem to show that the relationship changes between girls in small families and in big families. The lack of significant results might be because of how family structure was measured – as a binary between less than three and more than three people. This is an issue since a family of four might have a different effect than a family of seven, but the variable does not allow for this distinction. Though the hypotheses were unsupported, some of the relationships seen in these models are well replicated throughout research on education. The most relevant trend found in this data is that women tend to have poorer performance than men in math classes, even though they study more. This is thought to be due to stereotype threat and socialization, both of which discourage women from pursuing careers in mathematics programs. These results support the importance of making efforts to increase female representation in STEM fields.
#Appendix
library(tidyverse)
d1=read.table("student-mat.csv",sep=";",header=TRUE)
head(d1) #Lists all the variables in the set
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 1 GP F 18 U GT3 A 4 4 at_home teacher course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## 3 GP F 15 U LE3 T 1 1 at_home other other
## 4 GP F 15 U GT3 T 4 2 health services home
## 5 GP F 16 U GT3 T 3 3 other other home
## 6 GP M 16 U LE3 T 4 3 services other reputation
## guardian traveltime studytime failures schoolsup famsup paid activities
## 1 mother 2 2 0 yes no no no
## 2 father 1 2 0 no yes no no
## 3 mother 1 2 3 yes no yes no
## 4 mother 1 3 0 no yes yes yes
## 5 father 1 2 0 no yes yes no
## 6 mother 1 2 0 no yes yes yes
## nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1 yes yes no no 4 3 4 1 1 3
## 2 no yes yes no 5 3 3 1 1 3
## 3 yes yes yes no 4 3 2 2 3 3
## 4 yes yes yes yes 3 2 2 1 1 5
## 5 yes yes no no 4 3 2 1 2 5
## 6 yes yes yes no 5 4 2 1 2 5
## absences G1 G2 G3
## 1 6 5 6 6
## 2 4 5 5 6
## 3 10 7 8 10
## 4 2 15 14 15
## 5 4 6 10 10
## 6 10 15 15 15
#Discriptives
str(d1) #395 participants
## 'data.frame': 395 obs. of 33 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
table1<- table(d1$sex)
table1
##
## F M
## 208 187
prop.table(table1)
##
## F M
## 0.5265823 0.4734177
table2 <- table(d1$famsize)
table2
##
## GT3 LE3
## 281 114
prop.table(table2)
##
## GT3 LE3
## 0.7113924 0.2886076
summary(d1$G3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 8.00 11.00 10.42 14.00 20.00
summary(d1$studytime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.035 2.000 4.000
summary(d1$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.0 16.0 17.0 16.7 18.0 22.0
hist(d1$G3) #looks normal so use mean=10.42
hist(d1$studytime, breaks="FD") #Use median=2, a little skewed
hist(d1$age) #mean=16.7 years old
#Preliminary Analysis
#How does family size relate to GPAmath?
d1%>%
group_by(famsize)%>%
summarize(G3)
## `summarise()` regrouping output by 'famsize' (override with `.groups` argument)
## # A tibble: 395 x 2
## # Groups: famsize [2]
## famsize G3
## <chr> <int>
## 1 GT3 6
## 2 GT3 6
## 3 GT3 15
## 4 GT3 10
## 5 GT3 6
## 6 GT3 15
## 7 GT3 9
## 8 GT3 12
## 9 GT3 11
## 10 GT3 16
## # ... with 385 more rows
aggregate(d1$G3, list(d1$famsize), mean) #avg GPAmath per famsize
## Group.1 x
## 1 GT3 10.17794
## 2 LE3 11.00000
#T-test: Is there a difference?
t.test(d1$G3~d1$famsize, mu=0, alt="two.sided", conf=0.95, var.eq=F, paried=F)
##
## Welch Two Sample t-test
##
## data: d1$G3 by d1$famsize
## t = -1.6943, df = 231.57, p-value = 0.09155
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.7780287 0.1339006
## sample estimates:
## mean in group GT3 mean in group LE3
## 10.17794 11.00000
#Plot
ggplot(d1, aes(famsize, G3))+
geom_boxplot()
#How does gender relate to GPAmath?
d1%>%
group_by(sex)%>%
summarize(G3)
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 395 x 2
## # Groups: sex [2]
## sex G3
## <chr> <int>
## 1 F 6
## 2 F 6
## 3 F 10
## 4 F 15
## 5 F 10
## 6 F 6
## 7 F 9
## 8 F 12
## 9 F 14
## 10 F 14
## # ... with 385 more rows
aggregate(d1$G3, list(d1$sex), mean) #avg GPAmath per gender
## Group.1 x
## 1 F 9.966346
## 2 M 10.914439
#T-test: Is there a difference?
t.test(d1$G3~d1$sex, mu=0, alt="two.sided", conf=0.95, var.eq=F, paried=F)
##
## Welch Two Sample t-test
##
## data: d1$G3 by d1$sex
## t = -2.0651, df = 390.57, p-value = 0.03958
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.85073226 -0.04545244
## sample estimates:
## mean in group F mean in group M
## 9.966346 10.914439
#Plot
ggplot(d1, aes(sex, G3))+
geom_boxplot()
#How does gender relate to study time?
d1%>%
group_by(sex)%>%
summarize(studytime)
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 395 x 2
## # Groups: sex [2]
## sex studytime
## <chr> <int>
## 1 F 2
## 2 F 2
## 3 F 2
## 4 F 3
## 5 F 2
## 6 F 2
## 7 F 2
## 8 F 3
## 9 F 1
## 10 F 3
## # ... with 385 more rows
aggregate(d1$studytime, list(d1$sex), mean) #avg study time per gender
## Group.1 x
## 1 F 2.278846
## 2 M 1.764706
#Ttest
t.test(d1$studytime~d1$sex, mu=0, alt="two.sided", conf=0.95, var.eq=F, paried=F)
##
## Welch Two Sample t-test
##
## data: d1$studytime by d1$sex
## t = 6.3709, df = 386.7, p-value = 5.345e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3554718 0.6728087
## sample estimates:
## mean in group F mean in group M
## 2.278846 1.764706
#Plots
ggplot(d1, aes(sex, studytime))+
geom_boxplot()
#How does family size related to studytime?
d1%>%
group_by(famsize)%>%
summarize(studytime)
## `summarise()` regrouping output by 'famsize' (override with `.groups` argument)
## # A tibble: 395 x 2
## # Groups: famsize [2]
## famsize studytime
## <chr> <int>
## 1 GT3 2
## 2 GT3 2
## 3 GT3 3
## 4 GT3 2
## 5 GT3 2
## 6 GT3 2
## 7 GT3 2
## 8 GT3 3
## 9 GT3 2
## 10 GT3 3
## # ... with 385 more rows
aggregate(d1$studytime, list(d1$famsize), mean) #avg study time per famsize
## Group.1 x
## 1 GT3 2.074733
## 2 LE3 1.938596
#T-test: No difference in amount of study time
t.test(d1$studytime~d1$famsize, mu=0, alt="two.sided", conf=0.95, var.eq=F, paried=F)
##
## Welch Two Sample t-test
##
## data: d1$studytime by d1$famsize
## t = 1.5136, df = 225.69, p-value = 0.1315
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.04109249 0.31336570
## sample estimates:
## mean in group GT3 mean in group LE3
## 2.074733 1.938596
#Plots
ggplot(d1, aes(famsize, studytime))+
geom_boxplot()
#Results
#RQ1: How does study time relate to GPAmath (Model One)?
ggplot(d1, aes(studytime, G3))+
geom_point()+
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
#linear or polynomial?
ggplot(d1, aes(studytime, G3))+
geom_point()+
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 0.985
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1
mod1a <- lm(G3~poly(studytime, 3), d1)
summary(mod1a) #Linear
##
## Call:
## lm(formula = G3 ~ poly(studytime, 3), data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.400 -2.048 0.600 2.828 8.952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.4152 0.2299 45.307 <2e-16 ***
## poly(studytime, 3)1 8.8956 4.5688 1.947 0.0522 .
## poly(studytime, 3)2 1.5901 4.5688 0.348 0.7280
## poly(studytime, 3)3 -5.1517 4.5688 -1.128 0.2602
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.569 on 391 degrees of freedom
## Multiple R-squared: 0.01308, Adjusted R-squared: 0.005511
## F-statistic: 1.728 on 3 and 391 DF, p-value: 0.1607
#Model One
mod1 <- lm(G3~studytime, d1)
mod1_sum <- summary(mod1)
mod1_sum
##
## Call:
## lm(formula = G3 ~ studytime, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4643 -1.8623 0.5357 3.0697 9.1377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3283 0.6033 15.463 <2e-16 ***
## studytime 0.5340 0.2741 1.949 0.0521 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.565 on 393 degrees of freedom
## Multiple R-squared: 0.009569, Adjusted R-squared: 0.007049
## F-statistic: 3.797 on 1 and 393 DF, p-value: 0.05206
mean(mod1_sum$residuals^2)#MSE=20.73614
## [1] 20.73614
anova(mod1)
## Analysis of Variance Table
##
## Response: G3
## Df Sum Sq Mean Sq F value Pr(>F)
## studytime 1 79.1 79.132 3.7968 0.05206 .
## Residuals 393 8190.8 20.842
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Assumption check
ggplot(d1, aes(studytime, mod1_sum$residuals))+
geom_point()+
geom_hline(yintercept = 0) #Looks okay, but not ideal
#RQ2: Does family size change this relationship (Model Two)?
ggplot(d1, aes(studytime, G3))+
geom_point()+
geom_smooth(method="lm", se=FALSE)+
geom_abline(slope=0.5698, intercept=9.8956, color = "firebrick1")+
geom_text(x=3, y=5, label=c("red = LE3, blue = GT3"))
## `geom_smooth()` using formula 'y ~ x'
ggplot(d1, aes(studytime, G3, color = famsize))+
geom_point()+
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
#Model Two: W/out interaction (BEST)
d1$famsize1 <- factor(d1$famsize, levels = c("GT3", "LE3"))
contrasts(d1$famsize1)
## LE3
## GT3 0
## LE3 1
mod2a <- lm(G3~studytime+famsize, d1)
mod2a_sum <- summary(mod2a)
mod2a_sum
##
## Call:
## lm(formula = G3 ~ studytime + famsize, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1746 -2.0350 0.2949 3.2949 8.7251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9958 0.6301 14.276 <2e-16 ***
## studytime 0.5698 0.2740 2.079 0.0383 *
## famsizeLE3 0.8996 0.5069 1.775 0.0767 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.553 on 392 degrees of freedom
## Multiple R-squared: 0.01746, Adjusted R-squared: 0.01245
## F-statistic: 3.483 on 2 and 392 DF, p-value: 0.03165
mean(mod2a_sum$residuals^2)#MSE = 20.57088, significance/marginal significance
## [1] 20.57088
anova(mod2a)
## Analysis of Variance Table
##
## Response: G3
## Df Sum Sq Mean Sq F value Pr(>F)
## studytime 1 79.1 79.132 3.8176 0.05143 .
## famsize 1 65.3 65.281 3.1494 0.07673 .
## Residuals 392 8125.5 20.728
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Model Two: W/ Interaction
mod2b <- lm(G3~studytime*famsize, d1)
mod2b_sum <- summary(mod2b)
mod2b_sum
##
## Call:
## lm(formula = G3 ~ studytime * famsize, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.3369 -2.0028 0.2263 3.1621 9.0980
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3976 0.7124 13.192 <2e-16 ***
## studytime 0.3761 0.3175 1.185 0.237
## famsizeLE3 -0.5953 1.3385 -0.445 0.657
## studytime:famsizeLE3 0.7575 0.6278 1.207 0.228
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.55 on 391 degrees of freedom
## Multiple R-squared: 0.02111, Adjusted R-squared: 0.0136
## F-statistic: 2.81 on 3 and 391 DF, p-value: 0.03927
mean(mod2b_sum$residuals^2) #MSE= 20.49457, no significance
## [1] 20.49457
anova(mod2b)
## Analysis of Variance Table
##
## Response: G3
## Df Sum Sq Mean Sq F value Pr(>F)
## studytime 1 79.1 79.132 3.8220 0.05130 .
## famsize 1 65.3 65.281 3.1530 0.07656 .
## studytime:famsize 1 30.1 30.141 1.4558 0.22833
## Residuals 391 8095.4 20.704
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#RQ3: Does this affect one gender more than the other (Model Three)?
ggplot(d1, aes(studytime, G3, color = sex))+
geom_point()+
facet_wrap(~famsize)+
geom_smooth(method = 'lm', se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
ggplot(d1, aes(sex, G3))+
geom_boxplot()+
geom_smooth(method="lm", se=FALSE)+
facet_wrap(~famsize)
## `geom_smooth()` using formula 'y ~ x'
#There are five possible models here:
d1$sex1 <- factor(d1$sex, levels = c("F", "M"))
contrasts(d1$sex1)
## M
## F 0
## M 1
#1) Model Three: No interactions
mod3a <- lm(G3~studytime+famsize+sex, d1)
mod3a_sum <- summary(mod3a)
mod3a_sum
##
## Call:
## lm(formula = G3 ~ studytime + famsize + sex, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4414 -2.0347 0.4573 3.1631 9.2603
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.9374 0.7377 10.760 < 2e-16 ***
## studytime 0.8022 0.2852 2.813 0.00515 **
## famsizeLE3 0.8030 0.5042 1.593 0.11202
## sexM 1.2951 0.4793 2.702 0.00720 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.517 on 391 degrees of freedom
## Multiple R-squared: 0.03547, Adjusted R-squared: 0.02807
## F-statistic: 4.793 on 3 and 391 DF, p-value: 0.002723
mean(mod3a_sum$residuals^2) #MSE=20.19389, all but famsize is significant
## [1] 20.19389
anova(mod3a)
## Analysis of Variance Table
##
## Response: G3
## Df Sum Sq Mean Sq F value Pr(>F)
## studytime 1 79.1 79.132 3.8789 0.049601 *
## famsize 1 65.3 65.281 3.2000 0.074413 .
## sex 1 148.9 148.909 7.2993 0.007198 **
## Residuals 391 7976.6 20.400
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#2) Model Three: Interaction between family size and study hours
mod3b <- lm(G3~studytime*famsize+sex, d1)
mod3b_sum <- summary(mod3b)
mod3b_sum
##
## Call:
## lm(formula = G3 ~ studytime * famsize + sex, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2427 -2.1952 0.3056 3.1286 9.0491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.3591 0.7996 10.454 < 2e-16 ***
## studytime 0.5918 0.3243 1.825 0.06878 .
## famsizeLE3 -0.8705 1.3310 -0.654 0.51350
## sexM 1.3287 0.4795 2.771 0.00585 **
## studytime:famsizeLE3 0.8468 0.6234 1.358 0.17514
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.512 on 390 degrees of freedom
## Multiple R-squared: 0.04001, Adjusted R-squared: 0.03016
## F-statistic: 4.064 on 4 and 390 DF, p-value: 0.003066
mean(mod3b_sum$residuals^2) #MSE= 20.0988, only studytime and gender are still signifianct
## [1] 20.0988
anova(mod3b)
## Analysis of Variance Table
##
## Response: G3
## Df Sum Sq Mean Sq F value Pr(>F)
## studytime 1 79.1 79.132 3.8873 0.049358 *
## famsize 1 65.3 65.281 3.2069 0.074104 .
## sex 1 148.9 148.909 7.3151 0.007137 **
## studytime:famsize 1 37.6 37.560 1.8451 0.175137
## Residuals 390 7939.0 20.356
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#3) Model Three: Interaction between family size and gender(BEST)
mod3c <- lm(G3~studytime+famsize*sex, d1)
mod3c_sum <- summary(mod3c)
mod3c_sum
##
## Call:
## lm(formula = G3 ~ studytime + famsize * sex, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6042 -2.2225 0.2177 3.0123 9.3947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.8114 0.7460 10.471 < 2e-16 ***
## studytime 0.7939 0.2852 2.784 0.00563 **
## famsizeLE3 1.3831 0.7231 1.913 0.05651 .
## sexM 1.6172 0.5591 2.893 0.00403 **
## famsizeLE3:sexM -1.1273 1.0077 -1.119 0.26394
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.515 on 390 degrees of freedom
## Multiple R-squared: 0.03855, Adjusted R-squared: 0.02869
## F-statistic: 3.91 on 4 and 390 DF, p-value: 0.003987
mean(mod3c_sum$residuals^2) #MSE = 20.12929, no interaction is sig but all main effects are
## [1] 20.12929
anova(mod3c)
## Analysis of Variance Table
##
## Response: G3
## Df Sum Sq Mean Sq F value Pr(>F)
## studytime 1 79.1 79.132 3.8814 0.049530 *
## famsize 1 65.3 65.281 3.2020 0.074323 .
## sex 1 148.9 148.909 7.3040 0.007181 **
## famsize:sex 1 25.5 25.516 1.2516 0.263941
## Residuals 390 7951.1 20.387
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#4) Model Three: Interaction between study time and gender
mod3d <- lm(G3~studytime*sex+famsize, d1)
mod3d_sum <- summary(mod3d)
mod3d_sum
##
## Call:
## lm(formula = G3 ~ studytime * sex + famsize, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9165 -1.8920 0.3025 3.1326 9.0071
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.3908 0.9668 8.679 <2e-16 ***
## studytime 0.6021 0.3967 1.518 0.130
## sexM 0.4603 1.2457 0.369 0.712
## famsizeLE3 0.8137 0.5047 1.612 0.108
## studytime:sexM 0.4142 0.5705 0.726 0.468
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.519 on 390 degrees of freedom
## Multiple R-squared: 0.03677, Adjusted R-squared: 0.02689
## F-statistic: 3.722 on 4 and 390 DF, p-value: 0.005489
mean(mod3d_sum$residuals^2) #MSE= 20.16663, nothing significant
## [1] 20.16663
anova(mod3d)
## Analysis of Variance Table
##
## Response: G3
## Df Sum Sq Mean Sq F value Pr(>F)
## studytime 1 79.1 79.132 3.8742 0.049740 *
## sex 1 162.4 162.437 7.9528 0.005047 **
## famsize 1 51.8 51.753 2.5338 0.112241
## studytime:sex 1 10.8 10.768 0.5272 0.468233
## Residuals 390 7965.8 20.425
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#5) Model Three: All interactions (expecting all to be insignificant)
mod3f <- lm(G3~studytime*famsize*sex, d1)
mod3f_sum <- summary(mod3f)
mod3f_sum
##
## Call:
## lm(formula = G3 ~ studytime * famsize * sex, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2308 -2.1074 0.3407 3.0685 9.0820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.69628 1.09663 7.930 2.37e-14 ***
## studytime 0.40723 0.45231 0.900 0.369
## famsizeLE3 -0.60836 2.25347 -0.270 0.787
## sexM 0.59547 1.45415 0.409 0.682
## studytime:famsizeLE3 0.87848 0.94467 0.930 0.353
## studytime:sexM 0.46238 0.65840 0.702 0.483
## famsizeLE3:sexM 0.08246 2.85397 0.029 0.977
## studytime:famsizeLE3:sexM -0.40658 1.33038 -0.306 0.760
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.524 on 387 degrees of freedom
## Multiple R-squared: 0.04244, Adjusted R-squared: 0.02512
## F-statistic: 2.45 on 7 and 387 DF, p-value: 0.01812
mean(mod3f_sum$residuals^2) #MSE=20.04797, no significance
## [1] 20.04797
anova(mod3f)
## Analysis of Variance Table
##
## Response: G3
## Df Sum Sq Mean Sq F value Pr(>F)
## studytime 1 79.1 79.132 3.8672 0.049953 *
## famsize 1 65.3 65.281 3.1903 0.074860 .
## sex 1 148.9 148.909 7.2772 0.007289 **
## studytime:famsize 1 37.6 37.560 1.8356 0.176261
## studytime:sex 1 8.8 8.833 0.4317 0.511552
## famsize:sex 1 9.3 9.335 0.4562 0.499811
## studytime:famsize:sex 1 1.9 1.911 0.0934 0.760064
## Residuals 387 7918.9 20.462
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#How gender might change how study time relate to GPA without family size
ggplot(d1, aes(studytime, G3, color=sex))+
geom_point()+
geom_smooth(method="lm", se=FALSE) #expecting no interaction
## `geom_smooth()` using formula 'y ~ x'
#W/ interaction
mod4a <- lm(G3 ~ studytime+sex, d1)
summary(mod4a)
##
## Call:
## lm(formula = G3 ~ studytime + sex, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6583 -2.0980 0.2512 3.2512 9.0313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.1885 0.7221 11.340 < 2e-16 ***
## studytime 0.7801 0.2854 2.734 0.00655 **
## sexM 1.3492 0.4791 2.816 0.00510 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.526 on 392 degrees of freedom
## Multiple R-squared: 0.02921, Adjusted R-squared: 0.02426
## F-statistic: 5.898 on 2 and 392 DF, p-value: 0.002996
mod4a_sum <- summary(mod4a)
mean(mod4a_sum$residuals^2) #MSE=20.32491, all significant
## [1] 20.32491
anova(mod4a)
## Analysis of Variance Table
##
## Response: G3
## Df Sum Sq Mean Sq F value Pr(>F)
## studytime 1 79.1 79.132 3.8638 0.050044 .
## sex 1 162.4 162.437 7.9313 0.005104 **
## Residuals 392 8028.3 20.480
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#W/out interaction
mod4b <- lm(G3~studytime*sex, d1)
mod4b_sum <- summary(mod4b)
mod4b_sum
##
## Call:
## lm(formula = G3 ~ studytime * sex, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1054 -2.1451 0.1989 3.1989 8.8351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.6156 0.9586 8.987 <2e-16 ***
## studytime 0.5927 0.3975 1.491 0.137
## sexM 0.5691 1.2465 0.457 0.648
## studytime:sexM 0.3874 0.5715 0.678 0.498
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.529 on 391 degrees of freedom
## Multiple R-squared: 0.03035, Adjusted R-squared: 0.02291
## F-statistic: 4.08 on 3 and 391 DF, p-value: 0.00716
mean(mod4b_sum$residuals^2) #MSE=20.30104, not significant
## [1] 20.30104
anova(mod4b)
## Analysis of Variance Table
##
## Response: G3
## Df Sum Sq Mean Sq F value Pr(>F)
## studytime 1 79.1 79.132 3.8585 0.050204 .
## sex 1 162.4 162.437 7.9204 0.005135 **
## studytime:sex 1 9.4 9.428 0.4597 0.498173
## Residuals 391 8018.9 20.509
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1