Math 239 Final Project

DOES FAMILY SIZE AFFECT HOW STUDY TIME RELATES TO ACADEMIC PERFORMANCE? IMPLICATIONS FOR GENDER IN MATHMATICS

Background research shows that family heavily influences a student’s study habits and academic performance. This research is mainly focused on the quality of family relationships, but family structure might also play a role. For instance, a big family might distract students from studying, making their efforts less effective. Additionally, many school and familial experiences vary by gender. For example, girls are often socialized to be more family oriented than boys, and so might spend more time on family than on academics. The present study aims to observe how the relationship between studying and math performance changes with the family dynamics in a student’s life; first by looking for family structure interactions, and then by looking for gender interactions.

DATA AND APPLICATION

The present study uses the math performance data set from a UC Irvine Machine Learning Repository set labeled School Performance. This data set contains details on the academic lives of adolescent students from two Portuguese schools (N = 395, median age = 17 years old, 47.3% female). The variables tested give information about time spent studying (Likert scale 1-5 based on study hours), final GPAmath (scale of 0-20 GPA points in math classes), family size (binary; greater than three, or less than or equal to three), and gender (binary; female or male) of each participant.
A few trends were found during a preliminary analysis, where each variable was tested individually with GPA as the outcome. Family size was not related to GPAmath, and on average boys earned higher GPAmath scores than girls, even though girls studied more. Additionally, it is worth noting that family size became marginally related to GPAmath once study time was also added (please see the Appendix: Preliminary Analysis). These associations brought up questions of how studying might relate to math performance in different situations. This study will first look at the relationship between GPAmath and study time, expecting a moderate, positive relationship. Then, this study will look at how family size might change this relationship. It is expected that bigger families will provide more distraction, so studying will be less effective. Finally, this study will test whether one gender is affected more by family size than the other, because it is hypothesized that girls will be more influenced due to family role socialization.

METHODS AND ANALYSIS

Linear regression will be used to test these questions because the base model, Model One, consists of two numeric variables best fit linearly. Model One will use simple linear regression and a T-test to observe whether study time can predict GPAmath. Model Two will use multiple linear regression and F-Tests to detect whether family size changes this relationship. Model Three will also use multiple linear regression to test whether gender will interact with family size.

These relationships can be seen graphically:

library(tidyverse)

## -- Attaching packages --------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.1     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

d1=read.table("student-mat.csv",sep=";",header=TRUE)
#Model One 
ggplot(d1, aes(studytime, G3))+
      geom_point()+
      geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

 mod1 <- lm(G3~studytime, d1)
 summary(mod1)

## 
## Call:
## lm(formula = G3 ~ studytime, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4643  -1.8623   0.5357   3.0697   9.1377 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.3283     0.6033  15.463   <2e-16 ***
## studytime     0.5340     0.2741   1.949   0.0521 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.565 on 393 degrees of freedom
## Multiple R-squared:  0.009569,   Adjusted R-squared:  0.007049 
## F-statistic: 3.797 on 1 and 393 DF,  p-value: 0.05206

#Model Two 
ggplot(d1, aes(studytime, G3, color = famsize))+
        geom_point()+
        geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

 mod2a <- lm(G3~studytime+famsize, d1)
 summary(mod2a)

## 
## Call:
## lm(formula = G3 ~ studytime + famsize, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1746  -2.0350   0.2949   3.2949   8.7251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9958     0.6301  14.276   <2e-16 ***
## studytime     0.5698     0.2740   2.079   0.0383 *  
## famsizeLE3    0.8996     0.5069   1.775   0.0767 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.553 on 392 degrees of freedom
## Multiple R-squared:  0.01746,    Adjusted R-squared:  0.01245 
## F-statistic: 3.483 on 2 and 392 DF,  p-value: 0.03165

 #Model Three
 ggplot(d1, aes(studytime, G3, color = sex))+
        geom_point()+
        facet_wrap(~famsize)+
        geom_smooth(method = 'lm', se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

mod3c <- lm(G3~studytime+famsize*sex, d1)
summary(mod3c)

## 
## Call:
## lm(formula = G3 ~ studytime + famsize * sex, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6042  -2.2225   0.2177   3.0123   9.3947 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.8114     0.7460  10.471  < 2e-16 ***
## studytime         0.7939     0.2852   2.784  0.00563 ** 
## famsizeLE3        1.3831     0.7231   1.913  0.05651 .  
## sexM              1.6172     0.5591   2.893  0.00403 ** 
## famsizeLE3:sexM  -1.1273     1.0077  -1.119  0.26394    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.515 on 390 degrees of freedom
## Multiple R-squared:  0.03855,    Adjusted R-squared:  0.02869 
## F-statistic:  3.91 on 4 and 390 DF,  p-value: 0.003987

RESULTS

The outputs for these models show that while gender and family size might affect study time and GPAmath on their own, there are no significant interactions between them. Model One shows a marginally significant positive relationship between study time and GPAmath. Model Two shows a significant relationship between studying and GPAmath and a marginally significant main effect of family size – bigger famililies predict lower GPAs. Finally, Model Three shows that all three predictor variables are significant – smaller families and boys independently tended to yield a higher GPAmath. Overall, the model that best fit this data was Model Three with the most significant variables, and the best balance between fit (smallest MSE) and the ability to generalize.

CONCLUSION

As expected, there was a positive relationship between GPAmath and study hours. However, gender and family size did not interact significantly in a any model. This is surprising given that the graphics found during early analysis seem to show that the relationship changes between girls in small families and in big families. The lack of significant results might be because of how family structure was measured – as a binary between less than three and more than three people. This is an issue since a family of four might have a different effect than a family of seven, but the variable does not allow for this distinction. Though the hypotheses were unsupported, some of the relationships seen in these models are well replicated throughout research on education. The most relevant trend found in this data is that women tend to have poorer performance than men in math classes, even though they study more. This is thought to be due to stereotype threat and socialization, both of which discourage women from pursuing careers in mathematics programs. These results support the importance of making efforts to increase female representation in STEM fields.

#Appendix
library(tidyverse)
d1=read.table("student-mat.csv",sep=";",header=TRUE)
head(d1) #Lists all the variables in the set

##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
## 2     GP   F  17       U     GT3       T    1    1  at_home    other     course
## 3     GP   F  15       U     LE3       T    1    1  at_home    other      other
## 4     GP   F  15       U     GT3       T    4    2   health services       home
## 5     GP   F  16       U     GT3       T    3    3    other    other       home
## 6     GP   M  16       U     LE3       T    4    3 services    other reputation
##   guardian traveltime studytime failures schoolsup famsup paid activities
## 1   mother          2         2        0       yes     no   no         no
## 2   father          1         2        0        no    yes   no         no
## 3   mother          1         2        3       yes     no  yes         no
## 4   mother          1         3        0        no    yes  yes        yes
## 5   father          1         2        0        no    yes  yes         no
## 6   mother          1         2        0        no    yes  yes        yes
##   nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1     yes    yes       no       no      4        3     4    1    1      3
## 2      no    yes      yes       no      5        3     3    1    1      3
## 3     yes    yes      yes       no      4        3     2    2    3      3
## 4     yes    yes      yes      yes      3        2     2    1    1      5
## 5     yes    yes       no       no      4        3     2    1    2      5
## 6     yes    yes      yes       no      5        4     2    1    2      5
##   absences G1 G2 G3
## 1        6  5  6  6
## 2        4  5  5  6
## 3       10  7  8 10
## 4        2 15 14 15
## 5        4  6 10 10
## 6       10 15 15 15

#Discriptives
str(d1) #395 participants

## 'data.frame':    395 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

table1<- table(d1$sex)
  table1

## 
##   F   M 
## 208 187

    prop.table(table1)

## 
##         F         M 
## 0.5265823 0.4734177

table2 <- table(d1$famsize)
  table2

## 
## GT3 LE3 
## 281 114

    prop.table(table2)

## 
##       GT3       LE3 
## 0.7113924 0.2886076

summary(d1$G3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.00   11.00   10.42   14.00   20.00

summary(d1$studytime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.035   2.000   4.000

summary(d1$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    15.0    16.0    17.0    16.7    18.0    22.0

hist(d1$G3) #looks normal so use mean=10.42

hist(d1$studytime, breaks="FD") #Use median=2, a little skewed

hist(d1$age) #mean=16.7 years old

#Preliminary Analysis
  #How does family size relate to GPAmath?
  d1%>%
    group_by(famsize)%>%
    summarize(G3)

## `summarise()` regrouping output by 'famsize' (override with `.groups` argument)

## # A tibble: 395 x 2
## # Groups:   famsize [2]
##    famsize    G3
##    <chr>   <int>
##  1 GT3         6
##  2 GT3         6
##  3 GT3        15
##  4 GT3        10
##  5 GT3         6
##  6 GT3        15
##  7 GT3         9
##  8 GT3        12
##  9 GT3        11
## 10 GT3        16
## # ... with 385 more rows

    aggregate(d1$G3, list(d1$famsize), mean) #avg GPAmath per famsize

##   Group.1        x
## 1     GT3 10.17794
## 2     LE3 11.00000

    #T-test: Is there a difference?
      t.test(d1$G3~d1$famsize, mu=0, alt="two.sided", conf=0.95, var.eq=F, paried=F)

## 
##  Welch Two Sample t-test
## 
## data:  d1$G3 by d1$famsize
## t = -1.6943, df = 231.57, p-value = 0.09155
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.7780287  0.1339006
## sample estimates:
## mean in group GT3 mean in group LE3 
##          10.17794          11.00000

    #Plot
      ggplot(d1, aes(famsize, G3))+
        geom_boxplot()

  #How does gender relate to GPAmath? 
  d1%>%
    group_by(sex)%>%
    summarize(G3)

## `summarise()` regrouping output by 'sex' (override with `.groups` argument)

## # A tibble: 395 x 2
## # Groups:   sex [2]
##    sex      G3
##    <chr> <int>
##  1 F         6
##  2 F         6
##  3 F        10
##  4 F        15
##  5 F        10
##  6 F         6
##  7 F         9
##  8 F        12
##  9 F        14
## 10 F        14
## # ... with 385 more rows

    aggregate(d1$G3, list(d1$sex), mean) #avg GPAmath per gender

##   Group.1         x
## 1       F  9.966346
## 2       M 10.914439

     #T-test: Is there a difference? 
       t.test(d1$G3~d1$sex, mu=0, alt="two.sided", conf=0.95, var.eq=F, paried=F)

## 
##  Welch Two Sample t-test
## 
## data:  d1$G3 by d1$sex
## t = -2.0651, df = 390.57, p-value = 0.03958
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.85073226 -0.04545244
## sample estimates:
## mean in group F mean in group M 
##        9.966346       10.914439

     #Plot
       ggplot(d1, aes(sex, G3))+
         geom_boxplot()

  #How does gender relate to study time? 
  d1%>%
    group_by(sex)%>%
    summarize(studytime)

## `summarise()` regrouping output by 'sex' (override with `.groups` argument)

## # A tibble: 395 x 2
## # Groups:   sex [2]
##    sex   studytime
##    <chr>     <int>
##  1 F             2
##  2 F             2
##  3 F             2
##  4 F             3
##  5 F             2
##  6 F             2
##  7 F             2
##  8 F             3
##  9 F             1
## 10 F             3
## # ... with 385 more rows

    aggregate(d1$studytime, list(d1$sex), mean) #avg study time per gender

##   Group.1        x
## 1       F 2.278846
## 2       M 1.764706

    #Ttest 
       t.test(d1$studytime~d1$sex, mu=0, alt="two.sided", conf=0.95, var.eq=F, paried=F)

## 
##  Welch Two Sample t-test
## 
## data:  d1$studytime by d1$sex
## t = 6.3709, df = 386.7, p-value = 5.345e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3554718 0.6728087
## sample estimates:
## mean in group F mean in group M 
##        2.278846        1.764706

    #Plots
      ggplot(d1, aes(sex, studytime))+
        geom_boxplot()

  #How does family size related to studytime?
  d1%>%
    group_by(famsize)%>%
    summarize(studytime)

## `summarise()` regrouping output by 'famsize' (override with `.groups` argument)

## # A tibble: 395 x 2
## # Groups:   famsize [2]
##    famsize studytime
##    <chr>       <int>
##  1 GT3             2
##  2 GT3             2
##  3 GT3             3
##  4 GT3             2
##  5 GT3             2
##  6 GT3             2
##  7 GT3             2
##  8 GT3             3
##  9 GT3             2
## 10 GT3             3
## # ... with 385 more rows

    aggregate(d1$studytime, list(d1$famsize), mean) #avg study time per famsize

##   Group.1        x
## 1     GT3 2.074733
## 2     LE3 1.938596

    #T-test: No difference in amount of study time
       t.test(d1$studytime~d1$famsize, mu=0, alt="two.sided", conf=0.95, var.eq=F, paried=F)

## 
##  Welch Two Sample t-test
## 
## data:  d1$studytime by d1$famsize
## t = 1.5136, df = 225.69, p-value = 0.1315
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.04109249  0.31336570
## sample estimates:
## mean in group GT3 mean in group LE3 
##          2.074733          1.938596

    #Plots
      ggplot(d1, aes(famsize, studytime))+
        geom_boxplot()

#Results
  #RQ1: How does study time relate to GPAmath (Model One)?
    ggplot(d1, aes(studytime, G3))+
      geom_point()+
      geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

    #linear or polynomial?
      ggplot(d1, aes(studytime, G3))+
        geom_point()+
        geom_smooth(se=FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 0.985

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1.015

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1

      mod1a <- lm(G3~poly(studytime, 3), d1)
      summary(mod1a) #Linear

## 
## Call:
## lm(formula = G3 ~ poly(studytime, 3), data = d1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.400  -2.048   0.600   2.828   8.952 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          10.4152     0.2299  45.307   <2e-16 ***
## poly(studytime, 3)1   8.8956     4.5688   1.947   0.0522 .  
## poly(studytime, 3)2   1.5901     4.5688   0.348   0.7280    
## poly(studytime, 3)3  -5.1517     4.5688  -1.128   0.2602    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.569 on 391 degrees of freedom
## Multiple R-squared:  0.01308,    Adjusted R-squared:  0.005511 
## F-statistic: 1.728 on 3 and 391 DF,  p-value: 0.1607

    #Model One
    mod1 <- lm(G3~studytime, d1)
    mod1_sum <- summary(mod1)
    mod1_sum

## 
## Call:
## lm(formula = G3 ~ studytime, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4643  -1.8623   0.5357   3.0697   9.1377 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.3283     0.6033  15.463   <2e-16 ***
## studytime     0.5340     0.2741   1.949   0.0521 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.565 on 393 degrees of freedom
## Multiple R-squared:  0.009569,   Adjusted R-squared:  0.007049 
## F-statistic: 3.797 on 1 and 393 DF,  p-value: 0.05206

    mean(mod1_sum$residuals^2)#MSE=20.73614

## [1] 20.73614

    anova(mod1)

## Analysis of Variance Table
## 
## Response: G3
##            Df Sum Sq Mean Sq F value  Pr(>F)  
## studytime   1   79.1  79.132  3.7968 0.05206 .
## Residuals 393 8190.8  20.842                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

      #Assumption check 
        ggplot(d1, aes(studytime, mod1_sum$residuals))+
          geom_point()+
          geom_hline(yintercept = 0) #Looks okay, but not ideal

  #RQ2: Does family size change this relationship (Model Two)? 
      ggplot(d1, aes(studytime, G3))+
        geom_point()+
        geom_smooth(method="lm", se=FALSE)+
        geom_abline(slope=0.5698, intercept=9.8956, color = "firebrick1")+
        geom_text(x=3, y=5, label=c("red = LE3, blue = GT3"))

## `geom_smooth()` using formula 'y ~ x'

      ggplot(d1, aes(studytime, G3, color = famsize))+
        geom_point()+
        geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

    #Model Two: W/out interaction (BEST)
    d1$famsize1 <- factor(d1$famsize, levels = c("GT3", "LE3"))
    contrasts(d1$famsize1)

##     LE3
## GT3   0
## LE3   1

    mod2a <- lm(G3~studytime+famsize, d1)
    mod2a_sum <- summary(mod2a)
    mod2a_sum

## 
## Call:
## lm(formula = G3 ~ studytime + famsize, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1746  -2.0350   0.2949   3.2949   8.7251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9958     0.6301  14.276   <2e-16 ***
## studytime     0.5698     0.2740   2.079   0.0383 *  
## famsizeLE3    0.8996     0.5069   1.775   0.0767 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.553 on 392 degrees of freedom
## Multiple R-squared:  0.01746,    Adjusted R-squared:  0.01245 
## F-statistic: 3.483 on 2 and 392 DF,  p-value: 0.03165

    mean(mod2a_sum$residuals^2)#MSE = 20.57088, significance/marginal significance

## [1] 20.57088

    anova(mod2a)

## Analysis of Variance Table
## 
## Response: G3
##            Df Sum Sq Mean Sq F value  Pr(>F)  
## studytime   1   79.1  79.132  3.8176 0.05143 .
## famsize     1   65.3  65.281  3.1494 0.07673 .
## Residuals 392 8125.5  20.728                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    #Model Two: W/ Interaction
    mod2b <- lm(G3~studytime*famsize, d1)
    mod2b_sum <- summary(mod2b)
    mod2b_sum

## 
## Call:
## lm(formula = G3 ~ studytime * famsize, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3369  -2.0028   0.2263   3.1621   9.0980 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            9.3976     0.7124  13.192   <2e-16 ***
## studytime              0.3761     0.3175   1.185    0.237    
## famsizeLE3            -0.5953     1.3385  -0.445    0.657    
## studytime:famsizeLE3   0.7575     0.6278   1.207    0.228    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.55 on 391 degrees of freedom
## Multiple R-squared:  0.02111,    Adjusted R-squared:  0.0136 
## F-statistic:  2.81 on 3 and 391 DF,  p-value: 0.03927

    mean(mod2b_sum$residuals^2) #MSE= 20.49457, no significance

## [1] 20.49457

    anova(mod2b)

## Analysis of Variance Table
## 
## Response: G3
##                    Df Sum Sq Mean Sq F value  Pr(>F)  
## studytime           1   79.1  79.132  3.8220 0.05130 .
## famsize             1   65.3  65.281  3.1530 0.07656 .
## studytime:famsize   1   30.1  30.141  1.4558 0.22833  
## Residuals         391 8095.4  20.704                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  #RQ3: Does this affect one gender more than the other (Model Three)? 
      ggplot(d1, aes(studytime, G3, color = sex))+
        geom_point()+
        facet_wrap(~famsize)+
        geom_smooth(method = 'lm', se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

      ggplot(d1, aes(sex, G3))+
        geom_boxplot()+
        geom_smooth(method="lm", se=FALSE)+
        facet_wrap(~famsize)

## `geom_smooth()` using formula 'y ~ x'

    #There are five possible models here: 
      d1$sex1 <- factor(d1$sex, levels = c("F", "M"))
      contrasts(d1$sex1)

##   M
## F 0
## M 1

      #1) Model Three: No interactions
        mod3a <- lm(G3~studytime+famsize+sex, d1)
        mod3a_sum <- summary(mod3a)
        mod3a_sum

## 
## Call:
## lm(formula = G3 ~ studytime + famsize + sex, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4414  -2.0347   0.4573   3.1631   9.2603 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.9374     0.7377  10.760  < 2e-16 ***
## studytime     0.8022     0.2852   2.813  0.00515 ** 
## famsizeLE3    0.8030     0.5042   1.593  0.11202    
## sexM          1.2951     0.4793   2.702  0.00720 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.517 on 391 degrees of freedom
## Multiple R-squared:  0.03547,    Adjusted R-squared:  0.02807 
## F-statistic: 4.793 on 3 and 391 DF,  p-value: 0.002723

        mean(mod3a_sum$residuals^2) #MSE=20.19389, all but famsize is significant

## [1] 20.19389

        anova(mod3a)

## Analysis of Variance Table
## 
## Response: G3
##            Df Sum Sq Mean Sq F value   Pr(>F)   
## studytime   1   79.1  79.132  3.8789 0.049601 * 
## famsize     1   65.3  65.281  3.2000 0.074413 . 
## sex         1  148.9 148.909  7.2993 0.007198 **
## Residuals 391 7976.6  20.400                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

      #2) Model Three: Interaction between family size and study hours
        mod3b <- lm(G3~studytime*famsize+sex, d1)
        mod3b_sum <- summary(mod3b)
        mod3b_sum

## 
## Call:
## lm(formula = G3 ~ studytime * famsize + sex, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2427  -2.1952   0.3056   3.1286   9.0491 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            8.3591     0.7996  10.454  < 2e-16 ***
## studytime              0.5918     0.3243   1.825  0.06878 .  
## famsizeLE3            -0.8705     1.3310  -0.654  0.51350    
## sexM                   1.3287     0.4795   2.771  0.00585 ** 
## studytime:famsizeLE3   0.8468     0.6234   1.358  0.17514    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.512 on 390 degrees of freedom
## Multiple R-squared:  0.04001,    Adjusted R-squared:  0.03016 
## F-statistic: 4.064 on 4 and 390 DF,  p-value: 0.003066

        mean(mod3b_sum$residuals^2) #MSE= 20.0988, only studytime and gender are still signifianct

## [1] 20.0988

        anova(mod3b)

## Analysis of Variance Table
## 
## Response: G3
##                    Df Sum Sq Mean Sq F value   Pr(>F)   
## studytime           1   79.1  79.132  3.8873 0.049358 * 
## famsize             1   65.3  65.281  3.2069 0.074104 . 
## sex                 1  148.9 148.909  7.3151 0.007137 **
## studytime:famsize   1   37.6  37.560  1.8451 0.175137   
## Residuals         390 7939.0  20.356                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

      #3) Model Three: Interaction between family size and gender(BEST) 
        mod3c <- lm(G3~studytime+famsize*sex, d1)
        mod3c_sum <- summary(mod3c)
        mod3c_sum

## 
## Call:
## lm(formula = G3 ~ studytime + famsize * sex, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6042  -2.2225   0.2177   3.0123   9.3947 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.8114     0.7460  10.471  < 2e-16 ***
## studytime         0.7939     0.2852   2.784  0.00563 ** 
## famsizeLE3        1.3831     0.7231   1.913  0.05651 .  
## sexM              1.6172     0.5591   2.893  0.00403 ** 
## famsizeLE3:sexM  -1.1273     1.0077  -1.119  0.26394    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.515 on 390 degrees of freedom
## Multiple R-squared:  0.03855,    Adjusted R-squared:  0.02869 
## F-statistic:  3.91 on 4 and 390 DF,  p-value: 0.003987

        mean(mod3c_sum$residuals^2) #MSE = 20.12929, no interaction is sig but all main effects are

## [1] 20.12929

        anova(mod3c)

## Analysis of Variance Table
## 
## Response: G3
##              Df Sum Sq Mean Sq F value   Pr(>F)   
## studytime     1   79.1  79.132  3.8814 0.049530 * 
## famsize       1   65.3  65.281  3.2020 0.074323 . 
## sex           1  148.9 148.909  7.3040 0.007181 **
## famsize:sex   1   25.5  25.516  1.2516 0.263941   
## Residuals   390 7951.1  20.387                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

      #4) Model Three: Interaction between study time and gender 
        mod3d <- lm(G3~studytime*sex+famsize, d1)
        mod3d_sum <- summary(mod3d)
        mod3d_sum

## 
## Call:
## lm(formula = G3 ~ studytime * sex + famsize, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9165  -1.8920   0.3025   3.1326   9.0071 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.3908     0.9668   8.679   <2e-16 ***
## studytime        0.6021     0.3967   1.518    0.130    
## sexM             0.4603     1.2457   0.369    0.712    
## famsizeLE3       0.8137     0.5047   1.612    0.108    
## studytime:sexM   0.4142     0.5705   0.726    0.468    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.519 on 390 degrees of freedom
## Multiple R-squared:  0.03677,    Adjusted R-squared:  0.02689 
## F-statistic: 3.722 on 4 and 390 DF,  p-value: 0.005489

        mean(mod3d_sum$residuals^2) #MSE= 20.16663, nothing significant

## [1] 20.16663

        anova(mod3d)

## Analysis of Variance Table
## 
## Response: G3
##                Df Sum Sq Mean Sq F value   Pr(>F)   
## studytime       1   79.1  79.132  3.8742 0.049740 * 
## sex             1  162.4 162.437  7.9528 0.005047 **
## famsize         1   51.8  51.753  2.5338 0.112241   
## studytime:sex   1   10.8  10.768  0.5272 0.468233   
## Residuals     390 7965.8  20.425                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

      #5) Model Three: All interactions (expecting all to be insignificant)
        mod3f <- lm(G3~studytime*famsize*sex, d1)
        mod3f_sum <- summary(mod3f)
        mod3f_sum

## 
## Call:
## lm(formula = G3 ~ studytime * famsize * sex, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2308  -2.1074   0.3407   3.0685   9.0820 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                8.69628    1.09663   7.930 2.37e-14 ***
## studytime                  0.40723    0.45231   0.900    0.369    
## famsizeLE3                -0.60836    2.25347  -0.270    0.787    
## sexM                       0.59547    1.45415   0.409    0.682    
## studytime:famsizeLE3       0.87848    0.94467   0.930    0.353    
## studytime:sexM             0.46238    0.65840   0.702    0.483    
## famsizeLE3:sexM            0.08246    2.85397   0.029    0.977    
## studytime:famsizeLE3:sexM -0.40658    1.33038  -0.306    0.760    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.524 on 387 degrees of freedom
## Multiple R-squared:  0.04244,    Adjusted R-squared:  0.02512 
## F-statistic:  2.45 on 7 and 387 DF,  p-value: 0.01812

        mean(mod3f_sum$residuals^2) #MSE=20.04797, no significance

## [1] 20.04797

        anova(mod3f)

## Analysis of Variance Table
## 
## Response: G3
##                        Df Sum Sq Mean Sq F value   Pr(>F)   
## studytime               1   79.1  79.132  3.8672 0.049953 * 
## famsize                 1   65.3  65.281  3.1903 0.074860 . 
## sex                     1  148.9 148.909  7.2772 0.007289 **
## studytime:famsize       1   37.6  37.560  1.8356 0.176261   
## studytime:sex           1    8.8   8.833  0.4317 0.511552   
## famsize:sex             1    9.3   9.335  0.4562 0.499811   
## studytime:famsize:sex   1    1.9   1.911  0.0934 0.760064   
## Residuals             387 7918.9  20.462                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

          #How gender might change how study time relate to GPA without family size
          ggplot(d1, aes(studytime, G3, color=sex))+
            geom_point()+
            geom_smooth(method="lm", se=FALSE) #expecting no interaction

## `geom_smooth()` using formula 'y ~ x'

            #W/ interaction
            mod4a <- lm(G3 ~ studytime+sex, d1)
            summary(mod4a)

## 
## Call:
## lm(formula = G3 ~ studytime + sex, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6583  -2.0980   0.2512   3.2512   9.0313 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.1885     0.7221  11.340  < 2e-16 ***
## studytime     0.7801     0.2854   2.734  0.00655 ** 
## sexM          1.3492     0.4791   2.816  0.00510 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.526 on 392 degrees of freedom
## Multiple R-squared:  0.02921,    Adjusted R-squared:  0.02426 
## F-statistic: 5.898 on 2 and 392 DF,  p-value: 0.002996

            mod4a_sum <- summary(mod4a)
            mean(mod4a_sum$residuals^2) #MSE=20.32491, all significant

## [1] 20.32491

            anova(mod4a)

## Analysis of Variance Table
## 
## Response: G3
##            Df Sum Sq Mean Sq F value   Pr(>F)   
## studytime   1   79.1  79.132  3.8638 0.050044 . 
## sex         1  162.4 162.437  7.9313 0.005104 **
## Residuals 392 8028.3  20.480                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

            #W/out interaction
            mod4b <- lm(G3~studytime*sex, d1)
            mod4b_sum <- summary(mod4b)
            mod4b_sum

## 
## Call:
## lm(formula = G3 ~ studytime * sex, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.1054  -2.1451   0.1989   3.1989   8.8351 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.6156     0.9586   8.987   <2e-16 ***
## studytime        0.5927     0.3975   1.491    0.137    
## sexM             0.5691     1.2465   0.457    0.648    
## studytime:sexM   0.3874     0.5715   0.678    0.498    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.529 on 391 degrees of freedom
## Multiple R-squared:  0.03035,    Adjusted R-squared:  0.02291 
## F-statistic:  4.08 on 3 and 391 DF,  p-value: 0.00716

            mean(mod4b_sum$residuals^2) #MSE=20.30104, not significant

## [1] 20.30104

            anova(mod4b)

## Analysis of Variance Table
## 
## Response: G3
##                Df Sum Sq Mean Sq F value   Pr(>F)   
## studytime       1   79.1  79.132  3.8585 0.050204 . 
## sex             1  162.4 162.437  7.9204 0.005135 **
## studytime:sex   1    9.4   9.428  0.4597 0.498173   
## Residuals     391 8018.9  20.509                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Math 239 Final Project

Maya_Hansen-Tilkens

12/6/2020