Select a continuous response variable

For my continuous response variable, I will chose term 3 student grades (G3), which is on a scale from 0 to 20. The higher the number, the better the score. The reason that I chose this is because a grade, especially a final term one, is a natural fit as a response variable. In an academic context, grades are affected by many factors which makes it a perfect fit as a response here.

Select a categorical predictor variable

For my categorical predictor variable, I will chose Father’s education (Fedu) which has 5 levels. A value of 0 means no education, 1 means primary education (4th grade), 2 means 5th to 9th grade, 3 means secondary education and 4 means higher education.

Devise a null hypothesis, test it using ANOVA and summarize.

# Visualize the difference between groups
ggplot(sp, aes(x = Fedu, y = G3, group = Fedu)) + 
  geom_boxplot() +
  labs(title = "G3 grades for each father's education level")

First, we can visualize the difference between groups by creating boxplots. We can recognize that there is a difference, but we need to do a bit more to tell if the difference is notable. It’s important to note that Fedu level 0 has just 7 rows which means that we need to be careful going forward regarding that group. (When I spoke to Leon in class, he mentioned that since the boxplot Fedu level 0 isn’t too different from the other boxplots even with the small amount of data, we don’t need to immediately throw it out).

We can define our null hypothesis as:

$ H_0 : $

Before we perform an ANOVA, we need to make sure that we meet the assumptions:

First is independence. Each row in the data set represents a different student and we don’t really expect students to come from the same family. We can generally say the independence assumption holds.

Second is normality. According to the plot below, we can roughly say that the distribution within each level of father’s education is normal. The assumption of normality holds.

ggplot(sp, aes(x = G3)) +
  geom_histogram(bins = 10, color = "black", fill = "lightblue") +
  facet_wrap(~ Fedu) +
  theme_minimal() +
  labs(title = "Distribution of G3 in each group")

Finally, we need to look at homoscedasticity. According to the output below, we can see that the standard deviation within each group is fairly similar. No standard deviation is multiple orders of magnitude larger than the others. The homoscedasticity assumption holds.

sp |>
  group_by(Fedu) |>
  summarize(sd = sd(G3),
            count = n())
## # A tibble: 5 × 3
##    Fedu    sd count
##   <dbl> <dbl> <int>
## 1     0  2.79     7
## 2     1  3.42   174
## 3     2  3.45   209
## 4     3  2.49   131
## 5     4  2.92   128

Overall our independence, normality and homoscedasticity assumptions hold which means we can proceed with an ANOVA.

We can test our null hypothesis defined above using an ANOVA.

anova = aov(G3 ~ Fedu, data = sp)
summary(anova)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## Fedu          1    303  303.39   30.39 5.12e-08 ***
## Residuals   647   6460    9.98                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The resulting p-value of Fedu is very small indicating that it is unlikely that the mean of student’s G3 grades for each father’s education group are the exact same. There is at least one group that is different from the rest. To find out which groups are different, we can perform a pairwise t-test using a bonferroni correction:

pairwise.t.test(sp$G3, sp$Fedu, 
                p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  sp$G3 and sp$Fedu 
## 
##   0       1       2       3      
## 1 1.00000 -       -       -      
## 2 1.00000 0.09162 -       -      
## 3 1.00000 0.00086 0.90597 -      
## 4 1.00000 9.7e-07 0.01416 1.00000
## 
## P value adjustment method: bonferroni
sp_filter = filter(sp, Fedu != 0)
dim(sp)
## [1] 649  33
dim(sp_filter)
## [1] 642  33
pairwise.t.test(sp_filter$G3, sp_filter$Fedu, 
                p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  sp_filter$G3 and sp_filter$Fedu 
## 
##   1       2       3      
## 2 0.05541 -       -      
## 3 0.00053 0.54559 -      
## 4 6e-07   0.00859 1.00000
## 
## P value adjustment method: bonferroni

As we can see, there are a few combinations of groups that are statistically significant with several that are not. All of the pairs that include Fedu level zero aren’t significant but with so little data in level zero, it’s hard to make conclusions about this part of the data. The 3&4 pair also results in a p-value of 1.00 but those groups have much more data. Additionally, the 2&3 pair has a high p-value of 0.906. This shows that these pairs do not have statistically different mean G3 grades. The 1&2 pair at 0.09162 isn’t quite significant but it’s approaching that point. On the other hand, the 3&1, 4&1 and 4&2 pairs do appear to have statistically different mean G3 grades. Overall, the way the pairwise t-test results are shown indicates some differences among groups but also reinforces the fact that there is a general increase in mean G3 grades as father’s education level increases. This is what we saw in our boxplots from earlier. As an example, the second column for Fedu level 1 shows that, as our row value increases (from Fedu value 2 to 3 to 4), the results become more statistically significant. This means that the difference between the means gets larger, which confirms the increasing trend.

As an additional exercise, I dropped Fedu level zero to see how that might change these p-values. I performed another pairwise t-test and found that the p-values do decrease but their significance doesn’t for the most part. The only exception is the 1&2 pair which shrinks enough that it becomes a debate whether or not it’s significant.

Explain what this might mean for people who might be interested.

There isn’t enough evidence to conclude that all levels of father’s education have different average G3 grades. It’s safe to assume that some Fedu levels that are next to each other(like 2 and 3) are associated with similar student grade performance. There is sufficient evidence to conclude that when student’s fathers have a higher education level, their average G3 grades differ from some of the lower education groups (especially level 4 vs level 3 or 2).

Find a single continuous or ordered integer column of data that may influence the response

I will choose the ‘studytime’ variable which is an ordinal variable recoded as an integer type. When it’s value is 1, it means a student studies for <2 hours per week. 2 is 2-5 hours per week, 3 is 5-10 hours per week and 4 is >10 hours per week.

Before we create our linear model, we need to check the nature of the relationship between the two variables.

First, let’s check the linearity assumption between the predictor and response variables:

ggplot(sp, aes(x = studytime, y = G3)) +
  geom_jitter() +
  geom_smooth(method = 'lm', se = F, lwd = 2) +
  geom_smooth()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Above, I plotted ‘G3’ against ‘studytime’ using jittered scatters. I also overlayed an ‘lm’ and a ‘loess’ smoother to check whether the relationship looked linear. Overall, I would say that the relationship is monotonically increasing and meets the linearity assumption.

Next, we’ll create our linear model using these two variables:

model1 = lm(G3 ~ studytime, data = sp)

Above, I created the linear model with ‘studytime’ as the lone predictor and ‘G3’ as the response. I will revisit the interpretation of the model results after I check the assumptions.

I will test the normality of residuals assumption:

qqnorm(residuals(model1))
qqline(residuals(model1), col = "red")

hist(residuals(model1))

The normal qq plot of the residuals actually doesn’t perfectly match the red line. This means that the residuals aren’t perfectly normal. At the same time, our sample size is very large which helps out here. When I spoke with Leon in class, he said that the residuals that caused this to happen are important information in the data. There are several students with G3 grades of 0 for close to 0 which is what causes the skew and the residuals to become not normal. Leon mentioned that we can still say the normality of residuals assumption can be considered to roughly hold here for those reasons.

Next, let’s look at the standard deviation of the residuals:

plot(model1, which = 3)

Using the scale-location plot of the square root of the standardized residuals, we can see that the constant variance (homoscedasticity) assumption seems to be met. This is because the red line, which indicates the average magnitude of the residual’s variance, is pretty flat as the fitted values increase. There isn’t a clear trend of increasing or decreasing variance.

Finally, we know that the residuals are independent because we found that each row (student) in the data set is independent from one another based on the ANOVA we did previously. This means that the independent residuals assumption holds. We are also considering just one predictor variable so we don’t need to worry about the multicollinearity assumption

Interpret coefficients and explain how they relate to the context of the data.

summary(model1)
## 
## Call:
## lm(formula = G3 ~ studytime, data = sp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9735  -1.9463   0.0265   2.0265   7.0265 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.0278     0.3115  32.191  < 2e-16 ***
## studytime     0.9728     0.1483   6.562 1.09e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.131 on 647 degrees of freedom
## Multiple R-squared:  0.06239,    Adjusted R-squared:  0.06095 
## F-statistic: 43.06 on 1 and 647 DF,  p-value: 1.091e-10

Now, we can interpret our model output. As seen in the summary, our beta coefficient estimates are 10.0278 (beta 0) for the intercept and 0.9728 (beta 1) for ‘studytime’. The intercept coefficient means that, if ‘studytime’ was in the zero group, the expected G3 grade would be 10.0278. The ‘studytime’ coefficient means that, assuming our intercept is 10.0278, every unit increase in ‘studytime’, on average, will increase G3 grades by 0.9728. In other words, each time a student increases their ‘studytime’ 1 level the model expects their grade to, on average, increase by 0.9728 points. Obviously, this is isn’t a perfect model so predictions won’t be perfect and it’s possible that a student may have a different change than a 0.9728 increase. We can also see that ‘studytime’ is statistically significant but the Adjusted and Multiple R-squared values are around 0.06. This means that ‘studytime’ explains a small amount the variation in the data. Overall, the model indicates that studying more could help improve your grade.