My first test will utilize term 3 student grades - ‘G3’ (which is continuous) and students intention for higher education - ‘higher’ (which is binary yes/no). My null hypothesis (H0) is that there is no difference in the mean G3 term grades between students who want to pursue higher education and those who do not.
Here, there is more than enough data to perform a hypothesis test using the Neyman-Pearson framework. There are over 600 rows in the data set which is well over what is considered to be a large sample size. For the sample size calculation, I am using alpha = 0.1 because the nature of this analysis isn’t critical (we aren’t in the medical field where lives are potentially at stake). Also, using 0.1 allows us to decrease the chance we miss a meaningful association. For power, I will choose beta = 0.2 so that power = 1 - 0.2 or 0.8. This is the minimum accepted threshold in research and aligns with our choice of alpha which cites that our results aren’t critical.
# Check the standard deviations in each group.
sp |>
group_by(higher) |>
summarize(sd = sd(G3),
mean = mean(G3),
size = n())
## # A tibble: 2 × 4
## higher sd mean size
## <chr> <dbl> <dbl> <int>
## 1 no 2.97 8.80 69
## 2 yes 3.06 12.3 580
Above, I checked to see if the standard deviations in each group were similar. They are both about 3 so we can use the whole data set in the sample size calculation. For the additional parameters for the pwrss.t.2means function, I chose 2 as a meaningful grade difference between the group means since that is about 10% of the total grade. This value corresponds to mu1. We use the standard deviation of G3 grades as sd1 and define kappa as 580/69 because the groups are very imbalanced.
test1 = pwrss.t.2means(
mu1 = 2,
sd1 = sd(sp$G3),
alpha = 0.1,
power = 0.8,
kappa = 580/69,
alternative = 'not equal'
)
## +--------------------------------------------------+
## | SAMPLE SIZE CALCULATION |
## +--------------------------------------------------+
##
## Welch's T-Test (Independent Samples)
##
## ---------------------------------------------------
## Hypotheses
## ---------------------------------------------------
## H0 (Null Claim) : d - null.d = 0
## H1 (Alt. Claim) : d - null.d != 0
##
## ---------------------------------------------------
## Results
## ---------------------------------------------------
## Sample Size = 162 and 20 <<
## Type 1 Error (alpha) = 0.100
## Type 2 Error (beta) = 0.185
## Statistical Power = 0.815
The results mean that, to detect a 2-point grade difference with alpha = 0.1, power = 0.8 and kappa = 580/69, I need 20 students in the ‘no’ group and 162 students in the ‘yes’ group. Since the real proportion of my data set is 69 ‘no’ and 580 ‘yes’, the study is sufficiently powered to detect a 2-point difference.
t.test(G3 ~ higher, data = sp)
##
## Welch Two Sample t-test
##
## data: G3 by higher
## t = -9.1593, df = 86.036, p-value = 2.323e-14
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -4.233783 -2.723738
## sample estimates:
## mean in group no mean in group yes
## 8.797101 12.275862
Above, I chose a Welch’s two sample t-test to compare the means between the two groups. Welch’s t-test doesn’t assume equal variance between the two groups which is helpful because, while the variance is similar as seen above, it isn’t exactly the same. The sample estimate of the mean in group ‘no’ is 8.797 while it is 12.276 in group ‘yes’. This results in a p-value of 2.323e-14 which is well below our specified alpha threshold of 0.1. This means that there is ample evidence to reject our null hypothesis that there is no difference in the mean G3 grades between the two groups. We can assume the alternative hypothesis that there is a difference in mean G3 grades between the groups.
ggplot(sp, aes(x = G3, fill = higher)) +
geom_density(alpha = 0.4) +
labs(title = "Distribution of G3 grades by desire for higher education",
x = "G3 Grades",
y = "Density",
fill = "Higher Education") +
scale_fill_manual(values = c("no" = "blue", "yes" = "orange"))
The visualization above shows density distributions of G3 grades for both levels of the higher education variable. It clearly shows a separation in G3 grades between groups where the perceived mid-point of the ‘no’ group is below the mid-point of the ‘yes’ group. This checks out with our Welch’s test which told us that there is certainly a difference between these group’s means.
My second null hypothesis again utilizes G3 grades but now as a binary variable re-coded so that a score of >= 10 is a ‘pass’ while a score of < 10 is a ‘fail’. The other variable is home internet access, ‘internet’ (which is binary yes/no). The null hypothesis is that the proportion of students who ‘pass’ is the same for students with and without home internet access. I will utilize the same alpha value of 0.1 from hypothesis one for consistency.
# Create the binary pass/fail variable
sp$G3_pass = ifelse(sp$G3 >= 10, 1, 0)
For this hypothesis, I will use a normal test of equal proportions. This utilizes a parameterized normal curve based on the absolute difference in proportion between the two groups. I also set the Yates continuity correct to ‘False’ because the amount of data in each cell in the contingency table is large enough so the correction isn’t needed.
table = table(sp$G3_pass, sp$internet)
x = table[2,]
n = colSums(table)
prop.test = prop.test(x = x,
n = n,
alternative = 'two.sided',
correct = F)
prop.test
##
## 2-sample test for equality of proportions without continuity correction
##
## data: x out of n
## X-squared = 5.0504, df = 1, p-value = 0.02462
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.147195128 -0.003553563
## sample estimates:
## prop 1 prop 2
## 0.7880795 0.8634538
The 2 sample normal test for equal proportions shows the sample estimate of group ‘no’ as being 0.788 while the sample estimate of group ‘yes’ is 0.863. This results in a p-value of 0.02462 which is still below our predefined alpha value of 0.1. If the proportions were the same, our data set would be pretty rare. This gives us enough evidence to reject the null hypothesis that the proportion of students who get passing G3 grades with home internet access is the same as students without home internet access.
matrix = matrix(c(32, 68, 119, 430), nrow = 2)
colnames(matrix) = c("No Internet", "Internet")
rownames(matrix) = c("Fail", "Pass")
mosaic(t(matrix),
direction = "v",
highlighting = 2,
highlighting_fill = c("#234990", "#FF994E"),
main = "Mosaic Plot of Proportions",
labeling = labeling_border(
rot_labels = c(0, 0, 0, 0),
just_labels = c("center", "center"),
offset_labels = c(0.5, 0.5, 0.5, 0.5)
))
Above I produced a mosaic plot to visualize the proportions of the groups. First, we can see that the difference in group size is quite large by looking at the width of the boxes. There are many more students with internet access than there are students without it. By looking at the height of the boxes, we can see that in the ‘no internet’ group there are more students that fail. In the ‘internet’ group, there are less students that fail. This matches up with the results from our normal test for equal proportion above. It showed that the difference in passing proportion between students with internet versus without is statistically significant when compared to an alpha threshold of 0.1.