knitr::opts_chunk$set(include = TRUE)
The first numeric column I chose is Mother’s Education Level (Medu, which is an ordinal variable).
The second numeric column I chose is Home to School Travel Time (traveltime, which is also an ordinal variable).
The calculated column I built is the sum of term 1, 2 and 3 grades for each student.
# Calculate the sum of all term grades for each student.
sp$total_grade = sp$G1 + sp$G2 + sp$G3
ggplot(sp, aes(as.factor(Medu), total_grade, groups = Medu)) +
geom_boxplot() +
scale_x_discrete(labels = c("0" = "None", "1" = "Up to 4th grade", "2" = "5th-9th grade", "3" = "Secondary", "4" = "Higher")) +
labs(title = "Total Grade (out of 60) for each Mother's Education Group",
x = "Mother's Education",
y = "Total Grade")
# Check group size to see if any of the boxplots are lacking data.
sp |>
group_by(Medu) |>
summarise(count = n())
## # A tibble: 5 × 2
## Medu count
## <dbl> <int>
## 1 0 6
## 2 1 143
## 3 2 186
## 4 3 139
## 5 4 175
The plot above contains a boxplot of the 3-term total student grades for each mother’s education type. There is a generally increasing trend in the medians and overall distributions of total grades for each mother’s education type. I wanted to verify the group sizes to make sure that our boxplots weren’t representing only a few points. Sure enough, the ‘None’ education group contains just 6 individuals compared to the 139+ in every other education group. The left-most ‘None’ group is not a distribution we should take into consideration in this specific scenario. There are a few students who are considered outliers (according to the 1.5 x IQR rule), most of which have grades lower than the rest of the group distributions. From this data set, we do know that some students had term grades of zero which is probably why these ‘outliers’ exist. However we don’t have enough information from the data set to know why these students got zeros so while they are considered outliers in this plot, they might be that way for the wrong reason. Some students may have gotten a zero due to special circumstances, inability to complete the term, family emergency, etc.
ggplot(sp, aes(as.factor(traveltime), total_grade, groups = traveltime)) +
geom_boxplot() +
scale_x_discrete(labels = c("1" = "<15 mins", "2" = "15 to 30 mins", "3" = "30 to 60 mins", "4" = ">60 mins")) +
labs(title = "Total Grade (out of 60) for each Home to School Travel Time Group",
x = "Travel Time",
y = "Total Grade")
# Another group size check to see if any are lacking data.
sp |>
group_by(traveltime) |>
summarise(count = n())
## # A tibble: 4 × 2
## traveltime count
## <dbl> <int>
## 1 1 366
## 2 2 213
## 3 3 54
## 4 4 16
The plot above contains boxplots showing total student grades for each home to school travel duration group. The medians and distributions of each boxplot shows a decreasing trend in total grades when students have to travel more to get to school. Notably, the median total grade for students that travel less than 15 minutes is about 36 out of 60. In contrast, a student that travels 15 to 30 minutes has a median total grade of about 33. This is about a 5% difference in median total grades. While we can’t say for certain whether these differences in grades are solely due to the amount of time a student needs to travel to get to school, this visual representation does highlight important differences between these groups. The same sort of note about outliers can be applied here too. There do exists some outliers (according to the 1.5 x IQR rule), but we don’t really understand why students got these scores. It certainly is possible that a student repeatedly got poor scores because that is a reflection of their abilities. On the other hand, there could be extenuating circumstances that prohibit that student from getting a higher grade.
cor1 <- cor(x = as.numeric(sp$Medu),
y = sp$total_grade,
method = "spearman")
cor1
## [1] 0.2877664
Above, I calculated Spearman’s rank correlation coefficient between mother’s education and student’s total grade. My x variable (Medu) is discrete and has order while my y variable (total_grade) is continuous. For this reason, Pearson’s correlation coefficient does not make sense because discrete variables can violate the assumptions made when calculating Pearson’s rho. The resulting Spearman’s rank correlation value is 0.2877664. It makes sense that the value is positive because we would generally expect to see that a higher mother’s education level sets a child up to obtain a higher total grade. The relationship however isn’t particularly strong. This is real data where many factors (included and not included in this data set) influence a child’s grade. While mother’s education level probably plays some sort of role, it might not be a driving influence. It is important to mention that notable correlation values depend on the context so 0.28 for one field may be notable while it may not be for another field.
cor2 <- cor(x = as.numeric(sp$traveltime),
y = sp$total_grade,
method = "spearman")
cor2
## [1] -0.1634047
I followed the same procedure as above and calculated Spearman’s rank correlation coefficient between student’s home to school travel time and student’s total grade. The resulting value is -0.1634047. The visualization above confirms that the direction of this value is correct. We expect that living closer to school generally makes it easier for students to get higher grades. The relationship however is even weaker than the previous correlation we calculated. Again, there are many variables that affect student grades but this one may just not be as strong as other variables.
boot_ci = function(v, func = mean, conf = 0.95, n_iter = 1000) {
boot_func = \(x, i) func(x[i], na.rm = T)
b = boot(v, boot_func, R = n_iter)
boot.ci(b, conf = conf, type = "perc")
}
boot_ci(sp$total_grade, mean, 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (34.28, 35.52 )
## Calculations and Intervals on Original Scale
For this question, it is never specified what width of confidence interval we should produce, how we should produce it or what parameter we want to create the interval for. The only response variable in this situation is total_grade. I decided to produce a 95% confidence interval for the mean using the bootstrapping method. I utilized the boot_ci function in the lab to create 1000 samples. The resulting 95% bootstrapped confidence interval for the mean of total_grades is 34.21 to 35.52. If we were to repeatedly sample the entire population of Portuguese students (not just the ones in this data set), we would expect the true population mean to lie within approximately 95 of the calculated intervals. In this situation we can be 95% confident that the true population mean of total_score is between 34.21 and 35.52.