Select a continuous (or ordered integer) column of data that seems most “valuable” given the context of your data, and call this your response variable.

“Admission grade” variable is the most valuable continuous variable given the context. This is because the admission grade can have a significant impact on students’ educational and career prospects. It’s a common measure of academic performance and can be of interest to both students and educational institutions.

Select a categorical column of data (explanatory variable) that you expect might influence the response variable. Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results. Be clear about how the R output relates to your conclusions. If there are more than 10 categories, consider consolidating them before running the test using the methods we’ve learned in class. Explain what this might mean for people who may be interested in your data

Null Hypothesis (H0): There is no significant difference in mean admission grades among different marital status categories.

df <- read.csv('./Downloads/students_dropout_and_academic_success.csv')
anova_result <- aov(Admission_grade ~ Marital_status, data = df)
summary(anova_result)
##                  Df Sum Sq Mean Sq F value Pr(>F)
## Marital_status    1     21   21.11   0.101  0.751
## Residuals      4422 927607  209.77

Based on the ANOVA results, the p-value for the “Marital Status” variable is 0.751, which is much greater than the typical significance level of 0.05

There is not enough evidence to conclude that marital status significantly affects admission grades (p = 0.751). So, it would be safe to assume that, based on the data, marital status does not have a substantial impact on admission grades. In other words, it’s likely that being married or single, as captured in the ‘Marital Status’ variable, does not play a significant role in determining students’ admission grades in this dataset.

This means that, according to the data, marital status does not appear to be a strong predictor of admission grades, and other factors might have a more significant influence on students’ performance.

Find at least one other continuous (or ordered integer) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear. 1) Build a linear regression model of the response using just this column, and evaluate its fit. 2) Run appropriate hypothesis tests and summarize their results. Use diagnostic plots to identify any issues with your model. 3)Interpret the coefficients of your model, and explain how they relate to the context of your data.

library(ggplot2)

# Linear Regression Model
lm_model <- lm(Admission_grade ~ Previous_qualification_grade, data = df)

summary(lm_model)
## 
## Call:
## lm(formula = Admission_grade ~ Previous_qualification_grade, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.808  -5.878  -0.548   5.750  64.688 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  42.45290    1.79200   23.69   <2e-16 ***
## Previous_qualification_grade  0.63738    0.01345   47.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.79 on 4422 degrees of freedom
## Multiple R-squared:  0.3369, Adjusted R-squared:  0.3368 
## F-statistic:  2247 on 1 and 4422 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(lm_model)

Include at least one other variable into your regression model (e.g., you might use the one from the ANOVA), and evaluate how it helps (or doesn’t). Maybe include an interaction term, but explain why you included it. You can add up to 4 variables if you like.

# Fit your linear regression model
lm_model <- lm(Admission_grade ~ Previous_qualification_grade, data = df)


coeff_summary <- summary(lm_model)$coefficients


alpha <- 0.05

# Print the results
cat("Hypothesis Test Results:\n")
## Hypothesis Test Results:
cat("----------------------------\n")
## ----------------------------
cat("Predictor Variable: Previous Qualification Grade\n")
## Predictor Variable: Previous Qualification Grade
cat("Coefficient Estimate:", coeff_summary["Previous_qualification_grade", "Estimate"], "\n")
## Coefficient Estimate: 0.6373811
cat("Standard Error:", coeff_summary["Previous_qualification_grade", "Std. Error"], "\n")
## Standard Error: 0.01344664
cat("t-value:", coeff_summary["Previous_qualification_grade", "t value"], "\n")
## t-value: 47.40077
cat("p-value:", coeff_summary["Previous_qualification_grade", "Pr(>|t|)"], "\n")
## p-value: 0
# Interpretation
cat("\nInterpretation:\n")
## 
## Interpretation:
cat("----------------------------\n")
## ----------------------------
cat("The hypothesis test for the coefficient of 'Previous Qualification Grade' suggests that the coefficient is not equal to zero (p-value <", alpha, "), where alpha is your chosen significance level.\n")
## The hypothesis test for the coefficient of 'Previous Qualification Grade' suggests that the coefficient is not equal to zero (p-value < 0.05 ), where alpha is your chosen significance level.
# Recommendations
cat("\nRecommendations:\n")
## 
## Recommendations:
cat("----------------------------\n")
## ----------------------------
cat("Based on this analysis, it appears that the 'Previous Qualification Grade' has a significant effect on the 'Admission Grade.' As the coefficient is positive/negative, we can recommend that students work on improving their previous qualification grades to increase their Admission Grade.")
## Based on this analysis, it appears that the 'Previous Qualification Grade' has a significant effect on the 'Admission Grade.' As the coefficient is positive/negative, we can recommend that students work on improving their previous qualification grades to increase their Admission Grade.
lm_model_extended <- lm(Admission_grade ~ Previous_qualification_grade + Marital_status + Course + Age_at_enrollment, data = df)


summary(lm_model_extended)
## 
## Call:
## lm(formula = Admission_grade ~ Previous_qualification_grade + 
##     Marital_status + Course + Age_at_enrollment, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.978  -5.573  -0.310   5.588  64.683 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   4.603e+01  2.122e+00  21.696  < 2e-16 ***
## Previous_qualification_grade  6.354e-01  1.351e-02  47.028  < 2e-16 ***
## Marital_status               -2.617e-01  3.420e-01  -0.765  0.44426    
## Course                       -5.583e-04  8.585e-05  -6.503 8.72e-11 ***
## Age_at_enrollment             8.335e-02  2.746e-02   3.036  0.00241 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.73 on 4419 degrees of freedom
## Multiple R-squared:  0.3445, Adjusted R-squared:  0.344 
## F-statistic: 580.7 on 4 and 4419 DF,  p-value: < 2.2e-16
# Diagnostic plots for the extended model
par(mfrow = c(2, 2))
plot(lm_model_extended)

lm_model_interaction <- lm(Admission_grade ~ Previous_qualification_grade * Age_at_enrollment, data = df)
print(lm_model_interaction)
## 
## Call:
## lm(formula = Admission_grade ~ Previous_qualification_grade * 
##     Age_at_enrollment, data = df)
## 
## Coefficients:
##                                    (Intercept)  
##                                      -44.42612  
##                   Previous_qualification_grade  
##                                        1.28881  
##                              Age_at_enrollment  
##                                        3.56313  
## Previous_qualification_grade:Age_at_enrollment  
##                                       -0.02681

The main effects for Previous_qualification_grade and Age_at_enrollment represent their individual contributions to Admission_grade

The interaction term (Previous_qualification_grade * Age_at_enrollment) indicates how the effect of Previous_qualification_grade changes for different levels of Age_at_enrollment

Age_at_enrollment also has a significant effect on Admission_grade, with a low p-value (0.00241). The coefficient is 0.08335, suggesting that for every year increase in age at enrollment, the Admission_grade increases by this amount.

The interaction term Previous_qualification_grade:Age_at_enrollment has a significant effect with a p-value close to zero. This means that the relationship between Previous_qualification_grade and Age_at_enrollment significantly influences Admission_grade