Select a continuous (or ordered integer) column of data that seems most “valuable” given the context of your data, and call this your response variable. For example, in the Ames housing data, the price of the house is likely of the most value to both buyers and sellers. This is the thing most people will ask about when it comes to houses. Select a categorical column of data (explanatory variable) that you expect might influence the response variable. Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results. Be clear about how the R output relates to your conclusions. If there are more than 10 categories, consider consolidating them before running the test using the methods we’ve learned in class. Explain what this might mean for people who may be interested in your data. E.g., “there is not enough evidence to conclude [—-], so it would be safe to assume that we can [——]”. Find at least one other continuous (or ordered integer) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear. Build a linear regression model of the response using just this column, and evaluate its fit. Run appropriate hypothesis tests and summarize their results. Use diagnostic plots to identify any issues with your model. Interpret the coefficients of your model, and explain how they relate to the context of your data. For example, can you make any recommendations about an optimal way of doing something? Include at least one other variable into your regression model (e.g., you might use the one from the ANOVA), and evaluate how it helps (or doesn’t). Maybe include an interaction term, but explain why you included it. You can add up to 4 variables if you like. For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
Importing all the libraries
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 5 × 37
## `Marital status` `Application mode` `Application order` Course
## <dbl> <dbl> <dbl> <dbl>
## 1 1 17 5 171
## 2 1 15 1 9254
## 3 1 1 5 9070
## 4 1 17 2 9773
## 5 2 39 1 8014
## # ℹ 33 more variables: `Daytime/evening attendance\t` <dbl>,
## # `Previous qualification` <dbl>, `Previous qualification (grade)` <dbl>,
## # Nacionality <dbl>, `Mother's qualification` <dbl>,
## # `Father's qualification` <dbl>, `Mother's occupation` <dbl>,
## # `Father's occupation` <dbl>, `Admission grade` <dbl>, Displaced <dbl>,
## # `Educational special needs` <dbl>, Debtor <dbl>,
## # `Tuition fees up to date` <dbl>, Gender <dbl>, …
Response Variable
We will choose Admission grade as the response variable as this is the grade a student has when they get admitted.
ANOVA Hypothesis:
\(H_0\): There is no difference in the mean admission grade across the different courses. \(H_a\): There is a difference in the mean admission grade across the different courses.
we will use the “Admission grade” as the dependent variable and “Course” as the independent variable to test the hypothesis.
Continuous Explanatory Variable:
A continuous variable that might influence the “Admission grade” is the “Age at enrollment”. We can assume that age might play a role in academic performance.
Linear Regression:
A linear regression model with “Admission grade” as the dependent variable and “Age at enrollment” as the independent variable. This will help us understand the relationship between age at the time of enrollment and the admission grade.
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Course) 16 71958 4497 23.16 <2e-16 ***
## Residuals 4407 855671 194
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = `Admission grade` ~ `Age at enrollment`, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.222 -9.222 -0.979 7.735 64.149
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 128.30647 0.70204 182.76 <2e-16 ***
## `Age at enrollment` -0.05710 0.02869 -1.99 0.0466 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.48 on 4422 degrees of freedom
## Multiple R-squared: 0.0008949, Adjusted R-squared: 0.000669
## F-statistic: 3.961 on 1 and 4422 DF, p-value: 0.04663
##
## Call:
## lm(formula = `Admission grade` ~ `Age at enrollment` + as.factor(Course),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.292 -8.775 -0.935 7.516 67.972
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 120.31928 4.13196 29.119 < 2e-16 ***
## `Age at enrollment` -0.05952 0.03210 -1.854 0.063767 .
## as.factor(Course)171 17.04439 4.14281 4.114 3.96e-05 ***
## as.factor(Course)8014 3.87763 4.13440 0.938 0.348351
## as.factor(Course)9003 14.08443 4.13509 3.406 0.000665 ***
## as.factor(Course)9070 9.82645 4.13569 2.376 0.017543 *
## as.factor(Course)9085 12.14502 4.10018 2.962 0.003072 **
## as.factor(Course)9119 7.47407 4.16556 1.794 0.072841 .
## as.factor(Course)9130 13.45984 4.19422 3.209 0.001341 **
## as.factor(Course)9147 3.88567 4.08975 0.950 0.342113
## as.factor(Course)9238 6.51777 4.09770 1.591 0.111773
## as.factor(Course)9254 2.58493 4.12247 0.627 0.530669
## as.factor(Course)9500 8.18639 4.06220 2.015 0.043938 *
## as.factor(Course)9556 4.26829 4.29506 0.994 0.320391
## as.factor(Course)9670 2.44252 4.11793 0.593 0.553116
## as.factor(Course)9773 8.87689 4.10332 2.163 0.030568 *
## as.factor(Course)9853 4.13851 4.15271 0.997 0.319023
## as.factor(Course)9991 11.86650 4.11166 2.886 0.003920 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.93 on 4406 degrees of freedom
## Multiple R-squared: 0.07829, Adjusted R-squared: 0.07473
## F-statistic: 22.01 on 17 and 4406 DF, p-value: < 2.2e-16
ANOVA Result for Different Courses on Admission Grade:
There are differences in the mean “Admission grade” across the different courses.
Linear Regression with Age at Enrollment:
An increase in age by one year is associated with a decrease in the admission grade by approximately 0.0571 units.
Enhanced Regression Model with Age at Enrollment and Course:
The impact of age remains similar to the previous model, but with slightly reduced statistical significance. Relative to the reference course (Biofuel Production Technologies - 33, which isn’t listed among the coefficients): Students in “Animation and Multimedia Design” have, on average, an admission grade that’s approximately 17.04439 units higher. Students in “Agronomy” have an average admission grade that’s approximately 14.08443 units higher. Students in “Communication Design” and “Veterinary Nursing” have higher average admission grades by about 9.82645 and 12.14502 units, respectively. Other courses, such as “Nursing”, “Journalism and Communication”, and “Management (evening attendance)”, also show statistically significant differences in admission grades relative to the reference course.
Summary:- what does these results mean for people?
Age seems to have a slight negative influence on the admission grade. Courses like “Animation and Multimedia Design”, “Agronomy”, and “Veterinary Nursing” have notably higher average admission grades compared to “Biofuel Production Technologies”. These findings suggest that certain courses either attract higher-performing students or have different admission criteria.
#interaction There might be a possibility that the effect of age on the admission grade differs across courses.for some courses, older students might have higher grades due to more life experience, while in other courses, younger students might perform better due to recent training. .
##
## Call:
## lm(formula = `Admission grade` ~ `Age at enrollment` * as.factor(Course),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.804 -8.759 -0.863 7.638 64.487
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.46709 22.31484 4.637 3.64e-06
## `Age at enrollment` 0.51013 0.74208 0.687 0.4918
## as.factor(Course)171 56.20797 23.04482 2.439 0.0148
## as.factor(Course)8014 15.68849 22.57307 0.695 0.4871
## as.factor(Course)9003 41.08147 22.54286 1.822 0.0685
## as.factor(Course)9070 32.67899 22.61710 1.445 0.1486
## as.factor(Course)9085 24.27753 22.59014 1.075 0.2826
## as.factor(Course)9119 18.53855 22.64771 0.819 0.4131
## as.factor(Course)9130 33.47023 22.69551 1.475 0.1404
## as.factor(Course)9147 21.61324 22.42232 0.964 0.3351
## as.factor(Course)9238 24.96582 22.48954 1.110 0.2670
## as.factor(Course)9254 18.93650 22.50107 0.842 0.4001
## as.factor(Course)9500 25.91865 22.41034 1.157 0.2475
## as.factor(Course)9556 5.33596 22.84650 0.234 0.8153
## as.factor(Course)9670 15.52882 22.55797 0.688 0.4912
## as.factor(Course)9773 21.66922 22.45704 0.965 0.3346
## as.factor(Course)9853 29.46271 22.63718 1.302 0.1931
## as.factor(Course)9991 25.33547 22.62235 1.120 0.2628
## `Age at enrollment`:as.factor(Course)171 -1.66962 0.79309 -2.105 0.0353
## `Age at enrollment`:as.factor(Course)8014 -0.42107 0.74831 -0.563 0.5737
## `Age at enrollment`:as.factor(Course)9003 -0.93608 0.75022 -1.248 0.2122
## `Age at enrollment`:as.factor(Course)9070 -0.85394 0.76109 -1.122 0.2619
## `Age at enrollment`:as.factor(Course)9085 -0.35189 0.75880 -0.464 0.6429
## `Age at enrollment`:as.factor(Course)9119 -0.32250 0.75889 -0.425 0.6709
## `Age at enrollment`:as.factor(Course)9130 -0.70667 0.76182 -0.928 0.3537
## `Age at enrollment`:as.factor(Course)9147 -0.60765 0.74752 -0.813 0.4163
## `Age at enrollment`:as.factor(Course)9238 -0.64512 0.75298 -0.857 0.3916
## `Age at enrollment`:as.factor(Course)9254 -0.54728 0.75221 -0.728 0.4669
## `Age at enrollment`:as.factor(Course)9500 -0.61168 0.74823 -0.817 0.4137
## `Age at enrollment`:as.factor(Course)9556 0.05707 0.76486 0.075 0.9405
## `Age at enrollment`:as.factor(Course)9670 -0.39710 0.75636 -0.525 0.5996
## `Age at enrollment`:as.factor(Course)9773 -0.37433 0.75105 -0.498 0.6182
## `Age at enrollment`:as.factor(Course)9853 -0.95898 0.76103 -1.260 0.2077
## `Age at enrollment`:as.factor(Course)9991 -0.46639 0.75026 -0.622 0.5342
##
## (Intercept) ***
## `Age at enrollment`
## as.factor(Course)171 *
## as.factor(Course)8014
## as.factor(Course)9003 .
## as.factor(Course)9070
## as.factor(Course)9085
## as.factor(Course)9119
## as.factor(Course)9130
## as.factor(Course)9147
## as.factor(Course)9238
## as.factor(Course)9254
## as.factor(Course)9500
## as.factor(Course)9556
## as.factor(Course)9670
## as.factor(Course)9773
## as.factor(Course)9853
## as.factor(Course)9991
## `Age at enrollment`:as.factor(Course)171 *
## `Age at enrollment`:as.factor(Course)8014
## `Age at enrollment`:as.factor(Course)9003
## `Age at enrollment`:as.factor(Course)9070
## `Age at enrollment`:as.factor(Course)9085
## `Age at enrollment`:as.factor(Course)9119
## `Age at enrollment`:as.factor(Course)9130
## `Age at enrollment`:as.factor(Course)9147
## `Age at enrollment`:as.factor(Course)9238
## `Age at enrollment`:as.factor(Course)9254
## `Age at enrollment`:as.factor(Course)9500
## `Age at enrollment`:as.factor(Course)9556
## `Age at enrollment`:as.factor(Course)9670
## `Age at enrollment`:as.factor(Course)9773
## `Age at enrollment`:as.factor(Course)9853
## `Age at enrollment`:as.factor(Course)9991
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.86 on 4390 degrees of freedom
## Multiple R-squared: 0.09069, Adjusted R-squared: 0.08386
## F-statistic: 13.27 on 33 and 4390 DF, p-value: < 2.2e-16
Regression with Interaction between Age at Enrollment and Course:
A significant interaction suggests that the effect of age on the admission grade is different for that course compared to the reference course.
The interaction term for “Animation and Multimedia Design” and “Age at enrollment” is -1.66962, which is statistically significant (p-value = 0.0353). This means that for every additional year in age, students in “Animation and Multimedia Design” have an admission grade that’s about 1.67 units lower than what we would expect based on the main effects alone.
Model Fit:
The adjusted R-squared value is 0.08386, indicating that the model with the interaction terms explains approximately 8.4% of the variability in the admission grade. This is a slight improvement over the previous model without interaction terms. The overall F-statistic is significant (p-value < 2.2e-16)
Summary:
For some courses, like “Animation and Multimedia Design”, the effect of age on admission grade is different from the reference course (“Biofuel Production Technologies”). However, for many courses, the interaction with age is not statistically significant, suggesting that the effect of age on admission grade is similar to the reference course.