Data Dive 8

Select a continuous (or ordered integer) column of data that seems most “valuable” given the context of your data, and call this your response variable. For example, in the Ames housing data, the price of the house is likely of the most value to both buyers and sellers. This is the thing most people will ask about when it comes to houses. Select a categorical column of data (explanatory variable) that you expect might influence the response variable. Devise a null hypothesis for an ANOVA test given this situation. Test this hypothesis using ANOVA, and summarize your results. Be clear about how the R output relates to your conclusions. If there are more than 10 categories, consider consolidating them before running the test using the methods we’ve learned in class. Explain what this might mean for people who may be interested in your data. E.g., “there is not enough evidence to conclude [—-], so it would be safe to assume that we can [——]”. Find at least one other continuous (or ordered integer) column of data that might influence the response variable. Make sure the relationship between this variable and the response is roughly linear. Build a linear regression model of the response using just this column, and evaluate its fit. Run appropriate hypothesis tests and summarize their results. Use diagnostic plots to identify any issues with your model. Interpret the coefficients of your model, and explain how they relate to the context of your data. For example, can you make any recommendations about an optimal way of doing something? Include at least one other variable into your regression model (e.g., you might use the one from the ANOVA), and evaluate how it helps (or doesn’t). Maybe include an interaction term, but explain why you included it. You can add up to 4 variables if you like. For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.

Importing all the libraries

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr  (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 5 × 37
##   `Marital status` `Application mode` `Application order` Course
##              <dbl>              <dbl>               <dbl>  <dbl>
## 1                1                 17                   5    171
## 2                1                 15                   1   9254
## 3                1                  1                   5   9070
## 4                1                 17                   2   9773
## 5                2                 39                   1   8014
## # ℹ 33 more variables: `Daytime/evening attendance\t` <dbl>,
## #   `Previous qualification` <dbl>, `Previous qualification (grade)` <dbl>,
## #   Nacionality <dbl>, `Mother's qualification` <dbl>,
## #   `Father's qualification` <dbl>, `Mother's occupation` <dbl>,
## #   `Father's occupation` <dbl>, `Admission grade` <dbl>, Displaced <dbl>,
## #   `Educational special needs` <dbl>, Debtor <dbl>,
## #   `Tuition fees up to date` <dbl>, Gender <dbl>, …

Response Variable

We will choose Admission grade as the response variable as this is the grade a student has when they get admitted.

ANOVA Hypothesis:

\(H_0\): There is no difference in the mean admission grade across the different courses. \(H_a\): There is a difference in the mean admission grade across the different courses.

we will use the “Admission grade” as the dependent variable and “Course” as the independent variable to test the hypothesis.

Continuous Explanatory Variable:

A continuous variable that might influence the “Admission grade” is the “Age at enrollment”. We can assume that age might play a role in academic performance.

Linear Regression:

A linear regression model with “Admission grade” as the dependent variable and “Age at enrollment” as the independent variable. This will help us understand the relationship between age at the time of enrollment and the admission grade.

##                     Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(Course)   16  71958    4497   23.16 <2e-16 ***
## Residuals         4407 855671     194                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = `Admission grade` ~ `Age at enrollment`, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.222  -9.222  -0.979   7.735  64.149 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         128.30647    0.70204  182.76   <2e-16 ***
## `Age at enrollment`  -0.05710    0.02869   -1.99   0.0466 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.48 on 4422 degrees of freedom
## Multiple R-squared:  0.0008949,  Adjusted R-squared:  0.000669 
## F-statistic: 3.961 on 1 and 4422 DF,  p-value: 0.04663
## 
## Call:
## lm(formula = `Admission grade` ~ `Age at enrollment` + as.factor(Course), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.292  -8.775  -0.935   7.516  67.972 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           120.31928    4.13196  29.119  < 2e-16 ***
## `Age at enrollment`    -0.05952    0.03210  -1.854 0.063767 .  
## as.factor(Course)171   17.04439    4.14281   4.114 3.96e-05 ***
## as.factor(Course)8014   3.87763    4.13440   0.938 0.348351    
## as.factor(Course)9003  14.08443    4.13509   3.406 0.000665 ***
## as.factor(Course)9070   9.82645    4.13569   2.376 0.017543 *  
## as.factor(Course)9085  12.14502    4.10018   2.962 0.003072 ** 
## as.factor(Course)9119   7.47407    4.16556   1.794 0.072841 .  
## as.factor(Course)9130  13.45984    4.19422   3.209 0.001341 ** 
## as.factor(Course)9147   3.88567    4.08975   0.950 0.342113    
## as.factor(Course)9238   6.51777    4.09770   1.591 0.111773    
## as.factor(Course)9254   2.58493    4.12247   0.627 0.530669    
## as.factor(Course)9500   8.18639    4.06220   2.015 0.043938 *  
## as.factor(Course)9556   4.26829    4.29506   0.994 0.320391    
## as.factor(Course)9670   2.44252    4.11793   0.593 0.553116    
## as.factor(Course)9773   8.87689    4.10332   2.163 0.030568 *  
## as.factor(Course)9853   4.13851    4.15271   0.997 0.319023    
## as.factor(Course)9991  11.86650    4.11166   2.886 0.003920 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.93 on 4406 degrees of freedom
## Multiple R-squared:  0.07829,    Adjusted R-squared:  0.07473 
## F-statistic: 22.01 on 17 and 4406 DF,  p-value: < 2.2e-16

ANOVA Result for Different Courses on Admission Grade:

There are differences in the mean “Admission grade” across the different courses.

Linear Regression with Age at Enrollment:

An increase in age by one year is associated with a decrease in the admission grade by approximately 0.0571 units.

Enhanced Regression Model with Age at Enrollment and Course:

The impact of age remains similar to the previous model, but with slightly reduced statistical significance. Relative to the reference course (Biofuel Production Technologies - 33, which isn’t listed among the coefficients): Students in “Animation and Multimedia Design” have, on average, an admission grade that’s approximately 17.04439 units higher. Students in “Agronomy” have an average admission grade that’s approximately 14.08443 units higher. Students in “Communication Design” and “Veterinary Nursing” have higher average admission grades by about 9.82645 and 12.14502 units, respectively. Other courses, such as “Nursing”, “Journalism and Communication”, and “Management (evening attendance)”, also show statistically significant differences in admission grades relative to the reference course.

Summary:- what does these results mean for people?

Age seems to have a slight negative influence on the admission grade. Courses like “Animation and Multimedia Design”, “Agronomy”, and “Veterinary Nursing” have notably higher average admission grades compared to “Biofuel Production Technologies”. These findings suggest that certain courses either attract higher-performing students or have different admission criteria.

#interaction There might be a possibility that the effect of age on the admission grade differs across courses.for some courses, older students might have higher grades due to more life experience, while in other courses, younger students might perform better due to recent training. .

## 
## Call:
## lm(formula = `Admission grade` ~ `Age at enrollment` * as.factor(Course), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.804  -8.759  -0.863   7.638  64.487 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               103.46709   22.31484   4.637 3.64e-06
## `Age at enrollment`                         0.51013    0.74208   0.687   0.4918
## as.factor(Course)171                       56.20797   23.04482   2.439   0.0148
## as.factor(Course)8014                      15.68849   22.57307   0.695   0.4871
## as.factor(Course)9003                      41.08147   22.54286   1.822   0.0685
## as.factor(Course)9070                      32.67899   22.61710   1.445   0.1486
## as.factor(Course)9085                      24.27753   22.59014   1.075   0.2826
## as.factor(Course)9119                      18.53855   22.64771   0.819   0.4131
## as.factor(Course)9130                      33.47023   22.69551   1.475   0.1404
## as.factor(Course)9147                      21.61324   22.42232   0.964   0.3351
## as.factor(Course)9238                      24.96582   22.48954   1.110   0.2670
## as.factor(Course)9254                      18.93650   22.50107   0.842   0.4001
## as.factor(Course)9500                      25.91865   22.41034   1.157   0.2475
## as.factor(Course)9556                       5.33596   22.84650   0.234   0.8153
## as.factor(Course)9670                      15.52882   22.55797   0.688   0.4912
## as.factor(Course)9773                      21.66922   22.45704   0.965   0.3346
## as.factor(Course)9853                      29.46271   22.63718   1.302   0.1931
## as.factor(Course)9991                      25.33547   22.62235   1.120   0.2628
## `Age at enrollment`:as.factor(Course)171   -1.66962    0.79309  -2.105   0.0353
## `Age at enrollment`:as.factor(Course)8014  -0.42107    0.74831  -0.563   0.5737
## `Age at enrollment`:as.factor(Course)9003  -0.93608    0.75022  -1.248   0.2122
## `Age at enrollment`:as.factor(Course)9070  -0.85394    0.76109  -1.122   0.2619
## `Age at enrollment`:as.factor(Course)9085  -0.35189    0.75880  -0.464   0.6429
## `Age at enrollment`:as.factor(Course)9119  -0.32250    0.75889  -0.425   0.6709
## `Age at enrollment`:as.factor(Course)9130  -0.70667    0.76182  -0.928   0.3537
## `Age at enrollment`:as.factor(Course)9147  -0.60765    0.74752  -0.813   0.4163
## `Age at enrollment`:as.factor(Course)9238  -0.64512    0.75298  -0.857   0.3916
## `Age at enrollment`:as.factor(Course)9254  -0.54728    0.75221  -0.728   0.4669
## `Age at enrollment`:as.factor(Course)9500  -0.61168    0.74823  -0.817   0.4137
## `Age at enrollment`:as.factor(Course)9556   0.05707    0.76486   0.075   0.9405
## `Age at enrollment`:as.factor(Course)9670  -0.39710    0.75636  -0.525   0.5996
## `Age at enrollment`:as.factor(Course)9773  -0.37433    0.75105  -0.498   0.6182
## `Age at enrollment`:as.factor(Course)9853  -0.95898    0.76103  -1.260   0.2077
## `Age at enrollment`:as.factor(Course)9991  -0.46639    0.75026  -0.622   0.5342
##                                              
## (Intercept)                               ***
## `Age at enrollment`                          
## as.factor(Course)171                      *  
## as.factor(Course)8014                        
## as.factor(Course)9003                     .  
## as.factor(Course)9070                        
## as.factor(Course)9085                        
## as.factor(Course)9119                        
## as.factor(Course)9130                        
## as.factor(Course)9147                        
## as.factor(Course)9238                        
## as.factor(Course)9254                        
## as.factor(Course)9500                        
## as.factor(Course)9556                        
## as.factor(Course)9670                        
## as.factor(Course)9773                        
## as.factor(Course)9853                        
## as.factor(Course)9991                        
## `Age at enrollment`:as.factor(Course)171  *  
## `Age at enrollment`:as.factor(Course)8014    
## `Age at enrollment`:as.factor(Course)9003    
## `Age at enrollment`:as.factor(Course)9070    
## `Age at enrollment`:as.factor(Course)9085    
## `Age at enrollment`:as.factor(Course)9119    
## `Age at enrollment`:as.factor(Course)9130    
## `Age at enrollment`:as.factor(Course)9147    
## `Age at enrollment`:as.factor(Course)9238    
## `Age at enrollment`:as.factor(Course)9254    
## `Age at enrollment`:as.factor(Course)9500    
## `Age at enrollment`:as.factor(Course)9556    
## `Age at enrollment`:as.factor(Course)9670    
## `Age at enrollment`:as.factor(Course)9773    
## `Age at enrollment`:as.factor(Course)9853    
## `Age at enrollment`:as.factor(Course)9991    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.86 on 4390 degrees of freedom
## Multiple R-squared:  0.09069,    Adjusted R-squared:  0.08386 
## F-statistic: 13.27 on 33 and 4390 DF,  p-value: < 2.2e-16

Regression with Interaction between Age at Enrollment and Course:

A significant interaction suggests that the effect of age on the admission grade is different for that course compared to the reference course.

The interaction term for “Animation and Multimedia Design” and “Age at enrollment” is -1.66962, which is statistically significant (p-value = 0.0353). This means that for every additional year in age, students in “Animation and Multimedia Design” have an admission grade that’s about 1.67 units lower than what we would expect based on the main effects alone.

Model Fit:

The adjusted R-squared value is 0.08386, indicating that the model with the interaction terms explains approximately 8.4% of the variability in the admission grade. This is a slight improvement over the previous model without interaction terms. The overall F-statistic is significant (p-value < 2.2e-16)

Summary:

For some courses, like “Animation and Multimedia Design”, the effect of age on admission grade is different from the reference course (“Biofuel Production Technologies”). However, for many courses, the interaction with age is not statistically significant, suggesting that the effect of age on admission grade is similar to the reference course.