NikhilBharadwaj

Build a linear (or generalized linear) model as you like

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df <- read.csv('./Downloads/students_dropout_and_academic_success.csv')

# Build a linear regression model
lm_model <- lm(Admission_grade ~ Previous_qualification_grade + Age_at_enrollment + Gender, data = df)

summary(lm_model)

## 
## Call:
## lm(formula = Admission_grade ~ Previous_qualification_grade + 
##     Age_at_enrollment + Gender, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.480  -5.722  -0.376   5.928  64.823 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  40.04020    1.94174  20.621   <2e-16 ***
## Previous_qualification_grade  0.64279    0.01352  47.550   <2e-16 ***
## Age_at_enrollment             0.05831    0.02373   2.457   0.0141 *  
## Gender                        0.96408    0.37517   2.570   0.0102 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.78 on 4420 degrees of freedom
## Multiple R-squared:  0.3391, Adjusted R-squared:  0.3387 
## F-statistic:   756 on 3 and 4420 DF,  p-value: < 2.2e-16

Significance:

All coefficients are statistically significant as indicated by the low p-values (p < 0.05), which means they have a significant impact on the Admission Grade. R-squared: The R-squared value is 0.3391, which suggests that approximately 33.91% of the variance in the Admission Grade can be explained by the combination of Previous Qualification Grade, Age at Enrollment, and Gender. This indicates that the model explains a moderate portion of the variability in Admission Grade.

Adjusted R-squared: The adjusted R-squared value is similar to the R-squared value but accounts for the number of predictors in the model. In this case, it’s 0.3387.

Residual Standard Error: This is an estimate of the standard deviation of the model’s errors (residuals), and it is approximately 11.78. It represents the typical difference between the predicted Admission Grade and the actual Admission Grade.

F-statistic: The F-statistic tests whether the overall model is statistically significant. The high F-statistic (756) and very low p-value (p < 2.2e-16) indicate that the model as a whole is statistically significant.

Use the tools from previous weeks to diagnose the model

Residual Analysis

# Residual vs. Fitted Plot
plot(lm_model, which = 1)

# Normal Q-Q Plot
plot(lm_model, which = 2)

# Residual vs. Predictor Variables
par(mfrow = c(2, 2))
plot(lm_model)

### Outliers and Influential Observations:

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

# Cook's Distance Plot
influencePlot(lm_model, id.n = 4, main = "Influence Plot")

## Warning in plot.window(...): "id.n" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "id.n" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "id.n" is not a
## graphical parameter

## Warning in box(...): "id.n" is not a graphical parameter

## Warning in title(...): "id.n" is not a graphical parameter

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.n" is not a
## graphical parameter

##        StudRes          Hat       CookD
## 690  -1.214736 0.0089723565 0.003339472
## 703   2.076352 0.0069753481 0.007565240
## 1674  5.523747 0.0004389853 0.003327807
## 2270  4.539374 0.0061915392 0.031952569
## 2485  5.297292 0.0033154914 0.023194617

The table provides information about observations that may be outliers or influential. These are observations with relatively large residuals or high leverage. Some observations (e.g., 703, 2270, 2485) have relatively high standardized residuals (StudRes) and leverage values (Hat). The Cook’s distance values (CookD) for these observations are non-negligible. These observations may have a relatively high impact on the model.

Multicollinearity:

library(car)
vif <- vif(lm_model)
print(vif)

## Previous_qualification_grade            Age_at_enrollment 
##                     1.013589                     1.034175 
##                       Gender 
##                     1.023692

The values for multicollinearity are all close to 1, suggesting that there is no severe multicollinearity issue between the predictor variables. This is a positive sign.

Hypothesis Tests:

coeff_summary <- summary(lm_model)$coefficients
print(coeff_summary)

##                                 Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)                  40.04019713 1.94173511 20.620834 2.799995e-90
## Previous_qualification_grade  0.64278864 0.01351818 47.549927 0.000000e+00
## Age_at_enrollment             0.05830637 0.02373326  2.456736 1.405873e-02
## Gender                        0.96408010 0.37517478  2.569683 1.021166e-02

All three predictor variables (Previous Qualification Grade, Age at Enrollment, and Gender) have statistically significant coefficients. This means that they are associated with changes in the Admission Grade. The p-values associated with the t-tests for these coefficients are very close to zero, indicating their significance.

Model Fit:

rsquared <- summary(lm_model)$r.squared
adjusted_rsquared <- summary(lm_model)$adj.r.squared
print(paste("R-squared:", round(rsquared, 4)))

## [1] "R-squared: 0.3391"

print(paste("Adjusted R-squared:", round(adjusted_rsquared, 4)))

## [1] "Adjusted R-squared: 0.3387"

The R-squared value is 0.3391, indicating that approximately 33.91% of the variance in the response variable (Admission Grade) is explained by the model. The adjusted R-squared value is very close to the R-squared value, suggesting that the additional predictors might not significantly contribute to the model.

Interpret at least one of the coefficients

The coefficient for “Previous Qualification Grade” is approximately 0.6428. This means that for a one-unit increase in the “Previous Qualification Grade” of a student, we would expect the “Admission Grade” to increase by approximately 0.6428 units, holding all other variables constant.

In practical terms, this suggests that students with higher previous qualification grades tend to have higher admission grades. For example, if two students have all other factors (such as age and gender) equal but differ in their previous qualification grades by one unit, the student with the higher qualification grade is expected to have an admission grade that is approximately 0.6428 units higher.

NikhilBharadwaj_DD11

2023-11-05