Week 8 | Data Dive — Regression

# Load the necessary library

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(effsize)
library(pwrss)

## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

library(boot)
library(broom)
library(lindia)
library(dplyr)
library(ggplot2)

mpg<- read_delim("C:/Users/kondo/OneDrive/Desktop/INTRO to Statistics and R/Data Set and work/data.csv", delim = ";",show_col_types = FALSE)

glimpse(mpg)

## Rows: 4,424
## Columns: 37
## $ `Marital status`                                 <dbl> 1, 1, 1, 1, 2, 2, 1, …
## $ `Application mode`                               <dbl> 17, 15, 1, 17, 39, 39…
## $ `Application order`                              <dbl> 5, 1, 5, 2, 1, 1, 1, …
## $ Course                                           <dbl> 171, 9254, 9070, 9773…
## $ `Daytime/evening attendance\t`                   <dbl> 1, 1, 1, 1, 0, 0, 1, …
## $ `Previous qualification`                         <dbl> 1, 1, 1, 1, 1, 19, 1,…
## $ `Previous qualification (grade)`                 <dbl> 122.0, 160.0, 122.0, …
## $ Nacionality                                      <dbl> 1, 1, 1, 1, 1, 1, 1, …
## $ `Mother's qualification`                         <dbl> 19, 1, 37, 38, 37, 37…
## $ `Father's qualification`                         <dbl> 12, 3, 37, 37, 38, 37…
## $ `Mother's occupation`                            <dbl> 5, 3, 9, 5, 9, 9, 7, …
## $ `Father's occupation`                            <dbl> 9, 3, 9, 3, 9, 7, 10,…
## $ `Admission grade`                                <dbl> 127.3, 142.5, 124.8, …
## $ Displaced                                        <dbl> 1, 1, 1, 1, 0, 0, 1, …
## $ `Educational special needs`                      <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ Debtor                                           <dbl> 0, 0, 0, 0, 0, 1, 0, …
## $ `Tuition fees up to date`                        <dbl> 1, 0, 0, 1, 1, 1, 1, …
## $ Gender                                           <dbl> 1, 1, 1, 0, 0, 1, 0, …
## $ `Scholarship holder`                             <dbl> 0, 0, 0, 0, 0, 0, 1, …
## $ `Age at enrollment`                              <dbl> 20, 19, 19, 20, 45, 5…
## $ International                                    <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 1st sem (credited)`            <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 1st sem (enrolled)`            <dbl> 0, 6, 6, 6, 6, 5, 7, …
## $ `Curricular units 1st sem (evaluations)`         <dbl> 0, 6, 0, 8, 9, 10, 9,…
## $ `Curricular units 1st sem (approved)`            <dbl> 0, 6, 0, 6, 5, 5, 7, …
## $ `Curricular units 1st sem (grade)`               <dbl> 0.00000, 14.00000, 0.…
## $ `Curricular units 1st sem (without evaluations)` <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 2nd sem (credited)`            <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 2nd sem (enrolled)`            <dbl> 0, 6, 6, 6, 6, 5, 8, …
## $ `Curricular units 2nd sem (evaluations)`         <dbl> 0, 6, 0, 10, 6, 17, 8…
## $ `Curricular units 2nd sem (approved)`            <dbl> 0, 6, 0, 5, 6, 5, 8, …
## $ `Curricular units 2nd sem (grade)`               <dbl> 0.00000, 13.66667, 0.…
## $ `Curricular units 2nd sem (without evaluations)` <dbl> 0, 0, 0, 0, 0, 5, 0, …
## $ `Unemployment rate`                              <dbl> 10.8, 13.9, 10.8, 9.4…
## $ `Inflation rate`                                 <dbl> 1.4, -0.3, 1.4, -0.8,…
## $ GDP                                              <dbl> 1.74, 0.79, 1.74, -3.…
## $ Target                                           <chr> "Dropout", "Graduate"…

Select a Continuous Response Variable:

For this analysis, I’ll select “Curricular Units 1st Sem (Grade)” as the response variable. It represents the grade average in the 1st semester and is of high interest to both students and educators.

Select a Categorical Explanatory Variable and Formulate Null Hypothesis:

I will “Gender” as the categorical explanatory variable. I will test whether there is a significant difference in the 1st-semester grades between male and female students.

The null hypothesis (H0) is that there is no significant difference in the 1st-semester grades based on gender. The alternative hypothesis (Ha) is that there is a significant difference.

# Create a subset of the dataset for ANOVA
data_anova <- mpg %>%
  select("Gender", `Curricular units 1st sem (grade)`)

# Rename the columns for clarity
colnames(data_anova) <- c("Gender", "Grade")

# Perform ANOVA
anova_result <- aov(Grade ~ Gender, data = data_anova)
summary(anova_result)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## Gender         1   3724    3724   164.6 <2e-16 ***
## Residuals   4422 100044      23                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA test results show that there is a significant difference in 1st-semester grades based on gender. The p-value (Pr(>F)) is less than 0.001 (indicated as <2e-16), which is highly significant. Therefore, We can reject the null hypothesis and conclude that gender has a significant influence on 1st-semester grades.

This means that there is enough evidence to suggest that there is a difference in academic performance (measured by 1st-semester grades) between male and female students.

Find Another Continuous Variable:

I will select “Age at Enrollment” as the continuous variable. We’ll build a linear regression model to assess the relationship between age at enrollment and 1st-semester grades.

# Create a subset of the dataset for regression
data_regression <- mpg %>%
  select(`Curricular units 1st sem (grade)`, `Age at enrollment`)

# Rename the columns for clarity
colnames(data_regression) <- c("Grade", "Age")

# Build a linear regression model
lm_model <- lm(Grade ~ Age, data = data_regression)
summary(lm_model)

## 
## Call:
## lm(formula = Grade ~ Age, data = data_regression)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2672   0.2994   1.5471   2.6471   9.1074 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.966762   0.232010   55.89   <2e-16 ***
## Age         -0.099975   0.009481  -10.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.784 on 4422 degrees of freedom
## Multiple R-squared:  0.02453,    Adjusted R-squared:  0.02431 
## F-statistic: 111.2 on 1 and 4422 DF,  p-value: < 2.2e-16

The above output indicates the following :

The intercept (when Age is 0) is approximately 12.967.
The coefficient for Age is approximately -0.100. This means that for each year increase in Age, the Grade is expected to decrease by approximately 0.100 points.
The model’s performance statistics are as follows:

The residual standard error is approximately 4.784, which indicates the average error between the observed and predicted values.

The multiple R-squared is approximately 0.02453, which is quite low. This suggests that Age explains only a small proportion of the variance in Grade.

The adjusted R-squared, which accounts for the number of predictors, is also low at approximately 0.02431. The F-statistic tests the overall significance of the model and has a very low p-value (< 2.2e-16), indicating that the model as a whole is significant.

The low R-squared values suggest that the linear relationship between Age and Grade is not very strong. Additionally, the p-value for the Age coefficient is very low (< 2.2e-16), indicating that Age is a significant predictor of Grade, even though the effect size is small.

In practical terms, this means that there is a statistically significant, but weak, negative relationship between a student’s age and their 1st-semester grade. Older students tend to have slightly lower grades on average, but the age variable doesn’t explain much of the overall variation in grades.

Include Another Variable into the Regression Model:

I wil include “Gender” as another variable and check if it helps improve the model.

lm_model_combined <- lm(Grade ~ Age + mpg$Gender, data = data_regression)
summary(lm_model_combined)

## 
## Call:
## lm(formula = Grade ~ Age + mpg$Gender, data = data_regression)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.6883   0.0622   1.4622   2.6564  10.0836 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.197305   0.229535  57.496   <2e-16 ***
## Age         -0.083833   0.009449  -8.872   <2e-16 ***
## mpg$Gender  -1.723233   0.150135 -11.478   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.715 on 4421 degrees of freedom
## Multiple R-squared:  0.05276,    Adjusted R-squared:  0.05233 
## F-statistic: 123.1 on 2 and 4421 DF,  p-value: < 2.2e-16

# Scatter plot of Grade vs. Age
ggplot(data_regression, aes(x = Age, y = Grade, color = factor(mpg$Gender))) +
  geom_point() +
  labs(title = "Scatter Plot of Grade vs. Age",
       x = "Age",
       y = "Grade") +
  theme_minimal()

# Box plot of Grade by Gender
ggplot(data_regression, aes(x = factor(mpg$Gender), y = Grade, fill = factor(mpg$Gender))) +
  geom_boxplot() +
  labs(title = "Box Plot of Grade by Gender",
       x = "Gender",
       y = "Grade") +
  theme_minimal()

The output from the linear regression model, including the variables “Age” and “Gender” as predictors for the response variable “Grade.” Here’s an interpretation of the results:

Intercept (Intercept):

The estimated intercept is approximately 13.20. The intercept represents the expected value of “Grade” when both “Age” and “Gender” are zero. Age (Age):

The coefficient for “Age” is approximately -0.0838. This suggests that, on average, for a one-unit increase in “Age,” there is a decrease of approximately 0.0838 in the expected “Grade,” assuming all other variables are constant. The p-value for “Age” is <2e-16, indicating that “Age” is statistically significant in predicting “Grade.” Gender (mpg$Gender):

The coefficient for “Gender” is approximately -1.7232. This suggests that, on average, being in the male category (assuming 1 represents male) is associated with a decrease of approximately 1.7232 in the expected “Grade,” compared to the female category (assuming 0 represents female), assuming all other variables are constant. The p-value for “Gender” is <2e-16, indicating that “Gender” is statistically significant in predicting “Grade.” Residuals:

The residuals represent the differences between the observed “Grade” values and the values predicted by the model. The residuals have a minimum value of approximately -11.69, a maximum value of approximately 10.08, and are centered around zero. Model Fit:

The adjusted R-squared value is 0.05233, indicating that about 5.23% of the variability in “Grade” is explained by the model with “Age” and “Gender” as predictors. F-statistic:

The F-statistic tests the overall significance of the model. In this case, the F-statistic is 123.1 with a very low p-value, indicating that the model as a whole is statistically significant. In summary, the model suggests that “Age” and “Gender” are statistically significant predictors of “Grade.” However, the model’s adjusted R-squared value is relatively low, indicating that these predictors explain only a small portion of the variability in “Grade.” The negative coefficient for “Gender” suggests that being male (assuming 1) is associated with a lower expected “Grade” compared to being female (assuming 0), holding “Age” constant.

In the visualization above:

The scatter plot shows the relationship between “Grade” and “Age” and uses different colors to distinguish between male (1) and female (0) students. You can observe how “Grade” varies with “Age.”
The box plot displays the distribution of “Grade” for male and female students. It allows you to see the spread of grades for each gender and detect any potential differences.

Week 8 | Data Dive — Regression

Vaishali Kondoju

2023-10-12