Week 10 | Data Dive

# Load the necessary library

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(effsize)
library(pwrss)

## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

library(boot)
library(broom)
library(lindia)
library(dplyr)
library(ggplot2)

mpg<- read_delim("C:/Users/kondo/OneDrive/Desktop/INTRO to Statistics and R/Data Set and work/data.csv", delim = ";",show_col_types = FALSE)

glimpse(mpg)

## Rows: 4,424
## Columns: 37
## $ `Marital status`                                 <dbl> 1, 1, 1, 1, 2, 2, 1, …
## $ `Application mode`                               <dbl> 17, 15, 1, 17, 39, 39…
## $ `Application order`                              <dbl> 5, 1, 5, 2, 1, 1, 1, …
## $ Course                                           <dbl> 171, 9254, 9070, 9773…
## $ `Daytime/evening attendance\t`                   <dbl> 1, 1, 1, 1, 0, 0, 1, …
## $ `Previous qualification`                         <dbl> 1, 1, 1, 1, 1, 19, 1,…
## $ `Previous qualification (grade)`                 <dbl> 122.0, 160.0, 122.0, …
## $ Nacionality                                      <dbl> 1, 1, 1, 1, 1, 1, 1, …
## $ `Mother's qualification`                         <dbl> 19, 1, 37, 38, 37, 37…
## $ `Father's qualification`                         <dbl> 12, 3, 37, 37, 38, 37…
## $ `Mother's occupation`                            <dbl> 5, 3, 9, 5, 9, 9, 7, …
## $ `Father's occupation`                            <dbl> 9, 3, 9, 3, 9, 7, 10,…
## $ `Admission grade`                                <dbl> 127.3, 142.5, 124.8, …
## $ Displaced                                        <dbl> 1, 1, 1, 1, 0, 0, 1, …
## $ `Educational special needs`                      <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ Debtor                                           <dbl> 0, 0, 0, 0, 0, 1, 0, …
## $ `Tuition fees up to date`                        <dbl> 1, 0, 0, 1, 1, 1, 1, …
## $ Gender                                           <dbl> 1, 1, 1, 0, 0, 1, 0, …
## $ `Scholarship holder`                             <dbl> 0, 0, 0, 0, 0, 0, 1, …
## $ `Age at enrollment`                              <dbl> 20, 19, 19, 20, 45, 5…
## $ International                                    <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 1st sem (credited)`            <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 1st sem (enrolled)`            <dbl> 0, 6, 6, 6, 6, 5, 7, …
## $ `Curricular units 1st sem (evaluations)`         <dbl> 0, 6, 0, 8, 9, 10, 9,…
## $ `Curricular units 1st sem (approved)`            <dbl> 0, 6, 0, 6, 5, 5, 7, …
## $ `Curricular units 1st sem (grade)`               <dbl> 0.00000, 14.00000, 0.…
## $ `Curricular units 1st sem (without evaluations)` <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 2nd sem (credited)`            <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 2nd sem (enrolled)`            <dbl> 0, 6, 6, 6, 6, 5, 8, …
## $ `Curricular units 2nd sem (evaluations)`         <dbl> 0, 6, 0, 10, 6, 17, 8…
## $ `Curricular units 2nd sem (approved)`            <dbl> 0, 6, 0, 5, 6, 5, 8, …
## $ `Curricular units 2nd sem (grade)`               <dbl> 0.00000, 13.66667, 0.…
## $ `Curricular units 2nd sem (without evaluations)` <dbl> 0, 0, 0, 0, 0, 5, 0, …
## $ `Unemployment rate`                              <dbl> 10.8, 13.9, 10.8, 9.4…
## $ `Inflation rate`                                 <dbl> 1.4, -0.3, 1.4, -0.8,…
## $ GDP                                              <dbl> 1.74, 0.79, 1.74, -3.…
## $ Target                                           <chr> "Dropout", "Graduate"…

# Select the target variable 'Gender' and explanatory variables
target_var <- "Gender"
explanatory_vars <- c("Previous qualification", "Age at enrollment")  # Add variables of your choice

# Data preparation
data <- mpg %>% select(target_var, explanatory_vars)  # Subset the data

## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(target_var)
## 
##   # Now:
##   data %>% select(all_of(target_var))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(explanatory_vars)
## 
##   # Now:
##   data %>% select(all_of(explanatory_vars))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Create scatter plots
scatter_plot <- ggplot(data, aes(x = `Age at enrollment`, y = Gender)) +
  geom_point(aes(color = Gender)) +
  labs(title = "Scatter Plot of Age at Enrollment by Gender", x = "Age at Enrollment", y = "Gender") +
  theme_minimal()

print(scatter_plot)

# Build a logistic regression model
model <- glm(as.factor(Gender) ~ ., data = data, family = "binomial")

# Model summary
summary(model)

## 
## Call:
## glm(formula = as.factor(Gender) ~ ., family = "binomial", data = data)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -1.540929   0.101893 -15.123  < 2e-16 ***
## `Previous qualification`  0.011380   0.003024   3.763 0.000168 ***
## `Age at enrollment`       0.037229   0.004134   9.006  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5738.0  on 4423  degrees of freedom
## Residual deviance: 5629.4  on 4421  degrees of freedom
## AIC: 5635.4
## 
## Number of Fisher Scoring iterations: 4

# Confidence intervals (e.g., for 'Age.at.Enrollment')
wald_intervals <- confint(model)

## Waiting for profiling to be done...

print(wald_intervals)

##                                 2.5 %      97.5 %
## (Intercept)              -1.741471205 -1.34195088
## `Previous qualification`  0.005429088  0.01729336
## `Age at enrollment`       0.029152017  0.04536461

Above logistic regression model with two explanatory variables: “Previous qualification” and “Age at enrollment” to predict the binary outcome variable “Gender” (coded as 1 for male and 0 for female). The model summary provides you with coefficients, standard errors, z-values, and p-values for each of these variables.

Here’s an interpretation of the results:

Intercept (-1.540929): The intercept represents the estimated log-odds of the reference category, which is typically the category for which the outcome variable is 0 (in this case, female). In this context, the estimated log-odds of being female is approximately -1.541. You can exponentiate this value to get the odds of being female, which is about 0.214. This means that, by default, the odds of being female are roughly 0.214 times the odds of being male when all other variables are held constant.

Previous qualification (0.011380): The coefficient for “Previous qualification” is approximately 0.011. For each one-unit increase in the “Previous qualification” variable, the log-odds of being male (compared to female) increases by 0.011 when all other variables are held constant. The positive sign suggests that as “Previous qualification” increases, the likelihood of being male increases.

Age at enrollment (0.037229): The coefficient for “Age at enrollment” is approximately 0.037. For each one-unit increase in the “Age at enrollment” variable, the log-odds of being male (compared to female) increases by 0.037 when all other variables are held constant. The positive sign indicates that as the age at enrollment increases, the likelihood of being male also increases.

The significance of the coefficients is determined by the associated p-values. Both “Previous qualification” and “Age at enrollment” have p-values less than 0.05, indicating that they are statistically significant predictors of gender.

The 95% confidence intervals for the coefficients are also provided. For example, the confidence interval for “Previous qualification” is between 0.005 and 0.017, and for “Age at enrollment,” it’s between 0.029 and 0.045. This interval represents the range of values within which we can be 95% confident that the true population parameter lies.

In summary, based on this logistic regression model, both “Previous qualification” and “Age at enrollment” are statistically significant predictors of gender. An increase in “Previous qualification” or “Age at enrollment” is associated with an increased likelihood of being male compared to female, while controlling for other variables in the model.

Week 10 | Data Dive — GLMs

Vaishali Kondoju

2023-10-31