Week 6 | Data Dive — Confidence Intervals

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(stats)

mpg<- read_delim("C:/Users/kondo/OneDrive/Desktop/INTRO to Statistics and R/Data Set and work/data.csv", delim = ";",show_col_types = FALSE)

glimpse(mpg)

## Rows: 4,424
## Columns: 37
## $ `Marital status`                                 <dbl> 1, 1, 1, 1, 2, 2, 1, …
## $ `Application mode`                               <dbl> 17, 15, 1, 17, 39, 39…
## $ `Application order`                              <dbl> 5, 1, 5, 2, 1, 1, 1, …
## $ Course                                           <dbl> 171, 9254, 9070, 9773…
## $ `Daytime/evening attendance\t`                   <dbl> 1, 1, 1, 1, 0, 0, 1, …
## $ `Previous qualification`                         <dbl> 1, 1, 1, 1, 1, 19, 1,…
## $ `Previous qualification (grade)`                 <dbl> 122.0, 160.0, 122.0, …
## $ Nacionality                                      <dbl> 1, 1, 1, 1, 1, 1, 1, …
## $ `Mother's qualification`                         <dbl> 19, 1, 37, 38, 37, 37…
## $ `Father's qualification`                         <dbl> 12, 3, 37, 37, 38, 37…
## $ `Mother's occupation`                            <dbl> 5, 3, 9, 5, 9, 9, 7, …
## $ `Father's occupation`                            <dbl> 9, 3, 9, 3, 9, 7, 10,…
## $ `Admission grade`                                <dbl> 127.3, 142.5, 124.8, …
## $ Displaced                                        <dbl> 1, 1, 1, 1, 0, 0, 1, …
## $ `Educational special needs`                      <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ Debtor                                           <dbl> 0, 0, 0, 0, 0, 1, 0, …
## $ `Tuition fees up to date`                        <dbl> 1, 0, 0, 1, 1, 1, 1, …
## $ Gender                                           <dbl> 1, 1, 1, 0, 0, 1, 0, …
## $ `Scholarship holder`                             <dbl> 0, 0, 0, 0, 0, 0, 1, …
## $ `Age at enrollment`                              <dbl> 20, 19, 19, 20, 45, 5…
## $ International                                    <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 1st sem (credited)`            <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 1st sem (enrolled)`            <dbl> 0, 6, 6, 6, 6, 5, 7, …
## $ `Curricular units 1st sem (evaluations)`         <dbl> 0, 6, 0, 8, 9, 10, 9,…
## $ `Curricular units 1st sem (approved)`            <dbl> 0, 6, 0, 6, 5, 5, 7, …
## $ `Curricular units 1st sem (grade)`               <dbl> 0.00000, 14.00000, 0.…
## $ `Curricular units 1st sem (without evaluations)` <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 2nd sem (credited)`            <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 2nd sem (enrolled)`            <dbl> 0, 6, 6, 6, 6, 5, 8, …
## $ `Curricular units 2nd sem (evaluations)`         <dbl> 0, 6, 0, 10, 6, 17, 8…
## $ `Curricular units 2nd sem (approved)`            <dbl> 0, 6, 0, 5, 6, 5, 8, …
## $ `Curricular units 2nd sem (grade)`               <dbl> 0.00000, 13.66667, 0.…
## $ `Curricular units 2nd sem (without evaluations)` <dbl> 0, 0, 0, 0, 0, 5, 0, …
## $ `Unemployment rate`                              <dbl> 10.8, 13.9, 10.8, 9.4…
## $ `Inflation rate`                                 <dbl> 1.4, -0.3, 1.4, -0.8,…
## $ GDP                                              <dbl> 1.74, 0.79, 1.74, -3.…
## $ Target                                           <chr> "Dropout", "Graduate"…

Task 1 & 2 : Build at least three sets of variable combinations For each set of variables, include at least one column that you created (i.e., calculated based on others) All variables for this data dive should be either continuous (i.e., numeric) or ordered (e.g., [‘small’, ‘medium’, ‘large’] is okay, but [“apples”, “oranges”, “bananas”] is not) For each set, there should be one response variable with the others as explanatory variables

Plot a visualization for each response-explanatory relationship, and draw some conclusions based on the plot

Set 1: Response Variable: “Curricular units 2nd sem (grade)” Explanatory Variables: “Admission grade,” “Age at enrollment”

# Create a scatterplot for Set 1
ggplot(mpg, aes(x = `Admission grade`, y = `Age at enrollment`, color = `Curricular units 2nd sem (grade)`)) +
  geom_point() +
  labs(title = "Scatterplot of Curricular units 2nd sem (grade)",
       x = "Admission grade",
       y = "Age at enrollment",
       color = "Curricular units 2nd sem (grade)") +
  theme_minimal()

Conclusions for above visualization :

The scatterplot shows the relationship between “Curricular units 2nd sem (grade),” “Admission grade,” and “Age at enrollment.”
There doesn’t appear to be a strong linear relationship between “Curricular units 2nd sem (grade)” and the other two variables.
“Curricular units 2nd sem (grade)” varies across a wide range of values for both “Admission grade” and “Age at enrollment.”
From the above visualization we can see an outlier with respect to age at enrollment 70

Response Variable: “Target” (Classification) Explanatory Variables: “Admission grade,” “Age at enrollment,” “Displaced”

# Create a scatterplot matrix for Set 2
ggplot(mpg, aes(x = Target, y = `Admission grade`, fill = factor(`Age at enrollment`))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Grouped Bar Chart of Admission Grade by Target and Age at Enrollment",
       x = "Target",
       y = "Admission Grade") +
  scale_fill_discrete(name = "Age at Enrollment")

Conclusions for above visualization:

Admission Grade and Target:

The “Admission grade” tends to be higher for students who ultimately “Graduate” compared to those who “Dropout” or are “Enrolled.” Students who “Graduate” generally have a wider range of admission grades, while students who “Dropout” tend to have lower admission grades.

Age at Enrollment and Target:

The distribution of “Age at enrollment” varies across different “Target” categories. Students who “Dropout” and “Enrolled” tend to have a higher concentration of younger students, while those who “Graduate” include a broader age range. The “Enrolled” category has a significant number of younger students, while the “Dropout” category includes both younger and older students. These conclusions provide insights into the relationships between these variables and the likelihood of a student’s academic outcome, which could be valuable for making informed decisions or interventions in an educational context.

# Create a scatterplot for Set 3
ggplot(mpg, aes(x = `Inflation rate`, y = GDP, color = `Unemployment rate`)) +
  geom_point() +
  labs(title = "Scatterplot of Unemployment rate",
       x = "Inflation rate",
       y = "GDP",
       color = "Unemployment rate") +
  theme_minimal()

Conclusions for above visualization :

The scatterplot shows the relationship between “Unemployment rate,” “Inflation rate,” and “GDP.”
There doesn’t appear to be a strong linear relationship between “Unemployment rate” and the other two economic variables.
The “Unemployment rate” varies across different levels of “Inflation rate” and “GDP.”

Task 3 & 4 :

Calculate the appropriate correlation coefficient for each of these combinations Explain why the value makes sense (or doesn’t) based on the visualization(s)

Build a confidence interval for each of the response variables. Provide a detailed conclusion of the response variable (i.e., the population) based on your confidence interval.

#Combination 1:
#Response Variable: "Curricular units 1st sem (grade)"
#Explanatory Variables: "Age at enrollment," "Admission grade," "International"

# Correlation matrix
correlation_matrix <- cor(mpg[, c("Curricular units 1st sem (grade)", "Age at enrollment", "Admission grade")])

# Extract correlation coefficients
correlation_coefficients <- correlation_matrix["Curricular units 1st sem (grade)", c("Age at enrollment", "Admission grade")]

# Display correlation coefficients
correlation_coefficients

## Age at enrollment   Admission grade 
##       -0.15661585        0.07386842

The correlation coefficients for “Curricular units 1st sem (grade)” with “Age at enrollment,” “Admission grade,” and “International” can help us understand the relationships:

“Age at enrollment” is negatively correlated with “Curricular units 1st sem (grade).” This means that, on average, as the age at enrollment increases, the grade in the first semester tends to decrease slightly.

“Admission grade” is positively correlated with “Curricular units 1st sem (grade).” This suggests that students with higher admission grades tend to have higher grades in the first semester.

Now, let’s build confidence intervals for “Curricular units 1st sem (grade)” and interpret the results:

# Confidence interval for "Curricular units 1st sem (grade)"
confidence_interval <- t.test(mpg$`Curricular units 1st sem (grade)`)

# Display the confidence interval
confidence_interval

## 
##  One Sample t-test
## 
## data:  mpg$`Curricular units 1st sem (grade)`
## t = 146.12, df = 4423, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  10.49805 10.78359
## sample estimates:
## mean of x 
##  10.64082

The confidence interval for “Curricular units 1st sem (grade)” provides a range estimate for the population mean grade in the first semester. The confidence interval’s lower and upper bounds tell us the likely range of values for the population mean.

#Combination 2:
#Response Variable: "Target"
#Explanatory Variables: "Admission grade," "Age at enrollment," "Displaced"
# Perform logistic regression
# Recode "dropout" to 0, "enrolled" and "graduate" to 1
mpg$Target <- ifelse(mpg$Target == "Dropout", 0, 1)

logistic_model <- glm(Target ~ `Admission grade` + `Age at enrollment`, data = mpg, family = binomial)

# Summary of the logistic regression model
summary(logistic_model)

## 
## Call:
## glm(formula = Target ~ `Admission grade` + `Age at enrollment`, 
##     family = binomial, data = mpg)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          0.599896   0.312229   1.921   0.0547 .  
## `Admission grade`    0.014037   0.002336   6.009 1.87e-09 ***
## `Age at enrollment` -0.068731   0.004381 -15.690  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5554.5  on 4423  degrees of freedom
## Residual deviance: 5246.2  on 4421  degrees of freedom
## AIC: 5252.2
## 
## Number of Fisher Scoring iterations: 4

The coefficient for “Admission grade” is positive and significant, it indicates that higher admission grades are associated with higher odds of achieving a specific “Target” category. Similarly, the coefficient for “Age at enrollment” is negative and significant, it suggests that older students may have lower odds of achieving that “Target” category.

# relationship between the response variable "Unemployment rate" and the explanatory variables "Inflation # rate" and "GDP
# Assuming your data frame is named "data," create a linear regression model
model <- lm(`Unemployment rate` ~ `Inflation rate` + `GDP`, data = mpg)

# Summarize the regression model
summary(model)

## 
## Call:
## lm(formula = `Unemployment rate` ~ `Inflation rate` + GDP, data = mpg)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.685 -1.233  0.264  2.453  4.143 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      11.72633    0.05050 232.219  < 2e-16 ***
## `Inflation rate` -0.12980    0.02740  -4.737 2.24e-06 ***
## GDP              -0.40222    0.01669 -24.096  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.504 on 4421 degrees of freedom
## Multiple R-squared:  0.1168, Adjusted R-squared:  0.1164 
## F-statistic: 292.4 on 2 and 4421 DF,  p-value: < 2.2e-16

The linear regression analysis shows that both the inflation rate and GDP have statistically significant negative relationships with the unemployment rate. Specifically, as the inflation rate and GDP increase, the unemployment rate tends to decrease.

However, the model explains only about 11.68% of the variance in unemployment rates, suggesting that other unaccounted factors also play a significant role. These findings can be useful for understanding economic trends but should be considered alongside other economic indicators for a comprehensive analysis.

Week 6 | Data Dive — Confidence Intervals

Vaishali Kondoju

2023-10-02