library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(stats)
mpg<- read_delim("C:/Users/kondo/OneDrive/Desktop/INTRO to Statistics and R/Data Set and work/data.csv", delim = ";",show_col_types = FALSE)
glimpse(mpg)
## Rows: 4,424
## Columns: 37
## $ `Marital status` <dbl> 1, 1, 1, 1, 2, 2, 1, …
## $ `Application mode` <dbl> 17, 15, 1, 17, 39, 39…
## $ `Application order` <dbl> 5, 1, 5, 2, 1, 1, 1, …
## $ Course <dbl> 171, 9254, 9070, 9773…
## $ `Daytime/evening attendance\t` <dbl> 1, 1, 1, 1, 0, 0, 1, …
## $ `Previous qualification` <dbl> 1, 1, 1, 1, 1, 19, 1,…
## $ `Previous qualification (grade)` <dbl> 122.0, 160.0, 122.0, …
## $ Nacionality <dbl> 1, 1, 1, 1, 1, 1, 1, …
## $ `Mother's qualification` <dbl> 19, 1, 37, 38, 37, 37…
## $ `Father's qualification` <dbl> 12, 3, 37, 37, 38, 37…
## $ `Mother's occupation` <dbl> 5, 3, 9, 5, 9, 9, 7, …
## $ `Father's occupation` <dbl> 9, 3, 9, 3, 9, 7, 10,…
## $ `Admission grade` <dbl> 127.3, 142.5, 124.8, …
## $ Displaced <dbl> 1, 1, 1, 1, 0, 0, 1, …
## $ `Educational special needs` <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ Debtor <dbl> 0, 0, 0, 0, 0, 1, 0, …
## $ `Tuition fees up to date` <dbl> 1, 0, 0, 1, 1, 1, 1, …
## $ Gender <dbl> 1, 1, 1, 0, 0, 1, 0, …
## $ `Scholarship holder` <dbl> 0, 0, 0, 0, 0, 0, 1, …
## $ `Age at enrollment` <dbl> 20, 19, 19, 20, 45, 5…
## $ International <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 1st sem (credited)` <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 1st sem (enrolled)` <dbl> 0, 6, 6, 6, 6, 5, 7, …
## $ `Curricular units 1st sem (evaluations)` <dbl> 0, 6, 0, 8, 9, 10, 9,…
## $ `Curricular units 1st sem (approved)` <dbl> 0, 6, 0, 6, 5, 5, 7, …
## $ `Curricular units 1st sem (grade)` <dbl> 0.00000, 14.00000, 0.…
## $ `Curricular units 1st sem (without evaluations)` <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 2nd sem (credited)` <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 2nd sem (enrolled)` <dbl> 0, 6, 6, 6, 6, 5, 8, …
## $ `Curricular units 2nd sem (evaluations)` <dbl> 0, 6, 0, 10, 6, 17, 8…
## $ `Curricular units 2nd sem (approved)` <dbl> 0, 6, 0, 5, 6, 5, 8, …
## $ `Curricular units 2nd sem (grade)` <dbl> 0.00000, 13.66667, 0.…
## $ `Curricular units 2nd sem (without evaluations)` <dbl> 0, 0, 0, 0, 0, 5, 0, …
## $ `Unemployment rate` <dbl> 10.8, 13.9, 10.8, 9.4…
## $ `Inflation rate` <dbl> 1.4, -0.3, 1.4, -0.8,…
## $ GDP <dbl> 1.74, 0.79, 1.74, -3.…
## $ Target <chr> "Dropout", "Graduate"…
Task 1 & 2 : Build at least three sets of variable combinations For each set of variables, include at least one column that you created (i.e., calculated based on others) All variables for this data dive should be either continuous (i.e., numeric) or ordered (e.g., [‘small’, ‘medium’, ‘large’] is okay, but [“apples”, “oranges”, “bananas”] is not) For each set, there should be one response variable with the others as explanatory variables
Plot a visualization for each response-explanatory relationship, and draw some conclusions based on the plot
Set 1: Response Variable: “Curricular units 2nd sem (grade)” Explanatory Variables: “Admission grade,” “Age at enrollment”
# Create a scatterplot for Set 1
ggplot(mpg, aes(x = `Admission grade`, y = `Age at enrollment`, color = `Curricular units 2nd sem (grade)`)) +
geom_point() +
labs(title = "Scatterplot of Curricular units 2nd sem (grade)",
x = "Admission grade",
y = "Age at enrollment",
color = "Curricular units 2nd sem (grade)") +
theme_minimal()
Conclusions for above visualization :
Response Variable: “Target” (Classification) Explanatory Variables: “Admission grade,” “Age at enrollment,” “Displaced”
# Create a scatterplot matrix for Set 2
ggplot(mpg, aes(x = Target, y = `Admission grade`, fill = factor(`Age at enrollment`))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Grouped Bar Chart of Admission Grade by Target and Age at Enrollment",
x = "Target",
y = "Admission Grade") +
scale_fill_discrete(name = "Age at Enrollment")
Conclusions for above visualization:
The “Admission grade” tends to be higher for students who ultimately “Graduate” compared to those who “Dropout” or are “Enrolled.” Students who “Graduate” generally have a wider range of admission grades, while students who “Dropout” tend to have lower admission grades.
The distribution of “Age at enrollment” varies across different “Target” categories. Students who “Dropout” and “Enrolled” tend to have a higher concentration of younger students, while those who “Graduate” include a broader age range. The “Enrolled” category has a significant number of younger students, while the “Dropout” category includes both younger and older students. These conclusions provide insights into the relationships between these variables and the likelihood of a student’s academic outcome, which could be valuable for making informed decisions or interventions in an educational context.
# Create a scatterplot for Set 3
ggplot(mpg, aes(x = `Inflation rate`, y = GDP, color = `Unemployment rate`)) +
geom_point() +
labs(title = "Scatterplot of Unemployment rate",
x = "Inflation rate",
y = "GDP",
color = "Unemployment rate") +
theme_minimal()
Conclusions for above visualization :
Task 3 & 4 :
Calculate the appropriate correlation coefficient for each of these combinations Explain why the value makes sense (or doesn’t) based on the visualization(s)
Build a confidence interval for each of the response variables. Provide a detailed conclusion of the response variable (i.e., the population) based on your confidence interval.
#Combination 1:
#Response Variable: "Curricular units 1st sem (grade)"
#Explanatory Variables: "Age at enrollment," "Admission grade," "International"
# Correlation matrix
correlation_matrix <- cor(mpg[, c("Curricular units 1st sem (grade)", "Age at enrollment", "Admission grade")])
# Extract correlation coefficients
correlation_coefficients <- correlation_matrix["Curricular units 1st sem (grade)", c("Age at enrollment", "Admission grade")]
# Display correlation coefficients
correlation_coefficients
## Age at enrollment Admission grade
## -0.15661585 0.07386842
The correlation coefficients for “Curricular units 1st sem (grade)” with “Age at enrollment,” “Admission grade,” and “International” can help us understand the relationships:
“Age at enrollment” is negatively correlated with “Curricular units 1st sem (grade).” This means that, on average, as the age at enrollment increases, the grade in the first semester tends to decrease slightly.
“Admission grade” is positively correlated with “Curricular units 1st sem (grade).” This suggests that students with higher admission grades tend to have higher grades in the first semester.
Now, let’s build confidence intervals for “Curricular units 1st sem (grade)” and interpret the results:
`
# Confidence interval for "Curricular units 1st sem (grade)"
confidence_interval <- t.test(mpg$`Curricular units 1st sem (grade)`)
# Display the confidence interval
confidence_interval
##
## One Sample t-test
##
## data: mpg$`Curricular units 1st sem (grade)`
## t = 146.12, df = 4423, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 10.49805 10.78359
## sample estimates:
## mean of x
## 10.64082
The confidence interval for “Curricular units 1st sem (grade)” provides a range estimate for the population mean grade in the first semester. The confidence interval’s lower and upper bounds tell us the likely range of values for the population mean.
#Combination 2:
#Response Variable: "Target"
#Explanatory Variables: "Admission grade," "Age at enrollment," "Displaced"
# Perform logistic regression
# Recode "dropout" to 0, "enrolled" and "graduate" to 1
mpg$Target <- ifelse(mpg$Target == "Dropout", 0, 1)
logistic_model <- glm(Target ~ `Admission grade` + `Age at enrollment`, data = mpg, family = binomial)
# Summary of the logistic regression model
summary(logistic_model)
##
## Call:
## glm(formula = Target ~ `Admission grade` + `Age at enrollment`,
## family = binomial, data = mpg)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.599896 0.312229 1.921 0.0547 .
## `Admission grade` 0.014037 0.002336 6.009 1.87e-09 ***
## `Age at enrollment` -0.068731 0.004381 -15.690 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5554.5 on 4423 degrees of freedom
## Residual deviance: 5246.2 on 4421 degrees of freedom
## AIC: 5252.2
##
## Number of Fisher Scoring iterations: 4
The coefficient for “Admission grade” is positive and significant, it indicates that higher admission grades are associated with higher odds of achieving a specific “Target” category. Similarly, the coefficient for “Age at enrollment” is negative and significant, it suggests that older students may have lower odds of achieving that “Target” category.
# relationship between the response variable "Unemployment rate" and the explanatory variables "Inflation # rate" and "GDP
# Assuming your data frame is named "data," create a linear regression model
model <- lm(`Unemployment rate` ~ `Inflation rate` + `GDP`, data = mpg)
# Summarize the regression model
summary(model)
##
## Call:
## lm(formula = `Unemployment rate` ~ `Inflation rate` + GDP, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.685 -1.233 0.264 2.453 4.143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.72633 0.05050 232.219 < 2e-16 ***
## `Inflation rate` -0.12980 0.02740 -4.737 2.24e-06 ***
## GDP -0.40222 0.01669 -24.096 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.504 on 4421 degrees of freedom
## Multiple R-squared: 0.1168, Adjusted R-squared: 0.1164
## F-statistic: 292.4 on 2 and 4421 DF, p-value: < 2.2e-16
The linear regression analysis shows that both the inflation rate and GDP have statistically significant negative relationships with the unemployment rate. Specifically, as the inflation rate and GDP increase, the unemployment rate tends to decrease.
However, the model explains only about 11.68% of the variance in unemployment rates, suggesting that other unaccounted factors also play a significant role. These findings can be useful for understanding economic trends but should be considered alongside other economic indicators for a comprehensive analysis.