Introduction

  • Understanding what influences student success and dropout rates is crucial for educational institutions. This project analyzes a dataset from the UCI Machine Learning Repository, recognized for its comprehensive educational data. The dataset includes variables on student demographics, academic history, and socio-economic factors, providing a well-rounded perspective on the student experience.

  • Our analysis aims to uncover patterns that shed light on student outcomes, equipping educators and policymakers with data-driven insights to support student achievement and reduce dropout rates. Employing statistical methods in R, this report not only forecasts academic outcomes but also identifies key factors that contribute to student performance.

  • We present our findings with the intention to inform and guide educational strategies, focusing on enhancing student experiences and success rates.

Dataset Source

Exploratory Data Analysis

  • Having loaded our dataset into a dataframe, our first step is to understand what we’re working with. This initial exploration is crucial to ensure our data is accurate and complete, forming a reliable foundation for our analysis. Below is our dataset:
## Rows: 4,424
## Columns: 37
## $ `Marital status`                                 <dbl> 1, 1, 1, 1, 2, 2, 1, …
## $ `Application mode`                               <dbl> 17, 15, 1, 17, 39, 39…
## $ `Application order`                              <dbl> 5, 1, 5, 2, 1, 1, 1, …
## $ Course                                           <dbl> 171, 9254, 9070, 9773…
## $ `Daytime/evening attendance\t`                   <dbl> 1, 1, 1, 1, 0, 0, 1, …
## $ `Previous qualification`                         <dbl> 1, 1, 1, 1, 1, 19, 1,…
## $ `Previous qualification (grade)`                 <dbl> 122.0, 160.0, 122.0, …
## $ Nacionality                                      <dbl> 1, 1, 1, 1, 1, 1, 1, …
## $ `Mother's qualification`                         <dbl> 19, 1, 37, 38, 37, 37…
## $ `Father's qualification`                         <dbl> 12, 3, 37, 37, 38, 37…
## $ `Mother's occupation`                            <dbl> 5, 3, 9, 5, 9, 9, 7, …
## $ `Father's occupation`                            <dbl> 9, 3, 9, 3, 9, 7, 10,…
## $ `Admission grade`                                <dbl> 127.3, 142.5, 124.8, …
## $ Displaced                                        <dbl> 1, 1, 1, 1, 0, 0, 1, …
## $ `Educational special needs`                      <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ Debtor                                           <dbl> 0, 0, 0, 0, 0, 1, 0, …
## $ `Tuition fees up to date`                        <dbl> 1, 0, 0, 1, 1, 1, 1, …
## $ Gender                                           <dbl> 1, 1, 1, 0, 0, 1, 0, …
## $ `Scholarship holder`                             <dbl> 0, 0, 0, 0, 0, 0, 1, …
## $ `Age at enrollment`                              <dbl> 20, 19, 19, 20, 45, 5…
## $ International                                    <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 1st sem (credited)`            <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 1st sem (enrolled)`            <dbl> 0, 6, 6, 6, 6, 5, 7, …
## $ `Curricular units 1st sem (evaluations)`         <dbl> 0, 6, 0, 8, 9, 10, 9,…
## $ `Curricular units 1st sem (approved)`            <dbl> 0, 6, 0, 6, 5, 5, 7, …
## $ `Curricular units 1st sem (grade)`               <dbl> 0.00000, 14.00000, 0.…
## $ `Curricular units 1st sem (without evaluations)` <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 2nd sem (credited)`            <dbl> 0, 0, 0, 0, 0, 0, 0, …
## $ `Curricular units 2nd sem (enrolled)`            <dbl> 0, 6, 6, 6, 6, 5, 8, …
## $ `Curricular units 2nd sem (evaluations)`         <dbl> 0, 6, 0, 10, 6, 17, 8…
## $ `Curricular units 2nd sem (approved)`            <dbl> 0, 6, 0, 5, 6, 5, 8, …
## $ `Curricular units 2nd sem (grade)`               <dbl> 0.00000, 13.66667, 0.…
## $ `Curricular units 2nd sem (without evaluations)` <dbl> 0, 0, 0, 0, 0, 5, 0, …
## $ `Unemployment rate`                              <dbl> 10.8, 13.9, 10.8, 9.4…
## $ `Inflation rate`                                 <dbl> 1.4, -0.3, 1.4, -0.8,…
## $ GDP                                              <dbl> 1.74, 0.79, 1.74, -3.…
## $ Target                                           <chr> "Dropout", "Graduate"…

Our dataset comprises data on 4,424 students, spanning 37 diverse variables. These range from personal factors like marital status and parental background to academic aspects such as course details, grades, and broader economic indicators like unemployment and inflation rates.

In this exploration, our focus will be on identifying the variables that most significantly impact student success. We’ll look for patterns and trends within this data, which will guide our subsequent in-depth analysis. The aim is to provide you with clear, actionable insights that can inform strategies to enhance student success and reduce dropout rates.

Now that we have a general overview of our dataset, let’s dive deeper into the specifics. We’ll examine the distribution of key variables like academic performance, attendance rates, and socio-economic background. Understanding these distributions will help us identify patterns and potential areas of focus for enhancing student success.

par(bg = "#E7E5F1")
# Plotting histogram for a numerical variable
hist(student_data$`Previous qualification (grade)`, main = "Distribution of Previous Qualification Grades", xlab = "Grades", breaks = 30)

# Correlation analysis for selected numerical variables
correlation_matrix <- cor(student_data[, c("Age at enrollment", "Previous qualification (grade)", "Admission grade")])
corrplot(correlation_matrix, method = "circle", col = c("#6D9EC1","green", "#E46726"), 
         order = "hclust", addCoef.col = "black", 
         tl.col = "black", tl.srt = 45, 
         diag = FALSE)

# Comparing groups using boxplot
boxplot(student_data$`Previous qualification (grade)` ~ student_data$Gender,
        main = "Previous Qualification Grade by Gender",
        xlab = "Gender",
        ylab = "Grade")

par(bg = "white")
# Checking for missing values
sum(is.na(student_data))
## [1] 0

Insights from Exploratory Data Analysis

  • Grades Distribution: The histogram of previous qualification grades forms a bell-shaped curve centered around a grade of 140, indicating a standard distribution of academic performance prior to college.

  • Correlation Matrix: There is a discernible trend where older students tend to have higher enrollment grades, suggesting that maturity may correlate with academic readiness.

  • Gender-Based Grade Comparison: The median performance by gender is consistent, yet the data reveals considerable variability, with some students markedly outperforming or underperforming their peers.

Inferential Statistical Analysis

“In this section, we apply inferential statistical techniques to understand the relationships and differences within our data. Our aim is to determine whether observed patterns are statistically significant.”

Relationship Between Marital Status and Student Success - Chi-Square Test of Independence

  • Our inquiry begins with the chi-square test of independence, which assesses the association between students’ marital status and their academic success. This analysis is critical, as it may reveal socio-demographic factors that influence educational trajectories.

“We did a statistical test to see if a student’s marital status is connected to whether they graduate or drop out. The results were very clear: yes, there’s a significant connection.

The numbers from our test (called a Chi-Square test) show that this connection isn’t just by chance. It’s strong enough that we should pay attention to it.

This means that whether a student is married, single, or in another marital status could influence their chances of graduating or dropping out. It’s an important clue for schools. Knowing this, schools can think about ways to support students better, considering their different life situations.

Comparison of Pre-College Grades and Student Outcomes

In our journey to decode the patterns of student success and dropout, we now turn to an insightful aspect of our analysis: comparing the pre-college academic performance of students who eventually dropped out versus those who graduated. This exploration involves a Welch Two Sample t-test, a robust statistical technique that helps us understand if the average grades before college differ significantly between these two groups.

## 
##  Welch Two Sample t-test
## 
## data:  dropouts and graduates
## t = -6.6849, df = 3106.5, p-value = 2.729e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.839358 -2.097907
## sample estimates:
## mean of x mean of y 
##  131.1141  134.0827

The t-test results were revealing. We found a statistically significant difference in the pre-college grades between students who dropped out and those who graduated. Specifically:

t-value: The t-value of -6.6849 suggests a substantial difference in the means of the two groups. p-value: With a p-value of 2.729e-11, this difference is not only significant but also highly unlikely to have occurred by chance. Confidence Interval: The confidence interval ranges from -3.839358 to -2.097907, reinforcing the presence of a meaningful difference. Mean Grades: On average, students who dropped out had a lower grade (mean = 131.1) compared to those who graduated (mean = 134.1).

They underscore that academic performance before entering college can be a potent indicator of future outcomes. Students with lower grades prior to college tend to be at a higher risk of dropping out. This finding can empower educational institutions to proactively identify and support at-risk students, potentially altering their academic trajectories for the better.

Our analysis highlights the need for targeted interventions right from the onset of college. By identifying students who may need additional academic support based on their pre-college performance, educators and administrators can tailor their resources effectively, offering mentoring, tutoring, or remedial classes to ensure these students do not fall behind. This phase of our analysis sheds light on the significant role of early academic performance in determining student success. It provides a compelling case for educational institutions to consider pre-college grades as a critical factor in their student support strategies. By doing so, they can make strides in reducing dropout rates and enhancing overall student achievement.

Some questions and their hypothesis testing

This section explores the potential differences in unemployment rates between students who drop out and those who graduate.

Question 1: Is there a significant difference in the unemployment rate between students who drop out and those who graduate?

Null Hypothesis (H0): There is no significant difference in the unemployment rate between the two groups.

Alternative Hypothesis (H1): There is a significant difference in the unemployment rate between the two groups.

  • Statistical Test: T-Test for Independent Samples

We will perform a T-Test to compare the unemployment rates between the two groups of students.

# Redefining filtered_data with necessary columns for T-Test
unemployment_data <- student_data %>%
  filter(Target %in% c("Dropout", "Graduate")) %>%
  select(Target, `Unemployment rate`)

# Ensure 'Unemployment rate' is numeric
unemployment_data$`Unemployment rate` <- as.numeric(unemployment_data$`Unemployment rate`)

# Perform the T-Test
dropouts_unemployment_rate <- unemployment_data[unemployment_data$Target == "Dropout", ]$`Unemployment rate`
graduates_unemployment_rate <- unemployment_data[unemployment_data$Target == "Graduate", ]$`Unemployment rate`

t_test_unemployment <- t.test(dropouts_unemployment_rate, graduates_unemployment_rate)
t_test_unemployment
## 
##  Welch Two Sample t-test
## 
## data:  dropouts_unemployment_rate and graduates_unemployment_rate
## t = -0.24948, df = 2891.5, p-value = 0.803
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2032549  0.1573706
## sample estimates:
## mean of x mean of y 
##  11.61640  11.63934

The boxplot visualizing the unemployment rates among students who drop out versus those who graduate portrays a similar distribution for both categories. The median unemployment rate for both dropouts and graduates lies around the same value, with no significant outliers indicating exceptional cases.

The Welch Two Sample t-test supports this visual observation quantitatively, revealing no significant difference in the unemployment rates between the two groups (t = -0.24948, p-value = 0.803). The means of both groups are nearly identical, with dropouts at 11.616 and graduates at 11.639, and the confidence interval (-0.203 to 0.157) straddling zero further reinforces the lack of a noteworthy difference.

In summary, the visual and statistical analyses converge on a key insight: the status of being a dropout or a graduate does not appear to influence the unemployment rate. This finding suggests that factors other than academic completion status may be at play in determining the employment prospects of individuals in the dataset. For stakeholders, this might imply that interventions to improve student outcomes may need to extend beyond academia and consider broader socio-economic support systems.

Question 2: Are certain application modes more associated with student dropout?

Null Hypothesis (H0): Application mode is independent of student dropout. Alternative Hypothesis (H1): Application mode is associated with student dropout.

Statistical Test: Chi-Square Test of Independence.

library(dplyr)
library(ggplot2)
# Defining custom labels for application modes
app_mode_labels <- c("Online", "Mail-In", "In-Person", "Other")

# Filtering the dataset for 'Dropout' and 'Graduate' as target outcomes
app_mode_analysis <- student_data %>%
  filter(Target %in% c("Dropout", "Graduate")) %>%
  select(`Application mode`, Target)

# Handling missing values
app_mode_analysis <- app_mode_analysis %>%
  filter(!is.na(`Application mode`) & !is.na(Target))

# Perform the Chi-Square Test of Independence
chi_test_app_mode <- chisq.test(table(app_mode_analysis$`Application mode`, app_mode_analysis$Target))
chi_test_app_mode
## 
##  Pearson's Chi-squared test
## 
## data:  table(app_mode_analysis$`Application mode`, app_mode_analysis$Target)
## X-squared = 392.07, df = 17, p-value < 2.2e-16

Test Result:

X-squared: 392.07 Degrees of Freedom: 17 p-value: Less than 2.2e-16

We used a statistical test called Chi-Squared to see if the way students applied to the program (the ‘Application mode’) is related to whether they ended up graduating or dropping out. The test gave us a very clear answer: yes, there’s a significant relationship.

The test’s result, with a value called X-squared at 392.07 and a really small p-value, tells us that the connection we’re seeing isn’t just a coincidence. Different application modes seem to be linked with different outcomes for students. Our analysis indicates that the method by which students apply can significantly impact their academic journey. With a Chi-squared value of 392.07 across 17 degrees of freedom, we see that certain application modes are more strongly associated with either dropping out or graduating.

For instance, modes like ‘Change of course’ and ‘Technological specialization diploma holders’ show a higher proportion of graduates, suggesting that students applying through these modes may have a greater likelihood of completing their studies. Conversely, modes like ‘International student (bachelor)’ and ‘1st phase - special contingent (Madeira Island)’ display a more balanced or higher proportion of dropouts, which may identify areas where additional support could be instrumental in improving student retention and success.

Understanding the specific characteristics of these application modes can inform targeted support strategies to help at-risk students succeed

Question 3: Is there a correlation between ‘Previous qualification (grade)’ and ‘Admission grade’?

Null Hypothesis (H0): There is no correlation between the two variables. Alternative Hypothesis (H1): There is a correlation between the two variables.

  • Statistical Test: Pearson Correlation.
## 
##  Pearson's product-moment correlation
## 
## data:  student_data$`Previous qualification (grade)` and student_data$`Admission grade`
## t = 47.401, df = 4422, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5605639 0.5996559
## sample estimates:
##       cor 
## 0.5804442
## `geom_smooth()` using formula = 'y ~ x'

The results from our Pearson’s product-moment correlation test provide important insights about the relationship between students’ ‘Previous qualification (grade)’ and their ‘Admission grade’. Let’s break down the results:

  • Correlation Coefficient (cor): 0.5772406
  • t-value: 47.579
  • Degrees of Freedom (df): 3628
  • p-value: Less than 2.2e-16 (extremely small)
  • 95% Confidence Interval: 0.5551328 to 0.5985333

Interpretation of Results: - Strength of the Relationship: The correlation coefficient of approximately 0.577 suggests a moderate to strong positive correlation between ‘Previous qualification (grade)’ and ‘Admission grade’. This means that, generally, students with higher grades in their previous qualifications tend to have higher admission grades.

  • This graph shows us that students who did well in previous studies tend to start college with higher grades too. The upward trend of the line tells us there’s a consistent link between past and starting college grades. It’s like saying, the better you do before college, the better you’re likely to start off there.

  • “Our analysis shows a noticeable link between students’ grades before college and their grades when they get admitted. Students who had higher grades before generally also had higher grades at admission. This connection isn’t just a coincidence; it’s a strong, statistically significant pattern.

Knowing this, schools might use previous academic performance as a reliable indicator for predicting future academic success. It could help in identifying students who might need extra support right from the start of their college journey.”

This summary conveys the findings in a straightforward, non-technical manner, emphasizing the practical implications for educational stakeholders.

Advanced Overview

2. Socioeconomic Influence on Academic Trajectories

In our quest to unravel the complexities of student success, we turn our attention to the socioeconomic tapestry that underlies academic trajectories. It is often posited that a student’s background could significantly shape their educational journey. This section examines the potential influence of parental qualifications, occupation, and other socioeconomic indicators on students’ academic performance and their eventual outcomes.

We seek to determine whether a correlation exists between these factors and the number of curricular units completed, grades achieved, and the likelihood of graduation or dropout. By shedding light on these correlations, we aim to provide actionable insights that can help tailor support systems to the nuanced needs of diverse student populations.

Education does not exist in a vacuum, and it is often influenced by the socioeconomic background of students. In this section, we examine how the educational achievements of parents might impact the academic success of their children. By categorizing parental qualifications into distinct levels, from basic education to higher degrees, we aim to uncover patterns that may inform the support strategies educational institutions employ.

We anticipate that students whose parents have attained higher education degrees might have access to more educational resources or support systems, potentially contributing to higher rates of graduation. Conversely, students with parents who have not completed higher education may face more challenges in their academic journey. This analysis will explore these relationships and highlight areas where targeted support could make a significant difference in student retention and success.

The chart shows a clear trend where students with parents who have a ‘Higher Education - Bachelor’s Degree’ or higher are more likely to graduate. The blue bars, representing graduates, are notably higher in these categories, suggesting a positive correlation between parental education level and student success.

Conversely, in the ‘Secondary Education - 12th Year’ category, there is a significant number of students who have dropped out (indicated by the red bar), which is the highest count of dropouts across all parental qualification levels.

For the ‘Higher Education - Master (2nd cycle)’ and ‘Professional higher technical course’, the number of enrolled students (green bars) is relatively high, which may indicate that these students are currently continuing their education.

The lowest counts across the board are seen in the higher qualification levels such as ‘Doctorate’ and ‘Master’s’, suggesting that students with parents who have the highest levels of education may have a stronger educational support system, contributing to lower dropout rates and higher graduation rates.

The data implies that parental education levels could be a significant factor in student outcomes. Schools and policymakers might consider this when developing targeted interventions to support students at risk of dropping out, particularly those whose parents have lower levels of educational attainment.

3. The Influence of External Economic Factors

Narrative Section: “We now turn our attention to external factors that extend beyond the campus: the economic indicators of unemployment and inflation rates. How might these macroeconomic variables interplay with the microcosm of academic success?”

## 
##  Pearson's product-moment correlation
## 
## data:  student_data$`Unemployment rate` and student_data$`Previous qualification (grade)`
## t = 3.0103, df = 4422, p-value = 0.002625
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.01577452 0.07459164
## sample estimates:
##        cor 
## 0.04522227

The Pearson product-moment correlation test was performed to evaluate the relationship between the unemployment rate and the grades from students’ previous qualifications within the dataset.

  • Correlation Coefficient (cor): The test resulted in a correlation coefficient of approximately 0.045. This indicates a very weak positive correlation between the unemployment rate and students’ previous qualification grades.

  • Statistical Significance (p-value): With a p-value of approximately 0.0026, which is less than the conventional alpha level of 0.05, the test suggests that there is a statistically significant correlation between the two variables. However, given the very low correlation coefficient, the practical significance of this relationship is likely to be minimal.

  • Confidence Interval: The 95% confidence interval ranges from about 0.016 to 0.075, which does not include zero. This supports the finding that there is a statistically significant correlation, but again, the narrow range of this interval around zero highlights the weak nature of the relationship.

The hexbin plot visually represents the density of data points where the unemployment rate is plotted against previous qualification grades. Denser areas with more data points are colored more intensely. The spread of data points across various unemployment rates doesn’t show a distinct pattern, which corresponds with the weak positive correlation indicated by the Pearson correlation coefficient of 0.045. This visualization helps stakeholders to quickly grasp the distribution and density of the data points, reinforcing the conclusion that while there is a statistically significant relationship between unemployment rates and previous qualification grades, it is practically weak and may not have substantial implications on student academic performance.

Data Density: Darker hexagons indicate a higher concentration of students with similar unemployment rates and previous qualification grades. Lighter or absent hexagons suggest fewer students in those areas.

Overall Pattern: The plot does not show a clear trend or direction, which is consistent with the very weak positive correlation. There’s no distinct area where hexagons consistently shift from one color to another indicating a strong relationship.

Grade Range: Most data points seem to cluster around the middle range of previous qualification grades, regardless of the unemployment rate.

  • Our examination into how the broader economy affects students reveals a slight connection between unemployment rates and student grades before college. Although this link is statistically significant, it’s very weak, meaning that unemployment rates barely relate to how well students performed in their previous schooling. While economic conditions undoubtedly play a role in educational contexts, this particular aspect—unemployment rates—doesn’t seem to strongly predict students’ prior academic success.

4. Gender Dynamics in Education

Gender has long been a focal point of educational studies. We dissect this further by examining the influence of gender on academic outcomes within our dataset.

## 
##  Pearson's Chi-squared test
## 
## data:  table(student_data$Gender, student_data$Target)
## X-squared = 233.27, df = 2, p-value < 2.2e-16

The Pearson Chi-squared test was used to examine the relationship between gender and academic outcomes (such as Dropout, Enrolled, Graduate) within the dataset.

  • Chi-squared (X-squared): The value of the chi-squared statistic is 233.27. Degrees of Freedom (df): There are 2 degrees of freedom in this test, which likely corresponds to the number of categories in the outcome minus one (since there are three outcomes: Dropout, Enrolled, Graduate).

  • Statistical Significance (p-value): The p-value is less than 2.2e-16, which is exceedingly small and far below the conventional threshold of 0.05.

  • Interpretation of Results: The extremely small p-value indicates that there is a statistically significant association between gender and academic outcomes. The high chi-squared statistic further suggests that this association is strong.

Our statistical analysis reveals a striking and significant connection between students’ gender and their educational results. Gender appears to play a critical role in whether students drop out, stay enrolled, or graduate. This finding is robust and statistically significant, indicating that gender is not just a background detail but a factor that is closely linked to academic success.

5. The Age Factor in Academic Success

In the quest to understand the factors influencing academic trajectories, we now turn our focus to the age at which students embark on their higher education journey. Does the maturity that comes with age translate into academic success?

## # weights:  9 (4 variable)
## initial  value 4860.260765 
## iter  10 value 4373.368771
## final  value 4373.368706 
## converged

The model estimates the effects of Age at enrollment on the likelihood of being in the “Enrolled” and “Graduate” categories, compared to the baseline category, which in this case is “Dropout.

Enrolled vs Dropout:

  • Intercept: The intercept for “Enrolled” is 0.7995769. This represents the log odds of being enrolled (versus dropping out) when the Age at enrollment is zero. Since age cannot be zero, this is a theoretical value and should be interpreted with caution. Age at enrollment: The coefficient for Age at enrollment is -0.05763941. This negative sign indicates that as age at enrollment increases, the log odds of being enrolled (as opposed to dropping out) decrease. Graduate vs Dropout:

  • Intercept: The intercept for “Graduate” is 2.1723561, representing the log odds of graduating (versus dropping out) when the Age at enrollment is zero. Age at enrollment: The coefficient for Age at enrollment is -0.07348313, suggesting that older students at the time of enrollment are less likely to graduate compared to dropping out, relative to younger students. Standard Errors: The standard errors for the coefficients measure the standard deviation of the estimated coefficients across multiple samples. They are used for hypothesis testing and constructing confidence intervals.

  • Residual Deviance and AIC: The Residual Deviance is a measure of how well the model fits the data. A lower residual deviance indicates a better fit. The AIC (Akaike Information Criterion) is used for model comparison; lower AIC values indicate a model that better balances fit and complexity.

The coefficients for ‘Age at enrollment’ are negative for both ‘Enrolled’ and ‘Graduate’ categories, suggesting that with every additional year, the odds of being in either of these categories compared to ‘Dropout’ decrease slightly. The plot supports this analysis by visualizing the coefficient estimates and their confidence intervals for ‘Enrolled’ and ‘Graduate’ outcomes. The error bars indicate the range of estimates within which we can be confident the true values lie, based on the data.

Our analysis suggests that age plays a significant role in student’s academic paths. Specifically, as students get older, they are slightly less likely to remain enrolled or graduate compared to dropping out. This trend is statistically significant and consistent across both categories when compared to the baseline of dropping out.

6. The Scholarship Effect on Student Success

In this part of our study, we examine the role of financial aid, specifically scholarships, in shaping student outcomes. We aim to uncover whether the provision of scholarships is a significant factor in determining whether students are more likely to graduate, remain enrolled, or drop out

## 
## Call:
## glm(formula = Target ~ Displaced + `Educational special needs` + 
##     Debtor + `Tuition fees up to date` + Gender + `Scholarship holder`, 
##     family = "binomial", data = data)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -1.61026    0.14953 -10.769  < 2e-16 ***
## Displaced                    0.21771    0.07459   2.919  0.00352 ** 
## `Educational special needs` -0.36816    0.33352  -1.104  0.26965    
## Debtor                      -0.51404    0.12544  -4.098 4.17e-05 ***
## `Tuition fees up to date`    2.66271    0.14249  18.687  < 2e-16 ***
## Gender                      -0.70508    0.07578  -9.305  < 2e-16 ***
## `Scholarship holder`         1.26200    0.10724  11.768  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5554.5  on 4423  degrees of freedom
## Residual deviance: 4444.6  on 4417  degrees of freedom
## AIC: 4458.6
## 
## Number of Fisher Scoring iterations: 4

Tuition Fees Up to Date: This has the largest positive effect on the target outcome, suggesting that students who are up to date with their tuition fees are significantly more likely to graduate.

Scholarship Holder: Similarly, having a scholarship is a strong positive predictor of student success, implying that financial support is crucial.

Debtor: Being in debt has a negative effect on student outcomes, indicating that financial burdens may lead to a higher likelihood of dropping out.

Gender: There is a negative coefficient for gender, which could mean that the gender coded as ‘1’ (male) is less likely to graduate compared to ‘0’ (female).

Displaced: There is a positive association with being displaced, which may suggest that students who are displaced put more effort into their education.

Educational Special Needs: The effect is negative but not statistically significant, indicating that educational special needs do not have a strong impact on whether a student graduates or drops out.

For stakeholders, these insights are crucial. They highlight the importance of financial stability and support in educational success. These findings could lead to recommendations such as providing more scholarships, financial counseling, and support for students who are struggling financially. It also opens a discussion about the support systems in place for students with educational special needs and how they might be improved.

It’s important to note that the significance levels (denoted by stars) also give us a good idea of which factors are most reliable in predicting outcomes. Factors like ‘Tuition fees up to date’ and ‘Scholarship holder’ are highly significant, whereas ‘Educational special needs’ is not, suggesting that the latter might not be a decisive factor in student success within this dataset.

Final Consclusion for stakeholders and Readers of this Report

In concluding this comprehensive analysis, we’ve delved into a myriad of factors affecting student success and dropout rates. From socio-demographic variables to economic indicators and personal academic history, our exploration has provided valuable insights. The statistically significant patterns unearthed—such as the correlation between pre-college grades and admission rates, the impact of application modes on student outcomes, and the influence of age on academic trajectories—offer actionable information for educational strategies.

Financial stability, underscored by the scholarship effect, emerges as a crucial element in student achievement. Gender dynamics and parental education levels also play significant roles. These findings emphasize the need for a holistic approach to student support, one that considers individual backgrounds, financial circumstances, and educational readiness.

Educational institutions and policymakers are encouraged to leverage these insights to create targeted interventions, ensuring that all students have the support and resources they need to thrive academically. By doing so, we can work towards an educational environment where every student has the opportunity to reach their full potential, regardless of their starting point.

To wrap up, our deep look into what helps students succeed or causes them to leave school has shown us a lot. Simple things like being older, how you apply, and your grades before college can all affect whether you stay in school or graduate. Money matters too; students with scholarships or who keep up with fees do better.

What we’ve learned says a lot about what schools can do. They can help by spotting troubles early, giving more scholarships, and supporting students who have a harder time. It’s clear that helping students means thinking about everything, from their age and grades to their money situation. With this info, schools can really help every student do their best.

This comprehensive analysis underscores the multifaceted nature of academic success and the factors influencing student dropout rates. From the impact of socio-economic backgrounds to the importance of early academic performance, our findings provide a nuanced understanding of the student experience.

Key Takeaways:

Early identification of at-risk students through academic performance indicators is crucial. Socioeconomic factors, such as parental education and financial stability, play a significant role in student outcomes. Gender dynamics and age at enrollment are significant factors that educational strategies must address. Scholarships and financial aid emerge as vital supports for student success. Recommendations:

Implement targeted support for students with lower pre-college grades and those from less advantaged socio-economic backgrounds. Foster inclusive educational practices that address the specific needs of diverse student demographics. Enhance financial aid programs to alleviate the impact of economic challenges on academic continuity. Moving Forward:

Continuous monitoring and adaptation of educational strategies based on data-driven insights are essential. Collaboration between educational institutions and policymakers can lead to more effective support systems. Future research should explore longitudinal outcomes to refine and personalize educational interventions. By embracing a data-informed approach, we can create an educational environment that not only recognizes diversity in student backgrounds but also actively works to support every student’s academic journey.

#Appendix

Advanced Visualization: Interaction Effects

Examining the interaction between age at enrollment, admission grades, and first semester grades.

The plot we have created is a scatter plot showing the interaction effect of Age at enrollment and Admission Grade on First Semester Performance. The color gradient indicates the age at enrollment, with the red end of the spectrum representing older students. The linear regression line suggests a trend in the data.The color gradient suggests that age may play a role in academic performance, although the trend is subtle. The plot indicates that while there’s a general positive trend between admission grades and first-semester performance, the variation is significant, implying other factors may also play an important role.

Filtering

Encoding

  • Encoding refers to the process of converting data from one format or representation to another. In the context of data analysis and machine learning, encoding is often used to transform categorical or text data into numerical form, which can be more easily processed and utilized by algorithms.

  • Label Encoding: In label encoding, each unique category in a categorical variable is assigned an integer label. For example, if we have categories like “Red,” “Green,” and “Blue,” they could be encoded as 0, 1, and 2, respectively. However, caution should be exercised when using label encoding for ordinal data, as the numerical representation may introduce an unintended ordinal relationship.

## 
##    0    1 
## 1421 2209

Splitting Dataset :

Splitting a dataset refers to the process of dividing a given dataset into two or more subsets for training and evaluation purposes. The most common type of split is between the training set and the testing (or validation) set. This division allows us to assess the performance of a machine learning model on unseen data and evaluate its generalization capabilities.

Train-Test Split: This is the most basic type of split, where the dataset is divided into a training set and a testing set. The training set is used to train the machine learning model, while the testing set is used to evaluate its performance. The split is typically done using a fixed ratio, such as 80% for training and 20% for testing.

## # A tibble: 6 × 36
##   `Marital status` `Application mode` `Application order` Course
##              <dbl>              <dbl>               <dbl>  <dbl>
## 1                1                 17                   5    171
## 2                1                 15                   1   9254
## 3                1                  1                   5   9070
## 4                1                 17                   2   9773
## 5                1                  1                   1   9500
## 6                1                 18                   4   9254
## # ℹ 32 more variables: `Daytime/evening attendance\t` <dbl>,
## #   `Previous qualification` <dbl>, `Previous qualification (grade)` <dbl>,
## #   Nacionality <dbl>, `Mother's qualification` <dbl>,
## #   `Father's qualification` <dbl>, `Mother's occupation` <dbl>,
## #   `Father's occupation` <dbl>, `Admission grade` <dbl>, Displaced <dbl>,
## #   `Educational special needs` <dbl>, Debtor <dbl>,
## #   `Tuition fees up to date` <dbl>, Gender <dbl>, …
## [1] 0 1 0 1 1 0
## [1] 2541   36
## [1] 1089   36
## [1] 2541
## [1] 1089

We have test and train split of data and the shape of the data is (2541, 1085)

Model Selection And Training :

Model Selection:

  • Model selection involves choosing the best algorithm or model architecture for the given problem and dataset. This step requires careful consideration of various factors, such as the nature of the data (e.g., numerical or categorical), the problem type (e.g., regression, classification, clustering), the amount of available data, and the desired model performance. It is essential to select a model that can effectively capture the underlying patterns in the data and make accurate predictions.

Model Training:

  • Once the appropriate model has been selected, the next step is to train it on the dataset. Model training involves adjusting the model’s parameters using the training data to make accurate predictions on unseen data. The goal is to minimize the difference between the model’s predictions and the actual target values during training.
## [1] 0.9100092
  • Implementation Using RandomForest:
  • Training Steps:
    • Employing the RandomForest algorithm for its robustness and efficacy.
    • Training the model with a subset (70%) of the data to predict ‘Target’.
  • Performance:
    • Achieved an impressive 90% accuracy in predictions.
    • Reflects the model’s strong capability in classifying data accurately.

Hyper-parameter Tuning :

  • Hyperparameter tuning is a critical process in machine learning that involves finding the optimal set of hyperparameters for a given model. Hyperparameters are configuration settings that are not learned from the data during model training but are set before the training process begins. They significantly impact the model’s performance and generalization ability.

  • The goal of hyperparameter tuning is to systematically search through different combinations of hyperparameters to identify the configuration that yields the best model performance. The process ensures that the model is well-optimized and capable of making accurate predictions on new, unseen data.

Best Hyperparameter Value:

  • The optimal value for mtry is identified as 5. This is determined based on the performance metrics obtained from the grid search.

Performance Metrics Across Different mtry Values:

  • The output table lists the results for mtry values ranging from 2 to 5.
  • For each mtry value, the model’s accuracy and kappa scores, along with their standard deviations, are provided.

Accuracy and Kappa Scores:

  • The accuracy and kappa scores generally increase as the mtry value increases.
  • The highest accuracy (0.9000407) and kappa (0.7864352) scores are achieved with mtry = 5.
  • The standard deviations for accuracy and kappa also increase with higher mtry values, indicating greater variability in model performance.

Implications:

  • The results suggest that increasing the mtry parameter improves the model’s performance in terms of accuracy and kappa scores. However, the increase in standard deviation also points to a potential increase in model variability.
  • Choosing mtry = 5 appears to be the best configuration for this specific random forest model based on the current dataset and the metrics considered.

prediction

## [1] 0.9100092

Interpreting Model Performance Good Accuracy: An accuracy of around 90.3% is generally considered high in many contexts.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 348  19
##          1  79 643
##                                           
##                Accuracy : 0.91            
##                  95% CI : (0.8914, 0.9263)
##     No Information Rate : 0.6079          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8064          
##                                           
##  Mcnemar's Test P-Value : 2.524e-09       
##                                           
##             Sensitivity : 0.8150          
##             Specificity : 0.9713          
##          Pos Pred Value : 0.9482          
##          Neg Pred Value : 0.8906          
##              Prevalence : 0.3921          
##          Detection Rate : 0.3196          
##    Detection Prevalence : 0.3370          
##       Balanced Accuracy : 0.8931          
##                                           
##        'Positive' Class : 0               
## 

Confusion Matrix: Matrix Overview:

The confusion matrix compares the predicted values (Prediction) against the actual values (Reference). There are two classes: 0 and 1. Results:

Class 0 (Negative Class): 350 true negatives (correctly predicted as 0) and 77 false negatives (incorrectly predicted as 1). Class 1 (Positive Class): 637 true positives (correctly predicted as 1) and 25 false positives (incorrectly predicted as 0). Statistical Measures: Accuracy: Approximately 90.63% (0.9063). This indicates a high overall rate of correct predictions by the model.

95% Confidence Interval for Accuracy: Ranges from 88.75% to 92.3%, suggesting the model’s accuracy is consistently high.

No Information Rate: 60.79%. This is a baseline metric for comparison, indicating the accuracy that would be achieved by always predicting the most frequent class.

P-Value [Accuracy > No Information Rate]: Less than 2.2e-16, indicating the model’s accuracy is significantly better than the no information rate.

Kappa: Approximately 0.7992. The Kappa statistic measures agreement between predicted and actual classifications; a value close to 1 indicates strong agreement.

Mcnemar’s Test P-Value: 4.424e-07, suggesting a significant difference between the model’s performance on different classes.

Sensitivity (True Positive Rate): 81.97%, indicating the proportion of actual positives correctly identified.

Specificity (True Negative Rate): 96.22%, showing the proportion of actual negatives correctly identified.

Positive Predictive Value (Precision): 93.33%, reflecting the proportion of positive identifications that were actually correct.

Negative Predictive Value: 89.22%, indicating the proportion of negative identifications that were actually correct.

Prevalence of the Positive Class: 39.21%.

Detection Rate: 32.14%, indicating the rate at which the positive class is correctly identified.

Detection Prevalence: 34.44%, showing how often the model predicts the positive class.

Balanced Accuracy: 89.10%, a metric that considers both sensitivity and specificity.

Implications: The model demonstrates high accuracy, sensitivity, and specificity, indicating it performs well in both identifying the positive class and avoiding false positives. The high positive and negative predictive values suggest the model is reliable in its predictions. The significant kappa value and the results of Mcnemar’s Test underline the model’s effectiveness beyond chance or bias. Conclusion: Overall, the model shows excellent performance in classifying and predicting the two classes, as evidenced by the high accuracy, sensitivity, specificity, and other statistical measures.

Precision

## [1] 0.829559

Balanced Metric: The F1 score is particularly valuable because it provides a single metric that balances the trade-off between precision and recall. An F1 score close to 1 indicates high precision and high recall.

Overview of the Model:

This model has been developed to predict student success and dropout rates. Using a sophisticated machine learning approach, it aims to provide actionable insights that can assist in enhancing educational strategies and student support mechanisms.

Key Performance Metrics: High Accuracy (90.63%): The model correctly predicts student outcomes with an accuracy of approximately 90.63%. This high accuracy rate is indicative of its reliability in identifying students at risk of dropping out or succeeding.

Sensitivity and Specificity: The model demonstrates a sensitivity of 81.97% and a specificity of 96.22%. This means it is adept at correctly identifying students who are likely to drop out (sensitivity) while also accurately recognizing those who are likely to succeed (specificity).

Predictive Values: With a positive predictive value of 93.33% and a negative predictive value of 89.22%, the model is highly effective in its predictions. It ensures that a significant majority of its predictions are reliable.

Balanced Accuracy (89.10%): The balanced accuracy, considering both sensitivity and specificity, stands at 89.10%, underscoring the model’s overall balanced performance across different categories.

Statistical Significance: The model’s performance is statistically significant, as evidenced by a very low p-value in comparison to the No Information Rate. This suggests that the model’s predictions are not due to chance.

Implications for Stakeholders: Decision-Making Tool: The model can serve as a valuable tool for educational administrators and policymakers, providing data-driven insights to inform decisions about student support and intervention strategies. Early Intervention: The ability to accurately predict student outcomes can facilitate early intervention for at-risk students, potentially improving retention rates and academic success. Resource Allocation: By identifying trends and patterns in student performance, the model can assist in the efficient allocation of educational resources and support services. Continual Improvement: The model can be an integral part of a continual improvement process in educational settings, aiding in the assessment of current strategies and the development of new approaches based on data-driven evidence. Conclusion: The model is a robust, accurate, and reliable tool that can play a critical role in enhancing educational outcomes. Its ability to predict student success and dropout with high precision makes it an invaluable asset for stakeholders in the educational sector.

#The end