Student Success Predictor

2024-03-25

Introduction

In today’s educational landscape, understanding and predicting factors that contribute to student success are pivotal for educators, policymakers, and parents alike.

Project Title: “Student Success Predictor”:

Aims to analyze a dataset containing various attributes of students and their corresponding academic scores.
The dataset encompasses information such as gender, race/ethnicity, parental level of education, lunch type, test preparation course completion, and scores in math, reading, and writing assessments.

Introduction

Approach:

Through exploratory data analysis, delve into the relationships between variables and academic achievement.
Uncover trends and disparities within the student population.

Introduction

Goal:

Contribute towards enhancing student support systems and promoting inclusive educational practices.
Empower educators and institutions in nurturing the potential of every student for success in academia and beyond.

Importing and Reading the Dataset

Initiate analysis:
- Import the necessary package, readr, facilitating the handling of delimited files in R.
- Utilize the read_csv() function to import the dataset named “StudentsPerformance.csv”.
  - This function reads the CSV file and stores its contents in a data frame called student_data.
- Gain initial understanding:

# A tibble: 5 × 8
  gender `race/ethnicity` parental level of educa…¹ lunch test preparation cou…²
  <chr>  <chr>            <chr>                     <chr> <chr>                 
1 female group B          bachelor's degree         stan… none                  
2 female group C          some college              stan… completed             
3 female group B          master's degree           stan… none                  
4 male   group A          associate's degree        free… none                  
5 male   group C          some college              stan… none                  
# ℹ abbreviated names: ¹`parental level of education`,
#   ²`test preparation course`
# ℹ 3 more variables: `math score` <dbl>, `reading score` <dbl>,
#   `writing score` <dbl>

Analyzing the Dataset

Calculate summary statistics for numerical variables:
- Focus on “math score,” “reading score,” and “writing score.”
- Gain insights into central tendency and score spread across subjects.
Generate frequency tables for categorical variables:
- Include “gender,” “race/ethnicity,” “parental level of education,” “lunch,” and “test preparation course.”
- Provide a clear overview of category distribution within each variable.
- Highlight any disparities or patterns in the data.

Summary of Numerical Variables:

   math score     reading score    writing score   
 Min.   :  0.00   Min.   : 17.00   Min.   : 10.00  
 1st Qu.: 57.00   1st Qu.: 59.00   1st Qu.: 57.75  
 Median : 66.00   Median : 70.00   Median : 69.00  
 Mean   : 66.09   Mean   : 69.17   Mean   : 68.05  
 3rd Qu.: 77.00   3rd Qu.: 79.00   3rd Qu.: 79.00  
 Max.   :100.00   Max.   :100.00   Max.   :100.00

Insights from Data Summary

Numerical Variables:
Math Score: The distribution of math scores ranges from 0 to 100, with a mean score of approximately 66.09. Most students score between 57 and 77.
Reading Score: Scores range from 17 to 100, with a mean score of around 69.17. Majority score between 59 and 79.
Writing Score: Scores span from 10 to 100, with a mean score of about 68.05. Majority perform within 57.75 to 79.
Categorical Variables:
Gender: Slightly more females (518) than males (482).
Race/Ethnicity: Group C is the largest (319), followed by B (190) and D (262). A and E have fewer students (89 and 140).
Parental Level of Education: Highest frequency in “some college” (226) and “associate’s degree” (222). “Bachelor’s degree” and “high school” follow (118 and 196). Least represented are “master’s degree” (59) and “some high school” (179).
Lunch Type: Most students (645) have standard lunch, while fewer (355) receive free/reduced lunch.
Test Preparation Course: Majority (642) have not completed the course, while 358 have.

These findings provide a preliminary understanding of the dataset, revealing distributions, trends, and potential areas of interest for further analysis.

Data Wrangling, Munging, and Cleaning

Address inconsistencies and missing values in the dataset:
- Identify columns with missing data.
- Impute missing values with the mean value for numerical variables.
Transform categorical variables into factors to prepare the data for analysis.
Detect and treat outliers using z-score:
- Remove rows with outliers to enhance the robustness of our analysis.

Handling Missing Values

Check for missing values in the dataset and identify columns with missing data.
Impute missing values by replacing them with the mean value for numerical variables.
Ensure the dataset remains complete and suitable for analysis without losing valuable information due to missing data.

# Check for missing values
missing_values <- colSums(is.na(student_data))

# Display columns with missing values
cols_with_missing <- names(missing_values[missing_values > 0])
print(cols_with_missing)

character(0)

# Impute missing values
# For numerical variables, replace missing values with the mean
student_data[cols_with_missing] <- lapply(student_data[cols_with_missing], function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))

Data Transformation

Convert categorical variables into factors.
This transformation allows appropriate treatment of categorical variables in statistical analysis and machine learning models.
Converting them into factors ensures that categories are represented as distinct levels, facilitating further analysis and interpretation of the data.

# Convert categorical variables to factors
student_data$gender <- factor(student_data$gender)
student_data$`race/ethnicity` <- factor(student_data$`race/ethnicity`)
student_data$`parental level of education` <- factor(student_data$`parental level of education`)
student_data$lunch <- factor(student_data$lunch)
student_data$`test preparation course` <- factor(student_data$`test preparation course`)

Calculating Total Score

Calculate the total score for each student by summing up their math score, reading score, and writing score.
This total score provides a comprehensive measure of academic performance across these subjects.
Enable further analysis and insights into student achievement.

# Calculate total score for each row
student_data$total_score <- student_data$`math score` + student_data$`reading score` + student_data$`writing score`

Outlier Detection and Treatment

Detect and treat outliers using z-score.
Scale the numerical variables and identify outliers based on a predetermined threshold.
Remove any rows with outliers from the dataset.
This step helps mitigate the impact of outliers on statistical analysis.
Ensures that results are more reliable and representative of the underlying data distribution.

# Detect and treat outliers using z-score
z_scores <- scale(student_data[, c("math score", "reading score", "writing score")])
outliers <- rowSums(abs(z_scores) > 3)

# Remove rows with outliers
student_data <- student_data[outliers == 0, ]

Exploratory Data Analysis (EDA)

Now that we have cleaned the data, let’s proceed with exploratory data analysis to gain further insights.

Summary Statistics after Cleaning

# Summary statistics for numerical variables
summary_stats_clean <- summary(student_data[, c("total_score", "math score", "reading score", "writing score")])
print(summary_stats_clean)

  total_score      math score     reading score    writing score   
 Min.   : 88.0   Min.   : 22.00   Min.   : 28.00   Min.   : 27.00  
 1st Qu.:175.0   1st Qu.: 57.00   1st Qu.: 60.00   1st Qu.: 58.00  
 Median :206.0   Median : 66.00   Median : 70.00   Median : 69.00  
 Mean   :204.3   Mean   : 66.42   Mean   : 69.47   Mean   : 68.38  
 3rd Qu.:234.0   3rd Qu.: 77.00   3rd Qu.: 80.00   3rd Qu.: 79.00  
 Max.   :300.0   Max.   :100.00   Max.   :100.00   Max.   :100.00

Boxplot for total scores by parental level of education

Students with parents holding a master’s degree tend to have higher total scores, while those with parents having some high school education have lower scores.

Boxplot for Total Score by lunch type

Students receiving standard lunch tend to have higher total scores compared to those receiving free/reduced lunch.

Scatter Plot: Math Score vs. Reading Score

There is a positive correlation between math and reading scores, indicating that students who perform well in math also tend to perform well in reading.

Grouped Bar Plot: Parental Level of Education vs. Mean Math Score

Students with parents holding a master’s degree tend to have higher mean math scores, while those with parents having some high school education have lower scores.

Stacked Bar Plot: Lunch Type vs. Test Preparation Course Completion

Students receiving standard lunch are more likely to complete a test preparation course compared to those receiving free/reduced lunch.

Violin Plot: Total Score Distribution by Race/Ethnicity

There are variations in total scores among different race/ethnicity groups, with some groups showing higher median scores than others.

Conclusion

Student performance strongly linked to background:
- Higher parental education correlates with better academic scores.
- Students on standard lunch outperform those on free/reduced lunch, indicating socio-economic impact.
Positive correlation between math and reading scores:
- Proficiency in one subject often aligns with proficiency in the other.
- Mean math scores increase with higher parental education levels, emphasizing family background’s importance.
Disparities in test preparation course completion:
- Students on standard lunch more likely to complete courses.
- Significant portion across both lunch types haven’t engaged, suggesting potential resource disparities.
Academic performance disparities by race/ethnicity:
- Highlighting need for addressing equity issues in education.
- Ensuring all students have equal opportunities for success.