Students Performance

The purpose of this analysis is to uncover the key factors influencing student performance, as measured by their exam scores. The dataset used includes several variables such as hours studied, attendance, sleep hours, and previous scores, alongside demographic factors like gender, parental involvement, and family income. Through this analysis, we aim to explore the relationship between these factors and exam scores to provide actionable insights that could improve academic outcomes for students.


1.Loading Libraries and Dataset

The first step in any analysis is to load the necessary libraries and the dataset. In this case, we use libraries for data manipulation (dplyr), visualization (ggplot2, plotly), correlation plotting (corrplot), and enhanced visualizations (GGally).

Purpose: To prepare the environment for data manipulation, analysis, and visualization.

library(dplyr)
library(ggplot2)
library(corrplot)
library(GGally)
library(plotly)

studentdb = read.csv("U:/dataset/StudentPerformanceFactors.csv")

2. Summary Statistics for Continuous Variables

Next, we calculate summary statistics (mean, median, and standard deviation) for the continuous variables in the dataset. These include Exam_Score, Hours_Studied, Attendance, Sleep_Hours, and Previous_Scores.

Purpose: To understand the central tendencies (mean, median) and variability (standard deviation) of key variables. This helps identify any potential issues like skewness or outliers in the data.

summary_stats <- studentdb %>%
  select(Exam_Score, Hours_Studied, Attendance, Sleep_Hours, Previous_Scores) %>%
  summarise_all(list(mean = mean, median = median, sd = sd), na.rm = TRUE)
print(summary_stats)
##   Exam_Score_mean Hours_Studied_mean Attendance_mean Sleep_Hours_mean
## 1        67.23566           19.97533        79.97745          7.02906
##   Previous_Scores_mean Exam_Score_median Hours_Studied_median Attendance_median
## 1             75.07053                67                   20                80
##   Sleep_Hours_median Previous_Scores_median Exam_Score_sd Hours_Studied_sd
## 1                  7                     75      3.890456         5.990594
##   Attendance_sd Sleep_Hours_sd Previous_Scores_sd
## 1      11.54747        1.46812           14.39978

3. Distribution of Exam Scores

A histogram of Exam_Score is plotted to understand the distribution of exam scores across all students.

Purpose: To visually check how exam scores are distributed. This helps identify if scores are skewed (e.g., a concentration of low or high scores) or if they follow a normal distribution, which could inform subsequent analysis or modeling.

ggplot(studentdb, aes(x = Exam_Score)) +
  geom_histogram(binwidth = 5, fill = "blue", color = "black") +
  labs(title = "Distribution of Exam Scores", x = "Exam Score", y = "Frequency")

4. Correlation Matrix for Continuous Variables

The correlation matrix is computed for continuous variables such as Hours_Studied, Attendance, Sleep_Hours, Previous_Scores, and Exam_Score. This matrix is visualized using the corrplot function.

Purpose: To identify relationships between variables. Correlation helps us understand whether certain study habits (like hours studied or attendance) are linked to exam scores, allowing us to focus on the most influential factors.

cor_matrix <- cor(studentdb %>%
                    select(Hours_Studied, Attendance, Sleep_Hours, Previous_Scores, Exam_Score), 
                  use = "complete.obs")
corrplot(cor_matrix, method = "circle")

5. Boxplot Comparison by Gender

Boxplots are used to compare exam scores across gender.

Purpose: To explore if there are differences in exam scores between male and female students. This can help us assess if gender plays a role in academic performance.

ggplot(studentdb, aes(x = Gender, y = Exam_Score, fill = Gender)) +
  geom_boxplot() +
  labs(title = "Exam Scores by Gender", x = "Gender", y = "Exam Score")

6.Boxplot for Parental Involvement

This boxplot compares exam scores based on the level of parental involvement.

Purpose: To evaluate if students with higher levels of parental involvement perform better on exams. This is a critical factor, as parental support is often correlated with academic success.

ggplot(studentdb, aes(x = Parental_Involvement, y = Exam_Score, fill = Parental_Involvement)) +
  geom_boxplot() +
  labs(title = "Exam Scores by Parental Involvement", x = "Parental Involvement", y = "Exam Score")

7. Boxplot for Family Income

A boxplot is used to compare exam scores across different family income levels.

Purpose: To determine if socioeconomic factors such as family income influence academic performance. Higher family income may allow access to additional learning resources, impacting exam scores.

ggplot(studentdb, aes(x = Family_Income, y = Exam_Score, fill = Family_Income)) +
  geom_boxplot() +
  labs(title = "Exam Scores by Family Income", x = "Family Income", y = "Exam Score")

# 8. Interactive Scatterplot: Hours Studied vs Exam Score by Gender We create an interactive scatterplot using plotly to examine the relationship between Hours_Studied and Exam_Score, with a color distinction for Gender.

Purpose: To explore the relationship between hours studied and exam scores in an interactive format. This allows for deeper engagement with the data, and the gender breakdown can help assess whether the relationship differs by gender.

plot1 <- plot_ly(studentdb, x = ~Hours_Studied, y = ~Exam_Score, color = ~Gender, type = 'scatter', mode = 'markers',
                 colors = c("purple", "orange")) %>%
  layout(title = "Interactive Scatterplot: Hours Studied vs Exam Score by Gender",
         xaxis = list(title = "Hours Studied"),
         yaxis = list(title = "Exam Score"),
         showlegend = TRUE)
plot1

9. Boxplot Comparison by Internet Access

This boxplot compares exam scores based on internet access.

Purpose: To determine if access to the internet impacts academic performance, especially in the current age of online learning and digital resources.

ggplot(studentdb, aes(x = Internet_Access, y = Exam_Score, fill = Internet_Access)) +
  geom_boxplot() +
  labs(title = "Impact of Internet Access on Exam Scores", x = "Internet Access", y = "Exam Score")

10. Boxplot for Parental Education Level Impact

This boxplot compares exam scores based on the Parental Education Level.

Purpose: To assess the impact of parental education on students’ exam scores. Parental education is often correlated with student achievement, as parents with higher education may be more involved in their children’s schooling.

ggplot(studentdb, aes(x = Parental_Education_Level, y = Exam_Score, fill = Parental_Education_Level)) +
  geom_boxplot() +
  labs(title = "Impact of Parental Education on Exam Scores", x = "Parental Education Level", y = "Exam Score")

11. Interaction Between Gender and Parental Involvement

Finally, we examine the interaction between Gender and Parental Involvement using a boxplot.

Purpose: To understand whether the relationship between parental involvement and exam scores differs by gender. This helps uncover potential disparities in how male and female students are affected by parental involvement.

ggplot(studentdb, aes(x = Parental_Involvement, y = Exam_Score, fill = Gender)) +
  geom_boxplot() +
  labs(title = "Interaction Between Gender and Parental Involvement", x = "Parental Involvement", y = "Exam Score")

Conclusion

The analysis aims to uncover insights that can help educational institutions, parents, and students improve performance outcomes. By identifying key factors that impact student success, such as study habits, sleep, and parental involvement, the findings can guide interventions aimed at addressing disparities and promoting better academic achievement for all students.

Through interactive visualizations and statistical analysis, this study paints a comprehensive picture of the variables that contribute to student performance, which can be used to inform policy, teaching methods, and support systems in education.