Part 1 - Introduction
Abstract: The dataset the project will analyze is the Students Exam Scores: Extended Dataset, which can be found here. The data set is fictional and generated for training purposes only. The dataset comprises scores from three tests administered to students at a hypothetical public school, alongside a diverse range of personal and socio-economic factors. These factors are believed to potentially interact with the test scores, influencing their outcomes.
We will use the dataset to investigate the influence of parental education and marital status on students’ academic performance. The research question we are aiming to answer using the data is whether parental education and marital status serve as significant predictors of students’ academic performance.
Statistical analyses, including multiple regression, were conducted to explore the relationship between parental education, marital status, and students’ academic performance. The results indicate [insert findings here].
Part 2 - Data
The original dataset generator creator is Mr. Royce Kimmons. The data collection source was from kaggle.com which can be found here. There are 30,641 observations of different students in the data set before clearing the NA values. After tidying the data we are left with 22,058 observations where each row has a value. The results generated from the study should not be generalized through the whole population of US Students. Since the data set is generated and is fictional.
For our research purposes, the dependent variables are the Student exam scores, while the independent variables are the Parents’ Education level and their highest level of Education. which can be further inspected in the data dictionary below.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
data <- read.csv("https://raw.githubusercontent.com/sokkarbishoy/DATA607/main/Expanded_data_with_more_features.csv")# Create data dictionary of relevent columns
Data_Dictionary <- data.frame(
Variables = c("ParentEduc", "ParentMaritalStatus", "MathScore", "ReadingScore", "WritingScore", "Total_avg_scores"),
Discription = c("Parent(s) education background (from some_highschool to master's degree)", "Parent(s) marital status (married/single/widowed/divorced)", "Math test score(0-100)", " Reading test score(0-100)", "Writing test score(0-100)", "Average of all three exam scores")
)
# Print the data using kable function from knitr
kable(Data_Dictionary,
caption = "Data Dictionary",
align = c("l", "l"),
col.names = c("Variables", "Description"))| Variables | Description |
|---|---|
| ParentEduc | Parent(s) education background (from some_highschool to master’s degree) |
| ParentMaritalStatus | Parent(s) marital status (married/single/widowed/divorced) |
| MathScore | Math test score(0-100) |
| ReadingScore | Reading test score(0-100) |
| WritingScore | Writing test score(0-100) |
| Total_avg_scores | Average of all three exam scores |
Note that ’Total_Avg_scores” is a varible we created in the following code below. Additional variables presented in the data set are Gender, Ethnic group, School lunch type, Test preparation course, practicing sports, If the child is first born, and more variables. The data is a great place to start thinking on what factors comtibute most to student exam success.
Tidying the data In the following code we cleared the empty values of NA and creating a new variable that measures the average score of all three exams.
Part 3 - Exploratory data analysis
Box plots and ggplots
Lets examin our independent variables, which are both categorical variables. First ‘ParentEduc’ We can see from the bar plot highlighting that the Average of exam scores goes up as the Parent Education goes up. We can also see from the scatter plot that there are some outliers for some of the variables.
##
## associate's degree bachelor's degree high school master's degree
## 5351 3251 5475 1939
## some college some high school
## 6369 5295
#to find the average not the frequency we first group by the parent education level
average_scores <- data %>%
group_by(ParentEduc) %>%
summarise(Average_Score = mean(Total_avg_scores))%>%
arrange(Average_Score)
average_scores$ParentEduc <- factor(average_scores$ParentEduc, levels = average_scores$ParentEduc)
ggplot(average_scores, aes(x = ParentEduc, y = Average_Score)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
geom_text(aes(label = round(Average_Score, 2)), vjust = -0.5, color = "black", size = 3) + # Add text labels for values
labs(title = "Parental Education Distribution",
x = "Education Level",
y = "Average Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, hjust = 1))ggplot(data, aes(x = ParentEduc, y = Total_avg_scores, color = ParentEduc)) +
geom_point() +
labs(title = "Correlation between Parental Education Level and Total Average Scores",
x = "Education Level",
y = "Total Average Scores") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 30, hjust = 1))##
## Call:
## lm(formula = Total_avg_scores ~ ParentEduc, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.054 -9.562 0.279 10.268 36.105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.9469 0.1926 363.216 < 2e-16 ***
## ParentEducbachelor's degree 2.2678 0.3133 7.240 4.62e-13 ***
## ParentEduchigh school -4.2147 0.2708 -15.564 < 2e-16 ***
## ParentEducmaster's degree 4.9424 0.3734 13.236 < 2e-16 ***
## ParentEducsome college -1.8924 0.2612 -7.244 4.46e-13 ***
## ParentEducsome high school -6.0519 0.2731 -22.163 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.09 on 27674 degrees of freedom
## Multiple R-squared: 0.04966, Adjusted R-squared: 0.04949
## F-statistic: 289.2 on 5 and 27674 DF, p-value: < 2.2e-16
Second we will look at the ‘ParentMaritalStatus’. We can see that a larger proportion of the individuals in your dataset have the marital status of “married” compared to the other categories. from looking at the data we can see that the distibution among the variable levels are not spread out. we can see that the majority, 15.8K of the responses were married. 6K were single parents, 4.6K were divorced and only 500 responses answered widowed.
##
## divorced married single widowed
## 4655 15802 6664 559
average_scores_marital <- data %>%
group_by(ParentMaritalStatus) %>%
summarise(Average_Score_marital = mean(Total_avg_scores))
ggplot(average_scores_marital, aes(x = ParentMaritalStatus, y = Average_Score_marital)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
geom_text(aes(label = round(Average_Score_marital, 2)), vjust = -0.5, color = "black", size = 3) +
labs(title = "Parental Education Distribution",
x = "Education Level",
y = "Average Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, hjust = 1))ggplot(data, aes(x = ParentMaritalStatus)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Parent Marital Status")ggplot(data, aes(x = ParentMaritalStatus, y = Total_avg_scores)) +
geom_boxplot(fill = "skyblue", color = "darkblue", alpha = 0.7) +
geom_jitter(aes(color = ParentMaritalStatus), width = 0.1, alpha = 0.5) +
stat_summary(fun = mean, geom = "point",
shape = 3, size = 3, color = "red",
position = position_dodge(width = 0.75)) +
stat_summary(fun = median, geom = "point",
shape = 4, size = 3, color = "green",
position = position_dodge(width = 0.75)) +
labs(title = "Relationship Between Parental Marital Status and Total Average Scores",
x = "Parental Marital Status",
y = "Total Average Scores") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Part 4 - Inference
WE WILL DO A LINEAR REGRSSION USING EACH OF THE VARIABLES FIRST. Multiple slops and intercepts, over fitting is a problem we want to avoid.
##
## Call:
## lm(formula = data$Total_avg_scores ~ data$ParentEduc, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.054 -9.562 0.279 10.268 36.105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.9469 0.1926 363.216 < 2e-16 ***
## data$ParentEducbachelor's degree 2.2678 0.3133 7.240 4.62e-13 ***
## data$ParentEduchigh school -4.2147 0.2708 -15.564 < 2e-16 ***
## data$ParentEducmaster's degree 4.9424 0.3734 13.236 < 2e-16 ***
## data$ParentEducsome college -1.8924 0.2612 -7.244 4.46e-13 ***
## data$ParentEducsome high school -6.0519 0.2731 -22.163 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.09 on 27674 degrees of freedom
## Multiple R-squared: 0.04966, Adjusted R-squared: 0.04949
## F-statistic: 289.2 on 5 and 27674 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = data$Total_avg_scores ~ data$ParentMaritalStatus,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.863 -9.838 0.162 10.495 32.137
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.3359 0.2118 322.681 <2e-16 ***
## data$ParentMaritalStatusmarried -0.1647 0.2410 -0.683 0.4944
## data$ParentMaritalStatussingle -0.4725 0.2760 -1.712 0.0869 .
## data$ParentMaritalStatuswidowed 0.2234 0.6468 0.345 0.7298
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.45 on 27676 degrees of freedom
## Multiple R-squared: 0.0001385, Adjusted R-squared: 3.016e-05
## F-statistic: 1.278 on 3 and 27676 DF, p-value: 0.2799
Overall Model Fit: The model appears to have statistical significance, as indicated by the very low p-value (< 2.2e-16) for the F-statistic. This suggests that the model as a whole provides a better fit to the data than a model with no predictors.
Predictor Variables: Each level of ParentEduc (high school, some college, associate’s degree, bachelor’s degree, master’s degree) shows a statistically significant positive relationship with Total_avg_scores. This suggests that as the level of parental education increases, the average total scores of the students tend to increase as well.
However, the coefficient estimates indicate that the increase in average total scores varies depending on the level of parental education. For example, the increase associated with a master’s degree is higher than that associated with a high school education.
Regarding ParentMaritalStatus, the coefficients for married, single, and widowed categories are not statistically significant at conventional levels (alpha = 0.05). This suggests that there is no strong evidence to conclude that these marital statuses have a significant impact on Total_avg_scores after controlling for parental education.
Adjusted R-squared: The adjusted R-squared value of approximately 0.05182 suggests that about 5.18% of the variability in Total_avg_scores can be explained by the predictor variables included in the model.
Residuals: The residuals (the differences between the observed values and the predicted values) appear to have a mean close to zero and are relatively symmetrically distributed around zero. However, further diagnostic checks may be needed to assess the assumptions of normality and constant variance of residuals.
In summary, while the model indicates a statistically significant relationship between parental education and students’ average total scores, the impact of parental marital status on average total scores appears to be inconclusive based on the coefficients’ statistical significance. Further investigation, possibly with additional variables or interactions, may be needed to better understand the factors influencing students’ performance.
##
## Call:
## lm(formula = Total_avg_scores ~ ParentEduc + ParentMaritalStatus,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.146 -9.606 0.252 10.222 36.055
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.14626 0.27038 259.441 < 2e-16 ***
## ParentEducbachelor's degree 2.26240 0.31328 7.222 5.27e-13 ***
## ParentEduchigh school -4.22151 0.27082 -15.588 < 2e-16 ***
## ParentEducmaster's degree 4.93496 0.37345 13.215 < 2e-16 ***
## ParentEducsome college -1.89537 0.26124 -7.255 4.11e-13 ***
## ParentEducsome high school -6.05471 0.27307 -22.173 < 2e-16 ***
## ParentMaritalStatusmarried -0.14677 0.23496 -0.625 0.5322
## ParentMaritalStatussingle -0.46951 0.26913 -1.745 0.0811 .
## ParentMaritalStatuswidowed 0.05709 0.63064 0.091 0.9279
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.09 on 27671 degrees of freedom
## Multiple R-squared: 0.04979, Adjusted R-squared: 0.04952
## F-statistic: 181.2 on 8 and 27671 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = data$Total_avg_scores ~ data$PracticeSport, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.096 -9.874 0.237 10.459 32.521
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.2975 0.6107 111.842 < 2e-16 ***
## data$PracticeSportnever -1.8188 0.6563 -2.771 0.00559 **
## data$PracticeSportregularly 0.7987 0.6279 1.272 0.20332
## data$PracticeSportsometimes -0.4233 0.6229 -0.679 0.49684
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.43 on 27676 degrees of freedom
## Multiple R-squared: 0.003435, Adjusted R-squared: 0.003327
## F-statistic: 31.8 on 3 and 27676 DF, p-value: < 2.2e-16
Part 5 - Conclusion
The results of our analysis provide valuable insights into the relationship between parental education, marital status, and students’ academic performance.
Parental Education: Our analysis reveals a strong positive correlation between parental education level and students’ average exam scores. This suggests that children of parents with higher education levels tend to perform better academically. This finding highlights the importance of parental education as a significant predictor of students’ academic success.
Marital Status: While parental marital status also appears to influence students’ academic performance, its impact is less pronounced compared to parental education level. We observed that the effect of marital status variables on students’ average scores was weaker, indicating that other factors may play a more significant role in determining academic outcomes.
The findings suggest that parents’ educational background significantly influences their children’s success in school. However, while marital status may have some influence, its effect appears to be less significant compared to parental education level. These insights emphasize the importance of supporting and empowering parents, particularly in terms of education, to enhance students’ educational outcomes and overall success.
References
Kimmons, Royce. “ Exam Scores : Exam Scores for Students at a Public School.” Royce Kimmons: Understanding Digital Participation Divides, 2012, roycekimmons.com/tools/generated_data/exams.
“IBISWorld - Industry Market Research, Reports, and Statistics.” IBISWorld Industry Reports, www.ibisworld.com/us/bed/number-of-k-12-students/4251/. Accessed 4 May 2024.