This dataset was posted by Jered Ataky on week 5 discussion board in DATA 607. The entire document about it can be found in this link: https://www.kaggle.com/spscientist/students-performance-in-exams
The proposed analyses were:
-if scores can be predicted based on the other variables such as test preparation, parental level of education, and lunch cost. We can see if these variables affect the scores and build a model.
-We can also see if some scores are correlated with each other, such as reading and writing.
## -- Attaching packages --------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
# Get the data
data <- read.csv("https://raw.githubusercontent.com/jnataky/DATA-607/master/A2_Various_dataset_transformation/students_performance.csv")## [1] 1000 8
## gender race.ethnicity parental.level.of.education lunch
## 1 female group B bachelor's degree standard
## 2 female group C some college standard
## 3 female group B master's degree standard
## 4 male group A associate's degree free/reduced
## 5 male group C some college standard
## 6 female group B associate's degree standard
## test.preparation.course math.score reading.score writing.score
## 1 none 72 72 74
## 2 completed 69 90 88
## 3 none 90 95 93
## 4 none 47 57 44
## 5 none 76 78 75
## 6 none 71 83 78
## [1] "gender" "race.ethnicity"
## [3] "parental.level.of.education" "lunch"
## [5] "test.preparation.course" "math.score"
## [7] "reading.score" "writing.score"
## [1] 0
# Rename columns
data_copy <- data_copy %>%
rename(parent_level = parental.level.of.education, lunch = lunch, prep_test = test.preparation.course, math = math.score, reading = reading.score, writing = writing.score)
head(data_copy)## gender race.ethnicity parent_level lunch prep_test math reading
## 1 female group B bachelor's degree standard none 72 72
## 2 female group C some college standard completed 69 90
## 3 female group B master's degree standard none 90 95
## 4 male group A associate's degree free/reduced none 47 57
## 5 male group C some college standard none 76 78
## 6 female group B associate's degree standard none 71 83
## writing
## 1 74
## 2 88
## 3 93
## 4 44
## 5 75
## 6 78
## parent_level math reading writing
## 1 bachelor's degree 72 72 74
## 2 some college 69 90 88
## 3 master's degree 90 95 93
## 4 associate's degree 47 57 44
## 5 some college 76 78 75
## 6 associate's degree 71 83 78
## [1] "bachelor's degree" "some college" "master's degree"
## [4] "associate's degree" "high school" "some high school"
# Wider the data frame
data_1 <- data_1 %>%
group_by(parent_level, test) %>%
summarise(average_score = round(mean(score), 0))## `summarise()` regrouping output by 'parent_level' (override with `.groups` argument)
# Kable for tidy table
data_1 %>%
kbl(caption = "Test score mean with parent level of education", align = 'c') %>%
kable_material(c("striped", "hover")) %>%
row_spec(0, color = "indigo")| parent_level | test | average_score |
|---|---|---|
| associate’s degree | math | 68 |
| associate’s degree | reading | 71 |
| associate’s degree | writing | 70 |
| bachelor’s degree | math | 69 |
| bachelor’s degree | reading | 73 |
| bachelor’s degree | writing | 73 |
| high school | math | 62 |
| high school | reading | 65 |
| high school | writing | 62 |
| master’s degree | math | 70 |
| master’s degree | reading | 75 |
| master’s degree | writing | 76 |
| some college | math | 67 |
| some college | reading | 69 |
| some college | writing | 69 |
| some high school | math | 63 |
| some high school | reading | 67 |
| some high school | writing | 65 |
## [1] "standard" "free/reduced"
# Wider the data frame
data_2 <- data_2 %>%
group_by(lunch, test) %>%
summarise(average_score = round(mean(score), 0))## `summarise()` regrouping output by 'lunch' (override with `.groups` argument)
# Kable for tidy table
data_2 %>%
kbl(caption = "Test score mean with type of lunch offered to students", align = 'c') %>%
kable_material(c("striped", "hover")) %>%
row_spec(0, color = "indigo")| lunch | test | average_score |
|---|---|---|
| free/reduced | math | 59 |
| free/reduced | reading | 65 |
| free/reduced | writing | 63 |
| standard | math | 70 |
| standard | reading | 72 |
| standard | writing | 71 |
# Wider the data frame
data_3 <- data_3 %>%
group_by(prep_test, test) %>%
summarise(average_score = round(mean(score), 0))## `summarise()` regrouping output by 'prep_test' (override with `.groups` argument)
# Kable for tidy table
data_3 %>%
kbl(caption = "Test score mean with type of lunch offered to students", align = 'c') %>%
kable_material(c("striped", "hover")) %>%
row_spec(0, color = "indigo")| prep_test | test | average_score |
|---|---|---|
| completed | math | 70 |
| completed | reading | 74 |
| completed | writing | 74 |
| none | math | 64 |
| none | reading | 67 |
| none | writing | 65 |
ggplot(data = data_1) +
geom_bar( mapping = aes(x = reorder(parent_level, average_score), y = average_score, fill = test), position = "dodge", stat = "identity") +
facet_wrap(~ test, nrow = 3)There is a positive correlation between parent level of education and students performance.
ggplot(data = data_2) +
geom_bar( mapping = aes(x = reorder(lunch, average_score), y = average_score, fill = test), position = "dodge", stat = "identity", width = 0.5) +
facet_wrap(~ test, nrow = 3) Students with standard lunch perform better than students with free/reduced lunch
ggplot(data = data_3) +
geom_bar( mapping = aes(x = reorder(prep_test, average_score), y = average_score, fill = test), position = "dodge", stat = "identity", width = 0.5) +
facet_wrap(~ test, nrow = 3) There is a positive correlation between completing test preparation and not completing for all the 3 tests.
As we can see from analyses above, students performance are correlated to each of these three factors: parent level of education, lunch, and test preparation.
For parent level of education, the more the parents have a higher education the more students performed on all of the tests. This is maybe certain students tend to challenge themselves to their parents. Also something interesting is that the average performance for students whose parents level is “some high school” perform better than students whose parents completed high school.
In regards to lunch, seems like parent income affect students performance in tests.
For test preparation, this is sort of expected that students preparing for tests perform better although we can find some particular cases out there.