The purpose of this assignment is to tidy and transform data. The dataset of interest describes the relationship between student exam performance and parental education levels and our task is to:
The original data set was downloaded from Kaggle and is available here. Citation at bottom of page.
The data consists of marks secured by students in various subjects. The author seems to have a Finnish name so the data may have been extracted from the Finnish education system (be from Finland (one of the best in the world) …
A wide format .csv was created per assignment description and uploaded to github.
Our dataset was pulled from the larger Kaggle dataset (the 1st 20 students), transposed in Excel (rows became columns), and then saved as a .csv.
With the .csv available on github, we shift to reading from this file prior to tidying and transforming the data therein.
We read the .csv (in its raw form) from github and store corresponding data in a variable named “data” (I know … very creative).
#Get URL, read .csv (in raw form) from github, and put into tabular form
url <- getURL("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-607/Project-2/student_performance.csv")
data <- read.csv(text = url)
data <- tbl_df(data)
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 4 x 21
## Student.Num X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Parental L~ bach~ some~ mast~ asso~ some~ asso~ some~ some~ high~ high~ asso~
## 2 Math Score 72 69 90 47 76 71 88 40 64 38 58
## 3 Reading Sc~ 72 90 95 57 78 83 95 43 64 60 54
## 4 Writing Sc~ 74 88 93 44 75 78 92 39 67 50 52
## # ... with 9 more variables: X12 <chr>, X13 <chr>, X14 <chr>, X15 <chr>,
## # X16 <chr>, X17 <chr>, X18 <chr>, X19 <chr>, X20 <chr>
Once we’ve read from .csv and stored our data, we shift to tidying and transforming the data therein.
To start, we tidy:
##Make observations from variables using gather()
##This is also when we convert from a 'wide' to 'long' format
data_long <- gather(data, "Student", "Score", 2:21, factor_key=TRUE)
##Make variables from observations using spread()
data_long <- spread(data_long, "Student.Num", "Score")
##Sort columns: Student | Parental Level | Math | Reading | Writing
data_long <- select(data_long, Student, `Parental Level of Education`, `Math Score`, `Reading Score`, `Writing Score`)
data_long
## # A tibble: 20 x 5
## Student `Parental Level of Educ~ `Math Score` `Reading Score` `Writing Score`
## <fct> <chr> <chr> <chr> <chr>
## 1 X1 bachelor's degree 72 72 74
## 2 X2 some college 69 90 88
## 3 X3 master's degree 90 95 93
## 4 X4 associate's degree 47 57 44
## 5 X5 some college 76 78 75
## 6 X6 associate's degree 71 83 78
## 7 X7 some college 88 95 92
## 8 X8 some college 40 43 39
## 9 X9 high school 64 64 67
## 10 X10 high school 38 60 50
## 11 X11 associate's degree 58 54 52
## 12 X12 associate's degree 40 52 43
## 13 X13 high school 65 81 73
## 14 X14 some college 78 72 70
## 15 X15 master's degree 50 53 58
## 16 X16 some high school 69 75 78
## 17 X17 high school 88 89 86
## 18 X18 some high school 18 32 28
## 19 X19 master's degree 46 42 46
## 20 X20 associate's degree 54 58 61
What started as a very unclear 4x21 ‘wide’ table (“data”) is now a tidy 20x5 ‘long’ table (“data_long”).
With tidying complete, we move on to transforming our data. We’ll make use of dplyr() to transform data and perform mathematical calculations to solve for our variables. After this point, we will plot our tidied and transformed data, and should have all we need to clearly observe the correlation between student performance and parental education level.
Being that we’re only dealing with 20 observations and 5 categories with the parend education level variable, we’ll re-categorize parent education level as “low_ed” - high school, “mid_ed” - some college, associates, or “upper_ed” - bachelor’s, master’s.
##Categorize parent education level
upper_ed <- filter(data_long, (`Parental Level of Education` == "master's degree" | `Parental Level of Education` == "bachelor's degree"))
mid_ed <- filter(data_long, (`Parental Level of Education` == "some college" | `Parental Level of Education` == "associate's degree"))
low_ed <- filter(data_long, (`Parental Level of Education` == "high school"))
##Calculate average score per parent education level for each exam
###Those with an upper level: bachelor's or master's
u_avg_math <- sum(as.numeric(upper_ed$`Math Score`)) / nrow(upper_ed)
u_avg_read <- sum(as.numeric(upper_ed$`Reading Score`)) / nrow(upper_ed)
u_avg_write <- sum(as.numeric(upper_ed$`Writing Score`)) / nrow(upper_ed)
###Those with an mid level: some college or associate's
m_avg_math <- sum(as.numeric(mid_ed$`Math Score`)) / nrow(mid_ed)
m_avg_read <- sum(as.numeric(mid_ed$`Reading Score`)) / nrow(mid_ed)
m_avg_write <- sum(as.numeric(mid_ed$`Writing Score`)) / nrow(mid_ed)
###Those with a low level: high school
l_avg_math <- sum(as.numeric(low_ed$`Math Score`)) / nrow(low_ed)
l_avg_read <- sum(as.numeric(low_ed$`Reading Score`)) / nrow(low_ed)
l_avg_write <- sum(as.numeric(low_ed$`Writing Score`)) / nrow(low_ed)
At this point we’ve re-categorized parent education level and calculated the average score for all students per their categorization. Average test scores were calculated for math, reading, and writing.
We’ll make use of barplots to observe the difference in average test scores for each type of test and each parental education categorization.
##Plot the average score v parent education level for Math
barplot(c(u_avg_math, m_avg_math, l_avg_math), beside=TRUE, main = "Avg Math Score v Parent Education Level", xlab = "Parent Education Level", ylab = "Avg Math Score", ylim = c(0, 70), names.arg = c("Upper", "Mid", "Lower"), col=c("#6699FF", "#6699FF", "#6699FF"), border="white")
text(0.75, 50, as.character(u_avg_math))
text(1.90, 50, as.character(m_avg_math))
text(3.10, 50, as.character(l_avg_math))
##Plot the average score v parent education level for Reading
barplot(c(u_avg_read, m_avg_read, l_avg_read), beside=TRUE, main = "Avg Reading Score v Parent Education Level", xlab = "Parent Education Level", ylab = "Avg Reading Score", ylim = c(0, 75), names.arg = c("Upper", "Mid", "Lower"), col=c("#FF9933", "#FF9933", "#FF9933"), border="white")
text(0.75, 50, as.character(u_avg_read))
text(1.90, 50, as.character(m_avg_read))
text(3.10, 50, as.character(l_avg_read))
##Plot the average score v parent education level for Writing
barplot(c(u_avg_write, m_avg_write, l_avg_write), beside=TRUE, main = "Avg Writing Score v Parent Education Level", xlab = "Parent Education Level", ylab = "Avg Writing Score", ylim = c(0, 75), names.arg = c("Upper", "Mid", "Lower"), col=c("#66CC99", "#66CC99", "#66CC99"), border="white")
text(0.75, 50, as.character(u_avg_write))
text(1.90, 50, as.character(m_avg_write))
text(3.10, 50, as.character(l_avg_write))
Math test scores: Upper > Lower > Mid. The student’s whose parents fell into the “Upper” education category had the highest average score. It was close across the board though.
Reading test scores: Lower > Mid > Upper. The student’s whose parents fell into the “Lower” education category had the highest average score by a fairly significant margin.
Writing test scores: Lower > Upper > Mid. The student’s whose parents fell into the “Lower” education category had the highest average score by a slight margin over the “Upper” education category.
Rather interesting results …
Based on these results it would seem that the “Lower” parental education category, those whose parents merely went to High School had the highest average test scores.
The runner up would have been the “Upper” group and then the “Mid” group.
It appears, our assumptions could be flipped on their heads. Average test scores were rather close across the board (aside from Reading), but where we may have thought the “Upper” group would have outperformed all others, it was the dark horse “Lower” group that ended up scoring the highest average test scores.
With this said, it should be taken with a grain of salt. The sample size was rather small: 20 students considered across 3 categorizations. Thus any outlier (whether high or low) would have had an extraordinary affect on the group’s average score.
A larger dataset should be considered to draw a more decisive conclusion.
Data citation Jakki Seshapanpu. (2018). Students Performance in Exams [Data file]. Retrieved from https://www.kaggle.com/spscientist/students-performance-in-exams