Theme for this Project is Data Transformation.
For this project: 1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below. 2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!] 3. Perform the analysis requested in the discussion item.
Collaborators: Magnus Skonberg
The dataset was pulled from the larger Kaggle dataset (the 1st 20 students), transposed in Excel (rows became columns), and then saved as a .csv.
The Student Performance in Exams dataset, containing column names (representing Student numbers), is read from GitHub.
#
theURL <- c("https://raw.githubusercontent.com/CUNYSPS-RickRN/DATA607/master/student_performance.csv")
student_performance_df <- read_csv(theURL)
## Parsed with column specification:
## cols(
## .default = col_character()
## )
## See spec(...) for full column specifications.
## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"),
## envir = .GlobalEnv, overwrite = TRUE)
This phase will tidy and transform the Student Performance in Exams dataset. It’s readily apparent that each column represents observations for each student. Therefore, pivot_longer used to tidy data to represent observations comprised of Parental Level of Education and scores for Math, Reading, and Writing.
# Pivot table
student_pivot_df <- gather(student_performance_df, "Student", "Score", 2:21, factor_key=TRUE)
#
student_pivot_df <- spread(student_pivot_df, "Student Num", "Score")
# resequence columns
student_pivot_df <- select(student_pivot_df, Student, `Parental Level of Education`, `Math Score`, `Reading Score`, `Writing Score`)
student_pivot_df
## # A tibble: 20 x 5
## Student `Parental Level of Educ~ `Math Score` `Reading Score` `Writing Score`
## <fct> <chr> <chr> <chr> <chr>
## 1 1 bachelor's degree 72 72 74
## 2 2 some college 69 90 88
## 3 3 master's degree 90 95 93
## 4 4 associate's degree 47 57 44
## 5 5 some college 76 78 75
## 6 6 associate's degree 71 83 78
## 7 7 some college 88 95 92
## 8 8 some college 40 43 39
## 9 9 high school 64 64 67
## 10 10 high school 38 60 50
## 11 11 associate's degree 58 54 52
## 12 12 associate's degree 40 52 43
## 13 13 high school 65 81 73
## 14 14 some college 78 72 70
## 15 15 master's degree 50 53 58
## 16 16 some high school 69 75 78
## 17 17 high school 88 89 86
## 18 18 some high school 18 32 28
## 19 19 master's degree 46 42 46
## 20 20 associate's degree 54 58 61
Parental_LOE_factors <- factor(student_pivot_df$`Parental Level of Education`)
levels(Parental_LOE_factors)
## [1] "associate's degree" "bachelor's degree" "high school"
## [4] "master's degree" "some college" "some high school"
# Categorize parent education level
upper_ed <- filter(student_pivot_df, (`Parental Level of Education` == "master's degree" | `Parental Level of Education` == "bachelor's degree"))
mid_ed <- filter(student_pivot_df, (`Parental Level of Education` == "some college" | `Parental Level of Education` == "associate's degree"))
low_ed <- filter(student_pivot_df, (`Parental Level of Education` == "high school" | `Parental Level of Education` == "some high school"))
# Calculate average score per parent education level for each exam [future-do as func]
# Those with an upper level: bachelor's or master's
u_avg_math <- sum(as.numeric(upper_ed$`Math Score`)) / nrow(upper_ed)
u_avg_read <- sum(as.numeric(upper_ed$`Reading Score`)) / nrow(upper_ed)
u_avg_write <- sum(as.numeric(upper_ed$`Writing Score`)) / nrow(upper_ed)
# Those with an mid level: some college or associate's
m_avg_math <- sum(as.numeric(mid_ed$`Math Score`)) / nrow(mid_ed)
m_avg_read <- sum(as.numeric(mid_ed$`Reading Score`)) / nrow(mid_ed)
m_avg_write <- sum(as.numeric(mid_ed$`Writing Score`)) / nrow(mid_ed)
# Those with a low level: high school
l_avg_math <- sum(as.numeric(low_ed$`Math Score`)) / nrow(low_ed)
l_avg_read <- sum(as.numeric(low_ed$`Reading Score`)) / nrow(low_ed)
l_avg_write <- sum(as.numeric(low_ed$`Writing Score`)) / nrow(low_ed)
c("Upper level avgs - Math: ",u_avg_math, " Reading: ",u_avg_read, " Writing: ", u_avg_write)
## [1] "Upper level avgs - Math: " "64.5"
## [3] " Reading: " "65.5"
## [5] " Writing: " "67.75"
## [1] "Mid level avgs - Math: " "62.1"
## [3] " Reading: " "68.2"
## [5] " Writing: " "64.2"
## [1] "Lower level avgs - Math: " "57"
## [3] " Reading: " "66.8333333333333"
## [5] " Writing: " "63.6666666666667"
Three sets of graphs show for each level of parental education level the students’ average scores in Math, Reading, and Writing.
##Plot the average score v parent education level for Math
barplot(c(u_avg_math, m_avg_math, l_avg_math), beside=TRUE, main = "Avg Math Score v Parent Education Level", xlab = "Parent Education Level", ylab = "Avg Math Score", ylim = c(0, 70), names.arg = c("Upper", "Mid", "Lower"), col=c("#6699FF", "#6699FF", "#6699FF"), border="white")
text(0.75, 50, as.character(u_avg_math))
text(1.90, 50, as.character(m_avg_math))
text(3.10, 50, as.character(l_avg_math))
##Plot the average score v parent education level for Reading
barplot(c(u_avg_read, m_avg_read, l_avg_read), beside=TRUE, main = "Avg Reading Score v Parent Education Level", xlab = "Parent Education Level", ylab = "Avg Reading Score", ylim = c(0, 75), names.arg = c("Upper", "Mid", "Lower"), col=c("#FF9933", "#FF9933", "#FF9933"), border="white")
text(0.75, 50, as.character(u_avg_read))
text(1.90, 50, as.character(m_avg_read))
text(3.10, 50, as.character(l_avg_read))
##Plot the average score v parent education level for Writing
barplot(c(u_avg_write, m_avg_write, l_avg_write), beside=TRUE, main = "Avg Writing Score v Parent Education Level", xlab = "Parent Education Level", ylab = "Avg Writing Score", ylim = c(0, 75), names.arg = c("Upper", "Mid", "Lower"), col=c("#66CC99", "#66CC99", "#66CC99"), border="white")
text(0.75, 50, as.character(u_avg_write))
text(1.90, 50, as.character(m_avg_write))
text(3.10, 50, as.character(l_avg_write))
In Math, the average student scores were better when parents had Master or Bachelor degrees and lower average student scores followed parents with high school education.
In Writing, students whose parents have attained higher degrees score higher than students whose parents have an Associate’s degree or lower.
Interestingly, for Reading, students whose parents have higher levels of education with Masters or Bachelors degrees scored the lower than students whose parents only have a high school education.