Background

The purpose of this assignment is to tidy and transform data. The dataset of interest describes the relationship between student exam performance and parental education levels and our task is to:

  1. Create a .csv file in a “wide” structure.
  2. Read this .csv and use tidyr and dplyr to tidy and transform.
  3. Analyze the correlation between student performance and parental eucation level.
  4. Provide the .rmd file, rpubs link, and descriptions of steps taken.

(1) Create a .csv

The original data set was downloaded from Kaggle and is available here. Citation at bottom of page.

The data consists of marks secured by students in various subjects. The author seems to have a Finnish name so the data may have been extracted from the Finnish education system (be from Finland (one of the best in the world) …

A wide format .csv was created per assignment description and uploaded to github.

Our dataset was pulled from the larger Kaggle dataset (the 1st 20 students), transposed in Excel (rows became columns), and then saved as a .csv.

With the .csv available on github, we shift to reading from this file prior to tidying and transforming the data therein.

(2) Read and tidy

We read the .csv (in its raw form) from github and store corresponding data in a variable named “data” (I know … very creative).

#Get URL, read .csv (in raw form) from github, and put into tabular form
url <- getURL("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-607/Project-2/student_performance.csv")
data <- read.csv(text = url)
data <- tbl_df(data)
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
#Show what we're working with:
data
## # A tibble: 4 x 21
##   Student.Num X1    X2    X3    X4    X5    X6    X7    X8    X9    X10   X11  
##   <chr>       <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Parental L~ bach~ some~ mast~ asso~ some~ asso~ some~ some~ high~ high~ asso~
## 2 Math Score  72    69    90    47    76    71    88    40    64    38    58   
## 3 Reading Sc~ 72    90    95    57    78    83    95    43    64    60    54   
## 4 Writing Sc~ 74    88    93    44    75    78    92    39    67    50    52   
## # ... with 9 more variables: X12 <chr>, X13 <chr>, X14 <chr>, X15 <chr>,
## #   X16 <chr>, X17 <chr>, X18 <chr>, X19 <chr>, X20 <chr>

Once we’ve read from .csv and stored our data, we shift to tidying and transforming the data therein.

To start, we tidy:

##Make observations from variables using gather()
##This is also when we convert from a 'wide' to 'long' format
data_long <- gather(data, "Student", "Score", 2:21, factor_key=TRUE)

##Make variables from observations using spread()
data_long <- spread(data_long, "Student.Num", "Score")

##Sort columns: Student | Parental Level | Math | Reading | Writing
data_long <- select(data_long, Student, `Parental Level of Education`, `Math Score`, `Reading Score`, `Writing Score`)

data_long
## # A tibble: 20 x 5
##    Student `Parental Level of Educ~ `Math Score` `Reading Score` `Writing Score`
##    <fct>   <chr>                    <chr>        <chr>           <chr>          
##  1 X1      bachelor's degree        72           72              74             
##  2 X2      some college             69           90              88             
##  3 X3      master's degree          90           95              93             
##  4 X4      associate's degree       47           57              44             
##  5 X5      some college             76           78              75             
##  6 X6      associate's degree       71           83              78             
##  7 X7      some college             88           95              92             
##  8 X8      some college             40           43              39             
##  9 X9      high school              64           64              67             
## 10 X10     high school              38           60              50             
## 11 X11     associate's degree       58           54              52             
## 12 X12     associate's degree       40           52              43             
## 13 X13     high school              65           81              73             
## 14 X14     some college             78           72              70             
## 15 X15     master's degree          50           53              58             
## 16 X16     some high school         69           75              78             
## 17 X17     high school              88           89              86             
## 18 X18     some high school         18           32              28             
## 19 X19     master's degree          46           42              46             
## 20 X20     associate's degree       54           58              61

What started as a very unclear 4x21 ‘wide’ table (“data”) is now a tidy 20x5 ‘long’ table (“data_long”).

(3) Transform and analyze

With tidying complete, we move on to transforming our data. We’ll make use of dplyr() to transform data and perform mathematical calculations to solve for our variables. After this point, we will plot our tidied and transformed data, and should have all we need to clearly observe the correlation between student performance and parental education level.

Being that we’re only dealing with 20 observations and 5 categories with the parend education level variable, we’ll re-categorize parent education level as “low_ed” - high school, “mid_ed” - some college, associates, or “upper_ed” - bachelor’s, master’s.

##Categorize parent education level

upper_ed <- filter(data_long, (`Parental Level of Education` == "master's degree" | `Parental Level of Education` == "bachelor's degree"))

mid_ed <- filter(data_long, (`Parental Level of Education` == "some college" | `Parental Level of Education` == "associate's degree"))

low_ed <- filter(data_long, (`Parental Level of Education` == "high school"))


##Calculate average score per parent education level for each exam

###Those with an upper level: bachelor's or master's
u_avg_math <- sum(as.numeric(upper_ed$`Math Score`)) / nrow(upper_ed)
u_avg_read <- sum(as.numeric(upper_ed$`Reading Score`)) / nrow(upper_ed)
u_avg_write <- sum(as.numeric(upper_ed$`Writing Score`)) / nrow(upper_ed)

###Those with an mid level: some college or associate's

m_avg_math <- sum(as.numeric(mid_ed$`Math Score`)) / nrow(mid_ed)
m_avg_read <- sum(as.numeric(mid_ed$`Reading Score`)) / nrow(mid_ed)
m_avg_write <- sum(as.numeric(mid_ed$`Writing Score`)) / nrow(mid_ed)

###Those with a low level: high school

l_avg_math <- sum(as.numeric(low_ed$`Math Score`)) / nrow(low_ed)
l_avg_read <- sum(as.numeric(low_ed$`Reading Score`)) / nrow(low_ed)
l_avg_write <- sum(as.numeric(low_ed$`Writing Score`)) / nrow(low_ed)

At this point we’ve re-categorized parent education level and calculated the average score for all students per their categorization. Average test scores were calculated for math, reading, and writing.

We’ll make use of barplots to observe the difference in average test scores for each type of test and each parental education categorization.

##Plot the average score v parent education level for Math
barplot(c(u_avg_math, m_avg_math, l_avg_math), beside=TRUE, main = "Avg Math Score v Parent Education Level", xlab = "Parent Education Level", ylab = "Avg Math Score", ylim = c(0, 70), names.arg = c("Upper", "Mid", "Lower"), col=c("#6699FF", "#6699FF", "#6699FF"), border="white")

text(0.75, 50, as.character(u_avg_math))
text(1.90, 50, as.character(m_avg_math))
text(3.10, 50, as.character(l_avg_math))

##Plot the average score v parent education level for Reading
barplot(c(u_avg_read, m_avg_read, l_avg_read), beside=TRUE, main = "Avg Reading Score v Parent Education Level", xlab = "Parent Education Level", ylab = "Avg Reading Score", ylim = c(0, 75), names.arg = c("Upper", "Mid", "Lower"), col=c("#FF9933", "#FF9933", "#FF9933"), border="white")

text(0.75, 50, as.character(u_avg_read))
text(1.90, 50, as.character(m_avg_read))
text(3.10, 50, as.character(l_avg_read))

##Plot the average score v parent education level for Writing
barplot(c(u_avg_write, m_avg_write, l_avg_write), beside=TRUE, main = "Avg Writing Score v Parent Education Level", xlab = "Parent Education Level", ylab = "Avg Writing Score", ylim = c(0, 75), names.arg = c("Upper", "Mid", "Lower"), col=c("#66CC99", "#66CC99", "#66CC99"), border="white")

text(0.75, 50, as.character(u_avg_write))
text(1.90, 50, as.character(m_avg_write))
text(3.10, 50, as.character(l_avg_write))

Math test scores: Upper > Lower > Mid. The student’s whose parents fell into the “Upper” education category had the highest average score. It was close across the board though.

Reading test scores: Lower > Mid > Upper. The student’s whose parents fell into the “Lower” education category had the highest average score by a fairly significant margin.

Writing test scores: Lower > Upper > Mid. The student’s whose parents fell into the “Lower” education category had the highest average score by a slight margin over the “Upper” education category.

Rather interesting results …

Based on these results it would seem that the “Lower” parental education category, those whose parents merely went to High School had the highest average test scores.

The runner up would have been the “Upper” group and then the “Mid” group.

(4) Conclude

It appears, our assumptions could be flipped on their heads. Average test scores were rather close across the board (aside from Reading), but where we may have thought the “Upper” group would have outperformed all others, it was the dark horse “Lower” group that ended up scoring the highest average test scores.

With this said, it should be taken with a grain of salt. The sample size was rather small: 20 students considered across 3 categorizations. Thus any outlier (whether high or low) would have had an extraordinary affect on the group’s average score.

A larger dataset should be considered to draw a more decisive conclusion.


Data citation Jakki Seshapanpu. (2018). Students Performance in Exams [Data file]. Retrieved from https://www.kaggle.com/spscientist/students-performance-in-exams