Packages

Load required packages.

library(knitr)
suppressMessages(library(tidyverse))

Data

Description:

To find some interesting data for this week’s assignment, the Organization for Economic Co-operation and Development Open Data site (https://data.oecd.org/) referenced in our class notes was utilized. Leveraging the site’s Browse By Topic feature, I honed in on the various open data sets around education. My selection included open data sets around international (a) teacher salaries in primary education and (b) student performance in math, reading and science. I wanted to see if there is any correlation between teacher salaries and student learning performance.

All downloaded open data sets (csv format) included multi-year data. Since all data sets included 2018 (the most recent year), analysis was done for this year.

Load Data Sets:

Read in each raw data set.

teacherSalariesRaw <- read_csv("OECD-Teacher-Salaries-Primary.csv")
studentMathPerformanceRaw <- read_csv("OECD-Student-Math-Performance.csv")
studentReadingPerformanceRaw <- read_csv("OECD-Student-Reading-Performance.csv")
studentSciencePerformanceRaw <- read_csv("OECD-Student-Science-Performance.csv")

Cleanup Data Sets:

Pull out columns of interest from each raw table. Rename columns of interest (Location, Salary (USD), Math Performance, Reading Performance, Science Performance) to more friendly and meaningful names. Only include data for year 2018 and average teacher salaries with 15+ years of experience.

teacherSalaries <- teacherSalariesRaw %>% 
  filter(TIME == 2018, SUBJECT == 'PRY15YREXP') %>%
    select(LOCATION, Value) %>%
      rename('Location' = LOCATION, 'Salary (USD)' = Value)

studentMathPerformance <- studentMathPerformanceRaw %>% 
  filter(TIME == 2018) %>%
    select(LOCATION, Value) %>%
      rename('Location' = LOCATION, 'Math Performance' = Value)

studentReadingPerformance <- studentReadingPerformanceRaw %>% 
  filter(TIME == 2018) %>%
    select(LOCATION, Value) %>%
      rename('Location' = LOCATION, 'Reading Performance' = Value)

studentSciencePerformance <- studentSciencePerformanceRaw %>% 
  filter(TIME == 2018) %>%
    select(LOCATION, Value) %>%
      rename('Location' = LOCATION, 'Science Performance' = Value)

Join Data Sets:

Combine the 4 data sets into one.

combinedDataSet <- teacherSalaries %>% 
  inner_join(studentMathPerformance, by = c("Location" = "Location")) %>%
    inner_join(studentReadingPerformance, by = c("Location" = "Location")) %>%
      inner_join(studentSciencePerformance, by = c("Location" = "Location"))

Mutate Combined Data Set:

Derive a new column that combines math, reading and science performance into combined performance and sort from highest to lowest.

combinedDataSet <- combinedDataSet %>%
  mutate('Combined Performance' = combinedDataSet$`Math Performance` + combinedDataSet$`Reading Performance` + combinedDataSet$`Science Performance`) %>%
    arrange(desc(`Combined Performance`))

Results

Display data of interest using kable. Limit to top 10 results.

kable(head(combinedDataSet, n = 10), caption = "Top 10 Locations and Student Performance Metrics By Combined Performance")
Top 10 Locations and Student Performance Metrics By Combined Performance
Location Salary (USD) Math Performance Reading Performance Science Performance Combined Performance
JPN 51339.33 527 504 529 1560
KOR 57179.33 526 514 519 1559
CAN 67301.23 512 520 518 1550
FIN 42180.39 507 520 522 1549
POL 26427.77 516 512 511 1539
IRL 61533.97 500 518 496 1514
SVN 42110.67 509 495 507 1511
NZL 47311.14 494 506 508 1508
NLD 63412.91 519 485 503 1507
SWE 45635.68 502 506 499 1507

Display summary data of interest.

minTeacherInfo <- combinedDataSet %>% select(Location, `Salary (USD)`) %>% filter(`Salary (USD)` == min(`Salary (USD)`))
maxTeacherInfo <- combinedDataSet %>% select(Location, `Salary (USD)`) %>% filter(`Salary (USD)` == max(`Salary (USD)`))
minCombinedPerformanceInfo <- combinedDataSet %>% select(Location, `Combined Performance`) %>% filter(`Combined Performance` == min(`Combined Performance`))
maxCombinedPerformanceInfo <- combinedDataSet %>% select(Location, `Combined Performance`) %>% filter(`Combined Performance` == max(`Combined Performance`))

kable(combinedDataSet %>% summarize('Location' = minTeacherInfo$Location, 'Min. Teacher Salary' = minTeacherInfo$`Salary (USD)`), 
      caption = "Location by Min / Max of Teacher Salary and Student Performance")
Location by Min / Max of Teacher Salary and Student Performance
Location Min. Teacher Salary
LTU 21083.88
kable(combinedDataSet %>% summarize('Location' = maxTeacherInfo$Location, 'Max. Teacher Salary' = maxTeacherInfo$`Salary (USD)`))
Location Max. Teacher Salary
LUX 108624.2
kable(combinedDataSet %>% summarize('Location' = minCombinedPerformanceInfo$Location, 'Min. Combined Student Performance' = minCombinedPerformanceInfo$`Combined Performance`))
Location Min. Combined Student Performance
COL 1216
kable(combinedDataSet %>% summarize('Location' = maxCombinedPerformanceInfo$Location, 'Max. Combined Student Performance' = maxCombinedPerformanceInfo$`Combined Performance`))
Location Max. Combined Student Performance
JPN 1560

Chart #1 - Scatter Plot: Teacher Salary vs. Student Combined Performance

Display a scatter plot of all locations’ teacher salaries vs. student combined performance.

ggplot(data = combinedDataSet, aes(x = `Combined Performance`, y = `Salary (USD)` / 1000)) +
  geom_point() +
  xlab("Student Combined Performance") +
  ylab("Teacher Salary (USD in thousands)")

From the scatter plot, there does not appear to be any correlation between higher paid teachers enabling their students to perform better in math, reading and science (even when pulling out the $108K outlier).

Chart #2 - Bar Chart: Performance By Subject vs. Location

To ensure the combining of student performance did not influence the previous results, a reassessment was done by looking at performance per subject for each location. Given the number of countries, only the top 3 highest and 3 lowest paid teacher locations are displayed.

combinedDataSetLong <- combinedDataSet %>% select(-c(`Combined Performance`))
names(combinedDataSetLong) = c("Location", "Salary (USD)", "Math", "Reading", "Science")

combinedDataSetLong <- combinedDataSetLong %>% 
  gather(Subject, Performance, -c(Location, `Salary (USD)`)) %>% 
    arrange(desc(`Salary (USD)`))

combinedDataSetLongSample <- bind_rows(combinedDataSetLong %>% slice_head(n = 9), combinedDataSetLong %>% slice_tail(n = 9))
combinedDataSetLongSample$Location <- factor(combinedDataSetLongSample$Location, levels = unique(combinedDataSetLongSample$Location))

ggplot(data = combinedDataSetLongSample, aes(x = Performance, y = Location, fill = Subject)) +
  geom_col(position = "dodge") +
    ylab("Location [Highest Paid Teacher Salary (LUX) ... Lowest Paid (LTU)]") +
      coord_flip()

From the bar chart, there does not appear to be any correlation between higher paid teachers enabling their students to perform better in math, reading and science - even when comparing the top 3 highest paid locations to the 3 lowest paid locations.

Final Thoughts

Given more time, it would have been good to look at other educational variables to see what (if any) factors may influence student performance. I originally wanted to include total educational spend in addition to teacher salaries, but would have had to expand over multiple years due to missing information (i.e., not all locations had a measured value in the same year). This probably could have extrapolated over some time span.

The one thing I would have liked to have added to Chart #2 was a Salary. Specifically, I would have liked to included the salary at the top of each bar char or at least close to the location label (e.g., LUX ($108K), …). This would have prevented the less than ideal X label axis.