BUA 681 - Assignment 5

Packages

Load required packages.

library(knitr)
suppressMessages(library(tidyverse))

Data

Description:

To find some interesting data for this week’s assignment, the Organization for Economic Co-operation and Development Open Data site (https://data.oecd.org/) referenced in our class notes was utilized. Leveraging the site’s Browse By Topic feature, I honed in on the various open data sets around education. My selection included open data sets around international (a) teacher salaries in primary education and (b) student performance in math, reading and science. I wanted to see if there is any correlation between teacher salaries and student learning performance.

All downloaded open data sets (csv format) included multi-year data. Since all data sets included 2018 (the most recent year), analysis was done for this year.

Load Data Sets:

Read in each raw data set.

teacherSalariesRaw <- read_csv("OECD-Teacher-Salaries-Primary.csv")
studentMathPerformanceRaw <- read_csv("OECD-Student-Math-Performance.csv")
studentReadingPerformanceRaw <- read_csv("OECD-Student-Reading-Performance.csv")
studentSciencePerformanceRaw <- read_csv("OECD-Student-Science-Performance.csv")

Cleanup Data Sets:

Pull out columns of interest from each raw table. Rename columns of interest (Location, Salary (USD), Math Performance, Reading Performance, Science Performance) to more friendly and meaningful names. Only include data for year 2018 and average teacher salaries with 15+ years of experience.

teacherSalaries <- teacherSalariesRaw %>% 
  filter(TIME == 2018, SUBJECT == 'PRY15YREXP') %>%
    select(LOCATION, Value) %>%
      rename('Location' = LOCATION, 'Salary (USD)' = Value)

studentMathPerformance <- studentMathPerformanceRaw %>% 
  filter(TIME == 2018) %>%
    select(LOCATION, Value) %>%
      rename('Location' = LOCATION, 'Math Performance' = Value)

studentReadingPerformance <- studentReadingPerformanceRaw %>% 
  filter(TIME == 2018) %>%
    select(LOCATION, Value) %>%
      rename('Location' = LOCATION, 'Reading Performance' = Value)

studentSciencePerformance <- studentSciencePerformanceRaw %>% 
  filter(TIME == 2018) %>%
    select(LOCATION, Value) %>%
      rename('Location' = LOCATION, 'Science Performance' = Value)

Join Data Sets:

Combine the 4 data sets into one.

combinedDataSet <- teacherSalaries %>% 
  inner_join(studentMathPerformance, by = c("Location" = "Location")) %>%
    inner_join(studentReadingPerformance, by = c("Location" = "Location")) %>%
      inner_join(studentSciencePerformance, by = c("Location" = "Location"))

Mutate Combined Data Set:

Derive a new column that combines math, reading and science performance into combined performance and sort from highest to lowest.

combinedDataSet <- combinedDataSet %>%
  mutate('Combined Performance' = combinedDataSet$`Math Performance` + combinedDataSet$`Reading Performance` + combinedDataSet$`Science Performance`) %>%
    arrange(desc(`Combined Performance`))

Results

Display data of interest using kable. Limit to top 10 results.

kable(head(combinedDataSet, n = 10), caption = "Top 10 Locations and Student Performance Metrics By Combined Performance")

Top 10 Locations and Student Performance Metrics By Combined Performance
Location	Salary (USD)	Math Performance	Reading Performance	Science Performance	Combined Performance
JPN	51339.33	527	504	529	1560
KOR	57179.33	526	514	519	1559
CAN	67301.23	512	520	518	1550
FIN	42180.39	507	520	522	1549
POL	26427.77	516	512	511	1539
IRL	61533.97	500	518	496	1514
SVN	42110.67	509	495	507	1511
NZL	47311.14	494	506	508	1508
NLD	63412.91	519	485	503	1507
SWE	45635.68	502	506	499	1507

Display summary data of interest.

minTeacherInfo <- combinedDataSet %>% select(Location, `Salary (USD)`) %>% filter(`Salary (USD)` == min(`Salary (USD)`))
maxTeacherInfo <- combinedDataSet %>% select(Location, `Salary (USD)`) %>% filter(`Salary (USD)` == max(`Salary (USD)`))
minCombinedPerformanceInfo <- combinedDataSet %>% select(Location, `Combined Performance`) %>% filter(`Combined Performance` == min(`Combined Performance`))
maxCombinedPerformanceInfo <- combinedDataSet %>% select(Location, `Combined Performance`) %>% filter(`Combined Performance` == max(`Combined Performance`))

kable(combinedDataSet %>% summarize('Location' = minTeacherInfo$Location, 'Min. Teacher Salary' = minTeacherInfo$`Salary (USD)`), 
      caption = "Location by Min / Max of Teacher Salary and Student Performance")

Location by Min / Max of Teacher Salary and Student Performance
Location	Min. Teacher Salary
LTU	21083.88

kable(combinedDataSet %>% summarize('Location' = maxTeacherInfo$Location, 'Max. Teacher Salary' = maxTeacherInfo$`Salary (USD)`))

Location	Max. Teacher Salary
LUX	108624.2

kable(combinedDataSet %>% summarize('Location' = minCombinedPerformanceInfo$Location, 'Min. Combined Student Performance' = minCombinedPerformanceInfo$`Combined Performance`))

Location	Min. Combined Student Performance
COL	1216

kable(combinedDataSet %>% summarize('Location' = maxCombinedPerformanceInfo$Location, 'Max. Combined Student Performance' = maxCombinedPerformanceInfo$`Combined Performance`))

Location	Max. Combined Student Performance
JPN	1560

Chart #1 - Scatter Plot: Teacher Salary vs. Student Combined Performance

Display a scatter plot of all locations’ teacher salaries vs. student combined performance.

ggplot(data = combinedDataSet, aes(x = `Combined Performance`, y = `Salary (USD)` / 1000)) +
  geom_point() +
  xlab("Student Combined Performance") +
  ylab("Teacher Salary (USD in thousands)")

From the scatter plot, there does not appear to be any correlation between higher paid teachers enabling their students to perform better in math, reading and science (even when pulling out the $108K outlier).

Chart #2 - Bar Chart: Performance By Subject vs. Location

To ensure the combining of student performance did not influence the previous results, a reassessment was done by looking at performance per subject for each location. Given the number of countries, only the top 3 highest and 3 lowest paid teacher locations are displayed.

combinedDataSetLong <- combinedDataSet %>% select(-c(`Combined Performance`))
names(combinedDataSetLong) = c("Location", "Salary (USD)", "Math", "Reading", "Science")

combinedDataSetLong <- combinedDataSetLong %>% 
  gather(Subject, Performance, -c(Location, `Salary (USD)`)) %>% 
    arrange(desc(`Salary (USD)`))

combinedDataSetLongSample <- bind_rows(combinedDataSetLong %>% slice_head(n = 9), combinedDataSetLong %>% slice_tail(n = 9))
combinedDataSetLongSample$Location <- factor(combinedDataSetLongSample$Location, levels = unique(combinedDataSetLongSample$Location))

ggplot(data = combinedDataSetLongSample, aes(x = Performance, y = Location, fill = Subject)) +
  geom_col(position = "dodge") +
    ylab("Location [Highest Paid Teacher Salary (LUX) ... Lowest Paid (LTU)]") +
      coord_flip()

From the bar chart, there does not appear to be any correlation between higher paid teachers enabling their students to perform better in math, reading and science - even when comparing the top 3 highest paid locations to the 3 lowest paid locations.

Final Thoughts

Given more time, it would have been good to look at other educational variables to see what (if any) factors may influence student performance. I originally wanted to include total educational spend in addition to teacher salaries, but would have had to expand over multiple years due to missing information (i.e., not all locations had a measured value in the same year). This probably could have extrapolated over some time span.