Load required packages.
library(knitr)
suppressMessages(library(tidyverse))
Description:
To find some interesting data for this week’s assignment, the Organization for Economic Co-operation and Development Open Data site (https://data.oecd.org/) referenced in our class notes was utilized. Leveraging the site’s Browse By Topic feature, I honed in on the various open data sets around education. My selection included open data sets around international (a) teacher salaries in primary education and (b) student performance in math, reading and science. I wanted to see if there is any correlation between teacher salaries and student learning performance.
All downloaded open data sets (csv format) included multi-year data. Since all data sets included 2018 (the most recent year), analysis was done for this year.
Load Data Sets:
Read in each raw data set.
teacherSalariesRaw <- read_csv("OECD-Teacher-Salaries-Primary.csv")
studentMathPerformanceRaw <- read_csv("OECD-Student-Math-Performance.csv")
studentReadingPerformanceRaw <- read_csv("OECD-Student-Reading-Performance.csv")
studentSciencePerformanceRaw <- read_csv("OECD-Student-Science-Performance.csv")
Cleanup Data Sets:
Pull out columns of interest from each raw table. Rename columns of interest (Location, Salary (USD), Math Performance, Reading Performance, Science Performance) to more friendly and meaningful names. Only include data for year 2018 and average teacher salaries with 15+ years of experience.
teacherSalaries <- teacherSalariesRaw %>%
filter(TIME == 2018, SUBJECT == 'PRY15YREXP') %>%
select(LOCATION, Value) %>%
rename('Location' = LOCATION, 'Salary (USD)' = Value)
studentMathPerformance <- studentMathPerformanceRaw %>%
filter(TIME == 2018) %>%
select(LOCATION, Value) %>%
rename('Location' = LOCATION, 'Math Performance' = Value)
studentReadingPerformance <- studentReadingPerformanceRaw %>%
filter(TIME == 2018) %>%
select(LOCATION, Value) %>%
rename('Location' = LOCATION, 'Reading Performance' = Value)
studentSciencePerformance <- studentSciencePerformanceRaw %>%
filter(TIME == 2018) %>%
select(LOCATION, Value) %>%
rename('Location' = LOCATION, 'Science Performance' = Value)
Join Data Sets:
Combine the 4 data sets into one.
combinedDataSet <- teacherSalaries %>%
inner_join(studentMathPerformance, by = c("Location" = "Location")) %>%
inner_join(studentReadingPerformance, by = c("Location" = "Location")) %>%
inner_join(studentSciencePerformance, by = c("Location" = "Location"))
Mutate Combined Data Set:
Derive a new column that combines math, reading and science performance into combined performance and sort from highest to lowest.
combinedDataSet <- combinedDataSet %>%
mutate('Combined Performance' = combinedDataSet$`Math Performance` + combinedDataSet$`Reading Performance` + combinedDataSet$`Science Performance`) %>%
arrange(desc(`Combined Performance`))
Results
Display data of interest using kable. Limit to top 10 results.
kable(head(combinedDataSet, n = 10), caption = "Top 10 Locations and Student Performance Metrics By Combined Performance")
| Location | Salary (USD) | Math Performance | Reading Performance | Science Performance | Combined Performance |
|---|---|---|---|---|---|
| JPN | 51339.33 | 527 | 504 | 529 | 1560 |
| KOR | 57179.33 | 526 | 514 | 519 | 1559 |
| CAN | 67301.23 | 512 | 520 | 518 | 1550 |
| FIN | 42180.39 | 507 | 520 | 522 | 1549 |
| POL | 26427.77 | 516 | 512 | 511 | 1539 |
| IRL | 61533.97 | 500 | 518 | 496 | 1514 |
| SVN | 42110.67 | 509 | 495 | 507 | 1511 |
| NZL | 47311.14 | 494 | 506 | 508 | 1508 |
| NLD | 63412.91 | 519 | 485 | 503 | 1507 |
| SWE | 45635.68 | 502 | 506 | 499 | 1507 |
Display summary data of interest.
minTeacherInfo <- combinedDataSet %>% select(Location, `Salary (USD)`) %>% filter(`Salary (USD)` == min(`Salary (USD)`))
maxTeacherInfo <- combinedDataSet %>% select(Location, `Salary (USD)`) %>% filter(`Salary (USD)` == max(`Salary (USD)`))
minCombinedPerformanceInfo <- combinedDataSet %>% select(Location, `Combined Performance`) %>% filter(`Combined Performance` == min(`Combined Performance`))
maxCombinedPerformanceInfo <- combinedDataSet %>% select(Location, `Combined Performance`) %>% filter(`Combined Performance` == max(`Combined Performance`))
kable(combinedDataSet %>% summarize('Location' = minTeacherInfo$Location, 'Min. Teacher Salary' = minTeacherInfo$`Salary (USD)`),
caption = "Location by Min / Max of Teacher Salary and Student Performance")
| Location | Min. Teacher Salary |
|---|---|
| LTU | 21083.88 |
kable(combinedDataSet %>% summarize('Location' = maxTeacherInfo$Location, 'Max. Teacher Salary' = maxTeacherInfo$`Salary (USD)`))
| Location | Max. Teacher Salary |
|---|---|
| LUX | 108624.2 |
kable(combinedDataSet %>% summarize('Location' = minCombinedPerformanceInfo$Location, 'Min. Combined Student Performance' = minCombinedPerformanceInfo$`Combined Performance`))
| Location | Min. Combined Student Performance |
|---|---|
| COL | 1216 |
kable(combinedDataSet %>% summarize('Location' = maxCombinedPerformanceInfo$Location, 'Max. Combined Student Performance' = maxCombinedPerformanceInfo$`Combined Performance`))
| Location | Max. Combined Student Performance |
|---|---|
| JPN | 1560 |
Display a scatter plot of all locations’ teacher salaries vs. student combined performance.
ggplot(data = combinedDataSet, aes(x = `Combined Performance`, y = `Salary (USD)` / 1000)) +
geom_point() +
xlab("Student Combined Performance") +
ylab("Teacher Salary (USD in thousands)")
From the scatter plot, there does not appear to be any correlation between higher paid teachers enabling their students to perform better in math, reading and science (even when pulling out the $108K outlier).
To ensure the combining of student performance did not influence the previous results, a reassessment was done by looking at performance per subject for each location. Given the number of countries, only the top 3 highest and 3 lowest paid teacher locations are displayed.
combinedDataSetLong <- combinedDataSet %>% select(-c(`Combined Performance`))
names(combinedDataSetLong) = c("Location", "Salary (USD)", "Math", "Reading", "Science")
combinedDataSetLong <- combinedDataSetLong %>%
gather(Subject, Performance, -c(Location, `Salary (USD)`)) %>%
arrange(desc(`Salary (USD)`))
combinedDataSetLongSample <- bind_rows(combinedDataSetLong %>% slice_head(n = 9), combinedDataSetLong %>% slice_tail(n = 9))
combinedDataSetLongSample$Location <- factor(combinedDataSetLongSample$Location, levels = unique(combinedDataSetLongSample$Location))
ggplot(data = combinedDataSetLongSample, aes(x = Performance, y = Location, fill = Subject)) +
geom_col(position = "dodge") +
ylab("Location [Highest Paid Teacher Salary (LUX) ... Lowest Paid (LTU)]") +
coord_flip()
From the bar chart, there does not appear to be any correlation between higher paid teachers enabling their students to perform better in math, reading and science - even when comparing the top 3 highest paid locations to the 3 lowest paid locations.
Given more time, it would have been good to look at other educational variables to see what (if any) factors may influence student performance. I originally wanted to include total educational spend in addition to teacher salaries, but would have had to expand over multiple years due to missing information (i.e., not all locations had a measured value in the same year). This probably could have extrapolated over some time span.
The one thing I would have liked to have added to Chart #2 was a Salary. Specifically, I would have liked to included the salary at the top of each bar char or at least close to the location label (e.g., LUX ($108K), …). This would have prevented the less than ideal X label axis.