# Loading libraries
suppressWarnings({
library(tidyverse)
library(sf)
library(leaflet)
library(mapview)
library(tigris)
library(tidycensus)
library(dplyr)
library(readxl)
library(ggpubr)
library(ggthemes)
library(ggplot2)
library(broom)
library(patchwork)
library(MASS)
})
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
##
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
## (status 2 uses the sf package in place of rgdal)
##
## To enable caching of data, set `options(tigris_use_cache = TRUE)`
## in your R script or .Rprofile.
##
##
## Attaching package: 'MASS'
##
##
## The following object is masked from 'package:patchwork':
##
## area
##
##
## The following object is masked from 'package:dplyr':
##
## select
The dataset Union Strength Variables from CBAs.xlsx is
scraped, collected, cleaned by San Deigo County Taxpayers Association
Policy Analyst Intern Jack Andolina. The data is further cleaned by
matching and merging district code in accordance to the official
district code published by California Department of Education. The data
is then arranged by calendar year.
The following code explores variables of the table and special patterns to distribution across years and school district.
suppressWarnings({
#Loading Data
data <- "/Users/natania.w/Desktop/Cal things/Career Related/Internship/SDCTA/Union Strength Variables from CBAs.xlsx"
union<- read_excel(data, sheet=3)
## Data Cleaning
union$`calendar year`<-as.factor(union$`calendar year`)
union[,-c(2,3,4,5,12,17,18)]=lapply(union[,-c(2,3,4,5,12,17,18)], as.numeric)
head(union, 5)
})
## # A tibble: 5 × 18
## `district code` `school district` `union name` `Union represents`
## <dbl> <chr> <chr> <chr>
## 1 3768080 Encinitas Union School Distri… Teachers of… All
## 2 3768080 Encinitas Union School Distri… Teachers of… All
## 3 3768080 Encinitas Union School Distri… Teachers of… All
## 4 3773791 San Marcos Unified School Dis… San Marcos … All
## 5 3773791 San Marcos Unified School Dis… San Marcos … All
## # ℹ 14 more variables: `calendar year` <fct>, `contract start date` <dbl>,
## # `contract end date` <dbl>,
## # `yearly % increase in teacher salary across all groups` <dbl>,
## # `minimum # days paid leave per month` <dbl>,
## # `maximum # days paid leave per month` <dbl>,
## # `maximum allowable class size (not including PE/Dance/extra curriculars) averaged across grades` <dbl>,
## # `Hard max?` <chr>, …
# all variables of dataset
str(union)
## tibble [104 × 18] (S3: tbl_df/tbl/data.frame)
## $ district code : num [1:104] 3768080 3768080 3768080 3773791 3773791 ...
## $ school district : chr [1:104] "Encinitas Union School District" "Encinitas Union School District" "Encinitas Union School District" "San Marcos Unified School District" ...
## $ union name : chr [1:104] "Teachers of Encinitas" "Teachers of Encinitas" "Teachers of Encinitas" "San Marcos Educators Association" ...
## $ Union represents : chr [1:104] "All" "All" "All" "All" ...
## $ calendar year : Factor w/ 7 levels "2018","2019",..: 5 6 7 5 6 5 5 6 5 5 ...
## $ contract start date : num [1:104] 44743 44743 44743 44743 44743 ...
## $ contract end date : num [1:104] 45838 45838 45838 45473 45473 ...
## $ yearly % increase in teacher salary across all groups : num [1:104] 3.5 3.5 3.5 4.5 4.5 4 NA NA 7.5 0 ...
## $ minimum # days paid leave per month : num [1:104] 0.83 0.83 0.83 0.83 0.83 0.83 0.83 0.83 0.83 0.83 ...
## $ maximum # days paid leave per month : num [1:104] 8.33 8.33 8.33 8.33 8.33 ...
## $ maximum allowable class size (not including PE/Dance/extra curriculars) averaged across grades: num [1:104] 31 31 31 34 34 37 35 35 30 32 ...
## $ Hard max? : chr [1:104] "No" "No" "No" "Yes" ...
## $ step one % increase for bachelors with credential : num [1:104] 8 8 8 0 0 6 0.92 0.92 6.4 0 ...
## $ salary incentive for having a masters degree : num [1:104] 1000 1000 1000 0 0 ...
## $ minimum salary for instructors : num [1:104] 49124 49124 49124 NA NA ...
## $ maximum salary for instructors : num [1:104] 129687 129687 129687 NA NA ...
## $ paid leave increase through seniority? : chr [1:104] "Yes" "Yes" "Yes" "Yes" ...
## $ Longevity Perks? : chr [1:104] "Yes" "Yes" "Yes" "Yes" ...
# Summary Statistics of only numeric columns
summary(union[,-c(1,2,3,4,6,7,12,17,18)])
## calendar year yearly % increase in teacher salary across all groups
## 2018: 4 Min. :0.000
## 2019: 6 1st Qu.:0.000
## 2020:16 Median :3.000
## 2021:25 Mean :2.654
## 2022:33 3rd Qu.:3.875
## 2023:16 Max. :7.500
## 2024: 4 NA's :2
## minimum # days paid leave per month maximum # days paid leave per month
## Min. :0.83 Min. :2.000
## 1st Qu.:0.83 1st Qu.:8.330
## Median :0.83 Median :8.330
## Mean :0.83 Mean :7.539
## 3rd Qu.:0.83 3rd Qu.:8.330
## Max. :0.83 Max. :8.333
## NA's :4 NA's :80
## maximum allowable class size (not including PE/Dance/extra curriculars) averaged across grades
## Min. :27.00
## 1st Qu.:30.00
## Median :31.00
## Mean :33.29
## 3rd Qu.:35.00
## Max. :58.00
## NA's :45
## step one % increase for bachelors with credential
## Min. :0.000
## 1st Qu.:0.000
## Median :1.370
## Mean :2.421
## 3rd Qu.:5.000
## Max. :8.000
## NA's :8
## salary incentive for having a masters degree minimum salary for instructors
## Min. : 0 Min. :32825
## 1st Qu.: 0 1st Qu.:44946
## Median :1311 Median :51655
## Mean :1655 Mean :50974
## 3rd Qu.:2383 3rd Qu.:54626
## Max. :9321 Max. :69961
## NA's :8 NA's :29
## maximum salary for instructors
## Min. : 61318
## 1st Qu.: 99492
## Median :111028
## Mean :110949
## 3rd Qu.:121411
## Max. :173409
## NA's :31
# Pairsplot of all numeric columns
pairs(union[,-c(1,2,3,4,6,7,12,17,18)]) + title("Fig 1: Pairs plot of all variables")
## integer(0)
From fig 1, we plot a pairsplot of all numeric variables against all
numeric variables. This would give us a clearer view of potential
relationship among variables. The variables that show potential
relationship and would have notable characteristics for further
exploration includes
yearly % increase in teacher salary across all groups,
step one % increase for bachelors with credential,
salary incentive for having a masters degree,
minimum salary for instructors, and
maximum salary for instructors.
# Define function
data_summary <- function(x) {
m <- mean(x)
ymin <- m-sd(x)
ymax <- m+sd(x)
return(c(y=m,ymin=ymin,ymax=ymax))
}
# Distribution (Not dropping NAs)
union%>%
ggplot(aes(x=`calendar year`,
y=`yearly % increase in teacher salary across all groups`,
fill=`calendar year`)) +
geom_violin(color = "black", alpha = 0.4) +
labs(title = "Fig 2: Yearly % Change in Teacher Salary By Year",
x=" Calendar Year",
y = "Pct Change (%)") +
stat_summary(fun.data=data_summary,
color="darkred", alpha=0.7)+
theme_bw()
## Warning: Removed 2 rows containing non-finite values (`stat_ydensity()`).
## Warning: Removed 2 rows containing non-finite values (`stat_summary()`).
# Distribution (Dropping NAs)
na.omit(union)%>%
ggplot(aes(x=`calendar year`,
y=`yearly % increase in teacher salary across all groups`,
fill=`calendar year`)) +
geom_violin(color = "black", alpha = 0.4) +
labs(title = "Fig 3: Yearly % Change in Teacher Salary By Year (NA removed)",
x=" Calendar Year",
y = "Pct Change (%)") +
stat_summary(fun.data=data_summary,
color="darkred", alpha=0.7)+
theme_bw()
## Warning: Groups with fewer than two data points have been dropped.
## Warning: Removed 1 rows containing missing values (`geom_segment()`).
In both fig 1 and 2, the distribution of
Yearly % increase in teacher salary across all groups is
plotted against calendar year as a violin plot. The red bar
on top of each violin plot represents the range of values, whereas the
middle red dot represents the mean of the distribution. The distribution
of fig 1 seemingly is wider than fig 2, where NAs are removed. Looking
closer into fig 1, year 2021, 2022, and 2023 have shown the highest
variation in their yearly percentage change of teachers’ salary. Fig 2
does not show any apparent patterns. Overall, this variable presents
interesting characteristics for further exploration.
# Distribution (Not dropping NAs)
union%>%
ggplot(aes(x=`calendar year`,
y=`minimum # days paid leave per month`,
fill=`calendar year`)) +
geom_boxplot(color = "black", alpha = 0.4) +
ylim(0.829, 0.833) +
labs(title = "Fig 4: Minimum # Days Paid Leave Per Month",
x=" Calendar Year",
y = "# Days") +
stat_summary(fun.data=data_summary,
color="darkred", alpha=0.7)+
theme_bw()
## Warning: Removed 4 rows containing non-finite values (`stat_boxplot()`).
## Warning: Removed 4 rows containing non-finite values (`stat_summary()`).
# Exploring values
sum(na.omit(union$`minimum # days paid leave per month`!=0.83))
## [1] 0
From fig 3, Minimum # Days Paid Leave Per Month is
plotted against calendar year as a boxplot in order to
explore the potential variability of values in this variable. However,
it looks like the distribution is not apparent and that there might be
just one value in the entire column. Therefore, I utilized a logical
argument to pass through the entire column to explore whether if there
are any other values besides 0.83. The result shows that the entire
column only has one input. Overall, this variable alone does not present
notable characteristics for further exploration.
# Distribution (Not dropping NAs)
union%>%
ggplot(aes(x=`calendar year`,
y=`maximum # days paid leave per month`,
fill=`calendar year`)) +
geom_boxplot(color = "black", alpha = 0.4) +
ylim(0, 10) +
labs(title = "Fig 5: Maximum # Days Paid Leave Per Month",
x=" Calendar Year",
y = "# Days") +
stat_summary(fun.data=data_summary,
color="darkred", alpha=0.7)+
theme_bw()
## Warning: Removed 80 rows containing non-finite values (`stat_boxplot()`).
## Warning: Removed 80 rows containing non-finite values (`stat_summary()`).
## Warning: Removed 1 rows containing missing values (`geom_segment()`).
From fig 4,
Maximum # Days Paid Leave Per Month is plotted
against calendar year as a boxplot. Fig 4 shows that there
is no data avaliable in 2018 and 2019, with just one single value input
for both 2023 and 2024. Observable variability is present in 2020, 2021,
and 2022. The most notable variability occurs in 2020 where the range is
shown the widest, with a relatively lower mean compared to 2021 and
2022. Overall, this variable alone does not present notable
characteristics for further exploration.
# Distribution (Not dropping NAs)
union%>%
ggplot(aes(x=`calendar year`,
y=`maximum allowable class size (not including PE/Dance/extra curriculars) averaged across grades`,
fill=`calendar year`)) +
geom_boxplot(color = "black", alpha = 0.4) +
ylim(25, 40) +
labs(title = "Fig 6: Maximum # Days Paid Leave Per Month",
x=" Calendar Year",
y = "# Days") +
stat_summary(fun.data=data_summary,
color="darkred", alpha=0.7)+
theme_bw()
## Warning: Removed 48 rows containing non-finite values (`stat_boxplot()`).
## Warning: Removed 48 rows containing non-finite values (`stat_summary()`).
## Warning: Removed 1 rows containing missing values (`geom_segment()`).
Fig 6 shows a boxplot of
maximum allowable class size (not including PE/Dance/extra curriculars) averaged across grades
against calendar year. The maximum allowable class size
varies between around 27 and 35. The range increases with time from 2018
to 2023, and decreases in 2024. The median is approximately the same
over the years. It might be interesting to look into why the range
increases drastically from 2019 to 2020 and whether that contributes to
subsequent increase of
maximum allowable class size (not including PE/Dance/extra curriculars) averaged across grades
through 2023, hence, looking into potential policy change that
contributes to the decrease in 2024.
# Distribution (Not dropping NAs)
union%>%
ggplot(aes(x=`Hard max?`)) +
geom_bar(fill = c("lightpink", "lightblue1", "grey"), color = "black", alpha = 0.7) +
geom_text(stat='count', aes(label=after_stat(count), vjust=1.5)) +
ylim(0, 80) +
labs(title = "Fig 7: Hard Max?",
x="Presence of hard max") +
theme_bw()
Fig 7 shows the distribution of
hard max? against
calendar year. It shows that the mode for this variable is
“No”, where the count is greater than the count for the rest of the
variables for more than twice.
# Distribution (Not dropping NAs)
union%>%
ggplot(aes(x=`calendar year`,
y=`step one % increase for bachelors with credential`,
fill=`calendar year`)) +
geom_boxplot(color = "black", alpha = 0.4) +
ylim(0, 10) +
labs(title = "Fig 8: Step One % Increase For Bachelors With Credential",
x=" Calendar Year",
y = "Percentage (%)") +
stat_summary(fun.data=data_summary,
color="darkred", alpha=0.7)+
theme_bw()
## Warning: Removed 8 rows containing non-finite values (`stat_boxplot()`).
## Warning: Removed 8 rows containing non-finite values (`stat_summary()`).
## Warning: Removed 5 rows containing missing values (`geom_segment()`).
From fig 8,
step one % increase for bachelors with credential is being
plotted against calendar year as boxplots. The bars length
and dots represent the range of the values and the median respectively.
ALl step one % increase for bachelors with credential start
with 0 and have similiar IQR. However, the mean represented by the red
dot and the black line representing the median vary across years and
that might be notable patterns to look into.
# Distribution (Not dropping NAs)
union%>%
ggplot(aes(x=`calendar year`,
y=`salary incentive for having a masters degree`,
fill=`calendar year`)) +
geom_boxplot(color = "black", alpha = 0.4) +
ylim(500, 3500) +
labs(title = "Fig 9: Salary Incentive For Having A Masters Degree",
x=" Calendar Year",
y = "Dollars ($)") +
stat_summary(fun.data=data_summary,
color="darkred", alpha=0.7)+
theme_bw()
## Warning: Removed 50 rows containing non-finite values (`stat_boxplot()`).
## Warning: Removed 50 rows containing non-finite values (`stat_summary()`).
## Warning: Removed 1 rows containing missing values (`geom_segment()`).
In fig 9,
salary incentive for having a masters degree is
plotted against calendar year as boxplots. it shows that
the salary incentive for having a master degree does not vary a lot
since 2022, and that the range of the incentive has increased
drastically from 2020 to 2023, where we see a sharp decrease in
2024.
# Distribution (Not dropping NAs)
union%>%
ggplot(aes(x=`calendar year`,
y=`minimum salary for instructors`,
fill=`calendar year`)) +
geom_boxplot(color = "black", alpha = 0.4) +
ylim(30000, 65000) +
labs(title = "Fig 10: Minimum Salary for Instructors`",
x=" Calendar Year",
y = "Dollars ($)") +
stat_summary(fun.data=data_summary,
color="darkred", alpha=0.7)+
theme_bw()
## Warning: Removed 32 rows containing non-finite values (`stat_boxplot()`).
## Warning: Removed 32 rows containing non-finite values (`stat_summary()`).
In fig 10,
minimum salary for instructors is plotted
against calendar year as boxplots. Year 2020, 2021, and
2022 have noted the greatest range for values, where the median and IQR
for these years also show similar patterns. Overall, the median across
years does not show great variation and is around $51000.
# Distribution (Not dropping NAs)
union%>%
ggplot(aes(x=`calendar year`,
y=`maximum salary for instructors`,
fill=`calendar year`)) +
geom_violin(color = "black", alpha = 0.4) +
ylim(60000, 140000) +
labs(title = "Fig 11: Minimum Salary for Instructors",
x=" Calendar Year",
y = "Dollars ($)") +
stat_summary(fun.data=data_summary,
color="darkred", alpha=0.7)+
theme_bw()
## Warning: Removed 37 rows containing non-finite values (`stat_ydensity()`).
## Warning: Removed 37 rows containing non-finite values (`stat_summary()`).
## Warning: Removed 2 rows containing missing values (`geom_segment()`).
From fig 11,
maximum salary for instructors is plotted
against calendar year as violin plots, where the
distribution of the variables across calendar year is shown. It shows
that the distribution of maximum salary for instructors
across 2018 to 2020 has shown the greatest variation, and has shown a
continuous increase in the median
maximum salary for instructors over these years.
Subsequently during 2021 through 2024, the distribution has kept
approximately the same, with an increase in median to around $110000. It
might be interesting to look into the cause of this increase from 2020
to 2021 and why it does not see a continuous increase since 2021.
# Distribution (Not dropping NAs)
union%>%
ggplot(aes(x=`paid leave increase through seniority?`)) +
geom_bar(fill = c("lightpink", "lightblue1", "grey"), color = "black", alpha = 0.7) +
geom_text(stat='count', aes(label=after_stat(count), vjust=1.5)) +
ylim(0, 100) +
labs(title = "Fig 12: Paid leave increase through seniority?",
x="Presence of hard max") +
theme_bw()
In fig 12, a bar chart is plotted to show the distribution of the
answers. The mode of fig 12 is “No”, showing that the number of teachers
who do not see an increase in paid leave through seniority is
significantly higher than those who see an increase.
From the EDA above, variables that might have an interesting pattern
to look into include
yearly % increase in teacher salary across all groups,
step one % increase for bachelors with credential,
salary incentive for having a masters degree,
minimum salary for instructors, and
maximum salary for instructors. EDA results also align with
the pairsplot where it has also shown notable variations in the
distribution of variables above. Additonal variables that might be
interesting to look into include
maximum allowable class size (not including PE/Dance/extra curriculars) averaged across grades.
However, it is important to note that the EDA is briefly conducted on
the basis of each calendar year, which implies that there might be other
notable patterns across school districts that might have been missed. It
is recommended to explore patterns within and across school district
with additional study into this dataset.