EDA: Exploration of San Diego’s Union Strength

Loading libraries

# Loading libraries
suppressWarnings({
library(tidyverse)
library(sf)
library(leaflet)
library(mapview)
library(tigris)
library(tidycensus)
library(dplyr)
library(readxl)
library(ggpubr)
library(ggthemes)
library(ggplot2)
library(broom)
library(patchwork)
library(MASS)
})

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
## 
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
##      (status 2 uses the sf package in place of rgdal)
## 
## To enable caching of data, set `options(tigris_use_cache = TRUE)`
## in your R script or .Rprofile.
## 
## 
## Attaching package: 'MASS'
## 
## 
## The following object is masked from 'package:patchwork':
## 
##     area
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     select

Overview

The dataset Union Strength Variables from CBAs.xlsx is scraped, collected, cleaned by San Deigo County Taxpayers Association Policy Analyst Intern Jack Andolina. The data is further cleaned by matching and merging district code in accordance to the official district code published by California Department of Education. The data is then arranged by calendar year.

The following code explores variables of the table and special patterns to distribution across years and school district.

suppressWarnings({
  
#Loading Data
data <- "/Users/natania.w/Desktop/Cal things/Career Related/Internship/SDCTA/Union Strength Variables from CBAs.xlsx"
union<- read_excel(data, sheet=3)

## Data Cleaning
union$`calendar year`<-as.factor(union$`calendar year`)
union[,-c(2,3,4,5,12,17,18)]=lapply(union[,-c(2,3,4,5,12,17,18)], as.numeric)
head(union, 5)

})

## # A tibble: 5 × 18
##   `district code` `school district`              `union name` `Union represents`
##             <dbl> <chr>                          <chr>        <chr>             
## 1         3768080 Encinitas Union School Distri… Teachers of… All               
## 2         3768080 Encinitas Union School Distri… Teachers of… All               
## 3         3768080 Encinitas Union School Distri… Teachers of… All               
## 4         3773791 San Marcos Unified School Dis… San Marcos … All               
## 5         3773791 San Marcos Unified School Dis… San Marcos … All               
## # ℹ 14 more variables: `calendar year` <fct>, `contract start date` <dbl>,
## #   `contract end date` <dbl>,
## #   `yearly % increase in teacher salary across all groups` <dbl>,
## #   `minimum # days paid leave per month` <dbl>,
## #   `maximum # days paid leave per month` <dbl>,
## #   `maximum allowable class size (not including PE/Dance/extra curriculars) averaged across grades` <dbl>,
## #   `Hard max?` <chr>, …

Exploring variables

# all variables of dataset
str(union)

## tibble [104 × 18] (S3: tbl_df/tbl/data.frame)
##  $ district code                                                                                 : num [1:104] 3768080 3768080 3768080 3773791 3773791 ...
##  $ school district                                                                               : chr [1:104] "Encinitas Union School District" "Encinitas Union School District" "Encinitas Union School District" "San Marcos Unified School District" ...
##  $ union name                                                                                    : chr [1:104] "Teachers of Encinitas" "Teachers of Encinitas" "Teachers of Encinitas" "San Marcos Educators Association" ...
##  $ Union represents                                                                              : chr [1:104] "All" "All" "All" "All" ...
##  $ calendar year                                                                                 : Factor w/ 7 levels "2018","2019",..: 5 6 7 5 6 5 5 6 5 5 ...
##  $ contract start date                                                                           : num [1:104] 44743 44743 44743 44743 44743 ...
##  $ contract end date                                                                             : num [1:104] 45838 45838 45838 45473 45473 ...
##  $ yearly % increase in teacher salary across all groups                                         : num [1:104] 3.5 3.5 3.5 4.5 4.5 4 NA NA 7.5 0 ...
##  $ minimum # days paid leave per month                                                           : num [1:104] 0.83 0.83 0.83 0.83 0.83 0.83 0.83 0.83 0.83 0.83 ...
##  $ maximum # days paid leave per month                                                           : num [1:104] 8.33 8.33 8.33 8.33 8.33 ...
##  $ maximum allowable class size (not including PE/Dance/extra curriculars) averaged across grades: num [1:104] 31 31 31 34 34 37 35 35 30 32 ...
##  $ Hard max?                                                                                     : chr [1:104] "No" "No" "No" "Yes" ...
##  $ step one % increase for bachelors with credential                                             : num [1:104] 8 8 8 0 0 6 0.92 0.92 6.4 0 ...
##  $ salary incentive for having a masters degree                                                  : num [1:104] 1000 1000 1000 0 0 ...
##  $ minimum salary for instructors                                                                : num [1:104] 49124 49124 49124 NA NA ...
##  $ maximum salary for instructors                                                                : num [1:104] 129687 129687 129687 NA NA ...
##  $ paid leave increase through seniority?                                                        : chr [1:104] "Yes" "Yes" "Yes" "Yes" ...
##  $ Longevity Perks?                                                                              : chr [1:104] "Yes" "Yes" "Yes" "Yes" ...

# Summary Statistics of only numeric columns
summary(union[,-c(1,2,3,4,6,7,12,17,18)])

##  calendar year yearly % increase in teacher salary across all groups
##  2018: 4       Min.   :0.000                                        
##  2019: 6       1st Qu.:0.000                                        
##  2020:16       Median :3.000                                        
##  2021:25       Mean   :2.654                                        
##  2022:33       3rd Qu.:3.875                                        
##  2023:16       Max.   :7.500                                        
##  2024: 4       NA's   :2                                            
##  minimum # days paid leave per month maximum # days paid leave per month
##  Min.   :0.83                        Min.   :2.000                      
##  1st Qu.:0.83                        1st Qu.:8.330                      
##  Median :0.83                        Median :8.330                      
##  Mean   :0.83                        Mean   :7.539                      
##  3rd Qu.:0.83                        3rd Qu.:8.330                      
##  Max.   :0.83                        Max.   :8.333                      
##  NA's   :4                           NA's   :80                         
##  maximum allowable class size (not including PE/Dance/extra curriculars) averaged across grades
##  Min.   :27.00                                                                                 
##  1st Qu.:30.00                                                                                 
##  Median :31.00                                                                                 
##  Mean   :33.29                                                                                 
##  3rd Qu.:35.00                                                                                 
##  Max.   :58.00                                                                                 
##  NA's   :45                                                                                    
##  step one % increase for bachelors with credential
##  Min.   :0.000                                    
##  1st Qu.:0.000                                    
##  Median :1.370                                    
##  Mean   :2.421                                    
##  3rd Qu.:5.000                                    
##  Max.   :8.000                                    
##  NA's   :8                                        
##  salary incentive for having a masters degree minimum salary for instructors
##  Min.   :   0                                 Min.   :32825                 
##  1st Qu.:   0                                 1st Qu.:44946                 
##  Median :1311                                 Median :51655                 
##  Mean   :1655                                 Mean   :50974                 
##  3rd Qu.:2383                                 3rd Qu.:54626                 
##  Max.   :9321                                 Max.   :69961                 
##  NA's   :8                                    NA's   :29                    
##  maximum salary for instructors
##  Min.   : 61318                
##  1st Qu.: 99492                
##  Median :111028                
##  Mean   :110949                
##  3rd Qu.:121411                
##  Max.   :173409                
##  NA's   :31

# Pairsplot of all numeric columns 
pairs(union[,-c(1,2,3,4,6,7,12,17,18)]) + title("Fig 1: Pairs plot of all variables")

## integer(0)

From fig 1, we plot a pairsplot of all numeric variables against all numeric variables. This would give us a clearer view of potential relationship among variables. The variables that show potential relationship and would have notable characteristics for further exploration includes yearly % increase in teacher salary across all groups, step one % increase for bachelors with credential, salary incentive for having a masters degree, minimum salary for instructors, and maximum salary for instructors.

Exploratory Data Analysis (EDA)

Yearly % increase in teacher salary across all groups

# Define function
data_summary <- function(x) {
   m <- mean(x)
   ymin <- m-sd(x)
   ymax <- m+sd(x)
   return(c(y=m,ymin=ymin,ymax=ymax))
}

# Distribution (Not dropping NAs)
union%>%
  ggplot(aes(x=`calendar year`,
             y=`yearly % increase in teacher salary across all groups`,
             fill=`calendar year`)) + 
  geom_violin(color = "black", alpha = 0.4) +
  labs(title = "Fig 2: Yearly % Change in Teacher Salary By Year",
       x=" Calendar Year",
       y = "Pct Change (%)") +
  stat_summary(fun.data=data_summary,
               color="darkred", alpha=0.7)+
  theme_bw()

## Warning: Removed 2 rows containing non-finite values (`stat_ydensity()`).

## Warning: Removed 2 rows containing non-finite values (`stat_summary()`).

# Distribution (Dropping NAs)
na.omit(union)%>%
  ggplot(aes(x=`calendar year`,
             y=`yearly % increase in teacher salary across all groups`,
             fill=`calendar year`)) + 
  geom_violin(color = "black", alpha = 0.4) +
  labs(title = "Fig 3: Yearly % Change in Teacher Salary By Year (NA removed)",
       x=" Calendar Year",
       y = "Pct Change (%)") +
  stat_summary(fun.data=data_summary,
               color="darkred", alpha=0.7)+
  theme_bw()

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Removed 1 rows containing missing values (`geom_segment()`).

In both fig 1 and 2, the distribution of Yearly % increase in teacher salary across all groups is plotted against calendar year as a violin plot. The red bar on top of each violin plot represents the range of values, whereas the middle red dot represents the mean of the distribution. The distribution of fig 1 seemingly is wider than fig 2, where NAs are removed. Looking closer into fig 1, year 2021, 2022, and 2023 have shown the highest variation in their yearly percentage change of teachers’ salary. Fig 2 does not show any apparent patterns. Overall, this variable presents interesting characteristics for further exploration.

Minimum # Days Paid Leave Per Month

# Distribution (Not dropping NAs)
union%>%
  ggplot(aes(x=`calendar year`,
             y=`minimum # days paid leave per month`,
             fill=`calendar year`)) + 
  geom_boxplot(color = "black", alpha = 0.4) +
  ylim(0.829, 0.833) +
  labs(title = "Fig 4: Minimum # Days Paid Leave Per Month",
       x=" Calendar Year",
       y = "# Days") +
  stat_summary(fun.data=data_summary,
               color="darkred", alpha=0.7)+
  theme_bw()

## Warning: Removed 4 rows containing non-finite values (`stat_boxplot()`).

## Warning: Removed 4 rows containing non-finite values (`stat_summary()`).

# Exploring values
sum(na.omit(union$`minimum # days paid leave per month`!=0.83))

## [1] 0

From fig 3, Minimum # Days Paid Leave Per Month is plotted against calendar year as a boxplot in order to explore the potential variability of values in this variable. However, it looks like the distribution is not apparent and that there might be just one value in the entire column. Therefore, I utilized a logical argument to pass through the entire column to explore whether if there are any other values besides 0.83. The result shows that the entire column only has one input. Overall, this variable alone does not present notable characteristics for further exploration.

Maximum # Days Paid Leave Per Month`

# Distribution (Not dropping NAs)
union%>%
  ggplot(aes(x=`calendar year`,
             y=`maximum # days paid leave per month`,
             fill=`calendar year`)) + 
  geom_boxplot(color = "black", alpha = 0.4) +
  ylim(0, 10) +
  labs(title = "Fig 5: Maximum # Days Paid Leave Per Month",
       x=" Calendar Year",
       y = "# Days") +
  stat_summary(fun.data=data_summary,
               color="darkred", alpha=0.7)+
  theme_bw()

## Warning: Removed 80 rows containing non-finite values (`stat_boxplot()`).

## Warning: Removed 80 rows containing non-finite values (`stat_summary()`).

## Warning: Removed 1 rows containing missing values (`geom_segment()`).

From fig 4, Maximum # Days Paid Leave Per Month is plotted against calendar year as a boxplot. Fig 4 shows that there is no data avaliable in 2018 and 2019, with just one single value input for both 2023 and 2024. Observable variability is present in 2020, 2021, and 2022. The most notable variability occurs in 2020 where the range is shown the widest, with a relatively lower mean compared to 2021 and 2022. Overall, this variable alone does not present notable characteristics for further exploration.

Maximum Allowable Class Size (not including PE/Dance/extra curriculars) Averaged Across Grades

# Distribution (Not dropping NAs)
union%>%
  ggplot(aes(x=`calendar year`,
             y=`maximum allowable class size (not including PE/Dance/extra curriculars) averaged across grades`,
             fill=`calendar year`)) + 
  geom_boxplot(color = "black", alpha = 0.4) +
  ylim(25, 40) +
  labs(title = "Fig 6: Maximum # Days Paid Leave Per Month",
       x=" Calendar Year",
       y = "# Days") +
  stat_summary(fun.data=data_summary,
               color="darkred", alpha=0.7)+
  theme_bw()

## Warning: Removed 48 rows containing non-finite values (`stat_boxplot()`).

## Warning: Removed 48 rows containing non-finite values (`stat_summary()`).

## Warning: Removed 1 rows containing missing values (`geom_segment()`).

Fig 6 shows a boxplot of maximum allowable class size (not including PE/Dance/extra curriculars) averaged across grades against calendar year. The maximum allowable class size varies between around 27 and 35. The range increases with time from 2018 to 2023, and decreases in 2024. The median is approximately the same over the years. It might be interesting to look into why the range increases drastically from 2019 to 2020 and whether that contributes to subsequent increase of maximum allowable class size (not including PE/Dance/extra curriculars) averaged across grades through 2023, hence, looking into potential policy change that contributes to the decrease in 2024.

Hard max?

# Distribution (Not dropping NAs)
union%>%
  ggplot(aes(x=`Hard max?`)) + 
  geom_bar(fill = c("lightpink", "lightblue1", "grey"), color = "black", alpha = 0.7) +
  geom_text(stat='count', aes(label=after_stat(count), vjust=1.5)) +
  ylim(0, 80) +
  labs(title = "Fig 7: Hard Max?",
       x="Presence of hard max") +
  theme_bw()

Fig 7 shows the distribution of hard max? against calendar year. It shows that the mode for this variable is “No”, where the count is greater than the count for the rest of the variables for more than twice.

Step one % increase for bachelors with credential

# Distribution (Not dropping NAs)
union%>%
  ggplot(aes(x=`calendar year`,
             y=`step one % increase for bachelors with credential`,
             fill=`calendar year`)) + 
  geom_boxplot(color = "black", alpha = 0.4) +
  ylim(0, 10) +
  labs(title = "Fig 8: Step One % Increase For Bachelors With Credential",
       x=" Calendar Year",
       y = "Percentage (%)") +
  stat_summary(fun.data=data_summary,
               color="darkred", alpha=0.7)+
  theme_bw()

## Warning: Removed 8 rows containing non-finite values (`stat_boxplot()`).

## Warning: Removed 8 rows containing non-finite values (`stat_summary()`).

## Warning: Removed 5 rows containing missing values (`geom_segment()`).

From fig 8, step one % increase for bachelors with credential is being plotted against calendar year as boxplots. The bars length and dots represent the range of the values and the median respectively. ALl step one % increase for bachelors with credential start with 0 and have similiar IQR. However, the mean represented by the red dot and the black line representing the median vary across years and that might be notable patterns to look into.

Salary Incentive For Having a Masters Degree

# Distribution (Not dropping NAs)
union%>%
  ggplot(aes(x=`calendar year`,
             y=`salary incentive for having a masters degree`,
             fill=`calendar year`)) + 
  geom_boxplot(color = "black", alpha = 0.4) +
  ylim(500, 3500) +
  labs(title = "Fig 9: Salary Incentive For Having A Masters Degree",
       x=" Calendar Year",
       y = "Dollars ($)") +
  stat_summary(fun.data=data_summary,
               color="darkred", alpha=0.7)+
  theme_bw()

## Warning: Removed 50 rows containing non-finite values (`stat_boxplot()`).

## Warning: Removed 50 rows containing non-finite values (`stat_summary()`).

## Warning: Removed 1 rows containing missing values (`geom_segment()`).

In fig 9, salary incentive for having a masters degree is plotted against calendar year as boxplots. it shows that the salary incentive for having a master degree does not vary a lot since 2022, and that the range of the incentive has increased drastically from 2020 to 2023, where we see a sharp decrease in 2024.

Minimum Salary For Instructors

# Distribution (Not dropping NAs)
union%>%
  ggplot(aes(x=`calendar year`,
             y=`minimum salary for instructors`,
             fill=`calendar year`)) + 
  geom_boxplot(color = "black", alpha = 0.4) +
  ylim(30000, 65000) +
  labs(title = "Fig 10: Minimum Salary for Instructors`",
       x=" Calendar Year",
       y = "Dollars ($)") +
  stat_summary(fun.data=data_summary,
               color="darkred", alpha=0.7)+
  theme_bw()

## Warning: Removed 32 rows containing non-finite values (`stat_boxplot()`).

## Warning: Removed 32 rows containing non-finite values (`stat_summary()`).

In fig 10, minimum salary for instructors is plotted against calendar year as boxplots. Year 2020, 2021, and 2022 have noted the greatest range for values, where the median and IQR for these years also show similar patterns. Overall, the median across years does not show great variation and is around $51000.

Maximum salary for instructors

# Distribution (Not dropping NAs)
union%>%
  ggplot(aes(x=`calendar year`,
             y=`maximum salary for instructors`,
             fill=`calendar year`)) + 
  geom_violin(color = "black", alpha = 0.4) +
  ylim(60000, 140000) +
  labs(title = "Fig 11: Minimum Salary for Instructors",
       x=" Calendar Year",
       y = "Dollars ($)") +
  stat_summary(fun.data=data_summary,
               color="darkred", alpha=0.7)+
  theme_bw()

## Warning: Removed 37 rows containing non-finite values (`stat_ydensity()`).

## Warning: Removed 37 rows containing non-finite values (`stat_summary()`).

## Warning: Removed 2 rows containing missing values (`geom_segment()`).

From fig 11, maximum salary for instructors is plotted against calendar year as violin plots, where the distribution of the variables across calendar year is shown. It shows that the distribution of maximum salary for instructors across 2018 to 2020 has shown the greatest variation, and has shown a continuous increase in the median maximum salary for instructors over these years. Subsequently during 2021 through 2024, the distribution has kept approximately the same, with an increase in median to around $110000. It might be interesting to look into the cause of this increase from 2020 to 2021 and why it does not see a continuous increase since 2021.

Paid leave increase through seniority?

# Distribution (Not dropping NAs)
union%>%
  ggplot(aes(x=`paid leave increase through seniority?`)) + 
  geom_bar(fill = c("lightpink", "lightblue1", "grey"), color = "black", alpha = 0.7) +
  geom_text(stat='count', aes(label=after_stat(count), vjust=1.5)) +
  ylim(0, 100) +
  labs(title = "Fig 12: Paid leave increase through seniority?",
       x="Presence of hard max") +
  theme_bw()

In fig 12, a bar chart is plotted to show the distribution of the answers. The mode of fig 12 is “No”, showing that the number of teachers who do not see an increase in paid leave through seniority is significantly higher than those who see an increase.

Conclusion

From the EDA above, variables that might have an interesting pattern to look into include yearly % increase in teacher salary across all groups, step one % increase for bachelors with credential, salary incentive for having a masters degree, minimum salary for instructors, and maximum salary for instructors. EDA results also align with the pairsplot where it has also shown notable variations in the distribution of variables above. Additonal variables that might be interesting to look into include maximum allowable class size (not including PE/Dance/extra curriculars) averaged across grades. However, it is important to note that the EDA is briefly conducted on the basis of each calendar year, which implies that there might be other notable patterns across school districts that might have been missed. It is recommended to explore patterns within and across school district with additional study into this dataset.