Your task is to:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
The data is taken from here: http://www.pro-football-reference.com/boxscores/201609250sea.htm.
I created a .CSV file by copying the data from the above webpage into an Excel file and saving the file as a .CSV. This is itself a challenge because Excel changes some of the values into dates automatically. The solution is to select the area in which the paste will occur and format the cells to ‘Text’. Then paste the data using the “Match Destination Formatting” option. Then the data won’t change into date values.
Now I load the data and display it below.
data<-read.csv("C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/Project2/football_data.csv")
data## X SFO SEA
## 1 First Downs 12 18
## 2 Rush-Yds-TDs 31-135-2 31-127-2
## 3 Cmp-Att-Yd-TD-INT 14-25-119-0-1 22-32-308-2-1
## 4 Sacked-Yards 0-0 2-17
## 5 Net Pass Yards 119 291
## 6 Total Yards 254 418
## 7 Fumbles-Lost 1-0 1-1
## 8 Turnovers 1 2
## 9 Penalties-Yards 4-35 6-50
## 10 Third Down Conv. 4-15 9-14
## 11 Fourth Down Conv. 1-1 0-0
## 12 Time of Possession 24:03 35:57
## 13
A lot of cleanup and transformation will have to occur. I first load the libraries.
library(stringr)
library(tidyr)
library(dplyr)
library(DT)Now for the tidying.
#gather, select, and spread the data
data_tidy<-data %>%
gather(Team,Value,2:3) %>%
select(Team,X,Value) %>%
spread(X,Value)
data_tidy## Team Cmp-Att-Yd-TD-INT First Downs Fourth Down Conv. Fumbles-Lost
## 1 SEA 22-32-308-2-1 18 0-0 1-1
## 2 SFO 14-25-119-0-1 12 1-1 1-0
## Net Pass Yards Penalties-Yards Rush-Yds-TDs Sacked-Yards
## 1 291 6-50 31-127-2 2-17
## 2 119 4-35 31-135-2 0-0
## Third Down Conv. Time of Possession Total Yards Turnovers
## 1 9-14 35:57 418 2
## 2 4-15 24:03 254 1
Several of the columns have multiple values in them that are separated by a “-”. We need to break these up into individual columns. We can use the separate function to do so.
#remove empty column
data_tidy[,2]<-NULL
#separate the columns with multiple values
data_separate<-data_tidy %>%
separate("Cmp-Att-Yd-TD-INT",c("PassCompletions","PassAttempts","PassYards","PassTD","PassINT"), sep = "-", remove = TRUE) %>%
separate("Fourth Down Conv.",c("FourthDownAtt","FourthDownConv"), sep = "-", remove = TRUE) %>%
separate("Fumbles-Lost",c("FumblesMade","FumblesLost"), sep = "-", remove = TRUE) %>%
separate("Penalties-Yards",c("PenaltyCount","PenaltyYards"), sep = "-", remove = TRUE) %>%
separate("Rush-Yds-TDs",c("RushingAttempts","RushYards","RushTD"), sep = "-", remove = TRUE) %>%
separate("Sacked-Yards",c("SackedCounts","SackedYards"), sep = "-", remove = TRUE) %>%
separate("Third Down Conv.",c("ThirdDownAtt","ThirdDownConv"), sep = "-", remove = TRUE)
#convert time of possession to minutes only
for (i in 1:length(data_separate$`Time of Possession`)) {
temp<-as.numeric(unlist(str_split(data_separate$"Time of Possession"[i],":")))
data_separate$"Time of Possession"[i]<-temp[1]+temp[2]/60
}
data_separate## Team PassCompletions PassAttempts PassYards PassTD PassINT First Downs
## 1 SEA 22 32 308 2 1 18
## 2 SFO 14 25 119 0 1 12
## FourthDownAtt FourthDownConv FumblesMade FumblesLost Net Pass Yards
## 1 0 0 1 1 291
## 2 1 1 1 0 119
## PenaltyCount PenaltyYards RushingAttempts RushYards RushTD SackedCounts
## 1 6 50 31 127 2 2
## 2 4 35 31 135 2 0
## SackedYards ThirdDownAtt ThirdDownConv Time of Possession Total Yards
## 1 17 9 14 35.95 418
## 2 0 4 15 24.05 254
## Turnovers
## 1 2
## 2 1
Now that we have the data tidied and transformed, we can perform the analysis: Compare the yards per touchdown (both rushing and passing) for both teams. What does this mean?
First, let’s compare the passing yards to the passing touchdowns for both teams. Since these variables are still characters due to the splitting, we need to cast them as numeric before dividing the yards by the TDs. We can use the mutate function to do this calculation and add it to our data. We can then use select to look at the result for each team.
data_analysis<-data_separate %>%
mutate(PassYdsPerPassTD = as.numeric(PassYards) / as.numeric(PassTD))
select(data_analysis,Team, PassYdsPerPassTD)## Team PassYdsPerPassTD
## 1 SEA 154
## 2 SFO Inf
We see that the Seahawks averaged 154 passing yards per passing touchdown. So it took, on average, 154 yards of passing to get a touchdown. The lower this number is, the better, since it would mean that a team had to pass for fewer yards to get a touchdown as a result. The 49ers averaged Inf. The Inf value is the result of having 0 passing touchdowns, so we were dividing by 0. We can interpret this as “it would take an infinite number of passing yards to get a passing touchdown”. So on this analysis, the Seahawks did better in passing in that their passing yielded touchdowns more quickly than the 49ers.
What about rushing? We can do the same analysis with the rushing yards compared to the rushing touchdowns.
data_analysis<-data_analysis %>%
mutate(RushYdsPerRushTD = as.numeric(RushYards) / as.numeric(RushTD))
select(data_analysis,Team, RushYdsPerRushTD)## Team RushYdsPerRushTD
## 1 SEA 63.5
## 2 SFO 67.5
We see that, again, the Seahawks have a lower number than the 49ers. In other words, the Seahawks have to rush for fewer yards than the 49ers do before the rushing yields a rushing touchdown. But it is not much lower. In fact, it is almost the same.
We can combine these results to look at total yards compared to total touchdowns. We use the Total Yards column which accounts for lost yards due to sacks.
data_analysis<-data_analysis %>%
mutate(YdsPerTD = as.numeric(data_analysis$"Total Yards") / (as.numeric(RushTD) + as.numeric(PassTD)))
select(data_analysis,Team, YdsPerTD )## Team YdsPerTD
## 1 SEA 104.5
## 2 SFO 127.0
It took the Seahawks, on average, 104.5 yards of passing and rushing to get a touchdown, while it took the 49ers 127 yards on average. While the Seahawks did better in this regard, it is not a whole lot better. The trouble for the 49ers was that they had very little yardage in comparison to the Seahawks and never got a passing touchdown. So even if the 49ers had a yards per touchdown average of 90, if they only went 90 yards in total, they would only have one touchdown, which may not be enough to win the game.
In conclusion, we see that teams may be very similar in certain statistics (e.g., yards/touchdown) and very different in others (e.g., total touchdowns). It is important then to know which statistics matter in assessing whether a team will win or lose and which are not so important. Obviously, total touchdowns are important. But suppose we only knew a team’s average yards per touchdown in comparison to another team. That information alone may not be enough to accurately predict who will win the game, since we don’t know how many yards on average that team may get during a game.
The data on San Francisco public employee salaries and benefits comes from this page: http://transparentcalifornia.com/salaries/2015/san-francisco/. It has links to data from 2011-2015. I download each file as a .csv. I then import the files into R one at a time.
#import each file
temp_data_1<-read.csv("C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/Project2/san-francisco-2011.csv",stringsAsFactors = FALSE)
temp_data_2<-read.csv("C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/Project2/san-francisco-2012.csv",stringsAsFactors = FALSE)
temp_data_3<-read.csv("C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/Project2/san-francisco-2013.csv",stringsAsFactors = FALSE)
temp_data_4<-read.csv("C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/Project2/san-francisco-2014.csv",stringsAsFactors = FALSE)
temp_data_5<-read.csv("C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/Project2/san-francisco-2015.csv",stringsAsFactors = FALSE)
#check names
#names(temp_data_1)
#names(temp_data_2)
#names(temp_data_3)
#names(temp_data_4)
#names(temp_data_5)
#add missing status to 2011-2013
temp_data_1$Status<-NA
temp_data_2$Status<-NA
temp_data_3$Status<-NA
#adjust the names
names(temp_data_1)<-names(temp_data_5)
names(temp_data_2)<-names(temp_data_5)
names(temp_data_3)<-names(temp_data_5)
names(temp_data_4)<-names(temp_data_5)
#rbind the data into one dataframe
data<-rbind(temp_data_1,temp_data_2,temp_data_3,temp_data_4,temp_data_5)Now that I have the data imported, I can look at it. It is interesting to see that some of the pay values were negative.
#see summary output
summary(data)## Employee.Name Job.Title Base.Pay
## Length:188037 Length:188037 Min. : -292.4
## Class :character Class :character 1st Qu.: 33559.8
## Mode :character Mode :character Median : 65343.4
## Mean : 66840.5
## 3rd Qu.: 95229.1
## Max. :507831.6
## NA's :605
## Overtime.Pay Other.Pay Benefits
## Min. : -0.01 Min. : -7058.6 Length:188037
## 1st Qu.: 0.00 1st Qu.: 0.0 Class :character
## Median : 0.00 Median : 765.5 Mode :character
## Mean : 5179.35 Mean : 3539.1
## 3rd Qu.: 4765.89 3rd Qu.: 4095.8
## Max. :245131.88 Max. :400184.2
##
## Total.Pay Total.Pay...Benefits Year Notes
## Min. : -618.1 Min. : -618.1 Min. :2011 Mode:logical
## 1st Qu.: 36277.3 1st Qu.: 44901.2 1st Qu.:2012 NA's:188037
## Median : 71905.0 Median : 94166.5 Median :2013
## Mean : 75343.8 Mean : 95377.2 Mean :2013
## 3rd Qu.:106609.4 3rd Qu.:134996.1 3rd Qu.:2014
## Max. :567595.4 Max. :633723.3 Max. :2015
##
## Agency Status
## Length:188037 Length:188037
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
#see structure
str(data)## 'data.frame': 188037 obs. of 12 variables:
## $ Employee.Name : chr "NATHANIEL FORD" "GARY JIMENEZ" "ALBERT PARDINI" "CHRISTOPHER CHONG" ...
## $ Job.Title : chr "GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY" "CAPTAIN III (POLICE DEPARTMENT)" "CAPTAIN III (POLICE DEPARTMENT)" "WIRE ROPE CABLE MAINTENANCE MECHANIC" ...
## $ Base.Pay : num 167411 155966 212739 77916 134402 ...
## $ Overtime.Pay : num 0 245132 106088 56121 9737 ...
## $ Other.Pay : num 400184 137811 16453 198307 182235 ...
## $ Benefits : chr "Not Provided" "Not Provided" "Not Provided" "Not Provided" ...
## $ Total.Pay : num 567595 538909 335280 332344 326373 ...
## $ Total.Pay...Benefits: num 567595 538909 335280 332344 326373 ...
## $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ Notes : logi NA NA NA NA NA NA ...
## $ Agency : chr "San Francisco" "San Francisco" "San Francisco" "San Francisco" ...
## $ Status : chr NA NA NA NA ...
The analysis called for:
1. Categorize the data via job title and profile.
2. Salary changes over time between different groups
3. Base pay, overtime pay, and benefits allocated between different groups.
The profile does not exist in the data download. Instead, it is already split into Year and Agency. So 1 has been done.
I move on to 2. To get salary changes over time between the different groups (i.e., job titles), I need to spread the data after grouping by job title. For salary, perhaps base pay is the closest in meaning. But I’d like to see the total benefits plus pay instead, as this will give us a better picture of the total compensation a person receives. So I group the data using the average Total.Pay...Benefits.
salary_over_time<-data %>%
select(Job.Title,Total.Pay...Benefits,Year) %>%
mutate(Job.Title = str_to_upper(Job.Title)) %>%
group_by(Job.Title,Year) %>%
summarise(Average_TotalPayBenefits = mean(Total.Pay...Benefits)) %>%
spread(Year,Average_TotalPayBenefits)Which job title has the biggest total increase in pay from 2011 to 2015? It looks like the largest increase from 2011 to 2015 was $229,702. This belongs to the Mayor, whose total benefits and pay increased from $134,205 to $363,908.
#calculate increase
salary_over_time$NetPayIncrease<-salary_over_time$`2015` - salary_over_time$`2011`
#find the max increase
max(salary_over_time$NetPayIncrease, na.rm = TRUE)## [1] 229702.4
#return the row with the max increase
subset(salary_over_time, NetPayIncrease == max(salary_over_time$NetPayIncrease, na.rm = TRUE))## Source: local data frame [1 x 7]
## Groups: Job.Title [1]
##
## Job.Title `2011` `2012` `2013` `2014` `2015` NetPayIncrease
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 MAYOR 134205.7 335272.5 362551.7 364814.5 363908.1 229702.4
What about proportional increase? The maximum proportional increase belongs to the audiometrist, whose total pay and benefits increased by 892% from $10,595 to $105,185.
#calculate proportional increase
salary_over_time$PropPayIncrease<-(salary_over_time$`2015` - salary_over_time$`2011`) / salary_over_time$`2011`
#find the max increase
max(salary_over_time$PropPayIncrease, na.rm = TRUE)## [1] 8.927674
#return the row with the max increase
subset(salary_over_time, PropPayIncrease == max(salary_over_time$PropPayIncrease, na.rm = TRUE))## Source: local data frame [1 x 8]
## Groups: Job.Title [1]
##
## Job.Title `2011` `2012` `2013` `2014` `2015` NetPayIncrease
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AUDIOMETRIST 10595.14 39228.16 NA 51222.73 105185.1 94589.96
## # ... with 1 more variables: PropPayIncrease <dbl>
I also find the average net increase and the average proportional increase. The net increase average is $37980 and the proportional increase average is 60%. I look at the other summary statistics as well to get a sense of the distribution.
#calculate net increase
summary(salary_over_time$NetPayIncrease)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -105300 25130 35740 37980 48820 229700 1214
#calculate proportional increase
summary(salary_over_time$PropPayIncrease)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.9698 0.3693 0.5185 0.6062 0.6995 8.9280 1214
I move on to 3. Focusing on the 2015 data, I compare base pay, overtime pay, other pay, total pay, and total pay plus benefits among the job titles. I ignore benefits since the column is all NA.
#select 2015 data, capitalize job titles, group by job titles, rearrange the columns, and summarize each using the mean
data_2015<-data %>%
subset(Year == 2015) %>%
mutate(Job.Title = str_to_upper(Job.Title)) %>%
group_by(Job.Title) %>%
select(Job.Title, Base.Pay, Overtime.Pay, Other.Pay, Total.Pay, Total.Pay...Benefits) %>%
summarise_each(funs(mean))Now I can see which jobs have the highest and lowest in each category:
#base pay
head(arrange(data_2015,Base.Pay)) #lowest base pay## # A tibble: 6 × 6
## Job.Title Base.Pay Overtime.Pay Other.Pay Total.Pay
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 BATTLION CHIEF, FIRE SUPPRESSI 0.00 0 87095.500 87095.500
## 2 RECREATION DIRECTOR 0.00 0 3298.315 3298.315
## 3 TRANSIT PAINT SHOP SPRV1 0.00 0 961.180 961.180
## 4 VICTIM & WITNESS TECHNICIAN 0.00 0 3623.910 3623.910
## 5 X-RAY LABORATORY AIDE 0.00 0 4173.620 4173.620
## 6 BDCOMM MBR, GRP2,M=$25/MTG 259.62 0 0.000 259.620
## # ... with 1 more variables: Total.Pay...Benefits <dbl>
head(arrange(data_2015,desc(Base.Pay))) #highest base pay ## # A tibble: 6 × 6
## Job.Title Base.Pay Overtime.Pay Other.Pay Total.Pay
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CHIEF INVESTMENT OFFICER 507831.6 0 0.00 507831.6
## 2 CHIEF OF POLICE 308901.4 0 19354.12 328255.6
## 3 GEN MGR, PUBLIC TRNSP DEPT 304000.4 0 0.00 304000.4
## 4 CHIEF, FIRE DEPARTMENT 303494.8 0 24279.58 327774.4
## 5 ADMINISTRATOR, DPH 299121.5 0 0.00 299121.5
## 6 MAYOR 288963.5 0 0.00 288963.5
## # ... with 1 more variables: Total.Pay...Benefits <dbl>
#Overtime.Pay
head(arrange(data_2015,Overtime.Pay)) #lowest Overtime.Pay## # A tibble: 6 × 6
## Job.Title Base.Pay Overtime.Pay Other.Pay Total.Pay
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ACCOUNTANT I 36625.00 0 91.87500 36716.88
## 2 ACCOUNTANT II 68350.85 0 11.65117 68362.50
## 3 ACCOUNTANT II (OCII) 65710.03 0 0.00000 65710.03
## 4 ACCOUNTANT III 81890.74 0 641.21479 82531.96
## 5 ACCOUNTANT III (OCII) 79423.01 0 0.00000 79423.01
## 6 ACCOUNTANT IV 101112.39 0 477.53828 101589.93
## # ... with 1 more variables: Total.Pay...Benefits <dbl>
head(arrange(data_2015,desc(Overtime.Pay))) #highest Overtime.Pay## # A tibble: 6 × 6
## Job.Title Base.Pay Overtime.Pay Other.Pay
## <chr> <dbl> <dbl> <dbl>
## 1 MECH SHOP & EQUIP SUPT 108578.01 70332.42 10797.800
## 2 WIRE ROPE CABLE MAINT SPRV 166228.75 58417.13 9856.940
## 3 TRACK MAINT WRK SPRV 1 83177.59 56787.11 4653.800
## 4 TRANSIT POWER LINE SPRV1 94882.79 54852.07 5447.534
## 5 HEATING/VENTILATING INSPECTOR 113427.51 54705.21 4633.605
## 6 CAPT,FIRE PREV OR FIRE INVSGTN 171948.48 49997.61 11917.736
## # ... with 2 more variables: Total.Pay <dbl>, Total.Pay...Benefits <dbl>
#Other.Pay
head(arrange(data_2015,Other.Pay)) #lowest Other.Pay## # A tibble: 6 × 6
## Job.Title Base.Pay Overtime.Pay Other.Pay Total.Pay
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ACCOUNTANT II (OCII) 65710.03 0 0 65710.03
## 2 ACCOUNTANT III (OCII) 79423.01 0 0 79423.01
## 3 ACCOUNTING SUPERVISOR (OCII) 111016.03 0 0 111016.03
## 4 ACPO,JUVP, JUV PROB (SFERS) 153727.42 0 0 153727.42
## 5 ADMINISTRATIVE HEARING SUP 108913.01 0 0 108913.01
## 6 ADMINISTRATIVE SERVICES MGR 85323.62 0 0 85323.62
## # ... with 1 more variables: Total.Pay...Benefits <dbl>
head(arrange(data_2015,desc(Other.Pay))) #highest Other.Pay## # A tibble: 6 × 6
## Job.Title Base.Pay Overtime.Pay Other.Pay Total.Pay
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 BATTLION CHIEF, FIRE SUPPRESSI 0.0 0.000 87095.50 87095.5
## 2 ADM, SFGH MEDICAL CENTER 256098.0 0.000 82292.31 338390.3
## 3 ASST MED EXAMINER 216311.8 4770.727 69023.80 290106.3
## 4 DEPUTY CHIEF 3 205583.9 0.000 53862.39 259446.3
## 5 EXECUTIVE CONTRACT EMPLOYEE 214812.0 0.000 45970.67 260782.7
## 6 NURSING SUPERVISOR PSYCHIATRIC 174195.0 0.000 43701.97 217897.0
## # ... with 1 more variables: Total.Pay...Benefits <dbl>
#Total.Pay
head(arrange(data_2015,Total.Pay)) #lowest Total.Pay## # A tibble: 6 × 6
## Job.Title Base.Pay Overtime.Pay Other.Pay Total.Pay
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 BDCOMM MBR, GRP2,M=$25/MTG 259.6200 0 0.0000 259.6200
## 2 BDCOMM MBR, GRP3 M=$50/MTG 727.4324 0 0.0000 727.4324
## 3 ASSISTANT RECREATION DIRECTOR 595.4000 0 190.3633 785.7633
## 4 TRANSIT PAINT SHOP SPRV1 0.0000 0 961.1800 961.1800
## 5 BDCOMM MBR, GRP5,M$100/MO 1224.6283 0 0.0000 1224.6283
## 6 CHIEF NURSERY SPECIALIST 1647.7800 0 122.3600 1770.1400
## # ... with 1 more variables: Total.Pay...Benefits <dbl>
head(arrange(data_2015,desc(Total.Pay))) #highest Total.Pay## # A tibble: 6 × 6
## Job.Title Base.Pay Overtime.Pay Other.Pay Total.Pay
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CHIEF INVESTMENT OFFICER 507831.6 0 0.00 507831.6
## 2 ADM, SFGH MEDICAL CENTER 256098.0 0 82292.31 338390.3
## 3 CHIEF OF POLICE 308901.4 0 19354.12 328255.6
## 4 CHIEF, FIRE DEPARTMENT 303494.8 0 24279.58 327774.4
## 5 GEN MGR, PUBLIC TRNSP DEPT 304000.4 0 0.00 304000.4
## 6 ADMINISTRATOR, DPH 299121.5 0 0.00 299121.5
## # ... with 1 more variables: Total.Pay...Benefits <dbl>
#Total.Pay...Benefits
head(arrange(data_2015,Total.Pay...Benefits)) #lowest Total.Pay...Benefits## # A tibble: 6 × 6
## Job.Title Base.Pay Overtime.Pay Other.Pay Total.Pay
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ASSISTANT RECREATION DIRECTOR 595.400 0.0000000 190.3633 785.7633
## 2 TRANSIT PAINT SHOP SPRV1 0.000 0.0000000 961.1800 961.1800
## 3 SPECIAL EXAMINER 1817.452 0.9086364 0.0000 1818.3609
## 4 BDCOMM MBR, GRP2,M=$25/MTG 259.620 0.0000000 0.0000 259.6200
## 5 COMMISSIONER NO BENEFITS 2148.739 0.0000000 0.0000 2148.7393
## 6 PUBLIC SERVICE AIDE-TECHNICAL 2238.000 0.0000000 30.1520 2268.1520
## # ... with 1 more variables: Total.Pay...Benefits <dbl>
head(arrange(data_2015,desc(Total.Pay...Benefits))) #highest Total.Pay...Benefits## # A tibble: 6 × 6
## Job.Title Base.Pay Overtime.Pay Other.Pay Total.Pay
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CHIEF INVESTMENT OFFICER 507831.6 0 0.00 507831.6
## 2 CHIEF OF POLICE 308901.4 0 19354.12 328255.6
## 3 CHIEF, FIRE DEPARTMENT 303494.8 0 24279.58 327774.4
## 4 ADM, SFGH MEDICAL CENTER 256098.0 0 82292.31 338390.3
## 5 GEN MGR, PUBLIC TRNSP DEPT 304000.4 0 0.00 304000.4
## 6 ADMINISTRATOR, DPH 299121.5 0 0.00 299121.5
## # ... with 1 more variables: Total.Pay...Benefits <dbl>
The highest paid and compensated employees are the chief investment officer, chief of police, manager of public transportation, fire chief, public health administrator, administrator of San Francisco General Hospital, and the mayor. These are all high profile public positions, except for the CIO and the hospital administrator, which must receive competetive compensations with what the private sector could pay for each employee.
The highest overtime pay is for mechanical, track, and transit supervisors. These supervisors are adding about a third to one-half to their base pay in overtime pay. This seems a little suspicious to me. It probably deserves looking into, or at least further investigation and an explanation.
Those that are paid very little look like they are occupying entry level positions (e.g., accountant) and volunteer/part-time/contract positions (e.g., recreation director).
In conclusion, this is a great dataset. I like that San Francisco is being transparent about how much people are getting paid. This is a great way to keep track of pay increases over time to make sure that such increases are reasonable and make sense according to the type of work being done.
The data comes from here: http://www.census.gov/population/age/data/files/2012/2012gender_table17.csv.
data<-read.csv("http://www.census.gov/population/age/data/files/2012/2012gender_table17.csv", stringsAsFactors = FALSE)The data comes in very messy and requires significant cleanup.
#remove top 4 rows
data<-data[5:52,]
#remove bottom 8 rows
data<-data[1:40,]
#make row 1 the header
names(data)<-data[1,]
#remove rows 1 and 2
data<-data[3:40,]
#remove percent columns
data[,seq(3,23,2)]<-NULL
#replace "-" with 0, according to legend
for(i in 1:ncol(data)) {
data[which(data[,i]=="-"),i]<-0
}
#break into 3 tables
both<-data[1:12,]
male<-data[14:25,]
female<-data[27:38,]
#add Gender columns to male and female tables. Ignore both as it is no longer needed.
male$Gender<-"Male"
female$Gender<-"Female"
#remove row 1 from both male and female tables
male<-male[2:12,]
female<-female[2:12,]
#combine male and female tables
combined<-bind_rows(male,female)
#give column one the name "Age"
names(combined)[1]<-"Age"
#reorder the columns
combined<-combined[,c(1,13,3:12)]
#view data
head(combined)## Age Gender Under $5,000 $5,000 to $9,999
## 1 .15 to 17 years Male 10 3
## 2 .18 to 24 years Male 66 122
## 3 .25 to 29 years Male 23 77
## 4 .30 to 34 years Male 55 55
## 5 .35 to 39 years Male 34 30
## 6 .40 to 44 years Male 42 28
## $10,000 to $14,999 $15,000 to $19,999 $20,000 to $24,999
## 1 20 4 7
## 2 394 617 707
## 3 193 471 671
## 4 162 332 509
## 5 155 263 467
## 6 126 244 388
## $25,000 to $34,999 $35,000 to $49,999 $50,000 to $74,999
## 1 9 8 3
## 2 892 572 180
## 3 1,350 1,526 1,222
## 4 1,080 1,605 1,655
## 5 873 1,453 1,638
## 6 998 1,506 1,743
## $75,000 to $99,999 $100,000 and over
## 1 0 0
## 2 48 52
## 3 407 283
## 4 688 664
## 5 810 1,080
## 6 879 1,340
The analysis asks us to compare income between same age group male and female. So we need to do some tidying by spreading andgathering the data.
combined_spread<-combined %>%
gather(Income, Counts,3:12) %>%
select(Age,Gender,Income,Counts) %>%
spread(Gender,Counts)
head(combined_spread)## Age Income Female Male
## 1 .15 to 17 years $10,000 to $14,999 12 20
## 2 .15 to 17 years $15,000 to $19,999 10 4
## 3 .15 to 17 years $20,000 to $24,999 6 7
## 4 .15 to 17 years $25,000 to $34,999 12 9
## 5 .15 to 17 years $35,000 to $49,999 5 8
## 6 .15 to 17 years $5,000 to $9,999 6 3
There are several routes we can now take to compare income based on age and gender. I will first start by turning the counts into percentages of the age and gender.
#remove commas in Male and Female
combined_spread$Male<-str_replace(combined_spread$Male,",","")
combined_spread$Female<-str_replace(combined_spread$Female,",","")
#cast as numeric
combined_spread$Female<-as.numeric(combined_spread$Female)
combined_spread$Male<-as.numeric(combined_spread$Male)
#get sum of each age category by gender
female_sum<-combined_spread %>%
group_by(Age) %>%
summarise(Female_Sum=sum(as.numeric(Female, na.rm=TRUE)))
male_sum<-combined_spread %>%
group_by(Age) %>%
summarise(Male_Sum=sum(as.numeric(Male, na.rm=TRUE)))
#join each to the combined spread
combined_spread<-left_join(combined_spread,female_sum, by="Age")
combined_spread<-left_join(combined_spread,male_sum, by="Age")
#get percentages
combined_spread <- combined_spread %>%
mutate(Female_Perc = Female / Female_Sum) %>%
mutate(Male_Perc = Male / Male_Sum)As a first stab at the problem, if there is no difference between males and females when it comes to income, then in each age and income combination, there shouldn’t be any significant difference between the percentages of males and females in that age and income combination. Let’s subtract the female percentage from the male percentage for each age and income combination and see if there are any large differences.
#subtract female percentage from male percentage
combined_spread$Perc_Diff<-combined_spread$Male_Perc - combined_spread$Female_Per
#order the outcome by Perc_Diff
combined_spread<-arrange(combined_spread,Perc_Diff)
#see head
head(select(combined_spread,Age,Income,Perc_Diff))## Age Income Perc_Diff
## 1 .15 to 17 years $15,000 to $19,999 -0.11607143
## 2 .50 to 54 years $25,000 to $34,999 -0.07594152
## 3 .15 to 17 years $25,000 to $34,999 -0.07366071
## 4 .65 years and over $35,000 to $49,999 -0.06666104
## 5 .35 to 39 years $25,000 to $34,999 -0.06629518
## 6 .60 to 64 years $25,000 to $34,999 -0.06582208
#see tail
tail(select(combined_spread,Age,Income,Perc_Diff))## Age Income Perc_Diff
## 105 .55 to 59 years $100,000 and over 0.1176377
## 106 .45 to 49 years $100,000 and over 0.1222060
## 107 .60 to 64 years $100,000 and over 0.1338594
## 108 .15 to 17 years Under $5,000 0.1383929
## 109 .50 to 54 years $100,000 and over 0.1387611
## 110 .65 years and over $100,000 and over 0.1495640
If we look at the incomes in which females are more likely than males as a percentage, the incomes are in the lower to middle brackets: $15,000 to $19,999, $25,000 to $34,999, and $35,000 to $49,999. If we look at the incomes in which males are more likely than females as a percentage, the incomes are all $100,000 and over (excepting the under $5,000 category for 15 - 17 year olds). So it does seem that there is correlation between higher wages and being male, or conversely, between lower wages and being female.
We really want to treat Age as a numeric value, not as a category. Similarly with income. As another approach to comparing males and females by age and income, let’s convert the age and income categories to numeric values by using the middle value in each. We thereby explicitly make the assumption that the true mean of each category is the middle value, even though this is almost certainly not true. However, we can’t figure out what the true mean is for each category given the data we have, and this approach should be true enough for our purposes.
#get middle age value
age_column<-c()
for(i in 1: length(combined_spread$Age)){
age_vector<-as.numeric(str_extract_all(combined_spread$Age,"[0-9]{2}")[[i]])
age_mean<-mean(age_vector, na.rm = TRUE)
age_column<-c(age_column,age_mean)
}
#add to combined_spread
combined_spread$Age_Mean<-age_column
#get middle income value
income_column<-c()
for(i in 1: length(combined_spread$Income)){
income_vector<-as.numeric(str_extract_all(str_replace_all(combined_spread$Income,",",""),"[0-9]{2,}")[[i]])
income_mean<-mean(income_vector, na.rm = TRUE)
income_column<-c(income_column,income_mean)
}
#add to combined_spread
combined_spread$Income_Mean<-income_columnNow we can multiply the Income_Mean by the percentages of male and female for each Age_Mean, and then we can group by the Age_Mean and sum up the Income_Mean to get the average income for males and females for each Age_Mean.
#get the proportional income for the age bracket for males and females and join together
combined_avg<-combined_spread %>%
mutate(Prop_Male_Income = Male_Perc * Income_Mean) %>%
mutate(Prop_Female_Income = Female_Perc * Income_Mean) %>%
select(Age_Mean,Prop_Male_Income,Prop_Female_Income)
male_combined_avg<-combined_avg %>%
group_by(Age_Mean) %>%
summarise(Male_AVG_Income = sum(Prop_Male_Income))
female_combined_avg <-combined_avg %>%
group_by(Age_Mean) %>%
summarise(Female_AVG_Income = sum(Prop_Female_Income))
combined_avg_both<-left_join(male_combined_avg,female_combined_avg, by="Age_Mean")#calculate difference
combined_avg_both<-mutate(combined_avg_both,Difference = Male_AVG_Income - Female_AVG_Income)
#display
combined_avg_both## # A tibble: 11 × 4
## Age_Mean Male_AVG_Income Female_AVG_Income Difference
## <dbl> <dbl> <dbl> <dbl>
## 1 16 21054.27 25133.46 -4079.190
## 2 21 28655.68 25618.36 3037.324
## 3 27 43722.41 38858.25 4864.153
## 4 32 51524.17 43881.20 7642.966
## 5 37 56832.60 46791.15 10041.443
## 6 42 58786.27 46840.97 11945.294
## 7 47 59496.62 46298.17 13198.445
## 8 52 61907.67 48138.32 13769.347
## 9 57 61221.53 48847.76 12373.765
## 10 62 60319.71 46683.59 13636.123
## 11 65 56513.49 42728.39 13785.101
We can see that apart from the first Age_Mean of 16, males always have higher incomes than females. Furthermore, apart from one Age_Mean, the difference is always increasing.
We can plot this information to see it visually.
plot(combined_avg_both$Age_Mean,combined_avg_both$Male_AVG_Income, col="blue", type="l", lwd=3, ylim=c(-5000,70000), main="Male Income vs. Female Income by Age", xlab = "Age", ylab="Income")
lines(combined_avg_both$Age_Mean,combined_avg_both$Female_AVG_Income, col="pink", type="l", lwd=3)
lines(combined_avg_both$Age_Mean,combined_avg_both$Difference, col="red", type="l", lwd=3)
legend("topleft",
inset=.05,
cex = .5,
title="Legend",
c("Male","Female","Difference"),
horiz=FALSE,
lty=c(1,1),
lwd=c(1,1),
col=c("blue","pink","red"),
bg="grey96")Using this, it appears that males and females are roughly the same until the late 20s. At that point, males increase in income at a faster rate than females do. This continues over the 30s, but slows down and mostly stops in the 40s. By the mid 40s, male and female incomes move in parallel, maintaining roughly the same difference in wage through the 50s and 60s.
In conclusion, while much more could be said here, the two analyses suggest that males do have a higher income than females. And in the highest income brackets, the disproportion between males and females is the highest. Why is that?
It appears that much of the difference can be explained by whatever is happening from the late 20s through the 30s. As a hypothesis, these are years when women typically have children and many of these women could have taken a break from the career to have and raise children before returning to work full time. As these women would be competing against male colleagues that didn’t take a break from working, it may not surprise us that these colleagues continued to receive increases in income while their female counterparts did not.
If we are NOT assuming that women are paid less due to sex discrimination, the above explanation seems like a reasonable possibility. More data and analysis would be necessary to determine whether it is true or not. We could, for example, compare the incomes of women that took time off to raise children to those that didn’t to see if there is a difference, and if those that didn’t take time off kept pace with males of the same age. Unfortunately, the data here is not detailed enough to permit such an analysis.