Project 2: Tidying, Transforming, and Analyzing

Solution:

1: Football Statistics

Load and Clean the Data

The data is taken from here: http://www.pro-football-reference.com/boxscores/201609250sea.htm.

I created a .CSV file by copying the data from the above webpage into an Excel file and saving the file as a .CSV. This is itself a challenge because Excel changes some of the values into dates automatically. The solution is to select the area in which the paste will occur and format the cells to ‘Text’. Then paste the data using the “Match Destination Formatting” option. Then the data won’t change into date values.

Now I load the data and display it below.

  data<-read.csv("C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/Project2/football_data.csv")
  data

##                     X           SFO           SEA
## 1         First Downs            12            18
## 2        Rush-Yds-TDs      31-135-2      31-127-2
## 3   Cmp-Att-Yd-TD-INT 14-25-119-0-1 22-32-308-2-1
## 4        Sacked-Yards           0-0          2-17
## 5      Net Pass Yards           119           291
## 6         Total Yards           254           418
## 7        Fumbles-Lost           1-0           1-1
## 8           Turnovers             1             2
## 9     Penalties-Yards          4-35          6-50
## 10   Third Down Conv.          4-15          9-14
## 11  Fourth Down Conv.           1-1           0-0
## 12 Time of Possession         24:03         35:57
## 13

A lot of cleanup and transformation will have to occur. I first load the libraries.

library(stringr)
library(tidyr)
library(dplyr)
library(DT)

Now for the tidying.

#gather, select, and spread the data
data_tidy<-data %>%
  gather(Team,Value,2:3) %>%
  select(Team,X,Value) %>%
  spread(X,Value)

data_tidy

##   Team  Cmp-Att-Yd-TD-INT First Downs Fourth Down Conv. Fumbles-Lost
## 1  SEA      22-32-308-2-1          18               0-0          1-1
## 2  SFO      14-25-119-0-1          12               1-1          1-0
##   Net Pass Yards Penalties-Yards Rush-Yds-TDs Sacked-Yards
## 1            291            6-50     31-127-2         2-17
## 2            119            4-35     31-135-2          0-0
##   Third Down Conv. Time of Possession Total Yards Turnovers
## 1             9-14              35:57         418         2
## 2             4-15              24:03         254         1

Several of the columns have multiple values in them that are separated by a “-”. We need to break these up into individual columns. We can use the separate function to do so.

#remove empty column
data_tidy[,2]<-NULL

#separate the columns with multiple values
data_separate<-data_tidy %>%
  separate("Cmp-Att-Yd-TD-INT",c("PassCompletions","PassAttempts","PassYards","PassTD","PassINT"), sep = "-", remove = TRUE) %>%
  separate("Fourth Down Conv.",c("FourthDownAtt","FourthDownConv"), sep = "-", remove = TRUE) %>%
  separate("Fumbles-Lost",c("FumblesMade","FumblesLost"), sep = "-", remove = TRUE) %>%
  separate("Penalties-Yards",c("PenaltyCount","PenaltyYards"), sep = "-", remove = TRUE) %>%
  separate("Rush-Yds-TDs",c("RushingAttempts","RushYards","RushTD"), sep = "-", remove = TRUE) %>%
  separate("Sacked-Yards",c("SackedCounts","SackedYards"), sep = "-", remove = TRUE) %>%
  separate("Third Down Conv.",c("ThirdDownAtt","ThirdDownConv"), sep = "-", remove = TRUE)

#convert time of possession to minutes only
for (i in 1:length(data_separate$`Time of Possession`)) {
  temp<-as.numeric(unlist(str_split(data_separate$"Time of Possession"[i],":")))
  data_separate$"Time of Possession"[i]<-temp[1]+temp[2]/60
}


data_separate

##   Team PassCompletions PassAttempts PassYards PassTD PassINT First Downs
## 1  SEA              22           32       308      2       1          18
## 2  SFO              14           25       119      0       1          12
##   FourthDownAtt FourthDownConv FumblesMade FumblesLost Net Pass Yards
## 1             0              0           1           1            291
## 2             1              1           1           0            119
##   PenaltyCount PenaltyYards RushingAttempts RushYards RushTD SackedCounts
## 1            6           50              31       127      2            2
## 2            4           35              31       135      2            0
##   SackedYards ThirdDownAtt ThirdDownConv Time of Possession Total Yards
## 1          17            9            14              35.95         418
## 2           0            4            15              24.05         254
##   Turnovers
## 1         2
## 2         1

Analysis

Now that we have the data tidied and transformed, we can perform the analysis: Compare the yards per touchdown (both rushing and passing) for both teams. What does this mean?

First, let’s compare the passing yards to the passing touchdowns for both teams. Since these variables are still characters due to the splitting, we need to cast them as numeric before dividing the yards by the TDs. We can use the mutate function to do this calculation and add it to our data. We can then use select to look at the result for each team.

data_analysis<-data_separate %>%
  mutate(PassYdsPerPassTD = as.numeric(PassYards) / as.numeric(PassTD))

select(data_analysis,Team, PassYdsPerPassTD)

##   Team PassYdsPerPassTD
## 1  SEA              154
## 2  SFO              Inf

We see that the Seahawks averaged 154 passing yards per passing touchdown. So it took, on average, 154 yards of passing to get a touchdown. The lower this number is, the better, since it would mean that a team had to pass for fewer yards to get a touchdown as a result. The 49ers averaged Inf. The Inf value is the result of having 0 passing touchdowns, so we were dividing by 0. We can interpret this as “it would take an infinite number of passing yards to get a passing touchdown”. So on this analysis, the Seahawks did better in passing in that their passing yielded touchdowns more quickly than the 49ers.

What about rushing? We can do the same analysis with the rushing yards compared to the rushing touchdowns.

data_analysis<-data_analysis %>%
  mutate(RushYdsPerRushTD = as.numeric(RushYards) / as.numeric(RushTD))

select(data_analysis,Team, RushYdsPerRushTD)

##   Team RushYdsPerRushTD
## 1  SEA             63.5
## 2  SFO             67.5

We see that, again, the Seahawks have a lower number than the 49ers. In other words, the Seahawks have to rush for fewer yards than the 49ers do before the rushing yields a rushing touchdown. But it is not much lower. In fact, it is almost the same.

We can combine these results to look at total yards compared to total touchdowns. We use the Total Yards column which accounts for lost yards due to sacks.

data_analysis<-data_analysis %>%
  mutate(YdsPerTD = as.numeric(data_analysis$"Total Yards") / (as.numeric(RushTD) + as.numeric(PassTD)))

select(data_analysis,Team, YdsPerTD )

##   Team YdsPerTD
## 1  SEA    104.5
## 2  SFO    127.0

It took the Seahawks, on average, 104.5 yards of passing and rushing to get a touchdown, while it took the 49ers 127 yards on average. While the Seahawks did better in this regard, it is not a whole lot better. The trouble for the 49ers was that they had very little yardage in comparison to the Seahawks and never got a passing touchdown. So even if the 49ers had a yards per touchdown average of 90, if they only went 90 yards in total, they would only have one touchdown, which may not be enough to win the game.

Conclusion

In conclusion, we see that teams may be very similar in certain statistics (e.g., yards/touchdown) and very different in others (e.g., total touchdowns). It is important then to know which statistics matter in assessing whether a team will win or lose and which are not so important. Obviously, total touchdowns are important. But suppose we only knew a team’s average yards per touchdown in comparison to another team. That information alone may not be enough to accurately predict who will win the game, since we don’t know how many yards on average that team may get during a game.

2: San Francisco Salaries Analysis

Load and Clean the Data

The data on San Francisco public employee salaries and benefits comes from this page: http://transparentcalifornia.com/salaries/2015/san-francisco/. It has links to data from 2011-2015. I download each file as a .csv. I then import the files into R one at a time.

#import each file
temp_data_1<-read.csv("C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/Project2/san-francisco-2011.csv",stringsAsFactors = FALSE)
temp_data_2<-read.csv("C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/Project2/san-francisco-2012.csv",stringsAsFactors = FALSE)
temp_data_3<-read.csv("C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/Project2/san-francisco-2013.csv",stringsAsFactors = FALSE)
temp_data_4<-read.csv("C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/Project2/san-francisco-2014.csv",stringsAsFactors = FALSE)
temp_data_5<-read.csv("C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/Project2/san-francisco-2015.csv",stringsAsFactors = FALSE)
  
#check names
#names(temp_data_1)
#names(temp_data_2)
#names(temp_data_3)
#names(temp_data_4)
#names(temp_data_5)

#add missing status to 2011-2013
temp_data_1$Status<-NA
temp_data_2$Status<-NA
temp_data_3$Status<-NA


#adjust the names
names(temp_data_1)<-names(temp_data_5)
names(temp_data_2)<-names(temp_data_5)
names(temp_data_3)<-names(temp_data_5)
names(temp_data_4)<-names(temp_data_5)

#rbind the data into one dataframe
data<-rbind(temp_data_1,temp_data_2,temp_data_3,temp_data_4,temp_data_5)

Now that I have the data imported, I can look at it. It is interesting to see that some of the pay values were negative.

#see summary output
summary(data)

##  Employee.Name       Job.Title            Base.Pay       
##  Length:188037      Length:188037      Min.   :  -292.4  
##  Class :character   Class :character   1st Qu.: 33559.8  
##  Mode  :character   Mode  :character   Median : 65343.4  
##                                        Mean   : 66840.5  
##                                        3rd Qu.: 95229.1  
##                                        Max.   :507831.6  
##                                        NA's   :605       
##   Overtime.Pay         Other.Pay          Benefits        
##  Min.   :    -0.01   Min.   : -7058.6   Length:188037     
##  1st Qu.:     0.00   1st Qu.:     0.0   Class :character  
##  Median :     0.00   Median :   765.5   Mode  :character  
##  Mean   :  5179.35   Mean   :  3539.1                     
##  3rd Qu.:  4765.89   3rd Qu.:  4095.8                     
##  Max.   :245131.88   Max.   :400184.2                     
##                                                           
##    Total.Pay        Total.Pay...Benefits      Year       Notes        
##  Min.   :  -618.1   Min.   :  -618.1     Min.   :2011   Mode:logical  
##  1st Qu.: 36277.3   1st Qu.: 44901.2     1st Qu.:2012   NA's:188037   
##  Median : 71905.0   Median : 94166.5     Median :2013                 
##  Mean   : 75343.8   Mean   : 95377.2     Mean   :2013                 
##  3rd Qu.:106609.4   3rd Qu.:134996.1     3rd Qu.:2014                 
##  Max.   :567595.4   Max.   :633723.3     Max.   :2015                 
##                                                                       
##     Agency             Status         
##  Length:188037      Length:188037     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##

#see structure
str(data)

## 'data.frame':    188037 obs. of  12 variables:
##  $ Employee.Name       : chr  "NATHANIEL FORD" "GARY JIMENEZ" "ALBERT PARDINI" "CHRISTOPHER CHONG" ...
##  $ Job.Title           : chr  "GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY" "CAPTAIN III (POLICE DEPARTMENT)" "CAPTAIN III (POLICE DEPARTMENT)" "WIRE ROPE CABLE MAINTENANCE MECHANIC" ...
##  $ Base.Pay            : num  167411 155966 212739 77916 134402 ...
##  $ Overtime.Pay        : num  0 245132 106088 56121 9737 ...
##  $ Other.Pay           : num  400184 137811 16453 198307 182235 ...
##  $ Benefits            : chr  "Not Provided" "Not Provided" "Not Provided" "Not Provided" ...
##  $ Total.Pay           : num  567595 538909 335280 332344 326373 ...
##  $ Total.Pay...Benefits: num  567595 538909 335280 332344 326373 ...
##  $ Year                : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Notes               : logi  NA NA NA NA NA NA ...
##  $ Agency              : chr  "San Francisco" "San Francisco" "San Francisco" "San Francisco" ...
##  $ Status              : chr  NA NA NA NA ...

Analysis

The analysis called for:

1. Categorize the data via job title and profile.

2. Salary changes over time between different groups

3. Base pay, overtime pay, and benefits allocated between different groups.

The profile does not exist in the data download. Instead, it is already split into Year and Agency. So 1 has been done.

I move on to 2. To get salary changes over time between the different groups (i.e., job titles), I need to spread the data after grouping by job title. For salary, perhaps base pay is the closest in meaning. But I’d like to see the total benefits plus pay instead, as this will give us a better picture of the total compensation a person receives. So I group the data using the average Total.Pay...Benefits.

salary_over_time<-data %>%
  select(Job.Title,Total.Pay...Benefits,Year) %>%
  mutate(Job.Title = str_to_upper(Job.Title)) %>%
  group_by(Job.Title,Year) %>%
  summarise(Average_TotalPayBenefits = mean(Total.Pay...Benefits)) %>%
  spread(Year,Average_TotalPayBenefits)

Which job title has the biggest total increase in pay from 2011 to 2015? It looks like the largest increase from 2011 to 2015 was $229,702. This belongs to the Mayor, whose total benefits and pay increased from $134,205 to $363,908.

#calculate increase
salary_over_time$NetPayIncrease<-salary_over_time$`2015` - salary_over_time$`2011`

#find the max increase
max(salary_over_time$NetPayIncrease, na.rm = TRUE)

## [1] 229702.4

#return the row with the max increase
subset(salary_over_time, NetPayIncrease == max(salary_over_time$NetPayIncrease, na.rm = TRUE))

## Source: local data frame [1 x 7]
## Groups: Job.Title [1]
## 
##   Job.Title   `2011`   `2012`   `2013`   `2014`   `2015` NetPayIncrease
##       <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>          <dbl>
## 1     MAYOR 134205.7 335272.5 362551.7 364814.5 363908.1       229702.4

What about proportional increase? The maximum proportional increase belongs to the audiometrist, whose total pay and benefits increased by 892% from $10,595 to $105,185.

#calculate proportional increase
salary_over_time$PropPayIncrease<-(salary_over_time$`2015` - salary_over_time$`2011`) / salary_over_time$`2011`

#find the max increase
max(salary_over_time$PropPayIncrease, na.rm = TRUE)

## [1] 8.927674

#return the row with the max increase
subset(salary_over_time, PropPayIncrease == max(salary_over_time$PropPayIncrease, na.rm = TRUE))

## Source: local data frame [1 x 8]
## Groups: Job.Title [1]
## 
##      Job.Title   `2011`   `2012` `2013`   `2014`   `2015` NetPayIncrease
##          <chr>    <dbl>    <dbl>  <dbl>    <dbl>    <dbl>          <dbl>
## 1 AUDIOMETRIST 10595.14 39228.16     NA 51222.73 105185.1       94589.96
## # ... with 1 more variables: PropPayIncrease <dbl>

I also find the average net increase and the average proportional increase. The net increase average is $37980 and the proportional increase average is 60%. I look at the other summary statistics as well to get a sense of the distribution.

#calculate net increase
summary(salary_over_time$NetPayIncrease)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -105300   25130   35740   37980   48820  229700    1214

#calculate proportional increase
summary(salary_over_time$PropPayIncrease)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -0.9698  0.3693  0.5185  0.6062  0.6995  8.9280    1214

I move on to 3. Focusing on the 2015 data, I compare base pay, overtime pay, other pay, total pay, and total pay plus benefits among the job titles. I ignore benefits since the column is all NA.

#select 2015 data, capitalize job titles, group by job titles, rearrange the columns, and summarize each using the mean
data_2015<-data %>%
  subset(Year == 2015) %>%
  mutate(Job.Title = str_to_upper(Job.Title)) %>%
  group_by(Job.Title) %>%
  select(Job.Title, Base.Pay, Overtime.Pay, Other.Pay, Total.Pay, Total.Pay...Benefits) %>%
  summarise_each(funs(mean))

Now I can see which jobs have the highest and lowest in each category:

#base pay
head(arrange(data_2015,Base.Pay)) #lowest base pay

## # A tibble: 6 × 6
##                        Job.Title Base.Pay Overtime.Pay Other.Pay Total.Pay
##                            <chr>    <dbl>        <dbl>     <dbl>     <dbl>
## 1 BATTLION CHIEF, FIRE SUPPRESSI     0.00            0 87095.500 87095.500
## 2            RECREATION DIRECTOR     0.00            0  3298.315  3298.315
## 3       TRANSIT PAINT SHOP SPRV1     0.00            0   961.180   961.180
## 4    VICTIM & WITNESS TECHNICIAN     0.00            0  3623.910  3623.910
## 5          X-RAY LABORATORY AIDE     0.00            0  4173.620  4173.620
## 6     BDCOMM MBR, GRP2,M=$25/MTG   259.62            0     0.000   259.620
## # ... with 1 more variables: Total.Pay...Benefits <dbl>

head(arrange(data_2015,desc(Base.Pay))) #highest base pay

## # A tibble: 6 × 6
##                    Job.Title Base.Pay Overtime.Pay Other.Pay Total.Pay
##                        <chr>    <dbl>        <dbl>     <dbl>     <dbl>
## 1   CHIEF INVESTMENT OFFICER 507831.6            0      0.00  507831.6
## 2            CHIEF OF POLICE 308901.4            0  19354.12  328255.6
## 3 GEN MGR, PUBLIC TRNSP DEPT 304000.4            0      0.00  304000.4
## 4     CHIEF, FIRE DEPARTMENT 303494.8            0  24279.58  327774.4
## 5         ADMINISTRATOR, DPH 299121.5            0      0.00  299121.5
## 6                      MAYOR 288963.5            0      0.00  288963.5
## # ... with 1 more variables: Total.Pay...Benefits <dbl>

#Overtime.Pay
head(arrange(data_2015,Overtime.Pay)) #lowest Overtime.Pay

## # A tibble: 6 × 6
##               Job.Title  Base.Pay Overtime.Pay Other.Pay Total.Pay
##                   <chr>     <dbl>        <dbl>     <dbl>     <dbl>
## 1          ACCOUNTANT I  36625.00            0  91.87500  36716.88
## 2         ACCOUNTANT II  68350.85            0  11.65117  68362.50
## 3  ACCOUNTANT II (OCII)  65710.03            0   0.00000  65710.03
## 4        ACCOUNTANT III  81890.74            0 641.21479  82531.96
## 5 ACCOUNTANT III (OCII)  79423.01            0   0.00000  79423.01
## 6         ACCOUNTANT IV 101112.39            0 477.53828 101589.93
## # ... with 1 more variables: Total.Pay...Benefits <dbl>

head(arrange(data_2015,desc(Overtime.Pay))) #highest Overtime.Pay

## # A tibble: 6 × 6
##                        Job.Title  Base.Pay Overtime.Pay Other.Pay
##                            <chr>     <dbl>        <dbl>     <dbl>
## 1         MECH SHOP & EQUIP SUPT 108578.01     70332.42 10797.800
## 2     WIRE ROPE CABLE MAINT SPRV 166228.75     58417.13  9856.940
## 3         TRACK MAINT WRK SPRV 1  83177.59     56787.11  4653.800
## 4       TRANSIT POWER LINE SPRV1  94882.79     54852.07  5447.534
## 5  HEATING/VENTILATING INSPECTOR 113427.51     54705.21  4633.605
## 6 CAPT,FIRE PREV OR FIRE INVSGTN 171948.48     49997.61 11917.736
## # ... with 2 more variables: Total.Pay <dbl>, Total.Pay...Benefits <dbl>

#Other.Pay
head(arrange(data_2015,Other.Pay)) #lowest Other.Pay

## # A tibble: 6 × 6
##                      Job.Title  Base.Pay Overtime.Pay Other.Pay Total.Pay
##                          <chr>     <dbl>        <dbl>     <dbl>     <dbl>
## 1         ACCOUNTANT II (OCII)  65710.03            0         0  65710.03
## 2        ACCOUNTANT III (OCII)  79423.01            0         0  79423.01
## 3 ACCOUNTING SUPERVISOR (OCII) 111016.03            0         0 111016.03
## 4  ACPO,JUVP, JUV PROB (SFERS) 153727.42            0         0 153727.42
## 5   ADMINISTRATIVE HEARING SUP 108913.01            0         0 108913.01
## 6  ADMINISTRATIVE SERVICES MGR  85323.62            0         0  85323.62
## # ... with 1 more variables: Total.Pay...Benefits <dbl>

head(arrange(data_2015,desc(Other.Pay))) #highest Other.Pay

## # A tibble: 6 × 6
##                        Job.Title Base.Pay Overtime.Pay Other.Pay Total.Pay
##                            <chr>    <dbl>        <dbl>     <dbl>     <dbl>
## 1 BATTLION CHIEF, FIRE SUPPRESSI      0.0        0.000  87095.50   87095.5
## 2       ADM, SFGH MEDICAL CENTER 256098.0        0.000  82292.31  338390.3
## 3              ASST MED EXAMINER 216311.8     4770.727  69023.80  290106.3
## 4                 DEPUTY CHIEF 3 205583.9        0.000  53862.39  259446.3
## 5    EXECUTIVE CONTRACT EMPLOYEE 214812.0        0.000  45970.67  260782.7
## 6 NURSING SUPERVISOR PSYCHIATRIC 174195.0        0.000  43701.97  217897.0
## # ... with 1 more variables: Total.Pay...Benefits <dbl>

#Total.Pay
head(arrange(data_2015,Total.Pay)) #lowest Total.Pay

## # A tibble: 6 × 6
##                       Job.Title  Base.Pay Overtime.Pay Other.Pay Total.Pay
##                           <chr>     <dbl>        <dbl>     <dbl>     <dbl>
## 1    BDCOMM MBR, GRP2,M=$25/MTG  259.6200            0    0.0000  259.6200
## 2    BDCOMM MBR, GRP3 M=$50/MTG  727.4324            0    0.0000  727.4324
## 3 ASSISTANT RECREATION DIRECTOR  595.4000            0  190.3633  785.7633
## 4      TRANSIT PAINT SHOP SPRV1    0.0000            0  961.1800  961.1800
## 5     BDCOMM MBR, GRP5,M$100/MO 1224.6283            0    0.0000 1224.6283
## 6      CHIEF NURSERY SPECIALIST 1647.7800            0  122.3600 1770.1400
## # ... with 1 more variables: Total.Pay...Benefits <dbl>

head(arrange(data_2015,desc(Total.Pay))) #highest Total.Pay

## # A tibble: 6 × 6
##                    Job.Title Base.Pay Overtime.Pay Other.Pay Total.Pay
##                        <chr>    <dbl>        <dbl>     <dbl>     <dbl>
## 1   CHIEF INVESTMENT OFFICER 507831.6            0      0.00  507831.6
## 2   ADM, SFGH MEDICAL CENTER 256098.0            0  82292.31  338390.3
## 3            CHIEF OF POLICE 308901.4            0  19354.12  328255.6
## 4     CHIEF, FIRE DEPARTMENT 303494.8            0  24279.58  327774.4
## 5 GEN MGR, PUBLIC TRNSP DEPT 304000.4            0      0.00  304000.4
## 6         ADMINISTRATOR, DPH 299121.5            0      0.00  299121.5
## # ... with 1 more variables: Total.Pay...Benefits <dbl>

#Total.Pay...Benefits
head(arrange(data_2015,Total.Pay...Benefits)) #lowest Total.Pay...Benefits

## # A tibble: 6 × 6
##                       Job.Title Base.Pay Overtime.Pay Other.Pay Total.Pay
##                           <chr>    <dbl>        <dbl>     <dbl>     <dbl>
## 1 ASSISTANT RECREATION DIRECTOR  595.400    0.0000000  190.3633  785.7633
## 2      TRANSIT PAINT SHOP SPRV1    0.000    0.0000000  961.1800  961.1800
## 3              SPECIAL EXAMINER 1817.452    0.9086364    0.0000 1818.3609
## 4    BDCOMM MBR, GRP2,M=$25/MTG  259.620    0.0000000    0.0000  259.6200
## 5      COMMISSIONER NO BENEFITS 2148.739    0.0000000    0.0000 2148.7393
## 6 PUBLIC SERVICE AIDE-TECHNICAL 2238.000    0.0000000   30.1520 2268.1520
## # ... with 1 more variables: Total.Pay...Benefits <dbl>

head(arrange(data_2015,desc(Total.Pay...Benefits))) #highest Total.Pay...Benefits

## # A tibble: 6 × 6
##                    Job.Title Base.Pay Overtime.Pay Other.Pay Total.Pay
##                        <chr>    <dbl>        <dbl>     <dbl>     <dbl>
## 1   CHIEF INVESTMENT OFFICER 507831.6            0      0.00  507831.6
## 2            CHIEF OF POLICE 308901.4            0  19354.12  328255.6
## 3     CHIEF, FIRE DEPARTMENT 303494.8            0  24279.58  327774.4
## 4   ADM, SFGH MEDICAL CENTER 256098.0            0  82292.31  338390.3
## 5 GEN MGR, PUBLIC TRNSP DEPT 304000.4            0      0.00  304000.4
## 6         ADMINISTRATOR, DPH 299121.5            0      0.00  299121.5
## # ... with 1 more variables: Total.Pay...Benefits <dbl>

The highest paid and compensated employees are the chief investment officer, chief of police, manager of public transportation, fire chief, public health administrator, administrator of San Francisco General Hospital, and the mayor. These are all high profile public positions, except for the CIO and the hospital administrator, which must receive competetive compensations with what the private sector could pay for each employee.

The highest overtime pay is for mechanical, track, and transit supervisors. These supervisors are adding about a third to one-half to their base pay in overtime pay. This seems a little suspicious to me. It probably deserves looking into, or at least further investigation and an explanation.

Those that are paid very little look like they are occupying entry level positions (e.g., accountant) and volunteer/part-time/contract positions (e.g., recreation director).

Conclusion

In conclusion, this is a great dataset. I like that San Francisco is being transparent about how much people are getting paid. This is a great way to keep track of pay increases over time to make sure that such increases are reasonable and make sense according to the type of work being done.

3: Age and Gender and Income

Load and Clean Data

The data comes from here: http://www.census.gov/population/age/data/files/2012/2012gender_table17.csv.

data<-read.csv("http://www.census.gov/population/age/data/files/2012/2012gender_table17.csv", stringsAsFactors = FALSE)

The data comes in very messy and requires significant cleanup.

#remove top 4 rows
data<-data[5:52,]

#remove bottom 8 rows
data<-data[1:40,]

#make row 1 the header
names(data)<-data[1,]

#remove rows 1 and 2
data<-data[3:40,]

#remove percent columns
data[,seq(3,23,2)]<-NULL

#replace "-" with 0, according to legend
for(i in 1:ncol(data)) {
  data[which(data[,i]=="-"),i]<-0
}

#break into 3 tables
both<-data[1:12,]
male<-data[14:25,]
female<-data[27:38,]

#add Gender columns to male and female tables.  Ignore both as it is no longer needed.
male$Gender<-"Male"
female$Gender<-"Female"

#remove row 1 from both male and female tables
male<-male[2:12,]
female<-female[2:12,]

#combine male and female tables
combined<-bind_rows(male,female)

#give column one the name "Age"
names(combined)[1]<-"Age"

#reorder the columns
combined<-combined[,c(1,13,3:12)]

#view data
head(combined)

##                  Age Gender Under $5,000   $5,000 to $9,999
## 1    .15 to 17 years   Male           10                  3
## 2    .18 to 24 years   Male           66                122
## 3    .25 to 29 years   Male           23                 77
## 4    .30 to 34 years   Male           55                 55
## 5    .35 to 39 years   Male           34                 30
## 6    .40 to 44 years   Male           42                 28
##     $10,000 to $14,999   $15,000 to $19,999   $20,000 to $24,999
## 1                   20                    4                    7
## 2                  394                  617                  707
## 3                  193                  471                  671
## 4                  162                  332                  509
## 5                  155                  263                  467
## 6                  126                  244                  388
##     $25,000 to $34,999   $35,000 to $49,999   $50,000 to $74,999
## 1                    9                    8                    3
## 2                  892                  572                  180
## 3                1,350                1,526                1,222
## 4                1,080                1,605                1,655
## 5                  873                1,453                1,638
## 6                  998                1,506                1,743
##     $75,000 to $99,999 $100,000 and over
## 1                    0                 0
## 2                   48                52
## 3                  407               283
## 4                  688               664
## 5                  810             1,080
## 6                  879             1,340

Analysis

The analysis asks us to compare income between same age group male and female. So we need to do some tidying by spreading andgathering the data.

combined_spread<-combined %>%
  gather(Income, Counts,3:12) %>%
  select(Age,Gender,Income,Counts) %>%
  spread(Gender,Counts)

head(combined_spread)

##                  Age               Income Female Male
## 1    .15 to 17 years   $10,000 to $14,999     12   20
## 2    .15 to 17 years   $15,000 to $19,999     10    4
## 3    .15 to 17 years   $20,000 to $24,999      6    7
## 4    .15 to 17 years   $25,000 to $34,999     12    9
## 5    .15 to 17 years   $35,000 to $49,999      5    8
## 6    .15 to 17 years     $5,000 to $9,999      6    3

There are several routes we can now take to compare income based on age and gender. I will first start by turning the counts into percentages of the age and gender.

#remove commas in Male and Female
combined_spread$Male<-str_replace(combined_spread$Male,",","")
combined_spread$Female<-str_replace(combined_spread$Female,",","")

#cast as numeric
combined_spread$Female<-as.numeric(combined_spread$Female)
combined_spread$Male<-as.numeric(combined_spread$Male)

#get sum of each age category by gender
female_sum<-combined_spread %>%
  group_by(Age) %>%
  summarise(Female_Sum=sum(as.numeric(Female, na.rm=TRUE)))

male_sum<-combined_spread %>%
  group_by(Age) %>%
  summarise(Male_Sum=sum(as.numeric(Male, na.rm=TRUE)))

#join each to the combined spread
combined_spread<-left_join(combined_spread,female_sum, by="Age")
combined_spread<-left_join(combined_spread,male_sum, by="Age")

#get percentages
combined_spread <- combined_spread %>%
  mutate(Female_Perc = Female / Female_Sum) %>%
  mutate(Male_Perc = Male / Male_Sum)

As a first stab at the problem, if there is no difference between males and females when it comes to income, then in each age and income combination, there shouldn’t be any significant difference between the percentages of males and females in that age and income combination. Let’s subtract the female percentage from the male percentage for each age and income combination and see if there are any large differences.

#subtract female percentage from male percentage
combined_spread$Perc_Diff<-combined_spread$Male_Perc - combined_spread$Female_Per

#order the outcome by Perc_Diff
combined_spread<-arrange(combined_spread,Perc_Diff)

#see head
head(select(combined_spread,Age,Income,Perc_Diff))

##                     Age               Income   Perc_Diff
## 1       .15 to 17 years   $15,000 to $19,999 -0.11607143
## 2       .50 to 54 years   $25,000 to $34,999 -0.07594152
## 3       .15 to 17 years   $25,000 to $34,999 -0.07366071
## 4    .65 years and over   $35,000 to $49,999 -0.06666104
## 5       .35 to 39 years   $25,000 to $34,999 -0.06629518
## 6       .60 to 64 years   $25,000 to $34,999 -0.06582208

#see tail
tail(select(combined_spread,Age,Income,Perc_Diff))

##                       Age            Income Perc_Diff
## 105       .55 to 59 years $100,000 and over 0.1176377
## 106       .45 to 49 years $100,000 and over 0.1222060
## 107       .60 to 64 years $100,000 and over 0.1338594
## 108       .15 to 17 years      Under $5,000 0.1383929
## 109       .50 to 54 years $100,000 and over 0.1387611
## 110    .65 years and over $100,000 and over 0.1495640

If we look at the incomes in which females are more likely than males as a percentage, the incomes are in the lower to middle brackets: $15,000 to $19,999, $25,000 to $34,999, and $35,000 to $49,999. If we look at the incomes in which males are more likely than females as a percentage, the incomes are all $100,000 and over (excepting the under $5,000 category for 15 - 17 year olds). So it does seem that there is correlation between higher wages and being male, or conversely, between lower wages and being female.

We really want to treat Age as a numeric value, not as a category. Similarly with income. As another approach to comparing males and females by age and income, let’s convert the age and income categories to numeric values by using the middle value in each. We thereby explicitly make the assumption that the true mean of each category is the middle value, even though this is almost certainly not true. However, we can’t figure out what the true mean is for each category given the data we have, and this approach should be true enough for our purposes.

#get middle age value
age_column<-c()
for(i in 1: length(combined_spread$Age)){
  age_vector<-as.numeric(str_extract_all(combined_spread$Age,"[0-9]{2}")[[i]])
  age_mean<-mean(age_vector, na.rm = TRUE)
  age_column<-c(age_column,age_mean)
}

#add to combined_spread
combined_spread$Age_Mean<-age_column


#get middle income value
income_column<-c()
for(i in 1: length(combined_spread$Income)){
  income_vector<-as.numeric(str_extract_all(str_replace_all(combined_spread$Income,",",""),"[0-9]{2,}")[[i]])
  income_mean<-mean(income_vector, na.rm = TRUE)
  income_column<-c(income_column,income_mean)
}

#add to combined_spread
combined_spread$Income_Mean<-income_column

Now we can multiply the Income_Mean by the percentages of male and female for each Age_Mean, and then we can group by the Age_Mean and sum up the Income_Mean to get the average income for males and females for each Age_Mean.

#get the proportional income for the age bracket for males and females and join together
combined_avg<-combined_spread %>%
  mutate(Prop_Male_Income = Male_Perc * Income_Mean) %>%
  mutate(Prop_Female_Income = Female_Perc * Income_Mean) %>%
  select(Age_Mean,Prop_Male_Income,Prop_Female_Income)
  
male_combined_avg<-combined_avg %>%
  group_by(Age_Mean) %>%
  summarise(Male_AVG_Income = sum(Prop_Male_Income))
  
female_combined_avg <-combined_avg %>%
  group_by(Age_Mean) %>%
  summarise(Female_AVG_Income = sum(Prop_Female_Income))
  
combined_avg_both<-left_join(male_combined_avg,female_combined_avg, by="Age_Mean")

#calculate difference
combined_avg_both<-mutate(combined_avg_both,Difference = Male_AVG_Income - Female_AVG_Income)

#display
combined_avg_both

## # A tibble: 11 × 4
##    Age_Mean Male_AVG_Income Female_AVG_Income Difference
##       <dbl>           <dbl>             <dbl>      <dbl>
## 1        16        21054.27          25133.46  -4079.190
## 2        21        28655.68          25618.36   3037.324
## 3        27        43722.41          38858.25   4864.153
## 4        32        51524.17          43881.20   7642.966
## 5        37        56832.60          46791.15  10041.443
## 6        42        58786.27          46840.97  11945.294
## 7        47        59496.62          46298.17  13198.445
## 8        52        61907.67          48138.32  13769.347
## 9        57        61221.53          48847.76  12373.765
## 10       62        60319.71          46683.59  13636.123
## 11       65        56513.49          42728.39  13785.101

We can see that apart from the first Age_Mean of 16, males always have higher incomes than females. Furthermore, apart from one Age_Mean, the difference is always increasing.

We can plot this information to see it visually.

plot(combined_avg_both$Age_Mean,combined_avg_both$Male_AVG_Income, col="blue", type="l", lwd=3, ylim=c(-5000,70000), main="Male Income vs. Female Income by Age", xlab = "Age", ylab="Income")
lines(combined_avg_both$Age_Mean,combined_avg_both$Female_AVG_Income, col="pink", type="l", lwd=3)
lines(combined_avg_both$Age_Mean,combined_avg_both$Difference, col="red", type="l", lwd=3)
legend("topleft",
       inset=.05,
       cex = .5,
       title="Legend",
       c("Male","Female","Difference"),
       horiz=FALSE,
       lty=c(1,1),
       lwd=c(1,1),
       col=c("blue","pink","red"),
       bg="grey96")

Using this, it appears that males and females are roughly the same until the late 20s. At that point, males increase in income at a faster rate than females do. This continues over the 30s, but slows down and mostly stops in the 40s. By the mid 40s, male and female incomes move in parallel, maintaining roughly the same difference in wage through the 50s and 60s.

Conclusion

In conclusion, while much more could be said here, the two analyses suggest that males do have a higher income than females. And in the highest income brackets, the disproportion between males and females is the highest. Why is that?

It appears that much of the difference can be explained by whatever is happening from the late 20s through the 30s. As a hypothesis, these are years when women typically have children and many of these women could have taken a break from the career to have and raise children before returning to work full time. As these women would be competing against male colleagues that didn’t take a break from working, it may not surprise us that these colleagues continued to receive increases in income while their female counterparts did not.

If we are NOT assuming that women are paid less due to sex discrimination, the above explanation seems like a reasonable possibility. More data and analysis would be necessary to determine whether it is true or not. We could, for example, compare the incomes of women that took time off to raise children to those that didn’t to see if there is a difference, and if those that didn’t take time off kept pace with males of the same age. Unfortunately, the data here is not detailed enough to permit such an analysis.

Project 2: Tidying, Transforming, and Analyzing

Andrew Carson

October 3, 2016

Task:

Solution:

1: Football Statistics

Load and Clean the Data

Analysis

Conclusion

2: San Francisco Salaries Analysis

Load and Clean the Data

Analysis

Conclusion

3: Age and Gender and Income

Load and Clean Data

Analysis

Conclusion