In this workshop, we will do three mini cases studies, which will read different dataformats and tidy them and then visualize them.
NYS Quarterly Census of Employment and Wages.
The Quarterly Census of Employment and Wages (QCEW) program (also known as ES-202) collects employment and wage data from employers covered by New York State’s Unemployment Insurance (UI) Law. This program is a cooperative program with the U.S. Bureau of Labor Statistics. QCEW data encompass approximately 97 percent of New York’s nonfarm employment, providing a virtual census of employees and their wages as well as the most complete universe of employment and wage data, by industry, at the State, regional and county levels.
QCEW data are derived from quarterly tax reports submitted by all employers subject to UI laws. All covered employers are required to submit monthly employment figures representing the number of people either working during or receiving pay for the payroll period including the 12th of the month, and the total wages paid during the quarter. The QCEW program conducts ongoing surveys to verify the location and type of economic activity occurring at each of the more than 500,000 reporting units (establishments) in the state. An employer may operate in a number of different locations.
The original dataset is in csv
format, we will reat it into R firstly
library(tidyverse) # library the main package wee need, which is always tidyverse
path = '/Users/Michael/Library/Mobile Documents/com~apple~CloudDocs/CMSE/Data Analysis/Lectures/L2/nys.csv'
nys <- read_csv(file = path)
# you need change the path based on your file location
nys # print it to take a look
## # A tibble: 184,747 x 9
## `Area Type` Area NAICS `NAICS Title` Year Establishments
## <chr> <chr> <int> <chr> <int> <int>
## 1 State New York… 11 Agriculture, Forestry… 2016 2600
## 2 State New York… 111 Crop Production 2016 1128
## 3 State New York… 112 Animal Production 2016 833
## 4 State New York… 113 Forestry and Logging 2016 196
## 5 State New York… 114 Fishing, Hunting and … 2016 42
## 6 State New York… 115 Agriculture & Forestr… 2016 402
## 7 State New York… 21 Mining 2016 366
## 8 State New York… 211 Oil and Gas Extraction 2016 32
## 9 State New York… 212 Mining (except Oil an… 2016 262
## 10 State New York… 213 Support Activities fo… 2016 72
## # ... with 184,737 more rows, and 3 more variables: `Average
## # Employment` <int>, `Total Wage` <dbl>, `Annual Average Salary` <dbl>
unique(nys$Year) # unique function gives us the unique values
## [1] 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003
## [15] 2002 2001 2000
We are more interested on the distribution of average salary over industries and time. So, we will focus on these two aspects. To simplfy the datasets, let’s rename the variables and select main vairables we are interested on.
names(nys) <- c('Area_type', 'Area', 'Naics', 'N_title',
'Year', 'Estab', 'Average_emply', 'Total_wage', 'An_avg_sal')
nys_sb <- select(nys, Area_type, Area, N_title, Year, Average_emply, Total_wage, An_avg_sal)
nys_sb
## # A tibble: 184,747 x 7
## Area_type Area N_title Year Average_emply Total_wage An_avg_sal
## <chr> <chr> <chr> <int> <int> <dbl> <dbl>
## 1 State New Yo… Agricultur… 2016 25849 872749893 33763
## 2 State New Yo… Crop Produ… 2016 12379 377915334 30529
## 3 State New Yo… Animal Pro… 2016 10204 366444605 35912
## 4 State New Yo… Forestry a… 2016 644 24957438 38754
## 5 State New Yo… Fishing, H… 2016 103 3515772 34134
## 6 State New Yo… Agricultur… 2016 2519 99916744 39665
## 7 State New Yo… Mining 2016 4409 281253726 63791
## 8 State New Yo… Oil and Ga… 2016 326 22490945 68991
## 9 State New Yo… Mining (ex… 2016 3749 233708577 62339
## 10 State New Yo… Support Ac… 2016 333 25054204 75238
## # ... with 184,737 more rows
Okay, now, let’s focus on 2016, and find the top 10 industries which have the highest annual average salary and then dig into the different areas.
nys_sb %>% #this is pipe function
filter(Year == 2016) %>%
arrange(desc(An_avg_sal)) %>%
top_n(10)
## # A tibble: 11 x 7
## Area_type Area N_title Year Average_emply Total_wage An_avg_sal
## <chr> <chr> <chr> <int> <int> <dbl> <dbl>
## 1 County Suffol… Financia… 2016 2989 1.72e 9 574559
## 2 Workforce … Suffolk Financia… 2016 2989 1.72e 9 574559
## 3 County New Yo… Financia… 2016 169768 6.45e10 380045
## 4 Labor Mark… New Yo… Financia… 2016 172923 6.49e10 375279
## 5 Workforce … New Yo… Financia… 2016 172923 6.49e10 375279
## 6 Metropolit… New Yo… Financia… 2016 172923 6.49e10 375279
## 7 State New Yo… Financia… 2016 194682 6.99e10 359113
## 8 Labor Mark… Long I… Financia… 2016 6468 2.29e 9 354420
## 9 Metropolit… Nassau… Financia… 2016 6468 2.29e 9 354420
## 10 County Monroe… Funds, T… 2016 35 1.23e 7 352557
## 11 Workforce … Monroe Funds, T… 2016 35 1.23e 7 352557
To give a better present, we use package DT
, which will gives us interactive tables in html.
library(DT)
nys_sb %>% #this is pipe function
filter(Year == 2016) %>%
arrange(desc(An_avg_sal)) %>%
top_n(10) %>%
datatable()
You see, it looks much better. And it should not be surprised that the Financial Investment & Related Activity has the highest salary. Now, if you want to rank average employment, you can just click it.
Group Data and Visulize it
As we have 184,747 observations, and the dataset also includes so many dimensions, it’s better we cluster the data into one dimension, such as the trend in the time.
nys_sb %>%
filter(N_title == 'Financial Investment & Related Activity') %>%
group_by(Year)
## # A tibble: 1,717 x 7
## # Groups: Year [17]
## Area_type Area N_title Year Average_emply Total_wage An_avg_sal
## <chr> <chr> <chr> <int> <int> <dbl> <dbl>
## 1 State New Yo… Financial … 2016 194682 6.99e10 359113
## 2 County Albany… Financial … 2016 843 1.21e 8 143620
## 3 County Bronx … Financial … 2016 37 3.46e 6 93450
## 4 County Broome… Financial … 2016 164 1.75e 7 106820
## 5 County Cattar… Financial … 2016 20 3.94e 6 197028
## 6 County Chauta… Financial … 2016 69 8.31e 6 120468
## 7 County Chemun… Financial … 2016 153 3.11e 7 203253
## 8 County Chenan… Financial … 2016 21 2.76e 6 131638
## 9 County Clinto… Financial … 2016 37 2.81e 6 75898
## 10 County Columb… Financial … 2016 6 5.68e 5 94680
## # ... with 1,707 more rows
You can see that we now only have 1717 observations as we filter the data by focusing on the industry - Financial Investment & Related Activity. You can see there are many observations in the same year. So we will summarize the data through the year.
nys_sb %>%
filter(N_title == 'Financial Investment & Related Activity') %>%
group_by(Year) %>%
summarise(avg = mean(An_avg_sal))
## # A tibble: 17 x 2
## Year avg
## <int> <dbl>
## 1 2000 92264.
## 2 2001 82260.
## 3 2002 89120.
## 4 2003 86334.
## 5 2004 94269.
## 6 2005 92782.
## 7 2006 112025.
## 8 2007 123215.
## 9 2008 127490.
## 10 2009 114514.
## 11 2010 121183.
## 12 2011 130502.
## 13 2012 136947.
## 14 2013 142100
## 15 2014 150483.
## 16 2015 152613.
## 17 2016 151939.
Next step, we will visualize the trend:
nys_sb %>%
filter(N_title == 'Financial Investment & Related Activity') %>%
group_by(Year) %>%
summarise(avg = mean(An_avg_sal)) %>%
ggplot(aes(Year, avg)) + geom_bar(stat='identity')
You can change the color by giving more options
nys_sb %>%
filter(N_title == 'Financial Investment & Related Activity') %>%
group_by(Year) %>%
summarise(avg = round(mean(An_avg_sal))) %>%
ggplot(aes(Year, avg)) + geom_bar(stat='identity', fill= '#75A8D7') +
theme_classic() + geom_text(aes(label=avg),position=position_dodge(width=0.9),
vjust=-0.25) -> f1
f1
We can also make this graph become interactve by using package plotly
.
library(plotly)
nys_sb %>%
filter(N_title == 'Financial Investment & Related Activity') %>%
group_by(Year) %>%
summarise(avg = round(mean(An_avg_sal))) %>%
ggplot(aes(Year, avg)) + geom_bar(stat='identity', fill= '#75A8D7') +
theme_classic() -> f2
f2 <- ggplotly(f2)
f2
Baltic Dry Index (BDI) and Oil Price
The Baltic Dry Index (BDI), is issued daily by the London-based Baltic Exchange. The BDI is a composite of the Capesize, Panamax and Supramax Timecharter Averages. It is reported around the world as a proxy for dry bulk shipping stocks as well as a general shipping market bellwether. The reason the index is watched so closely by many economists and policymakers is that it is an excellent leading indicator for economic activity. The BDI tracks international carriers hauling raw materials (e.g., wood for Finnish paper mills) as opposed to container ships carrying finished goods (e.g., paper for the U.S. Treasury’s busy printing presses). Thus, the precursors to production, rather than the results of production, are accounted for in the index, giving it a predictive quality. Additionally, the market for these cargo ships tends to be tight and inelastic.
Now, we want to invesigate the relationship between BDI and Oil price. However, we have to collect the datasets firstly. Most time, we collect datasets seperately from different websites. In this case, we have two datasets: BDI.csv
and hkstock.csv
. We want to merge them together. The problem is that the date variable in BDI.csv
is string, so we need conver it into Dates format.
Even though we can use code gen Date = date(date, "YMD")
to covert string into the date format, it’s difficult to merger datasets by matching the date. However, R can do it easily.
bdi <- read_csv(file = "/Users/Michael/Library/Mobile Documents/com~apple~CloudDocs/CMSE/Data Analysis/Lectures/L2/BDI.csv")
# make sure your path or file location is right in your laptop.
bdi # R automatically conver the string into date format
## # A tibble: 1,236 x 2
## Date Index
## <date> <dbl>
## 1 2014-04-01 1316
## 2 2014-03-31 1362
## 3 2014-03-28 1373
## 4 2014-03-27 1412
## 5 2014-03-26 1496
## 6 2014-03-24 1602
## 7 2014-03-21 1599
## 8 2014-03-20 1621
## 9 2014-03-19 1570
## 10 2014-03-18 1518
## # ... with 1,226 more rows
tail(bdi)
## # A tibble: 6 x 2
## Date Index
## <date> <dbl>
## 1 2009-04-08 1463
## 2 2009-04-07 1466
## 3 2009-04-06 1486
## 4 2009-04-03 1506
## 5 2009-04-02 1538
## 6 2009-04-01 1574
hk <- read_csv(file = "/Users/Michael/Library/Mobile Documents/com~apple~CloudDocs/CMSE/Data Analysis/Lectures/L2/hkstock.csv")
hk
## # A tibble: 1,237 x 7
## Date Open High Low Close `Adj Close` Volume
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2009-04-01 4.81 5.21 4.81 5.1 4.79 51195944
## 2 2009-04-02 4.95 5.72 4.95 5.67 5.32 128183352
## 3 2009-04-03 5.61 6.05 5.61 5.81 5.45 155467992
## 4 2009-04-06 6.08 6.36 6.02 6.19 5.81 110922868
## 5 2009-04-07 6.43 6.43 5.87 5.92 5.56 77399528
## 6 2009-04-08 5.74 5.74 5.39 5.54 5.20 93684174
## 7 2009-04-09 5.66 5.85 5.66 5.84 5.48 56597038
## 8 2009-04-14 5.84 6.87 5.84 6.79 6.37 148947340
## 9 2009-04-15 6.5 7.46 6.35 7.4 6.94 196294748
## 10 2009-04-16 7.63 7.69 6.97 7.11 6.67 182892512
## # ... with 1,227 more rows
tail(hk)
## # A tibble: 6 x 7
## Date Open High Low Close `Adj Close` Volume
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-03-25 3.26 3.3 3.24 3.28 3.28 9257970
## 2 2014-03-26 3.3 3.3 3.24 3.29 3.29 7554650
## 3 2014-03-27 3.29 3.3 3.21 3.24 3.24 9542895
## 4 2014-03-28 3.25 3.28 3.05 3.16 3.16 30373690
## 5 2014-03-31 3.19 3.24 3.16 3.21 3.21 10002000
## 6 2014-04-01 3.23 3.26 3.18 3.25 3.25 8993074
Now, we will just merge the datasets by mathcing colomuns Date
. When you merger the datasets in R, you have to make sure the mathed columns should have same variable name.
hd_bdi <- inner_join(hk, bdi, by = "Date")
hd_bdi
## # A tibble: 1,190 x 8
## Date Open High Low Close `Adj Close` Volume Index
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2009-04-01 4.81 5.21 4.81 5.1 4.79 51195944 1574
## 2 2009-04-02 4.95 5.72 4.95 5.67 5.32 128183352 1538
## 3 2009-04-03 5.61 6.05 5.61 5.81 5.45 155467992 1506
## 4 2009-04-06 6.08 6.36 6.02 6.19 5.81 110922868 1486
## 5 2009-04-07 6.43 6.43 5.87 5.92 5.56 77399528 1466
## 6 2009-04-08 5.74 5.74 5.39 5.54 5.20 93684174 1463
## 7 2009-04-09 5.66 5.85 5.66 5.84 5.48 56597038 1478
## 8 2009-04-14 5.84 6.87 5.84 6.79 6.37 148947340 1492
## 9 2009-04-15 6.5 7.46 6.35 7.4 6.94 196294748 1534
## 10 2009-04-16 7.63 7.69 6.97 7.11 6.67 182892512 1604
## # ... with 1,180 more rows
tail(hd_bdi)
## # A tibble: 6 x 8
## Date Open High Low Close `Adj Close` Volume Index
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-03-24 3.26 3.33 3.26 3.27 3.27 5962558 1602
## 2 2014-03-26 3.3 3.3 3.24 3.29 3.29 7554650 1496
## 3 2014-03-27 3.29 3.3 3.21 3.24 3.24 9542895 1412
## 4 2014-03-28 3.25 3.28 3.05 3.16 3.16 30373690 1373
## 5 2014-03-31 3.19 3.24 3.16 3.21 3.21 10002000 1362
## 6 2014-04-01 3.23 3.26 3.18 3.25 3.25 8993074 1316
Done, now we have the merged data by matching the date. You can write it as Stata format or csv format by using function write_
.
Wage discrimination in UK
Imagine your boss ask you to analze the current wage discrimination in terms of gender in UK as she/he heard that the official statical of UK just published a dataset of survey in national level. The dataset is bhps_assignment.dta
, which is 14.3MB. You don’t want to open this dataset and explain to your boss by saying ‘look at here,…’. You just want to give her/him a clean dataset to check or present one graph which explains almost ‘everything’. This is we will do in next part.
Now, we will do another datasets for examing the trend of wage discrimination in UK. Let’s just read the datasets to take a look. The main variables we are intrested are: paygu
, which is the usual gross pay per month by the current job; qfedhi
, which is the highest educational qualification; jbft
, which is a dummy variable of full-time employed.
library(readstata13)
wd <- read.dta13(file = '/Users/Michael/Library/Mobile Documents/com~apple~CloudDocs/CMSE/Data Analysis/Lectures/L2/bhps_assignment.dta')
wd <- tbl_df(wd)
wd
## # A tibble: 159,040 x 63
## pid year sex age mlstat plbornc race yr2uk4 paygu cjsten jbhas
## * <int> <dbl> <int> <int> <int> <int> <fct> <int> <dbl> <int> <fct>
## 1 1.00e7 1991 2 91 5 -8 "whi… -8 -8 11208 "no …
## 2 1.00e7 1991 1 28 5 51 "bla… 1991 -8 251 "no …
## 3 1.00e7 1992 1 29 5 -8 "bla… -8 -8 591 "no …
## 4 1.00e7 1991 1 26 5 51 "bla… 1991 -8 244 "no …
## 5 1.00e7 1992 1 27 5 -8 "bla… -8 -8 591 "no …
## 6 1.00e7 1993 1 28 5 -8 "bla… -8 -8 690 "no …
## 7 1.00e7 1991 2 57 4 27 "oth… 1956 780 530 "yes…
## 8 1.00e7 1992 2 59 4 -8 "oth… -8 612. 926 "yes…
## 9 1.00e7 1993 2 59 4 -8 "oth… -8 608. 1277 "yes…
## 10 1.00e7 1998 2 64 -9 -8 "oth… -8 -8 1633 "no …
## # ... with 159,030 more rows, and 52 more variables: jboff <fct>,
## # jboffy <int>, jbsoc <int>, jbsic <int>, jbsemp <fct>, jbsize <fct>,
## # jbhrs <int>, jbed <fct>, jbbgd <int>, jbbgm <int>, jbstat <fct>,
## # jbft <fct>, jbbgy4 <int>, jbsic92 <int>, tujbpl <fct>, tuin1 <fct>,
## # school <int>, scend <int>, sctype <int>, scnow <int>, fetype <int>,
## # fenow <int>, feend <int>, qfedhi <fct>, qfvoc <fct>, qfachi <fct>,
## # doid <int>, doim <int>, paju <int>, pasoc <int>, pasemp <fct>,
## # paboss <fct>, pamngr <fct>, maju <int>, masoc <int>, masemp <fct>,
## # maboss <fct>, mamngr <fct>, smoker <int>, ncigs <int>, rach16 <fct>,
## # hunurs <int>, region <int>, nchild <int>, ivlyr <fct>, doiy4 <int>,
## # ivievr <fct>, ivstat2 <int>, pasoc00 <int>, masoc00 <int>,
## # smcigs <fct>, racel <fct>
We have been told that \(-9\) means the data is missing value for that observation in this dataset. From the printed table, and summary of vairables, we see many variables have factors. Factors are the categories which have different levels.
levels(wd$jbft) # it gives all the levels for employment
## [1] "missing or wild" "inapplicable "
## [3] "proxy respondent " "full time: 30 hrs + "
## [5] "part time: lt 30 hrs "
levels(wd$jbft)[4] # we will focus on the full time one
## [1] "full time: 30 hrs + "
wage_ft <- wd %>%
filter(jbft == levels(wd$jbft)[4]) %>% # filter the full-time ones
group_by(year, sex) %>% # group by the year and sex
summarise(
no_ft = n(), # count how many observations they have
wage_median = median(paygu, na.rm = TRUE) # median wage
) %>%
filter(sex == 1 | sex == 2) # add this one as I foud there is -7 there
wage_ft
## # A tibble: 34 x 4
## # Groups: year [17]
## year sex no_ft wage_median
## <dbl> <int> <int> <dbl>
## 1 1991 1 2847 997.
## 2 1991 2 1672 729.
## 3 1992 1 2725 1005.
## 4 1992 2 1621 786.
## 5 1993 1 2597 1001.
## 6 1993 2 1629 813.
## 7 1994 1 2622 1043.
## 8 1994 2 1631 841.
## 9 1995 1 2586 1083.
## 10 1995 2 1604 896.
## # ... with 24 more rows
wage_pt <- wd %>%
filter(jbft == levels(wd$jbft)[5]) %>% # filter the part-time ones
group_by(year, sex) %>%
summarise(
no_pt = n(),
wage_median = median(paygu, na.rm = TRUE)
) %>%
filter(sex == 1 | sex == 2)
wage_pt
## # A tibble: 34 x 4
## # Groups: year [17]
## year sex no_pt wage_median
## <dbl> <int> <int> <dbl>
## 1 1991 1 225 117
## 2 1991 2 1081 217.
## 3 1992 1 218 173.
## 4 1992 2 1047 225.
## 5 1993 1 227 152.
## 6 1993 2 1028 230.
## 7 1994 1 231 152.
## 8 1994 2 1024 234
## 9 1995 1 218 195.
## 10 1995 2 1026 256.
## # ... with 24 more rows
count_jbps <- wd %>% group_by(year, sex) %>% summarise(no = n()) %>%
filter(sex == 1 | sex == 2) # just count the
#observations to calcuate the percentage later.
count_jbps
## # A tibble: 34 x 3
## # Groups: year [17]
## year sex no
## <dbl> <int> <int>
## 1 1991 1 4833
## 2 1991 2 5431
## 3 1992 1 4630
## 4 1992 2 5215
## 5 1993 1 4476
## 6 1993 2 5124
## 7 1994 1 4440
## 8 1994 2 5041
## 9 1995 1 4320
## 10 1995 2 4929
## # ... with 24 more rows
wd_jbps <- bind_cols(count_jbps, wage_ft, wage_pt) # combine them together and
#assign it to a new dataset
wd_jbps
## # A tibble: 34 x 11
## # Groups: year [17]
## year sex no year1 sex1 no_ft wage_median year2 sex2 no_pt
## <dbl> <int> <int> <dbl> <int> <int> <dbl> <dbl> <int> <int>
## 1 1991 1 4833 1991 1 2847 997. 1991 1 225
## 2 1991 2 5431 1991 2 1672 729. 1991 2 1081
## 3 1992 1 4630 1992 1 2725 1005. 1992 1 218
## 4 1992 2 5215 1992 2 1621 786. 1992 2 1047
## 5 1993 1 4476 1993 1 2597 1001. 1993 1 227
## 6 1993 2 5124 1993 2 1629 813. 1993 2 1028
## 7 1994 1 4440 1994 1 2622 1043. 1994 1 231
## 8 1994 2 5041 1994 2 1631 841. 1994 2 1024
## 9 1995 1 4320 1995 1 2586 1083. 1995 1 218
## 10 1995 2 4929 1995 2 1604 896. 1995 2 1026
## # ... with 24 more rows, and 1 more variable: wage_median1 <dbl>
wd_jbps_sb <- wd_jbps %>%
select(year, sex, no, no_ft, no_pt, wage_median, wage_median1) %>% # select
# main variables we are interested
mutate(perct_ft = round(no_ft/no, 3)) # add a new column
wd_jbps_sb # this one gives you a clean data which focus on the employment and
## # A tibble: 34 x 8
## # Groups: year [17]
## year sex no no_ft no_pt wage_median wage_median1 perct_ft
## <dbl> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 1991 1 4833 2847 225 997. 117 0.589
## 2 1991 2 5431 1672 1081 729. 217. 0.308
## 3 1992 1 4630 2725 218 1005. 173. 0.589
## 4 1992 2 5215 1621 1047 786. 225. 0.311
## 5 1993 1 4476 2597 227 1001. 152. 0.580
## 6 1993 2 5124 1629 1028 813. 230. 0.318
## 7 1994 1 4440 2622 231 1043. 152. 0.591
## 8 1994 2 5041 1631 1024 841. 234 0.324
## 9 1995 1 4320 2586 218 1083. 195. 0.599
## 10 1995 2 4929 1604 1026 896. 256. 0.325
## # ... with 24 more rows
# median wage level
The above data gives us the comparison of median wage betweeb male and female. Now, let’s do data visualization to get the same results.
wd %>%
filter(jbft == levels(wd$jbft)[4]) %>% # again filter the datasets
ggplot(aes(x = reorder(year, paygu), y = paygu)) +
geom_boxplot(aes(fill = sex), outlier.shape = NA) +
scale_y_continuous(limits = c(0, 5000)) +
labs(x = "Year", y = "Gross pay per month",
title = "Boxplot of Gross payment per month for male and female (Fulltime), 1991-2007")
wd %>%
filter(jbft == levels(wd$jbft)[5]) %>%
ggplot(aes(x = reorder(year, paygu), y = paygu)) +
geom_boxplot(aes(fill = sex), outlier.shape = NA) +
scale_y_continuous(limits = c(0, 2000)) +
labs(x = "Year", y = "Gross pay per month",
title = "Boxplot of Gross payment per month for male and female (Parttime), 1991-2007")
Right now, the function aes(fill= )
does not work well as the type of variable sex
is interger. We need generate a new variable, in which the type is factor. Also, the size of title is too big for the presentation here.
wd %>%
mutate(sex1 = if_else(sex == 1, 'Male', 'Female')) %>%
filter(jbft == levels(wd$jbft)[4]) %>% # again filter the datasets
ggplot(aes(x = reorder(year, paygu), y = paygu)) +
geom_boxplot(aes(fill = sex1), outlier.shape = NA) +
scale_y_continuous(limits = c(0, 5000)) +
labs(x = "Year", y = "Gross pay per month",
title = "Boxplot of Gross payment per month for male and female (Fulltime), 1991-2007")
wd %>%
mutate(sex1 = if_else(sex == 1, 'Male', 'Female')) %>%
filter(jbft == levels(wd$jbft)[5]) %>%
ggplot(aes(x = reorder(year, paygu), y = paygu)) +
geom_boxplot(aes(fill = sex1), outlier.shape = NA) +
scale_y_continuous(limits = c(0, 2000)) +
labs(x = "Year", y = "Gross pay per month",
title = "Boxplot of Gross payment per month for male and female (Part-time), 1991-2007") +
theme(plot.title = element_text(size = 12))
Now, we want to check whether this boxplot will change or not if we focus on people who have the degrees.
levels(wd$qfedhi)
## [1] "missing" "proxy "
## [3] "proxy respondent " "higher degree "
## [5] "first degree " "teaching qf "
## [7] "other higher qf" "nursing qf "
## [9] "gce a levels " "gce o levels or equiv "
## [11] "commercial qf, no o levels " "cse grade 2-5,scot grade 4-5 "
## [13] "apprenticeship " "other qf "
## [15] "no qf " "still at school no qf "
levels(wd$qfachi)
## [1] "missing" "proxy "
## [3] "proxy respondent " "higher degree "
## [5] "1st degree " "hnd,hnc,teaching "
## [7] "a level" "o level"
## [9] "cse " "none of these "
levels(wd$qfachi)[4] # let's use qfachi one to filter
## [1] "higher degree "
wd %>%
filter(qfachi == levels(wd$qfachi)[4]) %>%
filter(sex == 1 | sex == 2) %>%
mutate(sex1 = if_else(sex == 1, 'Male', 'Female')) %>%
filter(jbft == levels(wd$jbft)[4]) %>%
filter(paygu >=0 ) %>%
ggplot(aes(x = reorder(year, paygu), y = paygu)) +
geom_boxplot(aes(fill = sex1), outlier.shape = NA) +
scale_y_continuous(limits = c(0, 2000)) +
labs(x = "Year", y = "Gross pay per month",
title = "Boxplot of Gross payment per month for male and female with higher degree (Fulltime), 1991-2007") +
theme(plot.title = element_text(size = 10))
It’s very interesting, right? It seems there is no wage discrimination for people who having the higher degree. Now, we want to know which education categories have the highest wage gap for different genders.
wd %>%
filter(jbft == levels(wd$jbft)[4]) %>%
group_by(year, sex, qfachi) %>%
summarise(wage_median = median(paygu, na.rm = TRUE)) %>%
spread(sex, wage_median) %>%
select(-'-7') -> wd_qfachi # delete the -7
wd_qfachi
## # A tibble: 152 x 4
## # Groups: year [17]
## year qfachi `1` `2`
## <dbl> <fct> <dbl> <dbl>
## 1 1991 missing 606. 152.
## 2 1991 "higher degree " 1872. 1601.
## 3 1991 "1st degree " 1501. 1273.
## 4 1991 "hnd,hnc,teaching " 1354. 1181.
## 5 1991 a level 1075. 793.
## 6 1991 o level 953. 701.
## 7 1991 "cse " 816. 567.
## 8 1991 "none of these " 849. 604.
## 9 1992 missing 1105 1495.
## 10 1992 "proxy respondent " -7 -7
## # ... with 142 more rows
names(wd_qfachi) <- c('year', 'qualification', 'male', 'female')
wd_qfachi %>%
mutate(diff = male - female) %>%
ggplot(aes(x = year, y = diff)) +
geom_bar(aes(fill = qualification), stat = 'identity', position="dodge")
# let's just print out the table to take a look
wd_qfachi %>%
mutate(diff = male - female) %>%
datatable()
It’s quite strage as we have contradicted resutls, let’s remove the NA
. We need check the data again. Now, let’s do the double check. It should be -7
issue.
wd %>%
filter(qfachi == levels(wd$qfachi)[4]) %>%
filter(sex == 1 | sex == 2) %>%
mutate(sex1 = if_else(sex == 1, 'Male', 'Female')) %>%
filter(jbft == levels(wd$jbft)[4]) %>%
filter(paygu >=0) %>%
group_by(year, sex) %>%
summarise(wage_median = median(paygu, na.rm = TRUE)) %>%
spread(sex, wage_median) -> wd_qfachi2
names(wd_qfachi2) <- c('year', 'male','female')
wd_qfachi2 %>%
mutate(diff = male - female) %>%
datatable()
Okay, the problem is about the limit
wd %>%
filter(qfachi == levels(wd$qfachi)[4]) %>%
filter(sex == 1 | sex == 2) %>%
mutate(sex1 = if_else(sex == 1, 'Male', 'Female')) %>%
filter(jbft == levels(wd$jbft)[4]) %>%
filter(paygu >=0 ) %>%
ggplot(aes(x = reorder(year, paygu), y = paygu)) +
geom_boxplot(aes(fill = sex1), outlier.shape = NA) +
scale_y_continuous(limits = c(0, 7000)) +
labs(x = "Year", y = "Gross pay per month",
title = "Boxplot of Gross payment per month for male and female with higher degree (Fulltime), 1991-2007") +
theme(plot.title = element_text(size = 10))
Be careful when you copy the code, especially for the limiting setting.