In this workshop, we will do three mini cases studies, which will read different dataformats and tidy them and then visualize them.

NYS Quarterly Census of Employment and Wages.

The Quarterly Census of Employment and Wages (QCEW) program (also known as ES-202) collects employment and wage data from employers covered by New York State’s Unemployment Insurance (UI) Law. This program is a cooperative program with the U.S. Bureau of Labor Statistics. QCEW data encompass approximately 97 percent of New York’s nonfarm employment, providing a virtual census of employees and their wages as well as the most complete universe of employment and wage data, by industry, at the State, regional and county levels.

QCEW data are derived from quarterly tax reports submitted by all employers subject to UI laws. All covered employers are required to submit monthly employment figures representing the number of people either working during or receiving pay for the payroll period including the 12th of the month, and the total wages paid during the quarter. The QCEW program conducts ongoing surveys to verify the location and type of economic activity occurring at each of the more than 500,000 reporting units (establishments) in the state. An employer may operate in a number of different locations.

The original dataset is in csv format, we will reat it into R firstly

library(tidyverse) # library the main package wee need, which is always tidyverse
path = '/Users/Michael/Library/Mobile Documents/com~apple~CloudDocs/CMSE/Data Analysis/Lectures/L2/nys.csv'
nys <- read_csv(file = path) 
# you need change the path based on your file location
nys # print it to take a look 
## # A tibble: 184,747 x 9
##    `Area Type` Area      NAICS `NAICS Title`           Year Establishments
##    <chr>       <chr>     <int> <chr>                  <int>          <int>
##  1 State       New York…    11 Agriculture, Forestry…  2016           2600
##  2 State       New York…   111 Crop Production         2016           1128
##  3 State       New York…   112 Animal Production       2016            833
##  4 State       New York…   113 Forestry and Logging    2016            196
##  5 State       New York…   114 Fishing, Hunting and …  2016             42
##  6 State       New York…   115 Agriculture & Forestr…  2016            402
##  7 State       New York…    21 Mining                  2016            366
##  8 State       New York…   211 Oil and Gas Extraction  2016             32
##  9 State       New York…   212 Mining (except Oil an…  2016            262
## 10 State       New York…   213 Support Activities fo…  2016             72
## # ... with 184,737 more rows, and 3 more variables: `Average
## #   Employment` <int>, `Total Wage` <dbl>, `Annual Average Salary` <dbl>
unique(nys$Year) # unique function gives us the unique values 
##  [1] 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003
## [15] 2002 2001 2000

We are more interested on the distribution of average salary over industries and time. So, we will focus on these two aspects. To simplfy the datasets, let’s rename the variables and select main vairables we are interested on.

names(nys) <- c('Area_type', 'Area', 'Naics', 'N_title', 
        'Year', 'Estab', 'Average_emply', 'Total_wage', 'An_avg_sal')
nys_sb <- select(nys, Area_type, Area, N_title, Year, Average_emply, Total_wage, An_avg_sal)
nys_sb
## # A tibble: 184,747 x 7
##    Area_type Area    N_title      Year Average_emply Total_wage An_avg_sal
##    <chr>     <chr>   <chr>       <int>         <int>      <dbl>      <dbl>
##  1 State     New Yo… Agricultur…  2016         25849  872749893      33763
##  2 State     New Yo… Crop Produ…  2016         12379  377915334      30529
##  3 State     New Yo… Animal Pro…  2016         10204  366444605      35912
##  4 State     New Yo… Forestry a…  2016           644   24957438      38754
##  5 State     New Yo… Fishing, H…  2016           103    3515772      34134
##  6 State     New Yo… Agricultur…  2016          2519   99916744      39665
##  7 State     New Yo… Mining       2016          4409  281253726      63791
##  8 State     New Yo… Oil and Ga…  2016           326   22490945      68991
##  9 State     New Yo… Mining (ex…  2016          3749  233708577      62339
## 10 State     New Yo… Support Ac…  2016           333   25054204      75238
## # ... with 184,737 more rows

Okay, now, let’s focus on 2016, and find the top 10 industries which have the highest annual average salary and then dig into the different areas.

nys_sb %>% #this is pipe function
        filter(Year == 2016) %>%
        arrange(desc(An_avg_sal)) %>%
        top_n(10)
## # A tibble: 11 x 7
##    Area_type   Area    N_title    Year Average_emply Total_wage An_avg_sal
##    <chr>       <chr>   <chr>     <int>         <int>      <dbl>      <dbl>
##  1 County      Suffol… Financia…  2016          2989    1.72e 9     574559
##  2 Workforce … Suffolk Financia…  2016          2989    1.72e 9     574559
##  3 County      New Yo… Financia…  2016        169768    6.45e10     380045
##  4 Labor Mark… New Yo… Financia…  2016        172923    6.49e10     375279
##  5 Workforce … New Yo… Financia…  2016        172923    6.49e10     375279
##  6 Metropolit… New Yo… Financia…  2016        172923    6.49e10     375279
##  7 State       New Yo… Financia…  2016        194682    6.99e10     359113
##  8 Labor Mark… Long I… Financia…  2016          6468    2.29e 9     354420
##  9 Metropolit… Nassau… Financia…  2016          6468    2.29e 9     354420
## 10 County      Monroe… Funds, T…  2016            35    1.23e 7     352557
## 11 Workforce … Monroe  Funds, T…  2016            35    1.23e 7     352557

To give a better present, we use package DT, which will gives us interactive tables in html.

library(DT)
nys_sb %>% #this is pipe function
        filter(Year == 2016) %>%
        arrange(desc(An_avg_sal)) %>%
        top_n(10) %>%
        datatable()

You see, it looks much better. And it should not be surprised that the Financial Investment & Related Activity has the highest salary. Now, if you want to rank average employment, you can just click it.

Group Data and Visulize it

As we have 184,747 observations, and the dataset also includes so many dimensions, it’s better we cluster the data into one dimension, such as the trend in the time.

nys_sb %>% 
        filter(N_title == 'Financial Investment & Related Activity') %>%
        group_by(Year)
## # A tibble: 1,717 x 7
## # Groups:   Year [17]
##    Area_type Area    N_title      Year Average_emply Total_wage An_avg_sal
##    <chr>     <chr>   <chr>       <int>         <int>      <dbl>      <dbl>
##  1 State     New Yo… Financial …  2016        194682    6.99e10     359113
##  2 County    Albany… Financial …  2016           843    1.21e 8     143620
##  3 County    Bronx … Financial …  2016            37    3.46e 6      93450
##  4 County    Broome… Financial …  2016           164    1.75e 7     106820
##  5 County    Cattar… Financial …  2016            20    3.94e 6     197028
##  6 County    Chauta… Financial …  2016            69    8.31e 6     120468
##  7 County    Chemun… Financial …  2016           153    3.11e 7     203253
##  8 County    Chenan… Financial …  2016            21    2.76e 6     131638
##  9 County    Clinto… Financial …  2016            37    2.81e 6      75898
## 10 County    Columb… Financial …  2016             6    5.68e 5      94680
## # ... with 1,707 more rows

You can see that we now only have 1717 observations as we filter the data by focusing on the industry - Financial Investment & Related Activity. You can see there are many observations in the same year. So we will summarize the data through the year.

nys_sb %>% 
        filter(N_title == 'Financial Investment & Related Activity') %>%
        group_by(Year) %>% 
        summarise(avg = mean(An_avg_sal))
## # A tibble: 17 x 2
##     Year     avg
##    <int>   <dbl>
##  1  2000  92264.
##  2  2001  82260.
##  3  2002  89120.
##  4  2003  86334.
##  5  2004  94269.
##  6  2005  92782.
##  7  2006 112025.
##  8  2007 123215.
##  9  2008 127490.
## 10  2009 114514.
## 11  2010 121183.
## 12  2011 130502.
## 13  2012 136947.
## 14  2013 142100 
## 15  2014 150483.
## 16  2015 152613.
## 17  2016 151939.

Next step, we will visualize the trend:

nys_sb %>% 
        filter(N_title == 'Financial Investment & Related Activity') %>%
        group_by(Year) %>% 
        summarise(avg = mean(An_avg_sal)) %>%
        ggplot(aes(Year, avg)) + geom_bar(stat='identity')

You can change the color by giving more options

nys_sb %>% 
        filter(N_title == 'Financial Investment & Related Activity') %>%
        group_by(Year) %>% 
        summarise(avg = round(mean(An_avg_sal))) %>%
        ggplot(aes(Year, avg)) + geom_bar(stat='identity', fill= '#75A8D7') +
        theme_classic() + geom_text(aes(label=avg),position=position_dodge(width=0.9),
                vjust=-0.25) -> f1 
f1

We can also make this graph become interactve by using package plotly.

library(plotly)
nys_sb %>% 
        filter(N_title == 'Financial Investment & Related Activity') %>%
        group_by(Year) %>% 
        summarise(avg = round(mean(An_avg_sal))) %>%
        ggplot(aes(Year, avg)) + geom_bar(stat='identity', fill= '#75A8D7') +
        theme_classic()  -> f2
f2 <- ggplotly(f2)
f2

Baltic Dry Index (BDI) and Oil Price

The Baltic Dry Index (BDI), is issued daily by the London-based Baltic Exchange. The BDI is a composite of the Capesize, Panamax and Supramax Timecharter Averages. It is reported around the world as a proxy for dry bulk shipping stocks as well as a general shipping market bellwether. The reason the index is watched so closely by many economists and policymakers is that it is an excellent leading indicator for economic activity. The BDI tracks international carriers hauling raw materials (e.g., wood for Finnish paper mills) as opposed to container ships carrying finished goods (e.g., paper for the U.S. Treasury’s busy printing presses). Thus, the precursors to production, rather than the results of production, are accounted for in the index, giving it a predictive quality. Additionally, the market for these cargo ships tends to be tight and inelastic.

Now, we want to invesigate the relationship between BDI and Oil price. However, we have to collect the datasets firstly. Most time, we collect datasets seperately from different websites. In this case, we have two datasets: BDI.csv and hkstock.csv. We want to merge them together. The problem is that the date variable in BDI.csv is string, so we need conver it into Dates format.

Even though we can use code gen Date = date(date, "YMD") to covert string into the date format, it’s difficult to merger datasets by matching the date. However, R can do it easily.

bdi <- read_csv(file = "/Users/Michael/Library/Mobile Documents/com~apple~CloudDocs/CMSE/Data Analysis/Lectures/L2/BDI.csv")
# make sure your path or file location is right in your laptop. 
bdi # R automatically conver the string into date format 
## # A tibble: 1,236 x 2
##    Date       Index
##    <date>     <dbl>
##  1 2014-04-01  1316
##  2 2014-03-31  1362
##  3 2014-03-28  1373
##  4 2014-03-27  1412
##  5 2014-03-26  1496
##  6 2014-03-24  1602
##  7 2014-03-21  1599
##  8 2014-03-20  1621
##  9 2014-03-19  1570
## 10 2014-03-18  1518
## # ... with 1,226 more rows
tail(bdi)
## # A tibble: 6 x 2
##   Date       Index
##   <date>     <dbl>
## 1 2009-04-08  1463
## 2 2009-04-07  1466
## 3 2009-04-06  1486
## 4 2009-04-03  1506
## 5 2009-04-02  1538
## 6 2009-04-01  1574
hk <- read_csv(file = "/Users/Michael/Library/Mobile Documents/com~apple~CloudDocs/CMSE/Data Analysis/Lectures/L2/hkstock.csv")
hk
## # A tibble: 1,237 x 7
##    Date        Open  High   Low Close `Adj Close`    Volume
##    <date>     <dbl> <dbl> <dbl> <dbl>       <dbl>     <dbl>
##  1 2009-04-01  4.81  5.21  4.81  5.1         4.79  51195944
##  2 2009-04-02  4.95  5.72  4.95  5.67        5.32 128183352
##  3 2009-04-03  5.61  6.05  5.61  5.81        5.45 155467992
##  4 2009-04-06  6.08  6.36  6.02  6.19        5.81 110922868
##  5 2009-04-07  6.43  6.43  5.87  5.92        5.56  77399528
##  6 2009-04-08  5.74  5.74  5.39  5.54        5.20  93684174
##  7 2009-04-09  5.66  5.85  5.66  5.84        5.48  56597038
##  8 2009-04-14  5.84  6.87  5.84  6.79        6.37 148947340
##  9 2009-04-15  6.5   7.46  6.35  7.4         6.94 196294748
## 10 2009-04-16  7.63  7.69  6.97  7.11        6.67 182892512
## # ... with 1,227 more rows
tail(hk)
## # A tibble: 6 x 7
##   Date        Open  High   Low Close `Adj Close`   Volume
##   <date>     <dbl> <dbl> <dbl> <dbl>       <dbl>    <dbl>
## 1 2014-03-25  3.26  3.3   3.24  3.28        3.28  9257970
## 2 2014-03-26  3.3   3.3   3.24  3.29        3.29  7554650
## 3 2014-03-27  3.29  3.3   3.21  3.24        3.24  9542895
## 4 2014-03-28  3.25  3.28  3.05  3.16        3.16 30373690
## 5 2014-03-31  3.19  3.24  3.16  3.21        3.21 10002000
## 6 2014-04-01  3.23  3.26  3.18  3.25        3.25  8993074

Now, we will just merge the datasets by mathcing colomuns Date. When you merger the datasets in R, you have to make sure the mathed columns should have same variable name.

hd_bdi <- inner_join(hk, bdi, by = "Date")
hd_bdi
## # A tibble: 1,190 x 8
##    Date        Open  High   Low Close `Adj Close`    Volume Index
##    <date>     <dbl> <dbl> <dbl> <dbl>       <dbl>     <dbl> <dbl>
##  1 2009-04-01  4.81  5.21  4.81  5.1         4.79  51195944  1574
##  2 2009-04-02  4.95  5.72  4.95  5.67        5.32 128183352  1538
##  3 2009-04-03  5.61  6.05  5.61  5.81        5.45 155467992  1506
##  4 2009-04-06  6.08  6.36  6.02  6.19        5.81 110922868  1486
##  5 2009-04-07  6.43  6.43  5.87  5.92        5.56  77399528  1466
##  6 2009-04-08  5.74  5.74  5.39  5.54        5.20  93684174  1463
##  7 2009-04-09  5.66  5.85  5.66  5.84        5.48  56597038  1478
##  8 2009-04-14  5.84  6.87  5.84  6.79        6.37 148947340  1492
##  9 2009-04-15  6.5   7.46  6.35  7.4         6.94 196294748  1534
## 10 2009-04-16  7.63  7.69  6.97  7.11        6.67 182892512  1604
## # ... with 1,180 more rows
tail(hd_bdi)
## # A tibble: 6 x 8
##   Date        Open  High   Low Close `Adj Close`   Volume Index
##   <date>     <dbl> <dbl> <dbl> <dbl>       <dbl>    <dbl> <dbl>
## 1 2014-03-24  3.26  3.33  3.26  3.27        3.27  5962558  1602
## 2 2014-03-26  3.3   3.3   3.24  3.29        3.29  7554650  1496
## 3 2014-03-27  3.29  3.3   3.21  3.24        3.24  9542895  1412
## 4 2014-03-28  3.25  3.28  3.05  3.16        3.16 30373690  1373
## 5 2014-03-31  3.19  3.24  3.16  3.21        3.21 10002000  1362
## 6 2014-04-01  3.23  3.26  3.18  3.25        3.25  8993074  1316

Done, now we have the merged data by matching the date. You can write it as Stata format or csv format by using function write_.

Wage discrimination in UK

Imagine your boss ask you to analze the current wage discrimination in terms of gender in UK as she/he heard that the official statical of UK just published a dataset of survey in national level. The dataset is bhps_assignment.dta, which is 14.3MB. You don’t want to open this dataset and explain to your boss by saying ‘look at here,…’. You just want to give her/him a clean dataset to check or present one graph which explains almost ‘everything’. This is we will do in next part.

Now, we will do another datasets for examing the trend of wage discrimination in UK. Let’s just read the datasets to take a look. The main variables we are intrested are: paygu, which is the usual gross pay per month by the current job; qfedhi, which is the highest educational qualification; jbft, which is a dummy variable of full-time employed.

library(readstata13)
wd <- read.dta13(file = '/Users/Michael/Library/Mobile Documents/com~apple~CloudDocs/CMSE/Data Analysis/Lectures/L2/bhps_assignment.dta')
wd <- tbl_df(wd)
wd
## # A tibble: 159,040 x 63
##       pid  year   sex   age mlstat plbornc race  yr2uk4 paygu cjsten jbhas
##  *  <int> <dbl> <int> <int>  <int>   <int> <fct>  <int> <dbl>  <int> <fct>
##  1 1.00e7  1991     2    91      5      -8 "whi…     -8   -8   11208 "no …
##  2 1.00e7  1991     1    28      5      51 "bla…   1991   -8     251 "no …
##  3 1.00e7  1992     1    29      5      -8 "bla…     -8   -8     591 "no …
##  4 1.00e7  1991     1    26      5      51 "bla…   1991   -8     244 "no …
##  5 1.00e7  1992     1    27      5      -8 "bla…     -8   -8     591 "no …
##  6 1.00e7  1993     1    28      5      -8 "bla…     -8   -8     690 "no …
##  7 1.00e7  1991     2    57      4      27 "oth…   1956  780     530 "yes…
##  8 1.00e7  1992     2    59      4      -8 "oth…     -8  612.    926 "yes…
##  9 1.00e7  1993     2    59      4      -8 "oth…     -8  608.   1277 "yes…
## 10 1.00e7  1998     2    64     -9      -8 "oth…     -8   -8    1633 "no …
## # ... with 159,030 more rows, and 52 more variables: jboff <fct>,
## #   jboffy <int>, jbsoc <int>, jbsic <int>, jbsemp <fct>, jbsize <fct>,
## #   jbhrs <int>, jbed <fct>, jbbgd <int>, jbbgm <int>, jbstat <fct>,
## #   jbft <fct>, jbbgy4 <int>, jbsic92 <int>, tujbpl <fct>, tuin1 <fct>,
## #   school <int>, scend <int>, sctype <int>, scnow <int>, fetype <int>,
## #   fenow <int>, feend <int>, qfedhi <fct>, qfvoc <fct>, qfachi <fct>,
## #   doid <int>, doim <int>, paju <int>, pasoc <int>, pasemp <fct>,
## #   paboss <fct>, pamngr <fct>, maju <int>, masoc <int>, masemp <fct>,
## #   maboss <fct>, mamngr <fct>, smoker <int>, ncigs <int>, rach16 <fct>,
## #   hunurs <int>, region <int>, nchild <int>, ivlyr <fct>, doiy4 <int>,
## #   ivievr <fct>, ivstat2 <int>, pasoc00 <int>, masoc00 <int>,
## #   smcigs <fct>, racel <fct>

We have been told that \(-9\) means the data is missing value for that observation in this dataset. From the printed table, and summary of vairables, we see many variables have factors. Factors are the categories which have different levels.

levels(wd$jbft) # it gives all the levels for employment 
## [1] "missing or wild"         "inapplicable   "        
## [3] "proxy respondent       " "full time: 30 hrs +    "
## [5] "part time: lt 30 hrs   "
levels(wd$jbft)[4] # we will focus on the full time one
## [1] "full time: 30 hrs +    "
wage_ft <- wd %>%
        filter(jbft == levels(wd$jbft)[4]) %>% # filter the full-time ones 
        group_by(year, sex) %>% # group by the year and sex 
        summarise(
                no_ft = n(), # count how many observations they have 
                wage_median = median(paygu, na.rm = TRUE) # median wage 
        )  %>%
        filter(sex == 1 | sex == 2) # add this one as I foud there is -7 there 
wage_ft
## # A tibble: 34 x 4
## # Groups:   year [17]
##     year   sex no_ft wage_median
##    <dbl> <int> <int>       <dbl>
##  1  1991     1  2847        997.
##  2  1991     2  1672        729.
##  3  1992     1  2725       1005.
##  4  1992     2  1621        786.
##  5  1993     1  2597       1001.
##  6  1993     2  1629        813.
##  7  1994     1  2622       1043.
##  8  1994     2  1631        841.
##  9  1995     1  2586       1083.
## 10  1995     2  1604        896.
## # ... with 24 more rows
wage_pt <- wd %>%
        filter(jbft == levels(wd$jbft)[5]) %>% # filter the part-time ones 
        group_by(year, sex) %>%
        summarise(
                no_pt = n(),
                wage_median = median(paygu, na.rm = TRUE)
        ) %>%
        filter(sex == 1 | sex == 2)
wage_pt 
## # A tibble: 34 x 4
## # Groups:   year [17]
##     year   sex no_pt wage_median
##    <dbl> <int> <int>       <dbl>
##  1  1991     1   225        117 
##  2  1991     2  1081        217.
##  3  1992     1   218        173.
##  4  1992     2  1047        225.
##  5  1993     1   227        152.
##  6  1993     2  1028        230.
##  7  1994     1   231        152.
##  8  1994     2  1024        234 
##  9  1995     1   218        195.
## 10  1995     2  1026        256.
## # ... with 24 more rows
count_jbps <- wd %>% group_by(year, sex) %>% summarise(no = n()) %>%
        filter(sex == 1 | sex == 2) # just count the
#observations to calcuate the percentage later.

count_jbps 
## # A tibble: 34 x 3
## # Groups:   year [17]
##     year   sex    no
##    <dbl> <int> <int>
##  1  1991     1  4833
##  2  1991     2  5431
##  3  1992     1  4630
##  4  1992     2  5215
##  5  1993     1  4476
##  6  1993     2  5124
##  7  1994     1  4440
##  8  1994     2  5041
##  9  1995     1  4320
## 10  1995     2  4929
## # ... with 24 more rows
wd_jbps <- bind_cols(count_jbps, wage_ft, wage_pt) # combine them together and 
#assign it to a new dataset 
wd_jbps
## # A tibble: 34 x 11
## # Groups:   year [17]
##     year   sex    no year1  sex1 no_ft wage_median year2  sex2 no_pt
##    <dbl> <int> <int> <dbl> <int> <int>       <dbl> <dbl> <int> <int>
##  1  1991     1  4833  1991     1  2847        997.  1991     1   225
##  2  1991     2  5431  1991     2  1672        729.  1991     2  1081
##  3  1992     1  4630  1992     1  2725       1005.  1992     1   218
##  4  1992     2  5215  1992     2  1621        786.  1992     2  1047
##  5  1993     1  4476  1993     1  2597       1001.  1993     1   227
##  6  1993     2  5124  1993     2  1629        813.  1993     2  1028
##  7  1994     1  4440  1994     1  2622       1043.  1994     1   231
##  8  1994     2  5041  1994     2  1631        841.  1994     2  1024
##  9  1995     1  4320  1995     1  2586       1083.  1995     1   218
## 10  1995     2  4929  1995     2  1604        896.  1995     2  1026
## # ... with 24 more rows, and 1 more variable: wage_median1 <dbl>
wd_jbps_sb <- wd_jbps %>%
        select(year, sex, no, no_ft, no_pt, wage_median, wage_median1) %>% # select
        # main variables we are interested 
        mutate(perct_ft = round(no_ft/no, 3)) # add a new column 
wd_jbps_sb # this one gives you a clean data which focus on the employment and 
## # A tibble: 34 x 8
## # Groups:   year [17]
##     year   sex    no no_ft no_pt wage_median wage_median1 perct_ft
##    <dbl> <int> <int> <int> <int>       <dbl>        <dbl>    <dbl>
##  1  1991     1  4833  2847   225        997.         117     0.589
##  2  1991     2  5431  1672  1081        729.         217.    0.308
##  3  1992     1  4630  2725   218       1005.         173.    0.589
##  4  1992     2  5215  1621  1047        786.         225.    0.311
##  5  1993     1  4476  2597   227       1001.         152.    0.580
##  6  1993     2  5124  1629  1028        813.         230.    0.318
##  7  1994     1  4440  2622   231       1043.         152.    0.591
##  8  1994     2  5041  1631  1024        841.         234     0.324
##  9  1995     1  4320  2586   218       1083.         195.    0.599
## 10  1995     2  4929  1604  1026        896.         256.    0.325
## # ... with 24 more rows
# median wage level 

The above data gives us the comparison of median wage betweeb male and female. Now, let’s do data visualization to get the same results.

wd %>%
        filter(jbft == levels(wd$jbft)[4]) %>% # again filter the datasets 
        ggplot(aes(x = reorder(year, paygu), y = paygu)) + 
        geom_boxplot(aes(fill = sex), outlier.shape = NA) +
        scale_y_continuous(limits = c(0, 5000)) + 
        labs(x = "Year", y = "Gross pay per month",
                title = "Boxplot of Gross payment per month for male and female (Fulltime), 1991-2007")

wd %>%
        filter(jbft == levels(wd$jbft)[5]) %>%
        ggplot(aes(x = reorder(year, paygu), y = paygu)) + 
        geom_boxplot(aes(fill = sex), outlier.shape = NA) +
        scale_y_continuous(limits = c(0, 2000)) + 
        labs(x = "Year", y = "Gross pay per month",
                title = "Boxplot of Gross payment per month for male and female (Parttime), 1991-2007")

Right now, the function aes(fill= ) does not work well as the type of variable sex is interger. We need generate a new variable, in which the type is factor. Also, the size of title is too big for the presentation here.

wd %>%
        mutate(sex1 = if_else(sex == 1, 'Male', 'Female')) %>%
        filter(jbft == levels(wd$jbft)[4]) %>% # again filter the datasets 
        ggplot(aes(x = reorder(year, paygu), y = paygu)) + 
        geom_boxplot(aes(fill = sex1), outlier.shape = NA) +
        scale_y_continuous(limits = c(0, 5000)) + 
        labs(x = "Year", y = "Gross pay per month",
                title = "Boxplot of Gross payment per month for male and female (Fulltime), 1991-2007")

wd %>%
        mutate(sex1 = if_else(sex == 1, 'Male', 'Female')) %>%
        filter(jbft == levels(wd$jbft)[5]) %>%
        ggplot(aes(x = reorder(year, paygu), y = paygu)) + 
        geom_boxplot(aes(fill = sex1), outlier.shape = NA) +
        scale_y_continuous(limits = c(0, 2000)) + 
        labs(x = "Year", y = "Gross pay per month",
                title = "Boxplot of Gross payment per month for male and female (Part-time), 1991-2007") +
        theme(plot.title = element_text(size = 12))

Now, we want to check whether this boxplot will change or not if we focus on people who have the degrees.

levels(wd$qfedhi)
##  [1] "missing"                         "proxy  "                        
##  [3] "proxy respondent       "         "higher degree  "                
##  [5] "first degree   "                 "teaching qf    "                
##  [7] "other higher qf"                 "nursing qf     "                
##  [9] "gce a levels   "                 "gce o levels or equiv  "        
## [11] "commercial qf, no o levels     " "cse grade 2-5,scot grade 4-5   "
## [13] "apprenticeship "                 "other qf       "                
## [15] "no qf  "                         "still at school no qf  "
levels(wd$qfachi)
##  [1] "missing"                 "proxy  "                
##  [3] "proxy respondent       " "higher degree  "        
##  [5] "1st degree     "         "hnd,hnc,teaching       "
##  [7] "a level"                 "o level"                
##  [9] "cse    "                 "none of these  "
levels(wd$qfachi)[4] # let's use qfachi one to filter 
## [1] "higher degree  "
wd %>%
        filter(qfachi == levels(wd$qfachi)[4]) %>%
        filter(sex == 1 | sex == 2) %>% 
        mutate(sex1 = if_else(sex == 1, 'Male', 'Female')) %>%
        filter(jbft == levels(wd$jbft)[4]) %>%
        filter(paygu >=0 ) %>%
        ggplot(aes(x = reorder(year, paygu), y = paygu)) + 
        geom_boxplot(aes(fill = sex1), outlier.shape = NA) +
        scale_y_continuous(limits = c(0, 2000)) + 
        labs(x = "Year", y = "Gross pay per month",
                title = "Boxplot of Gross payment per month for male and female with higher degree (Fulltime), 1991-2007") +
        theme(plot.title = element_text(size = 10))

It’s very interesting, right? It seems there is no wage discrimination for people who having the higher degree. Now, we want to know which education categories have the highest wage gap for different genders.

wd %>%
        filter(jbft == levels(wd$jbft)[4]) %>%
        group_by(year, sex, qfachi) %>%
        summarise(wage_median = median(paygu, na.rm = TRUE)) %>%
        spread(sex, wage_median) %>%
        select(-'-7') -> wd_qfachi # delete the -7 
wd_qfachi
## # A tibble: 152 x 4
## # Groups:   year [17]
##     year qfachi                       `1`    `2`
##    <dbl> <fct>                      <dbl>  <dbl>
##  1  1991 missing                     606.   152.
##  2  1991 "higher degree  "          1872.  1601.
##  3  1991 "1st degree     "          1501.  1273.
##  4  1991 "hnd,hnc,teaching       "  1354.  1181.
##  5  1991 a level                    1075.   793.
##  6  1991 o level                     953.   701.
##  7  1991 "cse    "                   816.   567.
##  8  1991 "none of these  "           849.   604.
##  9  1992 missing                    1105   1495.
## 10  1992 "proxy respondent       "    -7     -7 
## # ... with 142 more rows
names(wd_qfachi) <- c('year', 'qualification', 'male', 'female')
wd_qfachi %>%
        mutate(diff = male - female) %>%
        ggplot(aes(x = year, y = diff)) + 
        geom_bar(aes(fill = qualification), stat = 'identity', position="dodge")

# let's just print out the table to take a look 
wd_qfachi %>%
        mutate(diff = male - female) %>%
        datatable()

It’s quite strage as we have contradicted resutls, let’s remove the NA. We need check the data again. Now, let’s do the double check. It should be -7 issue.

wd %>%
        filter(qfachi == levels(wd$qfachi)[4]) %>%
        filter(sex == 1 | sex == 2) %>% 
        mutate(sex1 = if_else(sex == 1, 'Male', 'Female')) %>%
        filter(jbft == levels(wd$jbft)[4]) %>%
        filter(paygu >=0) %>%
        group_by(year, sex) %>%
        summarise(wage_median = median(paygu, na.rm = TRUE)) %>%
        spread(sex, wage_median) -> wd_qfachi2 
names(wd_qfachi2) <- c('year', 'male','female')

wd_qfachi2 %>%
        mutate(diff = male - female) %>%
        datatable()

Okay, the problem is about the limit

wd %>%
        filter(qfachi == levels(wd$qfachi)[4]) %>%
        filter(sex == 1 | sex == 2) %>% 
        mutate(sex1 = if_else(sex == 1, 'Male', 'Female')) %>%
        filter(jbft == levels(wd$jbft)[4]) %>%
        filter(paygu >=0 ) %>%
        ggplot(aes(x = reorder(year, paygu), y = paygu)) + 
        geom_boxplot(aes(fill = sex1), outlier.shape = NA) +
        scale_y_continuous(limits = c(0, 7000)) + 
        labs(x = "Year", y = "Gross pay per month",
                title = "Boxplot of Gross payment per month for male and female with higher degree (Fulltime), 1991-2007") +
        theme(plot.title = element_text(size = 10))

Be careful when you copy the code, especially for the limiting setting.