Homework 5

By Jonathan Zhang

Labour Force Survey (LFS) Data

In this assignment I will attempt to use a new dataset, specifically the Canadian Labour Force Survey (LFS). The Labour Force Survey is published each month by Statistics Canada and provides estimates on unemployment, labour force participation and other indicators. It also includes hourly wages, tenure, occupation, industry, demographics, etc for each employee.

The data can be downloaded from UBC Data Library ABACUS. The data is originally in SPSS format with a syntax file and must be converted. There are 78 original variables, with all but 3 of them being categorical. Because the LFS is a stratified multi-stage survey, the subjects surveyed remain in the LFS for 6 months. Thus economists often look at data that are at least 6 months apart. Because of this I took data from March and November, and I only took data every 5 years: 1997, 2002, 2007, 2012. The main purpose of looking at economic labour data is to analyze relations between variables and wage. Therefore, I chose the relevant variables. Many of the variables I dropped were just different representations of essentially the same thing. I also drop observations that are unemployed. This deletes all of the missing data for the variables ware interested in.

Much of my time was spent on cleaning the data. The data cleaning had to be done on MS Excel in CSV format; R would freeze trying to load the raw data which had over 3 million rows.

Out of 78 variables, I selected the relevant variables.
The categorical variables are:
YEAR Again I only use data that are 5 years apart, otherwise the data cannot even be saved/loaded in CSV format!
PROV Province: I only took the data for the 4 major province: BC, AL, ON, QU
SEX Male or Female
MARRIED An indicator of whether they are married, not sure how relevant this variable is
AGE Working age falls in 10 year intervals from 15 to 54
EDUCATION There are 4 education levels: 1 (Highschool/Elementary dropouts), 2 (Highschool Graduates), 3 (Some post-secondary education but no Bachelors degree. This includes either university dropouts, or diplomas, certificates etc.), 4 (Bachelors degree or higher). Currently the levels are not named yet, however, changing this is trivial
FULLTIME An indicator stating whether their main job is full-time
UNION An indicator stating whether their main job is unionzed

The quantitative variables are:
HOURS Hours of works worked on average per week
TENURE Tenure at their main job (in months)
HRLYEARN Hourly earnings at their main job

Survey weight:
FWEIGHT Frequency weight of that particular observation

Note/Help:* I am not sure how to deal with survey weights. I did some research and apparently R does not handle weights well. Apparently other statistical softwares have some sort of weight parameter for essentially every command. This is definitely not the case with R. For this assignment I just want to learn how to use ggplot2, and will not worry about the frequency weights

Also, I have given this cleaned dataset to Jack Ni to also work with. We will probably work together in some way on the final project, however, there was no collaboration on this assignment.

Analysis

LFS <- read.csv("~/stat545/Stat 545A/LFS.csv")
str(LFS)
## 'data.frame':    136407 obs. of  12 variables:
##  $ YEAR     : int  1997 1997 1997 1997 1997 1997 1997 1997 1997 1997 ...
##  $ PROV     : Factor w/ 4 levels "AL","BC","ON",..: 3 1 2 3 3 3 3 4 3 3 ...
##  $ SEX      : Factor w/ 2 levels "Female","Male": 2 1 1 1 1 1 2 2 2 1 ...
##  $ MARRIED  : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 1 2 1 2 2 ...
##  $ AGE      : Factor w/ 5 levels "15-24","25-34",..: 3 3 1 3 4 4 3 1 1 1 ...
##  $ EDUCATION: int  3 2 2 3 2 1 3 3 2 3 ...
##  $ FULLTIME : Factor w/ 2 levels "FT","PT": 1 1 2 1 1 1 1 1 1 1 ...
##  $ HOURS    : num  16 47 24 4 0 37.5 40 35 64 30 ...
##  $ TENURE   : int  140 61 22 80 240 204 74 8 32 33 ...
##  $ HRLYEARN : num  15 21.1 7 21 21.6 ...
##  $ UNION    : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 1 2 1 1 ...
##  $ FWEIGHT  : int  66 310 257 541 86 267 726 680 335 164 ...
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.0.2

Let's start by looking at a straight forward scatterplot.

ggplot(LFS, aes(x = TENURE, y = HRLYEARN)) + geom_point()

plot of chunk unnamed-chunk-2

Notice that we have way too much data to really notice any sort of trends. We will create a random subset of 5000 variables, and try that again.

random <- sample(1:nrow(LFS), 5000)
smallset <- LFS[random, ]
ggplot(smallset, aes(x = TENURE, y = HRLYEARN)) + geom_point()

plot of chunk unnamed-chunk-3

We don't really see much here. Except that many people have low tenure and low earnings. I also notice that there is a vertical line at TENURE=240 months. There seems to be a maximum of 20 years tenure in this survey.

Here are some boxplots of hourly earnings vs province over different years.

ggplot(LFS, aes(x = PROV, y = HRLYEARN, fill = PROV)) + geom_boxplot(alpha = 0.2) + 
    facet_wrap(~YEAR)

plot of chunk unnamed-chunk-4

Notice that in 1997 Alberta has a the lowest median hourly wage and the smallest interquartile range. Over the next 15 years, Alberta now has the higher median wage, possibly the largest interquartile range and the highest earnings in terms of outliers! Now lets look at a similar figure, but instead we facet by continent.

ggplot(smallset, aes(x = YEAR, y = HRLYEARN)) + geom_jitter() + facet_wrap(~PROV) + 
    geom_line(stat = "summary", fun.y = "median", col = "red", lwd = 1)

plot of chunk unnamed-chunk-5

It appears that there are minor increases in median salary in all four provinces, with the most profound increase in Alberta. This is not surprising due to the quick development of Alberta's economy. Keep in mind that these are nominal wages (meaning that it does not account for inflation), of course the wages in 2012 are much higher than the wages in 1992; our expenses have also increased dramatically. It also appears that income inequality has risen. Of course we're analyzing all this at a very superfical level.

Next we will look at how education and wages are related. We again used a jittered stripplot and plot the median line.

ggplot(smallset, aes(x = EDUCATION, y = HRLYEARN)) + geom_jitter() + facet_wrap(~YEAR) + 
    geom_line(stat = "summary", fun.y = "median", col = "red", lwd = 1)

plot of chunk unnamed-chunk-6

The effects are as expected: higher earnings with higher education. Recall that 1, 2, 3, and 4, correspond to highschool/elementary droppout, highschool graduate, some post-secondary education, bachelors degree or higher respectively. We see that the returns to education (the difference of wages between each education group) is largest in 2012. This can be interpreted as, a student in 2012 will benefit more by going to school than a student in 1997. We also see that there appears to be more people in the “some post-secondary education” group, however, recall that this is just a random sample of our data, and more importantly, I have disregarded the survey weights.

I am also interested in seeing the distribution of hours worked over time, especially the differences between male and females. I would expect males to work more, because it is probably still the case that the man is the primary income of the family.

ggplot(LFS, aes(x = HOURS, color = SEX)) + geom_density(lwd = 1) + facet_wrap(~YEAR)

plot of chunk unnamed-chunk-7

It appears that in general, men work more, and more men work full time than women. The distribution has not changed much over the last 20 years. I suspect that if we go back even further to the 1950's and 1960's our density plots will have evolved greatly over time.

The last figure I will make will look at hourly earnings vs average hours worked per week. I will facet by EDUCATION

ggplot(LFS, aes(x = HOURS, y = HRLYEARN, color = UNION)) + geom_point() + facet_wrap(~EDUCATION)

plot of chunk unnamed-chunk-8

First we see the obvious that people who have more education tend to make more. This is especially true if we restrict our attention to the people who work full time “regular” hours (30-50 hours). We also see that unionized people tend to make more. This is consistent with economic analysis. One problem is that with so much data, it is difficult to really see the data when everything is so clustered.

Conclusion

I have a few conclusions to make. First off, ignoring the frequency weights does not give us wild results. Everything makes intuitive sense to me. This is good news, however, I would still like to figure out how to deal with the weights. Maybe it would be okay to ignore them?
Also, I am not sure if this dataset is that great. The fact that there are only 3 quantitatve variables really restricts the analyses. It is also inconvenient that none of the quantitative variables seem to have that much of a relationship. I will need to play with this LFS data some more. At first glance, it is rather boring.