Assignment 6 Labor Force In Canada

Mohammadreza Bolandnazar

Introduction

In this Homework, I tried to deal with a new dataset. After my first experience with a dataset about Canadians labor economics in the last homework: Work on CANSIM tables I decided to find a more detailed and complete dataset about what I did in that assignment. I figured out that some of my colleagues have used an interesting similar data: Canadian Labor Force (LFS). I took the raw data from Jonathan Zhang, and started cleaning it. This data is mostly about labor force measures such as wages, tenure, occupation, unemployment status, etc.

This large data set covers years 1997-2013 and each year about 100000 distinct entries. Each entry is a representative of some individuals with the same statistics, where its frequency has been reported in the “Frequency Weight” assign to that entry (at the very last column of data). All in all, each year has a summary of labor force information of about 25 million people. Doubtlessly, we should pick a sample from this large data. The challenging part is the frequency weights of each entry. The way we deal with it was to sample among those entries with a significant level of frequency. For example we randomly selected entries among those with frequency weight of at least 500. Even though this method of sampling is a little biased, but it reduces the impact of rare data on final observations. We also picked some of the variables of the data.

Obviously, I spent a lot of time cleaning the data and anyone wants to use the raw data would be aware that because of the volume of the data, it may take some time to run the data cleaner files.

Finally, we have picked these variables: Year The survey year, from 1997 to 2013 Province The raw data has 10 provinces (PROV) but after sampling the data we had 5 provinces: Alberta, British Columbia, Manitoba, Ontario and Quebec. Age The age group that the individual is assigned to Sex Female or Male Marital Status In three categories: Single, Married and Separated Educations Highest educational attainment JobType The type of work the individual does such as Agriculture, Construction,Education, etc. JobAbsence for an employed person: reason absent full week JobStatus Declaring Full time/Part time Work Hours Usual hours per week at main job Tenure Job tenure in months, for currently employed only HourlyEarn for employees: Usual hourly wages Union Union membership status, employees only Frequency Final individual or family weight.

Data Cleaning Process

The raw data for each year, was saved in a separate [csv] file. So our first task was to pick, randomly, a sample of 1000 entries, in each year, among those entries with frequency weight (last column) of at least 500. Then we choose the relevant variables and form a summarized data for each year. At last we combine all these files together to form a single data.

For now, we need only two libraries

library(ggplot2)
library(plyr)

Now read the raw data from separate files.

# READING DATA AND CLEANIN
path <- "LFSDATA/"
filenames <- list.files(path)  # this reads all the data files titled: 1997..2013

We want to build a function for reading the files of 1997 to 2013

jFun <- function(x) {
    BIGDAT <- read.csv(x, na.strings = "")  #Resolves the empty cells with NA
    BIGDAT <- subset(BIGDAT, BIGDAT$FWEIGHT > 500)  #ignores the less frequent entities 
    Edat <- BIGDAT[sample(nrow(BIGDAT), 1000, replace = FALSE, prob = NULL), 
        ]  # Sample 1000 rows out of current file
    fDat <- subset(Edat, select = c(SURVYEAR, PROV, AGE_12, SEX, MARSTAT, EDUC90, 
        NAICS_18, YABSENT, FTPTMAIN, UHRSMAIN, TENURE, HRLYEARN, UNION, FWEIGHT))  ## Selects some of the columns
    return(fDat)
}



## Apply the function above on all the data files
setwd("LFSDATA/")  ## goes in the input directory to read and combine
lfsRAW <- do.call("rbind", lapply(filenames, jFun))
setwd("..")  ## come back to the main directory

Let check that the data has loaded correctly. It must have 17000 observations.

dim(lfsRAW)  ## Sanity Check

## [1] 17000    14

Now let make the input tidier and cleaner, by changing the names and etc.

lfsdat <- lfsRAW

## Change the names a little bit
levels(lfsdat$MARSTAT) <- list(Married = "Married or", Married = "Married", 
    Single = "Single, ne", Single = "Widowed", Separated = "Separated/", Separated = "Separated", 
    Separated = "Divorced", Married = "Living in")
levels(lfsdat$PROV) <- list(Alberta = "Alberta", BritishColumbia = "British Co", 
    Manitoba = "Manitoba", NewBrunswick = "New Brunsw", Newfoundland = "Newfoundla", 
    NovaScotia = "Nova Scoti", Ontario = "Ontario", PrinceEdward = "Prince Edw", 
    Quebec = "Qu\xe9bec", Saskatchewan = "Saskatchew")
levels(lfsdat$EDUC90) <- list(A.EightYears = "0 to 8 yea", B.Secondary = "Some secon", 
    C.Grade11to13 = "Grade 11 t", D.PostSecondary = "Post secon", D.PostSecondary = "Some post", 
    E.University = "University")
levels(lfsdat$YABSENT) <- list(Vacation = "Vacation", FamilyResponsibility = "Personal o", 
    OwnIllness = "Own illnes", Other = "Other")
c <- c("Year", "Province", "Age", "Sex", "MaritalStatus", "Educations", "JobType", 
    "JobAbsence", "JobStatus", "WorkHours", "Tenure", "HourlyEarn", "Union", 
    "Frequency")
names(lfsdat) <- c

Next, we rearrange our data to be nicely ordered


lfsdat = arrange(lfsdat, Year, Province, Sex)

Let's take a look at the data, via a figure.

ggplot(lfsdat[!is.na(lfsdat$HourlyEarn), ], aes(x = Province, y = HourlyEarn)) + 
    geom_jitter(color = "brown") + ggtitle("Hourly Wages in different Provinces")

plot of chunk unnamed-chunk-7

As we see it is better to drop Manitoba, since it has very few samples. Also the initial number of Provinces in the raw data was 10 and this says that in the process of making the data some of them have automaically omitted.

lfsdat <- subset(lfsdat, Province != "Manitoba")
write.table(lfsdat, "lfsdat.csv", quote = FALSE, sep = ",", row.names = FALSE)

Analysis


lfsdat <- read.csv("lfsdat.csv")  # Load the data
print(str(lfsdat))  # Structure of the data

## 'data.frame':    16999 obs. of  14 variables:
##  $ Year         : int  1997 1997 1997 1997 1997 1997 1997 1997 1997 1997 ...
##  $ Province     : Factor w/ 4 levels "Alberta","BritishColumbia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Age          : Factor w/ 12 levels "15 to 19","20 to 24",..: 6 12 6 5 5 9 6 4 9 7 ...
##  $ Sex          : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 2 2 2 ...
##  $ MaritalStatus: Factor w/ 3 levels "Married","Separated",..: 1 3 1 1 1 1 2 3 1 1 ...
##  $ Educations   : Factor w/ 5 levels "A.EightYears",..: 4 2 2 3 5 4 3 5 4 4 ...
##  $ JobType      : Factor w/ 18 levels "Accomm/Foo","Agricultur",..: 6 NA 7 4 7 7 16 3 15 6 ...
##  $ JobAbsence   : Factor w/ 4 levels "FamilyResponsibility",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ JobStatus    : Factor w/ 2 levels "Full-time","Part-time": 2 NA NA 2 1 1 1 1 NA 1 ...
##  $ WorkHours    : num  8 NA NA 7 40 40 40 40 NA 40 ...
##  $ Tenure       : int  67 NA NA 8 168 120 154 34 NA 116 ...
##  $ HourlyEarn   : num  19.2 NA NA NA 18.3 ...
##  $ Union        : Factor w/ 2 levels "Not member","Union memb": 1 NA NA NA 1 2 1 1 NA NA ...
##  $ Frequency    : int  598 690 769 640 508 552 559 502 730 915 ...
## NULL

First we loaded the data. Now let's run the required packages.

library(ggplot2)
library(xtable)
library(lattice)
library(plyr)
library(mgcv)

## Loading required package: nlme
## This is mgcv 1.7-26. For overview type 'help("mgcv-package")'.

Let see how people are distributed among different jobs.

lftemp <- within(lfsdat[!is.na(lfsdat$JobType), ], JobType <- reorder(JobType, 
    Sex, length, order = TRUE))  # Reorder on increasing count of each JobType
ggplot(lftemp[!is.na(lftemp$JobType), ], aes(x = JobType, fill = Sex)) + geom_bar(position = "dodge") + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + ggtitle("Workers in different Jobs")

plot of chunk unnamed-chunk-11

As we see people are mostly involved in Retail-Trade jobs and a very little in Agriculture jobs. Healthcare has attracted women the most, among different job types.On the other hand, Agriculture has been among least atractive jobs for both men and women.

ggplot(lfsdat[!is.na(lfsdat$JobAbsence), ], aes(x = JobAbsence, fill = Sex)) + 
    geom_bar(position = "dodge") + theme(axis.text.x = element_text(angle = 45, 
    hjust = 1, vjust = 1)) + ggtitle("Reason of Being Absent from the Main Job")

plot of chunk unnamed-chunk-12

People are absent at their jobs mostly to take a vacation. The intersting point here is Family Responsibility, which prevents women way more than men from going to their jobs.

Let's take a look at distribution of wages across provinces. As we observe from the data, wages in Quebec is generally more concentrated at smaller values compared to the other provinces.

lf2 <- within(lfsdat[!is.na(lfsdat$HourlyEarn), ], Province <- reorder(Province, 
    HourlyEarn, min, order = TRUE))
ggplot(lf2[!is.na(lf2$HourlyEarn), ], aes(x = HourlyEarn, color = Province)) + 
    geom_density(lwd = 1) + ggtitle("Distribution of Wage in Different Provinces")

plot of chunk unnamed-chunk-13

In order to get a sense of growth in wages let try to look at boxplots.

ggplot(lfsdat[!is.na(lfsdat$HourlyEarn), ], aes(x = as.factor(Year), y = HourlyEarn)) + 
    geom_boxplot(lwd = 1, trim = TRUE) + theme(axis.text.x = element_text(angle = 45, 
    hjust = 1, vjust = 1)) + ggtitle("Hourly Wage over Time") + xlab("Year")

plot of chunk unnamed-chunk-14

As we see overally the median wages has been increasing. Let's find this growth in wage more formally. To this end, first we form the median wages over time. Next we will find the regression paramaters for the trend in median hourly wages (over years starting 1997) for each province and summarize the results in the following table.

MedWage <- ddply(lfsdat[!is.na(lfsdat$HourlyEarn), ], ~Province + Year, summarize, 
    MedianWage = median(HourlyEarn))
names(MedWage) <- c("Province", "Year", "MedWage")
yearMin <- 1997
RegWageYear <- function(x) {
    ## This function finds the regression parameters
    RegCoefs <- coef(lm(MedWage ~ I(Year - yearMin), x))
    names(RegCoefs) <- c("intercept", "slope")
    return(RegCoefs)
}
WageTrend <- ddply(MedWage, ~Province, RegWageYear)
WageTrendTable <- xtable(WageTrend)
print(WageTrendTable, type = "html", include.rownames = FALSE)

Province	intercept	slope
Alberta	12.49	0.70
BritishColumbia	15.53	0.34
Ontario	14.26	0.45
Quebec	13.30	0.42

As we see, Alberta has the fastest growth in median wages accross the time. British Columbia (unfortunately!) has experenced the slowest growth in wages.

Finally the overal trend in wages for different provinces can be shown as the next figure.

ggplot(lfsdat[!is.na(lfsdat$HourlyEarn), ], aes(x = Year, y = HourlyEarn, color = Sex)) + 
    geom_jitter(position = position_jitter(width = 0.1)) + facet_wrap(~Province) + 
    geom_line(stat = "summary", fun.y = "median", col = "black", lwd = 1) + 
    ggtitle("Hourly Wage over Time") + xlab("Year")

plot of chunk unnamed-chunk-16

To get a better insight into how the median wage in different provinces has changed over time we can have the following figure.

lfstemp <- lfsdat[!is.na(lfsdat$HourlyEarn), ]
MedianWage <- ddply(lfstemp, .(Year, Province), summarize, MedWage = median(HourlyEarn))
ggplot(MedianWage, aes(x = Year, y = MedWage)) + facet_wrap(~Province) + geom_point(cex = 1) + 
    geom_line(lwd = 1, color = "blue") + geom_smooth(method = "lm", colour = "red", 
    aes(group = 1)) + ggtitle("Growth in Median Wage over Time for different Provinces")

plot of chunk unnamed-chunk-17

Now take a look at the hourly wage of different levels of educations. The x-axis in the following figure represents the highest educational level an individual attained.

lfstemp <- lfsdat[!is.na(lfsdat$Educations), ]  ## to get rid of NA parts
lfstemp <- lfstemp[!is.na(lfstemp$HourlyEarn), ]  ## to get rid of NA parts
lfstemp <- within(lfstemp, Province <- reorder(Province, HourlyEarn, min, order = TRUE))
ggplot(lfstemp, aes(x = (Educations), y = HourlyEarn, fill = Province, order = -as.numeric(Province))) + 
    geom_boxplot(alpha = 0.2) + theme(legend.position = "bottom") + ggtitle("Wage levels among Educational levels")

plot of chunk unnamed-chunk-18

As we expected, genrally, higher level of education leads to higher wages. There is an interesting fact about Alberta which has relatively high wages among its Eight-Year-Educated people (compared to other levels of educational attainment also compared to other provinces.)

The fact that people with higher educational level enjoy higher hourly wages can be simply seen in the following plot, which depicts the median wage for each group of educational attainment. We have sepparated the individuals by their sex. And unfortunately, hourly wages of women are lower than that of men for any level of educations.

lfstemp <- lfsdat[!is.na(lfsdat$HourlyEarn), ]
MedianWage <- ddply(lfstemp, .(Educations, Sex), summarize, MedWage = median(HourlyEarn), 
    order = TRUE)
ggplot(MedianWage, aes(x = Educations, y = MedWage, col = Sex)) + geom_point(cex = 4) + 
    geom_line(aes(group = Sex)) + theme(axis.text.x = element_text(angle = 45, 
    hjust = 1, vjust = 1)) + ggtitle("Median Wages for different Educational levels")

plot of chunk unnamed-chunk-19

Let see how hourly wages changes across different age groups.

lfstemp <- lfsdat[!is.na(lfsdat$Age), ]
lfstemp <- lfstemp[!is.na(lfstemp$HourlyEarn), ]
lfstemp <- within(lfstemp, JobStatus <- reorder(JobStatus, HourlyEarn, min, 
    order = TRUE))
ggplot(lfstemp, aes(x = Age, y = HourlyEarn, color = JobStatus, order = as.numeric(JobStatus))) + 
    geom_jitter(position = position_jitter(width = 0.1)) + ylab("Hourly Wage") + 
    ggtitle(" Wage for Different Ages") + theme(axis.text.x = element_text(angle = 45)) + 
    scale_fill_brewer("JobStatus", type = "qual", palette = 3) + geom_smooth(method = "gam", 
    formula = y ~ s(x), colour = "purple", aes(group = 1))

plot of chunk unnamed-chunk-20

As we expected, people start with lower wages and then take more as they get older until their 60's, where their wages start to decline.

In this part we want to rank different types of jobs based on their median wages.

lfstemp <- lfsdat[!is.na(lfsdat$HourlyEarn), ]
lfstemp <- within(lfstemp, JobType <- reorder(JobType, HourlyEarn, median, order = TRUE))
ggplot(lfstemp[!is.na(lfstemp$HourlyEarn), ], aes(x = as.factor(JobType), y = HourlyEarn)) + 
    geom_boxplot(lwd = 1, trim = TRUE, color = "brown") + theme(axis.text.x = element_text(angle = 45, 
    hjust = 1, vjust = 1)) + ggtitle("Hourly Wage for different Jobs") + xlab("Job Type")

As we see jobs in Forestry have the most median wage. Furthermore those who work in Accommodation and Food Services and also Agriculture, have the least median wages.

Now let see how union members are distributed among employees for different Provinces over time.

lfstemp <- lfsdat[!is.na(lfsdat$Union), ]
lftemp <- within(lfstemp, Province <- reorder(Province, Union, length, order = TRUE))
ggplot(lfstemp, aes(x = Province, fill = Union)) + geom_bar(position = "dodge") + 
    facet_wrap(~Year) + theme(axis.text.x = element_text(angle = 45, hjust = 1, 
    vjust = 1)) + ggtitle(" Union Members in different Provinces") + coord_flip()

plot of chunk unnamed-chunk-22

The point is that employees mostly are not interested in being union member. This fact can be observed from Ontario which seems to have the highest number of employees.

In the next part we will see how many hours a median employee works in a week among different age groups.

lfstemp <- lfsdat[!is.na(lfsdat$WorkHours), ]
lfstemp <- lfstemp[!is.na(lfstemp$Age), ]
ggplot(lfstemp, aes(x = Age, y = WorkHours)) + facet_wrap(~Sex, ncol = 2) + 
    geom_smooth(method = "gam", formula = y ~ s(x), colour = "purple", aes(group = 1)) + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + ggtitle(" Hours of Work in a week across age groups")

plot of chunk unnamed-chunk-23

Naturally people that are too old (55+) or too young (25-) work less than when they are in other ages. Morover at their working ages (25-55) people (in median sense) have relatively trendless amount of hours work in a week.

In the following graph we see the distribution of workers tenure (weeks) for all provinces.

lf2 <- within(lfsdat[!is.na(lfsdat$Tenure), ], Province <- reorder(Province, 
    Tenure, min, order = TRUE))
ggplot(lf2[!is.na(lf2$Tenure), ], aes(x = Tenure, color = Province)) + geom_density(lwd = 1) + 
    ggtitle("Distribution of Employee Tenures")

plot of chunk unnamed-chunk-24

The interesting fact is that this distribution seems to be identical for these provinces. The other interesting point is that there is a relatively high concentration of people withat high tenure times (almost 4 years). I have no idea why this happens.

Conclusion

In this assginment, I worked on Canadian LFS data from 1997 to 2013. I focused on some of the variables describing the labor force among different provinces, age groups, etc. There were interesting results, such as gender gap in hourly wages and the trend of wage increase among different Provinces. However, the data was too much detailed in categorical aspects of observations, and limited in quantitative measures of the labor force. It was also reflecting a short period of time which makes our suggestions less reliable.

Another difficulty to deal with data was the frequency weights and the way we used it made some biasedness in the data. However, the results made sense to a good extent which means that we have not lost too much.

All in all this sort of exercise to clean up a huge data was a good experiencec for me, eventhough it took a lot of time.