Mohammadreza Bolandnazar
In this Homework, I tried to deal with a new dataset. After my first experience with a dataset about Canadians labor economics in the last homework: Work on CANSIM tables I decided to find a more detailed and complete dataset about what I did in that assignment. I figured out that some of my colleagues have used an interesting similar data: Canadian Labor Force (LFS). I took the raw data from Jonathan Zhang, and started cleaning it. This data is mostly about labor force measures such as wages, tenure, occupation, unemployment status, etc.
This large data set covers years 1997-2013 and each year about 100000 distinct entries. Each entry is a representative of some individuals with the same statistics, where its frequency has been reported in the “Frequency Weight” assign to that entry (at the very last column of data). All in all, each year has a summary of labor force information of about 25 million people. Doubtlessly, we should pick a sample from this large data. The challenging part is the frequency weights of each entry. The way we deal with it was to sample among those entries with a significant level of frequency. For example we randomly selected entries among those with frequency weight of at least 500. Even though this method of sampling is a little biased, but it reduces the impact of rare data on final observations. We also picked some of the variables of the data.
Obviously, I spent a lot of time cleaning the data and anyone wants to use the raw data would be aware that because of the volume of the data, it may take some time to run the data cleaner files.
Finally, we have picked these variables:
Year The survey year, from 1997 to 2013
Province The raw data has 10 provinces (PROV) but after sampling the data we had 5 provinces: Alberta, British Columbia, Manitoba, Ontario and Quebec.
Age The age group that the individual is assigned to
Sex Female or Male
Marital Status In three categories: Single, Married and Separated
Educations Highest educational attainment
JobType The type of work the individual does such as Agriculture, Construction,Education, etc.
JobAbsence for an employed person: reason absent full week
JobStatus Declaring Full time/Part time
Work Hours Usual hours per week at main job
Tenure Job tenure in months, for currently employed only
HourlyEarn for employees: Usual hourly wages
Union Union membership status, employees only
Frequency Final individual or family weight.
The raw data for each year, was saved in a separate [csv] file. So our first task was to pick, randomly, a sample of 1000 entries, in each year, among those entries with frequency weight (last column) of at least 500. Then we choose the relevant variables and form a summarized data for each year. At last we combine all these files together to form a single data.
For now, we need only two libraries
library(ggplot2)
library(plyr)
Now read the raw data from separate files.
# READING DATA AND CLEANIN
path <- "LFSDATA/"
filenames <- list.files(path) # this reads all the data files titled: 1997..2013
We want to build a function for reading the files of 1997 to 2013
jFun <- function(x) {
BIGDAT <- read.csv(x, na.strings = "") #Resolves the empty cells with NA
BIGDAT <- subset(BIGDAT, BIGDAT$FWEIGHT > 500) #ignores the less frequent entities
Edat <- BIGDAT[sample(nrow(BIGDAT), 1000, replace = FALSE, prob = NULL),
] # Sample 1000 rows out of current file
fDat <- subset(Edat, select = c(SURVYEAR, PROV, AGE_12, SEX, MARSTAT, EDUC90,
NAICS_18, YABSENT, FTPTMAIN, UHRSMAIN, TENURE, HRLYEARN, UNION, FWEIGHT)) ## Selects some of the columns
return(fDat)
}
## Apply the function above on all the data files
setwd("LFSDATA/") ## goes in the input directory to read and combine
lfsRAW <- do.call("rbind", lapply(filenames, jFun))
setwd("..") ## come back to the main directory
Let check that the data has loaded correctly. It must have 17000 observations.
dim(lfsRAW) ## Sanity Check
## [1] 17000 14
Now let make the input tidier and cleaner, by changing the names and etc.
lfsdat <- lfsRAW
## Change the names a little bit
levels(lfsdat$MARSTAT) <- list(Married = "Married or", Married = "Married",
Single = "Single, ne", Single = "Widowed", Separated = "Separated/", Separated = "Separated",
Separated = "Divorced", Married = "Living in")
levels(lfsdat$PROV) <- list(Alberta = "Alberta", BritishColumbia = "British Co",
Manitoba = "Manitoba", NewBrunswick = "New Brunsw", Newfoundland = "Newfoundla",
NovaScotia = "Nova Scoti", Ontario = "Ontario", PrinceEdward = "Prince Edw",
Quebec = "Qu\xe9bec", Saskatchewan = "Saskatchew")
levels(lfsdat$EDUC90) <- list(A.EightYears = "0 to 8 yea", B.Secondary = "Some secon",
C.Grade11to13 = "Grade 11 t", D.PostSecondary = "Post secon", D.PostSecondary = "Some post",
E.University = "University")
levels(lfsdat$YABSENT) <- list(Vacation = "Vacation", FamilyResponsibility = "Personal o",
OwnIllness = "Own illnes", Other = "Other")
c <- c("Year", "Province", "Age", "Sex", "MaritalStatus", "Educations", "JobType",
"JobAbsence", "JobStatus", "WorkHours", "Tenure", "HourlyEarn", "Union",
"Frequency")
names(lfsdat) <- c
Next, we rearrange our data to be nicely ordered
lfsdat = arrange(lfsdat, Year, Province, Sex)
Let's take a look at the data, via a figure.
ggplot(lfsdat[!is.na(lfsdat$HourlyEarn), ], aes(x = Province, y = HourlyEarn)) +
geom_jitter(color = "brown") + ggtitle("Hourly Wages in different Provinces")
As we see it is better to drop Manitoba, since it has very few samples. Also the initial number of Provinces in the raw data was 10 and this says that in the process of making the data some of them have automaically omitted.
lfsdat <- subset(lfsdat, Province != "Manitoba")
write.table(lfsdat, "lfsdat.csv", quote = FALSE, sep = ",", row.names = FALSE)
lfsdat <- read.csv("lfsdat.csv") # Load the data
print(str(lfsdat)) # Structure of the data
## 'data.frame': 16999 obs. of 14 variables:
## $ Year : int 1997 1997 1997 1997 1997 1997 1997 1997 1997 1997 ...
## $ Province : Factor w/ 4 levels "Alberta","BritishColumbia",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Age : Factor w/ 12 levels "15 to 19","20 to 24",..: 6 12 6 5 5 9 6 4 9 7 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 2 2 2 ...
## $ MaritalStatus: Factor w/ 3 levels "Married","Separated",..: 1 3 1 1 1 1 2 3 1 1 ...
## $ Educations : Factor w/ 5 levels "A.EightYears",..: 4 2 2 3 5 4 3 5 4 4 ...
## $ JobType : Factor w/ 18 levels "Accomm/Foo","Agricultur",..: 6 NA 7 4 7 7 16 3 15 6 ...
## $ JobAbsence : Factor w/ 4 levels "FamilyResponsibility",..: NA NA NA NA NA NA NA NA NA NA ...
## $ JobStatus : Factor w/ 2 levels "Full-time","Part-time": 2 NA NA 2 1 1 1 1 NA 1 ...
## $ WorkHours : num 8 NA NA 7 40 40 40 40 NA 40 ...
## $ Tenure : int 67 NA NA 8 168 120 154 34 NA 116 ...
## $ HourlyEarn : num 19.2 NA NA NA 18.3 ...
## $ Union : Factor w/ 2 levels "Not member","Union memb": 1 NA NA NA 1 2 1 1 NA NA ...
## $ Frequency : int 598 690 769 640 508 552 559 502 730 915 ...
## NULL
First we loaded the data. Now let's run the required packages.
library(ggplot2)
library(xtable)
library(lattice)
library(plyr)
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.7-26. For overview type 'help("mgcv-package")'.
Let see how people are distributed among different jobs.
lftemp <- within(lfsdat[!is.na(lfsdat$JobType), ], JobType <- reorder(JobType,
Sex, length, order = TRUE)) # Reorder on increasing count of each JobType
ggplot(lftemp[!is.na(lftemp$JobType), ], aes(x = JobType, fill = Sex)) + geom_bar(position = "dodge") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + ggtitle("Workers in different Jobs")
As we see people are mostly involved in Retail-Trade jobs and a very little in Agriculture jobs. Healthcare has attracted women the most, among different job types.On the other hand, Agriculture has been among least atractive jobs for both men and women.
ggplot(lfsdat[!is.na(lfsdat$JobAbsence), ], aes(x = JobAbsence, fill = Sex)) +
geom_bar(position = "dodge") + theme(axis.text.x = element_text(angle = 45,
hjust = 1, vjust = 1)) + ggtitle("Reason of Being Absent from the Main Job")
People are absent at their jobs mostly to take a vacation. The intersting point here is Family Responsibility, which prevents women way more than men from going to their jobs.
Let's take a look at distribution of wages across provinces. As we observe from the data, wages in Quebec is generally more concentrated at smaller values compared to the other provinces.
lf2 <- within(lfsdat[!is.na(lfsdat$HourlyEarn), ], Province <- reorder(Province,
HourlyEarn, min, order = TRUE))
ggplot(lf2[!is.na(lf2$HourlyEarn), ], aes(x = HourlyEarn, color = Province)) +
geom_density(lwd = 1) + ggtitle("Distribution of Wage in Different Provinces")
In order to get a sense of growth in wages let try to look at boxplots.
ggplot(lfsdat[!is.na(lfsdat$HourlyEarn), ], aes(x = as.factor(Year), y = HourlyEarn)) +
geom_boxplot(lwd = 1, trim = TRUE) + theme(axis.text.x = element_text(angle = 45,
hjust = 1, vjust = 1)) + ggtitle("Hourly Wage over Time") + xlab("Year")
As we see overally the median wages has been increasing. Let's find this growth in wage more formally. To this end, first we form the median wages over time. Next we will find the regression paramaters for the trend in median hourly wages (over years starting 1997) for each province and summarize the results in the following table.
MedWage <- ddply(lfsdat[!is.na(lfsdat$HourlyEarn), ], ~Province + Year, summarize,
MedianWage = median(HourlyEarn))
names(MedWage) <- c("Province", "Year", "MedWage")
yearMin <- 1997
RegWageYear <- function(x) {
## This function finds the regression parameters
RegCoefs <- coef(lm(MedWage ~ I(Year - yearMin), x))
names(RegCoefs) <- c("intercept", "slope")
return(RegCoefs)
}
WageTrend <- ddply(MedWage, ~Province, RegWageYear)
WageTrendTable <- xtable(WageTrend)
print(WageTrendTable, type = "html", include.rownames = FALSE)
| Province | intercept | slope |
|---|---|---|
| Alberta | 12.49 | 0.70 |
| BritishColumbia | 15.53 | 0.34 |
| Ontario | 14.26 | 0.45 |
| Quebec | 13.30 | 0.42 |
As we see, Alberta has the fastest growth in median wages accross the time. British Columbia (unfortunately!) has experenced the slowest growth in wages.
Finally the overal trend in wages for different provinces can be shown as the next figure.
ggplot(lfsdat[!is.na(lfsdat$HourlyEarn), ], aes(x = Year, y = HourlyEarn, color = Sex)) +
geom_jitter(position = position_jitter(width = 0.1)) + facet_wrap(~Province) +
geom_line(stat = "summary", fun.y = "median", col = "black", lwd = 1) +
ggtitle("Hourly Wage over Time") + xlab("Year")
To get a better insight into how the median wage in different provinces has changed over time we can have the following figure.
lfstemp <- lfsdat[!is.na(lfsdat$HourlyEarn), ]
MedianWage <- ddply(lfstemp, .(Year, Province), summarize, MedWage = median(HourlyEarn))
ggplot(MedianWage, aes(x = Year, y = MedWage)) + facet_wrap(~Province) + geom_point(cex = 1) +
geom_line(lwd = 1, color = "blue") + geom_smooth(method = "lm", colour = "red",
aes(group = 1)) + ggtitle("Growth in Median Wage over Time for different Provinces")
Now take a look at the hourly wage of different levels of educations. The x-axis in the following figure represents the highest educational level an individual attained.
lfstemp <- lfsdat[!is.na(lfsdat$Educations), ] ## to get rid of NA parts
lfstemp <- lfstemp[!is.na(lfstemp$HourlyEarn), ] ## to get rid of NA parts
lfstemp <- within(lfstemp, Province <- reorder(Province, HourlyEarn, min, order = TRUE))
ggplot(lfstemp, aes(x = (Educations), y = HourlyEarn, fill = Province, order = -as.numeric(Province))) +
geom_boxplot(alpha = 0.2) + theme(legend.position = "bottom") + ggtitle("Wage levels among Educational levels")
As we expected, genrally, higher level of education leads to higher wages. There is an interesting fact about Alberta which has relatively high wages among its Eight-Year-Educated people (compared to other levels of educational attainment also compared to other provinces.)
The fact that people with higher educational level enjoy higher hourly wages can be simply seen in the following plot, which depicts the median wage for each group of educational attainment. We have sepparated the individuals by their sex. And unfortunately, hourly wages of women are lower than that of men for any level of educations.
lfstemp <- lfsdat[!is.na(lfsdat$HourlyEarn), ]
MedianWage <- ddply(lfstemp, .(Educations, Sex), summarize, MedWage = median(HourlyEarn),
order = TRUE)
ggplot(MedianWage, aes(x = Educations, y = MedWage, col = Sex)) + geom_point(cex = 4) +
geom_line(aes(group = Sex)) + theme(axis.text.x = element_text(angle = 45,
hjust = 1, vjust = 1)) + ggtitle("Median Wages for different Educational levels")
Let see how hourly wages changes across different age groups.
lfstemp <- lfsdat[!is.na(lfsdat$Age), ]
lfstemp <- lfstemp[!is.na(lfstemp$HourlyEarn), ]
lfstemp <- within(lfstemp, JobStatus <- reorder(JobStatus, HourlyEarn, min,
order = TRUE))
ggplot(lfstemp, aes(x = Age, y = HourlyEarn, color = JobStatus, order = as.numeric(JobStatus))) +
geom_jitter(position = position_jitter(width = 0.1)) + ylab("Hourly Wage") +
ggtitle(" Wage for Different Ages") + theme(axis.text.x = element_text(angle = 45)) +
scale_fill_brewer("JobStatus", type = "qual", palette = 3) + geom_smooth(method = "gam",
formula = y ~ s(x), colour = "purple", aes(group = 1))
As we expected, people start with lower wages and then take more as they get older until their 60's, where their wages start to decline.
In this part we want to rank different types of jobs based on their median wages.
lfstemp <- lfsdat[!is.na(lfsdat$HourlyEarn), ]
lfstemp <- within(lfstemp, JobType <- reorder(JobType, HourlyEarn, median, order = TRUE))
ggplot(lfstemp[!is.na(lfstemp$HourlyEarn), ], aes(x = as.factor(JobType), y = HourlyEarn)) +
geom_boxplot(lwd = 1, trim = TRUE, color = "brown") + theme(axis.text.x = element_text(angle = 45,
hjust = 1, vjust = 1)) + ggtitle("Hourly Wage for different Jobs") + xlab("Job Type")
As we see jobs in Forestry have the most median wage. Furthermore those who work in Accommodation and Food Services and also Agriculture, have the least median wages.
Now let see how union members are distributed among employees for different Provinces over time.
lfstemp <- lfsdat[!is.na(lfsdat$Union), ]
lftemp <- within(lfstemp, Province <- reorder(Province, Union, length, order = TRUE))
ggplot(lfstemp, aes(x = Province, fill = Union)) + geom_bar(position = "dodge") +
facet_wrap(~Year) + theme(axis.text.x = element_text(angle = 45, hjust = 1,
vjust = 1)) + ggtitle(" Union Members in different Provinces") + coord_flip()
The point is that employees mostly are not interested in being union member. This fact can be observed from Ontario which seems to have the highest number of employees.
In the next part we will see how many hours a median employee works in a week among different age groups.
lfstemp <- lfsdat[!is.na(lfsdat$WorkHours), ]
lfstemp <- lfstemp[!is.na(lfstemp$Age), ]
ggplot(lfstemp, aes(x = Age, y = WorkHours)) + facet_wrap(~Sex, ncol = 2) +
geom_smooth(method = "gam", formula = y ~ s(x), colour = "purple", aes(group = 1)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + ggtitle(" Hours of Work in a week across age groups")
Naturally people that are too old (55+) or too young (25-) work less than when they are in other ages. Morover at their working ages (25-55) people (in median sense) have relatively trendless amount of hours work in a week.
In the following graph we see the distribution of workers tenure (weeks) for all provinces.
lf2 <- within(lfsdat[!is.na(lfsdat$Tenure), ], Province <- reorder(Province,
Tenure, min, order = TRUE))
ggplot(lf2[!is.na(lf2$Tenure), ], aes(x = Tenure, color = Province)) + geom_density(lwd = 1) +
ggtitle("Distribution of Employee Tenures")
The interesting fact is that this distribution seems to be identical for these provinces. The other interesting point is that there is a relatively high concentration of people withat high tenure times (almost 4 years). I have no idea why this happens.
In this assginment, I worked on Canadian LFS data from 1997 to 2013. I focused on some of the variables describing the labor force among different provinces, age groups, etc. There were interesting results, such as gender gap in hourly wages and the trend of wage increase among different Provinces. However, the data was too much detailed in categorical aspects of observations, and limited in quantitative measures of the labor force. It was also reflecting a short period of time which makes our suggestions less reliable.
Another difficulty to deal with data was the frequency weights and the way we used it made some biasedness in the data. However, the results made sense to a good extent which means that we have not lost too much.
All in all this sort of exercise to clean up a huge data was a good experiencec for me, eventhough it took a lot of time.