The dataset I will be analyzing is the Labour Force Survey (LFS) July 2017–2021 Subset, collected by Statistics Canada. This dataset includes detailed information on employment, hours worked, wages, education, and demographics for workers in Canada.
The dataset represents secondary data from the Data Liberation Initiative (DLI) and was accessed through the course website. The data includes observations on individual workers from July surveys over five years.
Variables include: - Demographics (age, marital status, children), - Employment details (job tenure, hours worked, earnings), - Unique respondent IDs and weights.
I will explore relationships between family structure, job tenure, and working hours among employed workers.
# Reading CSV data using read_csv() – on ".csv file import"
lfs_data <- read_csv("LFS_July17_21.csv")
## Rows: 24086 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): REC_NUM, FINALWT, AGYOWNK, LFSSTAT, PROV, CMA, AGE_12, SEX, MARSTA...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Exploring structure
dim(lfs_data) # Get number of rows and columns
## [1] 24086 22
names(lfs_data) # List variable names
## [1] "REC_NUM" "FINALWT" "AGYOWNK" "LFSSTAT" "PROV" "CMA"
## [7] "AGE_12" "SEX" "MARSTAT" "EDUC" "IMMIG" "EFAMTYPE"
## [13] "HRLYEARN" "COWMAIN" "PERMTEMP" "FIRMSIZE" "SURVYEAR" "UNION"
## [19] "NOC_10" "UTOTHRS" "TENURE" "HRSTOT"
head(lfs_data) # View first six rows
## # A tibble: 6 × 22
## REC_NUM FINALWT AGYOWNK LFSSTAT PROV CMA AGE_12 SEX MARSTAT EDUC IMMIG
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 92118 86 1 1 47 0 2 2 2 4 3
## 2 14339 139 0 1 47 0 2 1 6 3 3
## 3 81765 105 0 1 13 0 5 2 6 4 3
## 4 51258 167 0 1 46 6 2 2 6 5 3
## 5 98910 78 1 1 47 0 3 2 4 5 3
## 6 37879 141 0 1 24 0 2 1 6 4 3
## # ℹ 11 more variables: EFAMTYPE <dbl>, HRLYEARN <dbl>, COWMAIN <dbl>,
## # PERMTEMP <dbl>, FIRMSIZE <dbl>, SURVYEAR <dbl>, UNION <dbl>, NOC_10 <dbl>,
## # UTOTHRS <dbl>, TENURE <dbl>, HRSTOT <dbl>
summary(lfs_data) # Summary statistics of numeric variables
## REC_NUM FINALWT AGYOWNK LFSSTAT
## Min. : 4 Min. : 8 Min. :0.0000 Min. :1.000
## 1st Qu.: 24327 1st Qu.: 117 1st Qu.:0.0000 1st Qu.:1.000
## Median : 48484 Median : 188 Median :0.0000 Median :1.000
## Mean : 48568 Mean : 332 Mean :0.7852 Mean :1.139
## 3rd Qu.: 72378 3rd Qu.: 422 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :102898 Max. :3158 Max. :4.0000 Max. :2.000
## PROV CMA AGE_12 SEX MARSTAT
## Min. :10.00 Min. :0.000 Min. : 1.000 Min. :1.000 Min. :1
## 1st Qu.:24.00 1st Qu.:0.000 1st Qu.: 3.000 1st Qu.:1.000 1st Qu.:1
## Median :35.00 Median :0.000 Median : 6.000 Median :1.000 Median :2
## Mean :35.31 Mean :1.558 Mean : 5.782 Mean :1.496 Mean :3
## 3rd Qu.:47.00 3rd Qu.:2.000 3rd Qu.: 8.000 3rd Qu.:2.000 3rd Qu.:6
## Max. :59.00 Max. :9.000 Max. :12.000 Max. :2.000 Max. :6
## EDUC IMMIG EFAMTYPE HRLYEARN
## Min. :0.000 Min. :1.000 Min. : 1.000 Min. : 3.136
## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.: 2.000 1st Qu.: 17.198
## Median :4.000 Median :3.000 Median : 3.000 Median : 24.387
## Mean :3.555 Mean :2.771 Mean : 5.063 Mean : 28.144
## 3rd Qu.:5.000 3rd Qu.:3.000 3rd Qu.: 6.000 3rd Qu.: 35.613
## Max. :6.000 Max. :3.000 Max. :18.000 Max. :125.490
## COWMAIN PERMTEMP FIRMSIZE SURVYEAR UNION
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :2017 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2018 1st Qu.:1.000
## Median :2.000 Median :1.000 Median :3.000 Median :2019 Median :2.000
## Mean :1.737 Mean :1.291 Mean :2.911 Mean :2019 Mean :1.675
## 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:4.000 3rd Qu.:2020 3rd Qu.:2.000
## Max. :2.000 Max. :4.000 Max. :4.000 Max. :2021 Max. :2.000
## NOC_10 UTOTHRS TENURE HRSTOT
## Min. : 1.000 Min. : 0.30 Min. : 1.00 Min. : 0
## 1st Qu.: 3.000 1st Qu.:35.00 1st Qu.: 15.00 1st Qu.:1715
## Median : 6.000 Median :40.00 Median : 54.00 Median :2000
## Mean : 5.448 Mean :36.83 Mean : 86.58 Mean :1839
## 3rd Qu.: 7.000 3rd Qu.:40.00 3rd Qu.:146.00 3rd Qu.:2040
## Max. :10.000 Max. :99.00 Max. :240.00 Max. :5049
# Recode AGYOWNK into ANYKIDS (Binary) – as shown in Convert to binary variable
lfs_data <- lfs_data %>%
mutate(ANYKIDS = if_else(AGYOWNK == 0, "No Children", "Has Children"))
# Recode AGE_12 into AGE_NEW – using case_when()
lfs_data <- lfs_data %>%
mutate(AGE_NEW = case_when(
AGE_12 %in% c(1, 2) ~ "15-24",
AGE_12 %in% c(3, 4) ~ "25-34",
AGE_12 %in% c(5, 6) ~ "35-44",
AGE_12 %in% c(7, 8) ~ "45-54",
AGE_12 %in% c(9, 10) ~ "55-64",
AGE_12 %in% c(11, 12) ~ "65+",
TRUE ~ NA_character_
))
# Recode TENURE into categories – using mutate and case_when() to create categories
lfs_data <- lfs_data %>%
mutate(TENURE2 = case_when(
TENURE < 12 ~ "Less than 12 months",
TENURE >= 12 & TENURE < 60 ~ "12–59 months",
TENURE >= 60 & TENURE < 120 ~ "60–119 months",
TENURE >= 120 ~ "120+ months",
TRUE ~ NA_character_
))
# Selecting only key variables – Selecting variables needed for analysis
lfs_subset <- lfs_data %>%
select(REC_NUM, FINALWT, AGE_NEW, ANYKIDS, TENURE2, HRLYEARN, MARSTAT, UTOTHRS)
# Calculate summary statistics using summarise()
summary_stats <- lfs_subset %>%
summarise(
Mean_Earnings = mean(HRLYEARN, na.rm = TRUE),
Median_Earnings = median(HRLYEARN, na.rm = TRUE),
SD_Earnings = sd(HRLYEARN, na.rm = TRUE),
Mean_Hours = mean(UTOTHRS, na.rm = TRUE),
Median_Hours = median(UTOTHRS, na.rm = TRUE),
SD_Hours = sd(UTOTHRS, na.rm = TRUE)
)
# Frequency tables using table()
age_table <- table(lfs_subset$AGE_NEW)
child_table <- table(lfs_subset$ANYKIDS)
tenure_table <- table(lfs_subset$TENURE2)
marital_table <- table(lfs_subset$MARSTAT)
# Show output
summary_stats
## # A tibble: 1 × 6
## Mean_Earnings Median_Earnings SD_Earnings Mean_Hours Median_Hours SD_Hours
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 28.1 24.4 14.3 36.8 40 10.5
age_table
##
## 15-24 25-34 35-44 45-54 55-64 65+
## 3832 4902 5185 5090 4228 849
child_table
##
## Has Children No Children
## 8684 15402
tenure_table
##
## 12–59 months 120+ months 60–119 months Less than 12 months
## 7596 7514 4057 4919
marital_table
##
## 1 2 3 4 5 6
## 10790 3800 250 640 1074 7532
The average hourly wage was approximately $28.14, with a median of $24.39. The standard deviation was $14.28, indicating substantial variation in wages. Workers reported working an average of 36.83 hours per week, with a median of 40 hours, suggesting many are employed full-time. These results indicate moderate variation in both wages and hours worked. Most respondents fall into the working-age groups, with marital status and job tenure reflecting a stable labor force. Approximately 36% of the sample had children.
# Subsetting by ANYKIDS filtering by group
with_kids <- lfs_subset %>% filter(ANYKIDS == "Has Children")
without_kids <- lfs_subset %>% filter(ANYKIDS == "No Children")
# Summary stats by child status
summary(with_kids$HRLYEARN)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.136 21.000 29.007 32.245 41.177 119.571
summary(without_kids$HRLYEARN)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.48 15.88 21.79 25.83 31.75 125.49
summary(with_kids$UTOTHRS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 35.00 40.00 38.06 40.00 99.00
summary(without_kids$UTOTHRS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.30 33.00 40.00 36.14 40.00 99.00
Approximately 36% of respondents had children, while 64% did not. Workers with children may differ in their average earnings and hours worked. Based on the summary statistics, differences in labor patterns could reflect household responsibilities, labor flexibility, or structural workplace conditions. Further analysis would be needed to draw definitive conclusions.