R Homework Assignment #1: Working with Data and Descriptive Statistics

Question 1: Data Overview

The dataset I will be analyzing is the Labour Force Survey (LFS) July 2017–2021 Subset, collected by Statistics Canada. This dataset includes detailed information on employment, hours worked, wages, education, and demographics for workers in Canada.

The dataset represents secondary data from the Data Liberation Initiative (DLI) and was accessed through the course website. The data includes observations on individual workers from July surveys over five years.

Variables include: - Demographics (age, marital status, children), - Employment details (job tenure, hours worked, earnings), - Unique respondent IDs and weights.

I will explore relationships between family structure, job tenure, and working hours among employed workers.

Question 2: Load and Explore the Dataset

# Reading CSV data using read_csv() –  on ".csv file import"
lfs_data <- read_csv("LFS_July17_21.csv")

## Rows: 24086 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): REC_NUM, FINALWT, AGYOWNK, LFSSTAT, PROV, CMA, AGE_12, SEX, MARSTA...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Exploring structure 
dim(lfs_data)       # Get number of rows and columns

## [1] 24086    22

names(lfs_data)     # List variable names

##  [1] "REC_NUM"  "FINALWT"  "AGYOWNK"  "LFSSTAT"  "PROV"     "CMA"     
##  [7] "AGE_12"   "SEX"      "MARSTAT"  "EDUC"     "IMMIG"    "EFAMTYPE"
## [13] "HRLYEARN" "COWMAIN"  "PERMTEMP" "FIRMSIZE" "SURVYEAR" "UNION"   
## [19] "NOC_10"   "UTOTHRS"  "TENURE"   "HRSTOT"

head(lfs_data)      # View first six rows

## # A tibble: 6 × 22
##   REC_NUM FINALWT AGYOWNK LFSSTAT  PROV   CMA AGE_12   SEX MARSTAT  EDUC IMMIG
##     <dbl>   <dbl>   <dbl>   <dbl> <dbl> <dbl>  <dbl> <dbl>   <dbl> <dbl> <dbl>
## 1   92118      86       1       1    47     0      2     2       2     4     3
## 2   14339     139       0       1    47     0      2     1       6     3     3
## 3   81765     105       0       1    13     0      5     2       6     4     3
## 4   51258     167       0       1    46     6      2     2       6     5     3
## 5   98910      78       1       1    47     0      3     2       4     5     3
## 6   37879     141       0       1    24     0      2     1       6     4     3
## # ℹ 11 more variables: EFAMTYPE <dbl>, HRLYEARN <dbl>, COWMAIN <dbl>,
## #   PERMTEMP <dbl>, FIRMSIZE <dbl>, SURVYEAR <dbl>, UNION <dbl>, NOC_10 <dbl>,
## #   UTOTHRS <dbl>, TENURE <dbl>, HRSTOT <dbl>

summary(lfs_data)   # Summary statistics of numeric variables

##     REC_NUM          FINALWT        AGYOWNK          LFSSTAT     
##  Min.   :     4   Min.   :   8   Min.   :0.0000   Min.   :1.000  
##  1st Qu.: 24327   1st Qu.: 117   1st Qu.:0.0000   1st Qu.:1.000  
##  Median : 48484   Median : 188   Median :0.0000   Median :1.000  
##  Mean   : 48568   Mean   : 332   Mean   :0.7852   Mean   :1.139  
##  3rd Qu.: 72378   3rd Qu.: 422   3rd Qu.:1.0000   3rd Qu.:1.000  
##  Max.   :102898   Max.   :3158   Max.   :4.0000   Max.   :2.000  
##       PROV            CMA            AGE_12            SEX           MARSTAT 
##  Min.   :10.00   Min.   :0.000   Min.   : 1.000   Min.   :1.000   Min.   :1  
##  1st Qu.:24.00   1st Qu.:0.000   1st Qu.: 3.000   1st Qu.:1.000   1st Qu.:1  
##  Median :35.00   Median :0.000   Median : 6.000   Median :1.000   Median :2  
##  Mean   :35.31   Mean   :1.558   Mean   : 5.782   Mean   :1.496   Mean   :3  
##  3rd Qu.:47.00   3rd Qu.:2.000   3rd Qu.: 8.000   3rd Qu.:2.000   3rd Qu.:6  
##  Max.   :59.00   Max.   :9.000   Max.   :12.000   Max.   :2.000   Max.   :6  
##       EDUC           IMMIG          EFAMTYPE         HRLYEARN      
##  Min.   :0.000   Min.   :1.000   Min.   : 1.000   Min.   :  3.136  
##  1st Qu.:2.000   1st Qu.:3.000   1st Qu.: 2.000   1st Qu.: 17.198  
##  Median :4.000   Median :3.000   Median : 3.000   Median : 24.387  
##  Mean   :3.555   Mean   :2.771   Mean   : 5.063   Mean   : 28.144  
##  3rd Qu.:5.000   3rd Qu.:3.000   3rd Qu.: 6.000   3rd Qu.: 35.613  
##  Max.   :6.000   Max.   :3.000   Max.   :18.000   Max.   :125.490  
##     COWMAIN         PERMTEMP        FIRMSIZE        SURVYEAR        UNION      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :2017   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:2018   1st Qu.:1.000  
##  Median :2.000   Median :1.000   Median :3.000   Median :2019   Median :2.000  
##  Mean   :1.737   Mean   :1.291   Mean   :2.911   Mean   :2019   Mean   :1.675  
##  3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:4.000   3rd Qu.:2020   3rd Qu.:2.000  
##  Max.   :2.000   Max.   :4.000   Max.   :4.000   Max.   :2021   Max.   :2.000  
##      NOC_10          UTOTHRS          TENURE           HRSTOT    
##  Min.   : 1.000   Min.   : 0.30   Min.   :  1.00   Min.   :   0  
##  1st Qu.: 3.000   1st Qu.:35.00   1st Qu.: 15.00   1st Qu.:1715  
##  Median : 6.000   Median :40.00   Median : 54.00   Median :2000  
##  Mean   : 5.448   Mean   :36.83   Mean   : 86.58   Mean   :1839  
##  3rd Qu.: 7.000   3rd Qu.:40.00   3rd Qu.:146.00   3rd Qu.:2040  
##  Max.   :10.000   Max.   :99.00   Max.   :240.00   Max.   :5049

Question 3: Recode Variables

(a) Recode AGYOWNK to ANYKIDS

# Recode AGYOWNK into ANYKIDS (Binary) – as shown in Convert to binary variable
lfs_data <- lfs_data %>%
  mutate(ANYKIDS = if_else(AGYOWNK == 0, "No Children", "Has Children"))

(b) Recode AGE_12 to AGE_NEW (10-year groups)

# Recode AGE_12 into AGE_NEW – using case_when()
lfs_data <- lfs_data %>%
  mutate(AGE_NEW = case_when(
    AGE_12 %in% c(1, 2) ~ "15-24",
    AGE_12 %in% c(3, 4) ~ "25-34",
    AGE_12 %in% c(5, 6) ~ "35-44",
    AGE_12 %in% c(7, 8) ~ "45-54",
    AGE_12 %in% c(9, 10) ~ "55-64",
    AGE_12 %in% c(11, 12) ~ "65+",
    TRUE ~ NA_character_
  ))

(c) Recode TENURE to TENURE2

# Recode TENURE into categories – using mutate and case_when() to create categories
lfs_data <- lfs_data %>%
  mutate(TENURE2 = case_when(
    TENURE < 12 ~ "Less than 12 months",
    TENURE >= 12 & TENURE < 60 ~ "12–59 months",
    TENURE >= 60 & TENURE < 120 ~ "60–119 months",
    TENURE >= 120 ~ "120+ months",
    TRUE ~ NA_character_
  ))

Question 4: Subset Key Variables

# Selecting only key variables – Selecting variables needed for analysis
lfs_subset <- lfs_data %>%
  select(REC_NUM, FINALWT, AGE_NEW, ANYKIDS, TENURE2, HRLYEARN, MARSTAT, UTOTHRS)

Question 5: Descriptive Statistics

# Calculate summary statistics using summarise() 
summary_stats <- lfs_subset %>%
  summarise(
    Mean_Earnings = mean(HRLYEARN, na.rm = TRUE),
    Median_Earnings = median(HRLYEARN, na.rm = TRUE),
    SD_Earnings = sd(HRLYEARN, na.rm = TRUE),
    Mean_Hours = mean(UTOTHRS, na.rm = TRUE),
    Median_Hours = median(UTOTHRS, na.rm = TRUE),
    SD_Hours = sd(UTOTHRS, na.rm = TRUE)
  )

# Frequency tables using table() 
age_table <- table(lfs_subset$AGE_NEW)
child_table <- table(lfs_subset$ANYKIDS)
tenure_table <- table(lfs_subset$TENURE2)
marital_table <- table(lfs_subset$MARSTAT)

# Show output
summary_stats

## # A tibble: 1 × 6
##   Mean_Earnings Median_Earnings SD_Earnings Mean_Hours Median_Hours SD_Hours
##           <dbl>           <dbl>       <dbl>      <dbl>        <dbl>    <dbl>
## 1          28.1            24.4        14.3       36.8           40     10.5

age_table

## 
## 15-24 25-34 35-44 45-54 55-64   65+ 
##  3832  4902  5185  5090  4228   849

child_table

## 
## Has Children  No Children 
##         8684        15402

tenure_table

## 
##        12–59 months         120+ months       60–119 months Less than 12 months 
##                7596                7514                4057                4919

marital_table

## 
##     1     2     3     4     5     6 
## 10790  3800   250   640  1074  7532

Question 6: Interpretation of Key Variables

The average hourly wage was approximately $28.14, with a median of $24.39. The standard deviation was $14.28, indicating substantial variation in wages. Workers reported working an average of 36.83 hours per week, with a median of 40 hours, suggesting many are employed full-time. These results indicate moderate variation in both wages and hours worked. Most respondents fall into the working-age groups, with marital status and job tenure reflecting a stable labor force. Approximately 36% of the sample had children.

Question 7: Compare With vs Without Children

# Subsetting by ANYKIDS filtering by group
with_kids <- lfs_subset %>% filter(ANYKIDS == "Has Children")
without_kids <- lfs_subset %>% filter(ANYKIDS == "No Children")

# Summary stats by child status
summary(with_kids$HRLYEARN)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.136  21.000  29.007  32.245  41.177 119.571

summary(without_kids$HRLYEARN)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.48   15.88   21.79   25.83   31.75  125.49

summary(with_kids$UTOTHRS)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   35.00   40.00   38.06   40.00   99.00

summary(without_kids$UTOTHRS)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.30   33.00   40.00   36.14   40.00   99.00

Approximately 36% of respondents had children, while 64% did not. Workers with children may differ in their average earnings and hours worked. Based on the summary statistics, differences in labor patterns could reflect household responsibilities, labor flexibility, or structural workplace conditions. Further analysis would be needed to draw definitive conclusions.