Sectoral Analysis of Employment in Metro Manila Using PSA’s LFS Data

Author

Benedict Bautista

Data cleaning

In this phase of the study, the Labor Force Survey (LFS) dataset provided by the Philippine Statistics Authority (PSA) was utilized as the primary data source. The initial step involved identifying and selecting the key variables necessary for the analysis. Following this, the dataset underwent a data cleaning process, which included handling and removing any missing or incomplete values to ensure the accuracy and reliability of the results.

The following libraries will be used: tidyverse ,pander for APA tables, nortest for the test of normality, and here for easier working directory.

Loading of the necessary libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pander)
library(here)
here() starts at C:/Users/ben/Desktop/portfolio RPUB
library(nortest)

Loading the dataset

The dataset was acquired from psada.gov.ph under Labor Force Survey tab and then the month of March 2024.

data <- read.csv(here("LFS PUF March 2024.csv"))

Filtering variables

In the next step, we wish to filter variables needed for the analysis. The variables needed are PUFC15_PKB coded as Major industry group and Sub industry group, PUFREG coded as Region, PUFC09 coded as Work indicator (1 = employed, 2 = unemployed), and PUFC07 coded as Educational attainment and PUFWGTPRV for the final weight based on projection. Please refer to lfs_metadata (dictionary) uploaded or you can visit the site :https://psada.psa.gov.ph/catalog/LFS/about

# filtering values
data1 <- data %>% filter(PUFREG == 13) %>% # 13 for NCR
  mutate(sectors = case_when(
    PUFC15_PKB %in% 1:3 ~ "Agriculture",
    PUFC15_PKB %in% 5:43 ~ "Industry",
    PUFC15_PKB >= 45 ~ "Services",
    T ~ NA_character_
  ),
    sub.industry = case_when(
      PUFC15_PKB %in% 1:2 ~ "Agriculture and Forestry",
      PUFC15_PKB == 3 ~ "Fishing and Aquaculture",
      PUFC15_PKB %in% 5:9 ~ "Mining and Quarying",
      PUFC15_PKB %in% 10:33 ~ "Manufacturing",
      PUFC15_PKB >= 35 ~ "Electricity, Gas, Steam, AC",
      PUFC15_PKB %in% 36:39 ~ "Water Supply, Sewerage, Waste Management",
      PUFC15_PKB %in% 41:43 ~ "Construction",
      PUFC15_PKB %in% 45:47 ~ "Wholesale and Retail Trade",
      PUFC15_PKB %in% 49:53 ~ "Transportation and Storage",
      PUFC15_PKB %in% 55:56 ~ "Accommodation and Food Service Activities",
      PUFC15_PKB %in% 58:63 ~ "Information and Communication",
      PUFC15_PKB %in% 64:66 ~ "Financial and Insurance Activities",
      PUFC15_PKB == 68 ~ "Real Estate Activities",
      PUFC15_PKB %in% 69:75 ~ "Professional, Scientific, and Technical Activities",
      PUFC15_PKB %in% 77:82 ~ "Administrative and Support Service Activities",
      PUFC15_PKB == 84 ~ "Public Administration and Defense",
      PUFC15_PKB == 85 ~ "Education",
      PUFC15_PKB %in% 86:88 ~ "Human Health and Social Work",
      PUFC15_PKB %in% 90:93 ~ "Arts, Entertainment and Recreation",
      PUFC15_PKB %in% 94:96 ~ "Other Service Activities",
      PUFC15_PKB %in% 97:98 ~ "Activities of Households as Employers",
      PUFC15_PKB == 99 ~ "Activities of Extraterritorial Orgs and Bodies",
      T ~ NA_character_
    ),
    employed = case_when(
      PUFC09_WORK == 1 ~ "Employed",
      PUFC09_WORK == 2 ~ "Unemployed",
      T ~ NA_character_
    )
  ) %>% filter(!is.na(sectors), !is.na(sub.industry), !is.na(employed))

# select variables
dataset <- data1 %>% select(PUFC15_PKB, PUFREG, PUFC09_WORK, PUFC07_GRADE, PUFPWGTPRV)

# renaming column names
colnames(dataset) <- c("Industries", "Region", "Employment Status", "Educational Attainment", "Final weight")

head(dataset, n = 10)
   Industries Region Employment Status Educational Attainment Final weight
1          84     13                 1                  60413     3999.574
2          82     13                 1                  60003     3347.185
3          56     13                 1                  60003     3240.956
4          47     13                 1                  24015     3240.956
5          46     13                 1                  24015     2238.744
6          47     13                 1                  24015     3347.185
7          47     13                 1                  24015     3999.574
8          47     13                 1                  10018     3503.780
9          82     13                 1                  60612     2713.258
10         47     13                 1                  60002     3448.945

Relabeling values

For simplicity we can rename the values from its coded ones.

For industries, (take note that Agriculture, Industry, and Services are the major industries) following coded values are:

\[\begin{array}{|l|c|c|} \hline \textbf{Industry} & \textbf{Minimum Code} & \textbf{Maximum Code} \\ \hline \text{AGRICULTURE} & 1 & 3 \\ \text{Agriculture and Forestry} & 1 & 2 \\ \text{Fishing and Aquaculture} & 3 & \\ \text{INDUSTRY} & 5 & 43 \\ \text{Mining and Quarrying} & 5 & 9 \\ \text{Manufacturing} & 10 & 33 \\ \text{Electricity, Gas, Steam and Airconditioning Supply} & 35 & \\ \text{Water Supply, Sewerage, Waste Management} & 36 & 39 \\ \text{Construction} & 41 & 43 \\ \text{SERVICES} & 45 & 99 \\ \text{Wholesale and Retail Trade} & 45 & 47 \\ \text{Transportation and Storage} & 49 & 53 \\ \text{Accommodation and Food Service Activities} & 55 & 56 \\ \text{Information and Communication} & 58 & 63 \\ \text{Financial and Insurance Activities} & 64 & 66 \\ \text{Real Estate Activities} & 68 & \\ \text{Professional, Scientific, and Technical Activities} & 69 & 75 \\ \text{Administrative and Support Service Activities} & 77 & 82 \\ \text{Public Administration and Defense} & 84 & \\ \text{Education} & 85 & \\ \text{Human Health and Social Work} & 86 & 88 \\ \text{Arts, Entertainment and Recreation} & 90 & 93 \\ \text{Other Service Activities} & 94 & 96 \\ \text{Activities of Households as Employers} & 97 & 98 \\ \text{Activities of Extraterritorial Orgs and Bodies} & 99 & \\ \hline \end{array}\]

For Educational Attainment the following coded values are:

\[\begin{array}{|l|c|c|} \hline \textbf{Educational Attainment} & \textbf{Minimum Code} & \textbf{Maximum Code} \\ \hline \text{No grade completed} & 0 & 1000 \\ \text{Elementary Undergraduate} & 10011 & 10015 \\ \text{Elementary Graduate} & 10016 & 10018 \\ \text{Junior High School Undergraduate} & 24011 & 24013 \\ \text{Junior High School Completer} & 24015 & 24015 \\ \text{Senior High School Undergraduate} & 34011 & 35011 \\ \text{Senior High School Graduate} & 34013 & 35013 \\ \text{Post Secondary Undergraduate} & 40001 & 50003 \\ \text{Post Secondary Graduate} & 40011 & 49999 \\ \text{College Undergraduate} & 60001 & 69999 \\ \text{College Graduate} & 60011 & 89999 \\ \hline \end{array}\]

EDA

RQ #1

To gain an initial understanding of the dataset and address the first two research questions, this study begins with an exploratory data analysis. Specifically, it aims to describe how employed Filipino workers are distributed across the country’s major and sub industry sectors, providing foundational insights for further statistical evaluation.

dist.1 <- dataset %>% filter(`Employment Status` == 1) %>% 
  mutate(major = case_when(
  Industries %in% 1:3 ~ "Agriculture",
  Industries %in% 5:43 ~ "Industry",
  Industries %in% 45:99 ~ "Services",
  T ~ NA_character_
)) 

# aggregate the dist.1
major.dist <- dist.1 %>%
  group_by(major) %>%
  summarise(`total weight` = sum(`Final weight`, na.rm = TRUE), .groups = "drop")

major.dist <- major.dist %>%
  bind_rows(
    summarise(
      major.dist,
      major = "TOTAL",
      `total weight` = sum(`total weight`)))

colnames(major.dist) <- c("Major Industry", "Population")

pander(major.dist,
       caption = "Total Population per Major Industry")
Total Population per Major Industry
Major Industry Population
Agriculture 5298
Industry 920484
Services 5276199
TOTAL 6201981

We cannot assume that the population is normally distributed without formally testing it. This is where the Shapiro–Francia test comes in. It is a variant of the Shapiro–Wilk test, designed to perform better with larger sample sizes. In our case, we are working with 2,580 samples drawn from a dataset containing 44,603 observations.

# normality test of employed NCR citizen per major sector
employed <- dist.1 %>% filter(`Employment Status` == 1)

employed %>%
  group_by(major) %>%
  summarise(
    p_value = sf.test(`Final weight`)$p.value
  )
# A tibble: 3 × 2
  major        p_value
  <chr>          <dbl>
1 Agriculture 5.63e- 2
2 Industry    1.28e-21
3 Services    9.31e-41

Plots of Weighted Frequency Distribution in Major Sectors

ggplot(dist.1, aes(x = `Final weight`)) +
  geom_histogram(fill = "steelblue", bins = 28) +  
  facet_wrap(~ major, scales = "free") +
  labs(
    title = "Distribution of employed by Major Sector (Weighted)",
    x = "Major Industry",
    y = "Population"
  ) +
  theme_minimal()

Analysis of Major Sector Distribution

A test of normality was conducted to assess the distribution of employed individuals across the major industry sectors. Results indicated that the Agriculture sector was the only one with a distribution that did not significantly deviate from normality (p = .056), whereas the Industry and Services sectors showed strong evidence against normality. Furthermore, we can notice from the graphs that Industry sector and Services sector is skewed to the right which indicate that we can utilize logarithmic transformation to the weights of Industry and Services sector to achieve normality.

RQ #2

From the code earlier we can tweak it to remove the weights (which represent the number of people in the population that each sampled respondent stands for) to get the actual frequency of each sector.

dist.2 <- dataset %>% filter(`Employment Status` == 1) %>% 
  mutate(major = case_when(
  Industries %in% 1:3 ~ "Agriculture",
  Industries %in% 5:43 ~ "Industry",
  Industries %in% 45:99 ~ "Services",
  T ~ NA_character_
)) 

# aggregate the dist.2
major.dist1 <- dist.2 %>%
  group_by(major) %>%
  summarise(frequency = n(), .groups = "drop")


major.dist1 <- major.dist1 %>%
  bind_rows(
    summarise(
      major.dist1,
      major = "TOTAL",
      frequency = sum(frequency)))

colnames(major.dist1) <- c("Major Industry", "Population")

pander(major.dist1,
       caption = "Total Frequency per Major Industry")
Total Frequency per Major Industry
Major Industry Population
Agriculture 9
Industry 427
Services 2133
TOTAL 2569

Plots of Frequency Distribution in Major Sectors

We can graph it to visualize which sector contributes more in the employment in NCR.

ggplot(dist.2, aes(x = fct_rev(fct_infreq(major)))) +
  geom_bar(fill = "steelblue") +
  coord_flip() +
  geom_text(stat = "count", aes(label = after_stat(count)),
            color = "black",
            hjust = 1.1,
            size = 3.5) +
  labs(
    title = "Distribution of Employed by Major Sector",
    x = "Major Sector",
    y = "Count"
  ) +
  theme_minimal()

Analysis of the population of employed personnel each sector

Upon observing the graph we can notice that the distribution of the frequency of employed personnel across all the major sectors is not normally distributed, because it does not resemble a bell-shaped curve, which also suggest that formal test is not needed. Furthermore, most of the employed personnel in NCR falls under the Services sector placing as the most populated sector in the NCR, Industry sector placing second and Agriculture sector placing last, which indicate that Agriculture sector is the least populated among the three major sectors in NCR alone.

RQ #3

Identifying the Leading Employment Sub-Sector in Services Sector in NCR

Previously, we identified the total number of employed individuals across major sectors. Based on the aggregated and visualized data, it was observed that the majority of the workforce is employed in the Services sector, while the Agriculture sector had the lowest share.

In this section, our goal is to determine which sub-sector within the Services sector accounts for the highest number of employed individuals. To achieve this, we will aggregate relevant data and present visualizations to uncover further insights.

sub <- dist.2 %>% filter(major == "Services") %>%
  mutate(sector2 = case_when(
    Industries %in% 45:47 ~ "Wholesale and Retail Trade",
    Industries %in% 49:53 ~ "Transportation and Storage",
    Industries %in% 55:56 ~ "Accommodation and Food Service Activities",
    Industries %in% 58:63 ~ "Information and Communication",
    Industries %in% 64:66 ~ "Financial and Insurance Activities",
    Industries == 68 ~ "Real Estate Activities",
    Industries %in% 69:75 ~ "Professional, Scientific, and Technical Activities",
    Industries %in% 77:82 ~ "Administrative and Support Service Activities",
    Industries == 84 ~ "Public Administration and Defense",
    Industries == 85 ~ "Education",
    Industries %in% 86:88 ~ "Human Health and Social Work",
    Industries %in% 90:93 ~ "Arts, Entertainment and Recreation",
    Industries %in% 94:96 ~ "Other Service Activities",
    Industries %in% 97:98 ~ "Activies of Households as Employers",
    Industries %in% 99 ~ "Activities of Extraterritorial Orgs and Bodies",
    T ~ NA_character_
  ))

sub.dist1 <- sub %>%
  group_by(sector2) %>%
  summarise(frequency = n(), .groups = "drop")

sub.dist1 <- sub.dist1 %>%
  bind_rows(
    summarise(
      sub.dist1,
      sector2 = "TOTAL",
      frequency = sum(frequency)))

colnames(sub.dist1) <- c("Sub Industry", "Frequency")

pander(sub.dist1,
       caption = "Total Frequency per Sub Industry")
Total Frequency per Sub Industry
Sub Industry Frequency
Accommodation and Food Service Activities 213
Administrative and Support Service Activities 344
Arts, Entertainment and Recreation 25
Education 60
Financial and Insurance Activities 70
Human Health and Social Work 60
Information and Communication 73
Other Service Activities 196
Professional, Scientific, and Technical Activities 49
Public Administration and Defense 154
Real Estate Activities 32
Transportation and Storage 273
Wholesale and Retail Trade 584
TOTAL 2133

We can graph it to observe the distribution and proportion of each Sub Industry to the overall frequency of the Services Sector.

ggplot(sub, aes(x = fct_rev(fct_infreq(sector2)))) +
  geom_bar(fill = "steelblue") +
   geom_text(stat = "count", aes(label = after_stat(count)),
            color = "white",
            hjust = 1.1,
            size = 3.5) +
  coord_flip() +
  labs(
  title = "Number of Employed Personnel per Sub-sector",
    x = "Sub-sector",
    y = "Count"
  ) +
  theme_minimal()

pie_data <- sub %>%
  filter(`Employment Status` == 1) %>%
  count(sector2) %>%
  mutate(percentage = n / sum(n) * 100,
         label = paste0(round(percentage, 1), "%"))

colnames(pie_data) <- c("Sub-Industry", "Count", "Percentage", "Rounded")

ggplot(pie_data, aes(x = fct_reorder(`Sub-Industry`, Percentage), y = Percentage)) +
  geom_col(fill = "steelblue") +
   geom_text(aes(label = paste0(Rounded)), 
            color = "white", 
            hjust = 1.1, 
            size = 3.5) + 
  coord_flip() +
  labs(
    title = "Percentage Distribution of Employed Personnel by Sub-sector",
    x = "Sub-Industry",
    y = "Percentage"
  ) +
  theme_minimal()

From the graphs presented we can observe that Wholesale and Retail Trade placed first with 584 or 27.4% employed personnel in the most populated sub-industry under Services Sector. Consequently, Administrative and Support Service Activities placed second with 344 or 16.1% employed personnel, while Transportation and Storage placed third with 273 or 12.8% employed personnel, indicating that the top three sub industries under Service sector are Wholesale and Retail Trade, Administrative and Support Service Activities, and Transportation and Storage adding up their population can add up to 1203 employed personnel making about 56.3% over the whole population of employed personnel in Services alone. This can indicate that these 3 sub industries are the most contributors to the overall population of Service sector.

Identifying the Leading Employment Sub-sector Across all Major Sectors.

# using dataset "dist.2" from earlier
# Agriculture
agri <- dist.2 %>%
  filter(major == "Agriculture") %>%
  mutate(sub_sector = case_when(
    Industries %in% 1:2 ~ "Agriculture and Forestry",
    Industries == 3 ~ "Fishing and Aquaculture",
    TRUE ~ NA_character_
  ))

# Industry
industry <- dist.2 %>%
  filter(major == "Industry") %>%
  mutate(sub_sector = case_when(
    Industries %in% 5:9 ~ "Mining and Quarrying",
    Industries %in% 10:33 ~ "Manufacturing",
    Industries == 35 ~ "Electricity, Gas, Steam, AC Supply",
    Industries %in% 36:39 ~ "Water Supply, Sewerage, Waste Management",
    Industries %in% 41:43 ~ "Construction",
    TRUE ~ NA_character_
  ))

# Services
services <- dist.2 %>%
  filter(major == "Services") %>%
  mutate(sub_sector = case_when(
    Industries %in% 45:47 ~ "Wholesale and Retail Trade",
    Industries %in% 49:53 ~ "Transportation and Storage",
    Industries %in% 55:56 ~ "Accommodation and Food Services",
    Industries %in% 58:63 ~ "Information and Communication",
    Industries %in% 64:66 ~ "Financial and Insurance Activities",
    Industries == 68 ~ "Real Estate Activities",
    Industries %in% 69:75 ~ "Professional, Scientific, and Technical Activities",
    Industries %in% 77:82 ~ "Administrative and Support Services",
    Industries == 84 ~ "Public Administration and Defense",
    Industries == 85 ~ "Education",
    Industries %in% 86:88 ~ "Human Health and Social Work",
    Industries %in% 90:93 ~ "Arts, Entertainment and Recreation",
    Industries %in% 94:96 ~ "Other Service Activities",
    Industries %in% 97:98 ~ "Household Employers",
    Industries == 99 ~ "Extraterritorial Orgs and Bodies",
    TRUE ~ NA_character_
  ))

all <- bind_rows(agri, industry, services)

freq_table <- all %>%
  filter(!is.na(sub_sector)) %>%
  count(major, sub_sector)


total_row <- data.frame(
  major = "Total",
  sub_sector = "-",
  n = sum(freq_table$n)
)

freq_table <- bind_rows(freq_table, total_row)
pander(freq_table)
major sub_sector n
Agriculture Agriculture and Forestry 2
Agriculture Fishing and Aquaculture 7
Industry Construction 197
Industry Electricity, Gas, Steam, AC Supply 3
Industry Manufacturing 220
Industry Mining and Quarrying 1
Industry Water Supply, Sewerage, Waste Management 6
Services Accommodation and Food Services 213
Services Administrative and Support Services 344
Services Arts, Entertainment and Recreation 25
Services Education 60
Services Financial and Insurance Activities 70
Services Human Health and Social Work 60
Services Information and Communication 73
Services Other Service Activities 196
Services Professional, Scientific, and Technical Activities 49
Services Public Administration and Defense 154
Services Real Estate Activities 32
Services Transportation and Storage 273
Services Wholesale and Retail Trade 584
Total - 2569

Now we graph it to visualize the Frequency of Employed Persons by Sub-sector and Major Sector

freq_table2 <- freq_table %>%
  filter(major != "Total")  # removes the row where 'major' is "Total"

library(forcats)

ggplot(
  freq_table2, 
  aes(
    x = fct_rev(fct_reorder(sub_sector, -n)),  
    y = n, 
    fill = sub_sector
  )
) +
  geom_col(position = position_dodge(width = 1)) +
  facet_wrap(~ major, scales = "free") +
  coord_flip() +
  labs(
    title = "Frequency of Employed Persons by Sub-sector and Major Sector",
    x = "Sub-sector",   
    y = "Number of Persons",
    fill = "Sub-sector"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 38, face = "bold"),
    axis.text.x = element_text(angle = 45, 
                               hjust = 1, 
                               size = 25),
    axis.text.y = element_text(angle = 70, 
                               hjust = 1, 
                               size = 25),
    legend.position = "none"
  )

Faceted bar graph shows the breakdown of employed individuals by different sub-sectors, classified by main sectors. Different facets depict various major sectors, enabling comparison of sub-sectoral employment both within and between sectors to be seen more clearly.

Under the Agriculture, Forestry, and Fishing sector, the highest employment is in Fishing and Aquaculture. Agriculture and Forestry also participate but to a lesser degree.

The Industry sector indicates high labor in Manufacturing and Construction, with overwhelming dominance of the group’s workforce. Other sub-sectors such as Water Supply and Electricity, Gas, Steam, and AC Supply have comparatively low employment.

In the Services sector, there is much diversification of employment. The Wholesale and Retail Trade sub-sector employs the largest number of people, followed by Administrative and Support Services, Transportation and Storage, and Accommodation and Food Services. Some other smaller sub-sectors also make contributions, including Financial and Insurance Activities, Education, and Public Administration.

In general, the Services sector also seems to be the most populous by employment, signaling a robust trend towards service job employment in the labor market. The distinction between color-coded bars and independent facets facilitates interpretability, allowing for easier observation of which sub-sectors predominate within each of the main economic sectors.

Notes

A logistic regression model was tested to assess the predictive power of educational attainment on employment status. However, due to high insignificance across levels, likely caused by category imbalance and sample constraints, the results were excluded from the main analysis.