The dataset I am working with focuses on the labour market in Ethiopia. The dataset includes 12 variables, consisting of several categorical variables such as gender and area, one quantitative variable representing labour force values, and a time variable. For this analysis, I selected the most relevant variables and renamed them to simpler and more readable terms. I also filtered the dataset to include only observations for Ethiopia and removed aggregated categories such as total values to ensure a more accurate comparison across groups. Since the available data for this analysis focuses on the year 2021, the study examines differences in labour force participation across groups rather than changes over time. In addition, I was interested in exploring whether the COVID-19 pandemic may have had an impact on the labour market during this period. This topic is personally meaningful to me, as I plan to return to Ethiopia after completing my studies, and I would like to better understand the labour market conditions that may influence my future career opportunities.
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.3
Warning: package 'readr' was built under R version 4.5.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)library(shiny)
Warning: package 'shiny' was built under R version 4.5.3
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 22036 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): ref_area.label, source.label, indicator.label, sex.label, classif1...
dbl (2): time, obs_value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(ethiopia)
# A tibble: 6 × 12
ref_area.label source.label indicator.label sex.label classif1.label
<chr> <chr> <chr> <chr> <chr>
1 Afghanistan LFS - Labour Force Su… Labour force b… Total Age (Youth, a…
2 Afghanistan LFS - Labour Force Su… Labour force b… Total Age (Youth, a…
3 Afghanistan LFS - Labour Force Su… Labour force b… Total Age (Youth, a…
4 Afghanistan LFS - Labour Force Su… Labour force b… Total Age (Youth, a…
5 Afghanistan LFS - Labour Force Su… Labour force b… Total Age (Youth, a…
6 Afghanistan LFS - Labour Force Su… Labour force b… Total Age (Youth, a…
# ℹ 7 more variables: classif2.label <chr>, time <dbl>, obs_value <dbl>,
# obs_status.label <chr>, note_classif.label <chr>,
# note_indicator.label <chr>, note_source.label <chr>
We can see here that the dataset consists of labour-related information for all countries and needs a filtering
This project compares labour force trends in Ethiopia in 2021 and across different groups, such as gender and area. It helps show how labour force participation changes and whether there are differences between males and females or between urban and rural areas. The dataset consisted of multiple countries, so i had to filter it to include only Ethiopia, removing all other countries using the filter() function. Then I renamed the variables to simpler names so they are easier to use in the analysis.
# A tibble: 6 × 4
year labour_force gender area
<dbl> <dbl> <chr> <chr>
1 2021 22064. Male Ethiopia
2 2021 17137. Male Ethiopia
3 2021 4927. Male Ethiopia
4 2021 20729. Male Ethiopia
5 2021 15934. Male Ethiopia
6 2021 4795. Male Ethiopia
# A tibble: 6 × 4
year labour_force gender area
<dbl> <dbl> <chr> <chr>
1 2021 22064. Male Ethiopia
2 2021 17137. Male Ethiopia
3 2021 4927. Male Ethiopia
4 2021 20729. Male Ethiopia
5 2021 15934. Male Ethiopia
6 2021 4795. Male Ethiopia
summary(ethiopia_filtered$labour_force)
Min. 1st Qu. Median Mean 3rd Qu. Max.
47.93 1095.48 3665.20 5585.09 6478.01 22064.31
Exploration
Labour force over time
ggplot(ethiopia_filtered, aes(x = area, y = labour_force, fill = area)) +geom_boxplot() +labs(title ="Labour Force Distribution by Area",x ="Area",y ="Labour Force" )
Main plot
ggplot(ethiopia_filtered, aes(x = gender, y = labour_force, fill = gender)) +geom_boxplot() +labs(title ="Labour Force Distribution by Gender in Ethiopia",x ="Gender",y ="Labour Force",fill ="Gender",caption ="Source: Ethiopia Labour Dataset" ) +scale_fill_brewer(palette ="Set1") +theme_minimal()
The boxplot shows the distribution of labour force values for males and females in Ethiopia for the year 2021. Each box represents the middle 50% of the data, while the line inside the box shows the median value. The points outside the boxes represent individual observations. From the graph, we can see that the labour force values for both genders have a wide spread, meaning there is variation within each group. While the median for males appears slightly higher than for females, there is still a lot of overlap between the two groups. This overlap suggests that the difference in labour force between males and females is not very strong. Overall, the boxplot helps show both the variation within each group and the comparison between genders in a clear way
model <-lm(labour_force ~ gender, data = ethiopia_filtered)summary(model)
Call:
lm(formula = labour_force ~ gender, data = ethiopia_filtered)
Residuals:
Min 1Q Median 3Q Max
-6247.4 -4325.8 -1892.7 948.7 15684.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4790.4 873.8 5.482 3.52e-07 ***
genderMale 1589.4 1235.8 1.286 0.202
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6054 on 94 degrees of freedom
Multiple R-squared: 0.01729, Adjusted R-squared: 0.006838
F-statistic: 1.654 on 1 and 94 DF, p-value: 0.2016
plot(model)
The regression model examines how gender affects labour force in Ethiopia for the year 2021. The coefficient for males is about 1589, which suggests that males have a higher labour force value than females on average. However, the p-value for gender is 0.202, which is greater than 0.05, meaning that gender is not statistically significant in this model. This means the difference between males and females could be due to chance rather than a strong relationship. The R² value is very low (about 0.017), which shows that the model explains only about 1.7% of the variation in labour force, so it is a weak model. Overall, this suggests that in 2021, gender alone does not strongly explain differences in labour force in Ethiopia.
R-shiny was not working, so I had to check with the exist function that ethiopia exists in memory
exists("ethiopia_filtered")
[1] TRUE
The diagnostic plots help evaluate whether the regression model is appropriate. The Residuals vs Fitted plot shows that the residuals are scattered without a clear pattern, but there is still noticeable spread, suggesting the model may not fully capture the variation in the data. The Q-Q plot shows that the residuals do not perfectly follow a straight line, which means they are not normally distributed. The Scale-Location plot indicates that the spread of residuals is not constant, suggesting unequal variance. Finally, the Residuals vs Factor Levels plot shows variation within each gender group. Overall, these plots suggest that the model is not a very strong fit and that gender alone does not fully explain labour force differences in Ethiopia.
exists("ethiopia_filtered")
[1] TRUE
library(shiny)# UIui <-fluidPage(titlePanel("Ethiopia Labour Force"),sidebarLayout(sidebarPanel(selectInput(inputId ="gender",label ="Select Gender:",choices =unique(ethiopia_filtered$gender) ) ),mainPanel(plotOutput(outputId ="distPlot") ) ))# Serverserver <-function(input, output) { output$distPlot <-renderPlot({ selected_data <- ethiopia_filtered %>%filter(gender == input$gender)ggplot(selected_data, aes(x = gender, y = labour_force, fill = gender)) +geom_boxplot() +labs(title ="Labour Force Distribution by Gender in Ethiopia",x ="Gender",y ="Labour Force" ) })}# Run appshinyApp(ui = ui, server = server)
Shiny applications not supported in static R Markdown documents
R Shiny is used in this project to make the analysis interactive instead of static. In this case, the Shiny app allows the user to select a gender and view the corresponding labour force distribution for Ethiopia. This makes it easier to focus on one group at a time and better understand the differences in labour force participation. Instead of just looking at one fixed graph, the user can interact with the data, which makes the results clearer and more engaging.
Warning in cor(ethiopia_clean$year, ethiopia_clean$labour_force): the standard
deviation is zero
[1] NA
I attempted to use correlation to examine the relationship between variables in the dataset, specifically between year and labour force. However, since my dataset only includes data from one year (2021), there is no variation in the time variable. As a result, the standard deviation is zero, and the correlation cannot be calculated, which is why it returns an undefined value (NA). This showed me that correlation is not meaningful in this case, so I focused my analysis on comparing labour force differences across groups such as gender instead.
Problems I faced
During this project, I faced some challenges while working with the data and visualizations. Initially, I attempted to create a line graph to show trends over time, but the graph appeared as a straight vertical line. After investigating the issue, I realized that the dataset only contained data for one year (2021), which made it impossible to analyze changes over time. Because of this limitation, I had to adjust my approach and focus on comparing labour force values across categories instead of trends. I changed my visualization to a boxplot to better compare differences between genders at a given period of time, which was 2021, which provided a more accurate and meaningful representation of the data.
Reflection
I cleaned the data by first loading the dataset using read_csv(). Then, I selected only the relevant variables such as time, labour force values, gender, and area. I renamed these variables to simpler names to make them easier to work with in R. After that, I filtered the dataset to include only observations for Ethiopia and removed the “Total” category so that I could focus on comparing groups more clearly. This made the dataset cleaner and ready for analysis.
For the visualization, I used a boxplot where each box represents the distribution of labour force values for each gender. The colors represent different gender groups, making it easier to compare them. Since the dataset only included data for the year 2021, I was not able to analyze trends over time, so I focused on comparing differences between groups instead.
The regression results examine the relationship between gender and labour force. The model shows that males have a higher average labour force value than females; however, the p-value is greater than 0.05, which means this difference is not statistically significant. The adjusted R-squared value is very low, indicating that gender explains only a small portion of the variation in labour force. Overall, this suggests that gender alone does not strongly explain differences in labour force in Ethiopia for 2021.
One limitation of this analysis is that the dataset only includes one year (2021), which prevents analyzing trends over time. This limits the ability to fully understand how the labour market has changed.I wish I could include more detailed variables, such as age groups or employment sectors, to better understand how different factors affect labour force participation. This would provide a deeper insight into the labour market and help explain variations more clearly.
I used AI as a learning tool to better understand key concepts such as p-values, R-squared, and regression, which helped me interpret my results more clearly and confidently. It provided simple explanations and examples that made complex statistical ideas easier to understand. I was able to connect these concepts directly to my own dataset and results. This helped me explain what the regression output means instead of just reporting numbers. Overall, it improved my understanding of the analysis and made my interpretations more accurate.