Yong-Quan Su

Import data and display the first 6 rows of the dataframe

## importing data using "read.table"
df <- read.table("unemployment_data_us.csv", header = TRUE, sep = ",", dec = ".")
head(df)

##   Year Month Primary_School     Date High_School Associates_Degree
## 1 2010   Jan           15.3 Jan-2010        10.2               8.6
## 2 2011   Jan           14.3 Jan-2011         9.5               8.1
## 3 2012   Jan           13.0 Jan-2012         8.5               7.1
## 4 2013   Jan           12.0 Jan-2013         8.1               6.9
## 5 2014   Jan            9.4 Jan-2014         6.5               5.9
## 6 2015   Jan            8.3 Jan-2015         5.4               5.2
##   Professional_Degree White Black Asian Hispanic  Men Women
## 1                 4.9   8.8  16.5   8.3     12.9 10.2   7.9
## 2                 4.3   8.1  15.8   6.8     12.3  9.0   7.9
## 3                 4.3   7.4  13.6   6.7     10.7  7.7   7.6
## 4                 3.8   7.1  13.7   6.4      9.7  7.5   7.2
## 5                 3.3   5.7  12.1   4.7      8.3  6.2   5.8
## 6                 2.8   4.9  10.3   4.0      6.7  5.3   5.0

Details about dataset

Unit of observation: each observation represents a month’s data

Sample size: 132 observations

Definition of variables:

Date:

Year (the year the sample was taken), Month (the month the sample was taken)

Unemployment rate based on educational qualification:

Primary_School, High_School, Associates_Degree, Professional_Degree

Unemployment rate based on race:

White, Black, Asian, Hispanic

Unemployment rate based on gender:

Men, Women

Source of data

https://www.kaggle.com/datasets/aniruddhasshirahatti/us-unemployment-dataset-2010-2020 (Kaggle, 2020)

Data manipulation

To make the data cleaner and more suitable for graphical representation. Firstly, two libraries are required for this task, namely dplyr for “relocate” and tidyr for “drop_na”

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## sort it in chronological order
df <- df[order(df$Year), ] 

## reorganisation of column "Primary_School" as preparation for the graphic representation 
df <- df %>% 
  relocate(Primary_School, 
           .after = Date)

## cleaning null value within df
df <- drop_na(df) 

## creating factorised values for column "Month"
df$Month_reformatted <- as.numeric(factor(df$Month, 
                                          levels = month.abb)) 

## subsetting for descriptive statistics focusing on data from 2017-2019
df_sub <- df[85:120, 1:7]  

## factorisation of the column "Year"
df_sub$Year <- factor(df_sub$Year)

Descriptive statistics

For the following section, I have chosen to focus on the unemployment rate of different education level. Other categories, namely race and gender, are excluded for better understanding of the effects education levels have on unemployment.

The statistic, range, shows the difference in value between the maximum of the observed values and the minimum of the observed values. It could give the readers an idea on how dispersed the data set may be. However, this is not particularly useful in determining the actual dispersion of the data. One might need to look at other metrics, such as standard deviation, to better determine the dispersion. It can be seen that, both in 2017 and 2019, the monthly unemployment rate for people with primary school degree had a larger change in nominal value than people with other education levels.

Consequently, this has certain effects on the standard deviation. Different from range, the standard deviation serves as a clearer measure for the dispersion of the data set. Whilst the range only represents the differences in maximum and minimum, standard deviation shows how each value deviates from the arithmetic mean. Similar to range, the unemployment rate of people with primary school degree is the highest in 2017 and 2019, showing a higher level of variation of the data.

Lastly, the measure of skewness shows the distribution of the data. If the value is negative, the data set is said to be skewed to the left, meaning that the tail of the distribution is leaning towards left and vice versa. With a skewness of zero or close to zero, it can be determined that the data set is normally distributed. For example, the skewness of the unemployment data of people with profession degree in 2019 is 0.93, meaning that the data are likely to be positively skewed (right-skew).

library(psych)

#descriptive statistics
describeBy(df_sub[, -c(1, 2, 3)], 
           df_sub$Year)

## 
##  Descriptive statistics by group 
## group: 2017
##                     vars  n mean   sd median trimmed  mad min max range  skew
## Primary_School         1 12 6.51 0.67   6.45    6.52 0.44 5.2 7.7   2.5  0.01
## High_School            2 12 4.65 0.33   4.60    4.64 0.37 4.2 5.2   1.0  0.17
## Associates_Degree      3 12 3.75 0.16   3.70    3.73 0.15 3.6 4.1   0.5  1.04
## Professional_Degree    4 12 2.32 0.13   2.30    2.32 0.15 2.1 2.5   0.4 -0.27
##                     kurtosis   se
## Primary_School         -0.64 0.19
## High_School            -1.41 0.10
## Associates_Degree      -0.17 0.05
## Professional_Degree    -1.18 0.04
## ------------------------------------------------------------ 
## group: 2018
##                     vars  n mean   sd median trimmed  mad min max range  skew
## Primary_School         1 12 5.61 0.20   5.65    5.62 0.22 5.2 5.9   0.7 -0.43
## High_School            2 12 4.04 0.28   4.05    4.06 0.37 3.5 4.4   0.9 -0.26
## Associates_Degree      3 12 3.32 0.15   3.30    3.32 0.15 3.1 3.5   0.4 -0.10
## Professional_Degree    4 12 2.12 0.11   2.15    2.12 0.07 2.0 2.3   0.3 -0.03
##                     kurtosis   se
## Primary_School         -0.80 0.06
## High_School            -1.10 0.08
## Associates_Degree      -1.51 0.04
## Professional_Degree    -1.61 0.03
## ------------------------------------------------------------ 
## group: 2019
##                     vars  n mean   sd median trimmed  mad min max range  skew
## Primary_School         1 12 5.35 0.25    5.3    5.36 0.15 4.8 5.8   1.0 -0.18
## High_School            2 12 3.66 0.12    3.7    3.66 0.07 3.4 3.9   0.5 -0.18
## Associates_Degree      3 12 3.02 0.23    3.0    3.02 0.22 2.7 3.4   0.7  0.36
## Professional_Degree    4 12 2.09 0.12    2.1    2.08 0.07 1.9 2.4   0.5  0.93
##                     kurtosis   se
## Primary_School         -0.02 0.07
## High_School             0.59 0.03
## Associates_Degree      -1.17 0.07
## Professional_Degree     0.78 0.04

Graphical representation

Firstly, I have chosen to visualise a boxplot divided by different education level, showing how the distribution of unemployment rate has evolved for people with different degree between 2017-19. The data is first converted into the long-form with the education level attached to each value.

When looking at the boxplot in each category, we can see that the median decreases as year goes by. This means that, in general, the unemployment rate lowers each year. When comparing cross the categories, it is also clear that there is stark differences in unemployment rate as the education level progresses.

Interestingly, it can be seen that the changes in unemployment rate for people with a professional degree is not very drastic as their medians are hovering around 2% across the years. On the other hand, for people whose highest education level is primary school, their median has visibly been decreasing.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.3

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(tidyr)

# data preparation for visualisation

## reshape df for visualisation
sub_long <- df_sub %>% 
  pivot_longer(
    cols = Primary_School:Professional_Degree, 
    names_to = "education_level", 
    values_to = "unemployment_rate"
  )

## factorisation of labels 
sub_long$education_level <- factor(sub_long$education_level, 
                                   levels = c("Primary_School", 
                                              "High_School", 
                                              "Associates_Degree", 
                                              "Professional_Degree"))

# data visualisation

## plotting boxplot
ggplot(sub_long, aes(x = Year, 
                     y = unemployment_rate, 
                     fill = Year)) +
  
  geom_boxplot() +
  
  facet_wrap(~education_level, 
             ncol = 4) +
  
  labs(title = "Unemployment rate by education level between 2017 and 2019", 
       y = "Unemployment rate") +
  
  scale_fill_manual(values = c("2017" = "bisque4", 
                               "2018" = "bisque3", 
                               "2019" = "bisque")) +
  
  theme(plot.title = element_text(hjust = 0.5, 
                                  face = "bold"))

Furthermore, although slightly more difficult to interpret, we can nevertheless gain some insights in terms of the distribution of the data set from the histogram. Regarding the parameters, since unemployment rate mostly changes in decimal values, the bin is set to 0.2 to make it representative enough without losing the interpretability.

Visually, it can be seen that the dispersion of unemployment rate for primary school graduates in 2017 is quite high. The spread of bars is visibly wider. Empirically, this can be confirmed by its standard deviation being 0.67.

Examining the skewness, the distribution of unemployed associates graduates is seen to be right skewed as there are two more bars whose values are far above the medians. When looking at the skewness from the descriptive statistics, unemployment rate of associate graduates in 2017 has a value of 1.04, confirming its skewness to the right.

# data visualisation
ggplot(sub_long, aes(x = unemployment_rate, 
                     fill = Year)) +
  
  geom_histogram(position = "dodge", 
                 binwidth = 0.2, 
                 colour = "black") +
  
  facet_wrap(~education_level, 
             ncol = 2) + 
  
  labs(title = "Distribution of unemployment rate by year", 
       x = "Unemployment Rate", 
       y = "Frequency") +
  
  scale_fill_manual(values = c("2017" = "bisque4", 
                               "2018" = "bisque3", 
                               "2019" = "bisque")) +
  
  theme(plot.title = element_text(hjust = 0.5, 
                                  face = "bold"))

US Unemployment Data (2010-2020)

2025-03-19