## importing data using "read.table"
df <- read.table("unemployment_data_us.csv", header = TRUE, sep = ",", dec = ".")
head(df)
## Year Month Primary_School Date High_School Associates_Degree
## 1 2010 Jan 15.3 Jan-2010 10.2 8.6
## 2 2011 Jan 14.3 Jan-2011 9.5 8.1
## 3 2012 Jan 13.0 Jan-2012 8.5 7.1
## 4 2013 Jan 12.0 Jan-2013 8.1 6.9
## 5 2014 Jan 9.4 Jan-2014 6.5 5.9
## 6 2015 Jan 8.3 Jan-2015 5.4 5.2
## Professional_Degree White Black Asian Hispanic Men Women
## 1 4.9 8.8 16.5 8.3 12.9 10.2 7.9
## 2 4.3 8.1 15.8 6.8 12.3 9.0 7.9
## 3 4.3 7.4 13.6 6.7 10.7 7.7 7.6
## 4 3.8 7.1 13.7 6.4 9.7 7.5 7.2
## 5 3.3 5.7 12.1 4.7 8.3 6.2 5.8
## 6 2.8 4.9 10.3 4.0 6.7 5.3 5.0
Year (the year the sample was taken), Month (the month the sample was taken)
Primary_School, High_School, Associates_Degree, Professional_Degree
White, Black, Asian, Hispanic
Men, Women
https://www.kaggle.com/datasets/aniruddhasshirahatti/us-unemployment-dataset-2010-2020 (Kaggle, 2020)
To make the data cleaner and more suitable for graphical representation. Firstly, two libraries are required for this task, namely dplyr for “relocate” and tidyr for “drop_na”
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
## sort it in chronological order
df <- df[order(df$Year), ]
## reorganisation of column "Primary_School" as preparation for the graphic representation
df <- df %>%
relocate(Primary_School,
.after = Date)
## cleaning null value within df
df <- drop_na(df)
## creating factorised values for column "Month"
df$Month_reformatted <- as.numeric(factor(df$Month,
levels = month.abb))
## subsetting for descriptive statistics focusing on data from 2017-2019
df_sub <- df[85:120, 1:7]
## factorisation of the column "Year"
df_sub$Year <- factor(df_sub$Year)
For the following section, I have chosen to focus on the unemployment rate of different education level. Other categories, namely race and gender, are excluded for better understanding of the effects education levels have on unemployment.
The statistic, range, shows the difference in value between the maximum of the observed values and the minimum of the observed values. It could give the readers an idea on how dispersed the data set may be. However, this is not particularly useful in determining the actual dispersion of the data. One might need to look at other metrics, such as standard deviation, to better determine the dispersion. It can be seen that, both in 2017 and 2019, the monthly unemployment rate for people with primary school degree had a larger change in nominal value than people with other education levels.
Consequently, this has certain effects on the standard deviation. Different from range, the standard deviation serves as a clearer measure for the dispersion of the data set. Whilst the range only represents the differences in maximum and minimum, standard deviation shows how each value deviates from the arithmetic mean. Similar to range, the unemployment rate of people with primary school degree is the highest in 2017 and 2019, showing a higher level of variation of the data.
Lastly, the measure of skewness shows the distribution of the data. If the value is negative, the data set is said to be skewed to the left, meaning that the tail of the distribution is leaning towards left and vice versa. With a skewness of zero or close to zero, it can be determined that the data set is normally distributed. For example, the skewness of the unemployment data of people with profession degree in 2019 is 0.93, meaning that the data are likely to be positively skewed (right-skew).
library(psych)
#descriptive statistics
describeBy(df_sub[, -c(1, 2, 3)],
df_sub$Year)
##
## Descriptive statistics by group
## group: 2017
## vars n mean sd median trimmed mad min max range skew
## Primary_School 1 12 6.51 0.67 6.45 6.52 0.44 5.2 7.7 2.5 0.01
## High_School 2 12 4.65 0.33 4.60 4.64 0.37 4.2 5.2 1.0 0.17
## Associates_Degree 3 12 3.75 0.16 3.70 3.73 0.15 3.6 4.1 0.5 1.04
## Professional_Degree 4 12 2.32 0.13 2.30 2.32 0.15 2.1 2.5 0.4 -0.27
## kurtosis se
## Primary_School -0.64 0.19
## High_School -1.41 0.10
## Associates_Degree -0.17 0.05
## Professional_Degree -1.18 0.04
## ------------------------------------------------------------
## group: 2018
## vars n mean sd median trimmed mad min max range skew
## Primary_School 1 12 5.61 0.20 5.65 5.62 0.22 5.2 5.9 0.7 -0.43
## High_School 2 12 4.04 0.28 4.05 4.06 0.37 3.5 4.4 0.9 -0.26
## Associates_Degree 3 12 3.32 0.15 3.30 3.32 0.15 3.1 3.5 0.4 -0.10
## Professional_Degree 4 12 2.12 0.11 2.15 2.12 0.07 2.0 2.3 0.3 -0.03
## kurtosis se
## Primary_School -0.80 0.06
## High_School -1.10 0.08
## Associates_Degree -1.51 0.04
## Professional_Degree -1.61 0.03
## ------------------------------------------------------------
## group: 2019
## vars n mean sd median trimmed mad min max range skew
## Primary_School 1 12 5.35 0.25 5.3 5.36 0.15 4.8 5.8 1.0 -0.18
## High_School 2 12 3.66 0.12 3.7 3.66 0.07 3.4 3.9 0.5 -0.18
## Associates_Degree 3 12 3.02 0.23 3.0 3.02 0.22 2.7 3.4 0.7 0.36
## Professional_Degree 4 12 2.09 0.12 2.1 2.08 0.07 1.9 2.4 0.5 0.93
## kurtosis se
## Primary_School -0.02 0.07
## High_School 0.59 0.03
## Associates_Degree -1.17 0.07
## Professional_Degree 0.78 0.04
Firstly, I have chosen to visualise a boxplot divided by different education level, showing how the distribution of unemployment rate has evolved for people with different degree between 2017-19. The data is first converted into the long-form with the education level attached to each value.
When looking at the boxplot in each category, we can see that the median decreases as year goes by. This means that, in general, the unemployment rate lowers each year. When comparing cross the categories, it is also clear that there is stark differences in unemployment rate as the education level progresses.
Interestingly, it can be seen that the changes in unemployment rate for people with a professional degree is not very drastic as their medians are hovering around 2% across the years. On the other hand, for people whose highest education level is primary school, their median has visibly been decreasing.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(tidyr)
# data preparation for visualisation
## reshape df for visualisation
sub_long <- df_sub %>%
pivot_longer(
cols = Primary_School:Professional_Degree,
names_to = "education_level",
values_to = "unemployment_rate"
)
## factorisation of labels
sub_long$education_level <- factor(sub_long$education_level,
levels = c("Primary_School",
"High_School",
"Associates_Degree",
"Professional_Degree"))
# data visualisation
## plotting boxplot
ggplot(sub_long, aes(x = Year,
y = unemployment_rate,
fill = Year)) +
geom_boxplot() +
facet_wrap(~education_level,
ncol = 4) +
labs(title = "Unemployment rate by education level between 2017 and 2019",
y = "Unemployment rate") +
scale_fill_manual(values = c("2017" = "bisque4",
"2018" = "bisque3",
"2019" = "bisque")) +
theme(plot.title = element_text(hjust = 0.5,
face = "bold"))
Furthermore, although slightly more difficult to interpret, we can nevertheless gain some insights in terms of the distribution of the data set from the histogram. Regarding the parameters, since unemployment rate mostly changes in decimal values, the bin is set to 0.2 to make it representative enough without losing the interpretability.
Visually, it can be seen that the dispersion of unemployment rate for primary school graduates in 2017 is quite high. The spread of bars is visibly wider. Empirically, this can be confirmed by its standard deviation being 0.67.
Examining the skewness, the distribution of unemployed associates graduates is seen to be right skewed as there are two more bars whose values are far above the medians. When looking at the skewness from the descriptive statistics, unemployment rate of associate graduates in 2017 has a value of 1.04, confirming its skewness to the right.
# data visualisation
ggplot(sub_long, aes(x = unemployment_rate,
fill = Year)) +
geom_histogram(position = "dodge",
binwidth = 0.2,
colour = "black") +
facet_wrap(~education_level,
ncol = 2) +
labs(title = "Distribution of unemployment rate by year",
x = "Unemployment Rate",
y = "Frequency") +
scale_fill_manual(values = c("2017" = "bisque4",
"2018" = "bisque3",
"2019" = "bisque")) +
theme(plot.title = element_text(hjust = 0.5,
face = "bold"))