This dataset was taken from the World Bank Global Economic Monitor. This dataset contains unemployment rates from 88 countries from year 1990 through 2017.
Source: https://datacatalog.worldbank.org/dataset/global-economic-monitor
This analysis will investigate annual unemployment rates from 2011 to 2015 of 71 countries.
For the five year period from 2011 to 2015, what’s the average annual unemployment rate of each country?
For the five year period from 2011 to 2015, what’s the distribution of the average annual unemployment rate?
For the five year period from 2011 to 2015, what’s the overall trend of the world’s annual unemployment rate?
The bar graph entitled 2011 - 2015: Average Annual Unemployment Rates
presented on Step 6 show the average unemployment rate rankings of the 71 countries. Belarus ranks lowest at 0.65%, and Macedonia ranks highest at 29.09%.
The histogram entitled 2011 - 2015: Average Annual Unemployment Rate of 71 Countries
presented on Step 6 shows a distribution that is multimodal and is skewed to the right. Most countries have an average annual unemployment below 14%. The distribution mean is 8.31%. The median is 7.07%. The standard deviation is 5.44%, and the inter quartile range is 4.83%.
The bar graph entitled 2011-2015: Overall Annual Unemployment Rate
presented under Step 6 shows that there was an increasing trend from 2011 to 2013 and a decreasing trend from 2013 to 2015. 2013 ranked highest at 8.44%, and 2015 ranked lowest at 7.97%.
#rm(list=ls())
library(dplyr)
library(tidyr)
library(knitr)
library(ggplot2)
library(stringr)
library(DT)
#fileSource = "C:/Users/stina/Documents/CUNY SPS Data Science/Spring 2018 Classes/DATA 607 - Data Acquisition and Management/Project 2/Unemployment Rate.csv"
fileSource <- "https://raw.githubusercontent.com/Shetura36/Data-607-Assignments/master/Project2/Unemployment%20Rate.csv"
data = read.table(fileSource, header=TRUE, sep=",", na.strings = c("", "NA"))
This dataset has 28 observations and 89 variables. The first row is blank. This blank row is going to be removed. The 28 observations excludes this blank row.
Each row represents data for a specific year for all 88 countries. Each country has its own column name.
The code below will also rename the columns so that “..” or “.” characters are replaced with the underscore character (_). R automatically adds dots (.) when there is a space in column names. Also, the year column was automatically assigned the column name “X” since the source file does not have a column name for this. The “X” column name is going to be renamed as year
.
# 29 observations and 89 variables
dim(data)
## [1] 29 89
# remove first row
data <- data[-1,]
# rename rownames
rownames(data) <- c(1:28)
# rename first column to "year"
colnames(data)[1] <- "year"
# get column names for countries
column_names <- colnames(data)
# clean up column names for each country
column_names <-
#replace the ".." with underscore character
sapply(column_names, stringr::str_replace_all, pattern="[.]{2,2}",replacement= "_") %>%
#replace the period at the very end of the string with a blank character
sapply(stringr::str_replace_all, pattern="[.]{1,1}$",replacement= "") %>%
#replace "." with underscore character
sapply(stringr::str_replace_all, pattern="[.]{1,1}",replacement= "_") %>%
sapply(stringr::str_replace_all, pattern="[_]{2,2}",replacement= "_")
#update country column names with cleaner names
colnames(data) <- column_names
#preview data in wide format
#datatable(data) --> does not display on github
kable(head(data[,1:10],10), format="markdown")
year | Argentina | Developing_Asia | Australia | Austria | Belgium | Bulgaria | Bahrain | Belarus | Brazil |
---|---|---|---|---|---|---|---|---|---|
1990 | NA | NA | 6.943297 | 5.373002 | 6.550265 | NA | NA | NA | NA |
1991 | NA | NA | 9.614137 | 5.823096 | 6.439811 | NA | NA | NA | NA |
1992 | NA | NA | 10.750080 | 5.941711 | 7.088083 | 13.23500 | NA | NA | NA |
1993 | NA | NA | 10.866170 | 6.811381 | 8.619116 | 15.85583 | NA | NA | NA |
1994 | NA | NA | 9.705695 | 6.545480 | 9.753488 | 14.06583 | NA | NA | NA |
1995 | NA | NA | 8.471058 | 6.589767 | 9.673860 | 11.38583 | NA | NA | NA |
1996 | NA | NA | 8.516425 | 7.033851 | 9.543726 | 11.06167 | NA | NA | NA |
1997 | NA | NA | 8.363718 | 7.103283 | 9.212427 | 14.04583 | NA | NA | NA |
1998 | NA | NA | 7.651377 | 7.184796 | 9.338412 | 12.20333 | NA | NA | NA |
1999 | NA | NA | 6.869885 | 6.645249 | 8.484549 | 13.78250 | NA | NA | NA |
This original table has a year column and followed by columns representing each country.
I am going to use tidyr to transform all the country columns into a single country_name
column so that every observation will have the columns year
, country_name
, and annual_unemployment
.
The tidyr::gather
function will accomplish this.
Below you will see a preview of the first few rows of the transformed dataset.
data <- tidyr::gather(data, "country_name", "annual_unemployment", 2:89)
#preview of tidy dataset (long format)
#datatable(data) --> does not display on Github
kable(head(data, 10), format="markdown")
year | country_name | annual_unemployment |
---|---|---|
1990 | Argentina | NA |
1991 | Argentina | NA |
1992 | Argentina | NA |
1993 | Argentina | NA |
1994 | Argentina | NA |
1995 | Argentina | NA |
1996 | Argentina | NA |
1997 | Argentina | NA |
1998 | Argentina | NA |
1999 | Argentina | NA |
This analysis will only investigate annual unemployment rates from 2012 to 2015. The code below will generate a subset of the data from 2011 to 2015. This five year period was selected because it’s the most recent five year period I identified that would exclude the least number of countries with incomplete cases.
This subset will only include countries with complete cases within the five year period.
In addition, the following country_name
entries are going to be excluded since these are groupings of countries and not actual countries.
#exclude entries that are not countries
exclude_countries = c(
"Developing_Asia",
"Developing_Countries",
"East_Asia_Pacific_developing",
"Europe_Central_Asia_developing",
"High_Income_Countries",
"High_income_OECD",
"Latin_America_Caribbean_developing",
"Low_Income",
"Middle_Income_Countries",
"Middle_East_N_Africa_developing",
"High_Income_Non_OECD",
"South_Asia_developing",
"Sub_Saharan_Africa_developing",
"World_WBG_members"
)
#only include years from 2011 to 2015 and exclude countries that are not actual countries
part1 <-
data %>% dplyr::filter(year >= 2011 & year <= 2015) %>%
dplyr::filter(!(country_name %in% exclude_countries))
dplyr::filter(data, country_name %in% exclude_countries) %>% dplyr::distinct(country_name)
## country_name
## 1 Developing_Asia
## 2 Developing_Countries
## 3 East_Asia_Pacific_developing
## 4 Europe_Central_Asia_developing
## 5 High_Income_Countries
## 6 High_income_OECD
## 7 Latin_America_Caribbean_developing
## 8 Low_Income
## 9 Middle_Income_Countries
## 10 Middle_East_N_Africa_developing
## 11 High_Income_Non_OECD
## 12 South_Asia_developing
## 13 Sub_Saharan_Africa_developing
## 14 World_WBG_members
#identify countries with missing data
#the code below will count the number of NA values for unemployment_rate for each country
part2 <-
data.frame(part1 %>% dplyr::filter(year >= 2011 & year <= 2015) %>%
dplyr::group_by(country_name) %>% dplyr::summarise(length(which(is.na(annual_unemployment)))))
#rename column to "NA_count" for readability
colnames(part2) <- c("country_name", "NA_count")
#display countries to be excluded
kable(head(dplyr::filter(part2, NA_count > 0), 10), format="markdown")
country_name | NA_count |
---|---|
Argentina | 1 |
Israel | 1 |
Tunisia | 1 |
#identify countries with complete data for 5 year period
part3 <- dplyr::filter(part2, NA_count == 0)
#only include country names with NA_count = 0.
#countries with NA_count of 0 are ones with complete data for years 2011 to 2015.
target <-
part1 %>% dplyr::filter(country_name %in% part3$country_name)
#check number of countries make sense
#total distinct country =
#excluded coutnries + countries with missing data for 5 year period + countries in target subset
nrow(dplyr::distinct(data,country_name)) ==
length(exclude_countries) +
nrow(dplyr::filter(part2, NA_count > 0)) +
nrow(dplyr::distinct(target,country_name))
## [1] TRUE
#number of distinct countries in target subset
nrow(dplyr::distinct(target,country_name))
## [1] 71
#preview target subset
#datatable(target) --> does not display on Github
kable(head(target, 10), format="markdown")
year | country_name | annual_unemployment |
---|---|---|
2011 | Australia | 5.084754 |
2012 | Australia | 5.223256 |
2013 | Australia | 5.661276 |
2014 | Australia | 6.076739 |
2015 | Australia | 6.062296 |
2011 | Austria | 6.724885 |
2012 | Austria | 6.971840 |
2013 | Austria | 7.616157 |
2014 | Austria | 8.368532 |
2015 | Austria | 9.107860 |
In this step, we calculate the 5-year average annual_unemployment rate for each country in the target subset.
target_mean <-
data.frame(target %>% dplyr::group_by(country_name) %>%
summarise(mean = mean(annual_unemployment)))
#datatable(target_mean) --> does not display on Github
kable(head(target_mean, 10), format="markdown")
country_name | mean |
---|---|
Algeria | 10.5200000 |
Australia | 5.6216642 |
Austria | 7.7578548 |
Bahrain | 3.8600000 |
Belarus | 0.6492506 |
Belgium | 8.0434458 |
Brazil | 7.5178434 |
Bulgaria | 10.6468346 |
Canada | 7.1483334 |
Chile | 6.4481292 |
Below is a bar plot of the average annual unemployment rate of 71 countries ordered from lowest to highest. Belarus ranks lowest at about 0.65%. Macedonia ranks highest at about 29.09%.
target_mean_ordered <- dplyr::arrange(target_mean, mean)
ggplot(data = target_mean_ordered[1:35,], aes(x=reorder(country_name,-mean), y=mean)) +
geom_bar(stat="identity", width=0.5, color="#1F3552", fill="steelblue",
position=position_dodge()) +
geom_text(aes(label=round(mean, digits=2)), hjust=1.3, size=3.0, color="white") +
coord_flip() +
ggtitle("2011 - 2015: Average Annual Unemployment Rates") +
xlab("") + ylab("Unemployment Rate") +
theme_minimal()
ggplot(data = target_mean_ordered[36:71,], aes(x=reorder(country_name,-mean), y=mean)) +
geom_bar(stat="identity", width=0.5, color="#1F3552", fill="steelblue",
position=position_dodge()) +
geom_text(aes(label=round(mean, digits=2)), hjust=1.3, size=3.0, color="white") +
coord_flip() +
ggtitle("2011 - 2015: Average Annual Unemployment Rates (continued)") +
xlab("") + ylab("Unemployment Rate") +
theme_minimal()
In the histogram below, you will see that the distribution looks multimodal and is skewed to the right. Most countries have an average annual unemployment below 14%.
The distribution mean is 8.31%. The median is 7.07%. The standard deviation is 5.44%, and the inter quartile range is 4.83%.
ggplot(target_mean, aes(x = mean)) +
geom_histogram(aes(y = ..count..), binwidth = 1,colour = "#1F3552", fill = "steelblue") +
scale_x_continuous(name = "Unemployment Rate",breaks = seq(0, 30, 2), limits=c(0, 30)) +
scale_y_continuous(name = "Count") +
ggtitle("2011 - 2015: Average Annual Unemployment Rate of 71 Countries") +
theme_minimal()
target_mean %>% dplyr::summarise(dist_mean = mean(mean),
dist_median = median(mean),
dist_sd = sd(mean),
dist_IQR = IQR(mean))
## dist_mean dist_median dist_sd dist_IQR
## 1 8.31395 7.071833 5.439919 4.823871
To visualize the overall trend of the world’s unemployment rate (as represented by the 71 countries) for each year, a simple average of all of the countries’ annual unemployment rates was calculated for each year.
Below is a bar graph that shows the overall annual unemployment rate for each year. As you can see, 2013 ranked as the highest in overall average unemployment rate at 8.44%, and 2015 ranked lowest at 7.97%.
Looking at the bar graph below, there was an increasing trend from 2011 to 2013 and a decreasing trend from 2013 to 2015.
year_mean <-
data.frame(target %>% dplyr::group_by(year) %>% summarise(year_mean = mean(annual_unemployment)))
ggplot(data = year_mean, aes(x=year, y=year_mean)) +
geom_bar(stat="identity", width=0.5, color="#1F3552", fill="steelblue",
position=position_dodge()) +
geom_text(aes(label=round(year_mean, digits=2)), vjust=1.3, size=3.0, color="white") +
ggtitle("2011 - 2015: Overall Annual Unemployment Rate") +
ylab("") + xlab("Unemployment Rate") +
theme_minimal()