Introduction

The goal of Project 2 is to choose 3 wide datasets and transform the datasets - Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below. - Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!] - Perform the analysis requested in the discussion item. - Your codeshould be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis,and conclusions.

Dataset 1 : Canada Unemployment Rate

The dataset that is from Stat Canada which is a government organization which compiles data for the public. The dataset that we will be looking at is the Canada unemployment data (“Labour Force Characteristics, Monthly, Seasonally Adjusted and Trend-Cycle, Last 5 Months,” n.d.). We can see that this is a wide dataset with all of the months being in the columns. The CSV file also includes a lot of unnecessary information that needs to be cleaned before any analysis can be done on the dataset.

Importing CSV File and Cleaning Data

We first need to import the CSV file from github and then clean then dataset for analysis. Just looking at the csv file in a raw format we can see that there is a lot of data that is unnecessary that needs to be ignored.

unemployment = read.csv('https://raw.githubusercontent.com/xvicxpx/Data607/main/Project%202/Canada%20Unemployment%20Data.csv', skip=9)

unemployment = unemployment[-c(1, 8:30),]

unemployment = unemployment[-c(2)]

unemployment[1, 1] = 'Population'
unemployment[2, 1] = 'Labour Force'
unemployment[3, 1] = 'Employed'
unemployment[4, 1] = 'Full Time Employment'
unemployment[5, 1] = 'Part TIme Employment'
unemployment[6, 1] = 'Unemployed'

Unpivoting the dataset

Now that the data has been cleaned we are going unpivot the dataset in order to make it easier to do analysis. In Tidyr the function is called pivot_wider() which has replaced the old gather function.

library(tidyr)

unemployment = unemployment %>%
  pivot_longer(cols = c(2:26), names_to = "Month_Year", values_to = "Num_of_People") %>%
  pivot_wider(names_from = "Labour.force.characteristics", values_from = "Num_of_People")

It seems like that the dataset thinks that all the numbers are characters so we need to fix that first

unemployment[, c(2:7)] = lapply(unemployment[, c(2:7)], function(x){as.numeric(gsub(",", "", x))})

As you can see we unpivot the data and then repivot the data. The reason for doing this is because it will be easier to calculate the unemployment percentage by month in this format.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

unemployment = unemployment %>%
  mutate(unemployment_percent = Unemployed /`Labour Force`)

I also need to fix the date as R is unable to understand month names

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

unemployment$Month_Year = parse_date_time(unemployment$Month_Year, orders = "bY")

Analysis

library(ggplot2)

unemployment %>%
  ggplot(aes(x=Month_Year, y=unemployment_percent)) + geom_line()

We can see that there is a huge spike in unemployment rates and that is due to COVID. The surprising part is that unemployment rates are going back up and that is most likely due to a resurgence of COVID in Canada. Without more data that has a COVID like event it is hard to see how the unemployment rate is going to go. One interesting thing is to see if there is any correlation between COVID cases and unemployment rates.

Dataset 2 : COVID

The dataset is from a Kaggle competition by the Roche Data Science Coalition (“UNCOVER COVID 19 Challenge,” n.d.). The dataset is a curation of publicly available COVID datasets from different sources.

Loading CSV and Cleaning Data

As what we did with the last dataset we will load the dataset and see if there needs to be any cleaning involved

covid = read.csv('https://raw.githubusercontent.com/xvicxpx/Data607/main/Project%202/Covid%20Dataset.csv')
covid = covid[-c(1:8, 224:231),]

We can see that there are a couple of columns that are not necessary. We will also be removing some unnecessary rows that does not help us with the analysis

covid = covid %>%
  select(country, total_cases, total_deaths, total_recovered, serious_critical_cases, total_tests, population)

Since this data is quite clean there isnt a need to do a lot of transformation

Analysis

One thing that is interesting to look at is which country has the most test and cases. The reason why this is interesting is because countries that test more will have more cases and I want to see if there a trend.

covid %>%
  ggplot(aes(x=total_tests, y=total_cases)) + geom_point() + geom_text(aes(label=country), hjust=0,vjust=2)

## Warning: Removed 28 rows containing missing values (geom_point).

## Warning: Removed 28 rows containing missing values (geom_text).

It seems like that there is no real correlation between cases of covid and testing. One thing that is interesting is how USA and Brazil is testing the same amount as country like Mexico, France, and UK yet USA and Brazil still have a lot more total cases.

Using library called GGally I am going to create a scatterplot matrix to look for correlations

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

covid %>%
  select(total_cases, total_deaths, total_tests) %>%
  ggpairs()

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 26 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 28 rows containing missing values

## Warning: Removed 26 rows containing missing values (geom_point).

## Warning: Removed 26 rows containing non-finite values (stat_density).

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 28 rows containing missing values

## Warning: Removed 28 rows containing missing values (geom_point).

## Warning: Removed 28 rows containing missing values (geom_point).

## Warning: Removed 28 rows containing non-finite values (stat_density).

We can see that there isnt that much correlations in this dataset there is a correlation between cases and deaths. This makes sense as with more cases there is also going to be more covid deaths. One interesting thing to look at is certain countries like China which does not openly provide data. It would have been interesting to see if they made a mistake in their data and showed more deaths than cases which shows that they are underreporting the covid cases.

Dataset 3 : World Bank GDP

This dataset is from the World Bank (“GDP Growth (Annual %),” n.d.) and shows annual GDP growth percentage for each country every year

Loading and Cleaning Data

We can see that this is a very wide dataset that needs to be cleaned. We are going to be needing to unpivot the data and fix the years. This is a sparse dataset as there are some countries that do not have a lot of GDP information. We can also see that there is still no data in 2020 so that will also be removed.

gdp = read.csv('https://raw.githubusercontent.com/xvicxpx/Data607/main/Project%202/World%20Bank%20Annual%20GDP%20Growth.csv', skip = 4)

gdp = gdp[-c(3, 4, 65, 66)]

gdp = gdp %>%
  pivot_longer(cols = c(3:62), names_to = 'Year', values_to = 'GDP_Growth_Percent')

gdp = gdp %>%
  mutate(Year = parse_date_time(gsub("X", "", Year), orders = "Y"))

Analysis

With the clean dataset now we can easily plot any countries GDP on a time series scale. To improve this dataset there should be a way to group the countries into regions, zones, or categories as currently one would need to manually select which country they would like to see.

gdp %>%
  filter(Country.Code == 'USA' | Country.Code == 'CAN' | Country.Code == 'MEX') %>%
  ggplot(aes(x=Year, y=GDP_Growth_Percent)) + geom_line(aes(color=Country.Name))

## Warning: Removed 3 row(s) containing missing values (geom_path).

We can see that Mexico, Canada, and the United States have very similar GDP which is due to them being neighbors and its largest trading partners in North America. But we can see that there is a slightly downward trend which shows that the countries economies are no longer growing as fast as it use to be.

References

“GDP Growth (Annual %).” n.d. The World Bank. https://data.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG?end=2019&start=1961&view=chart.

“Labour Force Characteristics, Monthly, Seasonally Adjusted and Trend-Cycle, Last 5 Months.” n.d. Statistics Canada. https://doi.org/https://doi.org/10.25318/1410028701-eng.

“UNCOVER COVID 19 Challenge.” n.d. Kaggle. https://www.kaggle.com/roche-data-science-coalition/uncover.

CUNY SPS Data 607 Project 2

Introduction

Dataset 1 : Canada Unemployment Rate

Importing CSV File and Cleaning Data

Unpivoting the dataset

Analysis

Dataset 2 : COVID

Loading CSV and Cleaning Data

Analysis

Dataset 3 : World Bank GDP

Loading and Cleaning Data

Analysis

References