DATA 607 PROJECT 2

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.

Your task is to:

(1) Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!)

For each of the three chosen datasets:

 Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

 Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]

 Perform the analysis requested in the discussion item.

 Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Wide Data Set # 1

Choose a wide data set from the Week 5/6 discussions to convert into long data set and generate analysis from the long data set. Screenshot of discussion data set below. I chose this dataset because it was a wide dataset.

CSV file has been loaded to github

https://github.com/johnsuh23/DATA-607/blob/master/Unemployment%20Rate.csv

Original file was excel file which was saved as a csv file. The data from the csv file will be transformed from a wide set to a long set along with other cleanup where neccessary in order to do further analysis on the data set.

df_wide<- read.csv("C://Users//jkks9//Documents//DATA 607//Unemployment Rate.csv", header=TRUE, sep=",")
head(df_wide[1:20])

##      X Advanced.Economies Argentina Australia  Austria  Belgium Bulgaria
## 1   NA                 NA        NA        NA       NA       NA       NA
## 2 1990           5.800582        NA  6.943297 5.373002 6.550260       NA
## 3 1991           6.728688        NA  9.614137 5.823096 6.439812       NA
## 4 1992           7.511064        NA 10.750080 5.941711 7.088092 13.23500
## 5 1993           7.936175        NA 10.866170 6.811381 8.619130 15.85583
## 6 1994           7.715897        NA  9.705695 6.545480 9.753554 14.06583
##   Bahrain Belarus Brazil   Canada Switzerland Chile China Colombia Cyprus
## 1      NA      NA     NA       NA          NA    NA    NA       NA     NA
## 2      NA      NA     NA  8.15000    0.501328    NA    NA       NA     NA
## 3      NA      NA     NA 10.31667    1.090451    NA    NA       NA     NA
## 4      NA      NA     NA 11.21667    2.563105    NA    NA       NA     NA
## 5      NA      NA     NA 11.37500    4.516116    NA    NA       NA     NA
## 6      NA      NA     NA 10.39167    4.718465    NA    NA       NA     NA
##   Czech.Republic  Germany Denmark Dominican.Republic
## 1             NA       NA      NA                 NA
## 2             NA       NA      NA                 NA
## 3             NA 4.864885      NA                 NA
## 4             NA 5.764563      NA                 NA
## 5       4.333333 6.931370      NA                 NA
## 6       4.283333 7.340639      NA                 NA

colnames(df_wide[1:20])

##  [1] "X"                  "Advanced.Economies" "Argentina"         
##  [4] "Australia"          "Austria"            "Belgium"           
##  [7] "Bulgaria"           "Bahrain"            "Belarus"           
## [10] "Brazil"             "Canada"             "Switzerland"       
## [13] "Chile"              "China"              "Colombia"          
## [16] "Cyprus"             "Czech.Republic"     "Germany"           
## [19] "Denmark"            "Dominican.Republic"

There are over 80 variable country columns. 10 of the variable country name columns will be chosen to be renamed as a abbreviated country name along with the year. NA values will be removed. And then the abbreviated variable country names except the year column will be turned into rows. This will turn the wide data set to a long data set.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
df2_wide<-rename(df_wide,Year=X,ARG=Argentina,SWE=Sweden,US=United.States,KOR=Korea..Rep.,JPN=Japan,BRA=Brazil,CAN=Canada,IND=India,MEX=Mexico,IRE=Ireland)
df3_wide<-select(df2_wide,Year,ARG,SWE,US,KOR,JPN,BRA,CAN,IND,MEX,IRE)
df4_wide<-na.omit(df3_wide)
head(df4_wide)

##    Year      ARG      SWE       US      KOR      JPN       BRA      CAN
## 25 2013 7.076472 8.004120 7.358333 3.108333 4.021680  7.191545 7.100000
## 26 2014 7.270868 7.931666 6.158333 3.500000 3.587407  6.784583 6.925000
## 27 2015 6.611389 7.378905 5.275000 3.600000 3.371420  8.304733 6.900000
## 28 2016 8.467953 6.940763 4.875000 3.675000 3.112862 11.281130 6.991667
## 29 2017 8.344703 6.674000 4.350000 3.683333 2.806622 12.764030 6.333333
##    IND      MEX       IRE
## 25 8.2 4.903333 13.775000
## 26 9.3 4.824167 11.900000
## 27 8.5 4.347500  9.950000
## 28 8.0 3.882500  8.391667
## 29 8.8 3.420833  6.733333

df_long<-gather(df4_wide,Country,UnemploymentRate,-Year)
head(df_long)

##   Year Country UnemploymentRate
## 1 2013     ARG         7.076472
## 2 2014     ARG         7.270868
## 3 2015     ARG         6.611389
## 4 2016     ARG         8.467953
## 5 2017     ARG         8.344703
## 6 2013     SWE         8.004120

Analysis of data will be performed by creating visualization of unemployment rates of a single country for time period of 5 years. The country chosen is ARG which is Argentina. Based on the chart, Argentina has had a unemployment rate hovering between 6% and 8% between 2013 and 2017. It is trending downward as 2017 was lower than the previous year.

library(ggplot2)
df2_long<-subset(df_long,Country=="ARG")
ggplot(df2_long, aes(x=df2_long$Year, y=df2_long$UnemploymentRate, fill=df2_long$Country)) + geom_bar(stat='identity', position='dodge')