The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.
Your task is to:
(1) Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!)
For each of the three chosen datasets:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Wide Data Set # 2
Choose a wide data set from the Week 5/6 discussions to convert into long data set and generate analysis from the the long data set. Screenshot of discussion data set below. I chose this dataset because it was a wide dataset similiar to the dataset 1 used. I wanted to keep hammering in and honing my data tidying skills of converting wide datasets to long datasets using the gather function.

CSV file has been loaded to github
Original file was excel file which was saved as a csv file. The data from the csv file will be transformed from a wide set to a long set along with other cleanup where neccessary in order to do further analysis on the data set.
Reference to website that linked the dataset and explained methods to tidy the dataset. The methods contained in the website was mostly used to understand the data, the only one directly used was selecting the columns with the select function.
df_wide<- read.csv("C://Users//jkks9//Documents//DATA 607//UN_MigrantStock.csv", header=FALSE, sep=",")
df2_wide<-df_wide[-c(1:15),]
head(df2_wide[1:10])
## V1 V2 V3 V4 V5 V6 V7
## 16 Total
## 17 1990 1990001 WORLD 900 153,011,473
## 18 1990 1990002 UN development groups ..
## 19 1990 1990003 More developed regions b 901 82,767,216
## 20 1990 1990004 Less developed regions c 902 70,244,257
## 21 1990 1990005 Least developed countries d 941 11,060,221
## V8 V9 V10
## 16 Other South Other North Afghanistan
## 17 6,548,526 2,366,800 6,823,350
## 18 .. .. ..
## 19 3,385,103 1,077,179 119,386
## 20 3,163,423 1,289,621 6,703,964
## 21 482,753 239,756 0
There are over 200 variable country columns. Destination countries will be picked to turn into rows. Subset will be applied on Origin Country, with the origin country being United States of America.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
df3_wide<-select(df2_wide,-V2,-V4,-V5,-V7)
df4_wide<-rename(df3_wide,Year=V1,Origin=V3,group=V6,Afghanistan=V10,Albania=V11,Algeria=V12,Angola=V15)
df5_wide<- subset(df4_wide,Origin=="United States of America")
df6_wide<-select(df5_wide,Year,Origin,Afghanistan,Albania,Algeria,Angola)
df2_long<-gather(df6_wide,Destination,Persons,-Origin,-Year)
## Warning: attributes are not identical across measure variables;
## they will be dropped
df2_long
## Year Origin Destination Persons
## 1 1990 United States of America Afghanistan 28,444
## 2 1995 United States of America Afghanistan 37,312
## 3 2000 United States of America Afghanistan 46,342
## 4 2005 United States of America Afghanistan 51,965
## 5 2010 United States of America Afghanistan 59,330
## 6 2015 United States of America Afghanistan 74,106
## 7 2019 United States of America Afghanistan 80,026
## 8 1990 United States of America Albania 5,627
## 9 1995 United States of America Albania 22,440
## 10 2000 United States of America Albania 39,644
## 11 2005 United States of America Albania 54,825
## 12 2010 United States of America Albania 72,085
## 13 2015 United States of America Albania 91,865
## 14 2019 United States of America Albania 99,205
## 15 1990 United States of America Algeria 4,629
## 16 1995 United States of America Algeria 7,859
## 17 2000 United States of America Algeria 11,158
## 18 2005 United States of America Algeria 12,680
## 19 2010 United States of America Algeria 14,633
## 20 2015 United States of America Algeria 19,633
## 21 2019 United States of America Algeria 21,201
## 22 1990 United States of America Angola 2,252
## 23 1995 United States of America Angola 3,352
## 24 2000 United States of America Angola 4,475
## 25 2005 United States of America Angola 7,294
## 26 2010 United States of America Angola 10,411
## 27 2015 United States of America Angola 14,068
## 28 2019 United States of America Angola 15,192
Chart created to show relationship between United Stated migration to Afghanistan. Chart is showing number of person migrating to a specific country by year. In this case, it is showing the number of persons migrating to Afghanistan from the US has been increasing every 5 yrs from about 28,000 in 1990 to 80,000 in 2019.
library(ggplot2)
df3_long<-subset(df2_long,Destination=="Afghanistan")
ggplot(df3_long, aes(x=df3_long$Year, y=df3_long$Persons, fill=df3_long$Destination)) + geom_bar(stat='identity', position='dodge')
