DATA607_Project2

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to: (1) Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:  Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.  Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]  Perform the analysis requested in the discussion item.  Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. (2) Please include in your homework submission, for each of the three chosen datasets:  The URL to the .Rmd file in your GitHub repository, and  The URL for your rpubs.com web page.

library( dplyr )

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library( tidyr )
getURL <- "https://raw.githubusercontent.com/IvanGrozny88/DATA607_Project2/main/Age5-14_deaths_2018.csv"

Children_death_estimates_df <- read.csv(getURL, header = TRUE, sep = ",", skip = 12)

head(Children_death_estimates_df)

##               Region.Name Uncertainty.bounds. X1990.5 X2000.5 X2010.5 X2015.5
## 1      Sub-Saharan Africa               Lower 575,856 582,903 519,030 480,283
## 2      Sub-Saharan Africa              Median 604,397 599,931 543,210 516,208
## 3      Sub-Saharan Africa               Upper 651,966 623,935 579,074 575,013
## 4 West and Central Africa               Lower 262,531 291,554 291,672 283,610
## 5 West and Central Africa              Median 285,507 305,245 314,220 317,710
## 6 West and Central Africa               Upper 323,545 322,440 344,649 370,029
##   X2017.5  X X.1 X.2 X.3 X.4 X.5 X.6 X.7 X.8 X.9 X.10 X.11 X.12 X.13 X.14 X.15
## 1 469,876 NA  NA  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA
## 2 508,174 NA  NA  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA
## 3 578,156 NA  NA  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA
## 4 279,679 NA  NA  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA
## 5 317,039 NA  NA  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA
## 6 378,544 NA  NA  NA  NA  NA  NA  NA  NA  NA  NA   NA   NA   NA   NA   NA   NA
##   X.16 X.17 X.18 X.19 X.20 X.21 X.22 X.23 X.24 X.25 X.26 X.27 X.28 X.29 X.30
## 1   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 2   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 3   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 4   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 5   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 6   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
##   X.31 X.32 X.33 X.34 X.35 X.36 X.37
## 1   NA   NA   NA   NA   NA   NA   NA
## 2   NA   NA   NA   NA   NA   NA   NA
## 3   NA   NA   NA   NA   NA   NA   NA
## 4   NA   NA   NA   NA   NA   NA   NA
## 5   NA   NA   NA   NA   NA   NA   NA
## 6   NA   NA   NA   NA   NA   NA   NA

Now we are melting the data set, that means all the columns which are data are being converted to separate columns.

Children_death_estimates_df_molten <- Children_death_estimates_df %>% gather("year", "Number.of.deaths", 3:43)
head(Children_death_estimates_df_molten)

##               Region.Name Uncertainty.bounds. X.36 X.37    year
## 1      Sub-Saharan Africa               Lower   NA   NA X1990.5
## 2      Sub-Saharan Africa              Median   NA   NA X1990.5
## 3      Sub-Saharan Africa               Upper   NA   NA X1990.5
## 4 West and Central Africa               Lower   NA   NA X1990.5
## 5 West and Central Africa              Median   NA   NA X1990.5
## 6 West and Central Africa               Upper   NA   NA X1990.5
##   Number.of.deaths
## 1          575,856
## 2          604,397
## 3          651,966
## 4          262,531
## 5          285,507
## 6          323,545

Children_death_estimates_df_molten$year <- substr(Children_death_estimates_df_molten$year, 2, 5)
Children_death_estimates_df_molten$Number.of.deaths <- gsub(",", "", Children_death_estimates_df_molten$Number.of.deaths)

head(Children_death_estimates_df_molten)

##               Region.Name Uncertainty.bounds. X.36 X.37 year Number.of.deaths
## 1      Sub-Saharan Africa               Lower   NA   NA 1990           575856
## 2      Sub-Saharan Africa              Median   NA   NA 1990           604397
## 3      Sub-Saharan Africa               Upper   NA   NA 1990           651966
## 4 West and Central Africa               Lower   NA   NA 1990           262531
## 5 West and Central Africa              Median   NA   NA 1990           285507
## 6 West and Central Africa               Upper   NA   NA 1990           323545

Children_death_estimates_df_molten_median <- Children_death_estimates_df_molten %>% filter(Children_death_estimates_df_molten$Uncertainty.bounds. == "Median", Children_death_estimates_df_molten$Region.Name != "World")

head(Children_death_estimates_df_molten_median)

##                    Region.Name Uncertainty.bounds. X.36 X.37 year
## 1           Sub-Saharan Africa              Median   NA   NA 1990
## 2      West and Central Africa              Median   NA   NA 1990
## 3  Eastern and Southern Africa              Median   NA   NA 1990
## 4 Middle East and North Africa              Median   NA   NA 1990
## 5                   South Asia              Median   NA   NA 1990
## 6        East Asia and Pacific              Median   NA   NA 1990
##   Number.of.deaths
## 1           604397
## 2           285507
## 3           318890
## 4            66409
## 5           584906
## 6           319190

Children_death_estimates_df_molten_median <- Children_death_estimates_df_molten_median %>% arrange(Region.Name, year)

DATASET-2 The secod data set is for the country wise data for Balance of payment as a percent of that country’s GDP Step-1 - To read the data set from Github location into R https://raw.githubusercontent.com/IvanGrozny88/DATA607_Project2/main/UN_MigrantStockByOriginAndDestination_2019.csv Original file was excel file which was saved as a csv file. The data from the csv file will be transformed from a wide set to a long set along with other cleanup where neccessary in order to do further analysis on the data set. Reference to website that linked the dataset and explained methods to tidy the dataset. The methods contained in the website was mostly used to understand the data, the only one directly used was selecting the columns with the select function. https://www.un.org/en/development/desa/population/migration/data/estimates2/data/UN_MigrantStockByOriginAndDestination_2019.xlsx

df_wide<- read.csv("C:/Users/Ivan/OneDrive/Desktop/UN_MigrantStockByOriginAndDestination_2019.csv", header=FALSE, sep=",")
df2_wide<-df_wide[-c(1:15),]
head(df2_wide[1:10])

##      V1      V2                        V3 V4  V5 V6          V7          V8
## 16                                                        Total Other South
## 17 1990 1990001                     WORLD    900    153,011,473   6,548,526
## 18 1990 1990002     UN development groups                    ..          ..
## 19 1990 1990003    More developed regions  b 901     82,767,216   3,385,103
## 20 1990 1990004    Less developed regions  c 902     70,244,257   3,163,423
## 21 1990 1990005 Least developed countries  d 941     11,060,221     482,753
##             V9         V10
## 16 Other North Afghanistan
## 17   2,366,800   6,823,350
## 18          ..          ..
## 19   1,077,179     119,386
## 20   1,289,621   6,703,964
## 21     239,756           0

There are over 200 variable country columns. Destination countries will be picked to turn into rows. Subset will be applied on Origin Country, with the origin country being United States of America.

library(tidyr)
df3_wide<-select(df2_wide,-V2,-V4,-V5,-V7)
df4_wide<-rename(df3_wide,Year=V1,Origin=V3,group=V6,Afghanistan=V10,Albania=V11,Algeria=V12,Angola=V15)
df5_wide<- subset(df4_wide,Origin=="United States of America")
df6_wide<-select(df5_wide,Year,Origin,Afghanistan,Albania,Algeria,Angola)
df2_long<-gather(df6_wide,Destination,Persons,-Origin,-Year)
df2_long

##    Year                   Origin Destination Persons
## 1  1990 United States of America Afghanistan  28,444
## 2  1995 United States of America Afghanistan  37,312
## 3  2000 United States of America Afghanistan  46,342
## 4  2005 United States of America Afghanistan  51,965
## 5  2010 United States of America Afghanistan  59,330
## 6  2015 United States of America Afghanistan  74,106
## 7  2019 United States of America Afghanistan  80,026
## 8  1990 United States of America     Albania   5,627
## 9  1995 United States of America     Albania  22,440
## 10 2000 United States of America     Albania  39,644
## 11 2005 United States of America     Albania  54,825
## 12 2010 United States of America     Albania  72,085
## 13 2015 United States of America     Albania  91,865
## 14 2019 United States of America     Albania  99,205
## 15 1990 United States of America     Algeria   4,629
## 16 1995 United States of America     Algeria   7,859
## 17 2000 United States of America     Algeria  11,158
## 18 2005 United States of America     Algeria  12,680
## 19 2010 United States of America     Algeria  14,633
## 20 2015 United States of America     Algeria  19,633
## 21 2019 United States of America     Algeria  21,201
## 22 1990 United States of America      Angola   2,252
## 23 1995 United States of America      Angola   3,352
## 24 2000 United States of America      Angola   4,475
## 25 2005 United States of America      Angola   7,294
## 26 2010 United States of America      Angola  10,411
## 27 2015 United States of America      Angola  14,068
## 28 2019 United States of America      Angola  15,192

Chart created to show relationship between United Stated migration to Afghanistan. Chart is showing number of person migrating to a specific country by year. In this case, it is showing the number of persons migrating to Afghanistan from the US has been increasing every 5 yrs from about 28,000 in 1990 to 80,000 in 2019.

library(ggplot2)
df3_long<-subset(df2_long,Destination=="Afghanistan")
ggplot(df3_long, aes(x=df3_long$Year, y=df3_long$Persons, fill=df3_long$Destination)) + geom_bar(stat='identity', position='dodge')

## Warning: Use of `df3_long$Year` is discouraged. Use `Year` instead.

## Warning: Use of `df3_long$Persons` is discouraged. Use `Persons` instead.

## Warning: Use of `df3_long$Destination` is discouraged. Use `Destination`
## instead.

DATA607_Project2

IvanTikhonov

2022-10-08