——————————————————————————–

DATA 607 Project 2.

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to: (1) Choose any three of the “wide” datasets identified in the Week 5 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets: ??? Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below. ??? Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!] ??? Perform the analysis requested in the discussion item. ??? Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. (2) Please include in your homework submission, for each of the three chosen datasets: ??? The URL to the .Rmd file in your GitHub repository, and ??? The URL for your rpubs.com web page.

set working directory and Install all the relevant packages and load their respective libraries into R.

Migrants by destimation country

Load the following libraries

library(stringr)

library(tidyr)

library(dplyr)

library(tidyverse)

library(tibble)

library(caret)

library(readr)

Upload the data into Github

This will ensure that everyone with access to the github repository can easily audit or retest the data. This ensures ease of accessibility and testing by a wide audience. Follow this link to see uploaded Male migrants .csv file (https://raw.githubusercontent.com/igukusamuel/DATA-607-Project-2/master/UN_MigrantStockBySexByDestination_2019.csv)

migrants <- read_csv("https://raw.githubusercontent.com/igukusamuel/DATA-607-Project-2/master/UN_MigrantStockBySexByDestination_2019.csv")
head(migrants)

#view(head(male_migrants, 20)) # vIew data frame structure and see how many rows to skip.

Skip first 15 rows

As part of data cleanup, skip the first 15 rows that include source information not relevant to out analysis.

migrants <- read_csv("https://raw.githubusercontent.com/igukusamuel/DATA-607-Project-2/master/UN_MigrantStockBySexByDestination_2019.csv", skip = 15)

head(migrants) #Print out first few rows to confirm that the data have been loaded correctly.

Filter for N/As in column X5

Careful review of the data shows that column named X5 only includes data for rows related to countries and N/A’s for rows relating to regions and regional totals. Thus filtering out all N/As in column X5 will leave us with country data only, which is the basis of out analysis. We first view all the N/As under column X5 to confirm none of them relate to country information.

colX5 <- filter(migrants, is.na(X5))

x <- length(colX5)
x

## [1] 26

head(colX5)

Exclude N/As in column X5

We then exclude all N/A’s in column X6 and print out the first 6 rows using the head() function.

migrants_by_country <- filter(migrants, !is.na(X5))

head(migrants_by_country)

Rename column X2

From the above print out, there is need to rename column X2 dest_country.

migrants_by_country <- migrants_by_country %>% 
        rename(
                dest_country = X2
        )
head(migrants_by_country)

View all columns

The above printout shows a number of irrelevant columns that are not necessary for our analysis. Lets print out the entire column names and delete the unnecessary ones to have a cleaner data set.

column_names <- colnames(migrants_by_country)
#column_names # umcomment to view entire list of column names
head(column_names)

## [1] "X1"           "dest_country" "X3"           "X4"          
## [5] "X5"           "1990"

Exclude irrelevant columns

The above print out reveals that we do not need all column names that start with “X”. We delete these columns using the srtarts_with function.

migrants_by_country <- migrants_by_country %>% 
        select(-starts_with("X"))


migrants_by_country <- migrants_by_country %>% 
        select(-c(2:8))

migrants_by_country

View dimentions of resulting data frame

We use dim() function to have an idea of how many rows and columns we have for our analysis.

dim(migrants_by_country)

## [1] 232  15

Confrim column names.

This is what we need for our analysis.

column_names_clean <- colnames(migrants_by_country)
#column_names_clean # uncomment to view entire list of cleaned up column names
head(column_names_clean)

## [1] "dest_country" "1990_1"       "1995_1"       "2000_1"      
## [5] "2005_1"       "2010_1"

View number of columns

Get the length of the column names to be used in the next line of code.

y <- length(colnames(migrants_by_country))

y

## [1] 15

clean up data

Let us use gather() function to gather all columns with years into a single columns and exclude any and all N/As to obtain clean data. Spread the resulting data by year column and rename “1” as male and “2” as female.

no_of_migrants_per_country <- mutate(gather(migrants_by_country, "year", "no_of_migrants", 2:y, na.rm = TRUE))

head(no_of_migrants_per_country)

no_of_migrants_per_country <- no_of_migrants_per_country %>%
        separate(year, c("year", "sex"), sep = "_")

no_of_migrants_per_country

Convert the years column to number format

no_of_migrants_per_country$year <- parse_number(no_of_migrants_per_country$year)

no_of_migrants_per_country

no_of_migrants_per_country <- no_of_migrants_per_country %>%
        spread(sex, no_of_migrants)
        

names(no_of_migrants_per_country)

## [1] "dest_country" "year"         "1"            "2"

no_of_migrants_per_country <- no_of_migrants_per_country %>% 
        rename(
                male = "1",
                female = "2"
        )
head(no_of_migrants_per_country)

Conversion of chr to dbl

convert the no_of_migrants data column from characters to doubles for statistical analysis. This we will do using the parse_number() function. Print out using head() function the first 6 rows and confirm this conversion.

no_of_migrants_per_country$male <- parse_number(no_of_migrants_per_country$male)
no_of_migrants_per_country$female <- parse_number(no_of_migrants_per_country$female)

clean_migrants_data <- no_of_migrants_per_country

head(clean_migrants_data)

Down stream analysis

Ordering of data

Ordering data by country with largest inflow of migrants

by_country <- clean_migrants_data %>% 
        group_by(year, dest_country) %>% 
        summarise(total_migrants = male + female) %>% 
        arrange(desc(total_migrants))
head(by_country)

Ordering the data of male migrants by the destination countries by year

male_by_country <- clean_migrants_data %>% 
        group_by(dest_country, year) %>% 
        summarise(male = male) %>% 
        arrange(desc(male))
head(male_by_country)

Ordering the data of female migrants by the destination countries by year

female_by_country <- clean_migrants_data %>% 
        group_by(dest_country, year) %>% 
        summarise(female = female) %>% 
        arrange(desc(female))
head(female_by_country)

Ordering the data by % of male migrants by the destination countries by year

Perc_male_by_country <- clean_migrants_data %>% 
        group_by(dest_country, year) %>% 
        summarise(perc_male = male/(male + female)) %>% 
        arrange(desc(perc_male))
head(Perc_male_by_country)

Ordering the data by % female migrants by the destination countries by year

Perc_female_by_country <- clean_migrants_data %>% 
        group_by(dest_country, year) %>% 
        summarise(perc_female = female/(male + female)) %>% 
        arrange(desc(perc_female))
head(Perc_female_by_country)

Conclusion

Maldives received the highest % of male migrants while nepal received the highest % of female migrants

DATA 607 Project 2

Samuel I Kigamba

October 06, 2019