DATA 607 Project 2.
The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to: (1) Choose any three of the “wide” datasets identified in the Week 5 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets: ??? Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below. ??? Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!] ??? Perform the analysis requested in the discussion item. ??? Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. (2) Please include in your homework submission, for each of the three chosen datasets: ??? The URL to the .Rmd file in your GitHub repository, and ??? The URL for your rpubs.com web page.
set working directory and Install all the relevant packages and load their respective libraries into R.
This will ensure that everyone with access to the github repository can easily audit or retest the data. This ensures ease of accessibility and testing by a wide audience. Follow this link to see uploaded Male migrants .csv file (https://raw.githubusercontent.com/igukusamuel/DATA-607-Project-2/master/UN_MigrantStockBySexByDestination_2019.csv)
migrants <- read_csv("https://raw.githubusercontent.com/igukusamuel/DATA-607-Project-2/master/UN_MigrantStockBySexByDestination_2019.csv")
head(migrants)#view(head(male_migrants, 20)) # vIew data frame structure and see how many rows to skip.As part of data cleanup, skip the first 15 rows that include source information not relevant to out analysis.
migrants <- read_csv("https://raw.githubusercontent.com/igukusamuel/DATA-607-Project-2/master/UN_MigrantStockBySexByDestination_2019.csv", skip = 15)
head(migrants) #Print out first few rows to confirm that the data have been loaded correctly.Careful review of the data shows that column named X5 only includes data for rows related to countries and N/A’s for rows relating to regions and regional totals. Thus filtering out all N/As in column X5 will leave us with country data only, which is the basis of out analysis. We first view all the N/As under column X5 to confirm none of them relate to country information.
colX5 <- filter(migrants, is.na(X5))
x <- length(colX5)
x## [1] 26
head(colX5)We then exclude all N/A’s in column X6 and print out the first 6 rows using the head() function.
migrants_by_country <- filter(migrants, !is.na(X5))
head(migrants_by_country)From the above print out, there is need to rename column X2 dest_country.
migrants_by_country <- migrants_by_country %>%
rename(
dest_country = X2
)
head(migrants_by_country)The above printout shows a number of irrelevant columns that are not necessary for our analysis. Lets print out the entire column names and delete the unnecessary ones to have a cleaner data set.
column_names <- colnames(migrants_by_country)
#column_names # umcomment to view entire list of column names
head(column_names)## [1] "X1" "dest_country" "X3" "X4"
## [5] "X5" "1990"
The above print out reveals that we do not need all column names that start with “X”. We delete these columns using the srtarts_with function.
migrants_by_country <- migrants_by_country %>%
select(-starts_with("X"))
migrants_by_country <- migrants_by_country %>%
select(-c(2:8))
migrants_by_countryWe use dim() function to have an idea of how many rows and columns we have for our analysis.
dim(migrants_by_country)## [1] 232 15
This is what we need for our analysis.
column_names_clean <- colnames(migrants_by_country)
#column_names_clean # uncomment to view entire list of cleaned up column names
head(column_names_clean)## [1] "dest_country" "1990_1" "1995_1" "2000_1"
## [5] "2005_1" "2010_1"
Get the length of the column names to be used in the next line of code.
y <- length(colnames(migrants_by_country))
y ## [1] 15
Let us use gather() function to gather all columns with years into a single columns and exclude any and all N/As to obtain clean data. Spread the resulting data by year column and rename “1” as male and “2” as female.
no_of_migrants_per_country <- mutate(gather(migrants_by_country, "year", "no_of_migrants", 2:y, na.rm = TRUE))
head(no_of_migrants_per_country)no_of_migrants_per_country <- no_of_migrants_per_country %>%
separate(year, c("year", "sex"), sep = "_")
no_of_migrants_per_countryConvert the years column to number format
no_of_migrants_per_country$year <- parse_number(no_of_migrants_per_country$year)
no_of_migrants_per_countryno_of_migrants_per_country <- no_of_migrants_per_country %>%
spread(sex, no_of_migrants)
names(no_of_migrants_per_country)## [1] "dest_country" "year" "1" "2"
no_of_migrants_per_country <- no_of_migrants_per_country %>%
rename(
male = "1",
female = "2"
)
head(no_of_migrants_per_country)convert the no_of_migrants data column from characters to doubles for statistical analysis. This we will do using the parse_number() function. Print out using head() function the first 6 rows and confirm this conversion.
no_of_migrants_per_country$male <- parse_number(no_of_migrants_per_country$male)
no_of_migrants_per_country$female <- parse_number(no_of_migrants_per_country$female)
clean_migrants_data <- no_of_migrants_per_country
head(clean_migrants_data)Ordering data by country with largest inflow of migrants
by_country <- clean_migrants_data %>%
group_by(year, dest_country) %>%
summarise(total_migrants = male + female) %>%
arrange(desc(total_migrants))
head(by_country)Ordering the data of male migrants by the destination countries by year
male_by_country <- clean_migrants_data %>%
group_by(dest_country, year) %>%
summarise(male = male) %>%
arrange(desc(male))
head(male_by_country)Ordering the data of female migrants by the destination countries by year
female_by_country <- clean_migrants_data %>%
group_by(dest_country, year) %>%
summarise(female = female) %>%
arrange(desc(female))
head(female_by_country)Ordering the data by % of male migrants by the destination countries by year
Perc_male_by_country <- clean_migrants_data %>%
group_by(dest_country, year) %>%
summarise(perc_male = male/(male + female)) %>%
arrange(desc(perc_male))
head(Perc_male_by_country)Ordering the data by % female migrants by the destination countries by year
Perc_female_by_country <- clean_migrants_data %>%
group_by(dest_country, year) %>%
summarise(perc_female = female/(male + female)) %>%
arrange(desc(perc_female))
head(Perc_female_by_country)Maldives received the highest % of male migrants while nepal received the highest % of female migrants