*Loading required packages.
library(readxl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(readr)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
*The data sets are csv file which contains Birth rate and Death rate of different countries from year 2009-2018. To make this data set ready for analysis, I did following steps of pre-processing:
*Imported data from each sheet respectively.
*Tidyed each data set and merged them all together.
*Checked data set structure and each variable’s data type.
*Converted data type if the variable was not suitable for data type.
*Created new variable “difference” represents the difference in birth rate and death rate for a country.
*Checked missing values, special values and obvious errors for all variables and with proper methodology.
*Checked outliers for all numeric variables, and fixed with capping methodology.
*Transformed the “Birth_rate” variable with logarithmic method to decrease the skewness and convert the distribution into a normal distribution.
*The data set is taken from open source Wolrd Bank open Data.
*data set Source:-https://data.worldbank.org/.
*Following are two data sets and their variable description:-
1.Birth rate:-it contains birth rate of different countires in years from 2009-2018.This data set has 14 variables:-
*Series Name:-it contains information that birth rate is taken per 1000 people.
*Series Code:-it contains unique code for series Name.
*Country Name:-it contains name of the Country.
*Country Code:-it contains unique code for Country Name.
*2009 [YR2009]- 2018[YR2018]:-this 10 coloumns contains birth rate from year 2009-2018.
2.Death rate:-it contains death rate of different countires in years from 2009-2018.This data set has 14 variables:-
*Series Name:-it contains information that death rate is taken per 1000 people.
*Series Code:-it contains unique code for series Name.
*Country Name:-it contains name of the Country.
*Country Code:-it contains unique code for Country Name.
*2009 [YR2009]- 2018[YR2018]:-this 10 coloumns contains death rate from year 2009-2018.
*Steps taken in this chunk are:-
1.Setting working directory to ~DATA Wrangling/Assignment-2 where both data sets are saved.
2.Importing Birth rate data set in Birth object.
3.Importing Death rate set in Death object.
4.Showing variables with 6 observations of both the objects with the help of head function.
setwd("C:/Users/61450/Desktop/DATA Wrangling/Assignment-2")
Birth<-read_csv("birth-rate.csv")
## Parsed with column specification:
## cols(
## `Series Name` = col_character(),
## `Series Code` = col_character(),
## `Country Name` = col_character(),
## `Country Code` = col_character(),
## `2009 [YR2009]` = col_character(),
## `2010 [YR2010]` = col_character(),
## `2011 [YR2011]` = col_character(),
## `2012 [YR2012]` = col_character(),
## `2013 [YR2013]` = col_character(),
## `2014 [YR2014]` = col_character(),
## `2015 [YR2015]` = col_character(),
## `2016 [YR2016]` = col_character(),
## `2017 [YR2017]` = col_character(),
## `2018 [YR2018]` = col_character()
## )
Death<-read_csv("death-rate.csv")
## Parsed with column specification:
## cols(
## `Series Name` = col_character(),
## `Series Code` = col_character(),
## `Country Name` = col_character(),
## `Country Code` = col_character(),
## `2009 [YR2009]` = col_character(),
## `2010 [YR2010]` = col_character(),
## `2011 [YR2011]` = col_character(),
## `2012 [YR2012]` = col_character(),
## `2013 [YR2013]` = col_character(),
## `2014 [YR2014]` = col_character(),
## `2015 [YR2015]` = col_character(),
## `2016 [YR2016]` = col_character(),
## `2017 [YR2017]` = col_character(),
## `2018 [YR2018]` = col_character()
## )
head(Birth)
head(Death)
*The data sets Birth and Death are untid as they does not meet the “Tidy data principles”-“Each variable must have its own coloumn”.
*The columns 2009 [YR2009]-2018[YR2018] in both the data sets are actually represeting same variable “year” and values in this coloumns are representing another same variable “birth rate” and “death rate”.
*To tidy the data sets i have gathered all the year variables in one coloumn“year” and their values i.e birth rate and death rate in another coloumns named “Birth_rate” and “death_rate”.
Birth<-Birth%>%gather(`2009 [YR2009]`,`2010 [YR2010]`,`2011 [YR2011]`,`2012 [YR2012]`,`2013 [YR2013]`,`2014 [YR2014]`,`2015 [YR2015]`,`2016 [YR2016]`,`2017 [YR2017]`,`2018 [YR2018]`,key="year",value="Birth_rate")
Death<-Death%>%gather(`2009 [YR2009]`,`2010 [YR2010]`,`2011 [YR2011]`,`2012 [YR2012]`,`2013 [YR2013]`,`2014 [YR2014]`,`2015 [YR2015]`,`2016 [YR2016]`,`2017 [YR2017]`,`2018 [YR2018]`,key="year",value="Death_rate")
##Merging Data sets
*Merging data sets birth and Death by left join and joining by common variables Country Name,Country code and Year.
*Showing merged data set variables using head function.
Birth_death<-Birth%>%left_join(Death,by=c("Country Name","Country Code","year"))
head(Birth_death)
*Checking the structure of data frame using str function and finding incorrect data type.
*Changing Birth_rate and Death_rate variable data type from character to numeric since it contains numeric values instead of character.
*Data types of variables Coutnry Name, Country code,year should be factor instead of character and hence changing their data types from character to factor.
*Showing levels of the converted factor variables to ensure the conversion took properly.
str(Birth_death)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2420 obs. of 9 variables:
## $ Series Name.x: chr "Birth rate, crude (per 1,000 people)" "Birth rate, crude (per 1,000 people)" "Birth rate, crude (per 1,000 people)" "Birth rate, crude (per 1,000 people)" ...
## $ Series Code.x: chr "SP.DYN.CBRT.IN" "SP.DYN.CBRT.IN" "SP.DYN.CBRT.IN" "SP.DYN.CBRT.IN" ...
## $ Country Name : chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Country Code : chr "AFG" "ALB" "DZA" "ASM" ...
## $ year : chr "2009 [YR2009]" "2009 [YR2009]" "2009 [YR2009]" "2009 [YR2009]" ...
## $ Birth_rate : chr "40.903" "11.945" "23.932" ".." ...
## $ Series Name.y: chr "Death rate, crude (per 1,000 people)" "Death rate, crude (per 1,000 people)" "Death rate, crude (per 1,000 people)" "Death rate, crude (per 1,000 people)" ...
## $ Series Code.y: chr "SP.DYN.CDRT.IN" "SP.DYN.CDRT.IN" "SP.DYN.CDRT.IN" "SP.DYN.CDRT.IN" ...
## $ Death_rate : chr "8.584" "6.756" "4.643" ".." ...
Birth_death$Birth_rate<-Birth_death$Birth_rate%>%as.numeric()
## Warning in function_list[[k]](value): NAs introduced by coercion
Birth_death$Death_rate<-Birth_death$Death_rate%>%as.numeric()
## Warning in function_list[[k]](value): NAs introduced by coercion
Birth_death$year <- factor(Birth_death$year,ordered = TRUE)
levels(Birth_death$year)
## [1] "2009 [YR2009]" "2010 [YR2010]" "2011 [YR2011]" "2012 [YR2012]"
## [5] "2013 [YR2013]" "2014 [YR2014]" "2015 [YR2015]" "2016 [YR2016]"
## [9] "2017 [YR2017]" "2018 [YR2018]"
Birth_death$`Country Name`<- factor(Birth_death$`Country Name`, ordered = TRUE)
levels(Birth_death$`Country Name`)
## [1] "Afghanistan" "Albania"
## [3] "Algeria" "American Samoa"
## [5] "Andorra" "Angola"
## [7] "Antigua and Barbuda" "Argentina"
## [9] "Armenia" "Aruba"
## [11] "Australia" "Austria"
## [13] "Azerbaijan" "Bahamas, The"
## [15] "Bahrain" "Bangladesh"
## [17] "Barbados" "Belarus"
## [19] "Belgium" "Belize"
## [21] "Benin" "Bermuda"
## [23] "Bhutan" "Bolivia"
## [25] "Bosnia and Herzegovina" "Botswana"
## [27] "Brazil" "British Virgin Islands"
## [29] "Brunei Darussalam" "Bulgaria"
## [31] "Burkina Faso" "Burundi"
## [33] "Cabo Verde" "Cambodia"
## [35] "Cameroon" "Canada"
## [37] "Cayman Islands" "Central African Republic"
## [39] "Chad" "Channel Islands"
## [41] "Chile" "China"
## [43] "Colombia" "Comoros"
## [45] "Congo, Dem. Rep." "Congo, Rep."
## [47] "Costa Rica" "Cote d'Ivoire"
## [49] "Croatia" "Cuba"
## [51] "Curacao" "Cyprus"
## [53] "Czech Republic" "Denmark"
## [55] "Djibouti" "Dominica"
## [57] "Dominican Republic" "Ecuador"
## [59] "Egypt, Arab Rep." "El Salvador"
## [61] "Equatorial Guinea" "Eritrea"
## [63] "Estonia" "Eswatini"
## [65] "Ethiopia" "Faroe Islands"
## [67] "Fiji" "Finland"
## [69] "France" "French Polynesia"
## [71] "Gabon" "Gambia, The"
## [73] "Georgia" "Germany"
## [75] "Ghana" "Gibraltar"
## [77] "Greece" "Greenland"
## [79] "Grenada" "Guam"
## [81] "Guatemala" "Guinea"
## [83] "Guinea-Bissau" "Guyana"
## [85] "Haiti" "Honduras"
## [87] "Hong Kong SAR, China" "Hungary"
## [89] "Iceland" "India"
## [91] "Indonesia" "Iran, Islamic Rep."
## [93] "Iraq" "Ireland"
## [95] "Isle of Man" "Israel"
## [97] "Italy" "Jamaica"
## [99] "Japan" "Jordan"
## [101] "Kazakhstan" "Kenya"
## [103] "Kiribati" "Korea, Dem. People’s Rep."
## [105] "Korea, Rep." "Kosovo"
## [107] "Kuwait" "Kyrgyz Republic"
## [109] "Lao PDR" "Latvia"
## [111] "Lebanon" "Lesotho"
## [113] "Liberia" "Libya"
## [115] "Liechtenstein" "Lithuania"
## [117] "Luxembourg" "Macao SAR, China"
## [119] "Madagascar" "Malawi"
## [121] "Malaysia" "Maldives"
## [123] "Mali" "Malta"
## [125] "Marshall Islands" "Mauritania"
## [127] "Mauritius" "Mexico"
## [129] "Micronesia, Fed. Sts." "Moldova"
## [131] "Monaco" "Mongolia"
## [133] "Montenegro" "Morocco"
## [135] "Mozambique" "Myanmar"
## [137] "Namibia" "Nauru"
## [139] "Nepal" "Netherlands"
## [141] "New Caledonia" "New Zealand"
## [143] "Nicaragua" "Niger"
## [145] "Nigeria" "North Macedonia"
## [147] "Northern Mariana Islands" "Norway"
## [149] "Oman" "Pakistan"
## [151] "Palau" "Panama"
## [153] "Papua New Guinea" "Paraguay"
## [155] "Peru" "Philippines"
## [157] "Poland" "Portugal"
## [159] "Puerto Rico" "Qatar"
## [161] "Romania" "Russian Federation"
## [163] "Rwanda" "Samoa"
## [165] "San Marino" "Sao Tome and Principe"
## [167] "Saudi Arabia" "Senegal"
## [169] "Serbia" "Seychelles"
## [171] "Sierra Leone" "Singapore"
## [173] "Sint Maarten (Dutch part)" "Slovak Republic"
## [175] "Slovenia" "Solomon Islands"
## [177] "Somalia" "South Africa"
## [179] "South Sudan" "Spain"
## [181] "Sri Lanka" "St. Kitts and Nevis"
## [183] "St. Lucia" "St. Martin (French part)"
## [185] "St. Vincent and the Grenadines" "Sudan"
## [187] "Suriname" "Sweden"
## [189] "Switzerland" "Syrian Arab Republic"
## [191] "Tajikistan" "Tanzania"
## [193] "Thailand" "Timor-Leste"
## [195] "Togo" "Tonga"
## [197] "Trinidad and Tobago" "Tunisia"
## [199] "Turkey" "Turkmenistan"
## [201] "Turks and Caicos Islands" "Tuvalu"
## [203] "Uganda" "Ukraine"
## [205] "United Arab Emirates" "United Kingdom"
## [207] "United States" "Uruguay"
## [209] "Uzbekistan" "Vanuatu"
## [211] "Venezuela, RB" "Vietnam"
## [213] "Virgin Islands (U.S.)" "West Bank and Gaza"
## [215] "Yemen, Rep." "Zambia"
## [217] "Zimbabwe"
Birth_death$`Country Code`<- factor(Birth_death$`Country Code`, ordered = TRUE)
levels(Birth_death$`Country Code`)
## [1] "ABW" "AFG" "AGO" "ALB" "AND" "ARE" "ARG" "ARM" "ASM" "ATG" "AUS" "AUT"
## [13] "AZE" "BDI" "BEL" "BEN" "BFA" "BGD" "BGR" "BHR" "BHS" "BIH" "BLR" "BLZ"
## [25] "BMU" "BOL" "BRA" "BRB" "BRN" "BTN" "BWA" "CAF" "CAN" "CHE" "CHI" "CHL"
## [37] "CHN" "CIV" "CMR" "COD" "COG" "COL" "COM" "CPV" "CRI" "CUB" "CUW" "CYM"
## [49] "CYP" "CZE" "DEU" "DJI" "DMA" "DNK" "DOM" "DZA" "ECU" "EGY" "ERI" "ESP"
## [61] "EST" "ETH" "FIN" "FJI" "FRA" "FRO" "FSM" "GAB" "GBR" "GEO" "GHA" "GIB"
## [73] "GIN" "GMB" "GNB" "GNQ" "GRC" "GRD" "GRL" "GTM" "GUM" "GUY" "HKG" "HND"
## [85] "HRV" "HTI" "HUN" "IDN" "IMN" "IND" "IRL" "IRN" "IRQ" "ISL" "ISR" "ITA"
## [97] "JAM" "JOR" "JPN" "KAZ" "KEN" "KGZ" "KHM" "KIR" "KNA" "KOR" "KWT" "LAO"
## [109] "LBN" "LBR" "LBY" "LCA" "LIE" "LKA" "LSO" "LTU" "LUX" "LVA" "MAC" "MAF"
## [121] "MAR" "MCO" "MDA" "MDG" "MDV" "MEX" "MHL" "MKD" "MLI" "MLT" "MMR" "MNE"
## [133] "MNG" "MNP" "MOZ" "MRT" "MUS" "MWI" "MYS" "NAM" "NCL" "NER" "NGA" "NIC"
## [145] "NLD" "NOR" "NPL" "NRU" "NZL" "OMN" "PAK" "PAN" "PER" "PHL" "PLW" "PNG"
## [157] "POL" "PRI" "PRK" "PRT" "PRY" "PSE" "PYF" "QAT" "ROU" "RUS" "RWA" "SAU"
## [169] "SDN" "SEN" "SGP" "SLB" "SLE" "SLV" "SMR" "SOM" "SRB" "SSD" "STP" "SUR"
## [181] "SVK" "SVN" "SWE" "SWZ" "SXM" "SYC" "SYR" "TCA" "TCD" "TGO" "THA" "TJK"
## [193] "TKM" "TLS" "TON" "TTO" "TUN" "TUR" "TUV" "TZA" "UGA" "UKR" "URY" "USA"
## [205] "UZB" "VCT" "VEN" "VGB" "VIR" "VNM" "VUT" "WSM" "XKX" "YEM" "ZAF" "ZMB"
## [217] "ZWE"
*Creating a new variable “difference” which is a difference of birth rate and death rate for a given year and country with the help of mutate function.
*Rearranging sequence of the variables.
Birth_death<-mutate(Birth_death,difference=Birth_rate-Death_rate)
Birth_death%>%select(`Country Name`,`Country Code`,`Series Name.x`,`Series Code.x`,Birth_rate,`Series Name.y`,`Series Code.y`,Death_rate)
*Checking total number of NAs in all the variables of Birth_death data set using colsums function.
*Filterting observations which has series Name.x “Birth rate, crude (per 1,000 people)”.I have used this method as the data set contains values other than Birth rate information which is not required and irrelevant.
*Replacing NAs of Birth_rate and Death_rate with 0.I have used this method as NA in this variable means the information of Birth rate and Death rate is not available for that particular country and year, also Birth rate and Death rate are numeric variables and hence replacing it with 0 number.
*Since difference variable is difference of Birth_rate and Death_rate to eliminate NAs of “difference” variable again storing difference of Birth_rate and Death_rate in difference coloumn.
*To confirm that all NAs of data frame is handled checking NAs of all variables again using colSums function and found that all NAs are handled.
colSums(is.na(Birth_death))
## Series Name.x Series Code.x Country Name Country Code year
## 150 250 250 250 0
## Birth_rate Series Name.y Series Code.y Death_rate difference
## 368 150 250 367 369
Birth_death<-Birth_death%>%filter(`Series Name.x`== "Birth rate, crude (per 1,000 people)")
colSums(is.na(Birth_death))
## Series Name.x Series Code.x Country Name Country Code year
## 0 0 0 0 0
## Birth_rate Series Name.y Series Code.y Death_rate difference
## 118 0 0 117 119
Birth_death$Birth_rate[is.na(Birth_death$Birth_rate)] = 0
Birth_death$Death_rate[is.na(Birth_death$Death_rate)] = 0
Birth_death$difference<-Birth_death$Birth_rate-Birth_death$Death_rate
colSums(is.na(Birth_death))
## Series Name.x Series Code.x Country Name Country Code year
## 0 0 0 0 0
## Birth_rate Series Name.y Series Code.y Death_rate difference
## 0 0 0 0 0
*Using boxplot visualisation to detect outliers of all numeric variable i.e,Birth_rate,Death_rate and difference.
*It is found that Death_rate has outliers present.
*Using capping method to deal with outliers of Death_rate variable, the reason behind choosing capping method is:- Every value of each variable is unique, if using MVN to detect and delete the outlier would influence the further analysis.
*Checking the results by comparing descriptive statistics summary.
boxplot(Birth_death$Birth_rate~Birth_death$year, main = "Birth rate by year", ylab = "Birth rate", xlab = "Year")
boxplot(Birth_death$Death_rate~Birth_death$year, main = "Death rate by year", ylab = "Death rate", xlab = "Year")
boxplot(Birth_death$difference~Birth_death$year, main = "difference in rate by year", ylab = "Difference in rate", xlab = "Year")
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
Birth_death$Death_rate<-Birth_death$Death_rate%>%cap
summary(Birth_death)
## Series Name.x Series Code.x Country Name Country Code
## Length:2170 Length:2170 Afghanistan : 10 ABW : 10
## Class :character Class :character Albania : 10 AFG : 10
## Mode :character Mode :character Algeria : 10 AGO : 10
## American Samoa: 10 ALB : 10
## Andorra : 10 AND : 10
## Angola : 10 ARE : 10
## (Other) :2110 (Other):2110
## year Birth_rate Series Name.y Series Code.y
## 2009 [YR2009]:217 Min. : 0.00 Length:2170 Length:2170
## 2010 [YR2010]:217 1st Qu.:11.16 Class :character Class :character
## 2011 [YR2011]:217 Median :17.40 Mode :character Mode :character
## 2012 [YR2012]:217 Mean :19.84
## 2013 [YR2013]:217 3rd Qu.:27.92
## 2014 [YR2014]:217 Max. :50.22
## (Other) :868
## Death_rate difference
## Min. : 0.000 Min. :-6.500
## 1st Qu.: 5.492 1st Qu.: 2.889
## Median : 7.200 Median :11.601
## Mean : 7.318 Mean :12.483
## 3rd Qu.: 9.265 3rd Qu.:20.919
## Max. :14.900 Max. :38.570
##
*I have selected Birth_rate numeric variable to transform.
*Plotting histogram of Birth_rate and found that the distribution is right skewed.
*Trying to reduce right skewness with logarithms, reciprocals, roots and Box-Cox methods.
*After comparing, logarithems method would be taken for the proper method.
*Applying logarithems method to the variable of data frame.
hist(Birth_death$Birth_rate,breaks=20)
log_Birth_death <- log10(Birth_death$Birth_rate)
hist(log_Birth_death, breaks = 20)
Birth_death_recip <- 1/Birth_death$Birth_rate
hist(Birth_death_recip, breaks = 20)
sqrt_Birth_death <- sqrt(Birth_death$Birth_rate)
hist(sqrt_Birth_death, breaks = 20)
boxcox_Birth_death <- BoxCox(Birth_death$Birth_rate, lambda = "auto")
hist(boxcox_Birth_death, breaks = 20)
Birth_death$Birth_rate <- log10(Birth_death$Birth_rate)
##References
*Death rate data set:-https://databank.worldbank.org/source/world-development-indicators#
*CANVAS-> DATA WRANGLING-> Modules https://rmit.instructure.com/courses/67186/modules.