library(readr)
library(dplyr)
library(tidyr)
library(Hmisc)
In this section, I described the datasets and their variables.
The first dataset contained birth and death rates for each country, between the years of 1960 and 2019.
It had 63 variables:
Country Code
Indicator Name
61 more variables assigned to the years 1960 to 2019
The second dataset contained information about location and income for each country.
It had 4 variables:
Country Code
Long Name
Region
Income Group
In this step, I explored the structures and attributes of the datasets and their variables’ datatypes.
Only the second dataset needed datatype conversions:
Region variable was converted from ‘character’ to ‘factor’
Income Group variable was converted from ‘character’ to ‘ordered factor’
In this step, I tidied the two datasets.
Spread and gather functions were used to tidy the first dataset.
The second dataset also had issues. Observations were a mixture of countries and continents, so I filtered the data to contain only observations for individual countries.
In this step, I created a new variable in the first dataset.
I mutated the birth and death rate columns to create a growth rate variable.
In this step, I merged the two datasets together.
I used a left-join, joining by the Country Code variable.
In this step, I checked for and dealt with missing and special values, as well as obvious errors.
3 numeric variables had missing values. The mean was imputed for each.
There were no special values or obvious errors.
In this step, I scanned the numeric variables for outliers using boxplots.
I used capping (AKA Winsorising) to deal with the outliers.
In this last step, I added a new variable by transforming the growth rate variable into z-scores.
Both datasets used for this assignment are taken from the World Databank here:
https://datacatalog.worldbank.org/dataset/health-nutrition-and-population-statistics
The two datasets were downloaded in the same .zip file provided on the page above.
The names of the two datasets are:
dataset1 is a subset from the original dataset called HNP_StatsData.
Below, I’ve provided the code used to load and subset the original data.
dataset1 <- read_csv("HNP_StatsData.csv")
cols1 <- "Death rate, crude (per 1,000 people)"
cols2 <- "Birth rate, crude (per 1,000 people)"
dataset1 <- dataset1 %>% filter(`Indicator Name` == c(cols1, cols2)) %>%
select(-`Indicator Code`, -`Country Name`)%>%
arrange(`Country Code`)
dataset1 contains birth and death rates for each country, between the years of 1960 and 2019.
Below, I’ve provided the head of dataset1.
head(dataset1)
There are 63 total variables in dataset1.
Country Code contains 3-letter abreviated codes used to easily refer to each country.
Indicator Name is a variable which contains the type of indicator being measured (Birth or Death rate). These are measured by taking the percentage of people affected per 1,000 people.
The remaining 61 variables are assigned to the years 1960 to 2019. These contain each country’s measured values for each indicator.
dataset2 is a subset from the original dataset called HNP_StatsCountry.
Below, I’ve provided the code used to load and subset the original data.
dataset2 <- read_csv("HNP_StatsCountry.csv")
dataset2 <- dataset2 %>%
select(`Country Code`, `Long Name`, `Region`,`Income Group`) %>%
arrange(`Country Code`)
dataset2 contains information about location and income for each country.
Below, I’ve provided the head of dataset2.
head(dataset2)
dataset2 contains 4 total variables.
Country Code contains 3-letter abreviated codes used to easily refer to each country.
Long Name contains the full names of each country.
Region contains the continental location of each country.
Income Group contains information about the size of each country’s income.
From the structure and attributes of dataset1, we can gather that:
dataset1 is a dataframe.
The Country Code variable is ‘character’ type, which I am happy with.
The Indicator Name variable is ‘character’ type, which I am happy with. This variable will be spread later in the project, so I won’t worry too much about it’s type now.
The 61 variables for the years between 1960 and 2019 are all ‘numeric’ types, which I’m happy with.
dataset1 did not need any datatype conversions.
The structure and attributes of dataset1 are displayed below.
str(dataset1)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 258 obs. of 63 variables:
## $ Country Code : chr "ABW" "ABW" "AGO" "AGO" ...
## $ Indicator Name: chr "Birth rate, crude (per 1,000 people)" "Death rate, crude (per 1,000 people)" "Birth rate, crude (per 1,000 people)" "Death rate, crude (per 1,000 people)" ...
## $ 1960 : num 35.68 6.39 49.08 27.1 40.92 ...
## $ 1961 : num 34.53 6.24 48.78 26.86 40.37 ...
## $ 1962 : num 33.32 6.12 48.55 26.63 39.63 ...
## $ 1963 : num 32.05 6.01 48.43 26.41 38.72 ...
## $ 1964 : num 30.74 5.92 48.45 26.19 37.7 ...
## $ 1965 : num 29.41 5.84 48.62 25.97 36.6 ...
## $ 1966 : num 28.12 5.77 48.94 25.69 35.5 ...
## $ 1967 : num 26.91 5.72 49.34 25.34 34.44 ...
## $ 1968 : num 25.82 5.68 49.79 24.92 33.46 ...
## $ 1969 : num 24.87 5.67 50.23 24.42 32.59 ...
## $ 1970 : num 24.1 5.67 50.62 23.87 31.84 ...
## $ 1971 : num 23.5 5.7 50.9 23.3 31.2 ...
## $ 1972 : num 23.07 5.75 51.06 22.78 30.59 ...
## $ 1973 : num 22.76 5.81 51.09 22.3 30.02 ...
## $ 1974 : num 22.56 5.89 51.01 21.89 29.47 ...
## $ 1975 : num 22.45 5.98 50.83 21.55 28.95 ...
## $ 1976 : num 22.41 6.07 50.6 21.28 28.45 ...
## $ 1977 : num 22.42 6.16 50.39 21.05 28 ...
## $ 1978 : num 22.45 6.24 50.23 20.84 27.61 ...
## $ 1979 : num 22.48 6.31 50.14 20.65 27.26 ...
## $ 1980 : num 22.47 6.38 50.13 20.47 26.98 ...
## $ 1981 : num 22.42 6.44 50.21 20.3 26.77 ...
## $ 1982 : num 22.33 6.52 50.32 20.14 26.63 ...
## $ 1983 : num 22.2 6.6 50.4 20 26.5 ...
## $ 1984 : num 21.99 6.69 50.57 19.89 26.45 ...
## $ 1985 : num 21.73 6.79 50.66 19.79 26.37 ...
## $ 1986 : num 21.4 6.87 50.71 19.71 26.24 ...
## $ 1987 : num 21.01 6.95 50.71 19.65 26.05 ...
## $ 1988 : num 20.6 7 50.7 19.6 25.8 ...
## $ 1989 : num 20.09 7.04 50.55 19.58 25.37 ...
## $ 1990 : num 19.57 7.06 50.38 19.55 24.87 ...
## $ 1991 : num 19.02 7.05 50.17 19.53 24.25 ...
## $ 1992 : num 18.45 7.03 49.92 19.51 23.53 ...
## $ 1993 : num 17.9 7 49.7 19.5 22.7 ...
## $ 1994 : num 17.27 6.97 49.38 19.4 21.9 ...
## $ 1995 : num 16.69 6.94 49.11 19.29 21.02 ...
## $ 1996 : num 16.13 6.92 48.87 19.15 20.11 ...
## $ 1997 : num 15.6 6.91 48.65 18.95 19.17 ...
## $ 1998 : num 15.09 6.92 48.46 18.71 18.24 ...
## $ 1999 : num 14.62 6.93 48.29 18.4 17.32 ...
## $ 2000 : num 14.17 6.97 48.15 18.04 16.44 ...
## $ 2001 : num 13.76 7.02 48.03 17.6 15.59 ...
## $ 2002 : num 13.38 7.08 47.91 17.09 14.79 ...
## $ 2003 : num 13.01 7.15 47.79 16.52 14.05 ...
## $ 2004 : num 12.67 7.23 47.64 15.9 13.38 ...
## $ 2005 : num 12.35 7.32 47.45 15.24 12.82 ...
## $ 2006 : num 12.05 7.42 47.22 14.54 12.4 ...
## $ 2007 : num 11.79 7.53 46.92 13.81 12.12 ...
## $ 2008 : num 11.56 7.65 46.56 13.09 11.97 ...
## $ 2009 : num 11.36 7.78 46.14 12.37 11.95 ...
## $ 2010 : num 11.21 7.92 45.66 11.68 12 ...
## $ 2011 : num 11.12 8.06 45.1 11.04 12.1 ...
## $ 2012 : num 11.09 8.21 44.49 10.45 12.2 ...
## $ 2013 : num 11.11 8.35 43.85 9.92 12.26 ...
## $ 2014 : num 11.18 8.49 43.18 9.45 12.26 ...
## $ 2015 : num 11.28 8.63 42.52 9.05 12.2 ...
## $ 2016 : num 11.4 8.77 41.88 8.72 12.08 ...
## $ 2017 : num 11.53 8.91 41.28 8.43 11.93 ...
## $ 2018 : num 11.65 9.05 40.73 8.19 11.78 ...
## $ 2019 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X65 : logi NA NA NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. `Country Name` = col_character(),
## .. `Country Code` = col_character(),
## .. `Indicator Name` = col_character(),
## .. `Indicator Code` = col_character(),
## .. `1960` = col_double(),
## .. `1961` = col_double(),
## .. `1962` = col_double(),
## .. `1963` = col_double(),
## .. `1964` = col_double(),
## .. `1965` = col_double(),
## .. `1966` = col_double(),
## .. `1967` = col_double(),
## .. `1968` = col_double(),
## .. `1969` = col_double(),
## .. `1970` = col_double(),
## .. `1971` = col_double(),
## .. `1972` = col_double(),
## .. `1973` = col_double(),
## .. `1974` = col_double(),
## .. `1975` = col_double(),
## .. `1976` = col_double(),
## .. `1977` = col_double(),
## .. `1978` = col_double(),
## .. `1979` = col_double(),
## .. `1980` = col_double(),
## .. `1981` = col_double(),
## .. `1982` = col_double(),
## .. `1983` = col_double(),
## .. `1984` = col_double(),
## .. `1985` = col_double(),
## .. `1986` = col_double(),
## .. `1987` = col_double(),
## .. `1988` = col_double(),
## .. `1989` = col_double(),
## .. `1990` = col_double(),
## .. `1991` = col_double(),
## .. `1992` = col_double(),
## .. `1993` = col_double(),
## .. `1994` = col_double(),
## .. `1995` = col_double(),
## .. `1996` = col_double(),
## .. `1997` = col_double(),
## .. `1998` = col_double(),
## .. `1999` = col_double(),
## .. `2000` = col_double(),
## .. `2001` = col_double(),
## .. `2002` = col_double(),
## .. `2003` = col_double(),
## .. `2004` = col_double(),
## .. `2005` = col_double(),
## .. `2006` = col_double(),
## .. `2007` = col_double(),
## .. `2008` = col_double(),
## .. `2009` = col_double(),
## .. `2010` = col_double(),
## .. `2011` = col_double(),
## .. `2012` = col_double(),
## .. `2013` = col_double(),
## .. `2014` = col_double(),
## .. `2015` = col_double(),
## .. `2016` = col_double(),
## .. `2017` = col_double(),
## .. `2018` = col_double(),
## .. `2019` = col_double(),
## .. X65 = col_logical()
## .. )
attributes(dataset1)
## $class
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## [199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## [217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
## [235] 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
## [253] 253 254 255 256 257 258
##
## $spec
## cols(
## `Country Name` = col_character(),
## `Country Code` = col_character(),
## `Indicator Name` = col_character(),
## `Indicator Code` = col_character(),
## `1960` = col_double(),
## `1961` = col_double(),
## `1962` = col_double(),
## `1963` = col_double(),
## `1964` = col_double(),
## `1965` = col_double(),
## `1966` = col_double(),
## `1967` = col_double(),
## `1968` = col_double(),
## `1969` = col_double(),
## `1970` = col_double(),
## `1971` = col_double(),
## `1972` = col_double(),
## `1973` = col_double(),
## `1974` = col_double(),
## `1975` = col_double(),
## `1976` = col_double(),
## `1977` = col_double(),
## `1978` = col_double(),
## `1979` = col_double(),
## `1980` = col_double(),
## `1981` = col_double(),
## `1982` = col_double(),
## `1983` = col_double(),
## `1984` = col_double(),
## `1985` = col_double(),
## `1986` = col_double(),
## `1987` = col_double(),
## `1988` = col_double(),
## `1989` = col_double(),
## `1990` = col_double(),
## `1991` = col_double(),
## `1992` = col_double(),
## `1993` = col_double(),
## `1994` = col_double(),
## `1995` = col_double(),
## `1996` = col_double(),
## `1997` = col_double(),
## `1998` = col_double(),
## `1999` = col_double(),
## `2000` = col_double(),
## `2001` = col_double(),
## `2002` = col_double(),
## `2003` = col_double(),
## `2004` = col_double(),
## `2005` = col_double(),
## `2006` = col_double(),
## `2007` = col_double(),
## `2008` = col_double(),
## `2009` = col_double(),
## `2010` = col_double(),
## `2011` = col_double(),
## `2012` = col_double(),
## `2013` = col_double(),
## `2014` = col_double(),
## `2015` = col_double(),
## `2016` = col_double(),
## `2017` = col_double(),
## `2018` = col_double(),
## `2019` = col_double(),
## X65 = col_logical()
## )
##
## $names
## [1] "Country Code" "Indicator Name" "1960" "1961"
## [5] "1962" "1963" "1964" "1965"
## [9] "1966" "1967" "1968" "1969"
## [13] "1970" "1971" "1972" "1973"
## [17] "1974" "1975" "1976" "1977"
## [21] "1978" "1979" "1980" "1981"
## [25] "1982" "1983" "1984" "1985"
## [29] "1986" "1987" "1988" "1989"
## [33] "1990" "1991" "1992" "1993"
## [37] "1994" "1995" "1996" "1997"
## [41] "1998" "1999" "2000" "2001"
## [45] "2002" "2003" "2004" "2005"
## [49] "2006" "2007" "2008" "2009"
## [53] "2010" "2011" "2012" "2013"
## [57] "2014" "2015" "2016" "2017"
## [61] "2018" "2019" "X65"
From the structure and attributes of dataset2, we can gather that:
dataset2 is a dataframe.
The Country Code variable is ‘character’ type, which I am happy with.
The Long Name variable is of ‘character’ type, which I am happy with.
The Region variable is of ‘character’ type, which I am NOT happy with. This variable should be converted to a factor variable.
The Income Group variable is of ‘character’ type, which I am NOT happy with. This variable should be converted to an ordered factor variable.
The structure and attributes of dataset2 are displayed below.
Below those, is the code used to convert variables into their correct datatypes.
str(dataset2)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 258 obs. of 4 variables:
## $ Country Code: chr "ABW" "AFG" "AGO" "ALB" ...
## $ Long Name : chr "Aruba" "Islamic State of Afghanistan" "People's Republic of Angola" "Republic of Albania" ...
## $ Region : chr "Latin America & Caribbean" "South Asia" "Sub-Saharan Africa" "Europe & Central Asia" ...
## $ Income Group: chr "High income" "Low income" "Lower middle income" "Upper middle income" ...
## - attr(*, "spec")=
## .. cols(
## .. `Country Code` = col_character(),
## .. `Short Name` = col_character(),
## .. `Table Name` = col_character(),
## .. `Long Name` = col_character(),
## .. `2-alpha code` = col_character(),
## .. `Currency Unit` = col_character(),
## .. `Special Notes` = col_character(),
## .. Region = col_character(),
## .. `Income Group` = col_character(),
## .. `WB-2 code` = col_character(),
## .. `National accounts base year` = col_character(),
## .. `National accounts reference year` = col_character(),
## .. `SNA price valuation` = col_character(),
## .. `Lending category` = col_character(),
## .. `Other groups` = col_character(),
## .. `System of National Accounts` = col_character(),
## .. `Alternative conversion factor` = col_character(),
## .. `PPP survey year` = col_logical(),
## .. `Balance of Payments Manual in use` = col_character(),
## .. `External debt Reporting status` = col_character(),
## .. `System of trade` = col_character(),
## .. `Government Accounting concept` = col_character(),
## .. `IMF data dissemination standard` = col_character(),
## .. `Latest population census` = col_character(),
## .. `Latest household survey` = col_character(),
## .. `Source of most recent Income and expenditure data` = col_character(),
## .. `Vital registration complete` = col_character(),
## .. `Latest agricultural census` = col_character(),
## .. `Latest industrial data` = col_double(),
## .. `Latest trade data` = col_double(),
## .. X31 = col_logical()
## .. )
attributes(dataset2)
## $class
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## [199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## [217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
## [235] 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
## [253] 253 254 255 256 257 258
##
## $spec
## cols(
## `Country Code` = col_character(),
## `Short Name` = col_character(),
## `Table Name` = col_character(),
## `Long Name` = col_character(),
## `2-alpha code` = col_character(),
## `Currency Unit` = col_character(),
## `Special Notes` = col_character(),
## Region = col_character(),
## `Income Group` = col_character(),
## `WB-2 code` = col_character(),
## `National accounts base year` = col_character(),
## `National accounts reference year` = col_character(),
## `SNA price valuation` = col_character(),
## `Lending category` = col_character(),
## `Other groups` = col_character(),
## `System of National Accounts` = col_character(),
## `Alternative conversion factor` = col_character(),
## `PPP survey year` = col_logical(),
## `Balance of Payments Manual in use` = col_character(),
## `External debt Reporting status` = col_character(),
## `System of trade` = col_character(),
## `Government Accounting concept` = col_character(),
## `IMF data dissemination standard` = col_character(),
## `Latest population census` = col_character(),
## `Latest household survey` = col_character(),
## `Source of most recent Income and expenditure data` = col_character(),
## `Vital registration complete` = col_character(),
## `Latest agricultural census` = col_character(),
## `Latest industrial data` = col_double(),
## `Latest trade data` = col_double(),
## X31 = col_logical()
## )
##
## $names
## [1] "Country Code" "Long Name" "Region" "Income Group"
To convert the Region variable to a factor variable, I used the following code.
dataset2$Region <- as.factor(dataset2$Region)
class(dataset2$Region)
## [1] "factor"
levels(dataset2$Region)
## [1] "East Asia & Pacific" "Europe & Central Asia"
## [3] "Latin America & Caribbean" "Middle East & North Africa"
## [5] "North America" "South Asia"
## [7] "Sub-Saharan Africa"
To convert the Income Group variable to an ordered factor variable, I used the following code.
dataset2$`Income Group` <-
factor(dataset2$`Income Group`,
levels = c("High income", "Low income",
"Lower middle income", "Upper middle income"),
labels = c("Low income", "Lower middle income",
"Upper middle income", "High income"),
ordered = TRUE)
class(dataset2$`Income Group`)
## [1] "ordered" "factor"
levels(dataset2$`Income Group`)
## [1] "Low income" "Lower middle income" "Upper middle income"
## [4] "High income"
Reflecting on the tidy data principles:
dataset1 DOES NOT fulfill this criteria and can NOT be considered tidy for the following reasons:
There are two variables within the Indicator Name column. These are birth and death rates. Since they are different variables, they need their own columns. They will need to be spread across the dataframe.
The years between 1960 and 2019 should be held within a single column but instead, are spread across 61 columns. This is considered wide data and will need to be gathered into long data.
First, to gather the year columns into a single column, I used the following code (I also got rid of an unneccessary column, ‘X65’).
dataset1 <- dataset1 %>%
gather(`1960`:`2019`, key = "Year", value = "Rate")
dataset1 <- dataset1 %>% select(-`X65`)
Secondly, to spread the Indicator Name column, I used the following code.
dataset1 <- dataset1 %>% spread(key = `Indicator Name`, value = `Rate`)
Below, I’ve provided the head of the tidy dataset1.
head(dataset1)
Reflecting on the tidy data principles:
dataset2 fulfills this criteria, although there is still an issue that needs fixing.
Observations are a mixture between countries and continents, which is inconsistant. I realised that obersvations for continents had no values for the Region variable.
To filter observations belonging to countries only, I used the following code.
dataset2 <- dataset2 %>% filter(!is.na(`Region`))
Below, I’ve provided the head of the tidy dataset2.
head(dataset2)
I left dataset2 untouched for this section of the assignment.
I decided to mutate a new variable for dataset1.
Using the birth rate and death rate variables, I created a new variable for growth rate.
dataset1 <- dataset1 %>% mutate(`Growth rate, crude (per 1,000 people)` =
`Birth rate, crude (per 1,000 people)` -
`Death rate, crude (per 1,000 people)`)
The head of the updated dataset1 is displayed below.
head(dataset1)
Now is the point of the assignment in which merged the two datasets.
I used a left join to join dataset1 to dataset2 by the Country Code variable.
I named this combined dataset, dataset.
The code used to execute this join is displayed below.
dataset <- left_join(dataset2, dataset1, by = "Country Code")
The head of the new dataset is shown below.
head(dataset)
Firsty, I realised that there were no values for 2019 across the entire dataset.
To filter the dataset to exclude the year 2019, I used the following code.
dataset <- dataset %>% filter(!`Year` == 2019)
To check for missing values across the dataset, I used the following code.
colSums(is.na(dataset))
## Country Code Long Name
## 0 0
## Region Income Group
## 0 0
## Year Birth rate, crude (per 1,000 people)
## 0 491
## Death rate, crude (per 1,000 people) Growth rate, crude (per 1,000 people)
## 495 500
Columns with missing values were birth, death and growth rates.
Dropping these rows would result in a large loss of data. Instead, I decided to replace missing values with the mean of their containing variable.
I used the following code to replace these missing values.
x1 <- as.numeric(impute(dataset$`Birth rate, crude (per 1,000 people)`, fun = mean))
x2 <- as.numeric(impute(dataset$`Death rate, crude (per 1,000 people)`, fun = mean))
x3 <- as.numeric(impute(dataset$`Growth rate, crude (per 1,000 people)`, fun = mean))
dataset$`Birth rate, crude (per 1,000 people)` <- x1
dataset$`Death rate, crude (per 1,000 people)` <- x2
dataset$`Growth rate, crude (per 1,000 people)` <- x3
Using the following code, I confirmed that all missing values had been dealt with.
colSums(is.na(dataset))
## Country Code Long Name
## 0 0
## Region Income Group
## 0 0
## Year Birth rate, crude (per 1,000 people)
## 0 0
## Death rate, crude (per 1,000 people) Growth rate, crude (per 1,000 people)
## 0 0
To check for special values I used the following code. There were no special values.
is.specialorNA <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))}
sapply(dataset, function(x) sum( is.specialorNA(x) ))
## Country Code Long Name
## 0 0
## Region Income Group
## 0 0
## Year Birth rate, crude (per 1,000 people)
## 0 0
## Death rate, crude (per 1,000 people) Growth rate, crude (per 1,000 people)
## 0 0
To check for obvious errors in the Country Code variable, I used the following code. Every code looked fine, so I moved on.
unique(dataset$`Country Code`)
## [1] "ABW" "AGO" "ALB" "ARG" "ASM" "AUT" "BDI" "BGD" "BGR" "BHS" "BLR" "BLZ"
## [13] "BMU" "BOL" "BWA" "CAF" "CAN" "CHI" "CHN" "CIV" "COG" "COM" "CUB" "CYP"
## [25] "DEU" "DMA" "DNK" "ECU" "ERI" "ESP" "FIN" "FRO" "GBR" "GIB" "GIN" "GMB"
## [37] "GRL" "GUM" "GUY" "HND" "HUN" "IND" "IRL" "IRN" "ISR" "JAM" "JOR" "KEN"
## [49] "KGZ" "KHM" "KNA" "LBY" "LSO" "LTU" "LVA" "MAC" "MAF" "MAR" "MDA" "MDV"
## [61] "MEX" "MKD" "MLT" "MMR" "MNG" "MRT" "MWI" "NER" "NLD" "NOR" "NRU" "NZL"
## [73] "PAK" "PAN" "PHL" "PRK" "PRT" "PRY" "PSE" "PYF" "QAT" "RUS" "SDN" "SEN"
## [85] "SGP" "SLB" "SLV" "STP" "SVK" "SWE" "SWZ" "SYC" "SYR" "TKM" "TLS" "TON"
## [97] "TUN" "TUV" "TZA" "UKR" "URY" "VGB" "VNM" "VUT" "WSM" "XKX" "ZAF" "ZMB"
To check for obvious errors in the Long Name variable, I used the following code. Every country name looked good, so I moved on.
unique(dataset$`Long Name`)
## [1] "Aruba"
## [2] "People's Republic of Angola"
## [3] "Republic of Albania"
## [4] "Argentine Republic"
## [5] "American Samoa"
## [6] "Republic of Austria"
## [7] "Republic of Burundi"
## [8] "People's Republic of Bangladesh"
## [9] "Republic of Bulgaria"
## [10] "Commonwealth of The Bahamas"
## [11] "Republic of Belarus"
## [12] "Belize"
## [13] "The Bermudas"
## [14] "Plurinational State of Bolivia"
## [15] "Republic of Botswana"
## [16] "Central African Republic"
## [17] "Canada"
## [18] "Channel Islands"
## [19] "People's Republic of China"
## [20] "Republic of Côte d'Ivoire"
## [21] "Republic of Congo"
## [22] "Union of the Comoros"
## [23] "Republic of Cuba"
## [24] "Republic of Cyprus"
## [25] "Federal Republic of Germany"
## [26] "Commonwealth of Dominica"
## [27] "Kingdom of Denmark"
## [28] "Republic of Ecuador"
## [29] "State of Eritrea"
## [30] "Kingdom of Spain"
## [31] "Republic of Finland"
## [32] "Faroe Islands"
## [33] "United Kingdom of Great Britain and Northern Ireland"
## [34] "Gibraltar"
## [35] "Republic of Guinea"
## [36] "Republic of The Gambia"
## [37] "Greenland"
## [38] "Guam"
## [39] "Co-operative Republic of Guyana"
## [40] "Republic of Honduras"
## [41] "Hungary"
## [42] "Republic of India"
## [43] "Ireland"
## [44] "Islamic Republic of Iran"
## [45] "State of Israel"
## [46] "Jamaica"
## [47] "Hashemite Kingdom of Jordan"
## [48] "Republic of Kenya"
## [49] "Kyrgyz Republic"
## [50] "Kingdom of Cambodia"
## [51] "St. Kitts and Nevis"
## [52] "Socialist People's Libyan Arab Jamahiriya"
## [53] "Kingdom of Lesotho"
## [54] "Republic of Lithuania"
## [55] "Republic of Latvia"
## [56] "Macao Special Administrative Region of the People's Republic of China"
## [57] "St. Martin (French part)"
## [58] "Kingdom of Morocco"
## [59] "Republic of Moldova"
## [60] "Republic of Maldives"
## [61] "United Mexican States"
## [62] "Republic of North Macedonia"
## [63] "Republic of Malta"
## [64] "Republic of the Union of Myanmar"
## [65] "Mongolia"
## [66] "Islamic Republic of Mauritania"
## [67] "Republic of Malawi"
## [68] "Republic of Niger"
## [69] "Kingdom of the Netherlands"
## [70] "Kingdom of Norway"
## [71] "Republic of Nauru"
## [72] "New Zealand"
## [73] "Islamic Republic of Pakistan"
## [74] "Republic of Panama"
## [75] "Republic of the Philippines"
## [76] "Democratic People's Republic of Korea"
## [77] "Portuguese Republic"
## [78] "Republic of Paraguay"
## [79] "West Bank and Gaza"
## [80] "French Polynesia"
## [81] "State of Qatar"
## [82] "Russian Federation"
## [83] "Republic of the Sudan"
## [84] "Republic of Senegal"
## [85] "Republic of Singapore"
## [86] "Solomon Islands"
## [87] "Republic of El Salvador"
## [88] "Democratic Republic of São Tomé and Principe"
## [89] "Slovak Republic"
## [90] "Kingdom of Sweden"
## [91] "Kingdom of Eswatini"
## [92] "Republic of Seychelles"
## [93] "Syrian Arab Republic"
## [94] "Turkmenistan"
## [95] "Democratic Republic of Timor-Leste"
## [96] "Kingdom of Tonga"
## [97] "Republic of Tunisia"
## [98] "Tuvalu"
## [99] "United Republic of Tanzania"
## [100] "Ukraine"
## [101] "Oriental Republic of Uruguay"
## [102] "British Virgin Islands"
## [103] "Socialist Republic of Vietnam"
## [104] "Republic of Vanuatu"
## [105] "Samoa"
## [106] "Republic of Kosovo"
## [107] "Republic of South Africa"
## [108] "Republic of Zambia"
To check for obvious errors in the Region variable, I used the following code. All values belonged to the 7 regions, so I moved on.
unique(dataset$Region)
## [1] Latin America & Caribbean Sub-Saharan Africa
## [3] Europe & Central Asia East Asia & Pacific
## [5] South Asia North America
## [7] Middle East & North Africa
## 7 Levels: East Asia & Pacific ... Sub-Saharan Africa
To check for obvious errors in the Income Group variable, I used the following code. Every value belonged to the 4 income levels, so I moved on.
unique(dataset$`Income Group`)
## [1] Low income Upper middle income High income
## [4] Lower middle income
## 4 Levels: Low income < Lower middle income < ... < High income
To check for obvious errors in the Year variable, I used the following code. There were no values from years outside the range of 1960 to 2019, so I moved on.
dim(filter(dataset, `Year` < 1960 | `Year` > 2019))[1]
## [1] 0
To check for obvious errors in the birth rate variable, I used the following code. There were no negative birth rates, so I moved on.
dim(filter(dataset, dataset$`Birth rate, crude (per 1,000 people)` < 0))[1]
## [1] 0
To check for obvious errors in the death rate variable, I used the following code. There were no death rates above 100%, so I moved on.
dim(filter(dataset, dataset$`Death rate, crude (per 1,000 people)` > 100))[1]
## [1] 0
To check for obvious errors in the growth rate variable, I used the following code. There were no values under -100%, so I moved on.
dim(filter(dataset, dataset$`Growth rate, crude (per 1,000 people)` < -100))[1]
## [1] 0
To scan for outliers across the 3 numeric variables, I used box plots.
boxplot(dataset$`Birth rate, crude (per 1,000 people)`,
dataset$`Death rate, crude (per 1,000 people)`,
dataset$`Growth rate, crude (per 1,000 people)`,
at = c(1,2,3),
names = c("Birth Rate", "Death Rate", "Growth Rate"))
From the visualisation above, the birth and growth rate variables had no outliers, but the death rate variable had quite a few.
To deal with these, I used capping (AKA Winsorising). I didn’t delete these outliers as they may still hold useful information.
I dealt with outliers in the death rate column using the following code.
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
dataset$`Death rate, crude (per 1,000 people)` <-
dataset$`Death rate, crude (per 1,000 people)` %>%
cap
Using the visualisation below, I confirmed that outliers had been dealt with.
boxplot(dataset$`Death rate, crude (per 1,000 people)`,
xlab ="Death Rate")
My numeric variables were already in the form of percentages, so they were already quite clean and didn’t need tranformations.
However, I added a new variable onto the dataset by transforming the growth rate variable into z-scores to demonstate how this could be done.
The code for this is below.
dataset$`Growth rate, crude (per 1,000 people) z scores` <-
scale(dataset$`Growth rate, crude (per 1,000 people)`, center = TRUE, scale = TRUE)
The head of this new column is shown below.
head(dataset$`Growth rate, crude (per 1,000 people) z scores`)
## [,1]
## [1,] 1.0809325
## [2,] 0.9878304
## [3,] 0.8870238
## [4,] 0.7789771
## [5,] 0.6656394
## [6,] 0.5502595
Here is the head of my final dataframe after preprocessing.
head(dataset)