install.packages("dplyr")
library(dplyr)
install.packages("tidyr")
library(tidyr)
install.packages("outliers")
library(outliers)
install.packages("Hmisc")
library(Hmisc)
install.packages("forecast")
library(forecast)
The datasets used in this report contain world life expectancy from 1960 to 2018 and the classification of countries by income group. Firstly, we inspect all variables in the dataset as well as the structure of the data frame. A new data frame has been created by merging the two datasets with its common variable. Thereafter, the attributes and structure of the merged dataset are re-examined. We also convert the data type from character to factor variables. Next, we exclude irrelevant variables, re-name a few variables, and transform data format in order to tidying data. Subsequently, we identify the missing values in each columns and then handling the NAs by recoding and/ or removing the missing values. Z-score method has been used to identify the outliers of the dataset. Lastly, the numerical variable ‘Life_Expectancy’ has been checked for its distribution by producing a histogram and it appeared to be left-skewed. Therefore, Box-Cox transformation comes into effect and the variable has been converted to normally distributed for better analysis.
Data 1: Life Expectancy
life_expectancy.csv is a dataset contains the average number of years a newborn baby is expected to live for 264 countries from year 1960 to 2017.
In the original file, the headers are on 5th rows – as such reading the dataset using two read.csv commands by reading the headers first and then the data.
Variables in the dataset: Country Name: List of countries; Country Code: The ISO 3166-1 alpha-3 codes which represent countries; Indicator Name: Total life expectancy at birth (annually); Indicator Code: SP.DYN.LE00.IN; 1960 - 2017: Average Life Expectancy from the year 1960 to 2017
headers=read.csv("life_expectancy.csv",skip = 4,header = F,nrows=1,as.is = T)
life_expectancy = read.csv("life_expectancy.csv",skip = 5, header = F)
colnames(life_expectancy)= headers
head(life_expectancy)
Data 2: Income Groupings
income_groupings.csv is a dataset contains the classification of countries by income group.
Variables in the dataset: Country Code: The ISO 3166-1 alpha-3 codes which represent countries; Region: Region; Income Group: A segment of the population is divided into low income, lower middle income, upper middle income, & high income; Special Notes: Notes about the country; Table Name: country Name
income_groupings<-read.csv("income_groupings.csv",sep=",", header=T, check.names = FALSE)
head(income_groupings)
Data Source: Both datasets were obtained from The Word Bank, retrieved from https://data.worldbank.org/indicator/SP.DYN.LE00.IN?locations=AU&view=chart.
The common variable ‘Country Code’ has been used to merge both datasets and create a new data frame which will be working on in this report.
combined_life_expectancy<- life_expectancy %>% left_join(income_groupings,by="Country Code")
Column `Country Code` joining factors with different levels, coercing to character vector
head(combined_life_expectancy)
We can display the structure of the dataset using str() function as follows:
str(combined_life_expectancy)
'data.frame': 264 obs. of 67 variables:
$ Country Name : Factor w/ 264 levels "Afghanistan",..: 11 1 6 2 5 8 250 9 10 4 ...
$ Country Code : chr "ABW" "AFG" "AGO" "ALB" ...
$ Indicator Name: Factor w/ 1 level "Life expectancy at birth, total (years)": 1 1 1 1 1 1 1 1 1 1 ...
$ Indicator Code: Factor w/ 1 level "SP.DYN.LE00.IN": 1 1 1 1 1 1 1 1 1 1 ...
$ 1960 : num 65.7 32.3 33.3 62.3 NA ...
$ 1961 : num 66.1 32.7 33.6 63.3 NA ...
$ 1962 : num 66.4 33.2 33.9 64.2 NA ...
$ 1963 : num 66.8 33.6 34.3 64.9 NA ...
$ 1964 : num 67.1 34.1 34.6 65.5 NA ...
$ 1965 : num 67.4 34.5 35 65.8 NA ...
$ 1966 : num 67.8 34.9 35.4 66.1 NA ...
$ 1967 : num 68.1 35.4 35.8 66.3 NA ...
$ 1968 : num 68.4 35.8 36.2 66.5 NA ...
$ 1969 : num 68.8 36.2 36.6 66.7 NA ...
$ 1970 : num 69.1 36.7 37 66.9 NA ...
$ 1971 : num 69.5 37.1 37.5 67.2 NA ...
$ 1972 : num 69.9 37.6 37.9 67.6 NA ...
$ 1973 : num 70.2 38.1 38.3 68 NA ...
$ 1974 : num 70.5 38.5 38.7 68.3 NA ...
$ 1975 : num 70.8 39 39.1 68.7 NA ...
$ 1976 : num 71.1 39.6 39.5 69.1 NA ...
$ 1977 : num 71.4 40.1 39.8 69.4 NA ...
$ 1978 : num 71.7 40.6 40.1 69.7 NA ...
$ 1979 : num 72 41.2 40.3 70 NA ...
$ 1980 : num 72.3 41.9 40.5 70.2 NA ...
$ 1981 : num 72.5 42.5 40.7 70.4 NA ...
$ 1982 : num 72.8 43.2 40.8 70.6 NA ...
$ 1983 : num 72.9 44 41 70.9 NA ...
$ 1984 : num 73.1 44.7 41.1 71.1 NA ...
$ 1985 : num 73.2 45.6 41.2 71.4 NA ...
$ 1986 : num 73.3 46.4 41.3 71.6 NA ...
$ 1987 : num 73.3 47.3 41.4 71.8 NA ...
$ 1988 : num 73.4 48.2 41.5 71.8 NA ...
$ 1989 : num 73.4 49 41.6 71.9 NA ...
$ 1990 : num 73.5 49.9 41.7 71.8 NA ...
$ 1991 : num 73.5 50.6 41.9 71.8 NA ...
$ 1992 : num 73.5 51.3 42.1 71.8 NA ...
$ 1993 : num 73.6 52 42.3 71.9 NA ...
$ 1994 : num 73.6 52.5 42.7 72 NA ...
$ 1995 : num 73.6 53.1 43.1 72.2 NA ...
$ 1996 : num 73.6 53.5 43.7 72.5 NA ...
$ 1997 : num 73.7 54 44.4 72.8 NA ...
$ 1998 : num 73.7 54.5 45.2 73.2 NA ...
$ 1999 : num 73.7 55 46.1 73.6 NA ...
$ 2000 : num 73.8 55.5 47.1 74 NA ...
$ 2001 : num 73.9 56 48.2 74.3 NA ...
$ 2002 : num 73.9 56.6 49.3 74.6 NA ...
$ 2003 : num 74 57.2 50.5 74.8 NA ...
$ 2004 : num 74.2 57.9 51.7 75 NA ...
$ 2005 : num 74.3 58.5 52.8 75.2 NA ...
$ 2006 : num 74.4 59.1 54 75.4 NA ...
$ 2007 : num 74.6 59.7 55.1 75.7 NA ...
$ 2008 : num 74.7 60.2 56.2 75.9 NA ...
$ 2009 : num 74.9 60.8 57.2 76.3 NA ...
$ 2010 : num 75 61.2 58.2 76.7 NA ...
$ 2011 : num 75.2 61.7 59 77 NA ...
$ 2012 : num 75.3 62.1 59.8 77.4 NA ...
$ 2013 : num 75.4 62.5 60.4 77.7 NA ...
$ 2014 : num 75.6 62.9 60.9 78 NA ...
$ 2015 : num 75.7 63.3 61.2 78.2 NA ...
$ 2016 : num 75.9 63.7 61.5 78.3 NA ...
$ 2017 : num 76 64 61.8 78.5 NA ...
$ 2018 : logi NA NA NA NA NA NA ...
$ Region : Factor w/ 8 levels "","East Asia & Pacific",..: 4 7 8 3 3 1 5 4 3 2 ...
$ IncomeGroup : Factor w/ 5 levels "","High income",..: 2 3 4 5 2 1 2 2 5 5 ...
$ SpecialNotes : Factor w/ 232 levels "","(see also: https://www.imf.org/en/News/Articles/2018/07/06/pr18283-somalia-2nd-and-final-review-under-the-staff"| __truncated__,..: 41 49 92 8 79 10 123 131 143 16 ...
$ TableName : Factor w/ 263 levels "Afghanistan",..: 11 1 6 2 5 8 249 9 10 4 ...
The default setting of the stringAsFactors argument = TRUE, changing it to FALSE will read in the variable as a character variable than a factor.
headers=read.csv("life_expectancy.csv",skip = 4,header = F,nrows=1,as.is = T,stringsAsFactors = FALSE)
life_expectancy = read.csv("life_expectancy.csv",skip = 5, header = F,stringsAsFactors = FALSE)
colnames(life_expectancy)= headers
income_groupings<-read.csv("income_groupings.csv",sep=",", header=T, check.names = FALSE,stringsAsFactors = FALSE)
combined_life_expectancy<- life_expectancy %>% left_join(income_groupings,by="Country Code")
str(combined_life_expectancy)
'data.frame': 264 obs. of 67 variables:
$ Country Name : chr "Aruba" "Afghanistan" "Angola" "Albania" ...
$ Country Code : chr "ABW" "AFG" "AGO" "ALB" ...
$ Indicator Name: chr "Life expectancy at birth, total (years)" "Life expectancy at birth, total (years)" "Life expectancy at birth, total (years)" "Life expectancy at birth, total (years)" ...
$ Indicator Code: chr "SP.DYN.LE00.IN" "SP.DYN.LE00.IN" "SP.DYN.LE00.IN" "SP.DYN.LE00.IN" ...
$ 1960 : num 65.7 32.3 33.3 62.3 NA ...
$ 1961 : num 66.1 32.7 33.6 63.3 NA ...
$ 1962 : num 66.4 33.2 33.9 64.2 NA ...
$ 1963 : num 66.8 33.6 34.3 64.9 NA ...
$ 1964 : num 67.1 34.1 34.6 65.5 NA ...
$ 1965 : num 67.4 34.5 35 65.8 NA ...
$ 1966 : num 67.8 34.9 35.4 66.1 NA ...
$ 1967 : num 68.1 35.4 35.8 66.3 NA ...
$ 1968 : num 68.4 35.8 36.2 66.5 NA ...
$ 1969 : num 68.8 36.2 36.6 66.7 NA ...
$ 1970 : num 69.1 36.7 37 66.9 NA ...
$ 1971 : num 69.5 37.1 37.5 67.2 NA ...
$ 1972 : num 69.9 37.6 37.9 67.6 NA ...
$ 1973 : num 70.2 38.1 38.3 68 NA ...
$ 1974 : num 70.5 38.5 38.7 68.3 NA ...
$ 1975 : num 70.8 39 39.1 68.7 NA ...
$ 1976 : num 71.1 39.6 39.5 69.1 NA ...
$ 1977 : num 71.4 40.1 39.8 69.4 NA ...
$ 1978 : num 71.7 40.6 40.1 69.7 NA ...
$ 1979 : num 72 41.2 40.3 70 NA ...
$ 1980 : num 72.3 41.9 40.5 70.2 NA ...
$ 1981 : num 72.5 42.5 40.7 70.4 NA ...
$ 1982 : num 72.8 43.2 40.8 70.6 NA ...
$ 1983 : num 72.9 44 41 70.9 NA ...
$ 1984 : num 73.1 44.7 41.1 71.1 NA ...
$ 1985 : num 73.2 45.6 41.2 71.4 NA ...
$ 1986 : num 73.3 46.4 41.3 71.6 NA ...
$ 1987 : num 73.3 47.3 41.4 71.8 NA ...
$ 1988 : num 73.4 48.2 41.5 71.8 NA ...
$ 1989 : num 73.4 49 41.6 71.9 NA ...
$ 1990 : num 73.5 49.9 41.7 71.8 NA ...
$ 1991 : num 73.5 50.6 41.9 71.8 NA ...
$ 1992 : num 73.5 51.3 42.1 71.8 NA ...
$ 1993 : num 73.6 52 42.3 71.9 NA ...
$ 1994 : num 73.6 52.5 42.7 72 NA ...
$ 1995 : num 73.6 53.1 43.1 72.2 NA ...
$ 1996 : num 73.6 53.5 43.7 72.5 NA ...
$ 1997 : num 73.7 54 44.4 72.8 NA ...
$ 1998 : num 73.7 54.5 45.2 73.2 NA ...
$ 1999 : num 73.7 55 46.1 73.6 NA ...
$ 2000 : num 73.8 55.5 47.1 74 NA ...
$ 2001 : num 73.9 56 48.2 74.3 NA ...
$ 2002 : num 73.9 56.6 49.3 74.6 NA ...
$ 2003 : num 74 57.2 50.5 74.8 NA ...
$ 2004 : num 74.2 57.9 51.7 75 NA ...
$ 2005 : num 74.3 58.5 52.8 75.2 NA ...
$ 2006 : num 74.4 59.1 54 75.4 NA ...
$ 2007 : num 74.6 59.7 55.1 75.7 NA ...
$ 2008 : num 74.7 60.2 56.2 75.9 NA ...
$ 2009 : num 74.9 60.8 57.2 76.3 NA ...
$ 2010 : num 75 61.2 58.2 76.7 NA ...
$ 2011 : num 75.2 61.7 59 77 NA ...
$ 2012 : num 75.3 62.1 59.8 77.4 NA ...
$ 2013 : num 75.4 62.5 60.4 77.7 NA ...
$ 2014 : num 75.6 62.9 60.9 78 NA ...
$ 2015 : num 75.7 63.3 61.2 78.2 NA ...
$ 2016 : num 75.9 63.7 61.5 78.3 NA ...
$ 2017 : num 76 64 61.8 78.5 NA ...
$ 2018 : logi NA NA NA NA NA NA ...
$ Region : chr "Latin America & Caribbean" "South Asia" "Sub-Saharan Africa" "Europe & Central Asia" ...
$ IncomeGroup : chr "High income" "Low income" "Lower middle income" "Upper middle income" ...
$ SpecialNotes : chr "Central Bureau of Statistics and Central Bank of Aruba ; Source of population estimates: UN Population Division"| __truncated__ "Central Statistics Organization; World Bank staff estimates ; Source of population estimates: UN Population Div"| __truncated__ "IMF ; Source of population estimates: UN Population Division's World Population Prospects 2019 PROVISIONAL esti"| __truncated__ "Albanian Institute of Statistics ; Source of population estimates: Institute of Statistics, Eurostat" ...
$ TableName : chr "Aruba" "Afghanistan" "Angola" "Albania" ...
Attributes of the dataset can be accessed using the attributes() function:
attributes(combined_life_expectancy)
$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
[65] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
[97] 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
[129] 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
[161] 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
[193] 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
[225] 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256
[257] 257 258 259 260 261 262 263 264
$class
[1] "data.frame"
$names
[1] "Country Name" "Country Code" "Indicator Name" "Indicator Code" "1960" "1961" "1962"
[8] "1963" "1964" "1965" "1966" "1967" "1968" "1969"
[15] "1970" "1971" "1972" "1973" "1974" "1975" "1976"
[22] "1977" "1978" "1979" "1980" "1981" "1982" "1983"
[29] "1984" "1985" "1986" "1987" "1988" "1989" "1990"
[36] "1991" "1992" "1993" "1994" "1995" "1996" "1997"
[43] "1998" "1999" "2000" "2001" "2002" "2003" "2004"
[50] "2005" "2006" "2007" "2008" "2009" "2010" "2011"
[57] "2012" "2013" "2014" "2015" "2016" "2017" "2018"
[64] "Region" "IncomeGroup" "SpecialNotes" "TableName"
We can also use dim() function to get a sense of the dimensions of the data frame. As we can see, there are 264 rows and 67 columns in the data frame.
dim(combined_life_expectancy)
[1] 264 67
IncomeGroup is converted to a factor variable, which levels are arranged to be ordered.
combined_life_expectancy$IncomeGroup<-factor(combined_life_expectancy$IncomeGroup,levels = c('Low income','Lower middle income','Upper middle income','High income'),labels = c('Low','Lower middle','Upper middle','High'),ordered = TRUE)
Region is convered to a factor variable from character.
combined_life_expectancy$Region=as.factor(combined_life_expectancy$Region)
Re-inspect the data frame to get a sense of the structure of the data after conversions.
str(combined_life_expectancy)
'data.frame': 264 obs. of 67 variables:
$ Country Name : chr "Aruba" "Afghanistan" "Angola" "Albania" ...
$ Country Code : chr "ABW" "AFG" "AGO" "ALB" ...
$ Indicator Name: chr "Life expectancy at birth, total (years)" "Life expectancy at birth, total (years)" "Life expectancy at birth, total (years)" "Life expectancy at birth, total (years)" ...
$ Indicator Code: chr "SP.DYN.LE00.IN" "SP.DYN.LE00.IN" "SP.DYN.LE00.IN" "SP.DYN.LE00.IN" ...
$ 1960 : num 65.7 32.3 33.3 62.3 NA ...
$ 1961 : num 66.1 32.7 33.6 63.3 NA ...
$ 1962 : num 66.4 33.2 33.9 64.2 NA ...
$ 1963 : num 66.8 33.6 34.3 64.9 NA ...
$ 1964 : num 67.1 34.1 34.6 65.5 NA ...
$ 1965 : num 67.4 34.5 35 65.8 NA ...
$ 1966 : num 67.8 34.9 35.4 66.1 NA ...
$ 1967 : num 68.1 35.4 35.8 66.3 NA ...
$ 1968 : num 68.4 35.8 36.2 66.5 NA ...
$ 1969 : num 68.8 36.2 36.6 66.7 NA ...
$ 1970 : num 69.1 36.7 37 66.9 NA ...
$ 1971 : num 69.5 37.1 37.5 67.2 NA ...
$ 1972 : num 69.9 37.6 37.9 67.6 NA ...
$ 1973 : num 70.2 38.1 38.3 68 NA ...
$ 1974 : num 70.5 38.5 38.7 68.3 NA ...
$ 1975 : num 70.8 39 39.1 68.7 NA ...
$ 1976 : num 71.1 39.6 39.5 69.1 NA ...
$ 1977 : num 71.4 40.1 39.8 69.4 NA ...
$ 1978 : num 71.7 40.6 40.1 69.7 NA ...
$ 1979 : num 72 41.2 40.3 70 NA ...
$ 1980 : num 72.3 41.9 40.5 70.2 NA ...
$ 1981 : num 72.5 42.5 40.7 70.4 NA ...
$ 1982 : num 72.8 43.2 40.8 70.6 NA ...
$ 1983 : num 72.9 44 41 70.9 NA ...
$ 1984 : num 73.1 44.7 41.1 71.1 NA ...
$ 1985 : num 73.2 45.6 41.2 71.4 NA ...
$ 1986 : num 73.3 46.4 41.3 71.6 NA ...
$ 1987 : num 73.3 47.3 41.4 71.8 NA ...
$ 1988 : num 73.4 48.2 41.5 71.8 NA ...
$ 1989 : num 73.4 49 41.6 71.9 NA ...
$ 1990 : num 73.5 49.9 41.7 71.8 NA ...
$ 1991 : num 73.5 50.6 41.9 71.8 NA ...
$ 1992 : num 73.5 51.3 42.1 71.8 NA ...
$ 1993 : num 73.6 52 42.3 71.9 NA ...
$ 1994 : num 73.6 52.5 42.7 72 NA ...
$ 1995 : num 73.6 53.1 43.1 72.2 NA ...
$ 1996 : num 73.6 53.5 43.7 72.5 NA ...
$ 1997 : num 73.7 54 44.4 72.8 NA ...
$ 1998 : num 73.7 54.5 45.2 73.2 NA ...
$ 1999 : num 73.7 55 46.1 73.6 NA ...
$ 2000 : num 73.8 55.5 47.1 74 NA ...
$ 2001 : num 73.9 56 48.2 74.3 NA ...
$ 2002 : num 73.9 56.6 49.3 74.6 NA ...
$ 2003 : num 74 57.2 50.5 74.8 NA ...
$ 2004 : num 74.2 57.9 51.7 75 NA ...
$ 2005 : num 74.3 58.5 52.8 75.2 NA ...
$ 2006 : num 74.4 59.1 54 75.4 NA ...
$ 2007 : num 74.6 59.7 55.1 75.7 NA ...
$ 2008 : num 74.7 60.2 56.2 75.9 NA ...
$ 2009 : num 74.9 60.8 57.2 76.3 NA ...
$ 2010 : num 75 61.2 58.2 76.7 NA ...
$ 2011 : num 75.2 61.7 59 77 NA ...
$ 2012 : num 75.3 62.1 59.8 77.4 NA ...
$ 2013 : num 75.4 62.5 60.4 77.7 NA ...
$ 2014 : num 75.6 62.9 60.9 78 NA ...
$ 2015 : num 75.7 63.3 61.2 78.2 NA ...
$ 2016 : num 75.9 63.7 61.5 78.3 NA ...
$ 2017 : num 76 64 61.8 78.5 NA ...
$ 2018 : logi NA NA NA NA NA NA ...
$ Region : Factor w/ 8 levels "","East Asia & Pacific",..: 4 7 8 3 3 1 5 4 3 2 ...
$ IncomeGroup : Ord.factor w/ 4 levels "Low"<"Lower middle"<..: 4 1 2 3 4 NA 4 4 3 3 ...
$ SpecialNotes : chr "Central Bureau of Statistics and Central Bank of Aruba ; Source of population estimates: UN Population Division"| __truncated__ "Central Statistics Organization; World Bank staff estimates ; Source of population estimates: UN Population Div"| __truncated__ "IMF ; Source of population estimates: UN Population Division's World Population Prospects 2019 PROVISIONAL esti"| __truncated__ "Albanian Institute of Statistics ; Source of population estimates: Institute of Statistics, Eurostat" ...
$ TableName : chr "Aruba" "Afghanistan" "Angola" "Albania" ...
head(combined_life_expectancy)
Subsetting data: Indicator Name, Indicator Code, SpecialNotes, and TableName are irrelevant variables - as such the following code snippets exclude these variables from the dataset.
sub_comb_life_expectancy<-combined_life_expectancy%>%select(-'Indicator Name',-'Indicator Code',-'SpecialNotes',-'TableName')
head(sub_comb_life_expectancy)
Renaming variables: checking the existing column names and relabelling Country Name as Country_Name, Country Code as Country_Code, and IncomeGroup as Income_Class.
colnames(sub_comb_life_expectancy)
[1] "Country Name" "Country Code" "1960" "1961" "1962" "1963" "1964" "1965"
[9] "1966" "1967" "1968" "1969" "1970" "1971" "1972" "1973"
[17] "1974" "1975" "1976" "1977" "1978" "1979" "1980" "1981"
[25] "1982" "1983" "1984" "1985" "1986" "1987" "1988" "1989"
[33] "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997"
[41] "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005"
[49] "2006" "2007" "2008" "2009" "2010" "2011" "2012" "2013"
[57] "2014" "2015" "2016" "2017" "2018" "Region" "IncomeGroup"
new_comb_life_expectancy<-rename(sub_comb_life_expectancy,Country_Name='Country Name',Country_Code='Country Code',Income_Class='IncomeGroup')
colnames(new_comb_life_expectancy)
[1] "Country_Name" "Country_Code" "1960" "1961" "1962" "1963" "1964" "1965"
[9] "1966" "1967" "1968" "1969" "1970" "1971" "1972" "1973"
[17] "1974" "1975" "1976" "1977" "1978" "1979" "1980" "1981"
[25] "1982" "1983" "1984" "1985" "1986" "1987" "1988" "1989"
[33] "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997"
[41] "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005"
[49] "2006" "2007" "2008" "2009" "2010" "2011" "2012" "2013"
[57] "2014" "2015" "2016" "2017" "2018" "Region" "Income_Class"
Tidying data: Transforming the data frame from wide format to long format.
tidy_comb_life_expectancy<-new_comb_life_expectancy%>%gather(key="Year",value = "Life_Expectancy",3:61)
head(tidy_comb_life_expectancy)
Creating a new variable with mutate() function by calculating the difference between the average life expectancy between year 2017 and 1960. Renaming 1960 as Year_1960 and 2017 as Year_2017 and dropping variables by selecting the key variables e.g. Country_Name, Country_Code, Region, Income_Class, Year_1960, and Year_2017.
filter_life_expectancy<-rename(new_comb_life_expectancy,Year_1960='1960',Year_2017='2017')%>%select(Country_Name,Country_Code,Region,Income_Class,Year_1960,Year_2017)
new_life_expectancy<-mutate(filter_life_expectancy,Difference=Year_2017-Year_1960)%>%select(-"Year_1960",-"Year_2017")
head(new_life_expectancy)
NA
We first identify the missing values in the column ‘Life_Expectancy’:
scan_comb_life_expectancy<-tidy_comb_life_expectancy%>%subset(is.na(Life_Expectancy))
head(scan_comb_life_expectancy)
We then calculate the total missing values in the column ‘Life_Expectancy’ in each country:
na_life_expectancy<-scan_comb_life_expectancy%>%group_by(Country_Name)%>%summarise(NA_Count=sum(is.na(Life_Expectancy)))
na_life_expectancy
Repeating the similar steps to identify missing values on other variables (Country_Name, Country_Code, Year, and Region)
scan_country_name<-tidy_comb_life_expectancy%>%subset(is.na(Country_Name))
head(scan_country_name)
scan_country_code<-tidy_comb_life_expectancy%>%subset(is.na(Country_Code))
head(scan_country_code)
scan_year<-tidy_comb_life_expectancy%>%subset(is.na(Year))
head(scan_year)
scan_region<-tidy_comb_life_expectancy%>%subset(is.na(Region))
head(scan_region)
na_region<-scan_region%>%group_by(Country_Name)%>%summarise(NA_Count=sum(is.na(Region)))
na_region
Dropping Not Classified from our observations as they are irrelevant and replacing the NAs with the data in adjacent year. Thereafter, re-check the dataset.
clean_life_expectancy<-tidy_comb_life_expectancy[!tidy_comb_life_expectancy$Country_Name=="Not classified",]
clean_life_expectancy<-clean_life_expectancy%>%group_by(Country_Name)%>%fill(Life_Expectancy,.direction = "down")
clean_life_expectancy<-clean_life_expectancy%>%group_by(Country_Name)%>%fill(Life_Expectancy,.direction = "up")
total_clean_life_expectancy<-clean_life_expectancy[complete.cases(clean_life_expectancy),]
total_clean_life_expectancy
Checking special values in the column ‘Life_Expectancy’:
is.special<- function(x){
if(is.numeric(x)) !is.finite(x) else is.na(x)
}
is.special<- function(x){
if(is.numeric(x)) !is.finite(x)
}
a<- sapply(total_clean_life_expectancy$Life_Expectancy, is.special)
length(a[a=="TRUE"])
[1] 0
Scanning outliers: ‘Life_Expectancy’ is the only numerical variable in the dataset, we use Z-scores to identify possible outliers.
z.scores <- total_clean_life_expectancy$Life_Expectancy %>% scores(type = "z")
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4.0200 -0.6640 0.2839 0.0000 0.7412 1.8148
which( abs(z.scores) >3 )
[1] 1785 1786 1787 1788 1789 1790 1791 6963 6964 6965 6966 6967 6968 6969 6970 9177 9178 9179 9180 9618
total_outliers<- total_clean_life_expectancy[which( abs(z.scores) >3 ),] %>% count(Country_Name)
total_outliers
Cambodia, Mali, Rwanda and Sierra Leona appeared to be the outliers in the dataset due to its small population.
Therefore, a separate analysis should be conducted to get more accurate results. Cambodia, Mali, Rwanda and Sierra Leona have been excluded from the data frame:
final_life_expectancy <- total_clean_life_expectancy %>% filter(Country_Name!="Cambodia", Country_Name!="Mali", Country_Name!="Rwanda", Country_Name!="Sierra Leona")
final_life_expectancy
In addition to the above, the outliers have been checked in ‘Life Expectancy’ variable for each countries, respectively. Following is the box plot and z-score derived for a sample country Afghanistan.
The similar steps have been applied to identify the outliers in other countries.
clean_afghanistan<- final_life_expectancy %>% filter(Country_Name=="Afghanistan")
plot<- ggplot(data=clean_afghanistan,
aes(x=Country_Name,y=Life_Expectancy))
plot<- plot+geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4, fill="darkseagreen3")+theme(legend.position = "none",
plot.title = element_text(lineheight=1, face="bold",size=10),
axis.text.x = element_text(hjust = 0.5, vjust = 0.5,size=8),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5,angle=90,size=8),
axis.title = element_text(hjust = 0.5, vjust = 0.5,size=10),
legend.title = element_blank())+
labs(x="Country Name", y="Total population",
title_main="Boxplot of Life Expectancy in Afghanistan Over the Years")+
coord_flip()
plot
Based on the above result, we can conclude that there is no outliers being identified for Afghanistan.
Data transformation has been applied to ‘Life Expectancy’ variable.
hist(final_life_expectancy$Life_Expectancy,
border="black",col="lightcyan",cex.main=0.75,cex.axis=0.6,cex.lab=0.75)
As the distribution is left-skewed, we apply the Box-Cox transformation transforming skewed distribution into a symmetric distribution.
boxcox_life_expectancy<-BoxCox(final_life_expectancy$Life_Expectancy,lambda="auto")
hist(boxcox_life_expectancy,border="black",col="lightskyblue1",cex.main=0.75,cex.axis=0.6,cex.lab=0.75)