MATH2349 Semester 1, 2020

Required packages

Package readr - Used to read and load csv files on our RStudio.
Package dplyr - Used to manipulate data faster. Provides simple shortcuts to summarise, rearrange data, etc
Package tidyr - Used to convert your dataset into a tidy fromat. Helps in changing dataset layout, etc.
Package Hmisc - Used to handle na and missing values
Package forecast - Used for transformations
Package outliers - Use to detect and handle outliers

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(outliers)

Executive Summary

We have collected two datasets. One is a tidy dataset and the other one is an untidy dataset. In our project, we would be merging the two datatsets using left join. Before we perform left join, we have to perform a few steps for tiding the data. We first select only those columns from both the datasets that we’ve to work on. Once we have done that, we need to check if the columns whith which we will connect two datasets have similar names. If not, then replace it with a similar name. The untidy data was in a wide format, so we converted it into a long format using the gather function and giving key and values. Once we have both the datasets tidy, we are all ready to merge them. youcan merge them using join functions. We have used left join to merge the data. As there was no data from 1995-1999 and 2016-2019, we discarded them using filter function. NAs in the country name column were dropped as if there cannot be data for no country name. There were a few columns with incorrect datatypes that needed to be corrected and a column which had to be converted to factor. On doing all the above things, we have performed data preprocessing properly and now we can head to the other parts of the project.

Data

For our study, we searched for data on the internet and found a few websites from where datasets can be created and downloaded. We create our untidy dataset from world bank website and select a tidy dataset from kaggle. we downloaded a GDP(current US$) dataset of all the countries from 1995-2019. The data downloaded here is in an untidy format. We then found a Life expectancy dataset from kaggle which consists of country wise data of health status of people and many other factors related to life expectancy. The data recorded is between 2000-2015. We want to check the relation between the world GDP and Life expectancy of people in those countries in those particular years.

The variables in the GDP(current US$) are as follows:

Series Name = Factor (GDP current US$) selected by us while creating the dataset.
Series Code = code for that particular series.
Country Name = Name of the country.
Country Code = Country Code
1995 [YR1995] - 2019 [YR2019] = Year of occurance.

All the columns in the GDP dataset were of datatype character.

The variables in Life expectancy table are as follows:

Country = Namr of the country
Year = Year of occurance.
Status = Country is a Developed or Developing.
Life expectancy` = Life expectancy average age.
Adult Mortality` = Adult Mortality Rate per 1000 people
infant deaths` = no of infant deaths per 1000 people
Alcohol = Alcohol consumption(liters)
percentage expenditure` = Spending on Health.
Hepatitis B` = Immunization against Hepatitis B for children of 1 year
Measles = reported cases per 1000 people
BMI = Average Body Mass Index
Polio = No. of polio cases reported per 1000 people
Total expenditure` = total expenditure in that year
Diphtheria = no. of diptheria cases recorded per 1000 people
HIV/AIDS` = no. of HIV/AIDS cases recorded per 1000 people
GDP = GDP of that country
Population = Population count.

Only Country and Status were of character datatype rest all the other variables were of datatype double in the Life Expectancy dataset.

First we imported the tow datasets onto our RStudio and performed head and view operation on it. Head will give you first 6 rows of the data and view will help yo view the entire dataset in a new tab. In the GDP dataset, we need all the rows and only contry Name, Year and Number columns, hence we select only three columns. Similarly in the Life expectancy we need all the rows and only Country Name, Year, Life Expectancy, Infant Deaths and Population, hence we select only those coulmns and discard the rest. We even had to rename the columns in GDP dataset as we have to merge the two datasets and need a key value to do it.

gdp <- read_csv("E:/r programs/lectures/assignment 2/Data_Extract_From_World_Development_Indicators (2)/gdp.csv")

## Parsed with column specification:
## cols(
##   .default = col_character()
## )

## See spec(...) for full column specifications.

head(gdp)

Life_Expectancy_Data <- read_csv("E:/r programs/lectures/assignment 2/Data_Extract_From_World_Development_Indicators (2)/Life Expectancy Data.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Country = col_character(),
##   Status = col_character()
## )
## See spec(...) for full column specifications.

head(Life_Expectancy_Data)

Life_Expectancy_Data1 <- Life_Expectancy_Data[,c(1:4,6,18)]
head(Life_Expectancy_Data1)

gdp1 <- gdp[,c(3,5:28)]
head(gdp1)

gdp1 <- rename(gdp1,"Country Name"="Country Name")
gdp1 <- rename(gdp1,"1995"="1995 [YR1995]")
gdp1 <- rename(gdp1,"1996"="1996 [YR1996]")
gdp1 <- rename(gdp1,"1997"="1997 [YR1997]")
gdp1 <- rename(gdp1,"1998"="1998 [YR1998]")
gdp1 <- rename(gdp1,"1999"="1999 [YR1999]")
gdp1 <- rename(gdp1,"2000"="2000 [YR2000]")
gdp1 <- rename(gdp1,"2001"="2001 [YR2001]")
gdp1 <- rename(gdp1,"2002"="2002 [YR2002]")
gdp1 <- rename(gdp1,"2003"="2003 [YR2003]")
gdp1 <- rename(gdp1,"2004"="2004 [YR2004]")
gdp1 <- rename(gdp1,"2005"="2005 [YR2005]")
gdp1 <- rename(gdp1,"2006"="2006 [YR2006]")
gdp1 <- rename(gdp1,"2007"="2007 [YR2007]")
gdp1 <- rename(gdp1,"2008"="2008 [YR2008]")
gdp1 <- rename(gdp1,"2009"="2009 [YR2009]")
gdp1 <- rename(gdp1,"2010"="2010 [YR2010]")
gdp1 <- rename(gdp1,"2011"="2011 [YR2011]")
gdp1 <- rename(gdp1,"2012"="2012 [YR2012]")
gdp1 <- rename(gdp1,"2013"="2013 [YR2013]")
gdp1 <- rename(gdp1,"2014"="2014 [YR2014]")
gdp1 <- rename(gdp1,"2015"="2015 [YR2015]")
gdp1 <- rename(gdp1,"2016"="2016 [YR2016]")
gdp1 <- rename(gdp1,"2017"="2017 [YR2017]")
gdp1 <- rename(gdp1,"2018"="2018 [YR2018]")

head(gdp1)

Tidy & Manipulate Data I

As our GDP data is in an untidy format, we need to tidy it first before we perform any operations. The GDP dataset is untidy because it has column values(1995-2019) as column names and is in a wide format. We need to transform it into a long format by using the gather(). This will help the data look more presentable and easier to read and understand. To use gather function, we set the key value to Year and value as GDP. Once we do this, our data will be tidy and data will be displayed ina long format.

We then use left join to merge both the dataset with the help of Country Name an Year. We selected Country name and Year because there were two factors that were common between the two datasets.

We then perform filter function to collect data oly from 2000-2015 as the Life expectancy dataset is between that time span. The other Values from 1995-1999 and 2016-2019 would all be NAs if we did not filter them out. Hence we decided to work only on the timespan between 2000-2015. We enen dropped NAs from the country column because iF the country Name isnt present, it doesn’t make any sence to have data for a nameless country.

gdp2 <- gdp1 %>% gather(key="Year", value="GDP", c(`1995`:`2018`))
head(gdp2)

typeof(Life_Expectancy_Data1$Year)

## [1] "double"

Life_Expectancy_Data1$Year <- as.character(Life_Expectancy_Data1$Year)
typeof(Life_Expectancy_Data1$Year)

## [1] "character"

working_data <- left_join(gdp2,Life_Expectancy_Data1, by = c("Country Name"="Country","Year"="Year"))
head(working_data)

working_data <- working_data %>% filter(Year=="2000"|Year=="2001"|Year=="2002"|Year=="2003"|Year=="2004"|Year=="2005"|Year=="2006"|Year=="2007"|Year=="2008"|Year=="2009"|Year=="2010"|Year=="2011"|Year=="2012"|Year=="2013"|Year=="2014"|Year=="2015")
head(working_data)

working_data <- working_data %>% drop_na(`Country Name`)
head(working_data)

Understand

We use the summary function to check the summary statistics of the datasets. Summary statistics include values such as min, max,1st and 3rd quantile, mean and median. We then check the structure of all the datasets. We perform the str() to check if the datatypes of variables in the datasets correct or no. We notice that we have some columns with wrong datatypes and correct them. The column GDP was of datatype character, we changed it to numeric using the as.numeric(). We changed this to numeric as all the data present in the column are numeric. In the status column, it was of datatype character, as it had only two values, we decided to convert it into a factor. These change in datatypes will be userfor further data manipluion.

summary(gdp2)

##  Country Name           Year               GDP           
##  Length:5328        Length:5328        Length:5328       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

summary(Life_Expectancy_Data1)

##    Country              Year              Status          Life expectancy
##  Length:2938        Length:2938        Length:2938        Min.   :36.30  
##  Class :character   Class :character   Class :character   1st Qu.:63.10  
##  Mode  :character   Mode  :character   Mode  :character   Median :72.10  
##                                                           Mean   :69.22  
##                                                           3rd Qu.:75.70  
##                                                           Max.   :89.00  
##                                                           NA's   :10     
##  infant deaths      Population       
##  Min.   :   0.0   Min.   :3.400e+01  
##  1st Qu.:   0.0   1st Qu.:1.958e+05  
##  Median :   3.0   Median :1.387e+06  
##  Mean   :  30.3   Mean   :1.275e+07  
##  3rd Qu.:  22.0   3rd Qu.:7.420e+06  
##  Max.   :1800.0   Max.   :1.294e+09  
##                   NA's   :652

summary(working_data)

##  Country Name           Year               GDP               Status         
##  Length:3472        Length:3472        Length:3472        Length:3472       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Life expectancy infant deaths       Population       
##  Min.   :36.30   Min.   :   0.00   Min.   :3.400e+01  
##  1st Qu.:63.10   1st Qu.:   0.00   1st Qu.:1.998e+05  
##  Median :72.20   Median :   3.00   Median :1.420e+06  
##  Mean   :69.34   Mean   :  31.33   Mean   :1.284e+07  
##  3rd Qu.:75.90   3rd Qu.:  21.00   3rd Qu.:7.463e+06  
##  Max.   :89.00   Max.   :1800.00   Max.   :1.294e+09  
##  NA's   :960     NA's   :953       NA's   :1202

str(gdp2)

## Classes 'tbl_df', 'tbl' and 'data.frame':    5328 obs. of  3 variables:
##  $ Country Name: chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Year        : chr  "1995" "1995" "1995" "1995" ...
##  $ GDP         : chr  ".." "2392764853.42107" "41764052457.8814" ".." ...

str(Life_Expectancy_Data1)

## Classes 'tbl_df', 'tbl' and 'data.frame':    2938 obs. of  6 variables:
##  $ Country        : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Year           : chr  "2015" "2014" "2013" "2012" ...
##  $ Status         : chr  "Developing" "Developing" "Developing" "Developing" ...
##  $ Life expectancy: num  65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
##  $ infant deaths  : num  62 64 66 69 71 74 77 80 82 84 ...
##  $ Population     : num  33736494 327582 31731688 3696958 2978599 ...

str(working_data)

## Classes 'tbl_df', 'tbl' and 'data.frame':    3472 obs. of  7 variables:
##  $ Country Name   : chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Year           : chr  "2000" "2000" "2000" "2000" ...
##  $ GDP            : chr  ".." "3480355188.60063" "54786074940.2073" ".." ...
##  $ Status         : chr  "Developing" "Developing" "Developing" NA ...
##  $ Life expectancy: num  54.8 72.6 71.3 NA NA 45.3 73.6 74.1 72 NA ...
##  $ infant deaths  : num  88 1 21 NA NA 97 0 12 1 NA ...
##  $ Population     : num  293756 38927 3118366 NA NA ...

typeof(working_data$GDP)

## [1] "character"

working_data$GDP <- as.numeric(working_data$GDP)

## Warning: NAs introduced by coercion

typeof(working_data$GDP)

## [1] "double"

typeof(working_data$Status)

## [1] "character"

working_data$Status <- as.factor(working_data$Status)
levels(working_data$Status)

## [1] "Developed"  "Developing"

Tidy & Manipulate Data II

We decided to create a new variable from the existing variables. For that we selected infant deaths and population columns. We selected these two columns as we want to find the percentage of the infact deaths. We use the mutate function to mutate two columns and create the third column where all teh data is saved. Here we mutate Infant deaths with population.We create a new column and name it perc_infant_deaths and save all the calculations in that column. This column will be attached after the last column of the table.

From the summary statistics, we can see that average percentage for infant deaths is 0.02%. The maxinum percentage being 32.866 and minimum number being 0.

working_data <- mutate(working_data,perc_Infant_Deaths = (working_data$`infant deaths`/working_data$Population)*100)
head(working_data)

summary(working_data$perc_Infant_Deaths)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0002  0.0292  0.0013 32.8660    1202

Scan I

Scan the data for missing values, special values and obvious errors (i.e. inconsistencies). In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.

We check if our data has missing values or no by using the is.na(). Doing this fo every column will be a long process, hence we used colSums function on is.na function where all the NAs present in each column will be displayed in a tabular format. This view will be easier to understand as we will know how many NAs are present in each column, all under one table.

We decided to use impute function and replace all the NAs with the mean value in the column. As all the values in these columns were numeric, mean or median would be the optimalsolution to replace NAs with.We selected and went ahead with replacing all the missing values with mean valueof that column. There were missing values in the GDP, Status, Life Expectancy, infant deaths and population columns. We replaced all the NAs with mean value of that particular columns. For the status column, we replace all the NA values with mode values because it is a factor.

We notice that after we execute an impute function on any column, the class of that column changes to “impute”. We need to convert this back into numeric datatype using as.numeric fucntion. The reason weconvert it back into numeric format is because we need tohave numeric values to perform further opetation like calculating boxplot,histogram, etc. After doing this, all the missing values are handled.

Now we check for special values in the dataset. We do it by using is.infinite function. In out dataset, there are no special values.

colSums(is.na(working_data))

##       Country Name               Year                GDP             Status 
##                  0                  0                193                953 
##    Life expectancy      infant deaths         Population perc_Infant_Deaths 
##                960                953               1202               1202

working_data$GDP <- impute(working_data$GDP,fun = mean)
class(working_data$GDP)

## [1] "impute"

working_data$GDP <- as.numeric(as.character(working_data$GDP))
class(working_data$GDP)

## [1] "numeric"

working_data$Status <- impute(working_data$Status,fun = mode)

working_data$`Life expectancy` <- impute(working_data$`Life expectancy`,fun = mean)
class(working_data$`Life expectancy`)

## [1] "impute"

working_data$`Life expectancy` <- as.numeric(as.character(working_data$`Life expectancy`))
class(working_data$`Life expectancy`)

## [1] "numeric"

working_data$`infant deaths` <- impute(working_data$`infant deaths`,fun = mean)
class(working_data$`infant deaths`)

## [1] "impute"

working_data$`infant deaths` <- as.numeric(as.character(working_data$`infant deaths`))
class(working_data$`infant deaths`)

## [1] "numeric"

working_data$Population <- impute(working_data$Population,fun = mean)
class(working_data$Population)

## [1] "impute"

working_data$Population <- as.numeric(as.character(working_data$Population))
class(working_data$Population)

## [1] "numeric"

sum(is.infinite(working_data$`Country Name`))

## [1] 0

sum(is.infinite(working_data$Year))

## [1] 0

sum(is.infinite(working_data$GDP))

## [1] 0

sum(is.infinite(working_data$`Life expectancy`))

## [1] 0

sum(is.infinite(working_data$`infant deaths`))

## [1] 0

sum(is.infinite(working_data$Population))

## [1] 0

Scan II

There are many ways to find out wether outliers are present in your dataset or no. We first used boxplot technique where we calculate the boxplot of that particular row. Using boxplot, the values outisde the fences are considered to be as an outlier. For example, if you have a look at the boxplot of Life expectancy, youll see that Values above 85 and below 55 are considered as outliers as they are outisde the fence boundaries.

We can even use zscore to detect the outliers. If the zscore is greater than 3, it is considered as an outlier. For example when we calculate zscore for Life Expectancy, we get 26 outliers. By using the which function, we can come to know the locations of the outliers. Similarly we calculate zscore for all the columns.

Once the outliers are detected, We need to handle them. We have used imputing function to handle the outliers. Imputng the outliers will replace all the outliers with mean, median or mode values. In out case, we will impute the outliers with mean values. Once we do that, we will check the zscore again. We notice that the number of outliers have increased. This happens because the outliers have been replaced with the mean value, so there will be new mean, median and other statistical values for it and hence the outliers will increase or decrease. If we want a perfect value, this loop needs to be repeated many times. We did it once only. Hence our outliers have been handled.

boxplot(working_data$GDP,main = "GDP",col = "grey")

z_numbers <- working_data$GDP %>% scores(type = "z")
which(abs(z_numbers)>3)

##  [1]   99  207  316  424  533  641  750  858  967 1075 1184 1292 1401 1509 1618
## [16] 1726 1778 1835 1943 1995 2052 2160 2212 2269 2377 2429 2486 2594 2646 2703
## [31] 2811 2863 2920 3028 3080 3112 3137 3245 3297 3354 3462

boxplot(working_data$`Life expectancy`,main = "Life expectancy",col = "grey")

z_LE <- working_data$`Life expectancy` %>% scores(type = "z")
which(abs(z_LE)>3)

##  [1]  120  171  216  337  388  433  554  651  771  822  868  980 1039 1085 1197
## [16] 1256 1302 1473 2255

boxplot(working_data$`infant deaths`,main = "Infant deaths",col = "grey")

z_infant <- working_data$`infant deaths` %>% scores(type = "z")
which(abs(z_infant)>3)

##  [1]   42   90  145  150  259  307  362  367  476  524  579  584  693  741  796
## [16]  801  910  958 1013 1018 1175 1230 1235 1392 1447 1452 1609 1664 1669 1826
## [31] 1881 1886 2043 2098 2103 2260 2315 2320 2477 2532 2537 2694 2749 2754 2911
## [46] 2966 2971 3128 3183 3188 3345 3400

boxplot(working_data$Population,main = "Population",col = "grey")

z_pop <- working_data$Population %>% scores(type = "z")
which(abs(z_pop)>3)

##  [1]   27  307  678  895  958  959 1112 1175 1392 1609 1610 1827 1980 2197 2261
## [16] 2414 2537 2695 2749 2754 2971 3128 3129 3188 3346 3400

working_data$GDP[ which( abs(z_numbers) >3 )] <- mean(working_data$GDP)
z_numbers <- working_data$GDP %>% scores(type = "z")
which(abs(z_numbers)>3)

##  [1]   74  206  291  423  476  503  508  640  693  720  725  748  857  910  937
## [16]  942  965 1074 1127 1154 1159 1182 1291 1344 1371 1376 1399 1508 1555 1561
## [31] 1588 1593 1616 1699 1725 1763 1772 1805 1810 1833 1898 1916 1942 1980 2022
## [46] 2027 2050 2133 2159 2197 2206 2239 2244 2260 2267 2332 2376 2414 2423 2456
## [61] 2461 2477 2484 2549 2567 2593 2615 2631 2640 2673 2678 2694 2701 2766 2810
## [76] 2832 2848 2857 2890 2895 2911 2918 2983 3027 3049 3065 3074 3107 3128 3135
## [91] 3200 3244 3282 3291 3324 3329 3345 3352 3461

working_data$`Life expectancy`[ which( abs(z_LE) >3 )] <- mean(working_data$`Life expectancy`)
z_LE <- working_data$`Life expectancy` %>% scores(type = "z")
which(abs(z_LE)>3)

##  [1]    6   62  255  434  472  650  763  988 1414 1519 1690

working_data$`infant deaths`[ which( abs(z_infant) >3 )] <- mean(working_data$`infant deaths`)
z_infant <- working_data$`infant deaths` %>% scores(type = "z")
which(abs(z_infant)>3)

##  [1]   16   27   65   91  203  233  282  308  420  450  499  525  637  667  716
## [16]  742  854  884  933  959 1101 1127 1150 1176 1318 1344 1367 1393 1535 1561
## [31] 1584 1610 1752 1778 1801 1827 1969 1995 2018 2044 2186 2212 2235 2261 2403
## [46] 2429 2452 2478 2620 2646 2669 2695 2863 2886 2912 3080 3103 3129 3297 3320
## [61] 3346 3405

working_data$Population[ which( abs(z_pop) >3 )] <- mean(working_data$Population)
z_pop <- working_data$Population %>% scores(type = "z")
which(abs(z_pop)>3)

##  [1]   16   65  150  156  162  282  286  291  362  373  410  416  508  584  716
## [16]  720  725  796  813  844  965 1013 1024 1154 1159 1182 1230 1241 1247 1371
## [31] 1376 1399 1495 1535 1593 1647 1712 1718 1801 1826 1833 2018 2043 2115 2146
## [46] 2152 2267 2298 2315 2326 2332 2369 2403 2456 2477 2484 2669 2694 2701 2760
## [61] 2797 2803 2837 2886 2911 2949 3020 3103 3107 3200 3231 3329 3454

Transform

In transformation, we decided to decrease the skewness and convert it into normal distibution. First we plot a histogram to check the skweness of the graph. We see that the graph is heavily skewed to the right. We use BoxCox function to correct the skewness and makes it more understandable. Lambda is set to auto so that it can automatically find and correct the skewness. We then plot the boxcox histogram and see that the histogram has been automatically corrected.

hist(working_data$Population)

boxcox_population <- BoxCox(working_data$Population,lambda = "auto")
hist(boxcox_population)

##References 1. Life Expectancy dataset - https://www.kaggle.com/kumarajarshi/life-expectancy-who 2. GDP(current US$) dataset - https://databank.worldbank.org/source/world-development-indicators#