library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(outliers)
We have collected two datasets. One is a tidy dataset and the other one is an untidy dataset. In our project, we would be merging the two datatsets using left join. Before we perform left join, we have to perform a few steps for tiding the data. We first select only those columns from both the datasets that we’ve to work on. Once we have done that, we need to check if the columns whith which we will connect two datasets have similar names. If not, then replace it with a similar name. The untidy data was in a wide format, so we converted it into a long format using the gather function and giving key and values. Once we have both the datasets tidy, we are all ready to merge them. youcan merge them using join functions. We have used left join to merge the data. As there was no data from 1995-1999 and 2016-2019, we discarded them using filter function. NAs in the country name column were dropped as if there cannot be data for no country name. There were a few columns with incorrect datatypes that needed to be corrected and a column which had to be converted to factor. On doing all the above things, we have performed data preprocessing properly and now we can head to the other parts of the project.
For our study, we searched for data on the internet and found a few websites from where datasets can be created and downloaded. We create our untidy dataset from world bank website and select a tidy dataset from kaggle. we downloaded a GDP(current US$) dataset of all the countries from 1995-2019. The data downloaded here is in an untidy format. We then found a Life expectancy dataset from kaggle which consists of country wise data of health status of people and many other factors related to life expectancy. The data recorded is between 2000-2015. We want to check the relation between the world GDP and Life expectancy of people in those countries in those particular years.
The variables in the GDP(current US$) are as follows:
All the columns in the GDP dataset were of datatype character.
The variables in Life expectancy table are as follows:
Only Country and Status were of character datatype rest all the other variables were of datatype double in the Life Expectancy dataset.
First we imported the tow datasets onto our RStudio and performed head and view operation on it. Head will give you first 6 rows of the data and view will help yo view the entire dataset in a new tab. In the GDP dataset, we need all the rows and only contry Name, Year and Number columns, hence we select only three columns. Similarly in the Life expectancy we need all the rows and only Country Name, Year, Life Expectancy, Infant Deaths and Population, hence we select only those coulmns and discard the rest. We even had to rename the columns in GDP dataset as we have to merge the two datasets and need a key value to do it.
gdp <- read_csv("E:/r programs/lectures/assignment 2/Data_Extract_From_World_Development_Indicators (2)/gdp.csv")
## Parsed with column specification:
## cols(
## .default = col_character()
## )
## See spec(...) for full column specifications.
head(gdp)
Life_Expectancy_Data <- read_csv("E:/r programs/lectures/assignment 2/Data_Extract_From_World_Development_Indicators (2)/Life Expectancy Data.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## Country = col_character(),
## Status = col_character()
## )
## See spec(...) for full column specifications.
head(Life_Expectancy_Data)
Life_Expectancy_Data1 <- Life_Expectancy_Data[,c(1:4,6,18)]
head(Life_Expectancy_Data1)
gdp1 <- gdp[,c(3,5:28)]
head(gdp1)
gdp1 <- rename(gdp1,"Country Name"="Country Name")
gdp1 <- rename(gdp1,"1995"="1995 [YR1995]")
gdp1 <- rename(gdp1,"1996"="1996 [YR1996]")
gdp1 <- rename(gdp1,"1997"="1997 [YR1997]")
gdp1 <- rename(gdp1,"1998"="1998 [YR1998]")
gdp1 <- rename(gdp1,"1999"="1999 [YR1999]")
gdp1 <- rename(gdp1,"2000"="2000 [YR2000]")
gdp1 <- rename(gdp1,"2001"="2001 [YR2001]")
gdp1 <- rename(gdp1,"2002"="2002 [YR2002]")
gdp1 <- rename(gdp1,"2003"="2003 [YR2003]")
gdp1 <- rename(gdp1,"2004"="2004 [YR2004]")
gdp1 <- rename(gdp1,"2005"="2005 [YR2005]")
gdp1 <- rename(gdp1,"2006"="2006 [YR2006]")
gdp1 <- rename(gdp1,"2007"="2007 [YR2007]")
gdp1 <- rename(gdp1,"2008"="2008 [YR2008]")
gdp1 <- rename(gdp1,"2009"="2009 [YR2009]")
gdp1 <- rename(gdp1,"2010"="2010 [YR2010]")
gdp1 <- rename(gdp1,"2011"="2011 [YR2011]")
gdp1 <- rename(gdp1,"2012"="2012 [YR2012]")
gdp1 <- rename(gdp1,"2013"="2013 [YR2013]")
gdp1 <- rename(gdp1,"2014"="2014 [YR2014]")
gdp1 <- rename(gdp1,"2015"="2015 [YR2015]")
gdp1 <- rename(gdp1,"2016"="2016 [YR2016]")
gdp1 <- rename(gdp1,"2017"="2017 [YR2017]")
gdp1 <- rename(gdp1,"2018"="2018 [YR2018]")
head(gdp1)
As our GDP data is in an untidy format, we need to tidy it first before we perform any operations. The GDP dataset is untidy because it has column values(1995-2019) as column names and is in a wide format. We need to transform it into a long format by using the gather(). This will help the data look more presentable and easier to read and understand. To use gather function, we set the key value to Year and value as GDP. Once we do this, our data will be tidy and data will be displayed ina long format.
We then use left join to merge both the dataset with the help of Country Name an Year. We selected Country name and Year because there were two factors that were common between the two datasets.
We then perform filter function to collect data oly from 2000-2015 as the Life expectancy dataset is between that time span. The other Values from 1995-1999 and 2016-2019 would all be NAs if we did not filter them out. Hence we decided to work only on the timespan between 2000-2015. We enen dropped NAs from the country column because iF the country Name isnt present, it doesn’t make any sence to have data for a nameless country.
gdp2 <- gdp1 %>% gather(key="Year", value="GDP", c(`1995`:`2018`))
head(gdp2)
typeof(Life_Expectancy_Data1$Year)
## [1] "double"
Life_Expectancy_Data1$Year <- as.character(Life_Expectancy_Data1$Year)
typeof(Life_Expectancy_Data1$Year)
## [1] "character"
working_data <- left_join(gdp2,Life_Expectancy_Data1, by = c("Country Name"="Country","Year"="Year"))
head(working_data)
working_data <- working_data %>% filter(Year=="2000"|Year=="2001"|Year=="2002"|Year=="2003"|Year=="2004"|Year=="2005"|Year=="2006"|Year=="2007"|Year=="2008"|Year=="2009"|Year=="2010"|Year=="2011"|Year=="2012"|Year=="2013"|Year=="2014"|Year=="2015")
head(working_data)
working_data <- working_data %>% drop_na(`Country Name`)
head(working_data)
We use the summary function to check the summary statistics of the datasets. Summary statistics include values such as min, max,1st and 3rd quantile, mean and median. We then check the structure of all the datasets. We perform the str() to check if the datatypes of variables in the datasets correct or no. We notice that we have some columns with wrong datatypes and correct them. The column GDP was of datatype character, we changed it to numeric using the as.numeric(). We changed this to numeric as all the data present in the column are numeric. In the status column, it was of datatype character, as it had only two values, we decided to convert it into a factor. These change in datatypes will be userfor further data manipluion.
summary(gdp2)
## Country Name Year GDP
## Length:5328 Length:5328 Length:5328
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
summary(Life_Expectancy_Data1)
## Country Year Status Life expectancy
## Length:2938 Length:2938 Length:2938 Min. :36.30
## Class :character Class :character Class :character 1st Qu.:63.10
## Mode :character Mode :character Mode :character Median :72.10
## Mean :69.22
## 3rd Qu.:75.70
## Max. :89.00
## NA's :10
## infant deaths Population
## Min. : 0.0 Min. :3.400e+01
## 1st Qu.: 0.0 1st Qu.:1.958e+05
## Median : 3.0 Median :1.387e+06
## Mean : 30.3 Mean :1.275e+07
## 3rd Qu.: 22.0 3rd Qu.:7.420e+06
## Max. :1800.0 Max. :1.294e+09
## NA's :652
summary(working_data)
## Country Name Year GDP Status
## Length:3472 Length:3472 Length:3472 Length:3472
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Life expectancy infant deaths Population
## Min. :36.30 Min. : 0.00 Min. :3.400e+01
## 1st Qu.:63.10 1st Qu.: 0.00 1st Qu.:1.998e+05
## Median :72.20 Median : 3.00 Median :1.420e+06
## Mean :69.34 Mean : 31.33 Mean :1.284e+07
## 3rd Qu.:75.90 3rd Qu.: 21.00 3rd Qu.:7.463e+06
## Max. :89.00 Max. :1800.00 Max. :1.294e+09
## NA's :960 NA's :953 NA's :1202
str(gdp2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5328 obs. of 3 variables:
## $ Country Name: chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Year : chr "1995" "1995" "1995" "1995" ...
## $ GDP : chr ".." "2392764853.42107" "41764052457.8814" ".." ...
str(Life_Expectancy_Data1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2938 obs. of 6 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Year : chr "2015" "2014" "2013" "2012" ...
## $ Status : chr "Developing" "Developing" "Developing" "Developing" ...
## $ Life expectancy: num 65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
## $ infant deaths : num 62 64 66 69 71 74 77 80 82 84 ...
## $ Population : num 33736494 327582 31731688 3696958 2978599 ...
str(working_data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3472 obs. of 7 variables:
## $ Country Name : chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Year : chr "2000" "2000" "2000" "2000" ...
## $ GDP : chr ".." "3480355188.60063" "54786074940.2073" ".." ...
## $ Status : chr "Developing" "Developing" "Developing" NA ...
## $ Life expectancy: num 54.8 72.6 71.3 NA NA 45.3 73.6 74.1 72 NA ...
## $ infant deaths : num 88 1 21 NA NA 97 0 12 1 NA ...
## $ Population : num 293756 38927 3118366 NA NA ...
typeof(working_data$GDP)
## [1] "character"
working_data$GDP <- as.numeric(working_data$GDP)
## Warning: NAs introduced by coercion
typeof(working_data$GDP)
## [1] "double"
typeof(working_data$Status)
## [1] "character"
working_data$Status <- as.factor(working_data$Status)
levels(working_data$Status)
## [1] "Developed" "Developing"
We decided to create a new variable from the existing variables. For that we selected infant deaths and population columns. We selected these two columns as we want to find the percentage of the infact deaths. We use the mutate function to mutate two columns and create the third column where all teh data is saved. Here we mutate Infant deaths with population.We create a new column and name it perc_infant_deaths and save all the calculations in that column. This column will be attached after the last column of the table.
From the summary statistics, we can see that average percentage for infant deaths is 0.02%. The maxinum percentage being 32.866 and minimum number being 0.
working_data <- mutate(working_data,perc_Infant_Deaths = (working_data$`infant deaths`/working_data$Population)*100)
head(working_data)
summary(working_data$perc_Infant_Deaths)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0002 0.0292 0.0013 32.8660 1202
Scan the data for missing values, special values and obvious errors (i.e. inconsistencies). In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.
We check if our data has missing values or no by using the is.na(). Doing this fo every column will be a long process, hence we used colSums function on is.na function where all the NAs present in each column will be displayed in a tabular format. This view will be easier to understand as we will know how many NAs are present in each column, all under one table.
We decided to use impute function and replace all the NAs with the mean value in the column. As all the values in these columns were numeric, mean or median would be the optimalsolution to replace NAs with.We selected and went ahead with replacing all the missing values with mean valueof that column. There were missing values in the GDP, Status, Life Expectancy, infant deaths and population columns. We replaced all the NAs with mean value of that particular columns. For the status column, we replace all the NA values with mode values because it is a factor.
We notice that after we execute an impute function on any column, the class of that column changes to “impute”. We need to convert this back into numeric datatype using as.numeric fucntion. The reason weconvert it back into numeric format is because we need tohave numeric values to perform further opetation like calculating boxplot,histogram, etc. After doing this, all the missing values are handled.
Now we check for special values in the dataset. We do it by using is.infinite function. In out dataset, there are no special values.
colSums(is.na(working_data))
## Country Name Year GDP Status
## 0 0 193 953
## Life expectancy infant deaths Population perc_Infant_Deaths
## 960 953 1202 1202
working_data$GDP <- impute(working_data$GDP,fun = mean)
class(working_data$GDP)
## [1] "impute"
working_data$GDP <- as.numeric(as.character(working_data$GDP))
class(working_data$GDP)
## [1] "numeric"
working_data$Status <- impute(working_data$Status,fun = mode)
working_data$`Life expectancy` <- impute(working_data$`Life expectancy`,fun = mean)
class(working_data$`Life expectancy`)
## [1] "impute"
working_data$`Life expectancy` <- as.numeric(as.character(working_data$`Life expectancy`))
class(working_data$`Life expectancy`)
## [1] "numeric"
working_data$`infant deaths` <- impute(working_data$`infant deaths`,fun = mean)
class(working_data$`infant deaths`)
## [1] "impute"
working_data$`infant deaths` <- as.numeric(as.character(working_data$`infant deaths`))
class(working_data$`infant deaths`)
## [1] "numeric"
working_data$Population <- impute(working_data$Population,fun = mean)
class(working_data$Population)
## [1] "impute"
working_data$Population <- as.numeric(as.character(working_data$Population))
class(working_data$Population)
## [1] "numeric"
sum(is.infinite(working_data$`Country Name`))
## [1] 0
sum(is.infinite(working_data$Year))
## [1] 0
sum(is.infinite(working_data$GDP))
## [1] 0
sum(is.infinite(working_data$`Life expectancy`))
## [1] 0
sum(is.infinite(working_data$`infant deaths`))
## [1] 0
sum(is.infinite(working_data$Population))
## [1] 0
There are many ways to find out wether outliers are present in your dataset or no. We first used boxplot technique where we calculate the boxplot of that particular row. Using boxplot, the values outisde the fences are considered to be as an outlier. For example, if you have a look at the boxplot of Life expectancy, youll see that Values above 85 and below 55 are considered as outliers as they are outisde the fence boundaries.
We can even use zscore to detect the outliers. If the zscore is greater than 3, it is considered as an outlier. For example when we calculate zscore for Life Expectancy, we get 26 outliers. By using the which function, we can come to know the locations of the outliers. Similarly we calculate zscore for all the columns.
Once the outliers are detected, We need to handle them. We have used imputing function to handle the outliers. Imputng the outliers will replace all the outliers with mean, median or mode values. In out case, we will impute the outliers with mean values. Once we do that, we will check the zscore again. We notice that the number of outliers have increased. This happens because the outliers have been replaced with the mean value, so there will be new mean, median and other statistical values for it and hence the outliers will increase or decrease. If we want a perfect value, this loop needs to be repeated many times. We did it once only. Hence our outliers have been handled.
boxplot(working_data$GDP,main = "GDP",col = "grey")
z_numbers <- working_data$GDP %>% scores(type = "z")
which(abs(z_numbers)>3)
## [1] 99 207 316 424 533 641 750 858 967 1075 1184 1292 1401 1509 1618
## [16] 1726 1778 1835 1943 1995 2052 2160 2212 2269 2377 2429 2486 2594 2646 2703
## [31] 2811 2863 2920 3028 3080 3112 3137 3245 3297 3354 3462
boxplot(working_data$`Life expectancy`,main = "Life expectancy",col = "grey")
z_LE <- working_data$`Life expectancy` %>% scores(type = "z")
which(abs(z_LE)>3)
## [1] 120 171 216 337 388 433 554 651 771 822 868 980 1039 1085 1197
## [16] 1256 1302 1473 2255
boxplot(working_data$`infant deaths`,main = "Infant deaths",col = "grey")
z_infant <- working_data$`infant deaths` %>% scores(type = "z")
which(abs(z_infant)>3)
## [1] 42 90 145 150 259 307 362 367 476 524 579 584 693 741 796
## [16] 801 910 958 1013 1018 1175 1230 1235 1392 1447 1452 1609 1664 1669 1826
## [31] 1881 1886 2043 2098 2103 2260 2315 2320 2477 2532 2537 2694 2749 2754 2911
## [46] 2966 2971 3128 3183 3188 3345 3400
boxplot(working_data$Population,main = "Population",col = "grey")
z_pop <- working_data$Population %>% scores(type = "z")
which(abs(z_pop)>3)
## [1] 27 307 678 895 958 959 1112 1175 1392 1609 1610 1827 1980 2197 2261
## [16] 2414 2537 2695 2749 2754 2971 3128 3129 3188 3346 3400
working_data$GDP[ which( abs(z_numbers) >3 )] <- mean(working_data$GDP)
z_numbers <- working_data$GDP %>% scores(type = "z")
which(abs(z_numbers)>3)
## [1] 74 206 291 423 476 503 508 640 693 720 725 748 857 910 937
## [16] 942 965 1074 1127 1154 1159 1182 1291 1344 1371 1376 1399 1508 1555 1561
## [31] 1588 1593 1616 1699 1725 1763 1772 1805 1810 1833 1898 1916 1942 1980 2022
## [46] 2027 2050 2133 2159 2197 2206 2239 2244 2260 2267 2332 2376 2414 2423 2456
## [61] 2461 2477 2484 2549 2567 2593 2615 2631 2640 2673 2678 2694 2701 2766 2810
## [76] 2832 2848 2857 2890 2895 2911 2918 2983 3027 3049 3065 3074 3107 3128 3135
## [91] 3200 3244 3282 3291 3324 3329 3345 3352 3461
working_data$`Life expectancy`[ which( abs(z_LE) >3 )] <- mean(working_data$`Life expectancy`)
z_LE <- working_data$`Life expectancy` %>% scores(type = "z")
which(abs(z_LE)>3)
## [1] 6 62 255 434 472 650 763 988 1414 1519 1690
working_data$`infant deaths`[ which( abs(z_infant) >3 )] <- mean(working_data$`infant deaths`)
z_infant <- working_data$`infant deaths` %>% scores(type = "z")
which(abs(z_infant)>3)
## [1] 16 27 65 91 203 233 282 308 420 450 499 525 637 667 716
## [16] 742 854 884 933 959 1101 1127 1150 1176 1318 1344 1367 1393 1535 1561
## [31] 1584 1610 1752 1778 1801 1827 1969 1995 2018 2044 2186 2212 2235 2261 2403
## [46] 2429 2452 2478 2620 2646 2669 2695 2863 2886 2912 3080 3103 3129 3297 3320
## [61] 3346 3405
working_data$Population[ which( abs(z_pop) >3 )] <- mean(working_data$Population)
z_pop <- working_data$Population %>% scores(type = "z")
which(abs(z_pop)>3)
## [1] 16 65 150 156 162 282 286 291 362 373 410 416 508 584 716
## [16] 720 725 796 813 844 965 1013 1024 1154 1159 1182 1230 1241 1247 1371
## [31] 1376 1399 1495 1535 1593 1647 1712 1718 1801 1826 1833 2018 2043 2115 2146
## [46] 2152 2267 2298 2315 2326 2332 2369 2403 2456 2477 2484 2669 2694 2701 2760
## [61] 2797 2803 2837 2886 2911 2949 3020 3103 3107 3200 3231 3329 3454
In transformation, we decided to decrease the skewness and convert it into normal distibution. First we plot a histogram to check the skweness of the graph. We see that the graph is heavily skewed to the right. We use BoxCox function to correct the skewness and makes it more understandable. Lambda is set to auto so that it can automatically find and correct the skewness. We then plot the boxcox histogram and see that the histogram has been automatically corrected.
hist(working_data$Population)
boxcox_population <- BoxCox(working_data$Population,lambda = "auto")
hist(boxcox_population)
##References 1. Life Expectancy dataset - https://www.kaggle.com/kumarajarshi/life-expectancy-who 2. GDP(current US$) dataset - https://databank.worldbank.org/source/world-development-indicators#