Packages Used: 1. readr :- This package is used to read the data from csv file to RStudio. 2. dplyr :- This package is used for fast data manipulation. It provides functions to summarise, rearrange data, etc. 3. tidyr :- This package is used to tidy the dataset. It can be used to change the layout of the dataset. 4. hmisc :- This is used to handle the na and missing values. 5. forecast :- This package is for data transformation. 6. outliners :- This package is used to detect and handle outliers.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
#install.packages('forecast')
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
#install.packages('Hmisc')
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
#install.packages('outliers')
library(outliers)
We have selected two datasets. One of them informs us about the life expectancy. The life expectancy dataset is a tidy dataset. The dataset follows the principles of tidy data. The second dataset is about the percentage of employers in the job markets of various countries. The second dataset is untidy. The data is untidy as the data over the years is populated in different columns. We will merge these two datasets using the join function. We have to perform some steps to tidy the data. The two datasets have a column specifying the country name. We can join the two tables by using “country” column. We keep the name of the country column same. The untidy data was in the wide format. The gather function was used to convert the untidy data to tidy format. Then the datasets were merged. We have selected some columns using the filter function. The NA values were removed. The columns with incorrect datatypes were corrected. The pre-processing tasks were performed and now we can move ahead with the assignment.
We have selected two datasets for this project. One is tidy and the other is untidy. The tidy data set is about life expectancy. The attributes are as follows: - 1. Country - Name of the country 2. Year - Year of occurance. 3. Status - Developed or Developing country. 4. Life expectancy - Life expectancy average age. 6. Adult Mortality - Adult Mortality Rate / 1000 people 7. infant deaths - Number of infant deaths / 1000 people 8. Alcohol - Alcohol consumption (liters) 9. percentage expenditure - Spending on the health sector 10. Hepatitis B - Immunity against Hepatitis B in children under 1 year of age 11. Measles - Measels cases / 1000 people 12. BMI = Average Body Mass Index 13. Polio = Number of polio cases / 1000 people 14. Total expenditure = Total yearly expenditure 15. Diphtheria = Number of diptheria cases / 1000 people 16. HIV/AIDS = Number of HIV/AIDS cases / 1000 people 17. GDP = GDP of the country 18. Population = Population of the country
Only Country and Status had character as their datatype. The rest columns are of datatype double.
The second dataset indicates the percent employers in the global job market. It is an untidy dataset. This dataset was taken from the www.worlddatabank.org The attributes are as follows: - 1. Series Name - Factor (Employers, total (% of total employment) (modeled ILO estimate)) 2. Series Code - Code for that particular series. 3. Country Name - Name of the country. 4. Country Code - Country Code 5. 2007 [YR2007] - 2016 [YR2016] = Year of occurance
library(readxl)
jobs <- read_excel("~/Master of Data Science (RMIT)/Semester 2/Data Wrangling/Assignments/Assignment 2/Assignment 2_2/Data_Extract_From_Jobs.xlsx")
#View(jobs)
head(jobs)
life_expectancy <- read_csv("~/Master of Data Science (RMIT)/Semester 2/Data Wrangling/Assignments/Assignment 2/Assignment 2_2/Life Expectancy Data.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## Country = col_character(),
## Status = col_character()
## )
## See spec(...) for full column specifications.
#View(life_expectancy)
head(life_expectancy)
jobs_filtered <- jobs[,c(3,5:14)]
life_expectancy_filtered <- life_expectancy[,c(1:4,6,18)]
jobs_filtered <- rename(jobs_filtered,"country"="Country Name")
jobs_filtered <- rename(jobs_filtered,"2007"="2007 [YR2007]")
jobs_filtered <- rename(jobs_filtered,"2008"="2008 [YR2008]")
jobs_filtered <- rename(jobs_filtered,"2009"="2009 [YR2009]")
jobs_filtered <- rename(jobs_filtered,"2010"="2010 [YR2010]")
jobs_filtered <- rename(jobs_filtered,"2011"="2011 [YR2011]")
jobs_filtered <- rename(jobs_filtered,"2012"="2012 [YR2012]")
jobs_filtered <- rename(jobs_filtered,"2013"="2013 [YR2013]")
jobs_filtered <- rename(jobs_filtered,"2014"="2014 [YR2014]")
jobs_filtered <- rename(jobs_filtered,"2015"="2015 [YR2015]")
jobs_filtered <- rename(jobs_filtered,"2016"="2016 [YR2016]")
head(jobs_filtered)
life_expectancy_filtered <- rename(life_expectancy_filtered,"country"="Country")
life_expectancy_filtered <- rename(life_expectancy_filtered,"year"="Year")
life_expectancy_filtered <- rename(life_expectancy_filtered,"status"="Status")
life_expectancy_filtered <- rename(life_expectancy_filtered,"life_expectancy"="Life expectancy")
life_expectancy_filtered <- rename(life_expectancy_filtered,"infant_deaths"="infant deaths")
life_expectancy_filtered <- rename(life_expectancy_filtered,"population"="Population")
head(life_expectancy_filtered)
jobs_filtered$`2007`[jobs_filtered$`2007` == ".." ] <- NA
jobs_filtered$`2008`[jobs_filtered$`2008` == ".." ] <- NA
jobs_filtered$`2009`[jobs_filtered$`2009` == ".." ] <- NA
jobs_filtered$`2010`[jobs_filtered$`2010` == ".." ] <- NA
jobs_filtered$`2011`[jobs_filtered$`2011` == ".." ] <- NA
jobs_filtered$`2012`[jobs_filtered$`2012` == ".." ] <- NA
jobs_filtered$`2013`[jobs_filtered$`2013` == ".." ] <- NA
jobs_filtered$`2014`[jobs_filtered$`2014` == ".." ] <- NA
jobs_filtered$`2015`[jobs_filtered$`2015` == ".." ] <- NA
jobs_filtered$`2016`[jobs_filtered$`2016` == ".." ] <- NA
We have used the summary function to understand the central tendencies of all the columns in the data which have numeric datatypes. We check the datatypes of all the columns and convert some datatypes from the jobs dataset that need converting. The datatypes of the years are converted from character to numeric in the jobs dataset.
summary(jobs_filtered)
## country 2007 2008 2009
## Length:242 Length:242 Length:242 Length:242
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## 2010 2011 2012 2013
## Length:242 Length:242 Length:242 Length:242
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## 2014 2015 2016
## Length:242 Length:242 Length:242
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
summary(life_expectancy_filtered)
## country year status life_expectancy
## Length:2938 Min. :2000 Length:2938 Min. :36.30
## Class :character 1st Qu.:2004 Class :character 1st Qu.:63.10
## Mode :character Median :2008 Mode :character Median :72.10
## Mean :2008 Mean :69.22
## 3rd Qu.:2012 3rd Qu.:75.70
## Max. :2015 Max. :89.00
## NA's :10
## infant_deaths population
## Min. : 0.0 Min. :3.400e+01
## 1st Qu.: 0.0 1st Qu.:1.958e+05
## Median : 3.0 Median :1.387e+06
## Mean : 30.3 Mean :1.275e+07
## 3rd Qu.: 22.0 3rd Qu.:7.420e+06
## Max. :1800.0 Max. :1.294e+09
## NA's :652
typeof(life_expectancy_filtered$country)
## [1] "character"
typeof(life_expectancy_filtered$year)
## [1] "double"
typeof(life_expectancy_filtered$status)
## [1] "character"
typeof(life_expectancy_filtered$life_expectancy)
## [1] "double"
typeof(life_expectancy_filtered$infant_deaths)
## [1] "double"
typeof(life_expectancy_filtered$population)
## [1] "double"
typeof(jobs_filtered$country)
## [1] "character"
typeof(jobs_filtered$`2007`)
## [1] "character"
typeof(jobs_filtered$`2008`)
## [1] "character"
typeof(jobs_filtered$`2009`)
## [1] "character"
typeof(jobs_filtered$`2010`)
## [1] "character"
typeof(jobs_filtered$`2011`)
## [1] "character"
typeof(jobs_filtered$`2012`)
## [1] "character"
typeof(jobs_filtered$`2014`)
## [1] "character"
typeof(jobs_filtered$`2015`)
## [1] "character"
typeof(jobs_filtered$`2016`)
## [1] "character"
jobs_filtered$`2007`<-as.double(jobs_filtered$`2007`)
jobs_filtered$`2008`<-as.double(jobs_filtered$`2008`)
jobs_filtered$`2009`<-as.double(jobs_filtered$`2009`)
jobs_filtered$`2010`<-as.double(jobs_filtered$`2010`)
jobs_filtered$`2011`<-as.double(jobs_filtered$`2011`)
jobs_filtered$`2012`<-as.double(jobs_filtered$`2012`)
jobs_filtered$`2013`<-as.double(jobs_filtered$`2013`)
jobs_filtered$`2014`<-as.double(jobs_filtered$`2014`)
jobs_filtered$`2015`<-as.double(jobs_filtered$`2015`)
jobs_filtered$`2016`<-as.double(jobs_filtered$`2016`)
life_expectancy_filtered$year <- as.factor(life_expectancy_filtered$year)
typeof(jobs_filtered$`2007`)
## [1] "double"
typeof(jobs_filtered$`2008`)
## [1] "double"
typeof(jobs_filtered$`2009`)
## [1] "double"
typeof(jobs_filtered$`2010`)
## [1] "double"
typeof(jobs_filtered$`2011`)
## [1] "double"
typeof(jobs_filtered$`2012`)
## [1] "double"
typeof(jobs_filtered$`2014`)
## [1] "double"
typeof(jobs_filtered$`2015`)
## [1] "double"
typeof(jobs_filtered$`2016`)
## [1] "double"
class(life_expectancy_filtered$year)
## [1] "factor"
The jobs data is in untidy format. We need to tidy it before performing any operations. The jobs data has distinct columns for each year. Hence it is in the wide format. It needs to be converted to a tidy format by using the gather format. We have set the key as “year” and value as “percent_employers”. Our data will be tidy after this operation. We have used left join on both the datasets to merge them with the help of “country” and “year” as both of these columns are common for all the datasets.
jobs_filtered2 <- jobs_filtered %>% gather(key="year", value="percent_employers", c(`2007`:`2016`))
head(jobs_filtered2)
typeof(life_expectancy_filtered)
## [1] "list"
integrated_data <- left_join(jobs_filtered2,life_expectancy_filtered, by = c("country"="country","year"=
"year"))
head(integrated_data)
integrated_data <- integrated_data %>% drop_na(`country`)
We found the rate of employers in the data by using the mutate function. We used the mutate function to calculate the rate by dividing the percent employers in the country by the population of the country. The data is saved in new column “rate_percent_employers”. The summary statistics for the “rate_percent_employers” was also found out.
integrated_data <- mutate(integrated_data,rate_employer_percentage = (integrated_data$percent_employers/integrated_data$population)*100)
summary(integrated_data$rate_employer_percentage)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0002 0.0297 0.0011 10.0415 1164
We checked our data for missing values by using the is.na() function. colSums function was used to find the missing values in all the columns of the dataset in one line of code. The impute() function was used to replace the missing values. The missing values were replaced with the mean for “percent_employers”, “life_expectancy”, “infant_deaths”, “population” and “rate_employer_percentage” with missing values. This was done because the values of these columns are numeric. The missing values in the “status” column were replaced with mode as value is categorical. The class of the imputed column changes to impute. We changed the classes back to their original types to plot graphs in the further sections.
colSums(is.na(integrated_data))
## country year percent_employers
## 0 0 306
## status life_expectancy infant_deaths
## 1000 1007 1000
## population rate_employer_percentage
## 1144 1164
integrated_data$percent_employers <- impute(integrated_data$percent_employers,fun = mean)
class(integrated_data$percent_employers)
## [1] "impute"
integrated_data$percent_employers <- impute(integrated_data$percent_employers,fun = mean)
integrated_data$life_expectancy <- impute(integrated_data$life_expectancy,fun = mean)
integrated_data$infant_deaths <- impute(integrated_data$infant_deaths,fun = mean)
integrated_data$population <- impute(integrated_data$population,fun = mean)
integrated_data$rate_employer_percentage <- impute(integrated_data$rate_employer_percentage,fun = mean)
integrated_data$status <- impute(integrated_data$status,fun = mode)
integrated_data$status <- as.character(integrated_data$status)
integrated_data$percent_employers <- as.numeric(as.character(integrated_data$percent_employers))
integrated_data$life_expectancy <- as.numeric(as.character(integrated_data$life_expectancy))
integrated_data$infant_deaths <- as.numeric(as.character(integrated_data$infant_deaths))
integrated_data$population <- as.numeric(as.character(integrated_data$population))
integrated_data$rate_employer_percentage <- as.numeric(as.character(integrated_data$rate_employer_percentage))
class(integrated_data$status)
## [1] "character"
class(integrated_data$percent_employers)
## [1] "numeric"
class(integrated_data$life_expectancy)
## [1] "numeric"
class(integrated_data$infant_deaths)
## [1] "numeric"
class(integrated_data$population)
## [1] "numeric"
class(integrated_data$rate_employer_percentage)
## [1] "numeric"
sum(is.infinite(integrated_data$country))
## [1] 0
sum(is.infinite(integrated_data$year))
## [1] 0
sum(is.infinite(integrated_data$percent_employers))
## [1] 0
sum(is.infinite(integrated_data$status))
## [1] 0
sum(is.infinite(integrated_data$life_expectancy))
## [1] 0
sum(is.infinite(integrated_data$infant_deaths))
## [1] 0
sum(is.infinite(integrated_data$population))
## [1] 0
sum(is.infinite(integrated_data$rate_employer_percentage))
## [1] 0
head(integrated_data)
The point which lies beyond 1.5 times the interquartile range is called as outlier. A box plot is used to detect outliers. In a statistical research, one can choose to remove or keep outliers. It depends on the research being carried out. We have chosen to remove the outliers in this instance. This can be done using the capping method (also called as Winsorizing). In capping method we replace the outliers with the nearest neighbours that are not outliers. We can detect the difference between the box plots with and without the outliers. There is a stark difference in both the cases.
#Due to page constraints, only one plot with outliers is shown. Code for plots of other columns is present but commented
boxplot(integrated_data$percent_employers,main = "Percent Employers",col = "green",horizontal = TRUE)
#boxplot(integrated_data$life_expectancy,main = "Life Expectancy",col = "blue",horizontal = TRUE)
#boxplot(integrated_data$infant_deaths,main = "Infant Deaths",col = "red",horizontal = TRUE)
#boxplot(integrated_data$population,main = "Population",col = "pink",horizontal = TRUE)
#boxplot(integrated_data$rate_employer_percentage,main = "Rate Employers",col = "yellow",horizontal = TRUE)
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ))
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
integrated_data$percent_employers %>% cap() %>% boxplot(main="Percent Employers (cap)",col="green",horizontal = TRUE)
integrated_data$life_expectancy %>% cap() %>% boxplot(main="Life Expectancy (cap)",col="blue",horizontal = TRUE)
integrated_data$infant_deaths %>% cap() %>% boxplot(main="Infant Deaths (cap)",col="red",horizontal = TRUE)
integrated_data$population %>% cap() %>% boxplot(main="Population (cap)",col="pink",horizontal = TRUE)
integrated_data$rate_employer_percentage %>% cap() %>% boxplot(main="Rate of Employers (cap)",col="yellow",horizontal = TRUE)
Data transformation is required to improve normality of variance i.e. skewness of variances. BoxCox transformation is used to transform non-normal data to a normalized form. The skewness can be reduced by using log, root, square, power transforms. Normalization is generally used for scaling of data before using machine learning algorithms. We have performed various transformations on the data.
class(integrated_data$percent_employers)
## [1] "numeric"
integrated_data$percent_employers %>% hist(main="Employers")
bc <- BoxCox(integrated_data$percent_employers,lambda = "auto")
bc %>% hist(main="BOX-COX")
#for reducing right skewness
log_tf<-log(integrated_data$percent_employers)
hist(log_tf)
#for left skewness
square<- function(x) {x*x}
sq_tf<-square(integrated_data$percent_employers)
hist(sq_tf)
#normalizing without centering
norm_tf<-scale(integrated_data$percent_employers,center = FALSE,scale = sd(integrated_data$percent_employers))
norm_centre_tf<-scale(integrated_data$percent_employers,center = TRUE,scale = sd(integrated_data$percent_employers))
hist(norm_tf)
hist(norm_centre_tf)