the following R packages are used in this report

Required packages

library(readr)
library(readxl)
library(foreign)
library(gdata)
library(rvest)
library(dplyr)
library(tidyr)
library(deductive)
library(validate)
library(Hmisc)
library(stringr)
library(lubridate)
library(outliers)
library(MVN)
library(infotheo)
library(MASS)
library(caret)
library(knitr)
library(ggplot2)

Executive Summary

The data that we have used in this report taken from(world bank) and it include the world Population ages 65 and above between 1960 and 2018 and income classification of countries for 264 observation.TO do further data prepossessing we do the following steps.First,the two data set was imported and read in R and then merge them by using left join() function into one data set using country code as a common variable.Then, we check the structure and attribute of the data that have been merged. After that, the data type conversion has done for some variables. Then, in order to make the data in a tidy Format we first exclude some unnecessary columns then transform data set from wide to long format has been done with convert some data type. Subsequently, We used appropriate method to deal with missing values and special values.Then, outliers have been identified by z-score method and treat by capping method. The distribution of Total_population has been checked and then logarithm base e (Ln) transformation has been used to correct the normality.

Data

The data is taken from (world bank) website, the two table contain many variables and observation The first data (population) contain 63 variables they are Country Name, Country Code, Indicator Name, Indicator Code and years from 1960 to 2018 and 264 observation. And the second data (income) contain 5 variables they are Country Code, Region, IncomeGroup, Special Notes and Table Name with 263 observation.later after join them we will exclude some of them and work on the only that we are interested in which are 62 variables and their discripition as follow Country Name,which is refers to country from all the world. Country Code, which is a short code indicating the name of the country from all the world. years from 1960:2018, refers to world population ages 65 and above between 1960 and 2018. IncomeGroup, which is refers to income classification of countries from all the world. with 264 observation

setwd("~/Desktop/API_SP.POP.65UP.TO.ZS_DS2_en_csv_v2_319149 ")
The working directory was changed to /Users/a222/Desktop/API_SP.POP.65UP.TO.ZS_DS2_en_csv_v2_319149  inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
population <- read_csv("API_SP.POP.65UP.TO.ZS_DS2_en_csv_v2_319149.csv") 
Parsed with column specification:
cols(
  .default = col_double(),
  `Country Name` = col_character(),
  `Country Code` = col_character(),
  `Indicator Name` = col_character(),
  `Indicator Code` = col_character()
)
See spec(...) for full column specifications.
head(population)

income <- read_csv("Metadata_Country_API_SP.POP.65UP.TO.ZS_DS2_en_csv_v2_319149.csv")
Parsed with column specification:
cols(
  `Country Code` = col_character(),
  Region = col_character(),
  IncomeGroup = col_character(),
  SpecialNotes = col_character(),
  TableName = col_character()
)
head(income)

We have country code as a common column between Population data set and income data set, so we use country code to merge both data set into one data set by using left join() function and then apply header of total_population data to check if the merge has done as following

total_population <- population %>% left_join(income,by="Country Code")
head(total_population)
NA

Understand

The dim() and str() functions have been used to check the the dimensions and structure of the data set.

## to ckeck the  the dimensions of total_population,
dim(total_population)
[1] 264  67
##In order to check te structure of our data set we can apply this command,
str(total_population)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    264 obs. of  67 variables:
 $ Country Name  : chr  "Aruba" "Afghanistan" "Angola" "Albania" ...
 $ Country Code  : chr  "ABW" "AFG" "AGO" "ALB" ...
 $ Indicator Name: chr  "Population ages 65 and above (% of total population)" "Population ages 65 and above (% of total population)" "Population ages 65 and above (% of total population)" "Population ages 65 and above (% of total population)" ...
 $ Indicator Code: chr  "SP.POP.65UP.TO.ZS" "SP.POP.65UP.TO.ZS" "SP.POP.65UP.TO.ZS" "SP.POP.65UP.TO.ZS" ...
 $ 1960          : num  2.48 2.8 3.01 5.41 NA ...
 $ 1961          : num  2.58 2.81 3.05 5.39 NA ...
 $ 1962          : num  2.69 2.8 3.06 5.41 NA ...
 $ 1963          : num  2.8 2.79 3.07 5.44 NA ...
 $ 1964          : num  2.93 2.75 3.07 5.46 NA ...
 $ 1965          : num  3.08 2.71 3.06 5.46 NA ...
 $ 1966          : num  3.23 2.72 3.03 5.4 NA ...
 $ 1967          : num  3.4 2.72 2.98 5.34 NA ...
 $ 1968          : num  3.57 2.7 2.9 5.27 NA ...
 $ 1969          : num  3.78 2.67 2.77 5.22 NA ...
 $ 1970          : num  4 2.63 2.61 5.19 NA ...
 $ 1971          : num  4.36 2.64 2.64 5.18 NA ...
 $ 1972          : num  4.74 2.63 2.67 5.18 NA ...
 $ 1973          : num  5.12 2.61 2.68 5.18 NA ...
 $ 1974          : num  5.49 2.58 2.68 5.19 NA ...
 $ 1975          : num  5.86 2.55 2.67 5.2 NA ...
 $ 1976          : num  6.04 2.56 2.65 5.22 NA ...
 $ 1977          : num  6.24 2.55 2.62 5.24 NA ...
 $ 1978          : num  6.44 2.52 2.6 5.26 NA ...
 $ 1979          : num  6.65 2.49 2.58 5.28 NA ...
 $ 1980          : num  6.85 2.43 2.56 5.3 NA ...
 $ 1981          : num  6.99 2.45 2.57 5.33 NA ...
 $ 1982          : num  7.11 2.45 2.58 5.35 NA ...
 $ 1983          : num  7.19 2.41 2.58 5.37 NA ...
 $ 1984          : num  7.27 2.35 2.58 5.39 NA ...
 $ 1985          : num  7.32 2.24 2.56 5.39 NA ...
 $ 1986          : num  7.38 2.27 2.55 5.41 NA ...
 $ 1987          : num  7.46 2.29 2.52 5.41 NA ...
 $ 1988          : num  7.53 2.3 2.49 5.42 NA ...
 $ 1989          : num  7.61 2.29 2.48 5.44 NA ...
 $ 1990          : num  7.66 2.23 2.47 5.49 NA ...
 $ 1991          : num  7.39 2.24 2.48 5.65 NA ...
 $ 1992          : num  7.17 2.26 2.5 5.83 NA ...
 $ 1993          : num  7.02 2.28 2.52 6.03 NA ...
 $ 1994          : num  6.93 2.32 2.53 6.23 NA ...
 $ 1995          : num  6.91 2.37 2.54 6.43 NA ...
 $ 1996          : num  6.99 2.37 2.56 6.56 NA ...
 $ 1997          : num  7.09 2.36 2.57 6.68 NA ...
 $ 1998          : num  7.23 2.34 2.57 6.8 NA ...
 $ 1999          : num  7.39 2.32 2.57 6.92 NA ...
 $ 2000          : num  7.58 2.29 2.57 7.06 NA ...
 $ 2001          : num  7.76 2.27 2.57 7.31 NA ...
 $ 2002          : num  7.96 2.26 2.58 7.59 NA ...
 $ 2003          : num  8.17 2.24 2.57 7.89 NA ...
 $ 2004          : num  8.38 2.23 2.55 8.19 NA ...
 $ 2005          : num  8.59 2.23 2.53 8.5 NA ...
 $ 2006          : num  8.92 2.25 2.5 8.91 NA ...
 $ 2007          : num  9.23 2.27 2.47 9.32 NA ...
 $ 2008          : num  9.55 2.29 2.43 9.75 NA ...
 $ 2009          : num  9.89 2.31 2.39 10.19 NA ...
 $ 2010          : num  10.29 2.33 2.36 10.65 NA ...
 $ 2011          : num  10.58 2.35 2.35 11.01 NA ...
 $ 2012          : num  10.92 2.39 2.34 11.39 NA ...
 $ 2013          : num  11.3 2.42 2.32 11.77 NA ...
 $ 2014          : num  11.71 2.45 2.3 12.18 NA ...
 $ 2015          : num  12.14 2.48 2.28 12.63 NA ...
 $ 2016          : num  12.6 2.52 2.26 12.96 NA ...
 $ 2017          : num  13.07 2.55 2.24 13.33 NA ...
 $ 2018          : num  13.55 2.58 2.22 13.74 NA ...
 $ Region        : chr  "Latin America & Caribbean" "South Asia" "Sub-Saharan Africa" "Europe & Central Asia" ...
 $ IncomeGroup   : chr  "High income" "Low income" "Lower middle income" "Upper middle income" ...
 $ SpecialNotes  : chr  NA NA NA NA ...
 $ TableName     : chr  "Aruba" "Afghanistan" "Angola" "Albania" ...

As shown above we have 264 observation of 67 variables, it is also shows that we have numeric and character variables.

In this step we apply data type conversion to convert some character variables, they are IncomeGroup that is converted to ordered factor and label IncomeGroup by labels() function with argument ordered=TRUE and Region, Country Name, Country Code to factor.For Country Name and Country Code they will be order alphabetically.


total_population$`Country Name`<- factor(total_population$`Country Name`)

total_population$`Country Code`<- factor(total_population$`Country Code`)

total_population$Region <- factor(total_population$Region,
                                  levels=c('Latin America & Caribbean',
                                           'South Asia',
                                           'Sub-Saharan Africa',
                                           'Europe & Central Asia',
                                           'Middle East & North Africa',
                                           'East Asia & Pacific',
                                           'North America'))

total_population$IncomeGroup <- factor(total_population$IncomeGroup,
                                        levels=c('Low income',
                                                 'Lower middle income',
                                                 'Upper middle income',
                                                 'High income'),
                                        labels = c('Low',
                                                   'Lower middle',
                                                   'Upper middle',
                                                   'High'),
                                        ordered=TRUE)

To check data type conversion we can use is.factor() function and we select converted columns to see the data type and order.

is.factor(total_population$`Country Name`)
[1] TRUE
is.factor(total_population$`Country Code`)
[1] TRUE
is.factor(total_population$IncomeGroup)
[1] TRUE
is.factor(total_population$Region)
[1] TRUE
total_population[, c(1:2,64:65)]
NA

Tidy & Manipulate Data I

In order to make the data in a tidy Format we first exclude some columns that we do not need to work on them by using the following command.

total_pop <- total_population %>% dplyr::select(-(`Indicator Name`:`Indicator Code`),-(`Region`),-(`SpecialNotes`:`TableName`))
head(total_pop)
dim(total_pop)
[1] 264  62

As you can see we have multiple data type in our new data it include 62 variables and 264 observation they are Country Name,which is refers to country from all the world. Country Code, which is a short code indicating the name of the country from all the world. years from 1960:2018, refers to world population ages 65 and above between 1960 and 2018. IncomeGroup, which is refers to income classification of countries from all the world.

The data will be tidy if the following three interrelated rules are met: Every variable have its own column. Every observation have its own row. Every value have its own cell. In our case column names from 3 to 61 are values instead of variables, so to achieve that rules gather()function is used to transform data from wide to long format. In addition, convert variable(year) from double to integer is done using as.integer() function.

total_pop<- total_pop %>% 
  gather(key="Year", value ="Total_population", 3:61)


#Converting the data type of column "Year"

total_pop$Year <- as.integer(total_pop$Year)
head(total_pop)

Here we rename some colmuns name by apply the following command

total_pop <- rename(total_pop,
                    Country_Name=`Country Name`,
                    Country_Code=`Country Code`,
                    Income_Group=IncomeGroup) 

head(total_pop)

Tidy & Manipulate Data II

In order to create new variable called (Growth_2018) which is the percentage of a growth in Population ages 65 and above between 2017 and 2018 we use filter() to filter 2017 and 2018 observation, the formula shows below is used to calculate the values of Growth_2018, then the data is filtered out only for 2018.

# Filter rows with 2017 & 2018 
total_pop_2018<- total_pop %>% filter(Year==2018|Year==2017)
##Calculate populatin growth 2018
total_pop_growth_2018<- total_pop %>% filter(Year==2018|Year==2017) %>%
  group_by(Country_Code) %>% mutate(Growth_2018 = 
                                      ((Total_population/lag(Total_population) - 1) * 100)) %>% 
  filter(Year==2018) %>% dplyr::select(-(Year))  
head(total_pop_growth_2018)

Scan I

In this section, we will Scan the data for missing values, obvious errors and inconsistencies.

##To calculate the total missing value
sum(is.na(total_pop))
[1] 4256
## To sum mising value in each colmuns
colSums(is.na(total_pop))
    Country_Name     Country_Code     Income_Group             Year Total_population 
               0                0             2773                0             1483 

As you can note we have a huge number of missing value in two variables, so to deal with this problem we do several step. First, we replace missing value of Income_Group which is categorical variables with the mode as shown below. Secondly, for numerical variables Total_population, replacing missing value with mean of Total_population has been done by using this command.

Mode <- function (x, na.rm) {
    xtab <- table(x)
    xmode <- names(which(xtab == max(xtab)))
    if (length(xmode) > 1) xmode <- ">1 mode"
    return(xmode)
}

total_pop$Income_Group[is.na(total_pop$Income_Group)]<- Mode(total_pop$Income_Group, na.rm = TRUE)


total_pop$Total_population[is.na(total_pop$Total_population)] <- mean(total_pop$Total_population, na.rm = TRUE)
#To check if replacing missing value done successfuly
colSums(is.na(total_pop))
    Country_Name     Country_Code     Income_Group             Year Total_population 
               0                0                0                0                0 
sum(is.na(total_pop))
[1] 0

Here in a below codes we are going to identify the inconsistencies or special values using is.special function

#check input whether they are not infinite or NA unsing a fuction called is.special
is.special<- function(x){ 
  if(is.numeric(x)) !is.finite(x) else is.na(x) 
}
is.special<- function(x){
  if(is.numeric(x)) !is.finite(x) 
}

sum(is.special(total_pop))
[1] 0

There is no special values to deal with.

To check there is no more further missing values or inconsistencies we apply the following code

total_pop[!complete.cases(total_pop),]
total_pop

Scan II

In this section we only have Total_population as a numeric variable that needs to scan its outliers, so we are going to use box plot at the beginning to visualize the outliers, and then we will investigate the outliers by use z-score method.


total_pop$Total_population %>%  boxplot(main="Box Plot of total_pop Total_population", ylab="Total_population", col = "grey")

The box plot shows that Total_population has many outliers in the next step, z-score via scores() function has been used to extract outliers of Total_population.

z.scores <- total_pop$Total_population%>%  scores(type = "z")
# To find out about the variables
z.scores %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.3821 -0.7359 -0.3562  0.0000  0.3128  5.2833 
# To identify  the locations of outliers in Total_population variable
which( abs(z.scores) >3 )
  [1] 10939 11203 11467 11470 11670 11731 11734 11934 11995 11998 12198 12259 12262 12462 12496 12523 12526 12726 12760 12787
 [21] 12790 12990 13024 13051 13054 13254 13288 13315 13318 13393 13484 13518 13531 13552 13579 13582 13608 13657 13686 13748
 [41] 13782 13795 13816 13843 13846 13872 13921 13950 14005 14012 14046 14059 14064 14066 14080 14090 14107 14110 14136 14185
 [61] 14214 14269 14276 14310 14313 14323 14326 14328 14330 14332 14344 14354 14371 14374 14398 14400 14449 14478 14533 14540
 [81] 14574 14577 14587 14589 14590 14592 14594 14596 14608 14618 14635 14638 14662 14664 14678 14713 14742 14797 14800 14804
[101] 14837 14838 14841 14851 14853 14854 14856 14858 14860 14872 14882 14899 14902 14926 14928 14942 14959 14977 15005 15006
[121] 15061 15064 15068 15084 15101 15102 15105 15115 15117 15118 15120 15122 15124 15136 15146 15148 15163 15166 15190 15192
[141] 15206 15223 15241 15245 15269 15270 15303 15325 15328 15332 15347 15348 15365 15366 15369 15379 15381 15382 15384 15386
[161] 15388 15392 15400 15410 15412 15427 15430 15454 15456 15470 15487 15503 15505 15509 15533 15534 15567
# To sum the  outliers according to the z-score
length (which( abs(z.scores) >3 ))
[1] 177

To treat the outliers capping method has been used to replace the outliers with the nearest neighbors which are not outliers as follow

# Define a function to cap the values outside the limits

cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}
# Take a subset of total_pop data using quantitative variables

Total_population_capped <- total_pop$Total_population %>% cap()
total_pop_sub <- total_pop %>%  dplyr::select(Total_population)

# Finde out descriptive statistics

summary(total_pop_sub)
 Total_population 
 Min.   : 0.6856  
 1st Qu.: 3.2927  
 Median : 4.8248  
 Mean   : 6.2617  
 3rd Qu.: 7.5238  
 Max.   :27.5764  
# Apply  function "cap" to a data frame

total_pop_capped <- sapply(total_pop_sub, FUN = cap)

# Check summary statistics again

summary(total_pop_capped)
 Total_population 
 Min.   : 0.6856  
 1st Qu.: 3.2927  
 Median : 4.8248  
 Mean   : 6.1667  
 3rd Qu.: 7.5238  
 Max.   :15.0069  
# To check if the  processing of  outlries has done successfully
total_pop_capped %>%  boxplot(main="Box Plot of total_pop Total_population", ylab="Total_population", col = "grey")

Transform

We used the Histogram to illustrate the distribution of the variable Total_population. it looks to be skewed to the right as you can see in the figure below. Therefore, we tried several ways to transform it. We have tried the sqrt(), BoxCox() and logarithm base e (ln). The most effective way to correct the right skewed was logarithm base e (ln) as shown in the second figure.

#Histogram of Total_population
hist(total_pop$Total_population , border="black",col="red",cex.main=0.95,cex.axis=0.7,cex.lab=0.95)

#Histogram of ln_Total_population
PT<-log(total_pop$Total_population )
hist(PT,border="black",col="red",cex.main=0.95,cex.axis=0.9,cex.lab=0.95)

---
title: "MATH2349 Semester 2, 2019"
author: "Norah Alshammari (3692568) , Rasha Asharari (3692549) , Assayel Alsubaie( 37402850) "
subtitle: Assignment 3
output:
  html_notebook: default
  html_document:
    df_print: paged
---
the following R packages are used in this report

## Required packages 

```{r}
library(readr)
library(readxl)
library(foreign)
library(gdata)
library(rvest)
library(dplyr)
library(tidyr)
library(deductive)
library(validate)
library(Hmisc)
library(stringr)
library(lubridate)
library(outliers)
library(MVN)
library(infotheo)
library(MASS)
library(caret)
library(knitr)
library(ggplot2)
```

## Executive Summary 

The data that we have used in this report taken from(world bank) and it include the world Population ages 65 and above between 1960 and 2018 and income classification of countries for 264 observation.TO do further data prepossessing we do the following steps.First,the two data set was imported and read in R and then merge them by using left join() function into one data set using country code as a common variable.Then, we check the structure and attribute of the data that have been merged. After that, the data type conversion has done for some variables. Then, in order to make the data in a tidy Format we first exclude some unnecessary columns then transform data set from wide to long format has been done with convert some data type. Subsequently, We used appropriate method to deal with missing values and special values.Then, outliers have been identified by z-score method and treat by capping method. The distribution of Total_population has been checked and then logarithm base e (Ln) transformation has been used to correct the normality.


## Data 

The data is taken from (world bank) website, the two table contain many variables and observation
The first data (population) contain 63 variables they are Country Name, Country Code, Indicator Name, Indicator Code and  years from 1960 to 2018 and 264 observation. And  the second data (income) contain 5 variables they are Country Code, Region, IncomeGroup, Special Notes and Table Name with 263 observation.later after join them  we will exclude some of them and work on the only that we are interested in which are 62 variables and their discripition as follow 
Country Name,which is  refers to country from all the world. 
Country Code, which is a short code indicating the name of the country from all the world.
years from 1960:2018, refers to world population ages 65 and above between 1960 and 2018.
IncomeGroup, which is refers to income classification of countries from all the world.
with 264 observation 


```{r}
setwd("~/Desktop/API_SP.POP.65UP.TO.ZS_DS2_en_csv_v2_319149 ")
population <- read_csv("API_SP.POP.65UP.TO.ZS_DS2_en_csv_v2_319149.csv") 
head(population)

income <- read_csv("Metadata_Country_API_SP.POP.65UP.TO.ZS_DS2_en_csv_v2_319149.csv")
head(income)
```

We have country code as a common column between Population data set and income data set, so we use country code to merge both data set into one data set by using left join() function and then apply header of total_population data to check if the merge has done as following

```{r}
total_population <- population %>% left_join(income,by="Country Code")
head(total_population)

```

## Understand 

The dim() and str() functions have been used  to check the the dimensions and structure of the data set.

```{r}
## to ckeck the  the dimensions of total_population,
dim(total_population)
##In order to check te structure of our data set we can apply this command,
str(total_population)
```

As shown above we have 264 observation of 67 variables, it is also shows that we have numeric and character variables.




In this step we apply data type conversion to convert some character variables, they are IncomeGroup that is converted to ordered factor and label IncomeGroup by labels() function with argument ordered=TRUE and Region, Country Name, Country Code to factor.For Country Name and Country Code they will be order alphabetically.
```{r}

total_population$`Country Name`<- factor(total_population$`Country Name`)

total_population$`Country Code`<- factor(total_population$`Country Code`)

total_population$Region <- factor(total_population$Region,
                                  levels=c('Latin America & Caribbean',
                                           'South Asia',
                                           'Sub-Saharan Africa',
                                           'Europe & Central Asia',
                                           'Middle East & North Africa',
                                           'East Asia & Pacific',
                                           'North America'))

total_population$IncomeGroup <- factor(total_population$IncomeGroup,
                                        levels=c('Low income',
                                                 'Lower middle income',
                                                 'Upper middle income',
                                                 'High income'),
                                        labels = c('Low',
                                                   'Lower middle',
                                                   'Upper middle',
                                                   'High'),
                                        ordered=TRUE)



```


To check data  type conversion we can use is.factor() function and we select converted columns to see the data type and order.
```{r}
is.factor(total_population$`Country Name`)
is.factor(total_population$`Country Code`)
is.factor(total_population$IncomeGroup)
is.factor(total_population$Region)
total_population[, c(1:2,64:65)]

```

##	Tidy & Manipulate Data I 


In order to make the data in a tidy Format we first exclude some columns that we do not need to work on them by using the following command.

```{r}
total_pop <- total_population %>% dplyr::select(-(`Indicator Name`:`Indicator Code`),-(`Region`),-(`SpecialNotes`:`TableName`))
head(total_pop)
dim(total_pop)
```

As you can see we have multiple data type in our new data it include 62 variables and 264 observation they are 
Country Name,which is  refers to country from all the world. 
Country Code, which is a short code indicating the name of the country from all the world.
years from 1960:2018, refers to world population ages 65 and above between 1960 and 2018.
IncomeGroup, which is refers to income classification of countries from all the world.


The data will be tidy if the following three interrelated rules are met:
Every variable have its own column.
Every observation have its own row.
Every value have its own cell.
In our case column names from 3 to 61 are values instead of variables, so to achieve that rules gather()function is used to transform data from wide to long format.
In addition, convert variable(year) from double to integer is done using as.integer() function.

```{r}
total_pop<- total_pop %>% 
  gather(key="Year", value ="Total_population", 3:61)


#Converting the data type of column "Year"

total_pop$Year <- as.integer(total_pop$Year)
head(total_pop)
```

Here we rename some colmuns name by apply the following command
```{r}
total_pop <- rename(total_pop,
                    Country_Name=`Country Name`,
                    Country_Code=`Country Code`,
                    Income_Group=IncomeGroup) 

head(total_pop)
```

##	Tidy & Manipulate Data II 

In order to create new variable called (Growth_2018) which is the percentage of a growth in Population ages 65 and above between 2017 and 2018 we use filter() to filter 2017 and 2018 observation, the formula shows below is used to calculate the values of Growth_2018, then the data is filtered out only for 2018.

```{r}
# Filter rows with 2017 & 2018 
total_pop_2018<- total_pop %>% filter(Year==2018|Year==2017)
##Calculate populatin growth 2018
total_pop_growth_2018<- total_pop %>% filter(Year==2018|Year==2017) %>%
  group_by(Country_Code) %>% mutate(Growth_2018 = 
                                      ((Total_population/lag(Total_population) - 1) * 100)) %>% 
  filter(Year==2018) %>% dplyr::select(-(Year))  
head(total_pop_growth_2018)
```


##	Scan I 

In this section, we will  Scan the data for missing values, obvious errors and  inconsistencies.

```{r}
##To calculate the total missing value
sum(is.na(total_pop))
```

```{r}
## To sum mising value in each colmuns
colSums(is.na(total_pop))

```

As you can note we have a huge number of missing value in two variables, so to deal with this problem we do several step.
First, we replace missing value of Income_Group which is categorical variables with the mode as shown below. 
Secondly, for numerical variables Total_population, replacing  missing value with mean of Total_population has been done by using this command.

```{r}
Mode <- function (x, na.rm) {
    xtab <- table(x)
    xmode <- names(which(xtab == max(xtab)))
    if (length(xmode) > 1) xmode <- ">1 mode"
    return(xmode)
}


```

```{r}

total_pop$Income_Group[is.na(total_pop$Income_Group)]<- Mode(total_pop$Income_Group, na.rm = TRUE)


total_pop$Total_population[is.na(total_pop$Total_population)] <- mean(total_pop$Total_population, na.rm = TRUE)

```

```{r}
#To check if replacing missing value done successfuly
colSums(is.na(total_pop))
sum(is.na(total_pop))
```


Here in a below codes we are going  to identify the inconsistencies or special values using is.special function
```{r}
#check input whether they are not infinite or NA unsing a fuction called is.special
is.special<- function(x){ 
  if(is.numeric(x)) !is.finite(x) else is.na(x) 
}
is.special<- function(x){
  if(is.numeric(x)) !is.finite(x) 
}

sum(is.special(total_pop))


```
There is no special values to deal with.



To check there is no more further missing values or inconsistencies we apply the following code 
```{r}
total_pop[!complete.cases(total_pop),]
total_pop
```

##	Scan II

In this section we only have Total_population  as a numeric variable that needs to scan its outliers, so we are going to use box plot at the beginning to visualize the outliers, and then we will investigate the outliers by use z-score method.
```{r}

total_pop$Total_population %>%  boxplot(main="Box Plot of total_pop Total_population", ylab="Total_population", col = "grey")

```

The box plot shows that Total_population has many outliers in the next step, z-score via scores() function has been used to extract outliers of Total_population.
```{r}
z.scores <- total_pop$Total_population%>%  scores(type = "z")
# To find out about the variables
z.scores %>% summary()
# To identify  the locations of outliers in Total_population variable
which( abs(z.scores) >3 )
# To sum the  outliers according to the z-score
length (which( abs(z.scores) >3 ))
```

To treat the outliers capping method has been used to replace the outliers with the nearest neighbors which are not outliers as follow
```{r}
# Define a function to cap the values outside the limits

cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}
# Take a subset of total_pop data using quantitative variables

Total_population_capped <- total_pop$Total_population %>% cap()
total_pop_sub <- total_pop %>%  dplyr::select(Total_population)

# Finde out descriptive statistics

summary(total_pop_sub)

# Apply  function "cap" to a data frame

total_pop_capped <- sapply(total_pop_sub, FUN = cap)

# Check summary statistics again

summary(total_pop_capped)
# To check if the  processing of  outlries has done successfully
total_pop_capped %>%  boxplot(main="Box Plot of total_pop Total_population", ylab="Total_population", col = "grey")
```

##	Transform 
We used the  Histogram  to illustrate the distribution of the variable Total_population. it looks to be  skewed to the right as you can see in the figure below. Therefore, we tried several ways to transform  it. We have  tried the sqrt(), BoxCox()  and logarithm base e (ln). The most effective way to correct the right skewed was logarithm base e (ln)  as shown in the second figure.

```{r}
#Histogram of Total_population
hist(total_pop$Total_population , border="black",col="red",cex.main=0.95,cex.axis=0.7,cex.lab=0.95)
```


 
```{r}
#Histogram of ln_Total_population
PT<-log(total_pop$Total_population )
hist(PT,border="black",col="red",cex.main=0.95,cex.axis=0.9,cex.lab=0.95)
```


