This assignment has been submitted to RPubs.Link for following is http://rpubs.com/ArchitaD_26/503312
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(outliers)
It is suggested to preprocess the data initially before initiating the investigation.I have collected two datasets from the web which deals about the GDP value for a country. I have imported, processed and filtered them in R, and tidy the data in the first part. After that, I combined the 2 datasets in one dataframe. I observed the data types and structures and did the required conversions in order to have a different kind of variables in the data frame. I created a new variable from existing variables. I verified if there is any missing or null values, errors and dealt with them. I reviewed for any possible outliers in data frames and found there are some outliers in GDP (current USD) in 2007, GDP (current USD) in 2017 and GDP growth, I handled them using capping method. I altered the GDP growth rate distribution from right skew into a symmetric one.
GDP (current US$)
Description: The first data set represent the GDP of a country in a year. Gross Domestic Product (GDP) is a overall measurement of a nation’s total economic activity. GDP is the financial worth of all the finished products and services produced within a country’s borders in a stipulated time period. GDP has been classified by countries from 1960 to 2017.
Source: https://data.worldbank.org/indicator/NY.GDP.MKTP.CD (API_NY.GDP.MKTP.CD_DS2_en_csv_v2_10576830.csv)
Variable descriptions:
Country.Name: The name of a country with an abecedarian order.
Country.Code: Represent three digits alphabetic code of the country.
Indicator.Name: Shows the indicator name for this data set which is the GDP (current US$).
Indicator.Code: dipicts the indicator code of GDP.
1960-2017: Shows the GDP (current is US$).
Income
Description: This data set categorized each country into clusters according to their income levels.
Source: https://data.worldbank.org/indicator/NY.GDP.MKTP.CD (Metadata_Country_API_NY.GDP.MKTP.CD_DS2_en_csv_v2_10576830.csv)
Variable descriptions:
Country Code: represent three digits alphabetic code of the country.
Region: classified each country by region.
IncomeGroup: each country has been categorized by their income level.
Special Notes: notation for each country by an international organization.
Table Name: The name of a country with an abecedarian order.
Steps
Imported the csv data files by using reading data packages. For dataset which refers to Income.
Tidy up the data by using gather function, so there would be one column for year instead of several columns, and one column represent GDP values.I have selected only the required variables and reduced Year because I wanted to concentrate on the year 2007 and 2017 from this dataset.
To combine all two data sets in single dataframe, I have connected them by using merge function. GDP and Income share 1 common variables, Country Name, so I have coupled them by usimng Country Name. I have rearrange the order of the columns. Dataframe is the generated. It has 4 variables: Country Name, Income_Group, GDP (current USD) in 2007, GDP (current USD) in 2017.
#read GDP dataset
a<-read.csv("C:/Users/aad19/Desktop/API_NY.GDP.MKTP.CD_DS2_en_csv_v2_10576830.csv", skip=4)
colnames(a)[colnames(a)=="Country.Name"] <- "Country Name"
colnames(a)[colnames(a)=="Country.Code"] <- "Country Code"
colnames(a)[colnames(a)=="Indicator.Name"] <- "Indicator Name"
colnames(a)[colnames(a)=="Indicator.Code"] <- "Indicator Code"
head(a)
#read Income dataset
b<-read.csv("C:/Users/aad19/Desktop/Metadata_Country_API_NY.GDP.MKTP.CD_DS2_en_csv_v2_10576830.csv")
head(b)
#rename IncomeGroup & TableName column
colnames(b)[colnames(b)=="TableName"] <- "Country Name"
colnames(b)[colnames(b)=="IncomeGroup"] <- "Income_Group"
head(b)
# Tidy up the data for 2007
a_GDP_2007<- a %>%gather(`X2007`, key= "Year", value = "GDP (current US$)")
a_GDP_2007<-a_GDP_2007 %>%separate(Year, into = c("Var","Year"), sep = 1)
colnames(a_GDP_2007)[colnames(a_GDP_2007)=="GDP (current US$)"] <- "GDP (current US$) in 2007"
a_GDP_2007<- a_GDP_2007%>% select("Country Name", "GDP (current US$) in 2007")
head(a_GDP_2007)
# Tidy up the data for 2017
a_GDP_2017<- a %>%gather(`X2017`, key= "Year", value = "GDP (current US$)")
a_GDP_2017<-a_GDP_2017 %>%separate(Year, into = c("Var","Year"), sep = 1)
colnames(a_GDP_2017)[colnames(a_GDP_2017)=="GDP (current US$)"] <- "GDP (current US$) in 2017"
a_GDP_2017<- a_GDP_2017%>% select("Country Name", "GDP (current US$) in 2017")
head(a_GDP_2017)
income<- b%>% select("Country Name" ,"Income_Group")
head(income)
#To merge two data sets
GDP<-merge(x=a_GDP_2007,y=a_GDP_2017,by="Country Name")
GDP_dp_3<-merge(x=GDP,y=income,by="Country Name")
GDP_dp_3
GDP_dp_3<- GDP_dp_3[,c(1,4,2,3)] #Re-order the columns
head(GDP_dp_3)
#Check the structure and the attributes of data.
str(GDP_dp_3)
## 'data.frame': 254 obs. of 4 variables:
## $ Country Name : Factor w/ 264 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Income_Group : Factor w/ 5 levels "","High income",..: 3 5 5 5 2 4 2 1 2 5 ...
## $ GDP (current US$) in 2007: num 9.75e+09 1.07e+10 1.35e+11 5.20e+08 4.02e+09 ...
## $ GDP (current US$) in 2017: num 1.95e+10 1.30e+10 1.68e+11 6.34e+08 3.01e+09 ...
#Convert Country Name to character variables.
GDP_dp_3$`Country Name`<- as.character(GDP_dp_3$`Country Name`)
class(GDP_dp_3$`Country Name`) #check the class of numeric variable
## [1] "character"
#Convert Income_Group to character variables.
GDP_dp_3$Income_Group<- as.character(GDP_dp_3$Income_Group)
class(GDP_dp_3$Income_Group)
## [1] "character"
# Now we have factor variable and we want to order the labels.
GDP_dp_3$Income_Group<- factor(GDP_dp_3$Income_Group,levels=c("High income","Upper middle income","Lower middle income","Low income"),labels=c("High income","Upper middle income","Lower middle income","Low income"),ordered=TRUE)
class(GDP_dp_3$Income_Group)
## [1] "ordered" "factor"
#Final structure of data set
str(GDP_dp_3)
## 'data.frame': 254 obs. of 4 variables:
## $ Country Name : chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Income_Group : Ord.factor w/ 4 levels "High income"<..: 4 2 2 2 1 3 1 NA 1 2 ...
## $ GDP (current US$) in 2007: num 9.75e+09 1.07e+10 1.35e+11 5.20e+08 4.02e+09 ...
## $ GDP (current US$) in 2017: num 1.95e+10 1.30e+10 1.68e+11 6.34e+08 3.01e+09 ...
head(GDP_dp_3)
After verifying the structure and the variables of the dataframe, it has 254 observations with 4 variables.we want to convert Country Name and Income_Group to Charactors variables, and then convert Income_Group from character to factor.In dataframe we have 1 character variable (Country Name), 2 numeric Variables (GDP in 2007, GDP in 2017), 1 factor ordered variable (Income_Group).
For data type alteration, I have used as.numeric function to convert character to numeric & as.charactor function to convert factor to character and as.factor to convert character to factor.
The variable Income_Group has been converted to factor variable with 4 levels, this factor variable is labelled and ordered
The GDP_dp_3 is in tidy format because of the following reasons:
1)Each variable have its own column.
2)Each observation have its own row.
3)Each value have its own cell.
# create new variable (GDP_Growth)
GDP_dp_3<-mutate(GDP_dp_3,GDP_growth=((GDP_dp_3$`GDP (current US$) in 2017`- GDP_dp_3$`GDP (current US$) in 2007`) / GDP_dp_3$`GDP (current US$) in 2007`)*100)
head(GDP_dp_3)
Discussion
We have two numeric variables GDP (current USD) in 2007 and GDP (current USD) in 2017 , we can compute the GDP growth in last 10 years. The equation should be as the following: Population Growth = ((GDP (current USD) in 2017 - GDP (current USD) in 2007)/GDP (current US) in 2007)*100. Therefore, if the GDP Growth has positive value we indicate that there is increase in GDP of the country. However, if the value is negative then there is decrease in GDP of the country.
Mutate function is used to create new variable from existing variables. The name of the new variable is GDP_Growth.
#check for missing values by computing the total missing values in each column
colSums(is.na(GDP_dp_3))
## Country Name Income_Group
## 0 41
## GDP (current US$) in 2007 GDP (current US$) in 2017
## 10 21
## GDP_growth
## 22
#check for inconsistencies/errors for each variable.
sum(is.nan(GDP_dp_3$`Country Name`))
## [1] 0
sum(is.nan(GDP_dp_3$Income_Group))
## [1] 0
sum(is.nan(GDP_dp_3$`GDP (current US$) in 2007`))
## [1] 0
sum(is.nan(GDP_dp_3$`GDP (current US$) in 2017`))
## [1] 0
sum(is.nan(GDP_dp_3$GDP_growth))
## [1] 0
#To deal with missing values, using imputation method.
missing_data_1<- impute(GDP_dp_3$Income_Group, fun = mode)
head(missing_data_1)
## 1 2 3
## Low income Upper middle income Upper middle income
## 4 5 6
## Upper middle income High income Lower middle income
## 4 Levels: High income < Upper middle income < ... < Low income
missing_data_2 <- impute(GDP_dp_3$`GDP (current US$) in 2007`, fun = mean)
head(missing_data_2)
## 1 2 3 4 5
## 9747879532 10677324144 134977087734 520000000 4016972351
## 6
## 65266452081
missing_data_3 <- impute(GDP_dp_3$`GDP (current US$) in 2017`, fun = mean)
head(missing_data_3)
## 1 2 3 4 5
## 19543976895 13038538300 167555280113 634000000 3012914131
## 6
## 122123822334
missing_data_4 <- impute(GDP_dp_3$GDP_growth, fun = mean)
head(missing_data_4)
## 1 2 3 4 5 6
## 100.49465 22.11429 24.13609 21.92308 -24.99540 87.11577
Missing values: ColSums function is used to find out the total number of NAs in each columns. From the output, we have 41 missing value in Income_Group, and 10 missing values in GDP in 2007, 21 in GDP in 2017, and 22 in GDP growth.
Inconsistencies:sum(is.nan()) is used to give the total number of errors. The output is zero which means there is no error.
Handling the missing values: Imputation method is used to handle the missing values.
# Detect the outliers in all numeric variables using box-plot
boxplot(GDP_dp_3$`GDP (current US$) in 2007`, main = "Box plot of GDP in 2007")
boxplot(GDP_dp_3$`GDP (current US$) in 2017`, main = "Box plot of GDP in 2017")
boxplot(GDP_dp_3$GDP_growth, main = "Box plot of GDP growth in 2007 to 2017 for the country")
# Detect the outliers using z-scores
outliers_GDP_2007<-na.omit(GDP_dp_3$`GDP (current US$) in 2007`) #Exclude missing values
z.scores_a <- outliers_GDP_2007 %>% scores(type = "z")
summary(z.scores_a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2716 -0.2707 -0.2668 0.0000 -0.2189 8.4305
outliers_GDP_2017<-na.omit(GDP_dp_3$`GDP (current US$) in 2017`)
z.scores_b <- outliers_GDP_2017 %>% scores(type = "z")
summary(z.scores_b)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2993 -0.2982 -0.2932 0.0000 -0.2411 8.5029
outliers_GDP_growth<-na.omit(GDP_dp_3$GDP_growth)
z.scores_c <- outliers_GDP_growth %>% scores(type = "z")
summary(z.scores_c)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.6587 -0.7129 -0.2283 0.0000 0.5634 5.8823
#to locate the outliers in GDP growth rate variable
head(which(abs(z.scores_c)>3))
## [1] 64 114 151 232
#total number of outliers according to the z-score
length(which(abs(z.scores_c)>3))
## [1] 4
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95))
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
outliers_GDP_growth_Cap<- outliers_GDP_growth%>% cap()
summary(outliers_GDP_growth_Cap)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -43.56 19.27 51.45 63.26 104.05 213.68
#Mathematical Operations transformation
hist(GDP_dp_3$GDP_growth, main = "Histogram of GDP growth rate in percentge in last 10 years",col="lightblue")
hist(sqrt(GDP_dp_3$GDP_growth), main = "GDP growth rate in percentge in last 10 years by sqrt",col="pink")
## Warning in sqrt(GDP_dp_3$GDP_growth): NaNs produced
I have performed the data alteration via mathematical methods to reduce the skewness and modify the distribution into a normal distribution. The distribution of the numeric variable (GDP growth) as shown in the histogram is a quite right-skewed. One of the advised methods to lower the right skewness in the spread by taking the square root.