These libraries allow us to use the various functions within this document
#install.packages("tidyverse")
library(tidyverse)
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
Loading the original data into the analysis environment. The original data will not be altered. Versions of the data are created in this environment which can be freely manipulated. If an alteration, such as a removal of a column, was a mistake - one can always reload the original data.
Indicators <- read_csv("Data/Indicators.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## dimension = col_character(),
## indicator_name = col_character(),
## iso3 = col_character(),
## country_name = col_character()
## )
## See spec(...) for full column specifications.
This function will show us the top 6 rows of the data frame in order to give us an idea of the structure of table.
head(Indicators)
#Count of rows
nrow(Indicators)
## [1] 25636
#Count of columns
ncol(Indicators)
## [1] 34
names(Indicators)
## [1] "dimension" "indicator_id" "indicator_name" "iso3"
## [5] "country_name" "1990" "1991" "1992"
## [9] "1993" "1994" "1995" "1996"
## [13] "1997" "1998" "1999" "2000"
## [17] "2001" "2002" "2003" "2004"
## [21] "2005" "2006" "2007" "2008"
## [25] "2009" "2010" "2011" "2012"
## [29] "2013" "2014" "2015" "2016"
## [33] "2017" "9999"
Tidying the data is an important part of preparing your data for analysis. Making your data tidy can be time consuming but it makes the format of your data consistent and easier to manage when you’re analyzing the data.
Change variable ‘dimension’ to a categorical variable, meaning there are only a certain number of values it could possibly be. This will help with analysis later.
Indicators$dimension <-as.factor(Indicators$dimension)
#Display number of levels in 'dimension' variable
nlevels(Indicators$dimension)
## [1] 14
Change variable ‘indicator_name’ to a categorical variable
Indicators$indicator_name <-as.factor(Indicators$indicator_name)
#Display levels in 'indicator_name' variable
nlevels(Indicators$indicator_name)
## [1] 157
Change variable ‘country_name’ to a categorical variable
Indicators$country_name <-as.factor(Indicators$country_name)
#Display levels in 'country_name' variable
nlevels(Indicators$country_name)
## [1] 195
Change variable ‘indicator_id’ to a categorical variable
Indicators$indicator_id <-as.factor(Indicators$indicator_id)
Change variable ‘iso3’ to a categorical variable
Indicators$iso3 <-as.factor(Indicators$iso3)
An important feature of tidy data is that “each variable must have its own column” As in the variable years should be in a single column. This is done because it makes transforming the data a much smoother process.
Indicators <-gather(Indicators, key = Years, "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1997", "1998","1999", "2000", "2001", "2002","2003","2004","2005","2006", "2007","2008","2009","2010","2011","2012","2013","2014","2015", "2016", "2017","9999")
#rename last column to rating
colnames(Indicators)[colnames(Indicators) == "1990" ] <- "HDI_value"
To break up the data by ‘dimensions’, we must take subsets of the original table. It is good practice to have each table be one thing, rather than a collection of everything. Dimensions in this data set are the category under which the Human Development Index is scored. For example dimensions include education, health, demographics, and so on.
Composite_indices <- subset(Indicators, dimension == "Composite indices" )
Demography <- subset(Indicators, dimension == "Demography" )
Education <- subset(Indicators, dimension == "Education" )
Environmental_sustainability <- subset(Indicators, dimension == "Environmental sustainability" )
Gender <- subset(Indicators, dimension == "Gender" )
Health <- subset(Indicators, dimension == "Health" )
Human_Security <- subset(Indicators, dimension == "Human Security" )
Income_resources <- subset(Indicators, dimension == "Income/composition of resources" )
Inequality <- subset(Indicators, dimension == "Inequality" )
Mobility_communication <- subset(Indicators, dimension == "Mobility and communication" )
Poverty <- subset(Indicators, dimension == "Poverty" )
Sustainability <- subset(Indicators, dimension == "Socio-economic sustainability" )
Financial_flows <- subset(Indicators, dimension == "Trade and financial flows" )
Work_vulnerability <- subset(Indicators, dimension == "Work, employment and vulnerability" )
A tibble is a subset table from a larger table (Rows and columns of data). Creating a tibble for each ’dimension, or category for which the Human Development Index was calculated. Here we are creating a tibble from each of the subset tables.
#Composite Indices Tibble
Tib_CompositeIndices <- tibble(indicator = Composite_indices$indicator_name, country = Composite_indices$country_name, year = Composite_indices$Years, HDI_value = Composite_indices$HDI_value)
#Demography Tibble
Tib_Demography <- tibble(indicator = Demography$indicator_name, country = Demography$country_name, year = Demography$Years, HDI_value = Demography$HDI_value)
#Education Tibble
Tib_Education <- tibble(indicator = Education$indicator_name, country = Education$country_name, year = Education$Years, HDI_value = Education$HDI_value)
#Environmental Sustainability Tibble
Tib_EnvironmentalSustainability <- tibble(indicator = Environmental_sustainability$indicator_name, country = Environmental_sustainability$country_name, year = Environmental_sustainability$Years, HDI_value = Environmental_sustainability$HDI_value)
#Gender Tibble
Tib_Gender <- tibble(indicator = Gender$indicator_name, country = Gender$country_name, year = Gender$Years, HDI_value = Gender$HDI_value)
#Health Tibble
Tib_Health <- tibble(indicator = Health$indicator_name, country = Health$country_name, year = Health$Years, HDI_value = Health$HDI_value)
#Human Security Tibble
Tib_HumanSecurity <- tibble(indicator = Human_Security$indicator_name, country = Human_Security$country_name, year = Human_Security$Years, HDI_value = Human_Security$HDI_value)
#Income Resources Tibble
Tib_IncomeResources <- tibble(indicator = Income_resources$indicator_name, country = Income_resources$country_name, year = Income_resources$Years, HDI_value = Income_resources$HDI_value)
#Inequality Tibble
Tib_Inequality <- tibble(indicator = Inequality$indicator_name, country = Inequality$country_name, year = Inequality$Years, HDI_value = Inequality$HDI_value)
#Mobility Communication Tibble
Tib_MobilityCommunication <- tibble(indicator = Mobility_communication$indicator_name, country = Mobility_communication$country_name, year = Mobility_communication$Years, HDI_value = Mobility_communication$HDI_value)
#Poverty Tibble
Tib_Poverty <- tibble(indicator = Poverty$indicator_name, country = Poverty$country_name, year = Poverty$Years, HDI_value = Poverty$HDI_value)
#Sustainability Tibble
Tib_Sustainability <- tibble(indicator = Sustainability$indicator_name, country = Sustainability$country_name, year = Sustainability$Years, HDI_value = Sustainability$HDI_value)
#Financial Flows Tibble
Tib_FinancialFlows <- tibble(indicator = Financial_flows$indicator_name, country = Financial_flows$country_name, year = Financial_flows$Years, HDI_value = Financial_flows$HDI_value)
#Work Vulnerability Tibble
Tib_WorkVulnerability <- tibble(indicator = Work_vulnerability$indicator_name, country = Work_vulnerability$country_name, year = Work_vulnerability$Years, HDI_value = Work_vulnerability$HDI_value)
There are still a lot of missing values (NA in a cell). For now we will leave them in, but an important tool to use when analyzing the data is the function to remove NA values from the calculations:
na.rm = TRUE
Nothing gives you a better perspective on your data than plotting your data.
At first glance the two tall bars in the center look as though there is something weird going on here. In looking into the tibble, it is evident that the HDI_value is scoring each indicator with different units. Most scores are in context of percentages, whereas the two tall bars are in terms of headcounts of certain population groups. This leads to the conclusion that these tibbles need to be sliced again based on the indicators in order for the HDI_values in a tibble to be in the same units.
ggplot(Tib_Poverty, aes(indicator, HDI_value) ) + geom_col() + geom_path(colour = "red", na.rm = TRUE)
## Warning: Removed 36655 rows containing missing values (position_stack).
Looking at distributions of a single variable The assumption would be that there is an even spread of the amount of indicators across the data set but here we see there a subtle dip then a rise at the far right of the bar graph.
Is this due to the missing data being removed? Or possibly some other reason for the uneven distribution?
names(Tib_Poverty)
## [1] "indicator" "country" "year" "HDI_value"
chart<- ggplot(Tib_Poverty, aes(indicator), na.rm = TRUE)
chart + geom_bar()
ggplot(Tib_Poverty , aes(x = country, y = HDI_value)) + geom_col()
## Warning: Removed 36655 rows containing missing values (position_stack).