This R Markdown discusses the gapminder data which talks about life expectancy, GDP & population by country for various countries across continents.
To create this report we will be working with the unfiltered version of the gapminder data which built into R as a part of the gapminder package. More information is available at Gapminder
library(knitr) # To allow the use of code chunks in the Rmd File
library(gapminder) # Base Data
library(ggplot2) # To create visualizations
library(dplyr) # Package for data manipulation
read_chunk("Week_4/Week_4.R") # Use the script for the Week 4 assignment
The data contains 3313 rows and 6 variables. These variables are :
#Initialize the gapminder dataset
data("gapminder_unfiltered")
#Data Description
str(gapminder_unfiltered)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3313 obs. of 6 variables:
## $ country : Factor w/ 187 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 6 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
#Check for NAs
sapply(gapminder_unfiltered, function(x)sum(is.na(x))) #None of the columns contain any NA values
## country continent year lifeExp pop gdpPercap
## 0 0 0 0 0 0
#Quick peek at the first 6 rows of the data
head(gapminder_unfiltered)
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
#Summary Statistics
summary(gapminder_unfiltered)
## country continent year lifeExp
## Czech Republic: 58 Africa : 637 Min. :1950 Min. :23.60
## Denmark : 58 Americas: 470 1st Qu.:1967 1st Qu.:58.33
## Finland : 58 Asia : 578 Median :1982 Median :69.61
## Iceland : 58 Europe :1302 Mean :1980 Mean :65.24
## Japan : 58 FSU : 139 3rd Qu.:1996 3rd Qu.:73.66
## Netherlands : 58 Oceania : 187 Max. :2007 Max. :82.67
## (Other) :2965
## pop gdpPercap
## Min. :5.941e+04 Min. : 241.2
## 1st Qu.:2.680e+06 1st Qu.: 2505.3
## Median :7.560e+06 Median : 7825.8
## Mean :3.177e+07 Mean : 11313.8
## 3rd Qu.:1.961e+07 3rd Qu.: 17355.8
## Max. :1.319e+09 Max. :113523.1
##
The distribution of GDP per Capita by country shows highly skewed data with ~75% of the countries having GDP < 19000
gdp_2007 <- gapminder_unfiltered %>% group_by(continent, country) %>% filter(year==2007) %>%
summarise(Avg_GDP = round(mean(gdpPercap),2))
p1 <- ggplot(data = gdp_2007, aes(Avg_GDP)) + geom_histogram(binwidth = 1000, fill = "#56B4E9", boundary = 0)
p1 + ylab("Count") + xlab("GDP per Capita") + ggtitle("Distribution of GDP by Country for 2007")
Europe has the highest median GDP per Capita while Africa has the lowest, Asia is home to the richest country in terms of GDP per Capita, Qatar
gdp_continent <- gapminder_unfiltered %>% filter(year==2007) %>% select(continent, year, gdpPercap)
p2 <- ggplot(data = gdp_continent , aes(continent, gdpPercap)) +
geom_boxplot(fill = "#56B4E9", outlier.colour = "#009E73", outlier.size = 3) +
stat_boxplot(geom = "errorbar")
p2 + ggtitle("GDP by Continent") + xlab("Continent") + ylab("GDP per Capita")
ggplot(data = gdp_continent, aes(gdpPercap)) + geom_histogram(binwidth = 1000, fill = "#56B4E9") +
facet_wrap(~ continent, nrow = 2) + ggtitle("GDP for Continents in small multiples") + xlab("GDP per Capita") +
ylab("Count")
Top 10 countries by GDP per Capita are dominated by Asia with 6 Asian countries
top_gdp_2007 <- gdp_2007 %>% ungroup() %>% top_n(n=10)
## Selecting by Avg_GDP
p3 <- ggplot(data = top_gdp_2007, aes(x=reorder(country, Avg_GDP), y = Avg_GDP ,fill=continent)) +
geom_bar(stat = "identity", width = 0.5)
p3 + ggtitle("Top 10 countries by GDP") + xlab("Country") +
ylab("GDP per Capita") + guides(fill = guide_legend("Continent")) + coord_flip()
Lets look at India’s GDP growth over the years
gdp_india <- gapminder_unfiltered %>% filter(country=="India")
p4 <- ggplot(data = gdp_india, aes(year,gdpPercap)) + geom_line()
p4 + ggtitle("GDP growth for India") + xlab("Year") + ylab("GDP per Capita")
gdp_growth <- gdp_india %>% mutate(GDP_Change = ((gdpPercap-lag(gdpPercap))/lag(gdpPercap))*100) %>%
select(year,gdpPercap,GDP_Change)
colnames(gdp_growth) <- c("Year", "Avg_GDP", "GDP_Growth")
print(gdp_growth)
## # A tibble: 12 × 3
## Year Avg_GDP GDP_Growth
## <int> <dbl> <dbl>
## 1 1952 546.5657 NA
## 2 1957 590.0620 7.958100
## 3 1962 658.3472 11.572539
## 4 1967 700.7706 6.443935
## 5 1972 724.0325 3.319477
## 6 1977 813.3373 12.334362
## 7 1982 855.7235 5.211394
## 8 1987 976.5127 14.115439
## 9 1992 1164.4068 19.241341
## 10 1997 1458.8174 25.284173
## 11 2002 1746.7695 19.738728
## 12 2007 2452.2104 40.385464