This R Markdown Notebook is my report for the Data Wrangling with R class assignment for Week 4.
The following report uses dplyr and ggplot2 to transform and visualize data for exploratory data analysis.
For this report we are looking at the gapminder_unfiltered data set which is included as a part of the gapminder package.
The dataset contains “an excerpt of the data available at Gapminder.org. For each of 142 countries, the package provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007.
More info
library(tidyverse)
library(gapminder)
library(dplyr)
library(ggplot2)
library(scales)
The data is not filtered on year or compelete cases and has 3313 rows.
A brief summary of the variables:
# number of rows
nrow(gapminder_unfiltered)
## [1] 3313
# number of variables
ncol(gapminder_unfiltered)
## [1] 6
# structure of data
str(gapminder_unfiltered, give.attr = F)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3313 obs. of 6 variables:
## $ country : Factor w/ 187 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 6 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
# sample rows
head(gapminder_unfiltered)
# summary stats
summary(gapminder_unfiltered)
## country continent year lifeExp
## Czech Republic: 58 Africa : 637 Min. :1950 Min. :23.60
## Denmark : 58 Americas: 470 1st Qu.:1967 1st Qu.:58.33
## Finland : 58 Asia : 578 Median :1982 Median :69.61
## Iceland : 58 Europe :1302 Mean :1980 Mean :65.24
## Japan : 58 FSU : 139 3rd Qu.:1996 3rd Qu.:73.66
## Netherlands : 58 Oceania : 187 Max. :2007 Max. :82.67
## (Other) :2965
## pop gdpPercap
## Min. :5.941e+04 Min. : 241.2
## 1st Qu.:2.680e+06 1st Qu.: 2505.3
## Median :7.560e+06 Median : 7825.8
## Mean :3.177e+07 Mean : 11313.8
## 3rd Qu.:1.961e+07 3rd Qu.: 17355.8
## Max. :1.319e+09 Max. :113523.1
##
We can see that we have data for 6 continents and a total of 187 countries are represented in our data. We can also see that 58 years are measured from 1950 to 2007
We can see that no missing values are present. But let us explore further before concluding this.
ggplot(data = gapminder_unfiltered) +
geom_bar(mapping = aes(x = year)) +
labs(x = "Year", y = "Number of observations") +
ggtitle("Number of observations per year")
table(gapminder_unfiltered$year)
##
## 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964
## 39 24 144 24 24 24 24 144 25 25 26 26 151 26 26
## 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
## 27 27 156 27 27 27 27 168 32 27 27 27 171 27 27
## 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
## 27 27 171 27 27 27 27 171 27 27 32 33 183 33 33
## 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
## 33 33 184 33 33 33 33 187 33 32 30 18 183
We can see that the number of entries for each year varies, this means that we do have missing data i.e. not all countries have data for every year.
But with the plot and the table we can see that we have relatively more complete data for every 5th year starting from 1952 to 2007.
gapminder_unfiltered %>%
filter(year == 2007) %>%
ggplot() +
geom_histogram(mapping = aes(gdpPercap),
binwidth = 5000,
boundary = 0,
fill = "#9ad3de",
color = "#006BA6") +
scale_x_continuous(breaks = pretty_breaks(n = 10)) +
labs(x = "GDP Per Capita", y = "Number of Countries") +
ggtitle("Distribution of GDP per capita across Countries")
As we can see from the histogram most countries have a GDP per capita of less than $10000.
gapminder_unfiltered %>%
filter(year == 2007) %>%
ggplot() +
geom_boxplot(mapping = aes(x = factor(continent, levels =),
y = gdpPercap),
fill = "#00A896") +
scale_y_continuous(breaks = pretty_breaks(n = 10)) +
labs(y = "GDP Per Capita", x = "Continent") +
ggtitle("Distribution of GDP per capita across continents") +
coord_flip()
The box plots shows the distribution of GDP per capita across different continents. Let us sort the continents by median and remove the outliers i.e. gdpPercap > 55000 to get a slightly better picture.
lvls <- gapminder_unfiltered %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarize(avgGdpPercap = median(gdpPercap)) %>%
arrange(desc(avgGdpPercap)) %>%
select(unique(continent))
gapminder_unfiltered %>%
filter(year == 2007, gdpPercap < 55000) %>%
ggplot() +
geom_boxplot(mapping = aes(x = factor(continent, levels = lvls$continent),
y = gdpPercap),
fill = "#00A896") +
scale_y_continuous(breaks = pretty_breaks(n = 10)) +
labs(y = "GDP Per Capita", x = "Continent") +
ggtitle("Distribution of GDP per capita across continents") +
coord_flip()
With this we can see the distributions more clearly. Africa has the lowest median GDP per cap while Europe has the highest.
gapminder_unfiltered %>%
filter(year == 2007) %>%
filter(rank(desc(gdpPercap)) <= 10) %>%
arrange(desc(gdpPercap)) %>%
ggplot() +
geom_bar(mapping = aes(x = factor(country, levels = rev(unique(country))),
y = gdpPercap,
fill = continent),
stat = "identity") +
labs(x = "Country", y = "GDP Per Capita") +
ggtitle("Top 10 countries with largest GDP per capita") +
coord_flip()
gapminder_unfiltered %>%
filter(country == 'India') %>%
ggplot() +
geom_line(mapping = aes(x = year, y = gdpPercap)) +
labs(x = "Year", y = "GDP Per Capita") +
ggtitle("India's GDP per capita over the years")
percent_growth_2007 <- gapminder_unfiltered %>%
filter(country == 'India') %>%
arrange(year) %>%
mutate(change = (gdpPercap - lag(gdpPercap))/lag(gdpPercap) * 100) %>%
filter(year == 2007) %>%
select(year, change)
The percent growth for India in 2007 was 40.39%.
India_gdpPercap <- gapminder_unfiltered %>%
filter(country == 'India') %>%
arrange(year) %>%
select(year, gdpPercap)
India_gdpPercap %>%
summarize(change = last(gdpPercap) - first(gdpPercap))
India’s GDP per capita in 1952 was $546.57 India’s GDP per capita in 2007 was $2452.21
The growth in India’s GDP over these years was $1905.64