In this code-through we will display how data wrangling and
visualization can be used to reveal patterns within data sets. For this
code-through we will be using the gapminder data set! Additionally, we
will be looking at a variety of packages like ‘dplyr’, ‘ggplot2’, and
‘plotly’.
Before we start digging into the code always make sure you have
the required packages installed and ready to go. For this code through
we need the packages ‘dplyr’, ‘ggplot2’, and ‘plotly’. We also need to
install the ‘gapminder’ package which is our data set.
# Install packages
# install.packages("dplyr")
# install.packages("ggplot2")
# install.packages("plotly")
# install.packages("gapminder")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
The first step when working with any data set should be to clean
and interfere with the data. No data set is ever perfect and the package
‘dply’ is great to make data manipulation easier.
Here we can get a small visual of the data. We see the data set
includes a list of different countries and continents over the years and
specific economic factors/values for each input!
# Filtering the data set for the years 2007 and 2002 and then selecting important columns
yr.2002_2007 <- gapminder %>%
filter(year %in% c(2002, 2007)) %>%
select(year, country, continent, lifeExp, gdpPercap, pop)
head(yr.2002_2007)
To get comfortable with the package ‘dplyr’ we demonstrated some
basic functions in the code above (“filter” and “select”). Using these
tools we are now only able to see all countries data points within 2002
and 2007. We can exclude non relevant fields using the “select”
function, however for now we want to keep all column values within the
data set!
Now we will use various tools to help calculate the change in
life expectancy, GDP, and population for each country between the two
years.
# 2007 and 2002 differences for each country
yr_differences <- yr.2002_2007 %>%
group_by(country) %>%
arrange(year) %>%
mutate(lifeExp_diff = lifeExp - lag(lifeExp),
gdpPercap_diff = gdpPercap - lag(gdpPercap),
pop_diff = pop - lag(pop)) %>%
filter(year == 2007)
head(yr_differences)
While using the “group_by”, “arrange” and “mutate” functions
found in the ‘dplyr’ package, we calculated the differences in each
countries data points from the year 2002 to 2007. The new columns
lifeExp_diff, gdpPercap_diff, and pop_diff show those calculations. We
can now take a look at each countries differences between the year 2002
to 2007!
More specifically, the “group_by” function will group the data by country and the “arrange” function will make sure that our data will be ordered by year within each group! The “mutate” function is used to create new column for the differences between 2002 and 2007.
The “lag()” function is used in R when wanting to shift values across
a certain number of potions/observations. In this case, we have two rows
of data per country, one for the year 2007 and another for 2002. When
calculating the difference of life expectancy we are going to take the
column lifeExp and then subtract that from the lag() of that same
column. This will essentially give us the 2002 value for each country
minus the 2007 value for each country, hence giving us the difference!
Next we can take a deeper look into R’s commands to help analyze
our new data set!
## year country continent lifeExp
## Min. :2007 Afghanistan: 1 Africa :52 Min. :39.61
## 1st Qu.:2007 Albania : 1 Americas:25 1st Qu.:57.16
## Median :2007 Algeria : 1 Asia :33 Median :71.94
## Mean :2007 Angola : 1 Europe :30 Mean :67.01
## 3rd Qu.:2007 Argentina : 1 Oceania : 2 3rd Qu.:76.41
## Max. :2007 Australia : 1 Max. :82.60
## (Other) :136
## gdpPercap pop lifeExp_diff gdpPercap_diff
## Min. : 277.6 Min. :1.996e+05 Min. :-4.2560 Min. :-1490.1
## 1st Qu.: 1624.8 1st Qu.:4.508e+06 1st Qu.: 0.8742 1st Qu.: 194.5
## Median : 6124.4 Median :1.052e+07 Median : 1.2110 Median : 935.0
## Mean :11680.1 Mean :4.402e+07 Mean : 1.3125 Mean : 1762.2
## 3rd Qu.:18008.8 3rd Qu.:3.121e+07 3rd Qu.: 1.7595 3rd Qu.: 2866.0
## Max. :49357.2 Max. :1.319e+09 Max. : 4.0940 Max. :12196.9
##
## pop_diff
## Min. : -435794
## 1st Qu.: 121500
## Median : 697036
## Mean : 2563631
## 3rd Qu.: 1954435
## Max. :76223784
##
## [1] 0
With this command, we can see a bunch of information throughout
our data set. We get a deep dive into each variables overview. Some
observations we can see within our variables are the minimum and maximum
values along with their median and mean values. We also can see how many
times each continent is listed within out data set.
Moving on we will use Basic Linear Regressions to take a look
into more analysis for our data set
# Running a linear regression
regression.model <- lm(lifeExp_diff ~ gdpPercap_diff, data = yr_differences)
summary(regression.model)
##
## Call:
## lm(formula = lifeExp_diff ~ gdpPercap_diff, data = yr_differences)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7054 -0.3241 -0.0241 0.4402 2.7620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.488e+00 1.134e-01 13.12 <2e-16 ***
## gdpPercap_diff -9.943e-05 4.059e-05 -2.45 0.0155 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.048 on 140 degrees of freedom
## Multiple R-squared: 0.04111, Adjusted R-squared: 0.03426
## F-statistic: 6.002 on 1 and 140 DF, p-value: 0.01553
Using the fuction “lm()” we were able to run a linear regression
to model how changes in GDP per capita may impact life expectancy. In
our model above, lifeExp_diff is our dependent variable and our
independent variable is gdpPercap_diff which is our predictor. We use
the summary function to get a reading on our regression model results.
Looking at the p-values and significant codes we can see that
gdpPercap_diff is significant!
Our next step is to use visualizations to help us understand how
our various data points in our data have changed between 2002 and
2007.
# A box and Whisker plot for life expectancy differences by continent
ggplot(yr_differences, aes(x = continent, y = lifeExp_diff, fill = continent)) +
geom_boxplot() +
labs(title = "Life Expectancy Differences by Continent (2002 - 2007)",
element_text(hjust = 0.5),
x = "Continent",
y = "Life Expectancy Difference") +
theme_classic()
In our above code we used ‘ggplot’ to show the distribution of
life expectancy difference by continent. In order to efficiently
demonstrate this we used a Box and Whisker plot which is seen in the
code function “geom_boxplot()”. Some notes about the above code is that
the function “labs()” is used to label the plot with titles. While the
fuction “theme_classic” is used to apply a clean asthetic to the charts
overview!
With this visualization we are going to step it up a notch by
incorporation the package “plotly”!
# Adding a column for high life expectancy and GDP per capita
threshold_lifeExp <- 69.84 # Average life expectancy in the world in 2007
threshold_gdpPercap <- 8701 # Average GDP per capita in the world in 2007
yr_differences <- yr_differences %>%
mutate(
high_lifeExp = lifeExp > threshold_lifeExp,
high_gdpPercap = gdpPercap > threshold_gdpPercap,
both_high = high_lifeExp & high_gdpPercap
)
# Scatter Plot
plotly.graph <- ggplot(yr_differences, aes(x = gdpPercap, y = lifeExp, color = factor(both_high), text = paste("Country:", country))) +
geom_point(size = 2.5, alpha = 0.7) +
labs(title = "Life Expectancy vs GDP per Capita (2007)",
x = "GDP per Capita",
y = "Life Expectancy",
color = "Both High") +
scale_color_manual(values = c("firebrick", "slateblue"), labels = c("No", "Yes")) +
theme_classic() +
theme(legend.position = "top")
ggplotly(plotly.graph, tooltip = c("text", "gdpPercap", "lifeExp"))
In this visualization we wanted to chart which countries had
both high life expectancy and high GDP per capita. In order to do so we
created a threshold for each variable which was calculated by using the
world average for each in 2007. Next we used the “mutate” function to
create new variables called “high_lifeExp”, “high_gdpPercap”, and
“both_high”. Using ‘ggplot’ we then start to construct our scatter plot.
Some key notes is that “color = factor(both_high)” will map the color of
points based on the value of that variable. “Text = paste(”Country:“,
country)” is the code that allows the country names to appear when
hovering over the data points. Lastly, “ggplotly” converts the original
ggplot into the interactive plotly graph! Now users can zoom in
throughout the graph and more. Additionally, “tooltip = c(”text”,
“gdpPercap”, “lifeExp”)” allows the values of “gdpPercap” and “lifeExp”
to be seen when hovering over each data point.
Hopefully you all learned some new tricks and tips while
following this code-through! General topics were covered for data
wrangling and visualizations. For those who have not used the data set
gapminder, I hope this was a great tour and guide as we went through a
deep analysis of life expectancy and GDP throughout this demonstration.
This analysis offers both statistical insights and visual tools for
exploring the gapminder data set!