Introduction

In this code-through we will display how data wrangling and visualization can be used to reveal patterns within data sets. For this code-through we will be using the gapminder data set! Additionally, we will be looking at a variety of packages like ‘dplyr’, ‘ggplot2’, and ‘plotly’.

Set-Up (Before you start)

Before we start digging into the code always make sure you have the required packages installed and ready to go. For this code through we need the packages ‘dplyr’, ‘ggplot2’, and ‘plotly’. We also need to install the ‘gapminder’ package which is our data set.

# Install packages
# install.packages("dplyr")
# install.packages("ggplot2")
# install.packages("plotly")
# install.packages("gapminder")

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(gapminder)

data("gapminder")

Data Wrangling using the Package ‘dplyr’

The first step when working with any data set should be to clean and interfere with the data. No data set is ever perfect and the package ‘dply’ is great to make data manipulation easier.

head(gapminder)

Here we can get a small visual of the data. We see the data set includes a list of different countries and continents over the years and specific economic factors/values for each input!

# Filtering the data set for the years 2007 and 2002 and then selecting important columns
yr.2002_2007 <- gapminder %>%
  filter(year %in% c(2002, 2007)) %>%
  select(year, country, continent, lifeExp, gdpPercap, pop)

head(yr.2002_2007)

To get comfortable with the package ‘dplyr’ we demonstrated some basic functions in the code above (“filter” and “select”). Using these tools we are now only able to see all countries data points within 2002 and 2007. We can exclude non relevant fields using the “select” function, however for now we want to keep all column values within the data set!

More Data Wrangling (Finding and comparing the differences in the data)

Now we will use various tools to help calculate the change in life expectancy, GDP, and population for each country between the two years.

# 2007 and 2002 differences for each country 
yr_differences <- yr.2002_2007 %>%
  group_by(country) %>%
  arrange(year) %>%
  mutate(lifeExp_diff = lifeExp - lag(lifeExp),
         gdpPercap_diff = gdpPercap - lag(gdpPercap),
         pop_diff = pop - lag(pop)) %>%
  filter(year == 2007)  

head(yr_differences)

While using the “group_by”, “arrange” and “mutate” functions found in the ‘dplyr’ package, we calculated the differences in each countries data points from the year 2002 to 2007. The new columns lifeExp_diff, gdpPercap_diff, and pop_diff show those calculations. We can now take a look at each countries differences between the year 2002 to 2007!

More specifically, the “group_by” function will group the data by country and the “arrange” function will make sure that our data will be ordered by year within each group! The “mutate” function is used to create new column for the differences between 2002 and 2007.

The “lag()” function is used in R when wanting to shift values across a certain number of potions/observations. In this case, we have two rows of data per country, one for the year 2007 and another for 2002. When calculating the difference of life expectancy we are going to take the column lifeExp and then subtract that from the lag() of that same column. This will essentially give us the 2002 value for each country minus the 2007 value for each country, hence giving us the difference!

Data Analysis within R

Next we can take a deeper look into R’s commands to help analyze our new data set!

# Summary statistics of the data set
summary(yr_differences)

##       year             country       continent     lifeExp     
##  Min.   :2007   Afghanistan:  1   Africa  :52   Min.   :39.61  
##  1st Qu.:2007   Albania    :  1   Americas:25   1st Qu.:57.16  
##  Median :2007   Algeria    :  1   Asia    :33   Median :71.94  
##  Mean   :2007   Angola     :  1   Europe  :30   Mean   :67.01  
##  3rd Qu.:2007   Argentina  :  1   Oceania : 2   3rd Qu.:76.41  
##  Max.   :2007   Australia  :  1                 Max.   :82.60  
##                 (Other)    :136                                
##    gdpPercap            pop             lifeExp_diff     gdpPercap_diff   
##  Min.   :  277.6   Min.   :1.996e+05   Min.   :-4.2560   Min.   :-1490.1  
##  1st Qu.: 1624.8   1st Qu.:4.508e+06   1st Qu.: 0.8742   1st Qu.:  194.5  
##  Median : 6124.4   Median :1.052e+07   Median : 1.2110   Median :  935.0  
##  Mean   :11680.1   Mean   :4.402e+07   Mean   : 1.3125   Mean   : 1762.2  
##  3rd Qu.:18008.8   3rd Qu.:3.121e+07   3rd Qu.: 1.7595   3rd Qu.: 2866.0  
##  Max.   :49357.2   Max.   :1.319e+09   Max.   : 4.0940   Max.   :12196.9  
##                                                                           
##     pop_diff       
##  Min.   : -435794  
##  1st Qu.:  121500  
##  Median :  697036  
##  Mean   : 2563631  
##  3rd Qu.: 1954435  
##  Max.   :76223784  
##

# Check for missing values
sum(is.na(yr_differences))

## [1] 0

With this command, we can see a bunch of information throughout our data set. We get a deep dive into each variables overview. Some observations we can see within our variables are the minimum and maximum values along with their median and mean values. We also can see how many times each continent is listed within out data set.

More Data Analysis (Regressions)

Moving on we will use Basic Linear Regressions to take a look into more analysis for our data set

# Running a linear regression
regression.model <- lm(lifeExp_diff ~ gdpPercap_diff, data = yr_differences)

summary(regression.model)

## 
## Call:
## lm(formula = lifeExp_diff ~ gdpPercap_diff, data = yr_differences)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7054 -0.3241 -0.0241  0.4402  2.7620 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.488e+00  1.134e-01   13.12   <2e-16 ***
## gdpPercap_diff -9.943e-05  4.059e-05   -2.45   0.0155 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.048 on 140 degrees of freedom
## Multiple R-squared:  0.04111,    Adjusted R-squared:  0.03426 
## F-statistic: 6.002 on 1 and 140 DF,  p-value: 0.01553

Using the fuction “lm()” we were able to run a linear regression to model how changes in GDP per capita may impact life expectancy. In our model above, lifeExp_diff is our dependent variable and our independent variable is gdpPercap_diff which is our predictor. We use the summary function to get a reading on our regression model results. Looking at the p-values and significant codes we can see that gdpPercap_diff is significant!

Visualizing the Differences using the package ‘ggplot’

Our next step is to use visualizations to help us understand how our various data points in our data have changed between 2002 and 2007.

# A box and Whisker plot for life expectancy differences by continent
ggplot(yr_differences, aes(x = continent, y = lifeExp_diff, fill = continent)) +
  geom_boxplot() +
  labs(title = "Life Expectancy Differences by Continent (2002 - 2007)",
       element_text(hjust = 0.5),
       x = "Continent", 
       y = "Life Expectancy Difference") +
  theme_classic()

In our above code we used ‘ggplot’ to show the distribution of life expectancy difference by continent. In order to efficiently demonstrate this we used a Box and Whisker plot which is seen in the code function “geom_boxplot()”. Some notes about the above code is that the function “labs()” is used to label the plot with titles. While the fuction “theme_classic” is used to apply a clean asthetic to the charts overview!

More Visualizations using the package ‘plotly’

With this visualization we are going to step it up a notch by incorporation the package “plotly”!

# Adding a column for high life expectancy and GDP per capita

threshold_lifeExp <- 69.84  # Average life expectancy in the world in 2007 
threshold_gdpPercap <- 8701  # Average GDP per capita in the world in 2007  

yr_differences <- yr_differences %>%
  mutate(
    high_lifeExp = lifeExp > threshold_lifeExp,
    high_gdpPercap = gdpPercap > threshold_gdpPercap,
    both_high = high_lifeExp & high_gdpPercap 
  )

# Scatter Plot
plotly.graph <- ggplot(yr_differences, aes(x = gdpPercap, y = lifeExp, color = factor(both_high), text = paste("Country:", country))) +
  geom_point(size = 2.5, alpha = 0.7) +
  labs(title = "Life Expectancy vs GDP per Capita (2007)",
       x = "GDP per Capita",
       y = "Life Expectancy",
       color = "Both High") +
  scale_color_manual(values = c("firebrick", "slateblue"), labels = c("No", "Yes")) +
  theme_classic() +
  theme(legend.position = "top")

ggplotly(plotly.graph, tooltip = c("text", "gdpPercap", "lifeExp"))

In this visualization we wanted to chart which countries had both high life expectancy and high GDP per capita. In order to do so we created a threshold for each variable which was calculated by using the world average for each in 2007. Next we used the “mutate” function to create new variables called “high_lifeExp”, “high_gdpPercap”, and “both_high”. Using ‘ggplot’ we then start to construct our scatter plot. Some key notes is that “color = factor(both_high)” will map the color of points based on the value of that variable. “Text = paste(”Country:“, country)” is the code that allows the country names to appear when hovering over the data points. Lastly, “ggplotly” converts the original ggplot into the interactive plotly graph! Now users can zoom in throughout the graph and more. Additionally, “tooltip = c(”text”, “gdpPercap”, “lifeExp”)” allows the values of “gdpPercap” and “lifeExp” to be seen when hovering over each data point.

To Conclude

Hopefully you all learned some new tricks and tips while following this code-through! General topics were covered for data wrangling and visualizations. For those who have not used the data set gapminder, I hope this was a great tour and guide as we went through a deep analysis of life expectancy and GDP throughout this demonstration. This analysis offers both statistical insights and visual tools for exploring the gapminder data set!

PAF 514

Tyler Thompson

2024-10-10