library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.0     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Check working directory

The data I chose for my project came from the online publication “Our World in Data”, an organization committed to presenting open access data to empower people to make social change. The url for the data is: https://ourworldindata.org/grapher/life-expectation-at-birth-by-sex?time=1790..2014

getwd()
## [1] "C:/Users/chat5/OneDrive/Desktop/DATA 110/Project 1/Cleaned data"

Confirm data

The data came from a blog post on Our world In Data titled “Why do women live longer than men?”. I pulled the data set for a chart titled “Life expectancy at birth” in which the life expectancies of France, United Kingdom, Sweden, and the United States are analyzed and compared regarding gender. The original data set before cleaning had five variables; entity (country from which the data was from), code (a designation given for each country), year, females, males. I decided to clean it by getting rid of the variable code, and renaming the variable entity as country and renaming the whole file as life for simplicity. I chose this data set since it had relatively few variables making it easy to examine the difference in life expectancy between men and women between countries that were comparable on their level of development.

Load data

life <- read.csv("life.csv")

Convert data from wide to long format

I realized after many attempts that the reason my code wasn’t generating the graph as expected was due to the fact that I had my data in wide format. This was a problem since I had two Y values in my data set, meaning each country at a specific year had a life expectancy for both men and women (as seen above). The solution was to convert the data to a long format where a new variable was created called “gender”, and value of “le” which was the life expectancy. This meant that the new data frame would have double the rows for each country, with one row representing one data point instead of two.

life_long <- life %>% 
              gather(key="gender", value="le", 3:4)

#Create sub df for each country

I decided to subset the dataframe due to the amount of data in one graph making it unreadable. In the future I would prefer all the data on one graph with the axis adjusted to make sure the data is comprehensible.

France <- filter(life_long, country== "France")
UK <- filter(life_long, country == "United Kingdom")
US <- filter(life_long, country == "United States")
Sweden <- filter(life_long, country == "Sweden")

Set values on x and y axis

Aftwwards it was simply a matter of graphing the four plots while making sure the the lines were colored by gender for readability (color=gender), and adding the X and Y labels and the tiltle.

F1 <- ggplot(France, aes(x = year, y = le, color=gender)) +
  xlab("Year") + 
  ylab("Life expectancy")+ geom_point()+ geom_line() +
  ggtitle("Life Expectancy Between Men and Women in France")
  
UK1 <- ggplot(UK, aes(x = year, y = le, color=gender)) +
  xlab("Year") + 
  ylab("Life expectancy")+ geom_point()+ geom_line() +
  ggtitle("Life Expectancy Between Men and Women in UK")

US1 <- ggplot(US, aes(x = year, y = le, color=gender)) +
  xlab("Year") + 
  ylab("Life expectancy")+ geom_point()+ geom_line() +
  ggtitle("Life Expectancy Between Men and Women in US")

S1 <- ggplot(Sweden, aes(x = year, y = le, color=gender)) +
  xlab("Year") + 
  ylab("Life expectancy")+ geom_point()+ geom_line() +
  ggtitle("Life Expectancy Between Men and Women in Sweden")

Arange all plots on same page

This was done so all four charts can be seen on one page, instead of being called up individually.

#ggarrange(F1, UK1, US1, S1, 
          #labels = c("A", "B", "C", "D"),
          #ncol = 2, nrow = 2)

Interactive graphs for each country

ggplotly(F1)
ggplotly(S1)
ggplotly(UK1)
ggplotly(US1)

Analyzing the data

I found it interesting that not all the nations had as comprehensive data as I expected. For example Sweden only had data for every 10 years, while the US had data every year starting in 1900. I believe this allows for a more thorough examination of the trends, since there can be fluctuations with life expectation within 10 years. A great example of this would the dip in life expectancy in the US for both men and women in 1918. This could be due to World War I, which ended in November of that year, and the hardships endured by the population due to the conflict. It could also be an anomaly within the data, and something I would like to do would be to compare the dip with other available data (employment, communicable disease) to pinpoint the cause. Another interesting trend is the widening gap between men and women when it comes to life expectancy, and how this is evident in all four countries. In the UK and US you can see that this deviation is pronounced around 1975. Again,I would be interested in exploring this more and trying to find the root cause/ causes (economic downturn?).