Merging two data sets

Often you need to merge two datasets in order to plot them. For example, you might have one dataset for the predictor variable and another for the response variable. If they’re currently in different files, we’ll need to combine them with the merge() function.

Let’s create two datasets and then merge them together. First, we’ll install and call the package gapminder

#install.packages("gapminder")
library(gapminder)
library(tidyverse) #call the tidyverse package (you should already have this installed. If not, install it.)

Next, let’s create two datasets from the gapminder package

#This is a dataset of life expectancy that does not contain the year 1957, but contains all other years
dataset_1 <- gapminder %>%
  filter(year!=1957) %>%
  select(country,year,lifeExp) 

#This is a dataset of gdp that contains all years, including 1957
dataset_2 <- gapminder %>%
  select(country,year,gdpPercap)%>%
  mutate(Country = country)%>%
  select(-country)

Look at the datasets…

head(dataset_1) #shows the first few rows
## # A tibble: 6 x 3
##   country      year lifeExp
##   <fct>       <int>   <dbl>
## 1 Afghanistan  1952    28.8
## 2 Afghanistan  1962    32.0
## 3 Afghanistan  1967    34.0
## 4 Afghanistan  1972    36.1
## 5 Afghanistan  1977    38.4
## 6 Afghanistan  1982    39.9
head(dataset_2)
## # A tibble: 6 x 3
##    year gdpPercap Country    
##   <int>     <dbl> <fct>      
## 1  1952      779. Afghanistan
## 2  1957      821. Afghanistan
## 3  1962      853. Afghanistan
## 4  1967      836. Afghanistan
## 5  1972      740. Afghanistan
## 6  1977      786. Afghanistan

Merge the datasets together and notice the mistake.

Country is spelled with a capital and with lower case. For that reason, the merge function created a huge dataset becuase it does not know that Country is the same as country. Capitalization or spelling mistakes are the #1 problem in most data manipulation! Capitalization or spelling mistakes are the #1 problem in most data manipulation! Capitalization or spelling mistakes are the #1 problem in most data manipulation! Capitalization or spelling mistakes are the #1 problem in most data manipulation! Capitalization or spelling mistakes are the #1 problem in most data manipulation! Capitalization or spelling mistakes are the #1 problem in most data manipulation! Capitalization or spelling mistakes are the #1 problem in most data manipulation! Capitalization or spelling mistakes are the #1 problem in most data manipulation! Capitalization or spelling mistakes are the #1 problem in most data manipulation! check, check, chekc, check, check

merged_data <- merge(dataset_1,dataset_2) # this tells R to merge dataset 1 and dataset 2
head(merged_data)
##   year     country lifeExp gdpPercap                  Country
## 1 1952 Afghanistan  28.801 1967.9557                Mauritius
## 2 1952 Afghanistan  28.801 2144.1151                 Colombia
## 3 1952 Afghanistan  28.801  400.4486                    China
## 4 1952 Afghanistan  28.801 1178.6659                     Chad
## 5 1952 Afghanistan  28.801 1071.3107 Central African Republic
## 6 1952 Afghanistan  28.801  786.5669                 Mongolia

Let’s fix that. I’ll do it here in R, but you can also just fix it in excel before you bring the data in.

#here's one way to do it
dataset_2_new <- gapminder %>%
  select(country,year,gdpPercap)

Now we’ll merge again

merged_data <- merge(dataset_2_new,dataset_1)
head(merged_data) #looks better. Now there's just one column for country, and you have both gdp and life expectancy for each country in each year. Also note that the year 1957 is excluded, since there is no data from that year for life expectancy.
##       country year gdpPercap lifeExp
## 1 Afghanistan 1952  779.4453  28.801
## 2 Afghanistan 1962  853.1007  31.997
## 3 Afghanistan 1967  836.1971  34.020
## 4 Afghanistan 1972  739.9811  36.088
## 5 Afghanistan 1977  786.1134  38.438
## 6 Afghanistan 1982  978.0114  39.854

Plotting

Now let’s try plotting the dataset. Our x-axis is gdpPercap and our y-axis is lifeExp. We’ll name the plot as “plot_1”

plot_1 <- ggplot(data=merged_data, aes(x=gdpPercap, y=lifeExp))+
  geom_point()
plot_1

Looks fine, but notice that we have other information that might be useful to see, such as the year in which the data were collected. Let’s add a new color for each year.

plot_1_color <- ggplot(data=merged_data, aes(x=gdpPercap, y=lifeExp, color=year))+
  geom_point()
plot_1_color

Now you can see that life expectancy increases with year (lighter blue is generally at the top). But since they all overlap, it’s hard to see the change. Let’s make a bunch of subplots with each year as a separate plot

plot_1_byyear <- ggplot(data=merged_data, aes(x=gdpPercap, y=lifeExp, color=year))+
  geom_point()+
  facet_wrap(~year)
plot_1_byyear

Now there is a lot to interpret. YOu can see how GDP increases over time, and also how life expectancy increases over time. The verbal interpretation of these patterns is up to you. Let’s try one more minor change to how the plot looks by choosing a new theme.

plot_1_byyear_theme <- ggplot(data=merged_data, aes(x=gdpPercap, y=lifeExp, color=year))+
  geom_point()+
  facet_wrap(~year)
plot_1_byyear_theme<-ggplot(data=merged_data, aes(x=gdpPercap, y=lifeExp, color=year))+
  geom_point()+
  facet_wrap(~year)+
  theme_classic()
plot_1_byyear_theme

There are an endless number of ways to adjust this plot to create what you want. I’ve only scratched the surface here. Use some cheatsheets and your google wu to find the right things to add to your code. For example, here’s a cheatsheet for ggplot: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf