Often you need to merge two datasets in order to plot them. For example, you might have one dataset for the predictor variable and another for the response variable. If they’re currently in different files, we’ll need to combine them with the merge() function.
Let’s create two datasets and then merge them together. First, we’ll install and call the package gapminder
#install.packages("gapminder")
library(gapminder)
library(tidyverse) #call the tidyverse package (you should already have this installed. If not, install it.)
Next, let’s create two datasets from the gapminder package
#This is a dataset of life expectancy that does not contain the year 1957, but contains all other years
dataset_1 <- gapminder %>%
filter(year!=1957) %>%
select(country,year,lifeExp)
#This is a dataset of gdp that contains all years, including 1957
dataset_2 <- gapminder %>%
select(country,year,gdpPercap)%>%
mutate(Country = country)%>%
select(-country)
Look at the datasets…
head(dataset_1) #shows the first few rows
## # A tibble: 6 x 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1962 32.0
## 3 Afghanistan 1967 34.0
## 4 Afghanistan 1972 36.1
## 5 Afghanistan 1977 38.4
## 6 Afghanistan 1982 39.9
head(dataset_2)
## # A tibble: 6 x 3
## year gdpPercap Country
## <int> <dbl> <fct>
## 1 1952 779. Afghanistan
## 2 1957 821. Afghanistan
## 3 1962 853. Afghanistan
## 4 1967 836. Afghanistan
## 5 1972 740. Afghanistan
## 6 1977 786. Afghanistan
Country is spelled with a capital and with lower case. For that reason, the merge function created a huge dataset becuase it does not know that Country is the same as country. Capitalization or spelling mistakes are the #1 problem in most data manipulation! Capitalization or spelling mistakes are the #1 problem in most data manipulation! Capitalization or spelling mistakes are the #1 problem in most data manipulation! Capitalization or spelling mistakes are the #1 problem in most data manipulation! Capitalization or spelling mistakes are the #1 problem in most data manipulation! Capitalization or spelling mistakes are the #1 problem in most data manipulation! Capitalization or spelling mistakes are the #1 problem in most data manipulation! Capitalization or spelling mistakes are the #1 problem in most data manipulation! Capitalization or spelling mistakes are the #1 problem in most data manipulation! check, check, chekc, check, check
merged_data <- merge(dataset_1,dataset_2) # this tells R to merge dataset 1 and dataset 2
head(merged_data)
## year country lifeExp gdpPercap Country
## 1 1952 Afghanistan 28.801 1967.9557 Mauritius
## 2 1952 Afghanistan 28.801 2144.1151 Colombia
## 3 1952 Afghanistan 28.801 400.4486 China
## 4 1952 Afghanistan 28.801 1178.6659 Chad
## 5 1952 Afghanistan 28.801 1071.3107 Central African Republic
## 6 1952 Afghanistan 28.801 786.5669 Mongolia
Let’s fix that. I’ll do it here in R, but you can also just fix it in excel before you bring the data in.
#here's one way to do it
dataset_2_new <- gapminder %>%
select(country,year,gdpPercap)
Now we’ll merge again
merged_data <- merge(dataset_2_new,dataset_1)
head(merged_data) #looks better. Now there's just one column for country, and you have both gdp and life expectancy for each country in each year. Also note that the year 1957 is excluded, since there is no data from that year for life expectancy.
## country year gdpPercap lifeExp
## 1 Afghanistan 1952 779.4453 28.801
## 2 Afghanistan 1962 853.1007 31.997
## 3 Afghanistan 1967 836.1971 34.020
## 4 Afghanistan 1972 739.9811 36.088
## 5 Afghanistan 1977 786.1134 38.438
## 6 Afghanistan 1982 978.0114 39.854
Now let’s try plotting the dataset. Our x-axis is gdpPercap and our y-axis is lifeExp. We’ll name the plot as “plot_1”
plot_1 <- ggplot(data=merged_data, aes(x=gdpPercap, y=lifeExp))+
geom_point()
plot_1
Looks fine, but notice that we have other information that might be useful to see, such as the year in which the data were collected. Let’s add a new color for each year.
plot_1_color <- ggplot(data=merged_data, aes(x=gdpPercap, y=lifeExp, color=year))+
geom_point()
plot_1_color
Now you can see that life expectancy increases with year (lighter blue is generally at the top). But since they all overlap, it’s hard to see the change. Let’s make a bunch of subplots with each year as a separate plot
plot_1_byyear <- ggplot(data=merged_data, aes(x=gdpPercap, y=lifeExp, color=year))+
geom_point()+
facet_wrap(~year)
plot_1_byyear
Now there is a lot to interpret. YOu can see how GDP increases over time, and also how life expectancy increases over time. The verbal interpretation of these patterns is up to you. Let’s try one more minor change to how the plot looks by choosing a new theme.
plot_1_byyear_theme <- ggplot(data=merged_data, aes(x=gdpPercap, y=lifeExp, color=year))+
geom_point()+
facet_wrap(~year)
plot_1_byyear_theme<-ggplot(data=merged_data, aes(x=gdpPercap, y=lifeExp, color=year))+
geom_point()+
facet_wrap(~year)+
theme_classic()
plot_1_byyear_theme
There are an endless number of ways to adjust this plot to create what you want. I’ve only scratched the surface here. Use some cheatsheets and your google wu to find the right things to add to your code. For example, here’s a cheatsheet for ggplot: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf