This example will go through some commonly used graphical methods for describing data. We will focus on learning how to use the various geometry types used in ggplot(). I urge you to consult the first chapter of the Wickham and Grolemund text, and Wickham’s ggplot2 text in the Use R! series.
We then examine the 2008 PRB Data sheet and the 2015 American Community Survey microdata.
The first thing we need to use ggplot is to install the library, which you have already done. The second thing we need is some data. I’m going to use the PRB data for now because it’s much prettier than the ACS data for plotting.
library(ggplot2)
library(readr)
prb<-read_csv(file = "https://raw.githubusercontent.com/coreysparks/data/master/PRB2008_All.csv")
## Parsed with column specification:
## cols(
## .default = col_integer(),
## Country = col_character(),
## Continent = col_character(),
## Region = col_character(),
## Population. = col_double(),
## Rate.of.natural.increase = col_double(),
## ProjectedPopMid2025 = col_double(),
## ProjectedPopMid2050 = col_double(),
## IMR = col_double(),
## TFR = col_double(),
## PercPop1549HIVAIDS2001 = col_double(),
## PercPop1549HIVAIDS2007 = col_double(),
## PercPpUnderNourished0204 = col_double(),
## PopDensPerSqMile = col_double()
## )
## See spec(...) for full column specifications.
To create a ggplot, we basically tell the function what data we want to use, then tell it the kind of geometry we want to use to summarize the data.
Histograms are a special case of a bar chart. They represent the distribution of data, and should ALWAYS be one of the first things you look at when beginning the examination of a data set/variable.
To construct the frequency histogram, we count the number of observations that fall into each class interval. The table() function is really good at counting stuff. - To construct the relative frequency histogram, we take the frequency histogram and divide the counts in each interval by the total number of observations. This gives the percentage of the data that fall into the interval.
Modality - Typically a variable will have a single mode (unimodal), meaning that there is 1 major peak in the data. - We can also have multi-modal data, where multiple peaks occur in a variable. This is often indicative of some unmeasured group structure in the data. The most common form of this is bi-modality, where we have two peaks in the histogram. Likewise the data may lack a single mode, and be more uniformly distributed, and the histogram will be flat.
Distributional shape - The data may be symmetric, where equal proportions occur on both the right and left of the peak - Similarly, the data may be skewed, or have a long tail - Skewed data are common in demographic work (income) - Right skewed data will have a long tail to the right (extreme values to the high end of the histogram) - Left skewed data will have a long tail to the left (extreme values to the low end of the histogram) - Skewness will have major implications for the kinds of statistical tests we can consider later. Typically most tests assume the data are symmetrically distributed
#fake data
xnorm<-rnorm(100, 25, 5)
xflat<-runif(100,min=20, max=40)
The hist() function in R will give basic histograms, but is pretty limited.
hist(xnorm, main="My plot title", xlab = "x axis label", ylab="y axis label")
hist(xflat)
Pretty basic.
ggplot allows us a lot more bells and whistles
ggplot can’t handle a vector, so we need to make a dataframe.
xnorm<-data.frame(xnorm=xnorm)
ggplot(data=xnorm)+geom_histogram(aes(xnorm), binwidth = 2.5)
Again, pretty basic. Let’s add some text.
ggplot(data=xnorm)+
geom_histogram(mapping= aes(x=xnorm), binwidth = 2.5 )+
ggtitle(label = "my histogram", subtitle = "my sub title")+
xlab(label = "my x label")+
ylab(label="my y label")
You can also get this using qplot
qplot(x =xnorm$xnorm,
geom = "histogram",
binwidth=2.5,
main = "big title",
xlab = "x label",
ylab="y label")
Pretty much the same thing, but fewer details are possible.
There’s also a geom_freqpoly geometry that draws lines instead of bars. It’s pretty neat, here it is doing the basic function of a histogram, and again doing histograms by continent.
ggplot(data=prb)+
geom_freqpoly(mapping = aes(TFR), bins=10)+
ggtitle(label = "Distribution of the Total Fertility Rate ", subtitle = "2008 Estimates")+
xlab(label = "TFR")+
ylab(label="Frequency")
## Warning: Removed 1 rows containing non-finite values (stat_bin).
ggplot(data=prb)+
geom_freqpoly(mapping = aes(TFR, colour=Continent), bins=10, lwd=2)+
ggtitle(label = "Distribution of the Total Fertility Rate by Continent", subtitle = "2008 Estimates")+
xlab(label = "TFR")+
ylab(label="Frequency")
## Warning: Removed 1 rows containing non-finite values (stat_bin).
Also, we can plot the density, instead of the count by including the ..density.. argument in the aesthetic aes().
ggplot(data=prb)+
geom_histogram(mapping = aes(TFR, colour=Continent), bins=10)+
ggtitle(label = "Distribution of the Total Fertility Rate by Continent", subtitle = "2008 Estimates")+
xlab(label = "TFR")+
ylab(label="Frequency")
## Warning: Removed 1 rows containing non-finite values (stat_bin).
So, you can see that we can quickly add detail to our plots, you can’t do this with hist().
Another visualization method is the stem and leaf plot, or box and whisker plot. This basically displays Tukey’s 5 number summary of data.
ggplot(prb)+
geom_boxplot(aes(x= Continent, y =TFR))+
ggtitle(label = "Distribution of the Total Fertility Rate by Continent", subtitle = "2008 Estimates")
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
You can flip the axes, by adding
coord_flip()
ggplot(prb)+
geom_boxplot(aes(x= Continent, y =TFR))+
ggtitle(label = "Distribution of the Total Fertility Rate by Continent", subtitle = "2008 Estimates")+coord_flip()
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
You can also color the boxes by a variable:
prb$newname<-paste(prb$Continent, prb$Region, sep="-")
ggplot(prb)+
geom_boxplot(aes(x= newname, y =TFR,fill=Continent))+coord_flip()+
ggtitle(label = "Distribution of the Total Fertility Rate by Continent", subtitle = "2008 Estimates")
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
These are useful for finding relationships among two or more variables, typically continuous. ggplot() can really make these pretty.
plot(TFR~IMR, data=prb)
Here are a few riffs using the PRB data:
ggplot(data=prb)+
geom_point(mapping= aes(x=TFR, y=IMR))+
ggtitle(label = "Relationship between Total Fertility and Infant Mortality", subtitle = "2008 Estimates")+
xlab(label = "TFR")+
ylab(label="IMR")
## Warning: Removed 2 rows containing missing values (geom_point).
#color varies by continent
ggplot(data=prb)+
geom_point(mapping= aes(x=TFR, y=IMR, color=Continent))+
ggtitle(label = "Relationship between Total Fertility and Infant Mortality", subtitle = "2008 Estimates")+
xlab(label = "TFR")+
ylab(label="IMR")
## Warning: Removed 2 rows containing missing values (geom_point).
#shape varies by continent
ggplot(data=prb)+
geom_point(mapping= aes(x=TFR, y=IMR, shape=Continent))+
ggtitle(label = "Relationship between Total Fertility and Infant Mortality", subtitle = "2008 Estimates")+
xlab(label = "TFR")+
ylab(label="IMR")
## Warning: Removed 2 rows containing missing values (geom_point).
Facet plots are nice, if you want to create a plot separately for a series of groups. This allows you to visualize if the relationship is constant across those groups, well at least graphically.
ggplot(data=prb)+
geom_point(mapping= aes(x=TFR, y=IMR))+
facet_wrap(~Continent)+
ggtitle(label = "Relationship between Total Fertility and Infant Mortality", subtitle = "2008 Estimates")+
xlab(label = "TFR")+
ylab(label="IMR")
## Warning: Removed 2 rows containing missing values (geom_point).
If you have lots of variables you want to plot, you can make a scatter plot matrix. ggplot() won’t do this out of the box, but the GGally library can do it, you need to install it first before using it.
GGally::ggpairs(data= prb[complete.cases(prb[,c(16, 18, 19)]),],
columns = c(16, 18, 19),
ggplot2::aes(color=Continent))
ggplot allows you to make some very nice line-fit plots for scatter plots. While the math behind these lines is not what we are talking about, they do produce a nice graphical summary of the relationships.
ggplot(data=prb)+
geom_point(mapping= aes(x=TFR, y=IMR))+
geom_smooth(mapping= aes(x=TFR, y=IMR), method = "lm")+
ggtitle(label = "Relationship between Total Fertility and Infant Mortality", subtitle = "2008 Estimates-linear fit")+
xlab(label = "TFR")+
ylab(label="IMR")
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
ggplot(data=prb)+
geom_point(mapping= aes(x=TFR, y=IMR))+
geom_smooth(mapping= aes(x=TFR, y=IMR) , method = "loess")+
ggtitle(label = "Relationship between Total Fertility and Infant Mortality", subtitle = "2008 Estimates")+
xlab(label = "TFR")+
ylab(label="IMR")
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
#another example, bad linear plot!
ggplot(data=prb)+
geom_point(mapping= aes(x=TFR, y=PercPopLT15))+
geom_smooth(mapping= aes(x=TFR, y=PercPopLT15) , method = "lm")+
ggtitle(label = "Relationship between Total Fertility and Percent under age 15", subtitle = "2008 Estimates-linear fit")+
xlab(label = "Percent under age 15")+
ylab(label="IMR")
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
ggplot(data=prb)+
geom_point(mapping= aes(x=TFR, y=PercPopLT15))+
geom_smooth(mapping= aes(x=TFR, y=PercPopLT15) , method = "loess")+
ggtitle(label = "Relationship between Total Fertility and Percent under age 15", subtitle = "2008 Estimates- loess fit")+
xlab(label = "Percent under age 15")+
ylab(label="IMR")
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
Now let’s open a ‘really real’ data file.
library(haven)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")
names(ipums) #print the column names
## [1] "year" "datanum" "serial" "hhwt" "statefip"
## [6] "met2013" "puma" "gq" "pernum" "perwt"
## [11] "famsize" "nchild" "nchlt5" "eldch" "nsibs"
## [16] "relate" "related" "sex" "age" "marst"
## [21] "birthyr" "fertyr" "race" "raced" "hispan"
## [26] "hispand" "bpl" "bpld" "citizen" "yrsusa1"
## [31] "language" "languaged" "speakeng" "educ" "educd"
## [36] "empstat" "empstatd" "labforce" "occ" "ind"
## [41] "inctot" "incwage" "poverty" "hwsei" "migrate1"
## [46] "migrate1d" "carpool" "trantime"
We are basically going to recreate the plots from above but using these data instead.
Here is a basic histogram of our income variable, by sex.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ipums%>%
mutate(mywage= ifelse(incwage%in%c(999998,999999), NA, incwage))%>%
filter(labforce==2, age>18) %>%
mutate(sexrecode=ifelse(sex==1, "male", "female")) %>%
ggplot()+
geom_histogram(aes(mywage))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ipums%>%
mutate(mywage= ifelse(incwage%in%c(999998,999999), NA, incwage))%>%
filter(labforce==2, age>18) %>%
mutate(sexrecode=ifelse(sex==1, "male", "female")) %>%
ggplot()+
geom_histogram(aes(mywage))+
facet_wrap(~sexrecode)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Next, we will do the income distribution using box plots across the metro areas in Texas. I looked on the ipums website to get the city codes.
png(filename = "C:/Users/ozd504/Documents/incomecityplot.png", width = 800, height =600)
ipums%>%
mutate(mywage= ifelse(incwage%in%c(999998,999999), NA, incwage))%>%
filter(labforce==2, age>18, met2013%in%c(11100, 12420, 15180, 17780, 18580, 21340, 26420, 29700,32580, 47380,41700, 19100) ) %>%
mutate(sexrecode=ifelse(sex==1, "male", "female"),
cityrec = case_when(.$met2013==11100~"Amarillo",
.$met2013 == 12420~"Austin",
.$met2013==15180 ~"Brownsville",
.$met2013== 17780 ~ "College Station",
.$met2013== 18580 ~ "Corpus Christi",
.$met2013== 21340 ~ "El Paso",
.$met2013== 26420 ~ "Houston",
.$met2013== 29700~ "Laredo",
.$met2013== 32580~ "McAllen",
.$met2013== 47380~ "Waco",
.$met2013== 41700 ~ "San Antonio",
.$met2013== 19100 ~ "Dallas")) %>%
ggplot()+
geom_boxplot(aes(x=cityrec, y=mywage))
dev.off()
## png
## 2
Or a scatter plot of income by age
ipums%>%
mutate(mywage= ifelse(incwage%in%c(999998,999999), NA, incwage))%>%
filter(labforce==2, age>18, met2013!=0) %>%
mutate(sexrecode=ifelse(sex==1, "male", "female")) %>%
ggplot()+
geom_point(aes(x=age, y=mywage, color=sexrecode),size=.15)+
geom_smooth(aes(x=age, y=mywage, colour=sexrecode))+
ylim(c(0,2.5e+05))+
ylab("Income")+
xlab("Age")+
ggtitle(label = "Distribution of Income by age for men and women", subtitle = "2015 American Community Survey")
## Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
## `geom_smooth()` using method = 'gam'
## Warning: Removed 1625 rows containing non-finite values (stat_smooth).
## Warning: Removed 1625 rows containing missing values (geom_point).
Or we could create a line plot of the average income by age and sex:
ipums%>%
mutate(mywage= ifelse(incwage%in%c(999998,999999), NA, incwage))%>%
filter(labforce==2, age>18, met2013!=0) %>%
mutate(sexrecode=ifelse(sex==1, "male", "female")) %>%
group_by(sexrecode, age)%>%
summarise(medincome=median(mywage, na.rm=T))%>%
ggplot()+
geom_point(aes(x=age, y=medincome, color=sexrecode),size=.1)+
geom_smooth(aes(x=age, y=medincome, colour=sexrecode))+
ylab("Age")+
xlab("Income")+
ggtitle(label = "Median Income by age for men and women", subtitle = "2015 American Community Survey")
## Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
## `geom_smooth()` using method = 'loess'