This is the second in a series of courses in a Professional Certificate in Data Science program. The courses in the Professional Certificate program are designed to prepare you to do data analysis in R, from simple computations to machine learning. If you need a refresher of some basic R, check out Data Science: R Basics, the first course in this series.
The textbook for the Data Science course series is freely available online.
Section 1: Introduction to Data Visualization and Distributions
You will get started with data visualization and distributions in R.
Section 2: Introduction to ggplot2
You will learn how to use ggplot2 to create plots.
Section 3: Summarizing with dplyr
You will learn how to summarize data using dplyr.
Section 4: Gapminder
You will see examples of ggplot2 and dplyr in action with the Gapminder dataset.
Section 5: Data Visualization Principles
You will learn general principles to guide you in developing effective data visualizations.
Section 1 introduces you to Data Visualization and Distributions.
After completing Section 1, you will:
The textbook for this section is available here
## [1] "sex" "height"
What data type is the sex variable?
A. Continuous
B. Categorical
C. Ordinal
D. None of the above
## [1] 139
## [1] 63
A. It is more effective to consider heights to be numerical given the number of unique values we observe and the fact that if we keep collecting data even more will be observed.
B. It is actually preferable to consider heights ordinal since on a computer there are only a finite number of possibilities.
C. This is actually a categorical variable: tall, medium or short.
D. This is a numerical variable because numbers are used to represent it.
To the closet 5%, what proportion of the states are in the North Central region?
A. 75%
B. 50%
C. 20%
D. 5%
A. The graph above is a histogram.
B. The graph above shows only four numbers with a bar plot.
C. Categories are not numbers, so it does not make sense to graph the distribution.
D. The colors, not the height of the bars, describe the distribution.
Based on the plot, what percentage of males are shorter than 75 inches?
A. 100%
B. 95%
C. 80%
D. 72 inches
A. 61 inches
B. 64 inches
C. 69 inches
D. 74 inches
Knowing that there are 51 states (counting DC) and based on this plot, how many states have murder rates larger than 10 per 100,000 people?
A. 1
B. 5
C. 10
D. 50
A. About half the states have murder rates above 7 per 100,000 and the other half below.
B. Most states have murder rates below 2 per 100,000.
C. All the states have murder rates above 2 per 100,000.
D. With the exception of 4 states, the murder rates are below 5 per 100,000.
Based on this plot, how many males are between 62.5 and 65.5?
A. 5
B. 24
C. 44
D. 100
A. 1%
B. 10%
C. 25%
D. 50%
A. 0.02
B. 0.15
C. 0.50
D. 0.55
Which of the following statements is true:
A. It is impossible that they are from the same dataset.
B. They are from the same dataset, but the plots are different due to code errors.
C. They are the same dataset, but the first and second plot undersmooth and the third oversmooths.
D. They are the same dataset, but the first is not in the log scale, the second undersmooths and the third oversmooths.
The normal distribution, also known as the bell curve and as the Gaussian distribution, is one of the most famous mathematical concepts in history. A reason for this is that approximately normal distributions occur in many situations. Examples include gambling winnings, heights, weights, blood pressure, standardized test scores, and experimental measurement errors. Often data visualization is needed to confirm that our data follows a normal distribution.
Here we focus on how the normal distribution helps us summarize data and can be useful in practice.
One way the normal distribution is useful is that it can be used to approximate the distribution of a list of numbers without having access to the entire list. We will demonstrate this with the heights dataset.
Load the height data set and create a vector x with just the male heights:
What proportion of the data is between 69 and 72 inches (taller than 69 but shorter or equal to 72)?
## [1] 0.3337438
Suppose you only have avg and stdev below, but no access to x, can you approximate the proportion of the data that is between 69 and 72 inches?
Use the normal approximation to estimate the proportion the proportion of the data that is between 69 and 72 inches.
Note that you can’t use x in your code, only avg and stdev. Also note that R has a function that may prove very helpful here - check out the pnorm function (and remember that you can get help by using ?pnorm)
x <- heights$height[heights$sex=="Male"]
avg <- mean(x)
stdev <- sd(x)
pnorm(72, avg, stdev) - pnorm(69, avg, stdev)
## [1] 0.3061779
However, the approximation is not always useful. An example is for the more extreme values, often called the “tails” of the distribution. Let’s look at an example. We can compute the proportion of heights between 79 and 81.
## [1] 0.004926108
Use normal approximation to estimate the proportion of heights between 79 and 81 inches and save it in an object called approx.
Report how many times bigger the actual proportion is compared to the approximation.
x <- heights$height[heights$sex == "Male"]
exact <- mean(x > 79 & x <= 81)
avg <- mean(x)
stdev <- sd(x)
approx <- pnorm(81, avg, stdev) - pnorm(79, avg, stdev)
exact/approx
## [1] 1.614261
First, we will estimate the proportion of adult men that are 7 feet tall or taller.
Assume that the distribution of adult men in the world as normally distributed with an average of 69 inches and a standard deviation of 3 inches.
Using this approximation, estimate the proportion of adult men that are 7 feet tall or taller, referred to as seven footers. Print out your estimate; don’t store it in an object.
## [1] 2.866516e-07
We know that there are about 1 billion men between the ages of 18 and 40 in the world, the age range for the NBA.
Can we use the normal distribution to estimate how many of these 1 billion men are at least seven feet tall? Use your answer to the previous exercise to estimate the proportion of men that are seven feet tall or taller in the world and store that value as p.
Then round the number of 18-40 year old men who are seven feet tall or taller to the nearest integer. (Do not store this value in an object.)
## [1] 48
Use your answer to exercise 4 to estimate the proportion of men that are seven feet tall or taller in the world and store that value as p.
Use your answer to the previous exercise (exercise 5) to round the number of 18-40 year old men who are seven feet tall or taller to the nearest integer and store that value as N.
Then calculate the proportion of the world’s 18 to 40 year old seven footers that are in the NBA. (Do not store this value in an object.)
## [1] 0.03484321
## [1] 0.03484321
Repeat the calculations performed in the previous question for Lebron James’ height: 6 feet 8 inches. There are about 150 players, instead of 10, that are at least that tall in the NBA.
Report the estimated proportion of people at least Lebron’s height that are in the NBA.
## [1] 0.001220842
What would be a fair critique of our calculations?
A. Practice and talent are what make a great basketball player, not height.
B. The normal approximation is not appropriate for heights.
C. As seen in exercise 3, the normal approximation tends to underestimate the extreme values. It's possible that there are more seven footers than we predicted.
D. As seen in exercise 3, the normal approximation tends to overestimate the extreme values. It’s possible that there are less seven footers than we predicted.
male <- heights$height[heights$sex=="Male"]
female <- heights$height[heights$sex=="Female"]
length(male)
## [1] 812
## [1] 238
male <- heights$height[heights$sex=="Male"]
female <- heights$height[heights$sex=="Female"]
male_percentiles <- quantile(male, seq(0.1, 0.9,0.2))
female_percentiles <- quantile(female, seq(0.1, 0.9,0.2))
df <- data.frame(female = female_percentiles, male = male_percentiles)
df
Which continent has the country with the largest population size?
A. Africa
B. Americas
C. Asia
D. Europe
E. Oceania
Which continent has median country with the largest population?
A. Africa
B. Americas
C. Asia
D. Europe
E. Oceania
A. 100 million
B. 25 million
C. 10 million
D. 5 million
E. 1 million
A. 0.75
B. 0.50
C. 0.25
D. 0.01
A. Africa
B. Americas
C. Asia
D. Europe
E. Oceania
Compute the average and median of these data. Note: do not assign them to a variable.
## [1] 68.08847
## [1] 68.2
## [1] 2.517941
## [1] 2.9652
Now suppose that suppose Galton made a mistake when entering the first value, forgetting to use the decimal point. You can imitate this error by typing:
library(HistData)
data(Galton)
x <- Galton$child
x_with_error <- x
x_with_error[1] <- x_with_error[1]*10
The data now has an outlier that the normal approximation does not account for. Let’s see how this affects the average.
Report how many inches the average grow after this mistake. Specifically, report the difference between the average of the data with the mistake x_with_error and the data without the mistake x.
## [1] 0.5983836
Report how many inches the SD grows after this mistake. Specifically, report the difference between the SD of the data with the mistake x_with_error and the data without the mistake x.
## [1] 15.6746
Now we are going to see how the median and MAD are much more resistant to outliers. For this reason we say that they are robust summaries.
Report how many inches the median grows after the mistake. Specifically, report the difference between the median of the data with the mistake x_with_error and the data without the mistake x.
## [1] 0
Report how many inches the MAD grows after the mistake. Specifically, report the difference between the MAD of the data with the mistake x_with_error and the data without the mistake x.
## [1] 0
A. Since it is only one value out of many, we will not be able to detect this.
B. We would see an obvious shift in the distribution.
C. A boxplot, histogram, or qq-plot would reveal a clear outlier.
D. A scatter plot would show high levels of measurement error.
To see how outliers can affect the average of a dataset, let’s write a simple function that takes the size of the outlier as input and returns the average.
Write a function called error_avg that takes a value k and returns the average of the vector x after the first entry changed to k. Show the results for k=10000 and k=-10000.
## [1] 78.79784
## [1] 57.24612
n Section 2, you will learn how to create data visualizations in R using ggplot2.
After completing Section 2, you will:
The textbook for this section is available here
Start by loading the dplyr and ggplot2 library as well as the murders and heights data.
Because data is the first argument we don’t need to spell it out
or, if we load dplyr, we can also use the pipe:
Remember the pipe sends the object on the left of %>% to be the first argument for the function the right of %>%.
What is class of the object p?
## [1] "gg" "ggplot"
## [1] 2
## [1] 2
Print the object p defined in exercise one and describe what you see.
A. Nothing happens.
B. A blank slate plot.
C. A scatter plot.
D. A histogram.
# define ggplot object called p like in the previous exercise but using a pipe
p <- heights %>% ggplot()
#What is the class of the object p you have just created?
class(p)
## [1] "gg" "ggplot"
A. state and abb.
B. total_murers and population_size.
C. total and population.
D. murders and size.
murders %>% ggplot(aes(x = , y = )) + geom_point()
except we have to define the two variables x and y. Fill this out with the correct variable names.
Remake the plot but now with total in the x-axis and population in the y-axis.
murders %>% ggplot(aes(population, total)) + geom_label()
will give us the error message: Error: geom_label requires the following missing aesthetics: label
Why is this?
A. We need to map a character to each point through the label argument in aes.
B. We need to let geom_label know what character to use in the plot.
C. The geom_label geometry does not require x-axis and y-axis values.
D. geom_label is not a ggplot2 command.
## edit the next line to add the label
murders %>% ggplot(aes(population, total, label=abb)) +
geom_point()+
geom_label()
A. Adding a column called blue to murders
B. Because each label needs a different color we map the colors through aes
C. Use the color argument in ggplot
D. Because we want all colors to be blue, we do not need to map colors, just use the color argument in geom_label
A. Adding a column called color to murders with the color we want to use
B. Mapping the colors through the color argument of aes because each label needs a different color
C. Using the color argument in ggplot
D. Using the color argument in geom_label because we want all colors to be blue so we do not need to map colors
## edit this code
murders %>% ggplot(aes(population, total, label = abb,color=region)) +
geom_label()
To change the y-axis to a log scale we learned about the scale_x_log10() function. Add this layer to the object p to change the scale and render the plot.
p <- murders %>% ggplot(aes(population, total, label = abb, color = region)) + geom_label()
p + scale_x_log10()
#Repeat the previous exercise but now change both axes to be in the log scale.
p + scale_x_log10() + scale_y_log10()
p <- murders %>% ggplot(aes(population, total, label = abb, color = region)) +
geom_label()
# add a layer to add title to the next line
p + scale_x_log10() +
scale_y_log10() + ggtitle("Gun murder data")
We use the geom_histogram function to make a histogram of the heights in the heights data frame. When reading the documentation for this function we see that it requires just one mapping, the values to be used for the histogram.
What is the variable containing the heights in inches in the heights data frame?
A. sex
B. heights
C. height
D. heights$height
Create a ggplot object called p using the pipe to assign the heights data to a ggplot object. Assign height to the x values through the aes function.
Add a layer to the object p (created in the previous exercise) using the geom_histogram function to make the histogram.
18. Histogram binwidth
Note that when we run the code from the previous exercise we get the following warning:
stat_bin() using bins = 30. Pick better value with binwidth.
Use the binwidth argument to change the histogram made in the previous exercise to use bins of size 1 inch.
p <- heights %>%
ggplot(aes(height))
## add the geom_histogram layer but with the requested argument
p + geom_histogram(binwidth=1)
Now instead of geom_histogram we will use geom_density to create a smooth density plot.
Add the appropriate layer to create a smooth density plot of heights.
## add the group argument then a layer with +
heights %>%
ggplot(aes(height,group = sex)) + geom_density()
We can also assign groups through the color or fill argument. For example, if you type color = sex ggplot knows you want a different color for each sex. So two densities must be drawn. You can therefore skip the group = sex mapping. Using color has the added benefit that it uses color to distinguish the groups. Change the density plots from the previous exercise to add color.
## edit the next line to use color instead of group then add a density layer
heights %>%
ggplot(aes(height, color = sex))+
geom_density()
We can see what this looks like by running the following code:
However, here the second density is drawn over the other. We can change this by using something called alpha blending. Set the alpha parameter to 0.2 in the geom_density function to make this change.
Section 3 introduces you to summarizing with dplyr.
After completing Section 3, you will:
The textbook for this section is available here
Practice Exercise. National Center for Health Statistics
To practice our dplyr skills we will be working with data from the survey collected by the United States National Center for Health Statistics (NCHS). This center has conducted a series of health and nutrition surveys since the 1960’s.
Starting in 1999, about 5,000 individuals of all ages have been interviewed every year and then they complete the health examination component of the survey. Part of this dataset is made available via the NHANES package which can be loaded this way:
The NHANES data has many missing values. Remember that the main summarization function in R will return NA if any of the entries of the input vector is an NA. Here is an example:
## [1] NA
## [1] NA
To ignore the NAs, we can use the na.rm argument:
## [1] 2.301754
## [1] 1.22338
First let’s select a group to set the standard. We will use 20-29 year old females. Note that the category is coded with 20-29, with a space in front of the 20! The AgeDecade is a categorical variable with these ages.
To know if someone is female, you can look at the Gender variable.
Filter the NHANES dataset so that only 20-29 year old females are included and assign this new data frame to the object tab.
Use the pipe to apply the function filter, with the appropriate logicals, to NHANES.
Remember that this age group is coded with 20-29, which includes a space. You can use head to explore the NHANES table to construct the correct call to filter.
library(dplyr)
library(NHANES)
data(NHANES)
## fill in what is needed
tab <- NHANES %>% filter(AgeDecade==" 20-29" & Gender=="female")
You will determine the average and standard deviation of systolic blood pressure, which are stored in the BPSysAve variable in the NHANES dataset.
Complete the line of code to save the average and standard deviation of systolic blood pressure as average and standard_deviation to a variable called ref.
Use the summarize function after filtering for 20-29 year old females and connect the results using the pipe %>%. When doing this remember there are NAs in the data!
## complete this line of code.
ref <- NHANES %>% filter(AgeDecade == " 20-29" & Gender == "female") %>% summarize(average=mean(BPSysAve,na.rm=TRUE), standard_deviation=sd(BPSysAve,na.rm=TRUE))
Modify the line of sample code to assign the average to a numeric variable called ref_avg.
## modify the code we wrote for previous exercise.
ref_avg <- NHANES %>%
filter(AgeDecade == " 20-29" & Gender == "female") %>%
summarize(average = mean(BPSysAve, na.rm = TRUE),
standard_deviation = sd(BPSysAve, na.rm=TRUE)) %>% .$average
Again we will do it for the BPSysAve variable and the group of 20-29 year old females.
Report the min and max values for the same group as in the previous exercises.
Use filter and summarize connected by the pipe %>% again. The functions min and max can be used to get the values you want.
Within summarize, save the min and max of systolic blood pressure as min and max.
## complete the line
NHANES %>%
filter(AgeDecade == " 20-29" & Gender == "female") %>% summarize(min=min(BPSysAve,na.rm=TRUE),max=max(BPSysAve,na.rm=TRUE))
What we are about to do is a very common operation in data science: you will split a data table into groups and then compute summary statistics for each group.
We will compute the average and standard deviation of systolic blood pressure for females for each age group separately. Remember that the age groups are contained in AgeDecade.
Use the functions filter, group_by, summarize, and the pipe %>% to compute the average and standard deviation of systolic blood pressure for females for each age group separately.
Within summarize, save the average and standard deviation of systolic blood pressure (BPSysAve) as average and standard_deviation.
##complete the line with group_by and summarize
NHANES %>%
filter(Gender == "female") %>% group_by(AgeDecade) %>% summarize(average=mean(BPSysAve,na.rm=TRUE),standard_deviation = sd(BPSysAve,na.rm=TRUE))
This time we will not provide much sample code. You are on your own!
Calculate the average and standard deviation of systolic blood pressure for males for each age group separately using the same methods as in the previous exercise.
NHANES %>%
filter(Gender == "male") %>% group_by(AgeDecade) %>% summarize(average=mean(BPSysAve,na.rm=TRUE),standard_deviation = sd(BPSysAve,na.rm=TRUE))
We can use group_by(AgeDecade, Gender) to group by both age decades and gender.
Create a single summary table for the average and standard deviation of systolic blood pressure using group_by(AgeDecade, Gender).
Note that we no longer have to filter!
Your code within summarize should remain the same as in the previous exercises.
We will learn to use the arrange function to order the outcome acording to one variable.
Note that this function can be used to order any table by a given outcome. Here is an example that arranges by systolic blood pressure.
If we want it in descending order we can use the desc function like this:
In this example, we will compare systolic blood pressure across values of the Race1 variable for males between the ages of 40-49.
Compute the average and standard deviation for each value of Race1 for males in the age decade 40-49.
Order the resulting table from lowest to highest average systolic blood pressure.
Use the functions filter, group_by, summarize, arrange, and the pipe %>% to do this in one line of code.
Within summarize, save the average and standard deviation of systolic blood pressure as average and standard_deviation.
NHANES %>%
filter(AgeDecade ==" 40-49" & Gender == "male") %>% group_by(Race1) %>% summarize(average=mean(BPSysAve,na.rm=TRUE),standard_deviation = sd(BPSysAve,na.rm=TRUE))%>% arrange(average)
n Section 4, you will look at a case study involving data from the Gapminder Foundation about trends in world health and economics.
After completing Section 4, you will:
The textbook for this section is available here
## fill out the missing parts in filter and aes
gapminder %>% filter(continent=="Africa" & year=="2012") %>%
ggplot(aes(fertility,life_expectancy)) +
geom_point()
Remake the plot from the previous exercises but this time use color to dinstinguish the different regions of Africa to see if this explains the clusters. Remember that you can explore the gapminder data to see how the regions of Africa are labeled in the dataframe!
data(gapminder)
gapminder %>% filter(continent=="Africa" & year=="2012") %>%
ggplot(aes(fertility,life_expectancy, color=region)) +
geom_point()
df <- gapminder %>% filter(continent=="Africa" & year=="2012" & fertility <=3 & life_expectancy>=70) %>%
select(country,region)
Use a single line of code to create a time series plot from 1960 to 2010 of life expectancy vs year for Cambodia.
data(gapminder)
gapminder %>% filter(year>=1960 & year <= 2010 & country=="Cambodia") %>% ggplot(aes(year,life_expectancy)) + geom_line()
In the first part of this analysis, we will create the dollars per day variable.
data(gapminder)
daydollars <- gapminder %>% mutate(dollars_per_day=gdp/population/365)%>% filter(year==2010 & continent=="Africa" & !is.na(dollars_per_day))
In the second part of this analysis, we will plot the smooth density plot using a log (base 2) x axis.
data(gapminder)
daydollars <- gapminder %>% mutate(dollars_per_day=gdp/population/365)%>% filter(year %in% c(1970,2010) & continent=="Africa" & !is.na(dollars_per_day))
daydollars %>% ggplot(aes(dollars_per_day)) + geom_density() + scale_x_continuous(trans='log2') + facet_grid(.~year)
Much of the code will be the same as in Exercise 9:
data(gapminder)
daydollars <- gapminder %>% mutate(dollars_per_day=gdp/population/365)%>% filter(year %in% c(1970,2010) & continent=="Africa" & !is.na(dollars_per_day))
daydollars %>% ggplot(aes(dollars_per_day,fill = region)) + geom_density(bw=0.5,position='stack') + scale_x_continuous(trans='log2') + facet_grid(.~year)
data(gapminder)
gapminder_Africa_2010 <- daydollars <- gapminder %>% mutate(dollars_per_day=gdp/population/365)%>% filter(year %in% c(2010) & continent=="Africa" & !is.na(dollars_per_day))
# now make the scatter plot
gapminder_Africa_2010 %>% ggplot(aes(dollars_per_day,infant_mortality,color = region)) + geom_point()
As an example, one country has infant mortality rates of less than 20 per 1000 and dollars per day of 16, while another country has infant mortality rates over 10% and dollars per day of about 1.
In this exercise, we will remake the plot from Exercise 12 with country names instead of points so we can identify which countries are which.
gapminder_Africa_2010 %>% ggplot(aes(dollars_per_day,infant_mortality, color=region,label = country)) + geom_point() + scale_x_continuous(trans='log2') +
geom_text()
data(gapminder)
gapminder %>%
filter(continent == "Africa" & year %in% c(1970,2010) & !is.na(gdp) & !is.na(year) & !is.na(infant_mortality)) %>%
mutate(dollars_per_day = gdp/population/365) %>%
ggplot(aes(dollars_per_day, infant_mortality, color = region,label = country)) +
geom_point() +
scale_x_continuous(trans = "log2") +
geom_text() +
facet_grid(year~.)
Section 5 covers some general principles that can serve as guides for effective data visualization.
After completing Section 5, you will:
The textbook for this section is available here
1: Customizing plots - Pie charts
Pie charts are appropriate:
A. When we want to display percentages.
B. When ggplot2 is not available.
C. When I am in a bakery.
D. Never. Barplots and tables are always better.
A. The values are wrong. The final vote was 306 to 232.
B. The axis does not start at 0. Judging by the length, it appears Trump received 3 times as many votes when in fact it was about 30% more.
C. The colors should be the same.
D. Percentages should be shown as a pie chart.
3: Customizing plots - What’s wrong 2?.
Take a look at the following two plots. They show the same information: rates of measles by state in the United States for 1928.
A. Both plots provide the same information, so they are equally good.
B. The plot on the left is better because it orders the states alphabetically.
C. The plot on the right is better because it orders the states by disease rate so we can quickly see the states with highest and lowest rates.
D. Both plots should be pie charts instead.
1: Customizing plots - watch and learn
To make the plot on the right in the exercise from the last set of assessments, we had to reorder the levels of the states’ variables.
dat <- us_contagious_diseases %>%
filter(year == 1967 & disease=="Measles" & !is.na(population)) %>% mutate(rate = count / population * 10000 * 52 / weeks_reporting)
state <- dat$state
rate <- dat$count/(dat$population/10000)*(52/dat$weeks_reporting)
state <- reorder(state,rate)
print(state)
## [1] Alabama Alaska Arizona
## [4] Arkansas California Colorado
## [7] Connecticut Delaware District Of Columbia
## [10] Florida Georgia Hawaii
## [13] Idaho Illinois Indiana
## [16] Iowa Kansas Kentucky
## [19] Louisiana Maine Maryland
## [22] Massachusetts Michigan Minnesota
## [25] Mississippi Missouri Montana
## [28] Nebraska Nevada New Hampshire
## [31] New Jersey New Mexico New York
## [34] North Carolina North Dakota Ohio
## [37] Oklahoma Oregon Pennsylvania
## [40] Rhode Island South Carolina South Dakota
## [43] Tennessee Texas Utah
## [46] Vermont Virginia Washington
## [49] West Virginia Wisconsin Wyoming
## attr(,"scores")
## Alabama Alaska Arizona
## 4.16107582 5.46389893 6.32695891
## Arkansas California Colorado
## 6.87899954 2.79313560 7.96331905
## Connecticut Delaware District Of Columbia
## 0.36986840 1.13098183 0.35873614
## Florida Georgia Hawaii
## 2.89358806 0.09987991 2.50173748
## Idaho Illinois Indiana
## 6.03115170 1.20115480 1.34027323
## Iowa Kansas Kentucky
## 2.94948911 0.66386422 4.74576011
## Louisiana Maine Maryland
## 0.46088071 2.57520433 0.49922233
## Massachusetts Michigan Minnesota
## 0.74762338 1.33466700 0.37722410
## Mississippi Missouri Montana
## 3.11366532 0.75696354 5.00433320
## Nebraska Nevada New Hampshire
## 3.64389801 6.43683882 0.47181511
## New Jersey New Mexico New York
## 0.88414264 6.15969926 0.66849058
## North Carolina North Dakota Ohio
## 1.92529764 14.48024642 1.16382241
## Oklahoma Oregon Pennsylvania
## 3.27496900 8.75036439 0.67687303
## Rhode Island South Carolina South Dakota
## 0.68207448 2.10412531 0.90289534
## Tennessee Texas Utah
## 5.47344506 12.49773953 4.03005836
## Vermont Virginia Washington
## 1.00970314 5.28270939 17.65180349
## West Virginia Wisconsin Wyoming
## 8.59456463 4.96246019 6.97303449
## 51 Levels: Georgia District Of Columbia Connecticut ... Washington
## [1] "Georgia" "District Of Columbia" "Connecticut"
## [4] "Minnesota" "Louisiana" "New Hampshire"
## [7] "Maryland" "Kansas" "New York"
## [10] "Pennsylvania" "Rhode Island" "Massachusetts"
## [13] "Missouri" "New Jersey" "South Dakota"
## [16] "Vermont" "Delaware" "Ohio"
## [19] "Illinois" "Michigan" "Indiana"
## [22] "North Carolina" "South Carolina" "Hawaii"
## [25] "Maine" "California" "Florida"
## [28] "Iowa" "Mississippi" "Oklahoma"
## [31] "Nebraska" "Utah" "Alabama"
## [34] "Kentucky" "Wisconsin" "Montana"
## [37] "Virginia" "Alaska" "Tennessee"
## [40] "Idaho" "New Mexico" "Arizona"
## [43] "Nevada" "Arkansas" "Wyoming"
## [46] "Colorado" "West Virginia" "Oregon"
## [49] "Texas" "North Dakota" "Washington"
2: Customizing plots - redefining
Now we are going to customize this plot a little more by creating a rate variable and reordering by that variable instead.
data(us_contagious_diseases)
dat <- us_contagious_diseases %>% filter(year == 1967 & disease=="Measles" & count>0 & !is.na(population)) %>%
mutate(rate = count / population * 10000 * 52 / weeks_reporting) %>% mutate(state = reorder(state, rate))
dat %>% ggplot(aes(state, rate)) +
geom_bar(stat="identity") +
coord_flip()
3: Showing the data and customizing plots
Say we are interested in comparing gun homicide rates across regions of the US. We see this plot:
data("murders")
murders %>% mutate(rate = total/population*100000) %>%
group_by(region) %>%
summarize(avg = mean(rate)) %>%
mutate(region = factor(region)) %>%
ggplot(aes(region, avg)) +
geom_bar(stat="identity") +
ylab("Murder Rate Average")
and decide to move to a state in the western region. What is the main problem with this interpretaion?
A. The categories are ordered alphabetically.
B. The graph does not show standard errors.
C. It does not show all the data. We do not see the variability within a region and it's possible that the safest states are not in the West.
D. The Northeast has the lowest average.
4: Making a box plot
To further investigate whether moving to the western region is a wise decision, let’s make a box plot of murder rates by region, showing all points.
data("murders")
murders %>% mutate(rate = total/population*100000) %>%
mutate(region=reorder(region, rate, FUN=median)) %>%
ggplot(aes(region, rate)) +
geom_boxplot() +
geom_point()
the_disease = "Measles"
dat <- us_contagious_diseases %>%
filter(!state%in%c("Hawaii","Alaska") & disease == the_disease) %>%
mutate(rate = count / population * 10000) %>%
mutate(state = reorder(state, rate))
dat %>% ggplot(aes(year, state, fill = rate)) +
geom_tile(color = "grey50") +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
theme_minimal() +
theme(panel.grid = element_blank()) +
ggtitle(the_disease) +
ylab("") +
xlab("")
the_disease = "Smallpox"
dat <- us_contagious_diseases %>%
filter(!state%in%c("Hawaii","Alaska") & disease == the_disease & !weeks_reporting<10) %>%
mutate(rate = count / population * 10000) %>%
mutate(state = reorder(state, rate))
dat %>% ggplot(aes(year, state, fill = rate)) +
geom_tile(color = "grey50") +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
theme_minimal() +
theme(panel.grid = element_blank()) +
ggtitle(the_disease) +
ylab("") +
xlab("")
data(us_contagious_diseases)
the_disease = "Measles"
dat <- us_contagious_diseases %>%
filter(!state%in%c("Hawaii","Alaska") & disease == the_disease) %>%
mutate(rate = count / population * 10000) %>%
mutate(state = reorder(state, rate))
str(dat)
## 'data.frame': 3724 obs. of 7 variables:
## $ disease : Factor w/ 7 levels "Hepatitis A",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ state : Factor w/ 51 levels "Mississippi",..: 9 9 9 9 9 9 9 9 9 9 ...
## ..- attr(*, "scores")= num [1:51(1d)] 9.27 NA 24.15 9.37 19.16 ...
## .. ..- attr(*, "dimnames")=List of 1
## .. .. ..$ : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ year : num 1928 1929 1930 1931 1932 ...
## $ weeks_reporting: int 52 49 52 49 41 51 52 49 40 49 ...
## $ count : num 8843 2959 4156 8934 270 ...
## $ population : num 2589923 2619131 2646248 2670818 2693027 ...
## $ rate : num 34.1 11.3 15.7 33.5 1 ...
avg <- us_contagious_diseases %>%
filter(disease==the_disease) %>% group_by(year) %>%
summarize(us_rate = sum(count, na.rm=TRUE)/sum(population, na.rm=TRUE)*10000)
dat %>% ggplot() +
geom_line(aes(year, rate, group = state), color = "grey50",
show.legend = FALSE, alpha = 0.2, size = 1) +
geom_line(mapping = aes(year, us_rate), data = avg, size = 1, color = "black") +
scale_y_continuous(trans = "sqrt", breaks = c(5,25,125,300)) +
ggtitle("Cases per 10,000 by state") +
xlab("") +
ylab("") +
geom_text(data = data.frame(x=1955, y=50), mapping = aes(x, y, label="US average"), color="black") +
geom_vline(xintercept=1963, col = "blue")
data(us_contagious_diseases)
the_disease = "Smallpox"
dat <- us_contagious_diseases %>%
filter(!state%in%c("Hawaii","Alaska") & disease == the_disease & !weeks_reporting<10) %>%
mutate(rate = count / population * 10000) %>%
mutate(state = reorder(state, rate))
str(dat)
## 'data.frame': 1014 obs. of 7 variables:
## $ disease : Factor w/ 7 levels "Hepatitis A",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ state : Factor w/ 51 levels "Rhode Island",..: 17 17 17 17 17 17 17 17 17 17 ...
## ..- attr(*, "scores")= num [1:51(1d)] 0.382 NA 2.011 0.805 0.924 ...
## .. ..- attr(*, "dimnames")=List of 1
## .. .. ..$ : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ year : num 1928 1929 1930 1931 1932 ...
## $ weeks_reporting: int 51 52 52 52 52 52 52 52 51 52 ...
## $ count : num 341 378 192 295 467 82 23 42 12 54 ...
## $ population : num 2589923 2619131 2646248 2670818 2693027 ...
## $ rate : num 1.317 1.443 0.726 1.105 1.734 ...
avg <- us_contagious_diseases %>%
filter(disease==the_disease) %>% group_by(year) %>%
summarize(us_rate = sum(count, na.rm=TRUE)/sum(population, na.rm=TRUE)*10000)
dat %>% ggplot() +
geom_line(aes(year, rate, group = state), color = "grey50",
show.legend = FALSE, alpha = 0.2, size = 1) +
geom_line(mapping = aes(year, us_rate), data = avg, size = 1, color = "black") +
scale_y_continuous(trans = "sqrt", breaks = c(5,25,125,300)) +
ggtitle("Cases per 10,000 by state") +
xlab("") +
ylab("") +
geom_text(data = data.frame(x=1955, y=50), mapping = aes(x, y, label="US average"), color="black") +
geom_vline(xintercept=1963, col = "blue")
data(us_contagious_diseases)
us_contagious_diseases %>% filter(state=="California" & !weeks_reporting<10) %>%
group_by(year, disease) %>%
summarize(rate = sum(count)/sum(population)*10000) %>%
ggplot(aes(year, rate,color=disease)) +
geom_line()
data(us_contagious_diseases)
us_contagious_diseases %>% filter(!is.na(population)) %>%
group_by(year, disease) %>%
summarize(rate=sum(count)/sum(population)*10000) %>%
ggplot(aes(year, rate,color=disease)) + geom_line()