Section 11.13
For these exercises, we will be using the vaccines data in the dslabs package:
library(dslabs)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data(us_contagious_diseases)
1. Pie charts are appropriate:
A. When we want to display percentages.
2. What is the problem with the plot below:
B. The axis does not start at 0. Judging by the length, it appears Trump received 3 times as many votes when, in fact, it was about 30% more.
3. Take a look at the following two plots. They show the same information: 1928 rates of measles across the 50 states.
4. To make the plot on the left, we have to reorder the levels of the states’ variables.
dat<-us_contagious_diseases |> filter(year==1967 & disease=="Measles" & !is.na(population)) |> mutate(rate=count/population*10000*52/weeks_reporting)
Note what happens when we make a barplot:
dat |> ggplot(aes(state, rate)) + geom_bar(stat="identity") + coord_flip()
Define these objects. Redefine the state object so that the levels are re-ordered. Print the new object state and its levels so you can see that the vector is not re-ordered by the levels.
state<-dat$state
rate<-dat$count/dat$population*10000*52/dat$weeks_reporting
state<-reorder(state, rate, FUN=mean)
state
## [1] Alabama Alaska Arizona
## [4] Arkansas California Colorado
## [7] Connecticut Delaware District Of Columbia
## [10] Florida Georgia Hawaii
## [13] Idaho Illinois Indiana
## [16] Iowa Kansas Kentucky
## [19] Louisiana Maine Maryland
## [22] Massachusetts Michigan Minnesota
## [25] Mississippi Missouri Montana
## [28] Nebraska Nevada New Hampshire
## [31] New Jersey New Mexico New York
## [34] North Carolina North Dakota Ohio
## [37] Oklahoma Oregon Pennsylvania
## [40] Rhode Island South Carolina South Dakota
## [43] Tennessee Texas Utah
## [46] Vermont Virginia Washington
## [49] West Virginia Wisconsin Wyoming
## attr(,"scores")
## Alabama Alaska Arizona
## 4.16107582 5.46389893 6.32695891
## Arkansas California Colorado
## 6.87899954 2.79313560 7.96331905
## Connecticut Delaware District Of Columbia
## 0.36986840 1.13098183 0.35873614
## Florida Georgia Hawaii
## 2.89358806 0.09987991 2.50173748
## Idaho Illinois Indiana
## 6.03115170 1.20115480 1.34027323
## Iowa Kansas Kentucky
## 2.94948911 0.66386422 4.74576011
## Louisiana Maine Maryland
## 0.46088071 2.57520433 0.49922233
## Massachusetts Michigan Minnesota
## 0.74762338 1.33466700 0.37722410
## Mississippi Missouri Montana
## 3.11366532 0.75696354 5.00433320
## Nebraska Nevada New Hampshire
## 3.64389801 6.43683882 0.47181511
## New Jersey New Mexico New York
## 0.88414264 6.15969926 0.66849058
## North Carolina North Dakota Ohio
## 1.92529764 14.48024642 1.16382241
## Oklahoma Oregon Pennsylvania
## 3.27496900 8.75036439 0.67687303
## Rhode Island South Carolina South Dakota
## 0.68207448 2.10412531 0.90289534
## Tennessee Texas Utah
## 5.47344506 12.49773953 4.03005836
## Vermont Virginia Washington
## 1.00970314 5.28270939 17.65180349
## West Virginia Wisconsin Wyoming
## 8.59456463 4.96246019 6.97303449
## 51 Levels: Georgia District Of Columbia Connecticut Minnesota ... Washington
5. Now with one line of code, define the dat table as done above, but change mutate to create a rate variable and re-order the state variable so that the levels are re-ordered by this variable. Then make a barplot using the code above, but for this new dat.
dat<-us_contagious_diseases |> filter(year==1967 & disease=="Measles" & !is.na(population)) |> mutate(rate=count/population*10000*52/weeks_reporting) |> mutate(state=reorder(state, rate, FUN=mean))
dat |> ggplot(aes(state, rate)) + geom_bar(stat="identity") + coord_flip()
6. Say we are interested in comparing gun homicide rates across regions of the US. We see this plot: and decide to move to a state in the western region. What is the main problem with this interpretation?
library(dslabs)
data("murders")
murders |> mutate(rate = total/population*100000) |> group_by(region) |>
summarize(avg = mean(rate)) |> mutate(region = factor(region)) |>
ggplot(aes(region, avg)) + geom_bar(stat="identity") + ylab("Murder Rate Average")
C. It does not show all the data. We do not see the variability within a region and it’s possible that the safest states are not in the West.
7. Make a boxplot of the murder rates defined as by region, showing all the points and ordering the regions by their median rate.
data("murders")
murders |> mutate(rate=total/population*100000) |> mutate(region=reorder(region, rate, FUN=median)) |> ggplot(aes(region, rate)) + geom_boxplot() + geom_point()
8. The plots below show three continuous variables.
The line x=2 appears to separate the points. But it is actually not the case, which we can see by plotting the data in a couple of two-dimensional points.
[] (https://rafalab.dfci.harvard.edu/dsbook/book_files/figure-html/pseud-3d-exercise-2-1.png)
Why is this happening?
D. Scatterplots should not be used to compare two variables when we have access to 3.