Complete Questions 1-3, 5-8 for exercises 11.13 in Irizarry text.
data-visualization-principles-exercises-20
11.13 Exercises For these exercises, we will be using the vaccines data in the dslabs package:
library(dslabs)
data(us_contagious_diseases)
a. When we want to display percentages.
b. When ggplot2 is not available.
c. When I am in a bakery.
d. Never. Barplots and tables are always better.
…humans are not good at precisely quantifying angles and are even worse when area is the only available visual cue.
If for some reason you need to make a pie chart, label each pie slice with its respective percentage
Question 2 Plot
a. The values are wrong. The final vote was 306 to
232.
b. The axis does not start at 0. Judging by the length, it appears Trump received 3 times as many votes when, in fact, it was about 30% more. It makes you think then differences are larger than they actually are.
When using position rather than length, it is then not necessary to include 0. Here they are comparing the differences between groups relative to the within-group variability.
c. The colors should be the same.
d. Percentages should be shown as a pie chart.
Question 3 plots
Which plot is easier to read if you are interested in determining which are the best and worst states in terms of rates, and why?
a. They provide the same information, so they are both equally
as good.
b. The plot on the right is better because it orders the states
alphabetically.
c. The plot on the left is better because alphabetical order has nothing to do with the disease and by ordering according to actual rate, we quickly see the states with most and least rates. Ordering the plots by value (or meaningful category) is easier to look at and interpret.
d. Both plots should be a pie chart.
colnames(us_contagious_diseases)
## [1] "disease" "state" "year" "weeks_reporting"
## [5] "count" "population"
dat <- us_contagious_diseases |>
filter(year == 1967 & disease=="Measles" & !is.na(population)) |>
mutate(rate = count / population * 10000 * 52 / weeks_reporting)
colnames(dat)
## [1] "disease" "state" "year" "weeks_reporting"
## [5] "count" "population" "rate"
Note what happens when we make a barplot:
dat |> ggplot(aes(state, rate)) +
geom_bar(stat="identity") +
coord_flip()
Define these objects:
state <- dat$state
rate <- dat$count/dat$population*10000*52/dat$weeks_reporting
Redefine the
stateobject so that thelevelsarereordered.stateand itslevelsso you can see that the vector is not re-ordered by the levels.
state <- reorder(dat$state, dat$rate)
print(state)
## [1] Alabama Alaska Arizona
## [4] Arkansas California Colorado
## [7] Connecticut Delaware District Of Columbia
## [10] Florida Georgia Hawaii
## [13] Idaho Illinois Indiana
## [16] Iowa Kansas Kentucky
## [19] Louisiana Maine Maryland
## [22] Massachusetts Michigan Minnesota
## [25] Mississippi Missouri Montana
## [28] Nebraska Nevada New Hampshire
## [31] New Jersey New Mexico New York
## [34] North Carolina North Dakota Ohio
## [37] Oklahoma Oregon Pennsylvania
## [40] Rhode Island South Carolina South Dakota
## [43] Tennessee Texas Utah
## [46] Vermont Virginia Washington
## [49] West Virginia Wisconsin Wyoming
## attr(,"scores")
## Alabama Alaska Arizona
## 4.16107582 5.46389893 6.32695891
## Arkansas California Colorado
## 6.87899954 2.79313560 7.96331905
## Connecticut Delaware District Of Columbia
## 0.36986840 1.13098183 0.35873614
## Florida Georgia Hawaii
## 2.89358806 0.09987991 2.50173748
## Idaho Illinois Indiana
## 6.03115170 1.20115480 1.34027323
## Iowa Kansas Kentucky
## 2.94948911 0.66386422 4.74576011
## Louisiana Maine Maryland
## 0.46088071 2.57520433 0.49922233
## Massachusetts Michigan Minnesota
## 0.74762338 1.33466700 0.37722410
## Mississippi Missouri Montana
## 3.11366532 0.75696354 5.00433320
## Nebraska Nevada New Hampshire
## 3.64389801 6.43683882 0.47181511
## New Jersey New Mexico New York
## 0.88414264 6.15969926 0.66849058
## North Carolina North Dakota Ohio
## 1.92529764 14.48024642 1.16382241
## Oklahoma Oregon Pennsylvania
## 3.27496900 8.75036439 0.67687303
## Rhode Island South Carolina South Dakota
## 0.68207448 2.10412531 0.90289534
## Tennessee Texas Utah
## 5.47344506 12.49773953 4.03005836
## Vermont Virginia Washington
## 1.00970314 5.28270939 17.65180349
## West Virginia Wisconsin Wyoming
## 8.59456463 4.96246019 6.97303449
## 51 Levels: Georgia District Of Columbia Connecticut Minnesota ... Washington
print(levels(state))
## [1] "Georgia" "District Of Columbia" "Connecticut"
## [4] "Minnesota" "Louisiana" "New Hampshire"
## [7] "Maryland" "Kansas" "New York"
## [10] "Pennsylvania" "Rhode Island" "Massachusetts"
## [13] "Missouri" "New Jersey" "South Dakota"
## [16] "Vermont" "Delaware" "Ohio"
## [19] "Illinois" "Michigan" "Indiana"
## [22] "North Carolina" "South Carolina" "Hawaii"
## [25] "Maine" "California" "Florida"
## [28] "Iowa" "Mississippi" "Oklahoma"
## [31] "Nebraska" "Utah" "Alabama"
## [34] "Kentucky" "Wisconsin" "Montana"
## [37] "Virginia" "Alaska" "Tennessee"
## [40] "Idaho" "New Mexico" "Arizona"
## [43] "Nevada" "Arkansas" "Wyoming"
## [46] "Colorado" "West Virginia" "Oregon"
## [49] "Texas" "North Dakota" "Washington"
dat table as done
above, but change the use mutate to create a
rate variable and reorder the
state variable so that the levels are re-ordered by this
variable.dat <- dat |>
mutate(rate=rate) |>
mutate(state=state)
Then make a barplot using the code above, but for this new
dat.
dat |>
ggplot(aes(state, rate)) +
geom_bar(stat="identity")
dat |>
ggplot(aes(state, rate)) +
geom_bar(stat="identity") +
coord_flip()
This didn’t work. But that’s OK because I need to do it using one line of code
dat <- us_contagious_diseases |>
filter(year == 1967 & disease=="Measles" & count>0 & !is.na(population)) |>
mutate(rate = count / population * 10000 * 52 / weeks_reporting) |>
mutate(state = reorder(state, rate, FUN = mean)) |>
ggplot(aes(state, rate)) +
geom_bar(stat="identity") +
coord_flip()
dat
library(dslabs)
data("murders")
murders |> mutate(rate = total/population*100000) |>
group_by(region) |>
summarize(avg = mean(rate)) |>
mutate(region = factor(region)) |>
ggplot(aes(region, avg)) +
geom_bar(stat="identity") +
ylab("Murder Rate Average")
and decide to move to a state in the western region. What is the main problem with this interpretation?
a. The categories are ordered alphabetically.
b. The graph does not show standerd errors.
c. It does not show all the data. We do not see the variability within a region and it’s possible that the safest states are not in the West.
d. The Northeast has the lowest average.
rates defined asdata("murders")
murders |> mutate(rate = total/population*100000)
## state abb region population total rate
## 1 Alabama AL South 4779736 135 2.8244238
## 2 Alaska AK West 710231 19 2.6751860
## 3 Arizona AZ West 6392017 232 3.6295273
## 4 Arkansas AR South 2915918 93 3.1893901
## 5 California CA West 37253956 1257 3.3741383
## 6 Colorado CO West 5029196 65 1.2924531
## 7 Connecticut CT Northeast 3574097 97 2.7139722
## 8 Delaware DE South 897934 38 4.2319369
## 9 District of Columbia DC South 601723 99 16.4527532
## 10 Florida FL South 19687653 669 3.3980688
## 11 Georgia GA South 9920000 376 3.7903226
## 12 Hawaii HI West 1360301 7 0.5145920
## 13 Idaho ID West 1567582 12 0.7655102
## 14 Illinois IL North Central 12830632 364 2.8369608
## 15 Indiana IN North Central 6483802 142 2.1900730
## 16 Iowa IA North Central 3046355 21 0.6893484
## 17 Kansas KS North Central 2853118 63 2.2081106
## 18 Kentucky KY South 4339367 116 2.6732010
## 19 Louisiana LA South 4533372 351 7.7425810
## 20 Maine ME Northeast 1328361 11 0.8280881
## 21 Maryland MD South 5773552 293 5.0748655
## 22 Massachusetts MA Northeast 6547629 118 1.8021791
## 23 Michigan MI North Central 9883640 413 4.1786225
## 24 Minnesota MN North Central 5303925 53 0.9992600
## 25 Mississippi MS South 2967297 120 4.0440846
## 26 Missouri MO North Central 5988927 321 5.3598917
## 27 Montana MT West 989415 12 1.2128379
## 28 Nebraska NE North Central 1826341 32 1.7521372
## 29 Nevada NV West 2700551 84 3.1104763
## 30 New Hampshire NH Northeast 1316470 5 0.3798036
## 31 New Jersey NJ Northeast 8791894 246 2.7980319
## 32 New Mexico NM West 2059179 67 3.2537239
## 33 New York NY Northeast 19378102 517 2.6679599
## 34 North Carolina NC South 9535483 286 2.9993237
## 35 North Dakota ND North Central 672591 4 0.5947151
## 36 Ohio OH North Central 11536504 310 2.6871225
## 37 Oklahoma OK South 3751351 111 2.9589340
## 38 Oregon OR West 3831074 36 0.9396843
## 39 Pennsylvania PA Northeast 12702379 457 3.5977513
## 40 Rhode Island RI Northeast 1052567 16 1.5200933
## 41 South Carolina SC South 4625364 207 4.4753235
## 42 South Dakota SD North Central 814180 8 0.9825837
## 43 Tennessee TN South 6346105 219 3.4509357
## 44 Texas TX South 25145561 805 3.2013603
## 45 Utah UT West 2763885 22 0.7959810
## 46 Vermont VT Northeast 625741 2 0.3196211
## 47 Virginia VA South 8001024 250 3.1246001
## 48 Washington WA West 6724540 93 1.3829942
## 49 West Virginia WV South 1852994 27 1.4571013
## 50 Wisconsin WI North Central 5686986 97 1.7056487
## 51 Wyoming WY West 563626 5 0.8871131
by region, showing all the points and
-reordering the regions by their median
rate.
murders |>
mutate(region = reorder(region, rate, FUN = median)) |>
ggplot(aes(region, rate)) +
geom_boxplot() +
geom_jitter(width = 0.1, alpha = 0.2)
Question 8a Plots
The line
x=2appears to separate the points. But it is actually not the case, which we can see by plotting the data in a couple of two-dimensional points.
Question 8b Plots
Why is this happening?
a. Humans are not good at reading pseudo-3D plots We are bad at seeing in three dimensions – particularly on a 2-D screen.Refer to pie chart of frequency of use.
b. There must be an error in the code.
c. The colors confuse us.
d. Scatterplots should not be used to compare two variables when
we have access to 3.
Read Section 11.14 in the Irizarry text (case study of vaccines). After you read through this case study, create a time series plot of rate of smallpox per year in California using the us_contagious_diseases dataframe.
11.14 Case study: vaccines and infectious diseases Vaccines have helped save millions of lives. In the 19th century, before herd immunization was achieved through vaccination programs, deaths from infectious diseases, such as smallpox and polio, were common. However, today vaccination programs have become somewhat controversial despite all the scientific evidence for their importance.
The controversy started with a paper published in 1988 and led by Andrew Wakefield claiming there was a link between the administration of the measles, mumps, and rubella (MMR) vaccine and the appearance of autism and bowel disease. Despite much scientific evidence contradicting this finding, sensationalist media reports and fear-mongering from conspiracy theorists led parts of the public into believing that vaccines were harmful. As a result, many parents ceased to vaccinate their children. This dangerous practice can be potentially disastrous given that the Centers for Disease Control (CDC) estimates that vaccinations will prevent more than 21 million hospitalizations and 732,000 deaths among children born in the last 20 years (see Benefits from Immunization during the Vaccines for Children Program Era — United States, 1994-2013, MMWR4. The 1988 paper has since been retracted and Andrew Wakefield was eventually “struck off the UK medical register, with a statement identifying deliberate falsification in the research published in The Lancet, and was thereby barred from practicing medicine in the UK.” (source: Wikipedia). Yet misconceptions persist, in part due to self-proclaimed activists who continue to disseminate misinformation about vaccines.
Effective communication of data is a strong antidote to misinformation and fear-mongering. Earlier we used an example provided by a Wall Street Journal article42 showing data related to the impact of vaccines on battling infectious diseases. Here we reconstruct that example.
The data used for these plots were collected, organized, and distributed by the Tycho Project43. They include weekly reported counts for seven diseases from 1928 to 2011, from all fifty states. We include the yearly totals in the
dslabspackage:
library(RColorBrewer)
data(us_contagious_diseases)
names(us_contagious_diseases)
## [1] "disease" "state" "year" "weeks_reporting"
## [5] "count" "population"
We create a temporary object dat that stores only the measles data, includes a per 100,000 rate, orders states by average value of disease and removes Alaska and Hawaii since they only became states in the late 1950s. Note that there is a weeks_reporting column that tells us for how many weeks of the year data was reported. We have to adjust for that value when computing the rate.
the_disease <- "Measles"
dat <- us_contagious_diseases |>
filter(!state%in%c("Hawaii","Alaska") & disease == the_disease) |>
mutate(rate = count / population * 10000 * 52 / weeks_reporting) |>
mutate(state = reorder(state, ifelse(year<=1963, rate, NA),
median, na.rm = TRUE))
We can now easily plot disease rates per year. Here are the measles data from California:
dat |> filter(state == "California" & !is.na(rate)) |>
ggplot(aes(year, rate)) +
geom_line() +
ylab("Cases per 10,000") +
geom_vline(xintercept=1963, col = "blue")
We add a vertical line at 1963 since this is when the vaccine was introduced [Control, Centers for Disease; Prevention (2014). CDC health information for international travel 2014 (the yellow book). p. 250. ISBN 9780199948505].
We add a vertical line at 1963 since this is when the vaccine was introduced [Control, Centers for Disease; Prevention (2014). CDC health information for international travel 2014 (the yellow book). p. 250. ISBN 9780199948505].
In our example, we want to use a sequential palette since there is no meaningful center, just low and high rates.
We use the geometry
geom_tileto tile the region with colors representing disease rates. We use a square root transformation to avoid having the really high counts dominate the plot. Notice that missing values are shown in grey. Note that once a disease was pretty much eradicated, some states stopped reporting cases all together. This is why we see so much grey after 1980.
dat |> ggplot(aes(year, state, fill = rate)) +
geom_tile(color = "grey50") +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
geom_vline(xintercept=1963, col = "blue") +
theme_minimal() +
theme(panel.grid = element_blank(),
legend.position="bottom",
text = element_text(size = 8)) +
labs(title = the_disease, x = "", y = "")
This plot makes a very striking argument for the contribution of vaccines. However, one limitation of this plot is that it uses color to represent quantity, which we earlier explained makes it harder to know exactly how high values are going. Position and lengths are better cues. If we are willing to lose state information, we can make a version of the plot that shows the values with position. We can also show the average for the US, which we compute like this:
avg <- us_contagious_diseases |>
filter(disease==the_disease) |> group_by(year) |>
summarize(us_rate = sum(count, na.rm = TRUE) /
sum(population, na.rm = TRUE) * 10000)
Now to make the plot we simply use the geom_line geometry:
dat |>
filter(!is.na(rate)) |>
ggplot() +
geom_line(aes(year, rate, group = state), color = "grey50",
show.legend = FALSE, alpha = 0.2, size = 1) +
geom_line(mapping = aes(year, us_rate), data = avg, linewidth = 1) +
scale_y_continuous(trans = "sqrt", breaks = c(5, 25, 125, 300)) +
ggtitle("Cases per 10,000 by state") +
xlab("") + ylab("") +
geom_text(data = data.frame(x = 1955, y = 50),
mapping = aes(x, y, label="US average"),
color="black") +
geom_vline(xintercept=1963, col = "blue")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
In theory, we could use color to represent the categorical value state, but it is hard to pick 50 distinct colors.
Creating a time series plot of rate
of smallpox per year in
California using the
us_contagious_diseases dataframe.
First thing I do is to become familiar with the data I am working with.I need to make sure I am spelling and using capitalization identical to the data set.
colnames(us_contagious_diseases)
## [1] "disease" "state" "year" "weeks_reporting"
## [5] "count" "population"
class(us_contagious_diseases$disease)
## [1] "factor"
class(us_contagious_diseases$state)
## [1] "factor"
levels(us_contagious_diseases$disease)
## [1] "Hepatitis A" "Measles" "Mumps" "Pertussis" "Polio"
## [6] "Rubella" "Smallpox"
levels((us_contagious_diseases$state))
## [1] "Alabama" "Alaska" "Arizona"
## [4] "Arkansas" "California" "Colorado"
## [7] "Connecticut" "Delaware" "District Of Columbia"
## [10] "Florida" "Georgia" "Hawaii"
## [13] "Idaho" "Illinois" "Indiana"
## [16] "Iowa" "Kansas" "Kentucky"
## [19] "Louisiana" "Maine" "Maryland"
## [22] "Massachusetts" "Michigan" "Minnesota"
## [25] "Mississippi" "Missouri" "Montana"
## [28] "Nebraska" "Nevada" "New Hampshire"
## [31] "New Jersey" "New Mexico" "New York"
## [34] "North Carolina" "North Dakota" "Ohio"
## [37] "Oklahoma" "Oregon" "Pennsylvania"
## [40] "Rhode Island" "South Carolina" "South Dakota"
## [43] "Tennessee" "Texas" "Utah"
## [46] "Vermont" "Virginia" "Washington"
## [49] "West Virginia" "Wisconsin" "Wyoming"
smallpox_year_ca <- us_contagious_diseases |>
filter(disease == "Smallpox", state == "California") |>
group_by(year) |>
summarize(rate = sum(count)/sum(population)*10000) |>
ggplot(aes(year, rate)) +
geom_line() +
labs(title = "Rate of smallpox per year in California")
smallpox_year_ca
I used
summarize(rate = sum(count)/sum(population)*10000) to make
the variable “rate”. Rate is the number of occurrences of smallpox per
people in the population. But this rate needs to be grouped by year
group_by(year). Before we can do that, we need to make sure
R knows we only want Smallpox and Californiaso
I used filter() which selects only the rows with these
names in the designated columns disease and
state, respectively. After all this, it was simply just
creating a time series plot using geom_line(). I added a
descriptive title to help the reader.
case_when() and ifelse() functionsThis week we learned about case_when() and
ifelse() functions. Using the gapminder
dataframe, create a variable high_fertility that takes the
value of 1 if fertility is greater than or equal
to , 3 and 0 if fertility is less
than 3. Then create a variable, decade,
which indicates the decade of data (1960-1969,
1970-1979, etc.) Then make a table that
indicates the number of high_fertility.
I begin by loading the data and reminding myself the names of the columns.
data(gapminder)
colnames(gapminder)
## [1] "country" "year" "infant_mortality" "life_expectancy"
## [5] "fertility" "population" "gdp" "continent"
## [9] "region"
# Create the high_fertility variable
gapminder$high_fertility <- ifelse(gapminder$fertility >= 3, 1, 0)
# Create the decade variable
gapminder$decade <- case_when(
gapminder$year >= 1960 & gapminder$year < 1970 ~ "1960-1969",
gapminder$year >= 1970 & gapminder$year < 1980 ~ "1970-1979",
gapminder$year >= 1980 & gapminder$year < 1990 ~ "1980-1989",
gapminder$year >= 1990 & gapminder$year < 2000 ~ "1990-1999",
gapminder$year >= 2000 & gapminder$year < 2010 ~ "2000-2009",
gapminder$year >= 2010 & gapminder$year < 2020 ~ "2010-2019",
TRUE ~ NA_character_
)
# Create the table
table(gapminder$decade, gapminder$high_fertility)
##
## 0 1
## 1960-1969 357 1493
## 1970-1979 535 1315
## 1980-1989 652 1198
## 1990-1999 890 960
## 2000-2009 1141 709
## 2010-2019 746 362
The ifelse function takes three arguments: a test
expression, a value to return if the test is true, and a value to return
if the test is false. The test expression is evaluated for each element
of a vector or column, and the corresponding true or false value is
returned based on the result of the test. For example, in the code
gapminder$high_fertility <- ifelse(gapminder$fertility >= 3, 1, 0),
the ifelse function is used to create a new variable
high_fertility in the gapminder data frame.
The test expression gapminder$fertility >= 3 is
evaluated for each row in the gapminder data frame, and if
the value of fertility for that row is greater than or
equal to 3, the value 1 is returned, otherwise
the value 0 is returned.
The case_when function is similar to
ifelse, but allows for multiple conditions to be tested in
a more readable and concise way. The case_when function
takes multiple case arguments, each of which consists of a
test expression and a value to return if the test is true. The case_when
function returns the value associated with the first true test. For
example, in the code
gapminder$decade <- case_when(gapminder$year >= 1960
& gapminder$year < 1970 ~ "1960-1969", ...), the
case_when function is used to create a new variable decade
in the gapminder data frame. The function tests multiple conditions to
see which decade a given year belongs to, and returns the corresponding
decade as a string. If none of the conditions are true, the
TRUE condition at the end returns the value
NA_character_ to represent missing values for character
data.