Homework: Week 4

Visualization, Confidence Intervals, & Hypothesis Testing

ANTH 504

1.Irizarry exercises 11.13

Complete Questions 1-3, 5-8 for exercises 11.13 in Irizarry text.

data-visualization-principles-exercises-20

11.13 Exercises For these exercises, we will be using the vaccines data in the dslabs package:

library(dslabs)
data(us_contagious_diseases)
  1. Pie charts are appropriate:

a. When we want to display percentages.

b. When ggplot2 is not available.

c. When I am in a bakery.

d. Never. Barplots and tables are always better.

…humans are not good at precisely quantifying angles and are even worse when area is the only available visual cue.

If for some reason you need to make a pie chart, label each pie slice with its respective percentage

  1. What is the problem with the plot below:

Question 2 Plot

a. The values are wrong. The final vote was 306 to 232.

b. The axis does not start at 0. Judging by the length, it appears Trump received 3 times as many votes when, in fact, it was about 30% more. It makes you think then differences are larger than they actually are.

When using position rather than length, it is then not necessary to include 0. Here they are comparing the differences between groups relative to the within-group variability.

c. The colors should be the same.

d. Percentages should be shown as a pie chart.

  1. Take a look at the following two plots. They show the same information: 1928 rates of measles across the 50 states.

Question 3 plots

Which plot is easier to read if you are interested in determining which are the best and worst states in terms of rates, and why?

a. They provide the same information, so they are both equally as good.

b. The plot on the right is better because it orders the states alphabetically.

c. The plot on the left is better because alphabetical order has nothing to do with the disease and by ordering according to actual rate, we quickly see the states with most and least rates. Ordering the plots by value (or meaningful category) is easier to look at and interpret.

d. Both plots should be a pie chart.

  1. To make the plot on the left, we have to reorder the levels of the states’ variables.
colnames(us_contagious_diseases)
## [1] "disease"         "state"           "year"            "weeks_reporting"
## [5] "count"           "population"
dat <- us_contagious_diseases |>  
  filter(year == 1967 & disease=="Measles" & !is.na(population)) |>
  mutate(rate = count / population * 10000 * 52 / weeks_reporting)
colnames(dat)
## [1] "disease"         "state"           "year"            "weeks_reporting"
## [5] "count"           "population"      "rate"

Note what happens when we make a barplot:

dat |> ggplot(aes(state, rate)) +
  geom_bar(stat="identity") +
  coord_flip() 

Define these objects:

state <- dat$state
rate <- dat$count/dat$population*10000*52/dat$weeks_reporting

Redefine the state object so that the levels are reordered. Print the new object state and its levels so you can see that the vector is not re-ordered by the levels.

state <- reorder(dat$state, dat$rate)
print(state)
##  [1] Alabama              Alaska               Arizona             
##  [4] Arkansas             California           Colorado            
##  [7] Connecticut          Delaware             District Of Columbia
## [10] Florida              Georgia              Hawaii              
## [13] Idaho                Illinois             Indiana             
## [16] Iowa                 Kansas               Kentucky            
## [19] Louisiana            Maine                Maryland            
## [22] Massachusetts        Michigan             Minnesota           
## [25] Mississippi          Missouri             Montana             
## [28] Nebraska             Nevada               New Hampshire       
## [31] New Jersey           New Mexico           New York            
## [34] North Carolina       North Dakota         Ohio                
## [37] Oklahoma             Oregon               Pennsylvania        
## [40] Rhode Island         South Carolina       South Dakota        
## [43] Tennessee            Texas                Utah                
## [46] Vermont              Virginia             Washington          
## [49] West Virginia        Wisconsin            Wyoming             
## attr(,"scores")
##              Alabama               Alaska              Arizona 
##           4.16107582           5.46389893           6.32695891 
##             Arkansas           California             Colorado 
##           6.87899954           2.79313560           7.96331905 
##          Connecticut             Delaware District Of Columbia 
##           0.36986840           1.13098183           0.35873614 
##              Florida              Georgia               Hawaii 
##           2.89358806           0.09987991           2.50173748 
##                Idaho             Illinois              Indiana 
##           6.03115170           1.20115480           1.34027323 
##                 Iowa               Kansas             Kentucky 
##           2.94948911           0.66386422           4.74576011 
##            Louisiana                Maine             Maryland 
##           0.46088071           2.57520433           0.49922233 
##        Massachusetts             Michigan            Minnesota 
##           0.74762338           1.33466700           0.37722410 
##          Mississippi             Missouri              Montana 
##           3.11366532           0.75696354           5.00433320 
##             Nebraska               Nevada        New Hampshire 
##           3.64389801           6.43683882           0.47181511 
##           New Jersey           New Mexico             New York 
##           0.88414264           6.15969926           0.66849058 
##       North Carolina         North Dakota                 Ohio 
##           1.92529764          14.48024642           1.16382241 
##             Oklahoma               Oregon         Pennsylvania 
##           3.27496900           8.75036439           0.67687303 
##         Rhode Island       South Carolina         South Dakota 
##           0.68207448           2.10412531           0.90289534 
##            Tennessee                Texas                 Utah 
##           5.47344506          12.49773953           4.03005836 
##              Vermont             Virginia           Washington 
##           1.00970314           5.28270939          17.65180349 
##        West Virginia            Wisconsin              Wyoming 
##           8.59456463           4.96246019           6.97303449 
## 51 Levels: Georgia District Of Columbia Connecticut Minnesota ... Washington
print(levels(state))
##  [1] "Georgia"              "District Of Columbia" "Connecticut"         
##  [4] "Minnesota"            "Louisiana"            "New Hampshire"       
##  [7] "Maryland"             "Kansas"               "New York"            
## [10] "Pennsylvania"         "Rhode Island"         "Massachusetts"       
## [13] "Missouri"             "New Jersey"           "South Dakota"        
## [16] "Vermont"              "Delaware"             "Ohio"                
## [19] "Illinois"             "Michigan"             "Indiana"             
## [22] "North Carolina"       "South Carolina"       "Hawaii"              
## [25] "Maine"                "California"           "Florida"             
## [28] "Iowa"                 "Mississippi"          "Oklahoma"            
## [31] "Nebraska"             "Utah"                 "Alabama"             
## [34] "Kentucky"             "Wisconsin"            "Montana"             
## [37] "Virginia"             "Alaska"               "Tennessee"           
## [40] "Idaho"                "New Mexico"           "Arizona"             
## [43] "Nevada"               "Arkansas"             "Wyoming"             
## [46] "Colorado"             "West Virginia"        "Oregon"              
## [49] "Texas"                "North Dakota"         "Washington"
  1. Now with one line of code, define the dat table as done above, but change the use mutate to create a rate variable and reorder the state variable so that the levels are re-ordered by this variable.
dat <- dat |>
  mutate(rate=rate) |>  
  mutate(state=state)

Then make a barplot using the code above, but for this new dat.

dat |>
  ggplot(aes(state, rate)) +
  geom_bar(stat="identity")

dat |>
  ggplot(aes(state, rate)) +
  geom_bar(stat="identity") + 
  coord_flip()

This didn’t work. But that’s OK because I need to do it using one line of code

dat <- us_contagious_diseases |> 
  filter(year == 1967 & disease=="Measles" & count>0 & !is.na(population)) |>
  mutate(rate = count / population * 10000 * 52 / weeks_reporting) |> 
  mutate(state = reorder(state, rate, FUN = mean)) |>
  ggplot(aes(state, rate)) +
    geom_bar(stat="identity") +
    coord_flip()
dat

  1. Say we are interested in comparing gun homicide rates across regions of the US. We see this plot:
library(dslabs)
data("murders")
murders |> mutate(rate = total/population*100000) |>
group_by(region) |>
summarize(avg = mean(rate)) |>
mutate(region = factor(region)) |>
ggplot(aes(region, avg)) +
geom_bar(stat="identity") +
ylab("Murder Rate Average")

and decide to move to a state in the western region. What is the main problem with this interpretation?

a. The categories are ordered alphabetically.

b. The graph does not show standerd errors.

c. It does not show all the data. We do not see the variability within a region and it’s possible that the safest states are not in the West.

d. The Northeast has the lowest average.

  1. Make a boxplot of the murder rates defined as
data("murders")
murders |> mutate(rate = total/population*100000)
##                   state abb        region population total       rate
## 1               Alabama  AL         South    4779736   135  2.8244238
## 2                Alaska  AK          West     710231    19  2.6751860
## 3               Arizona  AZ          West    6392017   232  3.6295273
## 4              Arkansas  AR         South    2915918    93  3.1893901
## 5            California  CA          West   37253956  1257  3.3741383
## 6              Colorado  CO          West    5029196    65  1.2924531
## 7           Connecticut  CT     Northeast    3574097    97  2.7139722
## 8              Delaware  DE         South     897934    38  4.2319369
## 9  District of Columbia  DC         South     601723    99 16.4527532
## 10              Florida  FL         South   19687653   669  3.3980688
## 11              Georgia  GA         South    9920000   376  3.7903226
## 12               Hawaii  HI          West    1360301     7  0.5145920
## 13                Idaho  ID          West    1567582    12  0.7655102
## 14             Illinois  IL North Central   12830632   364  2.8369608
## 15              Indiana  IN North Central    6483802   142  2.1900730
## 16                 Iowa  IA North Central    3046355    21  0.6893484
## 17               Kansas  KS North Central    2853118    63  2.2081106
## 18             Kentucky  KY         South    4339367   116  2.6732010
## 19            Louisiana  LA         South    4533372   351  7.7425810
## 20                Maine  ME     Northeast    1328361    11  0.8280881
## 21             Maryland  MD         South    5773552   293  5.0748655
## 22        Massachusetts  MA     Northeast    6547629   118  1.8021791
## 23             Michigan  MI North Central    9883640   413  4.1786225
## 24            Minnesota  MN North Central    5303925    53  0.9992600
## 25          Mississippi  MS         South    2967297   120  4.0440846
## 26             Missouri  MO North Central    5988927   321  5.3598917
## 27              Montana  MT          West     989415    12  1.2128379
## 28             Nebraska  NE North Central    1826341    32  1.7521372
## 29               Nevada  NV          West    2700551    84  3.1104763
## 30        New Hampshire  NH     Northeast    1316470     5  0.3798036
## 31           New Jersey  NJ     Northeast    8791894   246  2.7980319
## 32           New Mexico  NM          West    2059179    67  3.2537239
## 33             New York  NY     Northeast   19378102   517  2.6679599
## 34       North Carolina  NC         South    9535483   286  2.9993237
## 35         North Dakota  ND North Central     672591     4  0.5947151
## 36                 Ohio  OH North Central   11536504   310  2.6871225
## 37             Oklahoma  OK         South    3751351   111  2.9589340
## 38               Oregon  OR          West    3831074    36  0.9396843
## 39         Pennsylvania  PA     Northeast   12702379   457  3.5977513
## 40         Rhode Island  RI     Northeast    1052567    16  1.5200933
## 41       South Carolina  SC         South    4625364   207  4.4753235
## 42         South Dakota  SD North Central     814180     8  0.9825837
## 43            Tennessee  TN         South    6346105   219  3.4509357
## 44                Texas  TX         South   25145561   805  3.2013603
## 45                 Utah  UT          West    2763885    22  0.7959810
## 46              Vermont  VT     Northeast     625741     2  0.3196211
## 47             Virginia  VA         South    8001024   250  3.1246001
## 48           Washington  WA          West    6724540    93  1.3829942
## 49        West Virginia  WV         South    1852994    27  1.4571013
## 50            Wisconsin  WI North Central    5686986    97  1.7056487
## 51              Wyoming  WY          West     563626     5  0.8871131

by region, showing all the points and -reordering the regions by their median rate.

murders |> 
  mutate(region = reorder(region, rate, FUN = median)) |>
  ggplot(aes(region, rate)) +
    geom_boxplot() +
    geom_jitter(width = 0.1, alpha = 0.2)

  1. The plots below show three continuous variables.

Question 8a Plots

The line x=2 appears to separate the points. But it is actually not the case, which we can see by plotting the data in a couple of two-dimensional points.

Question 8b Plots

Why is this happening?

a. Humans are not good at reading pseudo-3D plots We are bad at seeing in three dimensions – particularly on a 2-D screen.Refer to pie chart of frequency of use.

b. There must be an error in the code.

c. The colors confuse us.

d. Scatterplots should not be used to compare two variables when we have access to 3.

2.Irizarry Section 11.14

Read Section 11.14 in the Irizarry text (case study of vaccines). After you read through this case study, create a time series plot of rate of smallpox per year in California using the us_contagious_diseases dataframe.

11.14 Case study: vaccines and infectious diseases Vaccines have helped save millions of lives. In the 19th century, before herd immunization was achieved through vaccination programs, deaths from infectious diseases, such as smallpox and polio, were common. However, today vaccination programs have become somewhat controversial despite all the scientific evidence for their importance.

The controversy started with a paper published in 1988 and led by Andrew Wakefield claiming there was a link between the administration of the measles, mumps, and rubella (MMR) vaccine and the appearance of autism and bowel disease. Despite much scientific evidence contradicting this finding, sensationalist media reports and fear-mongering from conspiracy theorists led parts of the public into believing that vaccines were harmful. As a result, many parents ceased to vaccinate their children. This dangerous practice can be potentially disastrous given that the Centers for Disease Control (CDC) estimates that vaccinations will prevent more than 21 million hospitalizations and 732,000 deaths among children born in the last 20 years (see Benefits from Immunization during the Vaccines for Children Program Era — United States, 1994-2013, MMWR4. The 1988 paper has since been retracted and Andrew Wakefield was eventually “struck off the UK medical register, with a statement identifying deliberate falsification in the research published in The Lancet, and was thereby barred from practicing medicine in the UK.” (source: Wikipedia). Yet misconceptions persist, in part due to self-proclaimed activists who continue to disseminate misinformation about vaccines.

Effective communication of data is a strong antidote to misinformation and fear-mongering. Earlier we used an example provided by a Wall Street Journal article42 showing data related to the impact of vaccines on battling infectious diseases. Here we reconstruct that example.

The data used for these plots were collected, organized, and distributed by the Tycho Project43. They include weekly reported counts for seven diseases from 1928 to 2011, from all fifty states. We include the yearly totals in the dslabs package:

library(RColorBrewer)
data(us_contagious_diseases)
names(us_contagious_diseases)
## [1] "disease"         "state"           "year"            "weeks_reporting"
## [5] "count"           "population"

We create a temporary object dat that stores only the measles data, includes a per 100,000 rate, orders states by average value of disease and removes Alaska and Hawaii since they only became states in the late 1950s. Note that there is a weeks_reporting column that tells us for how many weeks of the year data was reported. We have to adjust for that value when computing the rate.

the_disease <- "Measles"
dat <- us_contagious_diseases |>
  filter(!state%in%c("Hawaii","Alaska") & disease == the_disease) |>
  mutate(rate = count / population * 10000 * 52 / weeks_reporting) |> 
  mutate(state = reorder(state, ifelse(year<=1963, rate, NA), 
                         median, na.rm = TRUE)) 

We can now easily plot disease rates per year. Here are the measles data from California:

dat |> filter(state == "California" & !is.na(rate)) |>
  ggplot(aes(year, rate)) +
  geom_line() + 
  ylab("Cases per 10,000")  + 
  geom_vline(xintercept=1963, col = "blue")

We add a vertical line at 1963 since this is when the vaccine was introduced [Control, Centers for Disease; Prevention (2014). CDC health information for international travel 2014 (the yellow book). p. 250. ISBN 9780199948505].

We add a vertical line at 1963 since this is when the vaccine was introduced [Control, Centers for Disease; Prevention (2014). CDC health information for international travel 2014 (the yellow book). p. 250. ISBN 9780199948505].

In our example, we want to use a sequential palette since there is no meaningful center, just low and high rates.

We use the geometry geom_tile to tile the region with colors representing disease rates. We use a square root transformation to avoid having the really high counts dominate the plot. Notice that missing values are shown in grey. Note that once a disease was pretty much eradicated, some states stopped reporting cases all together. This is why we see so much grey after 1980.

dat |> ggplot(aes(year, state, fill = rate)) +
  geom_tile(color = "grey50") +
  scale_x_continuous(expand=c(0,0)) +
  scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
  geom_vline(xintercept=1963, col = "blue") +
  theme_minimal() +  
  theme(panel.grid = element_blank(), 
        legend.position="bottom", 
        text = element_text(size = 8)) +
  labs(title = the_disease, x = "", y = "")

This plot makes a very striking argument for the contribution of vaccines. However, one limitation of this plot is that it uses color to represent quantity, which we earlier explained makes it harder to know exactly how high values are going. Position and lengths are better cues. If we are willing to lose state information, we can make a version of the plot that shows the values with position. We can also show the average for the US, which we compute like this:

avg <- us_contagious_diseases |>
  filter(disease==the_disease) |> group_by(year) |>
  summarize(us_rate = sum(count, na.rm = TRUE) / 
              sum(population, na.rm = TRUE) * 10000)

Now to make the plot we simply use the geom_line geometry:

dat |> 
  filter(!is.na(rate)) |>
    ggplot() +
  geom_line(aes(year, rate, group = state),  color = "grey50", 
            show.legend = FALSE, alpha = 0.2, size = 1) +
  geom_line(mapping = aes(year, us_rate),  data = avg, linewidth = 1) +
  scale_y_continuous(trans = "sqrt", breaks = c(5, 25, 125, 300)) + 
  ggtitle("Cases per 10,000 by state") + 
  xlab("") + ylab("") +
  geom_text(data = data.frame(x = 1955, y = 50), 
            mapping = aes(x, y, label="US average"), 
            color="black") + 
  geom_vline(xintercept=1963, col = "blue")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

In theory, we could use color to represent the categorical value state, but it is hard to pick 50 distinct colors.

Creating a time series plot of rate of smallpox per year in California using the us_contagious_diseases dataframe.

First thing I do is to become familiar with the data I am working with.I need to make sure I am spelling and using capitalization identical to the data set.

colnames(us_contagious_diseases)
## [1] "disease"         "state"           "year"            "weeks_reporting"
## [5] "count"           "population"
class(us_contagious_diseases$disease)
## [1] "factor"
class(us_contagious_diseases$state)
## [1] "factor"
levels(us_contagious_diseases$disease)
## [1] "Hepatitis A" "Measles"     "Mumps"       "Pertussis"   "Polio"      
## [6] "Rubella"     "Smallpox"
levels((us_contagious_diseases$state))
##  [1] "Alabama"              "Alaska"               "Arizona"             
##  [4] "Arkansas"             "California"           "Colorado"            
##  [7] "Connecticut"          "Delaware"             "District Of Columbia"
## [10] "Florida"              "Georgia"              "Hawaii"              
## [13] "Idaho"                "Illinois"             "Indiana"             
## [16] "Iowa"                 "Kansas"               "Kentucky"            
## [19] "Louisiana"            "Maine"                "Maryland"            
## [22] "Massachusetts"        "Michigan"             "Minnesota"           
## [25] "Mississippi"          "Missouri"             "Montana"             
## [28] "Nebraska"             "Nevada"               "New Hampshire"       
## [31] "New Jersey"           "New Mexico"           "New York"            
## [34] "North Carolina"       "North Dakota"         "Ohio"                
## [37] "Oklahoma"             "Oregon"               "Pennsylvania"        
## [40] "Rhode Island"         "South Carolina"       "South Dakota"        
## [43] "Tennessee"            "Texas"                "Utah"                
## [46] "Vermont"              "Virginia"             "Washington"          
## [49] "West Virginia"        "Wisconsin"            "Wyoming"
smallpox_year_ca <- us_contagious_diseases  |> 
  filter(disease == "Smallpox", state == "California") |>
  group_by(year) |>
  summarize(rate = sum(count)/sum(population)*10000) |>
  ggplot(aes(year, rate)) + 
  geom_line() +
  labs(title = "Rate of smallpox per year in California")
smallpox_year_ca

I used summarize(rate = sum(count)/sum(population)*10000) to make the variable “rate”. Rate is the number of occurrences of smallpox per people in the population. But this rate needs to be grouped by year group_by(year). Before we can do that, we need to make sure R knows we only want Smallpox and Californiaso I used filter() which selects only the rows with these names in the designated columns disease and state, respectively. After all this, it was simply just creating a time series plot using geom_line(). I added a descriptive title to help the reader.

3.case_when() and ifelse() functions

This week we learned about case_when() and ifelse() functions. Using the gapminder dataframe, create a variable high_fertility that takes the value of 1 if fertility is greater than or equal to , 3 and 0 if fertility is less than 3. Then create a variable, decade, which indicates the decade of data (1960-1969, 1970-1979, etc.) Then make a table that indicates the number of high_fertility.

I begin by loading the data and reminding myself the names of the columns.

data(gapminder)
colnames(gapminder)
## [1] "country"          "year"             "infant_mortality" "life_expectancy" 
## [5] "fertility"        "population"       "gdp"              "continent"       
## [9] "region"
# Create the high_fertility variable
gapminder$high_fertility <- ifelse(gapminder$fertility >= 3, 1, 0)

# Create the decade variable
gapminder$decade <- case_when(
  gapminder$year >= 1960 & gapminder$year < 1970 ~ "1960-1969",
  gapminder$year >= 1970 & gapminder$year < 1980 ~ "1970-1979",
  gapminder$year >= 1980 & gapminder$year < 1990 ~ "1980-1989",
  gapminder$year >= 1990 & gapminder$year < 2000 ~ "1990-1999",
  gapminder$year >= 2000 & gapminder$year < 2010 ~ "2000-2009",
  gapminder$year >= 2010 & gapminder$year < 2020 ~ "2010-2019",
  TRUE ~ NA_character_
)

# Create the table
table(gapminder$decade, gapminder$high_fertility)
##            
##                0    1
##   1960-1969  357 1493
##   1970-1979  535 1315
##   1980-1989  652 1198
##   1990-1999  890  960
##   2000-2009 1141  709
##   2010-2019  746  362

The ifelse function takes three arguments: a test expression, a value to return if the test is true, and a value to return if the test is false. The test expression is evaluated for each element of a vector or column, and the corresponding true or false value is returned based on the result of the test. For example, in the code gapminder$high_fertility <- ifelse(gapminder$fertility >= 3, 1, 0), the ifelse function is used to create a new variable high_fertility in the gapminder data frame. The test expression gapminder$fertility >= 3 is evaluated for each row in the gapminder data frame, and if the value of fertility for that row is greater than or equal to 3, the value 1 is returned, otherwise the value 0 is returned.

The case_when function is similar to ifelse, but allows for multiple conditions to be tested in a more readable and concise way. The case_when function takes multiple case arguments, each of which consists of a test expression and a value to return if the test is true. The case_when function returns the value associated with the first true test. For example, in the code gapminder$decade <- case_when(gapminder$year >= 1960 & gapminder$year < 1970 ~ "1960-1969", ...), the case_when function is used to create a new variable decade in the gapminder data frame. The function tests multiple conditions to see which decade a given year belongs to, and returns the corresponding decade as a string. If none of the conditions are true, the TRUE condition at the end returns the value NA_character_ to represent missing values for character data.