DataViz Midterm II

General instructions for Midterms:

Create a new Markdown file
Change the heading to include your author name
Save the R Markdown file (named as: [MikeID]-[MidtermI].Rmd – e.g. “mlopez-MidtermI”) to somewhere where you’ll be able to access it later (zip drive, My Documents, Dropbox, etc)
Your file should contain the code/commands to answer each question in its own code block, which will also produce plots that will be automatically embedded in the output file
Each answer must be supported by written statements (unless otherwise specified) as well as any code used
A printed HTML or PDF copy of this midterm is due in class on Monday at 5:00 PM (November 21). I will not answer any questions on this exam after 5:00 PM on Sunday. I am avaialable over email through that point, and also after class on Friday.
Each student must abide by Skidmore’s honor code. You may use the internet for coding tips only: no solicitating answers or communicating with other students. However, note that class notes and labs are sufficient for finishing this exam.

Part 0 (10 points)

The first 10 points are awarded based on the format, spelling, grammar, and presentation of your HTML/PDF document. This includes eliminating unneeded code and writing (including the preamble above), warnings & messages, and any other text or output that is not part of your answer. Refer to the RMarkdown cheatsheet (given in the first lab) for formatting tips.

Note: You may include the questions themselves in your response, but you don’t need to.

Part 1 (60 points)

We start with the okcupiddata package, which stores cleaned profile data from 59,946 OkCupid users who were living within 25 miles of San Francisco, had active profiles on June 26, 2012, were online in the previous year, and had at least one picture in their profile. More information on this data, as well as a description of the variables, can be found here

A researcher is interested in the link between drinking and income among OKCupid users. Unfortunately, the most straightforward way to make a chart involves dropping subjects that did not respond to these questions.

Please use the following code to drop these subjects.

Also, note that income is converted into thousands of dollars, and that a categorical variable for income is created using the ifelse command. This splits users into one of three categories based on their incomes.

library(okcupiddata)
library(dplyr)
data("profiles")

profiles1 <- profiles %>%
  filter(is.na(income) == FALSE, is.na(drinks) == FALSE) %>%
  mutate(income.thous = income/10^3, 
         income.cat = ifelse(income.thous <= 100, "100k or Less",
                      ifelse(income.thous <= 500, "100k to 500k",
                      "More than 500k")))

Q1 (15 points) As one approach for identifying the link between drinking and income, a researcher presents the following plot. Make it, keeping in mind that the error bars shown reflect the margins of error.

Q2 (5 points) In 2-3 sentences summarise the link between income and drinking level presented in the chart.

Q3 (5 points) Which of the following two headlines corresponds to this chart?

“OkCupid users reporting higher incomes like to drink” or “OkCupid users that like to drink report higher incomes”

Q4 (5 points) The chart above shows average incomes. Why is this possibly misleading? Can you think of a better way to show the distribution of incomes?

Q5 (15 points) The researcher comes back with a second chart, shown below. Make it.

Q6 (5 points) In two sentences, explain what is shown in this figure that is not shown in the previous one.

Q7 (5 points) The chart above shows percentages. Why are we still unable to know for sure if drinking levels are significantly higher among OkCupid users with higher incomes? What information is missing?

Q8 (5 points) While interesting, and possibly a reason to dig deeper into the data, both charts above have two major limitations as far as learning about OkCupid users as a whole. Identify one of these limitations.

Part II (5 points each)

The Washington Post ran a set of serveral articles following up on the election. One of them included a chart - shown here via Twitter - that summarises the link between city type and county preference (red or blue).

Q1 The chart shows the percentage of counties leaning red or blue. For example, among urban cores, blue was at 86 percent. Estimate the margins of error for these six percentages.

Q2 One initial reaction to this chart is that there’s a major difference in the support for each candidate between urban and rural areas. Comment on why this may be an inaccurate conclusion.

Part III (10 points)

Biggest spike in 50 years? Blame apps writes the New York Times in an article linked here

The article presents one chart showing traffic fatalities over the last 40 years. In one paragraph, agree or disagree with the article’s headline, making reference to the chart, the article, and the approach used by the author’s.

Part IV (10 points)

Using Lab 09 as a guideline, make the following map. Note that this makes use of the 538 polling data set, filtering so that only type polls-plus appear.

Bonus (5 points)

Make the same plot as in Part IV, only varying the colours from blue (low Trump support) to grey (middle support) to red (high Trump support). A chart filled with purple does not count.