This homework is due by Tuesday, January 15th, 8:00pm.

Upload a zipped folder to Canvas called 1-visualization_homework.zip which contains three files:
• 1-visualization_homework.Rmd
• 1-visualization_homework.html
• 1-visualization_homework.pdf

Instructions

In this homework, you’ll write a short blog post about a data set. Your goal is to tell us something interesting using a well-crafted, thoughtfully-prepared data graphic. One data graphic should suffice, but you may include more if you choose (not more than 3 though). Feel free to make plots with multiple panels by using the patchwork package we’ve discussed (or one of the alternatives such as cowplot).

Your blog post should be short (between 100 and 300 words). We envision an introductory paragraph that explains your findings and provides some context to your data, the data graphic(s), and then a caption-like paragraph providing more detail about what to look for in the data graphic and how to interpret it. That is it. You will not earn more points by including more words or data graphics. What we are looking for is something that is insightful and well-crafted.

Here are some examples of articles that are similar in spirit to yours. Most of these are much longer than yours will be, but the idea is similar: use a good data graphic to tell us something we don’t already know.

Data

You are free to use whatever data you want. However, the purpose of this exercise is to learn how to make good plots – not to wrangle data (we’ll do that next). So we don’t want you to spend much time wrangling data. There are perfectly good data sets available through R packages that are already well-curated. Here is a list of packages with data sets.

  • fivethirtyeight: provides access to data sets that drive many articles on FiveThirtyEight
  • nycflights13: data about flights leaving from the three major NYC airports in 2013
  • NHANES: Data from the US National Health and Nutrition Examination Study
  • Lahman: comprehensive historical archive of major league baseball data
  • fueleconomy: fuel economy data from the EPA, 1985–2015
  • datasets: package that contains a large number of data sets

For example, to take a look at the datasets in the fivethirtyeight package, you can do the following:

# install the package 
install.packages("fivethirtyeight") 

# load the package 
library("fivethirtyeight")

# take a look at the data sets that come with the package
data(package = "fivethirtyeight")

# take a look at the help file to get more information about the different data sets (not 
# all packages have help files)
help("fivethirtyeight")

# the "fivethirtyeight" provides a detailed overview over the different data sets with 
# this command
vignette("fivethirtyeight", package = "fivethirtyeight")

# to load a particular data set (e.g. US_births_2000_2014, replace with the name of the 
# data set you'd liked to load) into your environment, run the following 
df.data = US_births_2000_2014

Note that I’ve set the code chunk option for the code block above to eval=FALSE. Thi way, the code is not evaluated. You can find out more about the different chunk options here.

Are taller players better or worse shooters? Well, they’re both…

Load packages

library("knitr")
library("tidyverse")

Load the data set

# load the data set here

df <- read.csv("~/Uni/Psych 252/Homework/Week 1/week1homework/players_stats.csv")

Description

In general, taller basketball players are worse shooters, at least according to many basketball fans. Shorter players, so the story goes, are more likely to have developed what they have in the way of skill to compensate for what they lack in the way of height. But while widely believed, statistics supporting this claim are rarely cited. What, then, should we make of it? To test it, we can analyze some data from the NBA 2014-2015 season. The data set features information about key statistics for 490 NBA players, including statistics on minutes played, rebounds made, and the like.

We might try to test the claim by examining the percentage of “free-throws” which players sucessfully make, that is, the percentage of unopposed shots that are made from the free-throw line, typically when “fouls” are called in-game. This statistic would seem to be a valid test: after all, unlike any other kind of shot on the court, one free-throw occurs under virtually the same circumstances as any other–they all occur from the same distance, with no opposition, and so forth. Hence, we will use scatter plots to view the distribution of successful shooting percentages organised by height.

Figure

# Load a nicer theme

theme_set(
  theme_classic() + #set the theme 
    theme(text = element_text(size = 20))) 


#Data preparation: removing statistics for players who have attempted few or no free throw shots or whose height is not recorded

df.filtered <- df %>% 
  filter(FTA > 3, !is.na(Height))

ggplot(data = df.filtered,
       mapping = aes(x = Height,
                     y = FT.
                     )) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", se = F) +
  labs(title = "Percentage of Successful Free-Throws", x = "Height (centremetres)", y = "%")

Caption: Taller players are slightly less likely to be accurate shooters, at least from the free-throw.

However, a different story emerges when we consider the percentage of non-free throw shots that are successfully made in game. This is important, since most points are gained in game when facing opposition on the court, not from the free-throw.

# replace this figure with an interesting one

#Data preparation:


df = df %>%  mutate(
  A. = (100 * (FGM + X3PM)/(FGA + X3PA)), #make a variable for the percentage of in game shots made
  AA = (df$FGA + df$X3PA) #make a variable for the number of in game shots attempted
)

#removing statistics for players who have attempted few or no free throw shots or whose height is not recorded


df.filtered <- df %>% 
  filter(AA > 3, !is.na(Height))

ggplot(data = df.filtered,
       mapping = aes(x = Height,
                     y = A.
                     )) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", se = F) +
  labs(title = "Percentage of Successful In-Game Shots", x = "Height (centremetres)", y = "%")

Caption: The above shows that taller players are more likely to make their shots successfully, perhaps because they choose to take shots in circumstances where they have the advantage or because the opposition is generally less effective in defending them.

In summary, the conventional wisdom that taller players are worse shooters is only a half-truth: it is partly true insofar as taller players are sightly less likely to make their shots when unopposed, but it is partly false insofar as taller players are more likely to make their shots when it matters most, that is, when playing in-game.

Session info

Information about this R session including which version of R was used, and what packages were loaded.

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2  forcats_0.3.0   stringr_1.3.1   dplyr_0.7.7    
##  [5] purrr_0.2.5     readr_1.1.1     tidyr_0.8.2     tibble_1.4.2   
##  [9] ggplot2_3.1.0   tidyverse_1.2.1 knitr_1.20     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.18     cellranger_1.1.0 pillar_1.3.0     compiler_3.5.2  
##  [5] plyr_1.8.4       bindr_0.1.1      tools_3.5.2      digest_0.6.17   
##  [9] lubridate_1.7.4  jsonlite_1.5     evaluate_0.11    nlme_3.1-137    
## [13] gtable_0.2.0     lattice_0.20-38  pkgconfig_2.0.2  rlang_0.2.2     
## [17] cli_1.0.1        rstudioapi_0.7   yaml_2.2.0       haven_1.1.2     
## [21] withr_2.1.2      xml2_1.2.0       httr_1.3.1       hms_0.4.2       
## [25] grid_3.5.2       tidyselect_0.2.5 glue_1.3.0       R6_2.2.2        
## [29] readxl_1.1.0     rmarkdown_1.11   modelr_0.1.2     magrittr_1.5    
## [33] backports_1.1.3  scales_1.0.0     htmltools_0.3.6  rvest_0.3.2     
## [37] assertthat_0.2.0 colorspace_1.3-2 labeling_0.3     stringi_1.1.7   
## [41] lazyeval_0.2.1   munsell_0.5.0    broom_0.5.0      crayon_1.3.4

Grading Rubric

There are 15 possible points for this homework.

Baseline
  • +1 for an .Rmd file that compiles without errors
  • +1 for describing the dataset
  • +1 for having a plot
  • +1 for including the code that generated the plot
  • +1 for describing the visual mapping (i.e. a key)
Average
  • +1 unnecessary messages from R are hidden from being displayed in the HTML
  • +1 for including a catchy and/or engaging title
  • +1 for having at least 100 words and no more than 500 words
  • +1 for explaining in a single coherent sentence what we can learn from this graphic
  • +1 for explaining the choice of geometric mapping
Advanced
  • +1 blog post text provides context or background useful in interpreting the graphic
  • +0-4 WOW factor: awarded at the grader’s discretion for submissions that are exceptionally compelling