set.seed(31412718)

library(ggplot2)
library(tidyverse)

Datasets

We’ll be using data from the blue_jays.rda dataset which is already in the /data subdirectory in our data_vis_labs project. Below is a description of the variables contained in the dataset.

  • BirdID - ID tag for bird
  • KnownSex - Sex coded as F or M
  • BillDepth - Thickness of the bill measured at the nostril (in mm)
  • BillWidth - Width of the bill (in mm)
  • BillLength - Length of the bill (in mm)
  • Head - Distance from tip of bill to back of head (in mm)
  • Mass - Body mass (in grams)
  • Skull - Distance from base of bill to back of skull (in mm)
  • Sex - Sex coded as 0 = female or 1 = male

We’ll also be using a subset of the BRFSS (Behavioral Risk Factor Surveillance System) survey collected annually by the Centers for Disease Control and Prevention (CDC). The data can be found in the provided cdc.txt file — place this file in your /data subdirectory. The dataset contains 20,000 complete observations/records of 9 variables/fields, described below.

  • genhlth - How would you rate your general health? (excellent, very good, good, fair, poor)
  • exerany - Have you exercised in the past month? (1 = yes, 0 = no)
  • hlthplan - Do you have some form of health coverage? (1 = yes, 0 = no)
  • smoke100 - Have you smoked at least 100 cigarettes in your life time? (1 = yes, 0 = no)
  • height - height in inches
  • weight - weight in pounds
  • wtdesire - weight desired in pounds
  • age - in years
  • gender - m for males and f for females

Exercise 1

Using blue_jays dataset, construct different few scatterplots of Head by Mass:

load(file = "data/blue_jays.rda")

Here, I have loaded the dataset of blue_jays to use for our multiple scatterplots.

ggplot(data= blue_jays,
       aes(y=Head,x=Mass))+
       geom_point(color="#4E2A84",size=2,shape=17)+
       labs(x="Mass(grams)",
            y="Head(mm)")

Above, I have created a scatterplot of Head by Mass, using the blue_jays dataset. The color aesthetic is set to Northwestern purple, shape aesthetic to a solid/filled triangle, and size aesthetic to 2. Based on the scatterplot, it seems as Mass of the bird increases, the Head size of the bird increases as well. The purple and triangles used to portray data seems to have done a good job in creating a clear, readable correlation.

ggplot(data= blue_jays,
       aes(y=Head,x=Mass, colour=Sex))+
       geom_point(size=2)+
       labs(x="Mass(grams)",
            y="Head(mm)")

Here, I have created a scatterplot of Head by Mass by mapping the Sex variable to the color aesthetic, with size 2. Although the scatterplot seems readable, the continuous variable of Sex does not seem to help in create a clear notion with the color aesthetic. To be specific, it is not easy to read the depth of blue portrayed in the scatterplots for the common eye. Perhaps, as noted in the textbook, the color aesthetic does not work best with a continuous variable as Sex, in this case.

ggplot(data= blue_jays,
       aes(y=Head,x=Mass, colour=KnownSex))+
       geom_point(size=2)+
       labs(x="Mass(grams)",
            y="Head(mm)")

Here, I have created a scatterplot of Head by Mass by mapping the KnownSex variable to the color aesthetic, with size 2. As a categorical variable, KnownSex seems to be a better match to the color aesthetic than is the variable Sex. The reason is because of its clear, distinctive colors for Male and Female. I believe this will be a better readable set of data for the common eye for interpretation.

In a nutshell, both scatterplots of Head by Mass with Sex mapped to the color aesthetic and KnownSex mapped to the color aesthetic seem to portray the fact that the mass of a bird and headsize of a bird are proportional to each other, and that male birds, compared to female ones, generally have bigger headsize and heavier weight. However, in terms of showing a more readable, clear scatterplot for everyone to have a pleasant time reading, I believe, as the book notes, the categorical variable of KnownSex has an advantage over the continuous variable of Sex in this case of the color aesthetic. One of the main reasons is because the KnownSex variable portrays clearly distinguishiable colors for Female and Male while the Sex variable shows a seemingly similar color of dark blue and light blue, making it harder to differentiate.

Exercise 2

Using subsample of size 100 from the cdc dataset (code provided below), construct a scatterplot of weight by height. Construct 5 more scatterplots of weight by height that make use of aesthetic attributes color and shape (maybe size too). You can define both aesthetics at the same time in each plot or one at a time. Just experiment. — Should be six total plots.

# Read in the cdc dataset
cdc <- read_delim(file = "data/cdc.txt", delim = "|") %>%
  mutate(genhlth = factor(genhlth,
    levels = c("excellent", "very good", "good", "fair", "poor")
  ))
## Parsed with column specification:
## cols(
##   genhlth = col_character(),
##   exerany = col_double(),
##   hlthplan = col_double(),
##   smoke100 = col_double(),
##   height = col_double(),
##   weight = col_double(),
##   wtdesire = col_double(),
##   age = col_double(),
##   gender = col_character()
## )
# Selecting a random subset of size 100
cdc_small <- cdc %>% sample_n(100)

The above code loads the ‘cdc’ dataset into the global environment and selects a random subset of size 100 from the ‘cdc’ dataset.

ggplot(data = cdc_small,
       aes(y=height,x=weight))+
      geom_point()+
      labs(x="weight(pounds)",
           y="height(inches)")

Above is the first scatterplot of height by weight, without any aesthetics. It seems the variables of weight and height and proportional to each other. As weight increases, height is increasing as well.

ggplot(data = cdc_small,
       aes(y=height,x=weight))+
       geom_point(color="red")+
       labs(x="weight(pounds)",
            y="height(inches)")

Above is the second scatterplot of height by weight, with the aesthetic of the color red added. The data points seem to stand out more as red dots.

ggplot(data = cdc_small,
       aes(y=height,x=weight))+
       geom_point(size=5, color="green")+
       labs(x="weight(pounds)",
            y="height(inches)")

Above is the third scatterplot of height by weight, with the size of 5. The dots definitely seem to be bigger and seem to merge with one another. Perhaps bigger data points do not always mean they are more readable.

ggplot(data = cdc_small,
       aes(y=height,x=weight))+
       geom_point(color="yellow",size=3,shape=11)+
       labs(x="weight(pounds)",
            y="height(inches)")

In this scatterplot, I have set the color of yellow, size of 3, and shape of 11. The shape resembles a star, and it is very interesting there are multiple ways to portray scatterplot data points. The stars look beautiful, but yet unreadable due to its too bright color.

ggplot(data = cdc_small,
       aes(y=height,x=weight,size=age))+
       geom_point(color="blue",shape=13)+
       labs(x="weight(pounds)",
            y="height(inches)")

In this scatterplot, I have set the color as blue, shape of number 13(from the R shape index), and the size of the data points in regards to the age of the people who participated as subjects. I believe this scatterplot shows how the factor of age comes into effect. The above data seems pretty fairly distributed of age.

ggplot(data = cdc_small,
       aes(y=height,x=weight,color=gender))+
       geom_point(shape=19, size=2)+
       labs(x="weight(pounds)",
            y="height(inches)")

In this scatterplot, I have used colors in regards to the gender of the subjects. It seems the males generally weight more and are taller than the female participants.