Week 4 Lab

Part 1: Learn about the dataset

Question 1

The tidyverse library and the palmerpenguins library are loaded in the setup chunk of the Rmd file, which should not display in the HTML file.

Question 2

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Question 3

There are 344 rows and 8 columns in the penguins dataset.

Question 4

The island variable contains data about which island each penguin was documented on (Biscoe, Dream or Torgersen).

Question 5

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

The categorical variables are: species, island, and sex. The numeric variables are: bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, and year. The glimpse() command shows information about each variable. The “dbl” and “int” variables are numeric; “fct” variables are categorical. The str() command can also be used; “factors” are categorical and “num” or “int” variables are numeric. The type of data in each variable can be inferred from the description of each variable in the documentation.

Part 2: Create a graph and modify its appearance

Question 6

The code uses the penguins dataset. The x-axis is flipper_length_mm, and the y-axis uses bill_length_mm. A is the correct image. It shows a scatterplot, which is what geom_point creates.

Question 7

ggplot(data = penguins) + 
  geom_point(mapping = aes(x = flipper_length_mm, y = bill_length_mm), color="blue")
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Question 8

ggplot(data = penguins) + 
  geom_point(mapping = aes(x = flipper_length_mm, y = bill_length_mm, color = species))
## Warning: Removed 2 rows containing missing values (`geom_point()`).

# Question 9

The three species of penguins appear to be different sizes. The Adelie are the smallest, the Chinstrap are in the middle, and the Gentoo penguins are the biggest in the dataset.

Part 3: Create another graph and modify its appearance

Question 10

The code uses the penguins dataset. The x-axis is species, and the y-axis is the number of penguins at each level of species. C is the correct graph.

Question 11

ggplot(data = penguins) + 
  geom_bar(mapping = aes(x = species), fill = "blue")

Question 12

ggplot(data = penguins) + 
  geom_bar(mapping = aes(x = species, fill = island))

Question 13

ggplot(data = penguins) + 
  geom_bar(mapping = aes(x = species, fill = island), position="dodge")

The only penguin that lives on multiple islands is the Adelie penguin.

Part 4: Try out making different kinds of graphs

Question 14: Scatterplot

ggplot(data = penguins) + 
  geom_point(mapping = aes(x = species, y = sex))

Question 15: Barchart

ggplot(data = penguins) + 
  geom_bar(mapping = aes(x = species, fill = sex), position="dodge")

Question 16

The species and sex variables are both categorical, so making a scatterplot just layers the points on top of each other at each of the different combinations of categories. It isn’t possible to see how many penguins fit into each combination of the categories. The barchart allows us to see more easily how many data points (penguins) are at each combination of the categories.

Question 17

ggplot(data = penguins) + 
  geom_point(mapping = aes(x = species, y = sex), position = "jitter")

Question 18

Jitter spreads out the points at the different combinations of the categorical variables so it is easier to see how many data points there are with that combination.

Part 5: Tibbles

Question 19

jobs <- read_csv("Lab3_jobs.csv")
## Rows: 5584 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, occupation
## dbl (4): year, people, n.by.year, proportion
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Question 20

is_tibble(jobs)
## [1] TRUE
jobs
## # A tibble: 5,584 × 6
##     year sex   occupation           people n.by.year proportion
##    <dbl> <chr> <chr>                 <dbl>     <dbl>      <dbl>
##  1  2000 M     Accountant / Auditor 866460 168122353   0.00515 
##  2  2000 M     Actor                 29975 168122353   0.000178
##  3  2000 M     Advertising Agent    114250 168122353   0.000680
##  4  2000 M     Architect            171025 168122353   0.00102 
##  5  2000 M     Artist / Art Teacher 147525 168122353   0.000877
##  6  2000 M     Author                89748 168122353   0.000534
##  7  2000 M     Baker                109810 168122353   0.000653
##  8  2000 M     Bank Teller           53683 168122353   0.000319
##  9  2000 M     Barber / Beautician  180813 168122353   0.00108 
## 10  2000 M     Bartender            185678 168122353   0.00110 
## # … with 5,574 more rows

The above command returned TRUE, indicating that the jobs object is a tibble. I can also tell by how it displays in the console.

Question 21

is.factor(jobs$occupation)
## [1] FALSE

The command returned FALSE, indicating that the occupation variable is NOT a factor.

Question 22

One difference between a data frame and a tibble in R is that some statistical functions in R won’t work on tibbles. Another is that they print out to the screen differently: tibbles are much nicer for printing to the screen, because they don’t try to print out ALL the data, just the first 10 rows (by default). Data frames are part of “base R”, but tibbles are part of the tidyverse. Finally, when tibbles print to the screen, it displays other useful information about the dataset like the number of rows and columns, and the type of each column.

Extra credit

analytics_raw <- read_csv("analytics_raw.csv")
## New names:
## Rows: 244398 Columns: 21
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (11): device, os, browser, user_type, channel, gender, hour_of_day, land... dbl
## (9): ...1, frequency, recency, page_depth, age, duration, quantity, rev... lgl
## (1): purchase_flag
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
analytics_sample <- analytics_raw[sample(1:nrow(analytics_raw), "1000", replace=FALSE),]

Question 23

ggplot(data = analytics_sample) + 
  geom_point(mapping = aes(x = hour_of_day, y = frequency))

Question 24

numeric variable: duration

ggplot(data = analytics_sample) + 
  geom_point(mapping = aes(x = hour_of_day, y = frequency, color = duration))

ggplot(data = analytics_sample) + 
  geom_point(mapping = aes(x = hour_of_day, y = frequency, size = duration))

ggplot(data = analytics_sample) + 
  geom_point(mapping = aes(x = hour_of_day, y = frequency, shape = duration))
## Error in `geom_point()`:
## ! Problem while computing aesthetics.
## ℹ Error occurred in the 1st layer.
## Caused by error in `scale_f()`:
## ! A continuous variable cannot be mapped to the shape aesthetic
## ℹ choose a different aesthetic or use `scale_shape_binned()`

Question 25

categorical variable: os

ggplot(data = analytics_sample) + 
  geom_point(mapping = aes(x = hour_of_day, y = frequency, color = os))

ggplot(data = analytics_sample) + 
  geom_point(mapping = aes(x = hour_of_day, y = frequency, size = os))
## Warning: Using size for a discrete variable is not advised.

ggplot(data = analytics_sample) + 
  geom_point(mapping = aes(x = hour_of_day, y = frequency, shape = os))

Question 26

Mapping a continuous variable to a shape gave an error, and mapping a categorical variable to both size and shape gave a warning. Shapes are discrete, so it makes sense that you can’t map a shape to a continuous variable. Also if you have more categories than available shapes, you can’t visually encode all of the levels of the variable. Sizes are continuous, so it makes sense that while you can map a size to a category, it might not make conceptual sense to do so unless the categories reflect ordered levels of size or magnitude in some way.

Additional Graph

ggplot(data=penguins) + 
  geom_point(aes(x = flipper_length_mm, y = bill_length_mm, color = species, shape = species)) + labs(title = "Penguin Flipper Length Generally Increases with Bill Length", caption="Data Taken from Files Provided") + theme(plot.caption = element_text(hjust = 1)) + theme(legend.position="top")
## Warning: Removed 2 rows containing missing values (`geom_point()`).