The tidyverse library and the palmerpenguins library are loaded in the setup chunk of the Rmd file, which should not display in the HTML file.
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
There are 344 rows and 8 columns in the penguins dataset.
The island variable contains data about which island
each penguin was documented on (Biscoe, Dream or Torgersen).
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
The categorical variables are: species,
island, and sex. The numeric variables are:
bill_length_mm, bill_depth_mm,
flipper_length_mm, body_mass_g, and
year. The glimpse() command shows information about each
variable. The “dbl” and “int” variables are numeric; “fct” variables are
categorical. The str() command can also be used; “factors” are
categorical and “num” or “int” variables are numeric. The type of data
in each variable can be inferred from the description of each variable
in the documentation.
The code uses the penguins dataset. The x-axis is
flipper_length_mm, and the y-axis uses
bill_length_mm. A is the correct image. It shows a
scatterplot, which is what geom_point creates.
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = bill_length_mm), color="blue")
## Warning: Removed 2 rows containing missing values (`geom_point()`).
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = bill_length_mm, color = species))
## Warning: Removed 2 rows containing missing values (`geom_point()`).
The three species of penguins appear to be different sizes. The Adelie are the smallest, the Chinstrap are in the middle, and the Gentoo penguins are the biggest in the dataset.
The code uses the penguins dataset. The x-axis is
species, and the y-axis is the number of penguins at each
level of species. C is the correct graph.
ggplot(data = penguins) +
geom_bar(mapping = aes(x = species), fill = "blue")
ggplot(data = penguins) +
geom_bar(mapping = aes(x = species, fill = island))
ggplot(data = penguins) +
geom_bar(mapping = aes(x = species, fill = island), position="dodge")
The only
penguin that lives on multiple islands is the Adelie penguin.
ggplot(data = penguins) +
geom_point(mapping = aes(x = species, y = sex))
ggplot(data = penguins) +
geom_bar(mapping = aes(x = species, fill = sex), position="dodge")
The species and sex variables are both categorical, so making a scatterplot just layers the points on top of each other at each of the different combinations of categories. It isn’t possible to see how many penguins fit into each combination of the categories. The barchart allows us to see more easily how many data points (penguins) are at each combination of the categories.
ggplot(data = penguins) +
geom_point(mapping = aes(x = species, y = sex), position = "jitter")
Jitter spreads out the points at the different combinations of the categorical variables so it is easier to see how many data points there are with that combination.
jobs <- read_csv("Lab3_jobs.csv")
## Rows: 5584 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, occupation
## dbl (4): year, people, n.by.year, proportion
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
is_tibble(jobs)
## [1] TRUE
jobs
## # A tibble: 5,584 × 6
## year sex occupation people n.by.year proportion
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2000 M Accountant / Auditor 866460 168122353 0.00515
## 2 2000 M Actor 29975 168122353 0.000178
## 3 2000 M Advertising Agent 114250 168122353 0.000680
## 4 2000 M Architect 171025 168122353 0.00102
## 5 2000 M Artist / Art Teacher 147525 168122353 0.000877
## 6 2000 M Author 89748 168122353 0.000534
## 7 2000 M Baker 109810 168122353 0.000653
## 8 2000 M Bank Teller 53683 168122353 0.000319
## 9 2000 M Barber / Beautician 180813 168122353 0.00108
## 10 2000 M Bartender 185678 168122353 0.00110
## # … with 5,574 more rows
The above command returned TRUE, indicating that the jobs object is a tibble. I can also tell by how it displays in the console.
is.factor(jobs$occupation)
## [1] FALSE
The command returned FALSE, indicating that the occupation variable is NOT a factor.
One difference between a data frame and a tibble in R is that some statistical functions in R won’t work on tibbles. Another is that they print out to the screen differently: tibbles are much nicer for printing to the screen, because they don’t try to print out ALL the data, just the first 10 rows (by default). Data frames are part of “base R”, but tibbles are part of the tidyverse. Finally, when tibbles print to the screen, it displays other useful information about the dataset like the number of rows and columns, and the type of each column.
analytics_raw <- read_csv("analytics_raw.csv")
## New names:
## Rows: 244398 Columns: 21
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (11): device, os, browser, user_type, channel, gender, hour_of_day, land... dbl
## (9): ...1, frequency, recency, page_depth, age, duration, quantity, rev... lgl
## (1): purchase_flag
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
analytics_sample <- analytics_raw[sample(1:nrow(analytics_raw), "1000", replace=FALSE),]
ggplot(data = analytics_sample) +
geom_point(mapping = aes(x = hour_of_day, y = frequency))
numeric variable: duration
ggplot(data = analytics_sample) +
geom_point(mapping = aes(x = hour_of_day, y = frequency, color = duration))
ggplot(data = analytics_sample) +
geom_point(mapping = aes(x = hour_of_day, y = frequency, size = duration))
ggplot(data = analytics_sample) +
geom_point(mapping = aes(x = hour_of_day, y = frequency, shape = duration))
## Error in `geom_point()`:
## ! Problem while computing aesthetics.
## ℹ Error occurred in the 1st layer.
## Caused by error in `scale_f()`:
## ! A continuous variable cannot be mapped to the shape aesthetic
## ℹ choose a different aesthetic or use `scale_shape_binned()`
categorical variable: os
ggplot(data = analytics_sample) +
geom_point(mapping = aes(x = hour_of_day, y = frequency, color = os))
ggplot(data = analytics_sample) +
geom_point(mapping = aes(x = hour_of_day, y = frequency, size = os))
## Warning: Using size for a discrete variable is not advised.
ggplot(data = analytics_sample) +
geom_point(mapping = aes(x = hour_of_day, y = frequency, shape = os))
Mapping a continuous variable to a shape gave an error, and mapping a categorical variable to both size and shape gave a warning. Shapes are discrete, so it makes sense that you can’t map a shape to a continuous variable. Also if you have more categories than available shapes, you can’t visually encode all of the levels of the variable. Sizes are continuous, so it makes sense that while you can map a size to a category, it might not make conceptual sense to do so unless the categories reflect ordered levels of size or magnitude in some way.
ggplot(data=penguins) +
geom_point(aes(x = flipper_length_mm, y = bill_length_mm, color = species, shape = species)) + labs(title = "Penguin Flipper Length Generally Increases with Bill Length", caption="Data Taken from Files Provided") + theme(plot.caption = element_text(hjust = 1)) + theme(legend.position="top")
## Warning: Removed 2 rows containing missing values (`geom_point()`).