OPIM5352-Assignment 1-Yi Fang

1.2.5 1.How many rows are in penguins? How many columns?

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

We can also use dplyr to show the shape of the dataset

## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

In Penguins, there are 344 rows and 8 columns.

3.Make a scatterplot of bill_depth_mm vs. bill_length_mm. That is, make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis.

Describe the relationship between these two variables:

These two variables are not strongly correlated, as the points appear to be spread out.

When the ‘bill_length’ is more than 45mm, it shows a slight positive correlation between these two variables.

4.What happens if you make a scatterplot of species vs. bill_depth_mm? What might be a better choice of geom?

Scatter plot: This shows the differences of ‘bill_depth’ among three different species.

Box plot: This could show the distribution in a bette way, since it tells the 25%, 50% and 75% percentiles of the data.

For ‘Adelie’ and ‘Chinstrap’, there’s no obvious difference of the distribution of ‘bill_depth’. The ‘bill_depth’ of both are around 17.5 to 19.5mm. However, the ‘bill_depth’ of ‘Gentoo’ is identically less than that of ‘Adelie’ and ‘Chinstrap’, which is around 10mm to 16mm.

1.4.3 1.Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?

This bar plot shows counts the number of each species in the dataset instead of the distribution. It compares the total numbers for each species, where ‘Adelie’ penguines are the most numerous (around 150), followed by’gentoo’(125). The number of ‘Chinstrap’ penguines (around 70) are much less than them.

1.5.5 1. Which variables in mpg are categorical? Which variables are numerical?

## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows

From the text, categorical variables are: manufacturer, model, displ, trans, fl, class, drv

Numerical variables are: year, cyl, cty, hwy,

2. （1）Make a scatterplot of hwy vs. displ using the mpg data frame.

2. (2）Next, map a third, numerical variable to color, then size, then both color and size, then shape

From the first scatter plot, for more ‘displ’, there will be less ‘hwy’ and more ‘cyl’.

Both Color and Size

Color, Size and Shape:

From the error warning, we can tell that shape is an aesthetic meant for categorical variables, but ‘cyl’ is numerical. Numeric variables work well with color and size because these aesthetics can represent continuous gradients.

3.In the scatterplot of hwy vs. displ, what happens if you map a third variable to linewidth?

Map ‘cty’ to linewidth

From the line graph, we can tell that as ‘displ’ goes up, both ‘hmy’ and ‘cty’ tend to decrease. This makes sense since larger engines typically consume more fuel. The varying linewidths create a chaotic appearance, which might not be the best way to represent the data.

OPIM5352-Assignment 1-Yi Fang

02/10/2025

1.2.5

1.How many rows are in penguins? How many columns?

We can also use dplyr to show the shape of the dataset

In Penguins, there are 344 rows and 8 columns.

3.Make a scatterplot of bill_depth_mm vs. bill_length_mm. That is, make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis.

Describe the relationship between these two variables:

These two variables are not strongly correlated, as the points appear to be spread out.

When the ‘bill_length’ is more than 45mm, it shows a slight positive correlation between these two variables.

4.What happens if you make a scatterplot of species vs. bill_depth_mm? What might be a better choice of geom?

Scatter plot: This shows the differences of ‘bill_depth’ among three different species.

Box plot: This could show the distribution in a bette way, since it tells the 25%, 50% and 75% percentiles of the data.

1.4.3

1.Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?

1.5.5

1. Which variables in mpg are categorical? Which variables are numerical?

From the text, categorical variables are: manufacturer, model, displ, trans, fl, class, drv

Numerical variables are: year, cyl, cty, hwy,

2. （1）Make a scatterplot of hwy vs. displ using the mpg data frame.

2. (2）Next, map a third, numerical variable to color, then size, then both color and size, then shape

From the first scatter plot, for more ‘displ’, there will be less ‘hwy’ and more ‘cyl’.

Both Color and Size

Color, Size and Shape:

From the error warning, we can tell that shape is an aesthetic meant for categorical variables, but ‘cyl’ is numerical. Numeric variables work well with color and size because these aesthetics can represent continuous gradients.

3.In the scatterplot of hwy vs. displ, what happens if you map a third variable to linewidth?

Map ‘cty’ to linewidth

From the line graph, we can tell that as ‘displ’ goes up, both ‘hmy’ and ‘cty’ tend to decrease. This makes sense since larger engines typically consume more fuel. The varying linewidths create a chaotic appearance, which might not be the best way to represent the data.