{r global_options, include=FALSE} #DO NOT EDIT THIS CHUNK OR ANYTHING ABOVE IT! knitr::opts_chunk$set(echo = TRUE, eval = TRUE, fig.align = "center", warning = F, message = F, tidy=T, tidy.opts=list(width.cutoff=50))

Henry Satel hjs2252

Please submit as a knitted HTML file on Canvas before the due date.

For all questions, include the R commands/functions that you used to find your answer. Answers without supporting code will not receive credit.

Review of how to submit this assignment

All homework assignments will be completed using R Markdown. These .Rmd files consist of >text/syntax (formatted using Markdown) alongside embedded R code. When you have completed the assignment (by adding R code inside codeblocks and supporting text outside of the codeblocks), create your document as follows:

Click the arrow next to the “Knit” button (above)

Choose “Knit to HTML”

Go to Files pane and put checkmark next to the correct HTML file

Click on the blue gear icon (“More”) and click Export

Download the file and then upload to Canvas

Question 1 (2 pts)

1.1 (1 pt)

Load the ggplot2 package, which contains the mpg dataset. It contains data from 38 popular car models in 1999 and in 2008. Using ggplot, make a histogram of `hwy` (EPA highway miles-per-gallon). Then, in a separate plot, use `stat_ecdf()` to make a cumulative density plot of the same variable. The bimodal distribution is most clear in the histogram, but how can you see it in the cumulative density plot? Eyeballing the cumulative density plot, approximately what proportion of cars have a highway mpg of 20 or less? (if you want, confirm your suspicions by using mean on a logical vector).

```{R} library(ggplot2)

ggplot(mpg, aes(x = hwy)) + geom_histogram(binwidth = 2, fill = “blue”, color = “black”) + labs(title = “Histogram of Highway MPG”, x = “Highway MPG”, y = “Count”)

ggplot(mpg, aes(x = hwy)) + stat_ecdf(geom = “step”, color = “red”) + labs(title = “Cumulative Density Plot of Highway MPG”, x = “Highway MPG”, y = “Cumulative Probability”)

mean(mpg$hwy <= 20)


*The cumulative density plot shows a bimodal distribution shown in the sections where the curve is steep and there are clusters of data points. From the visualization  around 38% of cars achieve a highway MPG of 20 or lower a finding that aligns with the calculated mean of the logical condition mpg\$hwy \<= 20.*

### 1.2 (1 pt)

##### Now then, make a density plot of `cty` (city miles-per-gallon) and fill it by `class` of car. Add an alpha value of .8 to increase transparency. Eyeballing the plot, which two classes have the most overlap in their distributions? Which class has the least variation in `cty` mpg and which class has the most?

```{R}
ggplot(mpg, aes(x = cty, fill = class)) +
  geom_density(alpha = 0.8) +
  labs(title = "Density Plot of City MPG by Car Class", x = "City MPG", y = "Density")

It seems that the SUVs and pickups are the most similar in their distributions. The two seater class shows the least variation in city MPG, and the subcompact class displays the greatest variability.

Question 2 (4 pts)

2.1 (1 pt)

Assume diastolic blood pressure is normally distributed in a certain population, with a mean of 67 mmHg and a standard deviation 13 mmHg. Use `pnorm()` to compute the probability of obtaining a value less than 60 mmHg from this distribution. Use `pnorm()` again to compute the probability of obtaining a value greater than 85 mmHg from this distribution. Finally, use `qnorm()` to find the 98th percentile of this distribution (the value which cuts of 98% of the distribution below it).

```{R} # probability of diastolic blood pressure < 60 mmHg pnorm(60, mean = 67, sd = 13)

probability of diastolic blood pressure > 85 mmHg

1 - pnorm(85, mean = 67, sd = 13)

98th percentile of the distribution

qnorm(0.98, mean = 67, sd = 13)


*There is around a 30% chance of seeing a diastolic blood pressure below 60 mmHg, while the probability of a value being greater than 85 mmHg is around 8.3%. The 98th percentile of this distribution is approximately 93.7 mmHg.*

### 2.2 (1 pts)

##### Below, we set the seed to 322 with `set.seed()` so our random draws match. With `rnorm()` take a sample of size 10000 (ten thousand) draws from a normal distribution with a mean of 67 and a standard deviation of 13. Using `mean()` on a logical vector, what proportion of the total draws are less than 60? What proportion are greater than 85? Using `quantile()`, What value in your sample represents the 98th percentile?

```{R}
set.seed(322) #leave this line alone

samp <- rnorm(n = 10000, mean = 67, sd = 13)

# proportion of draws less than 60
mean(samp < 60)

# proportion of draws greater than 85
mean(samp > 85)

# 98th percentile of the sample
quantile(samp, 0.98)

About 30% of the values are below 60, while around 8.6% of the values exceed 85, which matches the earlier pnorm() calculations. The 98th percentile in the sample is 93.9 mmHg the same as the qnorm(0.98, mean = 67, sd = 13) result.

2.3 (1 pts)

Hopefully you see the correspondence between question 2.1 and question 2.2 Why are your answers to question 2.2 close to those in question 2.1 but not exactly the same?

The answers in 2.2 are similar to those in 2.1 because both use the same normal distribution parameters (mean = 67, sd = 13), but 2.1 relies on exact theoretical probabilities from pnorm() and qnorm(). However 2.2 uses a finite random sample from rnorm(), introducing slight variation, and a larger sample would reduce this difference.

2.4 (1 pts)

Make your sample from question 2 into a dataframe and, using ggplot2, make a histogram of the sample by mapping the sample to the x-axis in the main `ggplot()` function and then adding the code `geom_histogram(aes(y=..density..))`. Then, overlay an density plot by adding `geom_density()`. Using `geom_vline(xintercept=)`, add solid vertical lines corresponding to the 2.5th and the 97.5th percentile of the sample (i.e., using `quantile`). Next overlay an actual normal distribution with a mean of 67 and a standard deviation of 13 using `geom_line()` with `dnorm` and `stat="function"`. Make it a different color to differentiate it from the empirical/sample density. Finally, using `geom_vline(xintercept=)`, add dashed vertical lines of the same new color corresponding to the 2.5th and the 97.5th percentile of the actual normal distribution (i.e., using `qnorm`).

```{R} library(ggplot2) #df <- data.frame(samp) #makes your sample a column of a dataframe called df df <- data.frame(samp)

empirical_2.5 <- quantile(df$samp, 0.025) empirical_97.5 <- quantile(df$samp, 0.975)

theoretical_2.5 <- qnorm(0.025, mean = 67, sd = 13) theoretical_97.5 <- qnorm(0.975, mean = 67, sd = 13)

ggplot(df, aes(x = samp)) + geom_histogram(aes(y = ..density..), bins = 30, fill = “lightblue”, color = “black”, alpha = 0.7) + geom_density(color = “red”, linewidth = 1) + geom_vline(xintercept = empirical_2.5, color = “red”, linetype = “solid”) + geom_vline(xintercept = empirical_97.5, color = “red”, linetype = “solid”) + stat_function(fun = dnorm, args = list(mean = 67, sd = 13), color = “blue”, linewidth = 1) + geom_vline(xintercept = theoretical_2.5, color = “blue”, linetype = “dashed”) + geom_vline(xintercept = theoretical_97.5, color = “blue”, linetype = “dashed”) + labs(title = “Histogram of Sample with Density and Normal Curve”, x = “Diastolic Blood Pressure (mmHg)”, y = “Density”) + theme_minimal()


## Question 3 (3 pts)

### 3.1 (1 pt)

##### Using `dplyr` functions (filter, summarize; do not use any [] or \$) and the `quakes` dataset, What is the mean of the variable `mag` when `depth` is *greater than* the median depth? What is the mean of the variable `mag` when `depth` is *less than* the median depth?

```{R}
install.packages("dplyr")

library(dplyr)
# compute the median depth
median_depth <- median(quakes$depth)

quakes %>%
  mutate(depth_category = ifelse(depth > median_depth, "Greater", "Less")) %>%
  group_by(depth_category) %>%
  summarize(mean_mag = mean(mag, na.rm = TRUE))

3.2 (1 pt)

Using `dplyr` and the `quakes` dataset, create a new variable called `depth_m` that gives `depth` in meters rather than kilometers. Use `mutate()` only once to achieve this! Do not use any [] or $.

```{R}

quakes <- quakes %>% mutate(depth_m = depth * 1000)


### 3.3 (1 pt)

##### Finally, using ggplot, take the quakes dataset and make a scatterplot of `long` (x-axis) and `lat` (y-axis) and color the points by `depth`. Add `coord_map()` to scale axes with mercator projection (if this doesn't work, use coord_fixed to make x- and y-axis scales the same). Optionally, add the extra code below to overlay world map data for this region using the code provided below.

```{R}
install.packages("tidyverse")
install.packages("maps")

library(tidyverse)
library(maps) #install.packages("maps") if not on the servers
world <- map_data("world")

ggplot(quakes) +
  geom_point(aes(long, lat, color = depth)) + 
  coord_fixed(ratio = 1) +  # Use coord_fixed to ensure equal scaling of axes (Mercator effect)
  geom_polygon(aes(long, lat, group = group), data = world, fill = "red") +  # Overlay world map
  xlim(150, 200) +  # Set x-axis limits
  ylim(-50, 0) +  # Set y-axis limits
  labs(title = "Scatterplot of Earthquake Locations",
       x = "Longitude",
       y = "Latitude") +
  theme_minimal()
#if you want to overlay a map, uncomment/add the following code to your plot:
#+geom_polygon(aes(long, lat, group=group), data = world, fill = "red") + xlim(150,200)+ylim(-50,0)

Question 4 (4 pts)

4.1 (1 pt)

All subsequent code will be done using the tidyverse (with a focus on `dplyr`), so we need to load this package. We also want to look at the `penguins` dataset which is inside the palmerpenguins package, so we can grab this as well

```{r message=FALSE} library(tidyverse)

install.packages(“palmerpenguins”) library(palmerpenguins)

penguins <- palmerpenguins::penguins


###### Using the function `filter()`, pick all the rows/observaions in the `penguins` dataset from the year 2007 and store the result as a new object called `penguins_2007`. Compare the number of observations/rows in the original `penguins` dataset with your new `penguins_2007` dataset in words.

```{R}
# filter for observations from 2007
penguins_2007 <- penguins %>% filter(year == 2007)

# compare the number of observations/rows
nrow(penguins)  # total rows in the original dataset
nrow(penguins_2007)  # rows from 2007

The original dataset consists of 344 observations while the 2007 data is a subset of this that contains 110 rows.

4.2 (1 pt)

Return all the rows in the `penguins_2007` dataset where `bill_length_mm` is between 45 and 55, (doesn’t matter if inclusive of 45 and 55 or not, since no observations have those exact values).

```{R}

penguins_2007_filtered <- penguins_2007 %>% filter(bill_length_mm > 45 & bill_length_mm < 55)

penguins_2007_filtered


### 4.3 (1 pt)

###### Are there any cases in the `penguins_2007` dataset for which the ratio of `bill_length_mm` to `bill_depth_mm` exceeds 3.5? For now, use only `filter()` to find out. If so, for which species of penguins is this true?

```{R}

penguins_2007_ratio <- penguins_2007 %>% filter(bill_length_mm / bill_depth_mm > 3.5)

penguins_2007_ratio

In the penguins_2007 dataset two observations have a bill_length_mm to bill_depth_mm ratio that is bigger than 3.5. Both belong to the Gentoo species and are recorded on Biscoe Island.

4.4 (1 pt)

Take your `penguins_2007` dataset and using `select()`, drop/delete the column `year`, since it isn’t necessary. Overwrite the `penguins_2007` dataset so it no longer contains that column.

```{R} penguins_2007 <- penguins_2007 %>% select(-year)

head(penguins_2007)


## Question 5 (2 pts)

### 5.1 (1 pt)

###### Using the function `mutate()`, take `penguins_2007` and create a new data column that contains the ratio of `bill_length_mm` to `bill_depth_mm` (call it `bill_ratio`). Overwrite `penguins_2007` so it contains this new column

```{R}
penguins_2007 <- penguins_2007 %>% mutate(bill_ratio = bill_length_mm / bill_depth_mm)

head(penguins_2007)

5.2 (1 pts)

Take the `penguins_2007` data and, using `group_by` along with either `arrange` and `slice` or `slice_min`, for each species find the three penguins with the shortest bill length. Of those 9 penguins, how many have the value female for the sex variable?

```{R} table(shortest_bill_penguins$sex, useNA = “always”)

shortest_bill_penguins <- penguins_2007 %>% group_by(species) %>% slice_min(order_by = bill_length_mm, n = 3)

female_count <- sum(shortest_bill_penguins$sex == “female”, na.rm = TRUE)

shortest_bill_penguins female_count


*From the nine penguins with the shortest bill lengths seven of them are female. This suggests that females are more likely to have shorter bills than males in the smallest individuals of each species.*

## Question 6 (2 pts)

### 6.1 (1 pts)

Using `penguins_2007`, calculate the mean and standard deviation of `bill_ratio` each species using `group_by` and `summarize`. Drop NAs from `bill_ratio` for these computations (e.g., using the argument na.rm=T) so you have values for each species. Which species has the greatest average `bill_ratio`?

```{R}
bill_ratio_stats <- penguins_2007 %>%
  group_by(species) %>%
  summarize(
    mean_bill_ratio = mean(bill_ratio, na.rm = TRUE),
    sd_bill_ratio = sd(bill_ratio, na.rm = TRUE)
  )

bill_ratio_stats

Gentoo penguins have the highest average bill_ratio with a mean of 3.20. This indicates that their bill length to depth ratio is greater than that of Adelie and Chinstrap penguins.

6.2 (1 pts)

With penguins_2007, using either summarize(n()) or using count, report the number of observations for each species-island combination (note that you’ll need to group by both variables!). Which species appears on all three islands?

```{R} species_island_counts <- penguins_2007 %>% group_by(species, island) %>% count()

species_island_counts


*Adelie penguins are the only species present on all three islands with obsevations  from Biscoe, Dream, and Torgersen. Chinstrap penguins are exclusive to Dream and Gentoo penguins are only found on Biscoe.*

## Question 7 (3 pts)

### 7.1 (1 pt)

Take the `penguins_2007` data set you have created and using `ggplot` create a single plot showing the distribution of `body_mass_g` for male and female penguins separately, faceted by species (use facet_grid to give each species its own row). Which species has the most NAs for `sex`? Which species shows the least sexual dimorphism (i.e., which shows the greatest overlap of male/female size distributions)?

```{R warning=F}

ggplot(penguins_2007, aes(x = body_mass_g, fill = sex)) +
  geom_density(alpha = 0.5) + 
  facet_grid(rows = vars(species)) + 
  theme_minimal() +
  labs(title = "Distribution of Body Mass by Sex and Species",
       x = "Body Mass (g)", y = "Density", fill = "Sex")

penguins_2007 %>%
  group_by(species) %>%
  summarize(missing_sex = sum(is.na(sex)))

Adelie has the highest number of missing values for sex, while Chinstrap exhibits the least sexual dimorphism. Adelie also shows minimal dimorphism due to the overlap.

7.2 (1 pt)

Now, take `penguins_2007` and, using `ggplot`, create a scatterplot of `flipper_length_mm` (x-axis) against the `bill_ratio` variable. Does it look like there is a relationship between length-to-depth ratio and the lengths? To see more clearly, add `geom_smooth(method="lm")` to the plot.

```{R} ggplot(penguins_2007, aes(x = flipper_length_mm, y = bill_ratio)) + geom_point(alpha = 0.6) + geom_smooth(method = “lm”, color = “blue”) + theme_minimal() + labs(title = “Flipper Length vs. Bill Ratio”, x = “Flipper Length (mm)”, y = “Bill Length-to-Depth Ratio”)


*A positive relationship shows that penguins with longer flippers tend to have a higher bill length-to-depth ratio.*

### 7.3 (1 pt)

###### Does your answer change when you consider each species individually instead of all together? To see more clearly, duplicate the plot from before but additionally, in the main `ggplot()` function, map `species` to color so each species gets its own color and regression trendline.

###### Compare this plot with the previous one (in 4.2) and discuss whether the relationship between flipper length and bill length-to-depth ratio changes when you look at it overall versus within each species.

```{R}
ggplot(penguins_2007, aes(x = flipper_length_mm, y = bill_ratio, color = species)) +
  geom_point(alpha = 0.6) + 
  geom_smooth(method = "lm", se = FALSE) + 
  theme_minimal() +
  labs(title = "Flipper Length vs. Bill Ratio by Species",
       x = "Flipper Length (mm)", 
       y = "Bill Length-to-Depth Ratio", 
       color = "Species")

When looking at all species together a positive linear trend is observed. However looking at each species individually reveals different patterns, Gentoo shows a positive linear relationship, while both Adelie and Chinstrap show a negative linear trends.

{R, echo=F} ## DO NOT DELETE OR MODIFY THIS CHUNK: IT MUST BE PRESENT TO RECEIVE CREDIT FOR THE ASSIGNMENT sessionInfo(); Sys.time(); Sys.info()

HW 2

SDS322E

`r format(Sys.Date(), '%B %d, %Y')`

Henry Satel hjs2252

Review of how to submit this assignment

Question 1 (2 pts)

1.1 (1 pt)

Question 2 (4 pts)

2.1 (1 pt)

probability of diastolic blood pressure > 85 mmHg

98th percentile of the distribution

2.3 (1 pts)

Hopefully you see the correspondence between question 2.1 and question 2.2 Why are your answers to question 2.2 close to those in question 2.1 but not exactly the same?

2.4 (1 pts)

3.2 (1 pt)

Using `dplyr` and the `quakes` dataset, create a new variable called `depth_m` that gives `depth` in meters rather than kilometers. Use `mutate()` only once to achieve this! Do not use any [] or $.

Question 4 (4 pts)

4.1 (1 pt)

All subsequent code will be done using the tidyverse (with a focus on `dplyr`), so we need to load this package. We also want to look at the `penguins` dataset which is inside the palmerpenguins package, so we can grab this as well

4.2 (1 pt)

Return all the rows in the `penguins_2007` dataset where `bill_length_mm` is between 45 and 55, (doesn’t matter if inclusive of 45 and 55 or not, since no observations have those exact values).

4.4 (1 pt)

Take your `penguins_2007` dataset and using `select()`, drop/delete the column `year`, since it isn’t necessary. Overwrite the `penguins_2007` dataset so it no longer contains that column.

5.2 (1 pts)

Take the `penguins_2007` data and, using `group_by` along with either `arrange` and `slice` or `slice_min`, for each species find the three penguins with the shortest bill length. Of those 9 penguins, how many have the value female for the sex variable?

6.2 (1 pts)

7.2 (1 pt)

Now, take `penguins_2007` and, using `ggplot`, create a scatterplot of `flipper_length_mm` (x-axis) against the `bill_ratio` variable. Does it look like there is a relationship between length-to-depth ratio and the lengths? To see more clearly, add `geom_smooth(method="lm")` to the plot.

HW 2

SDS322E

r format(Sys.Date(), '%B %d, %Y')

Henry Satel hjs2252

Review of how to submit this assignment

Question 1 (2 pts)

1.1 (1 pt)

Question 2 (4 pts)

2.1 (1 pt)

probability of diastolic blood pressure > 85 mmHg

98th percentile of the distribution

2.3 (1 pts)

Hopefully you see the correspondence between question 2.1 and question 2.2 Why are your answers to question 2.2 close to those in question 2.1 but not exactly the same?

2.4 (1 pts)

3.2 (1 pt)

Using dplyr and the quakes dataset, create a new variable called depth_m that gives depth in meters rather than kilometers. Use mutate() only once to achieve this! Do not use any [] or $.

Question 4 (4 pts)

4.1 (1 pt)

All subsequent code will be done using the tidyverse (with a focus on dplyr), so we need to load this package. We also want to look at the penguins dataset which is inside the palmerpenguins package, so we can grab this as well

4.2 (1 pt)

Return all the rows in the penguins_2007 dataset where bill_length_mm is between 45 and 55, (doesn’t matter if inclusive of 45 and 55 or not, since no observations have those exact values).

4.4 (1 pt)

Take your penguins_2007 dataset and using select(), drop/delete the column year, since it isn’t necessary. Overwrite the penguins_2007 dataset so it no longer contains that column.

5.2 (1 pts)

Take the penguins_2007 data and, using group_by along with either arrange and slice or slice_min, for each species find the three penguins with the shortest bill length. Of those 9 penguins, how many have the value female for the sex variable?

6.2 (1 pts)

7.2 (1 pt)

Now, take penguins_2007 and, using ggplot, create a scatterplot of flipper_length_mm (x-axis) against the bill_ratio variable. Does it look like there is a relationship between length-to-depth ratio and the lengths? To see more clearly, add geom_smooth(method="lm") to the plot.

`r format(Sys.Date(), '%B %d, %Y')`

Using `dplyr` and the `quakes` dataset, create a new variable called `depth_m` that gives `depth` in meters rather than kilometers. Use `mutate()` only once to achieve this! Do not use any [] or $.

All subsequent code will be done using the tidyverse (with a focus on `dplyr`), so we need to load this package. We also want to look at the `penguins` dataset which is inside the palmerpenguins package, so we can grab this as well

Return all the rows in the `penguins_2007` dataset where `bill_length_mm` is between 45 and 55, (doesn’t matter if inclusive of 45 and 55 or not, since no observations have those exact values).

Take your `penguins_2007` dataset and using `select()`, drop/delete the column `year`, since it isn’t necessary. Overwrite the `penguins_2007` dataset so it no longer contains that column.

Take the `penguins_2007` data and, using `group_by` along with either `arrange` and `slice` or `slice_min`, for each species find the three penguins with the shortest bill length. Of those 9 penguins, how many have the value female for the sex variable?

Now, take `penguins_2007` and, using `ggplot`, create a scatterplot of `flipper_length_mm` (x-axis) against the `bill_ratio` variable. Does it look like there is a relationship between length-to-depth ratio and the lengths? To see more clearly, add `geom_smooth(method="lm")` to the plot.