r format(Sys.Date(), '%B %d, %Y'){r global_options, include=FALSE} #DO NOT EDIT THIS CHUNK OR ANYTHING ABOVE IT! knitr::opts_chunk$set(echo = TRUE, eval = TRUE, fig.align = "center", warning = F, message = F, tidy=T, tidy.opts=list(width.cutoff=50))
Please submit as a knitted HTML file on Canvas before the due date.
For all questions, include the R commands/functions that you used to find your answer. Answers without supporting code will not receive credit.
Review of how to submit this assignment
All homework assignments will be completed using R Markdown. These
.Rmdfiles consist of >text/syntax (formatted using Markdown) alongside embedded R code. When you have completed the assignment (by adding R code inside codeblocks and supporting text outside of the codeblocks), create your document as follows:
- Click the arrow next to the “Knit” button (above)
- Choose “Knit to HTML”
- Go to Files pane and put checkmark next to the correct HTML file
- Click on the blue gear icon (“More”) and click Export
- Download the file and then upload to Canvas
hwy (EPA highway
miles-per-gallon). Then, in a separate plot, use
stat_ecdf() to make a cumulative density plot of the same
variable. The bimodal distribution is most clear in the histogram, but
how can you see it in the cumulative density plot? Eyeballing the
cumulative density plot, approximately what proportion of cars have a
highway mpg of 20 or less? (if you want, confirm your suspicions by
using mean on a logical vector).```{R} library(ggplot2)
ggplot(mpg, aes(x = hwy)) + geom_histogram(binwidth = 2, fill = “blue”, color = “black”) + labs(title = “Histogram of Highway MPG”, x = “Highway MPG”, y = “Count”)
ggplot(mpg, aes(x = hwy)) + stat_ecdf(geom = “step”, color = “red”) + labs(title = “Cumulative Density Plot of Highway MPG”, x = “Highway MPG”, y = “Cumulative Probability”)
mean(mpg$hwy <= 20)
*The cumulative density plot shows a bimodal distribution shown in the sections where the curve is steep and there are clusters of data points. From the visualization around 38% of cars achieve a highway MPG of 20 or lower a finding that aligns with the calculated mean of the logical condition mpg\$hwy \<= 20.*
### 1.2 (1 pt)
##### Now then, make a density plot of `cty` (city miles-per-gallon) and fill it by `class` of car. Add an alpha value of .8 to increase transparency. Eyeballing the plot, which two classes have the most overlap in their distributions? Which class has the least variation in `cty` mpg and which class has the most?
```{R}
ggplot(mpg, aes(x = cty, fill = class)) +
geom_density(alpha = 0.8) +
labs(title = "Density Plot of City MPG by Car Class", x = "City MPG", y = "Density")
It seems that the SUVs and pickups are the most similar in their distributions. The two seater class shows the least variation in city MPG, and the subcompact class displays the greatest variability.
pnorm() to compute the probability of obtaining a value
less than 60 mmHg from this distribution. Use pnorm() again
to compute the probability of obtaining a value greater than 85 mmHg
from this distribution. Finally, use qnorm() to find the
98th percentile of this distribution (the value which cuts of 98% of the
distribution below it).```{R} # probability of diastolic blood pressure < 60 mmHg pnorm(60, mean = 67, sd = 13)
1 - pnorm(85, mean = 67, sd = 13)
qnorm(0.98, mean = 67, sd = 13)
*There is around a 30% chance of seeing a diastolic blood pressure below 60 mmHg, while the probability of a value being greater than 85 mmHg is around 8.3%. The 98th percentile of this distribution is approximately 93.7 mmHg.*
### 2.2 (1 pts)
##### Below, we set the seed to 322 with `set.seed()` so our random draws match. With `rnorm()` take a sample of size 10000 (ten thousand) draws from a normal distribution with a mean of 67 and a standard deviation of 13. Using `mean()` on a logical vector, what proportion of the total draws are less than 60? What proportion are greater than 85? Using `quantile()`, What value in your sample represents the 98th percentile?
```{R}
set.seed(322) #leave this line alone
samp <- rnorm(n = 10000, mean = 67, sd = 13)
# proportion of draws less than 60
mean(samp < 60)
# proportion of draws greater than 85
mean(samp > 85)
# 98th percentile of the sample
quantile(samp, 0.98)
About 30% of the values are below 60, while around 8.6% of the values exceed 85, which matches the earlier pnorm() calculations. The 98th percentile in the sample is 93.9 mmHg the same as the qnorm(0.98, mean = 67, sd = 13) result.
The answers in 2.2 are similar to those in 2.1 because both use the same normal distribution parameters (mean = 67, sd = 13), but 2.1 relies on exact theoretical probabilities from pnorm() and qnorm(). However 2.2 uses a finite random sample from rnorm(), introducing slight variation, and a larger sample would reduce this difference.
ggplot() function and then adding the
code geom_histogram(aes(y=..density..)). Then, overlay an
density plot by adding geom_density(). Using
geom_vline(xintercept=), add solid vertical lines
corresponding to the 2.5th and the 97.5th percentile of the sample
(i.e., using quantile). Next overlay an actual normal
distribution with a mean of 67 and a standard deviation of 13 using
geom_line() with dnorm and
stat="function". Make it a different color to differentiate
it from the empirical/sample density. Finally, using
geom_vline(xintercept=), add dashed vertical lines of the
same new color corresponding to the 2.5th and the 97.5th percentile of
the actual normal distribution (i.e., using qnorm).```{R} library(ggplot2) #df <- data.frame(samp) #makes your sample a column of a dataframe called df df <- data.frame(samp)
empirical_2.5 <- quantile(df\(samp, 0.025) empirical_97.5 <- quantile(df\)samp, 0.975)
theoretical_2.5 <- qnorm(0.025, mean = 67, sd = 13) theoretical_97.5 <- qnorm(0.975, mean = 67, sd = 13)
ggplot(df, aes(x = samp)) + geom_histogram(aes(y = ..density..), bins = 30, fill = “lightblue”, color = “black”, alpha = 0.7) + geom_density(color = “red”, linewidth = 1) + geom_vline(xintercept = empirical_2.5, color = “red”, linetype = “solid”) + geom_vline(xintercept = empirical_97.5, color = “red”, linetype = “solid”) + stat_function(fun = dnorm, args = list(mean = 67, sd = 13), color = “blue”, linewidth = 1) + geom_vline(xintercept = theoretical_2.5, color = “blue”, linetype = “dashed”) + geom_vline(xintercept = theoretical_97.5, color = “blue”, linetype = “dashed”) + labs(title = “Histogram of Sample with Density and Normal Curve”, x = “Diastolic Blood Pressure (mmHg)”, y = “Density”) + theme_minimal()
## Question 3 (3 pts)
### 3.1 (1 pt)
##### Using `dplyr` functions (filter, summarize; do not use any [] or \$) and the `quakes` dataset, What is the mean of the variable `mag` when `depth` is *greater than* the median depth? What is the mean of the variable `mag` when `depth` is *less than* the median depth?
```{R}
install.packages("dplyr")
library(dplyr)
# compute the median depth
median_depth <- median(quakes$depth)
quakes %>%
mutate(depth_category = ifelse(depth > median_depth, "Greater", "Less")) %>%
group_by(depth_category) %>%
summarize(mean_mag = mean(mag, na.rm = TRUE))
dplyr and the quakes dataset, create
a new variable called depth_m that gives depth
in meters rather than kilometers. Use mutate() only
once to achieve this! Do not use any [] or $.```{R}
quakes <- quakes %>% mutate(depth_m = depth * 1000)
### 3.3 (1 pt)
##### Finally, using ggplot, take the quakes dataset and make a scatterplot of `long` (x-axis) and `lat` (y-axis) and color the points by `depth`. Add `coord_map()` to scale axes with mercator projection (if this doesn't work, use coord_fixed to make x- and y-axis scales the same). Optionally, add the extra code below to overlay world map data for this region using the code provided below.
```{R}
install.packages("tidyverse")
install.packages("maps")
library(tidyverse)
library(maps) #install.packages("maps") if not on the servers
world <- map_data("world")
ggplot(quakes) +
geom_point(aes(long, lat, color = depth)) +
coord_fixed(ratio = 1) + # Use coord_fixed to ensure equal scaling of axes (Mercator effect)
geom_polygon(aes(long, lat, group = group), data = world, fill = "red") + # Overlay world map
xlim(150, 200) + # Set x-axis limits
ylim(-50, 0) + # Set y-axis limits
labs(title = "Scatterplot of Earthquake Locations",
x = "Longitude",
y = "Latitude") +
theme_minimal()
#if you want to overlay a map, uncomment/add the following code to your plot:
#+geom_polygon(aes(long, lat, group=group), data = world, fill = "red") + xlim(150,200)+ylim(-50,0)
dplyr), so we need to load this package. We also want to
look at the penguins dataset which is inside the
palmerpenguins package, so we can grab this as well```{r message=FALSE} library(tidyverse)
install.packages(“palmerpenguins”) library(palmerpenguins)
penguins <- palmerpenguins::penguins
###### Using the function `filter()`, pick all the rows/observaions in the `penguins` dataset from the year 2007 and store the result as a new object called `penguins_2007`. Compare the number of observations/rows in the original `penguins` dataset with your new `penguins_2007` dataset in words.
```{R}
# filter for observations from 2007
penguins_2007 <- penguins %>% filter(year == 2007)
# compare the number of observations/rows
nrow(penguins) # total rows in the original dataset
nrow(penguins_2007) # rows from 2007
The original dataset consists of 344 observations while the 2007 data is a subset of this that contains 110 rows.
penguins_2007 dataset where
bill_length_mm is between 45 and 55, (doesn’t matter if
inclusive of 45 and 55 or not, since no observations have those exact
values).```{R}
penguins_2007_filtered <- penguins_2007 %>% filter(bill_length_mm > 45 & bill_length_mm < 55)
penguins_2007_filtered
### 4.3 (1 pt)
###### Are there any cases in the `penguins_2007` dataset for which the ratio of `bill_length_mm` to `bill_depth_mm` exceeds 3.5? For now, use only `filter()` to find out. If so, for which species of penguins is this true?
```{R}
penguins_2007_ratio <- penguins_2007 %>% filter(bill_length_mm / bill_depth_mm > 3.5)
penguins_2007_ratio
In the penguins_2007 dataset two observations have a bill_length_mm to bill_depth_mm ratio that is bigger than 3.5. Both belong to the Gentoo species and are recorded on Biscoe Island.
penguins_2007 dataset and using
select(), drop/delete the column year, since
it isn’t necessary. Overwrite the penguins_2007 dataset so
it no longer contains that column.```{R} penguins_2007 <- penguins_2007 %>% select(-year)
head(penguins_2007)
## Question 5 (2 pts)
### 5.1 (1 pt)
###### Using the function `mutate()`, take `penguins_2007` and create a new data column that contains the ratio of `bill_length_mm` to `bill_depth_mm` (call it `bill_ratio`). Overwrite `penguins_2007` so it contains this new column
```{R}
penguins_2007 <- penguins_2007 %>% mutate(bill_ratio = bill_length_mm / bill_depth_mm)
head(penguins_2007)
penguins_2007 data and, using
group_by along with either arrange and
slice or slice_min, for each species
find the three penguins with the shortest bill length. Of those 9
penguins, how many have the value female for the sex variable?```{R} table(shortest_bill_penguins$sex, useNA = “always”)
shortest_bill_penguins <- penguins_2007 %>% group_by(species) %>% slice_min(order_by = bill_length_mm, n = 3)
female_count <- sum(shortest_bill_penguins$sex == “female”, na.rm = TRUE)
shortest_bill_penguins female_count
*From the nine penguins with the shortest bill lengths seven of them are female. This suggests that females are more likely to have shorter bills than males in the smallest individuals of each species.*
## Question 6 (2 pts)
### 6.1 (1 pts)
Using `penguins_2007`, calculate the mean and standard deviation of `bill_ratio` each species using `group_by` and `summarize`. Drop NAs from `bill_ratio` for these computations (e.g., using the argument na.rm=T) so you have values for each species. Which species has the greatest average `bill_ratio`?
```{R}
bill_ratio_stats <- penguins_2007 %>%
group_by(species) %>%
summarize(
mean_bill_ratio = mean(bill_ratio, na.rm = TRUE),
sd_bill_ratio = sd(bill_ratio, na.rm = TRUE)
)
bill_ratio_stats
Gentoo penguins have the highest average bill_ratio with a mean of 3.20. This indicates that their bill length to depth ratio is greater than that of Adelie and Chinstrap penguins.
With penguins_2007, using either
summarize(n()) or using count, report the
number of observations for each species-island combination (note that
you’ll need to group by both variables!). Which species appears on all
three islands?
```{R} species_island_counts <- penguins_2007 %>% group_by(species, island) %>% count()
species_island_counts
*Adelie penguins are the only species present on all three islands with obsevations from Biscoe, Dream, and Torgersen. Chinstrap penguins are exclusive to Dream and Gentoo penguins are only found on Biscoe.*
## Question 7 (3 pts)
### 7.1 (1 pt)
Take the `penguins_2007` data set you have created and using `ggplot` create a single plot showing the distribution of `body_mass_g` for male and female penguins separately, faceted by species (use facet_grid to give each species its own row). Which species has the most NAs for `sex`? Which species shows the least sexual dimorphism (i.e., which shows the greatest overlap of male/female size distributions)?
```{R warning=F}
ggplot(penguins_2007, aes(x = body_mass_g, fill = sex)) +
geom_density(alpha = 0.5) +
facet_grid(rows = vars(species)) +
theme_minimal() +
labs(title = "Distribution of Body Mass by Sex and Species",
x = "Body Mass (g)", y = "Density", fill = "Sex")
penguins_2007 %>%
group_by(species) %>%
summarize(missing_sex = sum(is.na(sex)))
Adelie has the highest number of missing values for sex, while Chinstrap exhibits the least sexual dimorphism. Adelie also shows minimal dimorphism due to the overlap.
penguins_2007 and, using ggplot,
create a scatterplot of flipper_length_mm (x-axis) against
the bill_ratio variable. Does it look like there is a
relationship between length-to-depth ratio and the lengths? To see more
clearly, add geom_smooth(method="lm") to the plot.```{R} ggplot(penguins_2007, aes(x = flipper_length_mm, y = bill_ratio)) + geom_point(alpha = 0.6) + geom_smooth(method = “lm”, color = “blue”) + theme_minimal() + labs(title = “Flipper Length vs. Bill Ratio”, x = “Flipper Length (mm)”, y = “Bill Length-to-Depth Ratio”)
*A positive relationship shows that penguins with longer flippers tend to have a higher bill length-to-depth ratio.*
### 7.3 (1 pt)
###### Does your answer change when you consider each species individually instead of all together? To see more clearly, duplicate the plot from before but additionally, in the main `ggplot()` function, map `species` to color so each species gets its own color and regression trendline.
###### Compare this plot with the previous one (in 4.2) and discuss whether the relationship between flipper length and bill length-to-depth ratio changes when you look at it overall versus within each species.
```{R}
ggplot(penguins_2007, aes(x = flipper_length_mm, y = bill_ratio, color = species)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(title = "Flipper Length vs. Bill Ratio by Species",
x = "Flipper Length (mm)",
y = "Bill Length-to-Depth Ratio",
color = "Species")
When looking at all species together a positive linear trend is observed. However looking at each species individually reveals different patterns, Gentoo shows a positive linear relationship, while both Adelie and Chinstrap show a negative linear trends.
{R, echo=F} ## DO NOT DELETE OR MODIFY THIS CHUNK: IT MUST BE PRESENT TO RECEIVE CREDIT FOR THE ASSIGNMENT sessionInfo(); Sys.time(); Sys.info()