Homework 1

Question 0

a)

Add your name and the date to the header of the rmd.

b)

Add a floating table of contents to the HTML document. Check out the YAML header from previous files for an example of how to do this.

Question 1

The US Measles dataset contains yearly reports of measles prevalences in the US. The variables are: year, state, and prevalence.

a)

Read in the US Measles dataset (us_measles.csv).

library(tidyverse)

us_measles <- read_csv("data/us_measles.csv")

b)

Create a histogram of measles prevalence and comment on the shape of the distribution.

ggplot(data = us_measles) +
  geom_histogram(aes(x = prevalence)) +
  labs(title = "Measles Prevalence in the United States")

The prevalence of measles in the united states is skewed to the right.

c)

Calculate the mean and standard deviation of measles prevalence for years 1940 and 1990. Interpret these values in context.

y1940 <- subset(us_measles, year == "1940")
mean_prev40 <- mean(y1940$prevalence)
sd_prev40 <- sd(y1940$prevalence)

y1990 <- subset(us_measles, year == "1990")
mean_prev90 <- mean(y1990$prevalence)
sd_prev90 <- sd(y1990$prevalence)

The mean prevalence of measles in 1940 is 2.651914e-03 with SD 2.653625e-03.

The mean prevalence of measles in 1990 is 4.619080e-05 with SD 6.031689e-05.

d)

Create a scatterplot for measles prevalence by year in New York, and set all the points to the color of your choice (black isn’t allowed). Report in a sentence one pattern you notice with measles prevalence over time in New York.

new_york <- subset(us_measles, state == "New York")

ggplot(data = new_york) +
  geom_point(aes(x=year, y=prevalence), color = "blue") +
  labs(title = "Measles Prevalence by Year in New York")

The prevalence of measles in New York has been very low since 1970.

e)

Create a scatterplot for measles prevalence by year in another state of your choice (let us know in writing which state you chose), and set the points to the color of your choice (black isn’t allowed). How does this graph compare to the graph of New York’s prevalence by year?

alabama <- subset(us_measles, state == "Alabama")

ggplot(data = alabama) +
  geom_point(aes(x=year, y=prevalence), color = "blue") +
  labs(title = "Measles Prevalence by Year in Alabama")

Similar to trends noted in New York, the prevalence of measles in Alabama has been very low since 1970.

Question 2

The Tooth Growth dataset contains the results of an experiment conducted on 60 Guinea Pigs to evaluate the effect of vitamin C supplements on tooth growth. The variables are: Length (tooth length in cm), Supplement (supplement type, either VC-ascorbic acid or OJ-orange juice), and Dose (in milligrams/day).

a)

Read in Tooth Growth data (ToothGrowth.csv). Check the data carefully…

tooth_growth <- read_csv("data/ToothGrowth.csv", skip = 2)

b)

How many variables and observations are in this dataset? What is each variable’s type?

Note: This can be determined without code by using the RStudio Viewer. The functions nrow() and ncol() may also be useful for this question.

There are 60 observations and 3 variables in this dataset. Length is a numerical variable, Supplement is a character variable, and Dose is a numerical variable.

c)

Calculate the mean and standard deviation of tooth length for each dosage and report them using in-line R code.

## The code below pulls a vector of tooth lengths for the 0.5 Dose. You can alter this code to pull vectors of the tooth lengths for the other dosages -- you just need to change the "DATA" to the dataframe object you loaded in.

length_dose_05 <- tooth_growth %>% 
  filter(Dose == 0.5) %>%
  pull(Length)

meanlength05 <- mean(length_dose_05)
sdlength05 <- sd(length_dose_05)

length_dose_1 <- tooth_growth %>% 
  filter(Dose == 1.0) %>%
  pull(Length)

meanlength1 <- mean(length_dose_1)
sdlength1 <- sd(length_dose_1)

length_dose_2 <- tooth_growth %>% 
  filter(Dose == 2.0) %>%
  pull(Length)

meanlength2 <- mean(length_dose_2)
sdlength2 <- sd(length_dose_2)

The mean (SD) length for Dose of 0.5 is 10.605 (4.4997632). The mean (SD) length for Dose of 1.0 is 19.735 (4.4154364). The mean (SD) length for Dose of 2.0 is 26.1 (3.7741503).

d)

Make a boxplot for tooth length based on supplement. Comment on the distribution, and any observed differences you see between OJ and VC supplement groups.

ggplot(data = tooth_growth) +
  geom_boxplot(aes(x=Supplement, y=Length)) +
  labs(title = "Tooth Length by Supplement")

The distribution of length in those supplemented with OJ is higher than that of those supplemented with VC. The length distribution of those supplemented with OJ is skewed to the “left” whereas the length distribution of those supplemented with VC is skewed to the “right”.

Question 3

The Murders dataset contains information on murder rates in the US in 2012. The variables are: state, region, population (number of residents in the region), and total_murders (number of murders in the region).

a)

The code below attempts to read in the murders dataset but requires additional options to read in the data correctly. Take a look at the data file and check the data carefully after you read it in…

murders = read_excel("data/murders.xlsx", sheet = 2, range = "E5:H56")

b)

How many variables and observations are in this dataset? What is each variable’s type?

There are 51 observations and 4 variables in this dataset. State is a character variable, region is a character variable, population is a numerical variable, and total murders is a numerical variable.

c)

The following code creates a histogram of the total murders, yet contains four errors. Identify and correct each error, and describe what was wrong below the graph. Once you have fixed all errors, be sure to remove eval = FALSE from the code chunk options so that the code will run.

ggplot(data = murders) + 
  geom_histogram(aes(x = total_murders)) +  
  labs(title = "Histogram of Murders")

The dataframe is titled murders rather than murder. Changed %>% to + in line 203. Added aes to line 204 and changed y to x (also removed the x-axis title of cities).

d)

The following code attempts to visualize total murders by population and region, with each region displayed in a different color – but there are four errors. Find and correct each, and describe what was wrong below the graph. Once you have fixed all errors, be sure to remove eval = FALSE from the code chunk options.

ggplot(data = murders) +
  geom_point(aes(x = total_murders, y = population, color = region)) +
  labs(title = "Murders by Population and Region", x = "total_murders", y = "population")

line 219: geom_scatter changed to geom_point line 219: totalmurders changed to total_murders line 219: color = region inside of aes parentheses Switched the order with in aes() so that x comes before y and then renamed the titles according to what was on each axis.

Submission Instructions

When you are finished with this homework, Knit it to HTML and then submit both your .rmd and .html files to courseworks.