1 Directions

The problem set is worth 100 points.

Enter your answers in the empty code chunks. Replace “# your code here” with your code.

Make sure you run this chunk before attempting any of the problems:

library(tidyverse)

2 Basics

Calculate \(2+2\):

2+2
## [1] 4

Calculate \(2*3\):

2*3
## [1] 6

Calculate \(\frac{(2+2)\times (3^2 + 5)}{(6/4)}\):

((2+2)*((3^2)+5))/(6/4)
## [1] 37.33333

3 dplyr

Let’s work with the data set diamonds:

data(diamonds) # this will load a dataset called "diamonds"

Calculate the average price of a diamond. Use the %>% and summarise() syntax (hint: see lectures).

diamonds %>% 
  summarise(mean(price))
## # A tibble: 1 x 1
##   `mean(price)`
##           <dbl>
## 1         3933.

Calculate the average, median and standard deviation price of a diamond. Use the %>% and summarise() syntax.

diamonds %>% 
  summarise(mean(price), median(price), sd(price))
## # A tibble: 1 x 3
##   `mean(price)` `median(price)` `sd(price)`
##           <dbl>           <dbl>       <dbl>
## 1         3933.            2401       3989.

Use group_by() to group diamonds by color, then use summarise() to calculate the average price and the standard deviation in price by color:

diamonds %>% 
  group_by(color) %>% 
  summarise(mean(price), sd(price))
## # A tibble: 7 x 3
##   color `mean(price)` `sd(price)`
##   <ord>         <dbl>       <dbl>
## 1 D             3170.       3357.
## 2 E             3077.       3344.
## 3 F             3725.       3785.
## 4 G             3999.       4051.
## 5 H             4487.       4216.
## 6 I             5092.       4722.
## 7 J             5324.       4438.

Use filter() to remove observations with a depth greater than 62, then usegroup_by() to group diamonds by clarity, then use summarise() to find the maximum price of a diamond by clarity:

diamonds %>% 
  filter(depth <= 62) %>% 
  group_by(clarity) %>% 
  summarise(max(price))
## # A tibble: 8 x 2
##   clarity `max(price)`
##   <ord>          <int>
## 1 I1             15223
## 2 SI2            18784
## 3 SI1            18797
## 4 VS2            18823
## 5 VS1            18795
## 6 VVS2           18730
## 7 VVS1           18682
## 8 IF             18806

Use mutate() and log() to create a new variable to the data called “log_price”. Make sure you add the variable to the dataset diamonds.

diamonds = diamonds %>%
  mutate(log_price = log(price))

(Hint: if I wanted to add a variable called “max_price” that calculates the max price, the code would look like this:)

diamonds = diamonds %>% 
  mutate(max_price = max(price))

4 ggplot2

Continue using diamonds.

Use geom_histogram() to plot a histogram of prices:

diamonds %>% 
  ggplot(aes(x = price)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Use geom_density() to plot the density of log prices (the variable you added to the data frame):

diamonds %>% 
  ggplot(aes(x = log_price)) +
  geom_density()

Use geom_point() to plot carats against log prices (i.e. carats on the x-axis, log prices on the y-axis):

diamonds %>% 
  ggplot(aes(x = carat, y = log_price)) +
  geom_point()

Same as above, but now add a regression line with geom_smooth():

diamonds %>% 
  ggplot(aes(x = carat, y = log_price)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Use stat_summary() to make a bar plot of average log price by cut:

diamonds %>% 
  ggplot() +
  stat_summary(
    mapping = aes(x= cut, y = log_price
                  , fun.y = mean(log_price))
  )
## Warning: Ignoring unknown aesthetics: fun.y
## No summary function supplied, defaulting to `mean_se()`

Same as above but change the theme to theme_classic():

diamonds %>% 
  ggplot() +
  stat_summary(
    mapping = aes(x= cut, y = log_price
                  , fun.y = mean(log_price))
  ) +
  theme_classic()
## Warning: Ignoring unknown aesthetics: fun.y
## No summary function supplied, defaulting to `mean_se()`

5 Inference

Use lm() to estimate the model

\[ log(\text{price}) = \beta_0 + \beta_1 \text{carat} + \beta_2 \text{table} + \varepsilon \]

and store the output in an object called “m1”:

m1 = lm(log_price ~ carat, table, data = diamonds)

Use summary() to view the output of “m1”:

summary(m1)
## 
## Call:
## lm(formula = log_price ~ carat, data = diamonds, subset = table)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.05975 -0.03780 -0.03780  0.03763  1.41354 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.749561   0.003056  1881.7   <2e-16 ***
## carat       0.973751   0.010504    92.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08224 on 53938 degrees of freedom
## Multiple R-squared:  0.1374, Adjusted R-squared:  0.1374 
## F-statistic:  8594 on 1 and 53938 DF,  p-value: < 2.2e-16

Use lm() to estimate the model

\[ log(\text{price}) = \beta_0 + \beta_1 \text{carat} + \beta_2 \text{table} + \beta_3 \text{depth} + \varepsilon \]

and store the output in an object called “m2”:

m2 = lm(log_price ~ carat, table, depth, data = diamonds)

Use summary() to view the output of “m2”:

summary(m2)
## 
## Call:
## lm(formula = log_price ~ carat, data = diamonds, subset = table, 
##     weights = depth)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4675 -0.2992 -0.2954  0.2941 11.2629 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.747423   0.003092 1858.54   <2e-16 ***
## carat       0.980933   0.010622   92.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.646 on 53938 degrees of freedom
## Multiple R-squared:  0.1365, Adjusted R-squared:  0.1365 
## F-statistic:  8528 on 1 and 53938 DF,  p-value: < 2.2e-16