1 Directions

The problem set is worth 100 points.

Enter your answers in the empty code chunks. Replace “# your code here” with your code.

Make sure you run this chunk before attempting any of the problems:

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## Warning: package 'ggplot2' was built under R version 4.0.4
## Warning: package 'tibble' was built under R version 4.0.4
## Warning: package 'tidyr' was built under R version 4.0.4
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'purrr' was built under R version 4.0.4
## Warning: package 'dplyr' was built under R version 4.0.4
## Warning: package 'stringr' was built under R version 4.0.4
## Warning: package 'forcats' was built under R version 4.0.5

2 Basics

Calculate \(2+2\):

2+2
## [1] 4

Calculate \(2*3\):

2*3
## [1] 6

Calculate \(\frac{(2+2)\times (3^2 + 5)}{(6/4)}\):

(((2+2)*(3^2+5))/(6/4))
## [1] 37.33333

3 dplyr

Let’s work with the data set diamonds:

data(diamonds) # this will load a dataset called "diamonds"

Calculate the average price of a diamond. Use the %>% and summarise() syntax (hint: see lectures).

diamonds %>% 
  summarise(mean(price))
## # A tibble: 1 x 1
##   `mean(price)`
##           <dbl>
## 1         3933.

Calculate the average, median and standard deviation price of a diamond. Use the %>% and summarise() syntax.

diamonds %>% 
  summarise(mean(price),median(price),sd(price))
## # A tibble: 1 x 3
##   `mean(price)` `median(price)` `sd(price)`
##           <dbl>           <dbl>       <dbl>
## 1         3933.            2401       3989.

Use group_by() to group diamonds by color, then use summarise() to calculate the average price and the standard deviation in price by color:

diamonds %>% 
 group_by(color) %>% 
  summarise(mean(price),median(price),sd(price))
## # A tibble: 7 x 4
##   color `mean(price)` `median(price)` `sd(price)`
##   <ord>         <dbl>           <dbl>       <dbl>
## 1 D             3170.           1838        3357.
## 2 E             3077.           1739        3344.
## 3 F             3725.           2344.       3785.
## 4 G             3999.           2242        4051.
## 5 H             4487.           3460        4216.
## 6 I             5092.           3730        4722.
## 7 J             5324.           4234        4438.

Use filter() to remove observations with a depth greater than 62, then usegroup_by() to group diamonds by clarity, then use summarise() to find the maximum price of a diamond by clarity:

diamonds %>% 
 filter(depth>62) %>% 
group_by(clarity) %>% 
  summarise(max(price))
## # A tibble: 8 x 2
##   clarity `max(price)`
##   <ord>          <int>
## 1 I1             18531
## 2 SI2            18804
## 3 SI1            18818
## 4 VS2            18791
## 5 VS1            18500
## 6 VVS2           18768
## 7 VVS1           18777
## 8 IF             18552

Use mutate() and log() to create a new variable to the data called “log_price”. Make sure you add the variable to the dataset diamonds.

diamonds = diamonds %>% 
  mutate(log_price = log(price))

(Hint: if I wanted to add a variable called “max_price” that calculates the max price, the code would look like this:)

diamonds = diamonds %>% 
  mutate(max_price = max(price))

4 ggplot2

Continue using diamonds.

Use geom_histogram() to plot a histogram of prices:

diamonds %>% 
  ggplot(aes(x=price))+
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Use geom_density() to plot the density of log prices (the variable you added to the data frame):

diamonds %>% 
  ggplot(aes(x=price))+
  geom_density()

Use geom_point() to plot carats against log prices (i.e. carats on the x-axis, log prices on the y-axis):

diamonds %>% 
  ggplot(aes(x=carat, y=log_price))+
  geom_point()

Same as above, but now add a regression line with geom_smooth():

diamonds %>% 
  ggplot(aes(x=carat, y=log_price))+
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Use stat_summary() to make a bar plot of average log price by cut:

ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = log_price),
    stat = "summary",
    fun.min = min,
    fun.max = max,
    fun = median
  )

Same as above but change the theme to theme_classic():

ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = log_price),
    stat = "summary",
    fun.min = min,
    fun.max = max,
    fun = median
  )+
  theme_classic()

5 Inference

Use lm() to estimate the model

\[ log(\text{price}) = \beta_0 + \beta_1 \text{carat} + \beta_2 \text{table} + \varepsilon \]

and store the output in an object called “m1”:

m1 <- lm(fomrula = price ~ nox + carat + table, data=diamonds)
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'fomrula' will be disregarded

Use summary() to view the output of “m1”:

summary(m1)
## 
## Call:
## lm(data = diamonds, fomrula = price ~ nox + carat + table)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7558 -0.0229  0.0008  0.0224  3.4999 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -1.815e+00  2.158e-02  -84.086  < 2e-16 ***
## cut.L       -3.964e-03  1.286e-03   -3.083 0.002050 ** 
## cut.Q        3.725e-03  1.021e-03    3.648 0.000265 ***
## cut.C       -5.982e-04  8.779e-04   -0.681 0.495619    
## cut^4        3.483e-03  7.002e-04    4.974 6.58e-07 ***
## color.L      2.853e-02  1.236e-03   23.084  < 2e-16 ***
## color.Q      2.146e-02  9.104e-04   23.570  < 2e-16 ***
## color.C      4.002e-03  8.337e-04    4.800 1.59e-06 ***
## color^4     -1.299e-03  7.657e-04   -1.696 0.089816 .  
## color^5      4.256e-03  7.228e-04    5.888 3.94e-09 ***
## color^6      2.315e-03  6.570e-04    3.523 0.000427 ***
## clarity.L   -2.956e-02  2.288e-03  -12.919  < 2e-16 ***
## clarity.Q    5.976e-02  1.668e-03   35.832  < 2e-16 ***
## clarity.C   -3.001e-02  1.390e-03  -21.585  < 2e-16 ***
## clarity^4    5.905e-03  1.097e-03    5.382 7.39e-08 ***
## clarity^5   -7.096e-03  8.927e-04   -7.948 1.93e-15 ***
## clarity^6   -2.154e-03  7.758e-04   -2.776 0.005498 ** 
## clarity^7    4.235e-03  6.859e-04    6.174 6.70e-10 ***
## depth        1.894e-02  2.599e-04   72.862  < 2e-16 ***
## table        3.477e-03  1.650e-04   21.071  < 2e-16 ***
## price        4.364e-05  1.726e-07  252.824  < 2e-16 ***
## x            3.969e-01  2.135e-03  185.844  < 2e-16 ***
## y            1.014e-02  1.094e-03    9.266  < 2e-16 ***
## z            1.115e-02  1.895e-03    5.883 4.04e-09 ***
## log_price   -1.648e-01  1.481e-03 -111.243  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06392 on 53915 degrees of freedom
## Multiple R-squared:  0.9818, Adjusted R-squared:  0.9818 
## F-statistic: 1.213e+05 on 24 and 53915 DF,  p-value: < 2.2e-16

Use lm() to estimate the model

\[ log(\text{price}) = \beta_0 + \beta_1 \text{carat} + \beta_2 \text{table} + \beta_3 \text{depth} + \varepsilon \]

and store the output in an object called “m2”:

m2 <- lm(fomrula = price ~ nox + carat + table + depth, data=diamonds)
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'fomrula' will be disregarded

Use summary() to view the output of “m2”:

summary(m2)
## 
## Call:
## lm(data = diamonds, fomrula = price ~ nox + carat + table + depth)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7558 -0.0229  0.0008  0.0224  3.4999 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -1.815e+00  2.158e-02  -84.086  < 2e-16 ***
## cut.L       -3.964e-03  1.286e-03   -3.083 0.002050 ** 
## cut.Q        3.725e-03  1.021e-03    3.648 0.000265 ***
## cut.C       -5.982e-04  8.779e-04   -0.681 0.495619    
## cut^4        3.483e-03  7.002e-04    4.974 6.58e-07 ***
## color.L      2.853e-02  1.236e-03   23.084  < 2e-16 ***
## color.Q      2.146e-02  9.104e-04   23.570  < 2e-16 ***
## color.C      4.002e-03  8.337e-04    4.800 1.59e-06 ***
## color^4     -1.299e-03  7.657e-04   -1.696 0.089816 .  
## color^5      4.256e-03  7.228e-04    5.888 3.94e-09 ***
## color^6      2.315e-03  6.570e-04    3.523 0.000427 ***
## clarity.L   -2.956e-02  2.288e-03  -12.919  < 2e-16 ***
## clarity.Q    5.976e-02  1.668e-03   35.832  < 2e-16 ***
## clarity.C   -3.001e-02  1.390e-03  -21.585  < 2e-16 ***
## clarity^4    5.905e-03  1.097e-03    5.382 7.39e-08 ***
## clarity^5   -7.096e-03  8.927e-04   -7.948 1.93e-15 ***
## clarity^6   -2.154e-03  7.758e-04   -2.776 0.005498 ** 
## clarity^7    4.235e-03  6.859e-04    6.174 6.70e-10 ***
## depth        1.894e-02  2.599e-04   72.862  < 2e-16 ***
## table        3.477e-03  1.650e-04   21.071  < 2e-16 ***
## price        4.364e-05  1.726e-07  252.824  < 2e-16 ***
## x            3.969e-01  2.135e-03  185.844  < 2e-16 ***
## y            1.014e-02  1.094e-03    9.266  < 2e-16 ***
## z            1.115e-02  1.895e-03    5.883 4.04e-09 ***
## log_price   -1.648e-01  1.481e-03 -111.243  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06392 on 53915 degrees of freedom
## Multiple R-squared:  0.9818, Adjusted R-squared:  0.9818 
## F-statistic: 1.213e+05 on 24 and 53915 DF,  p-value: < 2.2e-16