1 Directions

The problem set is worth 100 points.

Enter your answers in the empty code chunks. Replace “# your code here” with your code.

Make sure you run this chunk before attempting any of the problems:

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3

2 Basics

Calculate \(2+2\):

2+2
## [1] 4

Calculate \(2*3\):

2*3
## [1] 6

Calculate \(\frac{(2+2)\times (3^2 + 5)}{(6/4)}\):

((2 + 2) * (3^2 + 5))/(6/4)
## [1] 37.33333

3 dplyr

Let’s work with the data set diamonds:

data(diamonds) # this will load a dataset called "diamonds"

Calculate the average price of a diamond. Use the %>% and summarise() syntax (hint: see lectures).

diamonds %>% 
  summarise(avg_price = mean(price))
## # A tibble: 1 x 1
##   avg_price
##       <dbl>
## 1     3933.

Calculate the average, median and standard deviation price of a diamond. Use the %>% and summarise() syntax.

diamonds %>% 
  summarise(avg_price = mean(price), #average price of a diamond
            median_price = median(price), #median price of a diamond
            sd_price = sd(price)) #standard deviation of diamond price
## # A tibble: 1 x 3
##   avg_price median_price sd_price
##       <dbl>        <dbl>    <dbl>
## 1     3933.         2401    3989.

Use group_by() to group diamonds by color, then use summarise() to calculate the average price and the standard deviation in price by color:

diamonds %>% 
  group_by(color) %>% 
  summarise(avg_price = mean(price), #average price by color
            sd_price = sd(price)) #standard deviation by color
## # A tibble: 7 x 3
##   color avg_price sd_price
##   <ord>     <dbl>    <dbl>
## 1 D         3170.    3357.
## 2 E         3077.    3344.
## 3 F         3725.    3785.
## 4 G         3999.    4051.
## 5 H         4487.    4216.
## 6 I         5092.    4722.
## 7 J         5324.    4438.

Use filter() to remove observations with a depth greater than 62, then usegroup_by() to group diamonds by clarity, then use summarise() to find the maximum price of a diamond by clarity:

diamonds %>% 
  filter(depth > 62) %>% #Filter our diamonds with depth greater than 62
  group_by(clarity) %>% #Group by clarity
  summarise(max_price = max(price)) #max diamond price by clarity with depth >62
## # A tibble: 8 x 2
##   clarity max_price
##   <ord>       <int>
## 1 I1          18531
## 2 SI2         18804
## 3 SI1         18818
## 4 VS2         18791
## 5 VS1         18500
## 6 VVS2        18768
## 7 VVS1        18777
## 8 IF          18552

Use mutate() and log() to create a new variable to the data called “log_price”. Make sure you add the variable to the dataset diamonds.

diamonds = diamonds %>% 
  mutate(log_price = log(price)) #Added new variable that is the log of price

(Hint: if I wanted to add a variable called “max_price” that calculates the max price, the code would look like this:)

diamonds = diamonds %>% 
  mutate(max_price = max(price))

4 ggplot2

Continue using diamonds.

Use geom_histogram() to plot a histogram of prices:

diamonds %>% 
  ggplot(aes(x = price)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Use geom_density() to plot the density of log prices (the variable you added to the data frame):

diamonds %>% 
  ggplot(aes(x = log_price)) + 
  geom_density() + 
  labs(x = "Log Prices")

Use geom_point() to plot carats against log prices (i.e. carats on the x-axis, log prices on the y-axis):

diamonds %>% 
  ggplot(aes(x = carat, y = log_price)) + 
  geom_point() + 
  labs(x = "Carats", y = "Log Price")

Same as above, but now add a regression line with geom_smooth():

diamonds %>% 
  ggplot(aes(x = carat, y = log_price)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(x = "Carats", y = "Log Price")
## `geom_smooth()` using formula 'y ~ x'

Use stat_summary() to make a bar plot of average log price by cut:

diamonds %>% 
  ggplot(aes(x = cut, y = log_price)) +
   stat_summary(fun="mean", geom="bar") +
  labs(title = "Average Log Price by Cut")

Same as above but change the theme to theme_classic():

diamonds %>% 
  ggplot(aes(x = cut, y = log_price)) +
   stat_summary(fun="mean", geom="bar") +
  labs(title = "Average Log Price by Cut") +
  theme_classic() #change to classic theme

5 Inference

Use lm() to estimate the model

\[ log(\text{price}) = \beta_0 + \beta_1 \text{carat} + \beta_2 \text{table} + \varepsilon \]

and store the output in an object called “m1”:

m1 = lm(formula = log_price ~ carat + table, data = diamonds)
#log_price as a function carats and table

Use summary() to view the output of “m1”:

summary(m1)
## 
## Call:
## lm(formula = log_price ~ carat + table, data = diamonds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2930 -0.2453  0.0338  0.2571  1.5573 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.4527654  0.0443008 145.658  < 2e-16 ***
## carat        1.9733423  0.0036678 538.015  < 2e-16 ***
## table       -0.0041876  0.0007781  -5.382  7.4e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3971 on 53937 degrees of freedom
## Multiple R-squared:  0.8469, Adjusted R-squared:  0.8469 
## F-statistic: 1.491e+05 on 2 and 53937 DF,  p-value: < 2.2e-16

Use lm() to estimate the model

\[ log(\text{price}) = \beta_0 + \beta_1 \text{carat} + \beta_2 \text{table} + \beta_3 \text{depth} + \varepsilon \]

and store the output in an object called “m2”:

m2 = lm(formula = log_price ~ carat+table+depth, data = diamonds)
#log price as a function of carat, table, and depth

Use summary() to view the output of “m2”:

summary(m2)
## 
## Call:
## lm(formula = log_price ~ carat + table + depth, data = diamonds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2280 -0.2437  0.0328  0.2578  1.5453 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.045098   0.101427   79.32   <2e-16 ***
## carat        1.978928   0.003672  538.99   <2e-16 ***
## table       -0.008539   0.000815  -10.48   <2e-16 ***
## depth       -0.021810   0.001251  -17.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.396 on 53936 degrees of freedom
## Multiple R-squared:  0.8477, Adjusted R-squared:  0.8477 
## F-statistic: 1.001e+05 on 3 and 53936 DF,  p-value: < 2.2e-16