The problem set is worth 100 points.
Enter your answers in the empty code chunks. Replace “# your code here” with your code.
Make sure you run this chunk before attempting any of the problems:
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## Warning: package 'ggplot2' was built under R version 4.0.4
## Warning: package 'tibble' was built under R version 4.0.4
## Warning: package 'tidyr' was built under R version 4.0.4
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'purrr' was built under R version 4.0.4
## Warning: package 'dplyr' was built under R version 4.0.4
## Warning: package 'stringr' was built under R version 4.0.4
## Warning: package 'forcats' was built under R version 4.0.5
Calculate \(2+2\):
2+2
## [1] 4
Calculate \(2*3\):
2*3
## [1] 6
Calculate \(\frac{(2+2)\times (3^2 + 5)}{(6/4)}\):
(((2+2)*(3^2+5))/(6/4))
## [1] 37.33333
dplyrLet’s work with the data set diamonds:
data(diamonds) # this will load a dataset called "diamonds"
Calculate the average price of a diamond. Use the %>% and summarise() syntax (hint: see lectures).
diamonds %>%
summarise(mean(price))
## # A tibble: 1 x 1
## `mean(price)`
## <dbl>
## 1 3933.
Calculate the average, median and standard deviation price of a diamond. Use the %>% and summarise() syntax.
diamonds %>%
summarise(mean(price),median(price),sd(price))
## # A tibble: 1 x 3
## `mean(price)` `median(price)` `sd(price)`
## <dbl> <dbl> <dbl>
## 1 3933. 2401 3989.
Use group_by() to group diamonds by color, then use summarise() to calculate the average price and the standard deviation in price by color:
diamonds %>%
group_by(color) %>%
summarise(mean(price),median(price),sd(price))
## # A tibble: 7 x 4
## color `mean(price)` `median(price)` `sd(price)`
## <ord> <dbl> <dbl> <dbl>
## 1 D 3170. 1838 3357.
## 2 E 3077. 1739 3344.
## 3 F 3725. 2344. 3785.
## 4 G 3999. 2242 4051.
## 5 H 4487. 3460 4216.
## 6 I 5092. 3730 4722.
## 7 J 5324. 4234 4438.
Use filter() to remove observations with a depth greater than 62, then usegroup_by() to group diamonds by clarity, then use summarise() to find the maximum price of a diamond by clarity:
diamonds %>%
filter(depth>62) %>%
group_by(clarity) %>%
summarise(max(price))
## # A tibble: 8 x 2
## clarity `max(price)`
## <ord> <int>
## 1 I1 18531
## 2 SI2 18804
## 3 SI1 18818
## 4 VS2 18791
## 5 VS1 18500
## 6 VVS2 18768
## 7 VVS1 18777
## 8 IF 18552
Use mutate() and log() to create a new variable to the data called “log_price”. Make sure you add the variable to the dataset diamonds.
diamonds = diamonds %>%
mutate(log_price = log(price))
(Hint: if I wanted to add a variable called “max_price” that calculates the max price, the code would look like this:)
diamonds = diamonds %>%
mutate(max_price = max(price))
ggplot2Continue using diamonds.
Use geom_histogram() to plot a histogram of prices:
diamonds %>%
ggplot(aes(x=price))+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Use geom_density() to plot the density of log prices (the variable you added to the data frame):
diamonds %>%
ggplot(aes(x=price))+
geom_density()
Use geom_point() to plot carats against log prices (i.e. carats on the x-axis, log prices on the y-axis):
diamonds %>%
ggplot(aes(x=carat, y=log_price))+
geom_point()
Same as above, but now add a regression line with geom_smooth():
diamonds %>%
ggplot(aes(x=carat, y=log_price))+
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Use stat_summary() to make a bar plot of average log price by cut:
ggplot(data = diamonds) +
geom_pointrange(
mapping = aes(x = cut, y = log_price),
stat = "summary",
fun.min = min,
fun.max = max,
fun = median
)
Same as above but change the theme to theme_classic():
ggplot(data = diamonds) +
geom_pointrange(
mapping = aes(x = cut, y = log_price),
stat = "summary",
fun.min = min,
fun.max = max,
fun = median
)+
theme_classic()
Use lm() to estimate the model
\[ log(\text{price}) = \beta_0 + \beta_1 \text{carat} + \beta_2 \text{table} + \varepsilon \]
and store the output in an object called “m1”:
m1 <- lm(fomrula = price ~ nox + carat + table, data=diamonds)
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'fomrula' will be disregarded
Use summary() to view the output of “m1”:
summary(m1)
##
## Call:
## lm(data = diamonds, fomrula = price ~ nox + carat + table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7558 -0.0229 0.0008 0.0224 3.4999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.815e+00 2.158e-02 -84.086 < 2e-16 ***
## cut.L -3.964e-03 1.286e-03 -3.083 0.002050 **
## cut.Q 3.725e-03 1.021e-03 3.648 0.000265 ***
## cut.C -5.982e-04 8.779e-04 -0.681 0.495619
## cut^4 3.483e-03 7.002e-04 4.974 6.58e-07 ***
## color.L 2.853e-02 1.236e-03 23.084 < 2e-16 ***
## color.Q 2.146e-02 9.104e-04 23.570 < 2e-16 ***
## color.C 4.002e-03 8.337e-04 4.800 1.59e-06 ***
## color^4 -1.299e-03 7.657e-04 -1.696 0.089816 .
## color^5 4.256e-03 7.228e-04 5.888 3.94e-09 ***
## color^6 2.315e-03 6.570e-04 3.523 0.000427 ***
## clarity.L -2.956e-02 2.288e-03 -12.919 < 2e-16 ***
## clarity.Q 5.976e-02 1.668e-03 35.832 < 2e-16 ***
## clarity.C -3.001e-02 1.390e-03 -21.585 < 2e-16 ***
## clarity^4 5.905e-03 1.097e-03 5.382 7.39e-08 ***
## clarity^5 -7.096e-03 8.927e-04 -7.948 1.93e-15 ***
## clarity^6 -2.154e-03 7.758e-04 -2.776 0.005498 **
## clarity^7 4.235e-03 6.859e-04 6.174 6.70e-10 ***
## depth 1.894e-02 2.599e-04 72.862 < 2e-16 ***
## table 3.477e-03 1.650e-04 21.071 < 2e-16 ***
## price 4.364e-05 1.726e-07 252.824 < 2e-16 ***
## x 3.969e-01 2.135e-03 185.844 < 2e-16 ***
## y 1.014e-02 1.094e-03 9.266 < 2e-16 ***
## z 1.115e-02 1.895e-03 5.883 4.04e-09 ***
## log_price -1.648e-01 1.481e-03 -111.243 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06392 on 53915 degrees of freedom
## Multiple R-squared: 0.9818, Adjusted R-squared: 0.9818
## F-statistic: 1.213e+05 on 24 and 53915 DF, p-value: < 2.2e-16
Use lm() to estimate the model
\[ log(\text{price}) = \beta_0 + \beta_1 \text{carat} + \beta_2 \text{table} + \beta_3 \text{depth} + \varepsilon \]
and store the output in an object called “m2”:
m2 <- lm(fomrula = price ~ nox + carat + table + depth, data=diamonds)
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'fomrula' will be disregarded
Use summary() to view the output of “m2”:
summary(m2)
##
## Call:
## lm(data = diamonds, fomrula = price ~ nox + carat + table + depth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7558 -0.0229 0.0008 0.0224 3.4999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.815e+00 2.158e-02 -84.086 < 2e-16 ***
## cut.L -3.964e-03 1.286e-03 -3.083 0.002050 **
## cut.Q 3.725e-03 1.021e-03 3.648 0.000265 ***
## cut.C -5.982e-04 8.779e-04 -0.681 0.495619
## cut^4 3.483e-03 7.002e-04 4.974 6.58e-07 ***
## color.L 2.853e-02 1.236e-03 23.084 < 2e-16 ***
## color.Q 2.146e-02 9.104e-04 23.570 < 2e-16 ***
## color.C 4.002e-03 8.337e-04 4.800 1.59e-06 ***
## color^4 -1.299e-03 7.657e-04 -1.696 0.089816 .
## color^5 4.256e-03 7.228e-04 5.888 3.94e-09 ***
## color^6 2.315e-03 6.570e-04 3.523 0.000427 ***
## clarity.L -2.956e-02 2.288e-03 -12.919 < 2e-16 ***
## clarity.Q 5.976e-02 1.668e-03 35.832 < 2e-16 ***
## clarity.C -3.001e-02 1.390e-03 -21.585 < 2e-16 ***
## clarity^4 5.905e-03 1.097e-03 5.382 7.39e-08 ***
## clarity^5 -7.096e-03 8.927e-04 -7.948 1.93e-15 ***
## clarity^6 -2.154e-03 7.758e-04 -2.776 0.005498 **
## clarity^7 4.235e-03 6.859e-04 6.174 6.70e-10 ***
## depth 1.894e-02 2.599e-04 72.862 < 2e-16 ***
## table 3.477e-03 1.650e-04 21.071 < 2e-16 ***
## price 4.364e-05 1.726e-07 252.824 < 2e-16 ***
## x 3.969e-01 2.135e-03 185.844 < 2e-16 ***
## y 1.014e-02 1.094e-03 9.266 < 2e-16 ***
## z 1.115e-02 1.895e-03 5.883 4.04e-09 ***
## log_price -1.648e-01 1.481e-03 -111.243 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06392 on 53915 degrees of freedom
## Multiple R-squared: 0.9818, Adjusted R-squared: 0.9818
## F-statistic: 1.213e+05 on 24 and 53915 DF, p-value: < 2.2e-16