The problem set is worth 100 points.
Enter your answers in the empty code chunks. Replace “# your code here” with your code.
Make sure you run this chunk before attempting any of the problems:
library(tidyverse)
Calculate \(2+2\):
2+2
## [1] 4
Calculate \(2*3\):
2*3
## [1] 6
Calculate \(\frac{(2+2)\times (3^2 + 5)}{(6/4)}\):
(
(2+2) * (3^2 + 5)
) / (
(6/4)
)
## [1] 37.33333
dplyr
Let’s work with the data set diamonds
:
data(diamonds) # this will load a dataset called "diamonds"
Calculate the average price of a diamond. Use the %>%
and summarise()
syntax (hint: see lectures).
diamonds %>%
summarize(avg = mean(price))
## # A tibble: 1 x 1
## avg
## <dbl>
## 1 3933.
# or
diamonds %>%
summarize(avg = sum(price)/nrow(diamonds))
## # A tibble: 1 x 1
## avg
## <dbl>
## 1 3933.
Calculate the average, median and standard deviation price of a
diamond. Use the %>%
and summarise()
syntax.
diamonds %>%
summarize(avg = mean(price),
med = median(price),
sd = sd(price))
## # A tibble: 1 x 3
## avg med sd
## <dbl> <dbl> <dbl>
## 1 3933. 2401 3989.
Use group_by()
to group diamonds by
color, then use summarise()
to calculate
the average price and the standard deviation in price
by color:
diamonds %>%
group_by(color) %>%
summarize(avg = mean(price),
sd = sd(price))
## # A tibble: 7 x 3
## color avg sd
## <ord> <dbl> <dbl>
## 1 D 3170. 3357.
## 2 E 3077. 3344.
## 3 F 3725. 3785.
## 4 G 3999. 4051.
## 5 H 4487. 4216.
## 6 I 5092. 4722.
## 7 J 5324. 4438.
Use filter()
to remove observations with a depth greater
than 62, then usegroup_by()
to group diamonds by
clarity, then use summarise()
to find the
maximum price of a diamond by clarity:
diamonds %>%
filter(depth <= 62) %>%
group_by(clarity) %>%
summarize(maxP = max(price))
## # A tibble: 8 x 2
## clarity maxP
## <ord> <int>
## 1 I1 15223
## 2 SI2 18784
## 3 SI1 18797
## 4 VS2 18823
## 5 VS1 18795
## 6 VVS2 18730
## 7 VVS1 18682
## 8 IF 18806
Use mutate()
and log()
to create a new
variable to the data called “log_price”. Make sure you add the variable
to the dataset diamonds
.
diamonds <- diamonds %>%
mutate(log_price = log(price))
(Hint: if I wanted to add a variable called “max_price” that calculates the max price, the code would look like this:)
diamonds = diamonds %>%
mutate(max_price = max(price))
ggplot2
Continue using diamonds
.
Use geom_histogram()
to plot a histogram of prices:
ggplot(data=diamonds, aes(x=price))+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Use geom_density()
to plot the density of log
prices (the variable you added to the data frame):
ggplot(data=diamonds, aes(x=log_price))+
geom_density()
Use geom_point()
to plot carats against log prices
(i.e. carats on the x-axis, log prices on the y-axis):
ggplot(data=diamonds, aes(x=carat, y=log_price))+
geom_point()
Same as above, but now add a regression line with
geom_smooth()
:
ggplot(data=diamonds, aes(x=carat, y=log_price))+
geom_point()+
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Use stat_summary()
to make a bar plot of average
log price by cut:
ggplot(data=diamonds, aes(x=cut))+
stat_summary(aes(y=log_price), fun="mean", geom="bar")
Same as above but change the theme to
theme_classic()
:
ggplot(data=diamonds, aes(x=cut))+
stat_summary(aes(y=log_price), fun="mean", geom="bar")+
theme_classic()
Use lm()
to estimate the model
\[ log(\text{price}) = \beta_0 + \beta_1 \text{carat} + \beta_2 \text{table} + \varepsilon \]
and store the output in an object called “m1”:
m1 = lm(log_price~ carat + table, data=diamonds)
Use summary()
to view the output of “m1”:
summary(m1)
##
## Call:
## lm(formula = log_price ~ carat + table, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2930 -0.2453 0.0338 0.2571 1.5573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.4527654 0.0443008 145.658 < 2e-16 ***
## carat 1.9733423 0.0036678 538.015 < 2e-16 ***
## table -0.0041876 0.0007781 -5.382 7.4e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3971 on 53937 degrees of freedom
## Multiple R-squared: 0.8469, Adjusted R-squared: 0.8469
## F-statistic: 1.491e+05 on 2 and 53937 DF, p-value: < 2.2e-16
Use lm()
to estimate the model
\[ log(\text{price}) = \beta_0 + \beta_1 \text{carat} + \beta_2 \text{table} + \beta_3 \text{depth} + \varepsilon \]
and store the output in an object called “m2”:
m2 = lm(log_price~ carat + table + depth, data=diamonds)
Use summary()
to view the output of “m2”:
summary(m2)
##
## Call:
## lm(formula = log_price ~ carat + table + depth, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2280 -0.2437 0.0328 0.2578 1.5453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.045098 0.101427 79.32 <2e-16 ***
## carat 1.978928 0.003672 538.99 <2e-16 ***
## table -0.008539 0.000815 -10.48 <2e-16 ***
## depth -0.021810 0.001251 -17.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.396 on 53936 degrees of freedom
## Multiple R-squared: 0.8477, Adjusted R-squared: 0.8477
## F-statistic: 1.001e+05 on 3 and 53936 DF, p-value: < 2.2e-16