::opts_chunk$set(echo = TRUE)
knitrlibrary(dplyr)
library(ggplot2)
library(mosaic)
library(stringr)
library(tidyr)
library(tidyverse)
Homework 1
Stat 253
Notes
- Please update the author name to your name!
- This document is for your answers only. Do NOT write out the homework questions in this document. Writing out the questions makes it very challenging to grade the homework. (You can create a separate document for your own records if you wish.)
- Besides adding your answers/code, please DO NOT make any deletions or changes to the structure of this template!
Setup
Put any code you need to load packages and load/prep data here.
Exercises
Data Context
install.packages(rattle)
## Error in eval(expr, envir, enclos): object 'rattle' not found
library(rattle)
data(weatherAUS)
dim(weatherAUS)
## [1] 208495 24
## [1] 208495 24
Exercise 1
Part a
<- weatherAUS %>%
weather filter(Location %in% c("Adelaide","Brisbane","Canberra","Hobart","Melbourne","Sydney")) %>%
mutate(Temp9am = Temp9am * 9/5 + 32,
Temp3pm = Temp3pm * 9/5 + 32) %>%
select(Temp3pm, Location, WindSpeed9am, Humidity9am, Pressure9am, Temp9am)
<- weatherAUS %>%
weather filter(Location %in% c("Adelaide", "Brisbane", "Canberra", "Hobart", "Melbourne", "Sydney")) %>%
mutate(Temp9am = Temp9am * 9/5 + 32,
Temp3pm = Temp3pm * 9/5 + 32) %>%
select(Temp3pm, Location, WindSpeed9am, Humidity9am, Pressure9am, Temp9am)
dim(weather)
## [1] 26828 6
head(weather, 2)
## # A tibble: 2 × 6
## Temp3pm Location WindSpeed9am Humidity9am Pressure9am Temp9am
## <dbl> <chr> <dbl> <int> <dbl> <dbl>
## 1 69.6 Sydney 17 92 1018. 69.3
## 2 76.6 Sydney 9 83 1018. 72.3
Part b
%>%
weather arrange(desc(Temp3pm)) %>%
head(3)
## # A tibble: 3 × 6
## Temp3pm Location WindSpeed9am Humidity9am Pressure9am Temp9am
## <dbl> <chr> <dbl> <int> <dbl> <dbl>
## 1 117. Adelaide 13 14 1008. 96.4
## 2 114. Melbourne 54 23 1003. 91.6
## 3 112. Sydney 13 62 1003. 84.0
Part c
%>%
weather group_by(Location) %>%
summarize(mean_temp3pm = mean(Temp3pm, na.rm = TRUE))
## # A tibble: 6 × 2
## Location mean_temp3pm
## <chr> <dbl>
## 1 Adelaide 70.8
## 2 Brisbane 76.7
## 3 Canberra 67.0
## 4 Hobart 61.2
## 5 Melbourne 66.6
## 6 Sydney 70.9
Part d
ggplot(weather, aes(x = Temp3pm, fill = Location)) +
geom_density(alpha = 0.4) +
labs(x = "3pm Temperature (°F)", y = "Density") +
theme_minimal()
ggplot(weather, aes(x = Temp9am, y = Temp3pm, color = Humidity9am)) +
geom_point(alpha = 0.6) +
facet_wrap(~ Location) +
scale_color_gradient(low = "lightblue", high = "darkblue") +
labs(x = "9am Temperature (°F)", y = "3pm Temperature (°F)", color = "Humidity9am") +
theme_minimal()
Part e
From the density plot, the distribution of 3 pm temperatures varies by location: Brisbane and Sydney tend to have higher average 3 pm temperatures, while Hobart has noticeably cooler temperatures. The overlap between several locations makes it a bit difficult to distinguish their distributions.
From the scattered plot, across all six locations, 3 pm temperature increases nearly linearly with 9 am temperature, with points colored by humidity showing that higher morning humidity tends to occur on cooler days. The relationship appears strongest and most consistent in Adelaide, Melbourne, and Canberra. The influence of humidity is less obvious from color.
Exercise 2
Part a
library(tidymodels)
<- linear_reg() %>%
lm_spec set_mode("regression") %>%
set_engine("lm")
<- recipe(Temp3pm ~ Temp9am + Location + Pressure9am, data = weather)
weather_recipe
<- workflow() %>%
weather_workflow add_model(lm_spec) %>%
add_recipe(weather_recipe)
<- fit(weather_workflow, data = weather)
weather_model_1
%>% tidy()
weather_model_1 ## # A tibble: 8 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -146. 5.07 -28.8 9.39e-180
## 2 Temp9am 0.951 0.00404 235. 0
## 3 LocationBrisbane -2.39 0.124 -19.3 2.47e- 82
## 4 LocationCanberra 3.53 0.123 28.6 2.37e-177
## 5 LocationHobart -1.29 0.128 -10.1 3.70e- 24
## 6 LocationMelbourne -0.0935 0.123 -0.759 4.48e- 1
## 7 LocationSydney -1.12 0.118 -9.43 4.57e- 21
## 8 Pressure9am 0.154 0.00487 31.8 4.20e-217
Part b
<- augment(weather_model_1, new_data = weather)
aug
<- aug %>%
res_plot mutate(.resid = Temp3pm - .pred) %>%
ggplot(aes(x = .pred, y = .resid, color = Location)) +
geom_point(alpha = 0.4) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "Fitted 3pm temperature", y = "Residual",
title = "Residuals vs Fitted: weather_model_1") +
theme_minimal()
res_plot
This model does not appear to be wrong. The points are roughly centered around zero across the full range of values. There is no clear pattern. There is no strong funnel shape, so homoskedasticity seems okay. There are a few outliers, but does not dominate the pattern.
Part c
<- augment(weather_model_1, new_data = weather) %>%
weather_aug mutate(.resid = Temp3pm - .pred)
rsq(weather_aug, truth = Temp3pm, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.757
The rsq means that about 86 % of the variation in 3 pm temperatures is explained by 9 am temperature, location, and pressure. This is a very high proportion, suggesting that the model is pretty strong in reflecting this relationship.
Part d
<- augment(weather_model_1, new_data = weather) %>%
weather_aug mutate(.resid = Temp3pm - .pred)
rmse(weather_aug, truth = Temp3pm, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 5.44
mae(weather_aug, truth = Temp3pm, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mae standard 4.24
On average, the model’s predictions are about 5.4°F away from the true 3 pm temperature. On average, the absolute error is about 4.2°F. The model produces reasonably accurate predictions, although there is still some error.
Part e
The model is only looking at weather- and location-based data. There is no involvement of any personally identifiable or sensitive information, which is little risk of direct harms to any individuals or groups. Accurate temperature would benefit people who work and spent long times in the outdoors. Potential risk can happen if stakeholder relies too much on this model’s prediction for professional work; the model’s result might also be less convincing in terms of irregular and extreme temperature patterns.
Part f
Holding location and pressure constant, for every 1 °F increase in 9 am temperature, the model predicts the 3 pm temperature will rise by about 0.95 °F on average, which shows a very strong positive relationship between 9 am and 3 pm temperatures.
Holding 9 am temperature and pressure constant, Hobart’s average 3 pm temperature is about 1.3 °F lower than Adelaide’s.
Exercise 3
Part a
set.seed(253)
<- vfold_cv(weather, v = 10)
weather_folds
<- linear_reg() %>%
lm_spec set_mode("regression") %>%
set_engine("lm")
<- recipe(Temp3pm ~ Temp9am + Location + Pressure9am, data = weather)
weather_recipe_1
<- workflow() %>%
weather_workflow_1 add_model(lm_spec) %>%
add_recipe(weather_recipe_1)
<- recipe(Temp3pm ~ ., data = weather)
weather_recipe_2
<- workflow() %>%
weather_workflow_2 add_model(lm_spec) %>%
add_recipe(weather_recipe_2)
<- fit_resamples(weather_workflow_1, resamples = weather_folds,
results_1 metrics = metric_set(rmse, mae, rsq))
<- fit_resamples(weather_workflow_2, resamples = weather_folds,
results_2 metrics = metric_set(rmse, mae, rsq))
collect_metrics(results_1)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 mae standard 4.25 10 0.0216 pre0_mod0_post0
## 2 rmse standard 5.44 10 0.0243 pre0_mod0_post0
## 3 rsq standard 0.757 10 0.00165 pre0_mod0_post0
collect_metrics(results_2)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 mae standard 4.12 10 0.0242 pre0_mod0_post0
## 2 rmse standard 5.31 10 0.0254 pre0_mod0_post0
## 3 rsq standard 0.769 10 0.00149 pre0_mod0_post0
Model 2 is the better predictive model because it has smaller mae and rmse than model 1, but has higher rsq than model 1.
Part b
Cross-validation evaluates the performance of each model on data that was not used for fitting. This method is more realistic and helpful when we want to generalize to new data. In-sample metrics can hide overfitting because they measure how well the model fits data we already know.
Part c
Setting the random seed ensures that the folds created for cross-validation are the same every time we run the code. Otherwise, the data would be split into folds randomly each time, which will cause changes in mae, rmse, and rsq between models, making them harder to compare with each other.
Exercise 4
<- read.csv('https://mac-stat.github.io/data/weather_melbourne.csv')
melbourne
# Estimate the model. Assuming you named your model specification "lm_spec"...
<- lm_spec %>%
temp_model_1 fit(Humidity3pm ~ Temp9am + Humidity9am, data = melbourne)
%>%
temp_model_1 tidy()
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 18.5 3.69 5.02 6.28e- 7
## 2 Temp9am 0.108 0.116 0.928 3.54e- 1
## 3 Humidity9am 0.573 0.0366 15.7 2.57e-49
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 18.5 3.69 5.02 6.28e- 7
## 2 Temp9am 0.108 0.116 0.928 3.54e- 1
## 3 Humidity9am 0.573 0.0366 15.7 2.57e-49
# Predict the 3pm humidity when at 9am it is 15 degrees (C) with 75% humidity
%>%
temp_model_1 predict(new_data = data.frame(Temp9am = 15, Humidity9am = 75))
## # A tibble: 1 × 1
## .pred
## <dbl>
## 1 63.1
## # A tibble: 1 × 1
## .pred
## <dbl>
## 1 63.1
Part a
- Coefficient: Yes, the slope might change due to different scales of data used.
- P-value: No, how variable related to the general picture of the data should not be changing.
- Humidity9am coefficient: No, different scales should not change relationships.
- Prediction: Yes, because they contain the same information, although in different scales. The predictive result should be the same.
Part b
<- lm_spec %>%
temp_model_2 fit(Humidity3pm ~ Temp9amF + Humidity9am, data = melbourne)
%>% tidy()
temp_model_1 ## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 18.5 3.69 5.02 6.28e- 7
## 2 Temp9am 0.108 0.116 0.928 3.54e- 1
## 3 Humidity9am 0.573 0.0366 15.7 2.57e-49
%>% tidy()
temp_model_2 ## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 16.6 5.45 3.05 2.38e- 3
## 2 Temp9amF 0.0598 0.0645 0.928 3.54e- 1
## 3 Humidity9am 0.573 0.0366 15.7 2.57e-49
predict(temp_model_1, new_data = data.frame(Temp9am = 15, Humidity9am = 75))
## # A tibble: 1 × 1
## .pred
## <dbl>
## 1 63.1
predict(temp_model_2, new_data = data.frame(Temp9amF = 59, Humidity9am = 75))
## # A tibble: 1 × 1
## .pred
## <dbl>
## 1 63.1
The intuitions are correct!
Part c
No.
Part d
install.packages("mice")
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
library(mice)
data(mammalsleep)
<- mammalsleep %>%
mammalsleep filter(brw > 1) %>%
mutate(log_brw = log(brw))
<- lm(log_brw ~ gt, data = mammalsleep)
lm_log
<- augment(lm_log)
aug_mammalsleep
ggplot(mammalsleep, aes(x = log_brw)) +
geom_density() +
labs(title = "Density of brw", x = "brw", y = "density") +
theme_minimal()
ggplot(mammalsleep, aes(x = gt, y = log_brw)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "log(brw) vs gt", x = "gt", y = "brw") +
theme_minimal()
ggplot(aug_mammalsleep, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(
x = ".pred",
y = ".resd",
title = "Residuals vs Fitted (log model)"
+
) theme_minimal()
The graph does not look similar to the example graphs in the link. I am not sure where it went wrong.
Part e
Exercise 5
I used a couple of times Grok (AI) to help me de-bug. Most codes were pasted from prior assignments/activities, or projects that I have completed before this course.
Textbook is helpful too, although I reviewed that more for interpretation and concepts instead of coding.
Exercise 6
I found it very challenging to apply the methods to specific contexts of the questions. When I am making graphs, I still need to pull out past examples to copy and paste.
I think I am more confident in using tidymodel and tidyverse codes and filters.
I started a couple of days before the deadline; I had my notebook open when I was doing the assignment. I worked by myself; I did not struggle much, but rather coded and answered what I knew and left the rest. I revived Slack for questions posted by other students.
Improving efficiency and understanding in coding are things that I want to help me better manage future homework.
Done!
- Render your QMD to HTML.
- Check that HTML to make sure there are no errors or formatting issues.
- Submit the HTML on Moodle.