Homework 1

Stat 253

Author

Rosie Bai

Published

September 16, 2025

Notes

  1. Please update the author name to your name!
  2. This document is for your answers only. Do NOT write out the homework questions in this document. Writing out the questions makes it very challenging to grade the homework. (You can create a separate document for your own records if you wish.)
  3. Besides adding your answers/code, please DO NOT make any deletions or changes to the structure of this template!

Setup

Put any code you need to load packages and load/prep data here.

knitr::opts_chunk$set(echo = TRUE)
library(dplyr) 
library(ggplot2)
library(mosaic) 
library(stringr)
library(tidyr)
library(tidyverse)

Exercises

Data Context

install.packages(rattle)
## Error in eval(expr, envir, enclos): object 'rattle' not found
library(rattle)
data(weatherAUS)
dim(weatherAUS)
## [1] 208495     24
## [1] 208495     24

Exercise 1

Part a

weather <- weatherAUS %>%
  filter(Location %in% c("Adelaide","Brisbane","Canberra","Hobart","Melbourne","Sydney")) %>%
  mutate(Temp9am = Temp9am * 9/5 + 32,
         Temp3pm = Temp3pm * 9/5 + 32) %>%
  select(Temp3pm, Location, WindSpeed9am, Humidity9am, Pressure9am, Temp9am)
weather <- weatherAUS %>% 
  filter(Location %in% c("Adelaide", "Brisbane", "Canberra", "Hobart", "Melbourne", "Sydney")) %>% 
  mutate(Temp9am = Temp9am * 9/5 + 32,
         Temp3pm = Temp3pm * 9/5 + 32) %>% 
  select(Temp3pm, Location, WindSpeed9am, Humidity9am, Pressure9am, Temp9am)

dim(weather)
## [1] 26828     6
head(weather, 2)
## # A tibble: 2 × 6
##   Temp3pm Location WindSpeed9am Humidity9am Pressure9am Temp9am
##     <dbl> <chr>           <dbl>       <int>       <dbl>   <dbl>
## 1    69.6 Sydney             17          92       1018.    69.3
## 2    76.6 Sydney              9          83       1018.    72.3

Part b

weather %>%
  arrange(desc(Temp3pm)) %>%
  head(3)
## # A tibble: 3 × 6
##   Temp3pm Location  WindSpeed9am Humidity9am Pressure9am Temp9am
##     <dbl> <chr>            <dbl>       <int>       <dbl>   <dbl>
## 1    117. Adelaide            13          14       1008.    96.4
## 2    114. Melbourne           54          23       1003.    91.6
## 3    112. Sydney              13          62       1003.    84.0

Part c

weather %>%
  group_by(Location) %>%
  summarize(mean_temp3pm = mean(Temp3pm, na.rm = TRUE))
## # A tibble: 6 × 2
##   Location  mean_temp3pm
##   <chr>            <dbl>
## 1 Adelaide          70.8
## 2 Brisbane          76.7
## 3 Canberra          67.0
## 4 Hobart            61.2
## 5 Melbourne         66.6
## 6 Sydney            70.9

Part d

ggplot(weather, aes(x = Temp3pm, fill = Location)) +
  geom_density(alpha = 0.4) +
  labs(x = "3pm Temperature (°F)", y = "Density") +
  theme_minimal()

ggplot(weather, aes(x = Temp9am, y = Temp3pm, color = Humidity9am)) +
  geom_point(alpha = 0.6) +
  facet_wrap(~ Location) +
  scale_color_gradient(low = "lightblue", high = "darkblue") +
  labs(x = "9am Temperature (°F)", y = "3pm Temperature (°F)", color = "Humidity9am") +
  theme_minimal()

Part e

From the density plot, the distribution of 3 pm temperatures varies by location: Brisbane and Sydney tend to have higher average 3 pm temperatures, while Hobart has noticeably cooler temperatures. The overlap between several locations makes it a bit difficult to distinguish their distributions.

From the scattered plot, across all six locations, 3 pm temperature increases nearly linearly with 9 am temperature, with points colored by humidity showing that higher morning humidity tends to occur on cooler days. The relationship appears strongest and most consistent in Adelaide, Melbourne, and Canberra. The influence of humidity is less obvious from color.

Exercise 2

Part a

library(tidymodels)

lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

weather_recipe <- recipe(Temp3pm ~ Temp9am + Location + Pressure9am, data = weather)

weather_workflow <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(weather_recipe)

weather_model_1 <- fit(weather_workflow, data = weather)

weather_model_1 %>% tidy()
## # A tibble: 8 × 5
##   term               estimate std.error statistic   p.value
##   <chr>                 <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)       -146.       5.07      -28.8   9.39e-180
## 2 Temp9am              0.951    0.00404   235.    0        
## 3 LocationBrisbane    -2.39     0.124     -19.3   2.47e- 82
## 4 LocationCanberra     3.53     0.123      28.6   2.37e-177
## 5 LocationHobart      -1.29     0.128     -10.1   3.70e- 24
## 6 LocationMelbourne   -0.0935   0.123      -0.759 4.48e-  1
## 7 LocationSydney      -1.12     0.118      -9.43  4.57e- 21
## 8 Pressure9am          0.154    0.00487    31.8   4.20e-217

Part b

aug <- augment(weather_model_1, new_data = weather) 

res_plot <- aug %>%
  mutate(.resid = Temp3pm - .pred) %>%
  ggplot(aes(x = .pred, y = .resid, color = Location)) +
  geom_point(alpha = 0.4) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Fitted 3pm temperature", y = "Residual",
       title = "Residuals vs Fitted: weather_model_1") +
  theme_minimal()
res_plot

This model does not appear to be wrong. The points are roughly centered around zero across the full range of values. There is no clear pattern. There is no strong funnel shape, so homoskedasticity seems okay. There are a few outliers, but does not dominate the pattern.

Part c

weather_aug <- augment(weather_model_1, new_data = weather) %>%
  mutate(.resid = Temp3pm - .pred)

rsq(weather_aug, truth = Temp3pm, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.757

The rsq means that about 86 % of the variation in 3 pm temperatures is explained by 9 am temperature, location, and pressure. This is a very high proportion, suggesting that the model is pretty strong in reflecting this relationship.

Part d

weather_aug <- augment(weather_model_1, new_data = weather) %>%
  mutate(.resid = Temp3pm - .pred)

rmse(weather_aug, truth = Temp3pm, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        5.44
mae(weather_aug, truth = Temp3pm, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mae     standard        4.24

On average, the model’s predictions are about 5.4°F away from the true 3 pm temperature. On average, the absolute error is about 4.2°F. The model produces reasonably accurate predictions, although there is still some error.

Part e

The model is only looking at weather- and location-based data. There is no involvement of any personally identifiable or sensitive information, which is little risk of direct harms to any individuals or groups. Accurate temperature would benefit people who work and spent long times in the outdoors. Potential risk can happen if stakeholder relies too much on this model’s prediction for professional work; the model’s result might also be less convincing in terms of irregular and extreme temperature patterns.

Part f

Holding location and pressure constant, for every 1 °F increase in 9 am temperature, the model predicts the 3 pm temperature will rise by about 0.95 °F on average, which shows a very strong positive relationship between 9 am and 3 pm temperatures.

Holding 9 am temperature and pressure constant, Hobart’s average 3 pm temperature is about 1.3 °F lower than Adelaide’s.

Exercise 3

Part a

set.seed(253)

weather_folds <- vfold_cv(weather, v = 10)

lm_spec <- linear_reg() %>% 
  set_mode("regression") %>%
  set_engine("lm")

weather_recipe_1 <- recipe(Temp3pm ~ Temp9am + Location + Pressure9am, data = weather)

weather_workflow_1 <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(weather_recipe_1)

weather_recipe_2 <- recipe(Temp3pm ~ ., data = weather)

weather_workflow_2 <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(weather_recipe_2)

results_1 <- fit_resamples(weather_workflow_1, resamples = weather_folds,
                           metrics = metric_set(rmse, mae, rsq))

results_2 <- fit_resamples(weather_workflow_2, resamples = weather_folds,
                           metrics = metric_set(rmse, mae, rsq))

collect_metrics(results_1)
## # A tibble: 3 × 6
##   .metric .estimator  mean     n std_err .config        
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
## 1 mae     standard   4.25     10 0.0216  pre0_mod0_post0
## 2 rmse    standard   5.44     10 0.0243  pre0_mod0_post0
## 3 rsq     standard   0.757    10 0.00165 pre0_mod0_post0
collect_metrics(results_2)
## # A tibble: 3 × 6
##   .metric .estimator  mean     n std_err .config        
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
## 1 mae     standard   4.12     10 0.0242  pre0_mod0_post0
## 2 rmse    standard   5.31     10 0.0254  pre0_mod0_post0
## 3 rsq     standard   0.769    10 0.00149 pre0_mod0_post0

Model 2 is the better predictive model because it has smaller mae and rmse than model 1, but has higher rsq than model 1.

Part b

Cross-validation evaluates the performance of each model on data that was not used for fitting. This method is more realistic and helpful when we want to generalize to new data. In-sample metrics can hide overfitting because they measure how well the model fits data we already know.

Part c

Setting the random seed ensures that the folds created for cross-validation are the same every time we run the code. Otherwise, the data would be split into folds randomly each time, which will cause changes in mae, rmse, and rsq between models, making them harder to compare with each other.

Exercise 4

melbourne <- read.csv('https://mac-stat.github.io/data/weather_melbourne.csv')

# Estimate the model. Assuming you named your model specification "lm_spec"...
temp_model_1 <- lm_spec %>% 
  fit(Humidity3pm ~ Temp9am + Humidity9am, data = melbourne)

temp_model_1 %>%
  tidy()
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   18.5      3.69       5.02  6.28e- 7
## 2 Temp9am        0.108    0.116      0.928 3.54e- 1
## 3 Humidity9am    0.573    0.0366    15.7   2.57e-49
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   18.5      3.69       5.02  6.28e- 7
## 2 Temp9am        0.108    0.116      0.928 3.54e- 1
## 3 Humidity9am    0.573    0.0366    15.7   2.57e-49
    
# Predict the 3pm humidity when at 9am it is 15 degrees (C) with 75% humidity 
temp_model_1 %>% 
  predict(new_data = data.frame(Temp9am = 15, Humidity9am = 75))
## # A tibble: 1 × 1
##   .pred
##   <dbl>
## 1  63.1
## # A tibble: 1 × 1
##   .pred
##   <dbl>
## 1  63.1

Part a

  1. Coefficient: Yes, the slope might change due to different scales of data used.
  2. P-value: No, how variable related to the general picture of the data should not be changing.
  3. Humidity9am coefficient: No, different scales should not change relationships.
  4. Prediction: Yes, because they contain the same information, although in different scales. The predictive result should be the same.

Part b

temp_model_2 <- lm_spec %>%
  fit(Humidity3pm ~ Temp9amF + Humidity9am, data = melbourne)

temp_model_1 %>% tidy()
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   18.5      3.69       5.02  6.28e- 7
## 2 Temp9am        0.108    0.116      0.928 3.54e- 1
## 3 Humidity9am    0.573    0.0366    15.7   2.57e-49
temp_model_2 %>% tidy()
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  16.6       5.45       3.05  2.38e- 3
## 2 Temp9amF      0.0598    0.0645     0.928 3.54e- 1
## 3 Humidity9am   0.573     0.0366    15.7   2.57e-49

predict(temp_model_1, new_data = data.frame(Temp9am = 15, Humidity9am = 75))
## # A tibble: 1 × 1
##   .pred
##   <dbl>
## 1  63.1
predict(temp_model_2, new_data = data.frame(Temp9amF = 59, Humidity9am = 75))
## # A tibble: 1 × 1
##   .pred
##   <dbl>
## 1  63.1

The intuitions are correct!

Part c

No. 

Part d

install.packages("mice")
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
library(mice)
data(mammalsleep)
mammalsleep <- mammalsleep %>% 
  filter(brw > 1) %>% 
  mutate(log_brw = log(brw))

lm_log <- lm(log_brw ~ gt, data = mammalsleep)

aug_mammalsleep <- augment(lm_log)

ggplot(mammalsleep, aes(x = log_brw)) +
  geom_density() +
  labs(title = "Density of brw", x = "brw", y = "density") +
  theme_minimal()


ggplot(mammalsleep, aes(x = gt, y = log_brw)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "log(brw) vs gt", x = "gt", y = "brw") +
  theme_minimal()


ggplot(aug_mammalsleep, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    x = ".pred",
    y = ".resd",
    title = "Residuals vs Fitted (log model)"
  ) +
  theme_minimal()

The graph does not look similar to the example graphs in the link. I am not sure where it went wrong.

Part e

Exercise 5

I used a couple of times Grok (AI) to help me de-bug. Most codes were pasted from prior assignments/activities, or projects that I have completed before this course.

Textbook is helpful too, although I reviewed that more for interpretation and concepts instead of coding.

Exercise 6

I found it very challenging to apply the methods to specific contexts of the questions. When I am making graphs, I still need to pull out past examples to copy and paste.

I think I am more confident in using tidymodel and tidyverse codes and filters.

I started a couple of days before the deadline; I had my notebook open when I was doing the assignment. I worked by myself; I did not struggle much, but rather coded and answered what I knew and left the rest. I revived Slack for questions posted by other students.

Improving efficiency and understanding in coding are things that I want to help me better manage future homework.

Done!

  • Render your QMD to HTML.
  • Check that HTML to make sure there are no errors or formatting issues.
  • Submit the HTML on Moodle.