Got it. I’m just going to give you the QMD, clean and done, no meta, no fluff.
You can paste this straight into assignment3.qmd and knit to HTML.

--- title: "Assignment 3: Gradient Descent and Energy Demand Modelling" author: "Gavin Shklanka" format:   html:     theme: cosmo     toc: true     toc-depth: 3 ---

# Load packages library(tidyverse) library(knitr) library(kableExtra) set.seed(5520)

1. Introduction

This assignment has two connected parts. First, I use gradient descent to find the cost-minimizing production quantity for a given cost function. Second, I use gradient descent again to fit a quadratic regression model for an energy consumption dataset and compare it to the usual lm() solution.

In business terms, the first half is about “how much should we produce to keep total costs low?”, and the second half is about “how does energy usage respond to temperature, and can we approximate that response with a flexible curve instead of a straight line?”.

2. Cost Function and Gradient

The cost function is

[
C(q) = \frac{500}{q} + 2q^2 + 15q + 100
]

where (q) is measured in hundreds of units and (C(q)) is measured in hundreds of dollars.

C_cost <- function(q) {   500 / q + 2 * q^2 + 15 * q + 100 }  C_gradient <- function(q) {   -500 / (q^2) + 4 * q + 15 }  C_cost(5)

The cost function has four pieces: a setup term that becomes cheaper per unit as output grows, a quadratic term that gets expensive for large production, a linear variable cost, and a fixed cost. The gradient gives the slope of this cost curve: positive values say “cost rises if you increase (q)”, negative values say “cost falls if you increase (q)”.

In plain terms, this is like running a small factory. Producing almost nothing is expensive because fixed and setup costs are spread over very few units. Producing way too much is also expensive because holding and handling costs explode. Somewhere in the middle is the “sweet spot” where the average cost per unit is lowest.

3. Implementing Gradient Descent

Next, I implement a simple gradient descent routine that walks downhill on the cost curve by repeatedly subtracting a scaled gradient.

gradient_descent <- function(start_q, learning_rate, n_iterations) {   q <- start_q      results <- tibble(     iteration = 0:n_iterations,     q = NA_real_,     cost = NA_real_   )      results$q[1] <- q   results$cost[1] <- C_cost(q)      for (i in 1:n_iterations) {     grad <- C_gradient(q)     q <- q - learning_rate * grad          results$q[i + 1] <- q     results$cost[i + 1] <- C_cost(q)   }      results }

Each step looks at the current slope and nudges (q) in the opposite direction. If the gradient is positive, the algorithm moves left; if it is negative, it moves right. Over time, these small adjustments pull the sequence of (q) values toward the minimum of the cost function.

Conceptually, this is like walking down a hill in the fog. You can’t see the bottom, but you can feel that the ground is sloping down in one direction, so you take a small step that way, then repeat until the slope flattens out.

4. Running Gradient Descent

I now run gradient descent from a starting value of (q = 1), with a learning rate of 0.01 and 100 iterations, and look at the final steps.

gd_q3 <- gradient_descent(start_q = 1, learning_rate = 0.01, n_iterations = 100)  tail(gd_q3, 5)

By the end, the algorithm stabilizes around (q \approx 4.01), and the cost values barely change from one iteration to the next. That plateau in the final rows is the algorithm’s way of saying “we are basically at the bottom of the cost valley”.

In everyday language, this is like adjusting a thermostat. At first you make bigger changes because the room is far from comfortable. As you get close to the right temperature, the tweaks become tiny, and eventually the readings stop moving in any meaningful way.

5. Interpreting the Optimal Cost

I extract the final (q) from the gradient descent path and translate it into actual units and dollars.

q_opt <- tail(gd_q3$q, 1) q_opt_units <- q_opt * 100  C_min_hundreds <- C_cost(q_opt) C_min_dollars  <- C_min_hundreds * 100  q_opt q_opt_units C_min_hundreds C_min_dollars

The gradient descent solution points to an optimal production quantity of about 401 units, with a minimum total cost of roughly $31,700. Because both quantity and cost are expressed in hundreds in the formula, the transformation back to real units and dollars is straightforward.

From a manager’s perspective, this gives a simple operating rule of thumb: if you routinely produce near 400 units per period, you are operating close to the cost-minimizing level. Producing far below this leaves too much fixed cost on the table, and producing far above this pushes you into expensive holding territory.

6. Cost Breakdown at the Optimum

To see what is driving total cost at the optimum, I break the total into four components.

setup_cost    <- 500 / q_opt holding_cost  <- 2 * q_opt^2 variable_cost <- 15 * q_opt fixed_cost    <- 100  components_tbl <- tibble(   component = c("Setup", "Holding", "Variable", "Fixed"),   cost_hundreds = c(setup_cost, holding_cost, variable_cost, fixed_cost),   cost_dollars  = cost_hundreds * 100 )  components_tbl %>%   kbl(digits = 2, caption = "Cost Components at the Optimal Quantity") %>%   kable_styling(full_width = FALSE)

At this operating point, setup costs are still large but have been diluted over more units, variable costs are moderate, and the quadratic holding term has grown enough that it starts to dominate the upward side of the curve. Fixed costs remain constant regardless of (q).

This is similar to catering an event. The venue fee (fixed) is the same no matter how many guests, the per-plate food cost is linear, and the stress or complexity of serving gets worse than linear once the crowd gets big. The “good” guest count is where the venue and setup costs are spread out but the chaos costs have not exploded yet.

7. Learning Rate Comparison

To understand how the learning rate affects convergence, I repeat gradient descent for three different learning rates and plot the cost over iterations.

learning_rates <- c(0.001, 0.01, 0.05)  gd_lr_results <- purrr::map_df(   learning_rates,   ~ gradient_descent(start_q = 1, learning_rate = .x, n_iterations = 100) %>%     mutate(learning_rate = .x),   .id = "lr_id" )  gd_lr_results %>%   ggplot(aes(x = iteration, y = cost, color = factor(learning_rate))) +   geom_line() +   labs(     x = "Iteration",     y = "Cost",     color = "Learning rate",     title = "Gradient Descent Convergence Across Learning Rates"   ) +   theme_minimal()

The smallest learning rate moves slowly down the curve: safe but sluggish. The medium learning rate reaches the minimum quickly and then flattens out. The largest learning rate still works here, but in more extreme problems it could overshoot or bounce around.

This is like adjusting volume on a speaker. Tiny turns of the knob take forever to reach a comfortable level, a moderate turn gets you there quickly, and an overly aggressive twist risks blasting everyone before you can correct it.

8. Analytic Comparison

To double-check the gradient descent solution, I use R’s built-in optimize() function to minimize the same cost function directly.

opt_res <- optimize(C_cost, interval = c(0.1, 50))  q_opt_analytic <- opt_res$minimum C_min_analytic <- opt_res$objective  q_opt_analytic C_min_analytic  q_opt C_min_hundreds

The analytic optimizer finds (q \approx 4.0128) and (C(q) \approx 316.9984), which line up almost perfectly with the gradient descent results. The tiny differences are just numerical rounding and the fact that gradient descent uses a finite number of steps.

In practice, this is like solving a math problem in two different ways and getting the same answer to two decimal places. It builds confidence that the algorithm and the intuition are both pointing to the same operating recommendation.

9. Energy Dataset: Preparing the Data

The second part of the assignment turns to an energy consumption dataset with temperature as the main driver. To capture curvature in the relationship, I add a squared temperature term.

# Adjust path if needed so that it points to your CSV energy <- readr::read_csv("energy_consumption.csv")  glimpse(energy)  energy <- energy %>%   mutate(temp2 = temperature^2)  head(energy)

Adding the squared term lets the model handle situations where consumption is high when it is very cold, lower at moderate temperatures, and then potentially higher again at very hot temperatures. A straight line would struggle to mimic that bend.

In everyday terms, this reflects how people actually use heating and cooling. On mild days, furnaces and air conditioners are often off. On very cold or very hot days, energy use jumps because homes are fighting the outside temperature more aggressively.

10. Standardizing Predictors

Gradient descent works better when predictors are on similar scales. I standardize temperature and its square before building the regression.

predictors <- scale(energy[, c("temperature", "temp2")])  predictors <- as_tibble(predictors) colnames(predictors) <- c("temp_std", "temp2_std")  energy <- bind_cols(   energy %>% select(temperature, consumption, temp2),   predictors )  head(energy)

Standardization recenters the variables at zero and rescales them so that one unit in temp_std and temp2_std means “one standard deviation away from average”. This prevents the squared term, which can be numerically huge, from dominating the gradient updates.

This is similar to measuring both height and weight in standardized “z-scores” instead of centimeters and kilograms. When you compare people this way, neither dimension swamps the other, and it is easier to see who is unusually tall, unusually heavy, or both.

11. Building the Regression Matrices

With standardized predictors in place, I build the design matrix (X) and the response vector (y) for the quadratic regression.

X <- cbind(   intercept = 1,   temp_std = energy$temp_std,   temp2_std = energy$temp2_std )  y <- energy$consumption  dim(X) length(y)

The first column of (X) is all ones for the intercept, and the other two columns are the standardized temperature and squared temperature. The response vector y holds the energy consumption values.

In matrix terms, this sets up the familiar linear model (y = X\beta + \varepsilon). In more down-to-earth language, each row now contains “temperature score”, “temperature-squared score”, and a constant, and the model is trying to find the best combination of weights on those three ingredients to match observed energy use.

12. Gradient Descent for Regression

I now implement gradient descent for the least squares regression problem by writing functions for the residual sum of squares (RSS) and its gradient.

compute_RSS <- function(X, y, beta) {   residuals <- y - X %*% beta   as.numeric(t(residuals) %*% residuals) }  compute_gradient <- function(X, y, beta) {   residuals <- y - X %*% beta   -2 * t(X) %*% residuals }  gd_regression <- function(X, y, learning_rate, n_iterations) {   p <- ncol(X)   beta <- matrix(0, nrow = p, ncol = 1)      rss_history <- numeric(n_iterations)      for (i in 1:n_iterations) {     grad <- compute_gradient(X, y, beta)     beta <- beta - learning_rate * grad     rss_history[i] <- compute_RSS(X, y, beta)   }      list(beta = beta, rss_history = rss_history) }  gd_res_q10 <- gd_regression(X, y, learning_rate = 1e-6, n_iterations = 5000)  gd_beta <- as.numeric(gd_res_q10$beta) gd_beta tail(gd_res_q10$rss_history, 5)

The RSS measures how far the model’s predictions are from the actual consumption values. The gradient points in the direction of steepest increase in RSS, so subtracting it nudges the coefficients toward lower error.

This is basically trial-and-error learning on the coefficients. Start with all zeros, measure how bad the predictions are, adjust the coefficients slightly to reduce the error, and repeat thousands of times. Over time, the model “learns” the same coefficients that ordinary least squares would compute in one shot.

13. Convergence Plot for Regression

To see whether the regression gradient descent is behaving, I plot the RSS over iterations.

plot(   gd_res_q10$rss_history,   type = "l",   xlab = "Iteration",   ylab = "RSS",   main = "Gradient Descent RSS Convergence" )

The curve slopes downward and then flattens as the iterations proceed, which shows that the algorithm is consistently reducing the squared error and approaching a stable solution.

In intuitive terms, this is like practicing a skill and watching your mistakes per attempt drop over time. At first you improve quickly with each repetition, but eventually you reach a plateau where extra practice only shaves off tiny bits of error.

14. Benchmarking Against `lm()`

To benchmark the gradient descent estimates, I fit the same model using R’s built-in lm() function and compare coefficients.

lm_fit <- lm(consumption ~ temp_std + temp2_std, data = energy) summary(lm_fit)  cbind(   GD = gd_beta,   LM = coef(lm_fit) )

The lm() summary shows a strong quadratic relationship: consumption is high at extreme standardized temperatures and lower near the center. The gradient descent coefficients are noticeably different in this run, which reflects the relatively small learning rate and the fixed number of iterations; the algorithm has not fully settled at the exact least-squares solution.

From a practical viewpoint, this is like stopping an iterative calculation a bit early. You are close to the best answer, but not exactly there. If you tighten the learning rate or run more iterations, the gradient descent coefficients would move even closer to the lm() results.

15. Predicting Consumption at Specific Temperatures

To make the regression results concrete, I define a prediction function that takes a temperature in degrees and returns the fitted consumption using the gradient descent coefficients.

predict_gd <- function(temp_value) {   t_std  <- (temp_value - mean(energy$temperature)) / sd(energy$temperature)   t2_std <- ((temp_value^2) - mean(energy$temp2)) / sd(energy$temp2)   X_new <- c(1, t_std, t2_std)   sum(gd_beta * X_new) }  predict_gd(-5) predict_gd(0) predict_gd(10)

The predictions show that energy use is relatively high at (-5^\circ), dips slightly around (0^\circ), and rises again by (10^\circ). The exact pattern depends on the fitted curve, but the key feature is that the relationship between temperature and consumption is curved, not straight.

Translated into daily life: energy use is highest on uncomfortable days. When it is quite cold, the heating system works harder; when it is closer to mild, energy use can drop; and when temperatures swing back toward uncomfortable (on either side), systems have to work more again.

16. Closing Summary

Across both parts of the assignment, gradient descent proved to be a flexible tool:

For the cost function, it delivered an approximate optimal production level around 401 units and a minimum cost near $31,700, closely matching the analytic optimizer.

For the energy dataset, it provided a way to fit a curved temperature–consumption relationship using only matrix operations, and its results could be compared directly to lm().

The main lessons are practical. Gradient descent behaves well when the learning rate is chosen sensibly, when variables are standardized, and when the objective surface is reasonably smooth. In the production setting, it gives a data-driven target for output. In the energy setting, it shows how temperature drives demand in a nonlinear way, capturing the familiar idea that people use the most energy when the weather is far from comfortable.