Replace the fields in lines 4-6 with your information.
run the chunk below with the library() commands, to
load the required packages. At this point in the semester you will
start to be expected to understand to do this in markdown files without
needing to be told.
For each question, please place your code in the existing code chunks or create other code chunks as needed.
For each question that asks you to calculate a value, after the code chunk write (not in the code chunk, but below so that it will be considered text): “The value of BLANK is X”, replacing BLANK and X as needed.
To add text answers to a question, simply type the text below the code chunk.
For graphs, it is expected that you will change and add the appropriate title and labels.
When you upload your pdf to Gradescope, please indicate in Gradescope the page(s) for the answers to each exercise
The work and words that you submit must be your own.
library("tidyverse")
library("knitr")
library("dplyr")
library("pander")
library("readr")
Case Study Background
To operate efficiently, electric power companies (such as Duquesne Light in the Pittsburgh region) must be able to predict the peak power load at their various stations. Peak power load is the maximum amount of power that must be generated each day to meet demand.
A power company wants to use daily high temperature, X, to model daily peak power load, Y, during the summer months when demand for power is greatest due to air conditioner use. Although the company expects peak load to increase as the temperature increases, the rate of increase in Y might not remain constant as X increases. For example, a 1-degree increase in high temperature from 100 F to 101 F might result in a larger increase in power demand than would a 1-degree increase from 80 F to 81 F.
A random sample of 25 summer days is selected and the peak load (measured in megawatts) and high temperature (in degrees F) is recorded for each day.
For this lab, the data is available in your Rstudio Files space as a
csv (comma separated values) file. The following command loads the data
and names it colvard; run the following chunk:
powerload <- readr::read_csv("powerload.csv")
When you load the data, R indicates that it interprets both the variables as “dbl” (meaning double precision, i.e., quantitative).
In the data, the variables are:
temp: The daily high temp (in degrees Fahrenheit);
load: The daily peak power load (in megawatts)
In a new code chunk, create a scatterplot to visually explore the relationship between temp and load.
Hints:
Recall the plot() command with the construction of the form Y~X and also specifying the data in the command;
Note which variable we are interested in being Y and which one is X, based on the story above;
Note the variable names are lowercase in the data.
plot(load ~ temp,
data = powerload,
xlab = "daily high temp (F)",
ylab = "daily peak power load",
main = "Power demand by temp")
Describe the relationship (e.g. direction, relative strength by eye, and whether it looks linear). Based on the graphical exploration, does the linearity assumption of the simple linear regression model seem valid? [It may not be so obvious just on the scatterplot; later in the lab you will see for sure whether a linear model is appropriate or not.]
The relationship is positive and relatively strong, but possibly non linear
Now, let’s see why a linear fit may not be best for these data.
In a new code chunk, produce a linear model for predicting
load from temp, and name the resulting model
powerload_linmod (for powerload linear model).
Hints:
Recall the lm() command for making linear models, with the construction of the form Y~X and also specifying the data in the command;
Remember the back-arrow for assignment.
powerload_linmod <- lm(load ~ temp,
data = powerload)
Now, in a new code chunk, add the regression line from
powerload_linmod to the scatterplot, by executing abline()
on powerload_linmod [without the tickmarks] immediately
after and run simultaneous with the plot() command making the
scatterplot [like you did on #2c in lab 3].
plot(load ~ temp,
data = powerload,
xlab = "daily high temp ",
ylab = "daily peak power load",
main = "Power demand by temp")
abline(powerload_linmod)
The code chunk below makes four possible residual plot shapes.
Run the chunk, then read and answer the questions that follow.
###########################
# DO NOT EDIT THIS CHUNK
###########################
set.seed(203)
## make it random by adding a line, then adding errors with a pattern
x.base.a <- seq(5, 27, by=0.3)
## part A
error.a <- runif(n = 74, min = -.5, max = .5)
y.a <- runif(n = 74, min = -0.75, max = 0.75)
## part C
## to make the happy face do x^2 then lower the value
x.base.b <- seq(0, 6, length.out = 74)
error.b <- runif(n = 74, min = -1, max = 1)
y.b <- abs(x.base.b - 3)^2 - 3.75 + error.b
## part B --> just part C reversed
## Part D --> use the y.a values for linear
## add more error
y.d <- 2*x.base.a + - 28 + runif(n = 74, min = -5, max = 5)
par(mfrow=c(2, 2))
plot(x.base.a, y.a,
xlab = "Fitted Values", ylab = "Residuals", main = "A",
xaxt="n", yaxt="n", ylim = c(-2, 2))
abline(a = 0, b = 0)
plot(x.base.b, -y.b,
xlab = "Fitted Values", ylab = "Residuals", main = "B",
xaxt="n", yaxt="n", ylim = c(-7, 7))
abline(a = 0, b = 0)
plot(x.base.a, y.d,
xlab = "Fitted Values", ylab = "Residuals", main = "C",
xaxt="n", yaxt="n")
abline(a = 0, b = 0)
plot(x.base.b, y.b,
xlab = "Fitted Values", ylab = "Residuals", main = "D",
xaxt="n", yaxt="n", ylim = c(-7, 7))
abline(a = 0, b = 0)
Look back at the scatterplot of the data in 1a. If we made the residual plot from the linear model that you fit in 2b, which of the above possible residual plot shapes (A,B,C, or D) would it resemble? [Hint, consider the concavity of the scatterplot in 1a.]
What feature of this residual plot shape would indicate that the simple linear model in 2b was not justified?
It would resemble D the most because of the convexity of the curve
Now, verify your hunch in 2c by obtaining the residual plot from the linear model that you had fit in 2b.
[Recall, you can obtain the residual plot from a linear model by
executing the plot() command on the name of the model and adding the
argument which=1 (without the tickmarks), after a comma in
the plot command.]
plot(powerload_linmod, which=1)
Since the relationship seems to not be best modeled as linear, a standard remedy is to try some transformations on the variable(s).
To get an indication of what sorts of transformations we might want to try, let’s look at the distributions of the separate variables. In a new code chunk, produce histograms of temp and of load.
[Hints:
Recall the hist() function;
recal the syntax of the histogram function works on the
“extracted” variable by calling the variable from the datraframe using
the dollarsign operator in a form like
dataset$variable;
be sure to give your histograms some meaningful titles and x-axis labels.
hist(powerload$temp,
xlab="daily high temperature",
main="histogram of temp")
hist(powerload$load,
xlab="power load daily peak",
main="histogram of power load")
Which variable has stronger skewness? Which way is it skewed?
The power load, and it is skewed right
Possible “linearizing” transformations to try might be a transformation that makes one or both variables less skewed (although this isn’t always the best or only choice for linearizing transformations).
For now, let’s try a log tranformation on load – i.e.,
let’s try fitting a linear model between log(load) and temp, as
follows:
Before taking a log of a variable, we need to make sure the minimum value of the variable is greater than 1, because otherwise taking a log could give some negatives (if taking log of values between 0 and 1) or it could give negative infinity (if taking log of 0) or it could give undefined (if attempting to take log of values less than zero).
In a new code chunk, check the minimum of load (hint,
use the min() function on the variable load extracted from
the powerload dataframe with the dollarsign).
min(powerload$load)
## [1] 92.5
Since the min value of load is greater than 1 we can
safely proceed to try a linear model between log(load) and temp.
To do that, we first need to produce the log of the load variable.
The proper way to transform a variable in order to use it in later functions and keep track of it is to add a new column (i.e., a new variable) to the dataframe.
We can name the new variable anything we like; for the purpose of the
lab, let’s call the column log.load.
To create this new column and attach it to the dataframe and fill it with the log of the load values, run the following code chunk [explanation follows]:
powerload$log.load <- log(powerload$load)
Explanation of the previous chunk:
On the right side of that command, the dollar sign operator is being
used to extract the variable load from the (pre existing)
powerload dataframe (and we are then taking the log of
load); on the left of the command, the dollar sign operator
is being used to append a new variable (i.e., a new column vector)
called log.load to the powerload dataframe, and the
assignment operator <- is then assigning the log(load)
values to that new variable name.
[Essentially, the left side creates a new (blank) column in the dataframe, the right side creates values, and the assignment operator fills the new column with the values.]
The dollar sign operator can be used like this to create a new variable in an existing dataframe. When creating the name of the new field, make sure that the name does not conflict with any existing object name. Choose reasonably descriptive object names to avoid accidentally using the name of another existing variable or function in R. Usual naming conventions in R apply (no spaces; don’t use just numbers; avoid using single letters; check that you’re not using the name of a pre-existing object or R function.)
After running the chunk, a new variable was added to your
powerload object in your environment. If you now click on
powerload in your environment pane, you will see the new
column.
It is also good to know how to remove a variable from a dataframe. To do that, we assign the value NULL to the variable. For example, the following [but without the tickmarks] is how you would remove the column you just created (don’t actually run it):
powerload$log.load <- NULL
Having now created a transformed variable log.load, let’s check the
linearity assumption. Make a scatterplot (plot()) of
log.load (on Y) vs temp (on X).
[Hints for the plot command, recall the twiddle notation, and recall you need to specify the dataframe with “data=”].
plot(log.load ~ temp,
data = powerload)
You should see that although the plot of log.load vs temp looks more linear than load vs. temp, it’s still a bit concave.
Let’s try fitting the linear model of log.load predicted from temp
and then we’ll check the other diagnostics. We will call the model
logload_mod [without the tickmarks]. Copy the following
into a new code chunk and run it:
logload_mod <- lm(log.load ~ temp,
data = powerload)
Then, produce the two residual diagnostic plots [residual plot, and
qq plot of the residuals] from the logload_mod model
[hints, recall you can execute plot() on the model name, and then add
the extra plot argument “which=1” for the residual plot, or “which=2”
for the qq plot of the residuals]
plot(logload_mod, which=1)
plot(logload_mod, which=2)
Comment on the residual diagnostics from the linear model between log(load) and temp. [Are there any concerns? Which diagnostics are concerning, and why?]
There are concerns, as the U shaped plot does not represent randomness
In a new code chunk, produce the summary() of the linear model between log(load) and temp.
summary(logload_mod)
##
## Call:
## lm(formula = log.load ~ temp, data = powerload)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.17145 -0.04350 -0.02121 0.05912 0.17739
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.396990 0.134174 25.32 < 2e-16 ***
## temp 0.015869 0.001521 10.43 3.42e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08841 on 23 degrees of freedom
## Multiple R-squared: 0.8256, Adjusted R-squared: 0.818
## F-statistic: 108.8 on 1 and 23 DF, p-value: 3.416e-10
Say (in text) what percent of the variability in
log(load) is explained by the linear relationship with
temp. [Hint, this is a value in the summary() you made in
3e(i).]
82.56%
Give a contextual interpretation (in terms of the variables in the story) of the value of the beta1 estimate (i.e., of the estimate of the slope of the linear relationship between log(load) and temp). [Hint, be careful to interpret in terms of the transformed variable.]
the beta1 estimate is 0.015869
for each degree increase, the logload increases by 0.0159
To produce an improved model, we could consider also transforming
temp. This is an option. On the other hand, more
transformations make interpretation more complicated, and also can make
the estimates increasingly untrustworthy (“unstable”) if they shrink the
spread of the variable (as transformations like a root or a log do).
You are seeing some of the issues involved in model building: Model diagnostics are rarely perfect, and there is a balancing act between improving the diagnostics on the one hand, versus keeping the model relatively simple and stable on the other hand.
Nevertheless, linearizing transformations are generally the recommendation as opposed to ‘higher order’ models (like quadratics, cubics, etc). The higher the order of the model, the more risk there is of overfitting your sample. [This is partly reflected in the fact that linear models also only have two beta parameters whereas higher order models have more than two beta parameters, and needing to estimate more parameters can make the estimates less trustworthy / more unstable from sample to sample.]
But for the time being, let’s at least see how to fit a higher order model.
One higher order model is the quadratic model, which proposes that the population values are given by the following relationship (to see the math font, click or mouse over, or view the knitted version):
\[y_{i} = \beta_{0} + \beta_{1}x + \beta_{2}x^{2} + \epsilon_{i}\]
In the quadratic model, we propose that the population trend is a general quadratic function in X, rather than a linear function in X.
[Note that the general quadratic in X expands to three terms, namely a constant, an X term, and an X squared term.]
The quadratic regression model has the same error assumptions as the linear regression model.
To fit a model that is quadratic in temp, we will need
not only the temp values but also their square.
Therefore, we will add a new variable to the dataset called
temp_sq [without the tickmarks], which will contain the
squared temp values, (i.e., it will contain temp^2).
To do this, run the following chunk: [explanation follows]
powerload$temp_sq <- (powerload$temp)^2
On the right of that command, the dollar sign operator is being used
to extract the variable temp from the (pre existing)
powerload dataframe (and we are then squaring temp); on the
left, the dollar sign operator is being used to append a new variable
(vector) name called temp_sq to the powerload dataframe,
and the assignment operator <- is then assigning the
temp^2 values to that new variable (vector) name.
[Essentially, the left side creates a new (blank) column in the dataframe, the right side creates values, and the assignment operator fills the new column with the values.]
REMARK
In the above code on the right side, note we used parentheses before
squaring. This makes sense – we want to extract the temp
variable, then square it. Unfortunately, R would do the same
thing (and this is bad because it looks wrong mathematically) if we had
executed:
powerload$temp^2
That looks wrong, because by order of operations it looks like it
should first square temp and then try to extract the
squared values from the powerload dataframe. Unfortunately in this
context, R wouldn’t follow apparent mathematical order of operations!
So, always put the parentheses to avoid confusion.
Now, to fit the quadratic model for predicting load from
temp quadratically, we use the same lm()
command but we add another term corresponding to the
temp_sq values. (R still uses the “linear model”
lm() command even though we’re going to now use it make a
non-linear quadratic model, because internally it’s a simple
generalization to add another term).
The way we do this in the lm() command is with syntax that has the general form:
Y twiddle X plus X^2
(i.e., we will specify that the Y be modeled in terms of an equation that has a sum of X and X^2 terms).
Let’s call the resulting model load_quadmod (since it’s
a quadratic model for the load).
To fit this model, copy the following into a new code chunk and run it:
load_quadmod <- lm(load ~ temp + temp_sq,
data = powerload)
Notice that in order for the above lm() syntax to work correctly, we needed to have first appended the new vector temp_sq to the dataframe.
[You might wonder why we need to go to all the trouble of creating
the new variable temp_sq to then use in the lm(), and why
we can’t simply add a term “temp^2” inside the lm() instead. The reason
is that in an lm(), power notation does not have its
mathematical meaning (it means something about what are called
interactions, which you will learn about in lecture).]
After you run the quadratic model, we will inspect the diagnostics for the quadratic model.
Let’s now inspect the scatterplot to validate the quadratic assumption for the model.
Adding the quadratic curve to the scatterplot is tricky, so we’ve
done it for you below. After having fit the load_quadmod
model in the previous question, copy the following into a code chunk and
run it (note, you have to run the lines() command
simultaneously with the plot() command):
plot(powerload$temp, powerload$load,
xlab = "Temp",
ylab = "Load",
main = "Temp vs. Load with Quad fitted line")
order_temp <- order(powerload$temp)
lines(powerload$temp[order_temp],
load_quadmod$fitted.values[order_temp])
[The previous syntax is more than you are expected to duplicate in
the course, but in case you’re curious about how the previous syntax
works: In the previous plot() command we used an alternate
plot() syntax of the form X comma Y along with extracting
the variables from the dataframe with the dollar sign, instead of the Y
twiddle X syntax and specifying data=. The
plot() command accepts either method although it’s easy to
confuse the two! Then, the lines argument adds a line in a
similar manner to abline, and it takes that same syntax of
the form X comma Y and extracting the variables from the dataframe with
the dollar sign. We needed to sort them to ensure that the correct Y
went with the correct X, so to do that we created an intermediate object
containing the order of the temp values.]
Now, let’s inspect the residual diagnostics for the quadratic model.
Produce the two residual diagnostic plots [residual plot, and qq plot
of the residuals] from the load_quadmod model [hints,
recall you can execute plot() on the model name, and then
add the extra plot argument “which=1” for the residual plot, or
“which=2” for the qq plot of the residuals]
plot(load_quadmod, which=1)
plot(load_quadmod, which=2)
Comment on the residual diagnostics from the quadratic model (are they reasonably valid?)
the residual diagnostics from the model look good and are reasonably valid
One of the downsides of the higher order model (compared to a linear fit to transformed variables) is that higher order models have more beta parameters to estimate.
How many beta parameters are there in the quadratic model?
There are 3
In a new code chunk, produce summary() of the
load_quadmod model.
Then, in text below, report the values of the beta estimates.
summary(load_quadmod)
##
## Call:
## lm(formula = load ~ temp + temp_sq, data = powerload)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.974 -2.832 1.315 4.014 7.485
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 543.772646 54.530278 9.972 1.27e-09 ***
## temp -12.199390 1.283925 -9.502 3.03e-09 ***
## temp_sq 0.082924 0.007461 11.115 1.71e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.314 on 22 degrees of freedom
## Multiple R-squared: 0.9657, Adjusted R-squared: 0.9626
## F-statistic: 309.5 on 2 and 22 DF, p-value: < 2.2e-16
543.8 -12.2 0.83
Say (in text) what percent of the variability in load is
explained by the quadratic relationship with temp. [Hint,
this is a value in the summary() of the quadratic model.]
96.6%
Consider your answer in 5a and in 3e(ii). In terms of comparing R^2 values, which model would you choose as “better,” the model of log(load) linearly with temp, or the model of load quadratically with temp? Give the values you’re comparing, and say which one is “better” (in the R^2 sense).
the quadratic model is better as 96% is higher 82%
REMARKS:
You are seeing some of the issues involved in model building: Model diagnostics are rarely perfect, and there are competing demands between improving the diagnostics on the one hand, versus keeping the model relatively simple and stable on the other hand.
If needed, linearizing transformations are generally recommended over higher order models. The higher order the model, the more risk there is of overfitting your sample. Additionally, linear models only have two beta parameters whereas higher order models have more than two beta parameters (and needing to estimate more parameters can make the estimates less trustworthy / more unstable from sample to sample).
But note that even with transformations, it is sometimes better to have fewer transformations, in order to keep the interpretations simpler and keep the estimates more trustworthy / more stable from sample to sample, even if it means giving up some “perfection” of the diagnostics.
Model building is a complicated art! No matter what you do, give justifications for your choices, either for the use of transformations or for not using them (even when they might have improved your model diagnostics).
Instructions for uploading:
When you upload your pdf to Gradescope, please indicate in Gradescope the page(s) for the answers to each exercise
[End of lab]