By the end of this project you will:
All questions for this project will be answered in the Project 2 Gradescope assignment. Read through this guidance to find the relevant material and R commands, as well as the required tasks, but note that the only deliverables for this project are to answer the questions in the Gradescope assignment and upload a copy of your R Script at the end of the Gradescope assignment.
This assignment is individual effort. You may reference any published resource (in print or online) and receive help from any individual; however, the work you turn in must be an accurate representation of your knowledge and understanding of the problem. It is NEVER acceptable to copy any portion of another’s work and submit it as your own. Here are a few blatant examples of copying:
Helping your classmates understand the concepts is encouraged, but only provide assistance up to your depth of understanding. If you have to look at your solution while giving help, you are most likely beyond your depth of understanding. Do not give your solution to another cadet in any form (hard copy, soft copy, or verbal).
“The C-130 Hercules primarily performs the tactical portion of the airlift mission. The aircraft is capable of operating from rough, dirt strips and is the primary transport for airdropping troops and equipment into hostile areas” Link: C-130 Hercules - Fact Sheet. This project will focus on the airland mission of the C-130 - in particular you will explore how the variables of \(Weight\) and \(Temperature\) affect the takeoff \(Distance\) (the length of runway required for an aircraft to become airborne) for a C-130J Super Hercules.
C-130 pilots are intimately familiar with Takeoff and Landing Data (commonly referred to as TOLD) because many missions require operating in austere environments involving dirt landing zones and short runways. Often, this limits the amount of cargo that can be moved in and out of certain locations since aircraft gross weight is a significant factor in takeoff distance calculations. A common question asked by ground users, tactical, and strategic planners to C-130 crews is: “how much can you take?” Unfortunately, without access to the Mission Computer inside a C-130J cockpit, computing takeoff distance at various weights is difficult to accomplish with a high degree of accuracy. The aircraft performance manual offers charts (see example below) which are good for rough estimations (within a hundred feet at best), but perhaps we can use our Math 141 expertise to find a better way to estimate take-off distance.
Although there are many factors that influence takeoff distance, we will be exploring the impacts of just external Temperature and aircraft Weight in this project. We will expand on our understanding of projections to find an appropriate multivariable model for a given set of data.
The dataset we will use in the project has three recorded variables: Temperature (in degrees Celsius), Weight (in thousands of pounds), and Distance (the distance in feet it takes for a C-130 to takeoff at a given temperature and weight).
Before we proceed, load the required packages. Remember that in order to use a package, we need to install AND load the package. The packages necessary for this project are already installed on PositCloud; there is no need to install them there.
You will need to load the required packages by executing the commands below. Note: you may have to re-run these packages each time you reload your R session.
library(rgl)
library(plotly)
library(tidyverse)
The data you will use for this project is not pre-installed in
R, so you need to upload the provided dataset into Posit-Cloud.
To load the dataset into RStudio on PositCloud, follow the steps below:
Step 0: Download the C130TOLD.csv file to your
computer. The file C130TOLD.csv can be found in
Block 2 Resources and then Project 2
folders in the Math 141 Fall 2025 Teams channel. Note: Do NOT
change the file type or name, leave as a .csv file!
Step 1: In RStudio on PositCloud, in the Files tab in
the bottom right, click on the file folder called “data”.
Step 2: Click the Upload Files button as depicted below
Step 3 & 4: Click Choose File and choose the C130TOLD.csv file wherever you saved it. Click OK
The C130TOLD.csv file should now be loaded in RStudio on PositCloud, however you still need to load the dataset to use it. If you followed the import directions exactly, you should not have to change the file path. Copy the following commands into your R Script and Run:
setwd("/cloud/project/")
C130TOLD = read_csv('data/C130TOLD.csv')
Note: you will have to re-run the above
read_csv command to define the data C130TOLD
if you clear your Environment.
ALWAYS view your data after loading. Not only will this confirm whether it was imported and loaded properly, but it also allows you to get familiar with the data. Use either of the commands below to preview your data. Copy the following commands into your R Script and Run:
head(C130TOLD)
View(C130TOLD)
Get familiar with your dataset by using plot_ly().
Examine how the input variables \(Temperature\) and \(Weight\) appear to influence the output
variable \(Distance\). Copy the
following commands into your R Script and Run:
plot_ly(C130TOLD, x=~Temperature, y=~Weight, z=~Distance) %>% add_markers(size=12) %>%
layout(scene=list(xaxis=list(title='Temp'),
yaxis=list(title='Wt'),
zaxis=list(title='Dist')))
For this project, all models will be in the linear form. The first
function we will use to model this data is in the linear form with a
single-variable input, Temperature. Model equation form: \(Distance = m*Temperature+b\). In Block 1,
we learned how to use fitModel() to find the best-fit
parameters. For example, copy the following commands into your
R Script and Run:
linearModel = fitModel(Distance~m*Temperature+b, data=C130TOLD)
coef(linearModel)
Your results should show the model equation with the best-fit
parameters is:
\(Distance =
42.58*Temperature+2962.17\).
But how does fitModel() work? In Block 2, we learned
that fitModel() involves building a projection of
the vector of Distance values onto the space of all linear
combinations of \(\bar{1}_n\) (a vector
of 1s) and \(Temperature\) (a vector of
x values). Recall that a want a slope \(m\) and an intercept \(b\) that best fits these data points.
To accomplish vector projection, we need to store each variable from our data table in R as separate vectors to simplify calculations. Copy the following commands into your R Script and Run:
Temperature=C130TOLD$Temperature
Weight=C130TOLD$Weight
Distance=C130TOLD$Distance
Note: you will have to re-run the above commands to
define your variable vectors if you clear your Environment.
We will also need an intercept vector to multiply with the intercept
scalar. The rep() command stands for “replicate”, so it is
replicating the value of 1 fifty times to make a vector of ones. Copy
the following command into your R Script and Run:
ones=rep(1,50)
Use the project() command to find the projection of
Distance onto the space of all linear combinations of \(Temperature\) and \(\bar{1}\). We can do this using two
methods:
The first method involves projecting the Distance vector
onto the combination of Temperature, the vector of \(Temperature\) values, and a vector of ones,
\(\bar{1}_n\). Copy the following
command into your R Script and Run:
project(Distance ~ Temperature + 1)
The second method involves projecting the Distance
vector onto a “model matrix” which we refer to as M. This
model matrix is constructed by setting the \(Temperature\) and \(\bar{1}_n\) vectors side-by-side. Copy the
following commands into your R Script and Run:
M = matrix(c(Temperature, ones), nrow=50, ncol=2)
head(M)
Now, we can project the Distance vector onto this model
matrix. Copy the following command into your R Script and
Run:
project(Distance ~ M)
Using the model matrix method involves keeping track of which column of a matrix corresponds to what term of a model.
Note: that both methods return the same intercept
and slope which also match the intercept and slope produced by the
fitModel command.
Is this a “good” model? One way to explore whether this is an
adequate model is to examine the residuals, the difference between the
observed Distance values and the associated predicted
values from the model, and the Root-Mean Squared Error (RMSE).
Find the residual vector and use it to compute and report the RMSE. Copy the following commands into your R Script and Run:
model1=2962.17332+Temperature*42.57788
residual1=Distance-model1
RMSE1=sqrt(dot(residual1, residual1))/sqrt(50)
RMSE1 # Displays RMSE value
Recall that RMSE is measured in the same units as the output variable, \(Distance\). One “rule-of-thumb” is to compare RMSE to the mean of the output values and determine if RMSE is less than the comparison metric (5% of the average output). Copy the following commands into your R Script and Run:
comparison_metric = 0.05*mean(Distance)
comparison_metric # Displays comparison metric
Our model obtained above is a “best-fit” model because it is impossible to obtain parameters in this form that result in a better RMSE than what we found. The proof of this is outside the scope of this class.
However, while this is the best-fit linear function using only \(Temperature\) as an input, it does not mean this is the best possible model for this data. We can tell that our model does not have a great fit by examining the function plotted with the associated data, \(Distance\) versus \(Temperature\). Copy the following commands into your R Script and Run:
plotPoints(Distance~Temperature)
plotFun(linearModel(Temperature)~Temperature, add=TRUE, col="red")
As you can see from this plot, there are large differences or errors between the observed values of \(Temperature\) versus \(Distance\) and the linear model that we developed. The model’s residuals capture these errors and lead to calculating a large RMSE that was not close to our comparison metric. Therefore, we should explore other model forms that can reduce the RMSE and perform better at estimating take-off distance.
Thus far, we have been ignoring \(Weight\), the other input variable in our
data set. Using the plot_ly() code from earlier to generate
a multi-variable scatterplot, does the \(Weight\) variable appear to be important in
explaining \(Distance\)? Using either
of the two projection methods outlined above, perform a
similar process with the Weight input variable using the form
\(Distance = a*Weight + b\). Report the
best-fit values for \(a\) and \(b\).
Answer in Gradescope: Q2.1, Q2.2
Find and report the RMSE for this model. Comment on whether this is an improvement. Does this model pass the comparison metric threshold?
Answer in Gradescope: Q2.3 & Q2.4
Next, let’s try modeling with both \(Weight\) and \(Temperature\) input variables included in the linear multi-variable model. Using either of the two projection methods outlined above, find the best-fit model of the form: \(Distance = a*Weight + b*Temperature + c\). Report the best-fit values for \(a\), \(b\), and \(c\).
Answer in Gradescope: Q3.1, Q3.2, Q3.3
Find and report the RMSE for this model. Comment on whether this is an improvement. Did this model achieve a “good” fit by passing the comparison metric?
Answer in Gradescope: Q3.4 & Q3.5
While the RMSE is low enough to consider this current model to be adequate, there are other factors to consider when developing linear models. You will learn more about this is a future statistics course, but for now, just know that the goal is for the plot of residuals versus the output variable to appear reasonably linear.
Let’s examine the following residual versus variable plots to
determine whether the residuals appear reasonably linear. Copy the
following command into your R Script and Run, where
residual3 is what you called the residual vector from the
\(Distance = a*Weight + b*Temperature +
c\) model:
plotPoints(residual3~Distance)
What did you notice from the ‘residual3’ versus \(Distance\) plot? Did the residuals appear linear across values of \(Distance\) or was another shape apparent?
Answer in Gradescope: Q3.6
You might remember that \(f(x) = x^2\) has a similar graphical shape to what you examined in the last plot. Interestingly, to ensure the residuals of the next model are reasonably linear versus the output variable, we should add a squared-term (\(Input^2\)) to address this issue. This means we actually square at least one of our input variables to create a model of a polynomial form. Which input variable should we square? Examine each of the residual versus input variable plots to determine which input variable would most benefit from adding a squared-term. Copy the following commands into your R Script and Run:
plotPoints(residual3~Weight)
plotPoints(residual3~Temperature)
We can see that the \(Weight\)
variable has a strong parabolic relationship with the residuals, so we
will add a \(Weight^2\) term to the
model next. The model form will look like this:
\(Distance = a*Weight^2 + b*Weight +
c*Temperature + d\). Using either of the two projection
methods outlined above, find the best-fit parameters for the
above model. Note: for Method 1, you will need to use
I(Weight^2) within project(), while just
Weight^2 works with Method 2. Report the best-fit values
for \(a\), \(b\), \(c\), and \(d\).
Answer in Gradescope: Q4.1, Q4.2, Q4.3, Q4.4
Find and report the RMSE for this model. Is this an improvement? Did
adding the \(Weight^2\) term improve
the residual versus output variable relationship? Copy the following
command into your R Script and Run, where
residual4 is what you called the residual vector from the
\(Distance = a*Weight^2 + b*Weight +
c*Temperature + d\) model:
plotPoints(residual4~Distance)
While not perfectly linear, the residuals versus output variable plot no longer displays a strong parabolic relationship and can be considered reasonable. With RMSE at a reasonable range compared to the mean of our \(Distance\) values and acceptable residual versus variable relationships, we have assurance that the latest model is adequate to represent the C-130 TOLD data set.
Answer in Gradescope: Q4.5, Q4.6
Normally, when different variables are added to a linear model, there is a concern that relationships between inputs can cause difficulties with calculating model parameters. Ideally, the input variables have no correlation (linear relationship) which greatly reduces this concern. First, we can examine the plot of \(Weight\) versus \(Temperature\) to examine if there’s any trend between the two input variables.
plotPoints(Weight~Temperature)
Do you notice any trend between \(Weight\) and \(Temperature\)? To fully confirm the relationship between \(Weight\) and \(Temperature\), we can apply one of the approaches we learned in Block 2: the dot product. When the dot product of two variables equals 0, the variables have no correlation.
The big difference when working with real data is that we need to
“center” the vectors before calculating the dot product. To center a
vector you simply subtract the mean of the vector from each component.
Remember from above that the mean of a vector is found in R
with the mean() command. Compute the centered dot product
and evaluate the result to answer the last two questions.
Answer in Gradescope: Q5.1, Q5.2
When input variables have zero correlation, it simplifies the model building process for multi-variable models. Did you notice that coefficient values for the corresponding variable did not change between Models 1-3?