Learning Objectives

By the end of this project you will:

Use project() to find best-fit models using a vector or matrix approach
Evaluate model performance using Root Mean Squared Error (RMSE)
Build a multi-variable, polynomial model to represent a dataset
Examine the relationship between residuals and the model’s variables
Apply the dot product to understand input variable characteristics

Admin

All questions for this project will be answered in the Project 2 Gradescope assignment. Read through this guidance to find the relevant material and R commands, as well as the required tasks, but note that the only deliverables for this project are to answer the questions in the Gradescope assignment and upload a copy of your R Script at the end of the Gradescope assignment.

Help Policy

This assignment is individual effort. You may reference any published resource (in print or online) and receive help from any individual; however, the work you turn in must be an accurate representation of your knowledge and understanding of the problem. It is NEVER acceptable to copy any portion of another’s work and submit it as your own. Here are a few blatant examples of copying:

Making an electronic copy of another cadet’s solution and then modifying it slightly to make it appear as your own work.
Reading another cadet’s work as you implement your solution.
Completing your solution by following explicit instructions from another cadet, while he/she refers to his/her own solution

Helping your classmates understand the concepts is encouraged, but only provide assistance up to your depth of understanding. If you have to look at your solution while giving help, you are most likely beyond your depth of understanding. Do not give your solution to another cadet in any form (hard copy, soft copy, or verbal).

Introduction

“The C-130 Hercules primarily performs the tactical portion of the airlift mission. The aircraft is capable of operating from rough, dirt strips and is the primary transport for airdropping troops and equipment into hostile areas” Link: C-130 Hercules - Fact Sheet. This project will focus on the airland mission of the C-130 - in particular you will explore how the variables of \(Weight\) and \(Temperature\) affect the takeoff \(Distance\) (the length of runway required for an aircraft to become airborne) for a C-130J Super Hercules.

C-130 pilots are intimately familiar with Takeoff and Landing Data (commonly referred to as TOLD) because many missions require operating in austere environments involving dirt landing zones and short runways. Often, this limits the amount of cargo that can be moved in and out of certain locations since aircraft gross weight is a significant factor in takeoff distance calculations. A common question asked by ground users, tactical, and strategic planners to C-130 crews is: “how much can you take?” Unfortunately, without access to the Mission Computer inside a C-130J cockpit, computing takeoff distance at various weights is difficult to accomplish with a high degree of accuracy. The aircraft performance manual offers charts (see example below) which are good for rough estimations (within a hundred feet at best), but perhaps we can use our Math 141 expertise to find a better way to estimate take-off distance.

Although there are many factors that influence takeoff distance, we will be exploring the impacts of just external Temperature and aircraft Weight in this project. We will expand on our understanding of projections to find an appropriate multivariable model for a given set of data.

Data

The dataset we will use in the project has three recorded variables: Temperature (in degrees Celsius), Weight (in thousands of pounds), and Distance (the distance in feet it takes for a C-130 to takeoff at a given temperature and weight).

Packages

Before we proceed, load the required packages. Remember that in order to use a package, we need to install AND load the package. The packages necessary for this project are already installed on PositCloud; there is no need to install them there.

You will need to load the required packages by executing the commands below. Note: you may have to re-run these packages each time you reload your R session.

library(rgl)
library(plotly)
library(tidyverse)

Importing the Data

The data you will use for this project is not pre-installed in R, so you need to upload the provided dataset into Posit-Cloud. To load the dataset into RStudio on PositCloud, follow the steps below: Step 0: Download the C130TOLD.csv file to your computer. The file C130TOLD.csv can be found in Block 2 Resources and then Project 2 folders in the Math 141 Fall 2025 Teams channel. Note: Do NOT change the file type or name, leave as a .csv file!
Step 1: In RStudio on PositCloud, in the Files tab in the bottom right, click on the file folder called “data”.

Step 2: Click the Upload Files button as depicted below

Step 3 & 4: Click Choose File and choose the C130TOLD.csv file wherever you saved it. Click OK

The C130TOLD.csv file should now be loaded in RStudio on PositCloud, however you still need to load the dataset to use it. If you followed the import directions exactly, you should not have to change the file path. Copy the following commands into your R Script and Run:

setwd("/cloud/project/")
C130TOLD = read_csv('data/C130TOLD.csv')

Note: you will have to re-run the above read_csv command to define the data C130TOLD if you clear your Environment.

Previewing the Data

ALWAYS view your data after loading. Not only will this confirm whether it was imported and loaded properly, but it also allows you to get familiar with the data. Use either of the commands below to preview your data. Copy the following commands into your R Script and Run:

head(C130TOLD)
View(C130TOLD)

Visualizing the Data

Get familiar with your dataset by using plot_ly(). Examine how the input variables \(Temperature\) and \(Weight\) appear to influence the output variable \(Distance\). Copy the following commands into your R Script and Run:

plot_ly(C130TOLD, x=~Temperature, y=~Weight, z=~Distance) %>% add_markers(size=12) %>%
  layout(scene=list(xaxis=list(title='Temp'),
                    yaxis=list(title='Wt'),
                    zaxis=list(title='Dist')))

Developing Models

Linear Model

For this project, all models will be in the linear form. The first function we will use to model this data is in the linear form with a single-variable input, Temperature. Model equation form: \(Distance = m*Temperature+b\). In Block 1, we learned how to use fitModel() to find the best-fit parameters. For example, copy the following commands into your R Script and Run:

linearModel = fitModel(Distance~m*Temperature+b, data=C130TOLD)
coef(linearModel)

Your results should show the model equation with the best-fit parameters is:
\(Distance = 42.58*Temperature+2962.17\).

But how does fitModel() work? In Block 2, we learned that fitModel() involves building a projection of the vector of Distance values onto the space of all linear combinations of \(\bar{1}_n\) (a vector of 1s) and \(Temperature\) (a vector of x values). Recall that a want a slope \(m\) and an intercept \(b\) that best fits these data points.

Obtain Projection

To accomplish vector projection, we need to store each variable from our data table in R as separate vectors to simplify calculations. Copy the following commands into your R Script and Run:

Temperature=C130TOLD$Temperature
Weight=C130TOLD$Weight
Distance=C130TOLD$Distance

Note: you will have to re-run the above commands to define your variable vectors if you clear your Environment.
We will also need an intercept vector to multiply with the intercept scalar. The rep() command stands for “replicate”, so it is replicating the value of 1 fifty times to make a vector of ones. Copy the following command into your R Script and Run:

ones=rep(1,50)

Model 1 - Temperature as Input

Use the project() command to find the projection of Distance onto the space of all linear combinations of \(Temperature\) and \(\bar{1}\). We can do this using two methods:

Method 1 (Combination of Vectors)

The first method involves projecting the Distance vector onto the combination of Temperature, the vector of \(Temperature\) values, and a vector of ones, \(\bar{1}_n\). Copy the following command into your R Script and Run:

project(Distance ~ Temperature + 1)

Method 2 (Model Matrix)

The second method involves projecting the Distance vector onto a “model matrix” which we refer to as M. This model matrix is constructed by setting the \(Temperature\) and \(\bar{1}_n\) vectors side-by-side. Copy the following commands into your R Script and Run:

M = matrix(c(Temperature, ones), nrow=50, ncol=2)
head(M)

Now, we can project the Distance vector onto this model matrix. Copy the following command into your R Script and Run:

project(Distance ~ M)

Using the model matrix method involves keeping track of which column of a matrix corresponds to what term of a model.

Note: that both methods return the same intercept and slope which also match the intercept and slope produced by the fitModel command.

Obtain RMSE

Is this a “good” model? One way to explore whether this is an adequate model is to examine the residuals, the difference between the observed Distance values and the associated predicted values from the model, and the Root-Mean Squared Error (RMSE).

Find the residual vector and use it to compute and report the RMSE. Copy the following commands into your R Script and Run:

model1=2962.17332+Temperature*42.57788
residual1=Distance-model1
RMSE1=sqrt(dot(residual1, residual1))/sqrt(50)
RMSE1 # Displays RMSE value

Interpretation

Recall that RMSE is measured in the same units as the output variable, \(Distance\). One “rule-of-thumb” is to compare RMSE to the mean of the output values and determine if RMSE is less than the comparison metric (5% of the average output). Copy the following commands into your R Script and Run:

comparison_metric = 0.05*mean(Distance)
comparison_metric # Displays comparison metric

Our model obtained above is a “best-fit” model because it is impossible to obtain parameters in this form that result in a better RMSE than what we found. The proof of this is outside the scope of this class.

However, while this is the best-fit linear function using only \(Temperature\) as an input, it does not mean this is the best possible model for this data. We can tell that our model does not have a great fit by examining the function plotted with the associated data, \(Distance\) versus \(Temperature\). Copy the following commands into your R Script and Run:

plotPoints(Distance~Temperature)
plotFun(linearModel(Temperature)~Temperature, add=TRUE, col="red")

As you can see from this plot, there are large differences or errors between the observed values of \(Temperature\) versus \(Distance\) and the linear model that we developed. The model’s residuals capture these errors and lead to calculating a large RMSE that was not close to our comparison metric. Therefore, we should explore other model forms that can reduce the RMSE and perform better at estimating take-off distance.

Model 2 - Weight as Input Variable

Thus far, we have been ignoring \(Weight\), the other input variable in our data set. Using the plot_ly() code from earlier to generate a multi-variable scatterplot, does the \(Weight\) variable appear to be important in explaining \(Distance\)? Using either of the two projection methods outlined above, perform a similar process with the Weight input variable using the form \(Distance = a*Weight + b\). Report the best-fit values for \(a\) and \(b\).

Answer in Gradescope: Q2.1, Q2.2

RMSE

Find and report the RMSE for this model. Comment on whether this is an improvement. Does this model pass the comparison metric threshold?

Answer in Gradescope: Q2.3 & Q2.4

Model 3: Multi-variable Model

Next, let’s try modeling with both \(Weight\) and \(Temperature\) input variables included in the linear multi-variable model. Using either of the two projection methods outlined above, find the best-fit model of the form: \(Distance = a*Weight + b*Temperature + c\). Report the best-fit values for \(a\), \(b\), and \(c\).

Answer in Gradescope: Q3.1, Q3.2, Q3.3

RMSE

Find and report the RMSE for this model. Comment on whether this is an improvement. Did this model achieve a “good” fit by passing the comparison metric?

Answer in Gradescope: Q3.4 & Q3.5

While the RMSE is low enough to consider this current model to be adequate, there are other factors to consider when developing linear models. You will learn more about this is a future statistics course, but for now, just know that the goal is for the plot of residuals versus the output variable to appear reasonably linear.

Let’s examine the following residual versus variable plots to determine whether the residuals appear reasonably linear. Copy the following command into your R Script and Run, where residual3 is what you called the residual vector from the \(Distance = a*Weight + b*Temperature + c\) model:

plotPoints(residual3~Distance)

What did you notice from the ‘residual3’ versus \(Distance\) plot? Did the residuals appear linear across values of \(Distance\) or was another shape apparent?

Answer in Gradescope: Q3.6

Model 4: Polynomial Model

You might remember that \(f(x) = x^2\) has a similar graphical shape to what you examined in the last plot. Interestingly, to ensure the residuals of the next model are reasonably linear versus the output variable, we should add a squared-term (\(Input^2\)) to address this issue. This means we actually square at least one of our input variables to create a model of a polynomial form. Which input variable should we square? Examine each of the residual versus input variable plots to determine which input variable would most benefit from adding a squared-term. Copy the following commands into your R Script and Run:

plotPoints(residual3~Weight)
plotPoints(residual3~Temperature)

We can see that the \(Weight\) variable has a strong parabolic relationship with the residuals, so we will add a \(Weight^2\) term to the model next. The model form will look like this:
\(Distance = a*Weight^2 + b*Weight + c*Temperature + d\). Using either of the two projection methods outlined above, find the best-fit parameters for the above model. Note: for Method 1, you will need to use I(Weight^2) within project(), while just Weight^2 works with Method 2. Report the best-fit values for \(a\), \(b\), \(c\), and \(d\).

Answer in Gradescope: Q4.1, Q4.2, Q4.3, Q4.4

RMSE and Model Fit

Find and report the RMSE for this model. Is this an improvement? Did adding the \(Weight^2\) term improve the residual versus output variable relationship? Copy the following command into your R Script and Run, where residual4 is what you called the residual vector from the \(Distance = a*Weight^2 + b*Weight + c*Temperature + d\) model:

plotPoints(residual4~Distance)

While not perfectly linear, the residuals versus output variable plot no longer displays a strong parabolic relationship and can be considered reasonable. With RMSE at a reasonable range compared to the mean of our \(Distance\) values and acceptable residual versus variable relationships, we have assurance that the latest model is adequate to represent the C-130 TOLD data set.

Answer in Gradescope: Q4.5, Q4.6

Input Variable Correlation

Normally, when different variables are added to a linear model, there is a concern that relationships between inputs can cause difficulties with calculating model parameters. Ideally, the input variables have no correlation (linear relationship) which greatly reduces this concern. First, we can examine the plot of \(Weight\) versus \(Temperature\) to examine if there’s any trend between the two input variables.

plotPoints(Weight~Temperature)

Do you notice any trend between \(Weight\) and \(Temperature\)? To fully confirm the relationship between \(Weight\) and \(Temperature\), we can apply one of the approaches we learned in Block 2: the dot product. When the dot product of two variables equals 0, the variables have no correlation.

The big difference when working with real data is that we need to “center” the vectors before calculating the dot product. To center a vector you simply subtract the mean of the vector from each component. Remember from above that the mean of a vector is found in R with the mean() command. Compute the centered dot product and evaluate the result to answer the last two questions.

Answer in Gradescope: Q5.1, Q5.2

When input variables have zero correlation, it simplifies the model building process for multi-variable models. Did you notice that coefficient values for the corresponding variable did not change between Models 1-3?

Math 141 Project 2 - Guidance

Learning Objectives

Admin

Help Policy

Introduction

Data

Packages

Importing the Data

Previewing the Data

Visualizing the Data

Developing Models

Linear Model

Obtain Projection

Model 1 - Temperature as Input

Method 1 (Combination of Vectors)

Method 2 (Model Matrix)

Obtain RMSE

Interpretation

Model 2 - Weight as Input Variable

RMSE

Model 3: Multi-variable Model

RMSE

Model 4: Polynomial Model

RMSE and Model Fit

Input Variable Correlation