2025-03-17

Airmiles Dataset

  • This dataset shows relationship between years and miles flown over time

  • The dataset contains years from the range 1937 to 1960

Examining the data

Print out the first 6 rows of the data

data("airmiles")
?airmiles # find out about the data
years <- 1937:1960 # the dataset contains the years from the range 1937 to 1960
df <- data.frame(years = years, miles = as.numeric(airmiles)) # create the dataframe for the airmiles 
head(df) 
##   years miles
## 1  1937   412
## 2  1938   480
## 3  1939   683
## 4  1940  1052
## 5  1941  1385
## 6  1942  1418

Summary of the data

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     412    1580    6431   10528   17532   30514
  • Important notes:

    • minimum miles flown in a year was 412

    • maximum miles flown in a year was 30,514

    • The average miles flown in all years of dataset was 10,528

Slide with ggplot of the passenger miles for U.S. airlines 1937-1960

  • The slide show show the ggplot where we see how over time the miles have increased

Simple Linear Regression For Airmiles Dataset

The formula for the simple regression in regards to this dataset:

\[ y = \beta_0 + \beta_1 x + \epsilon \]

The meanings of each variable in the formula:

- \(x\) is the year

- \(y\) is the total miles flown

- \(\beta_0\) is the intercept which is essentially what the starting miles is when the year is 0

- \(\beta_1\) is the slope which is essentially the rate of change in miles flown each year

- \(\epsilon\) is the random error which essentially is the difference between the actual and predicted miles flown

Deeper Dive into the meaning of Epsilon in the Simple Regression Formula

Simple regression formula:

$$ y = \beta_0 + \beta_1 x + \epsilon $$

  • If for example the predicted miles was 500 but the actual miles was 700 then we calculate epsilon as:

  • \[ \epsilon = 700 - 500 = 200 \]

  • This 200 value indicates that the prediction was off by 200 miles the the xth year

Slide with ggplot for the Regression Line

  • The blue line on the graph is the regression line which essentially is the best fit

  • The dots are the actual values from the dataset

## `geom_smooth()` using formula = 'y ~ x'

Slide with Plotly: Scatter Plot

  • Here you can interact with the scatter plot