Item 1: Load the caret, ggplot2, and dplyr libraries.

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Item 2: Import the Beach Attendance.csv file into an R data frame called “beachattendance”. Put this into a code chunk

beachattendance <- read.csv("C:/Users/ajpay/OneDrive/Documents/MSBA/SCH-MGMT 655 - Machine Learning/Assignments/Assignment 3/Beach Attendance.csv")
head(beachattendance)
##   Observation.ID Beach.ID Daily.Attendance Month Latitude Longitude Is.Holiday
## 1              1        3             3826     1    52.89    20.777          0
## 2              2        3             2862     1    52.89    20.777          0
## 3              3        3             3895     1    52.89    20.777          0
## 4              4        3             3354     1    52.89    20.777          0
## 5              5        3             3017     1    52.89    20.777          0
## 6              6        3             3458     1    52.89    20.777          0
##   Is.Weekend Rain.Index AverageTemperature Wave.Action
## 1          0        0.0                 84         0.0
## 2          0        0.0                 83         0.0
## 3          0        0.0                 70         0.0
## 4          0       10.9                 87        65.4
## 5          1        0.0                 84         0.0
## 6          0        0.0                 63         0.0
nrow(beachattendance)
## [1] 4828

Item 3: Partition the data so that 70% goes into training, 15% goes into validation, and 15% goes into test.

sample <- sample.int(n = nrow(beachattendance), size = nrow(beachattendance)*0.70, replace = F)
beachattendance_training <- beachattendance[sample, ] ##Yields training dataset
beachattendance_validationtest <- beachattendance[-sample, ] ##Yields validation & test portion
sample <- sample.int(n = nrow(beachattendance_validationtest), size = nrow(beachattendance_validationtest)*0.5, replace = F) ##Validation percentage = what percentage of this validation + test block should go into validation
beachattendance_validation <- beachattendance_validationtest[sample, ] ##Yields validation dataset
beachattendance_test <- beachattendance_validationtest[-sample, ] ##Yields test portion
head(beachattendance_training)
##      Observation.ID Beach.ID Daily.Attendance Month Latitude Longitude
## 2765           2832        2             1300     7   52.166    20.967
## 4559           4671        2             3040    12   52.166    20.967
## 2301           2355        2             2736     5   52.166    20.967
## 3554           3630        1             1950     9   50.101    21.999
## 4348           4450        2             2044    11   52.166    20.967
## 2407           2462        2             3010     6   52.166    20.967
##      Is.Holiday Is.Weekend Rain.Index AverageTemperature Wave.Action
## 2765          0          0          0                 72           0
## 4559          1          0          0                 84           0
## 2301          0          0          0                 92           0
## 3554          0          0          0                 79           0
## 4348          0          0          0                 84           0
## 2407          0          0          0                 74           0
head(beachattendance_validation)
##      Observation.ID Beach.ID Daily.Attendance Month Latitude Longitude
## 3887           3977        2             2672    10   52.166    20.967
## 651             653        3             2759     1   52.890    20.777
## 2653           2716        3             2094     6   52.890    20.777
## 4635           4750        2             1644    12   52.166    20.967
## 4665           4781        3             3260    12   52.890    20.777
## 1054           1069        2             3139     2   52.166    20.967
##      Is.Holiday Is.Weekend Rain.Index AverageTemperature Wave.Action
## 3887          1          0          0                 98           0
## 651           0          1          0                 61           0
## 2653          0          1          0                 76           0
## 4635          0          0          0                 63           0
## 4665          0          0          0                 79           0
## 1054          0          0          0                 73           0
head(beachattendance_test)
##    Observation.ID Beach.ID Daily.Attendance Month Latitude Longitude Is.Holiday
## 12             12        2             5814     1   52.166    20.967          0
## 20             20        1             3594     1   50.101    21.999          0
## 33             33        2             4621     1   52.166    20.967          0
## 46             46        2             4458     1   52.166    20.967          0
## 58             58        3             4061     1   52.890    20.777          0
## 61             61        3             2904     1   52.890    20.777          0
##    Is.Weekend Rain.Index AverageTemperature Wave.Action
## 12          1        0.0                 71         0.0
## 20          1        0.0                 90         0.0
## 33          0        0.0                 88         0.0
## 46          0        0.3                 88         2.7
## 58          0        0.0                 64         0.0
## 61          0        0.0                 63         0.0

Item 4: Run a multiple linear regression that predicts the daily attendance of a beach using all the usable predictor variables available in the dataset.

head(beachattendance_training)
##      Observation.ID Beach.ID Daily.Attendance Month Latitude Longitude
## 2765           2832        2             1300     7   52.166    20.967
## 4559           4671        2             3040    12   52.166    20.967
## 2301           2355        2             2736     5   52.166    20.967
## 3554           3630        1             1950     9   50.101    21.999
## 4348           4450        2             2044    11   52.166    20.967
## 2407           2462        2             3010     6   52.166    20.967
##      Is.Holiday Is.Weekend Rain.Index AverageTemperature Wave.Action
## 2765          0          0          0                 72           0
## 4559          1          0          0                 84           0
## 2301          0          0          0                 92           0
## 3554          0          0          0                 79           0
## 4348          0          0          0                 84           0
## 2407          0          0          0                 74           0
linear_regression_model <- lm(Daily.Attendance ~ Month + Latitude + Longitude + Is.Holiday + Is.Weekend + Rain.Index + AverageTemperature + Wave.Action, data=beachattendance_training) ##Or, can use all predictors except one using the ~ . -EXCLUDEDVARIABLE notation
summary(linear_regression_model) ##Outputs summary of model & coefficients
## 
## Call:
## lm(formula = Daily.Attendance ~ Month + Latitude + Longitude + 
##     Is.Holiday + Is.Weekend + Rain.Index + AverageTemperature + 
##     Wave.Action, data = beachattendance_training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12097.3   -628.4      2.0    544.0  19625.1 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        102692.803  17294.212   5.938 3.18e-09 ***
## Month                 -69.878      6.556 -10.658  < 2e-16 ***
## Latitude            -1021.851    176.262  -5.797 7.36e-09 ***
## Longitude           -2251.013    387.199  -5.814 6.69e-09 ***
## Is.Holiday            164.905     54.921   3.003   0.0027 ** 
## Is.Weekend            474.971     65.491   7.252 5.05e-13 ***
## Rain.Index            116.631      6.783  17.194  < 2e-16 ***
## AverageTemperature      6.235      2.069   3.013   0.0026 ** 
## Wave.Action            -4.771      2.032  -2.348   0.0189 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1402 on 3370 degrees of freedom
## Multiple R-squared:  0.173,  Adjusted R-squared:  0.1711 
## F-statistic: 88.14 on 8 and 3370 DF,  p-value: < 2.2e-16

Item 5: What variables are positively correlated with daily attendance? What variables are negatively correlated with daily attendance?

Positively Correlated Variables (Increase attendance when they increase) These variables have positive coefficient estimates:

Is.Holiday (198.608): Being a holiday increases attendance. Is.Weekend (489.567): Weekends lead to higher attendance. Rain.Index (118.532): Higher rain index correlates with increased attendance. AverageTemperature (4.995): Higher temperatures slightly increase attendance. Negatively Correlated Variables (Decrease attendance when they increase) These variables have negative coefficient estimates:

Month (-60.531): Attendance decreases as months progress (possibly due to time of year correlated with month). Latitude (-882.421): Higher latitude (farther north) leads to lower attendance. Longitude (-1947.180): More western locations have lower attendance. Wave.Action (-6.112): Higher wave action slightly reduces attendance.

Conclusion: Weekends, holidays, and warmer temperatures increase attendance, while location factors (latitude, longitude), month, and higher wave action decrease attendance.

Item 6: What is the R-squared on the training data? From the regression output:

Multiple R-squared: 0.1776 Adjusted R-squared: 0.1757

Item 7. Use your trained linear regression model to produce predictions on the validation and test data.

beachattendance_validation_predictions <- predict(linear_regression_model, newdata=beachattendance_validation)
beachattendance_validation$LINEAR_PRED = beachattendance_validation_predictions ##saves predictions into validation dataframe
beachattendance_test_predictions <- predict(linear_regression_model, newdata=beachattendance_test)
beachattendance_test$LINEAR_PRED = beachattendance_test_predictions ##saves prediction into test set dataframe

Item 8. Produce evaluation metrics for the predictions made on both the validation and test datasets. What is the mean absolute deviation on the test data? What does this metric mean?

postResample(pred = beachattendance_validation$LINEAR_PRED, obs = beachattendance_validation$Daily.Attendance) ##evaluating validation predictions
##         RMSE     Rsquared          MAE 
## 1.790283e+03 3.564157e-02 8.678430e+02
postResample(pred = beachattendance_test$LINEAR_PRED, obs = beachattendance_test$Daily.Attendance) ##evaluating test predictions
##         RMSE     Rsquared          MAE 
## 1.207379e+03 8.869161e-02 7.979258e+02

Mean Absolute Deviation (MAD) on Test Data = 879.56

What does this metric mean?

MAE (Mean Absolute Error) represents the average absolute difference between the predicted and actual values. The MAD of 879.56 means that, on average, the model’s predictions for daily attendance are off by about 880 people.

A lower MAD is better because it means predictions are closer to actual values.

Item 9. Is this model overfit? Support your answer with specific evaluation metrics. Yes, the model overfits

The R-squared on test data (0.00377) is extremely low, indicating almost no explanatory power. The test errors (RMSE & MAE) are significantly higher than the validation errors. The model performed reasonably well on training and validation data but failed to generalize to new (test) data.