Item 1: Load the caret, ggplot2, and dplyr libraries.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Item 2: Import the Beach Attendance.csv file into an R data frame called “beachattendance”. Put this into a code chunk
beachattendance <- read.csv("C:/Users/ajpay/OneDrive/Documents/MSBA/SCH-MGMT 655 - Machine Learning/Assignments/Assignment 3/Beach Attendance.csv")
head(beachattendance)
## Observation.ID Beach.ID Daily.Attendance Month Latitude Longitude Is.Holiday
## 1 1 3 3826 1 52.89 20.777 0
## 2 2 3 2862 1 52.89 20.777 0
## 3 3 3 3895 1 52.89 20.777 0
## 4 4 3 3354 1 52.89 20.777 0
## 5 5 3 3017 1 52.89 20.777 0
## 6 6 3 3458 1 52.89 20.777 0
## Is.Weekend Rain.Index AverageTemperature Wave.Action
## 1 0 0.0 84 0.0
## 2 0 0.0 83 0.0
## 3 0 0.0 70 0.0
## 4 0 10.9 87 65.4
## 5 1 0.0 84 0.0
## 6 0 0.0 63 0.0
nrow(beachattendance)
## [1] 4828
Item 3: Partition the data so that 70% goes into training, 15% goes into validation, and 15% goes into test.
sample <- sample.int(n = nrow(beachattendance), size = nrow(beachattendance)*0.70, replace = F)
beachattendance_training <- beachattendance[sample, ] ##Yields training dataset
beachattendance_validationtest <- beachattendance[-sample, ] ##Yields validation & test portion
sample <- sample.int(n = nrow(beachattendance_validationtest), size = nrow(beachattendance_validationtest)*0.5, replace = F) ##Validation percentage = what percentage of this validation + test block should go into validation
beachattendance_validation <- beachattendance_validationtest[sample, ] ##Yields validation dataset
beachattendance_test <- beachattendance_validationtest[-sample, ] ##Yields test portion
head(beachattendance_training)
## Observation.ID Beach.ID Daily.Attendance Month Latitude Longitude
## 2765 2832 2 1300 7 52.166 20.967
## 4559 4671 2 3040 12 52.166 20.967
## 2301 2355 2 2736 5 52.166 20.967
## 3554 3630 1 1950 9 50.101 21.999
## 4348 4450 2 2044 11 52.166 20.967
## 2407 2462 2 3010 6 52.166 20.967
## Is.Holiday Is.Weekend Rain.Index AverageTemperature Wave.Action
## 2765 0 0 0 72 0
## 4559 1 0 0 84 0
## 2301 0 0 0 92 0
## 3554 0 0 0 79 0
## 4348 0 0 0 84 0
## 2407 0 0 0 74 0
head(beachattendance_validation)
## Observation.ID Beach.ID Daily.Attendance Month Latitude Longitude
## 3887 3977 2 2672 10 52.166 20.967
## 651 653 3 2759 1 52.890 20.777
## 2653 2716 3 2094 6 52.890 20.777
## 4635 4750 2 1644 12 52.166 20.967
## 4665 4781 3 3260 12 52.890 20.777
## 1054 1069 2 3139 2 52.166 20.967
## Is.Holiday Is.Weekend Rain.Index AverageTemperature Wave.Action
## 3887 1 0 0 98 0
## 651 0 1 0 61 0
## 2653 0 1 0 76 0
## 4635 0 0 0 63 0
## 4665 0 0 0 79 0
## 1054 0 0 0 73 0
head(beachattendance_test)
## Observation.ID Beach.ID Daily.Attendance Month Latitude Longitude Is.Holiday
## 12 12 2 5814 1 52.166 20.967 0
## 20 20 1 3594 1 50.101 21.999 0
## 33 33 2 4621 1 52.166 20.967 0
## 46 46 2 4458 1 52.166 20.967 0
## 58 58 3 4061 1 52.890 20.777 0
## 61 61 3 2904 1 52.890 20.777 0
## Is.Weekend Rain.Index AverageTemperature Wave.Action
## 12 1 0.0 71 0.0
## 20 1 0.0 90 0.0
## 33 0 0.0 88 0.0
## 46 0 0.3 88 2.7
## 58 0 0.0 64 0.0
## 61 0 0.0 63 0.0
Item 4: Run a multiple linear regression that predicts the daily attendance of a beach using all the usable predictor variables available in the dataset.
head(beachattendance_training)
## Observation.ID Beach.ID Daily.Attendance Month Latitude Longitude
## 2765 2832 2 1300 7 52.166 20.967
## 4559 4671 2 3040 12 52.166 20.967
## 2301 2355 2 2736 5 52.166 20.967
## 3554 3630 1 1950 9 50.101 21.999
## 4348 4450 2 2044 11 52.166 20.967
## 2407 2462 2 3010 6 52.166 20.967
## Is.Holiday Is.Weekend Rain.Index AverageTemperature Wave.Action
## 2765 0 0 0 72 0
## 4559 1 0 0 84 0
## 2301 0 0 0 92 0
## 3554 0 0 0 79 0
## 4348 0 0 0 84 0
## 2407 0 0 0 74 0
linear_regression_model <- lm(Daily.Attendance ~ Month + Latitude + Longitude + Is.Holiday + Is.Weekend + Rain.Index + AverageTemperature + Wave.Action, data=beachattendance_training) ##Or, can use all predictors except one using the ~ . -EXCLUDEDVARIABLE notation
summary(linear_regression_model) ##Outputs summary of model & coefficients
##
## Call:
## lm(formula = Daily.Attendance ~ Month + Latitude + Longitude +
## Is.Holiday + Is.Weekend + Rain.Index + AverageTemperature +
## Wave.Action, data = beachattendance_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12097.3 -628.4 2.0 544.0 19625.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102692.803 17294.212 5.938 3.18e-09 ***
## Month -69.878 6.556 -10.658 < 2e-16 ***
## Latitude -1021.851 176.262 -5.797 7.36e-09 ***
## Longitude -2251.013 387.199 -5.814 6.69e-09 ***
## Is.Holiday 164.905 54.921 3.003 0.0027 **
## Is.Weekend 474.971 65.491 7.252 5.05e-13 ***
## Rain.Index 116.631 6.783 17.194 < 2e-16 ***
## AverageTemperature 6.235 2.069 3.013 0.0026 **
## Wave.Action -4.771 2.032 -2.348 0.0189 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1402 on 3370 degrees of freedom
## Multiple R-squared: 0.173, Adjusted R-squared: 0.1711
## F-statistic: 88.14 on 8 and 3370 DF, p-value: < 2.2e-16
Item 5: What variables are positively correlated with daily attendance? What variables are negatively correlated with daily attendance?
Positively Correlated Variables (Increase attendance when they increase) These variables have positive coefficient estimates:
Is.Holiday (198.608): Being a holiday increases attendance. Is.Weekend (489.567): Weekends lead to higher attendance. Rain.Index (118.532): Higher rain index correlates with increased attendance. AverageTemperature (4.995): Higher temperatures slightly increase attendance. Negatively Correlated Variables (Decrease attendance when they increase) These variables have negative coefficient estimates:
Month (-60.531): Attendance decreases as months progress (possibly due to time of year correlated with month). Latitude (-882.421): Higher latitude (farther north) leads to lower attendance. Longitude (-1947.180): More western locations have lower attendance. Wave.Action (-6.112): Higher wave action slightly reduces attendance.
Conclusion: Weekends, holidays, and warmer temperatures increase attendance, while location factors (latitude, longitude), month, and higher wave action decrease attendance.
Item 6: What is the R-squared on the training data? From the regression output:
Multiple R-squared: 0.1776 Adjusted R-squared: 0.1757
Item 7. Use your trained linear regression model to produce predictions on the validation and test data.
beachattendance_validation_predictions <- predict(linear_regression_model, newdata=beachattendance_validation)
beachattendance_validation$LINEAR_PRED = beachattendance_validation_predictions ##saves predictions into validation dataframe
beachattendance_test_predictions <- predict(linear_regression_model, newdata=beachattendance_test)
beachattendance_test$LINEAR_PRED = beachattendance_test_predictions ##saves prediction into test set dataframe
Item 8. Produce evaluation metrics for the predictions made on both the validation and test datasets. What is the mean absolute deviation on the test data? What does this metric mean?
postResample(pred = beachattendance_validation$LINEAR_PRED, obs = beachattendance_validation$Daily.Attendance) ##evaluating validation predictions
## RMSE Rsquared MAE
## 1.790283e+03 3.564157e-02 8.678430e+02
postResample(pred = beachattendance_test$LINEAR_PRED, obs = beachattendance_test$Daily.Attendance) ##evaluating test predictions
## RMSE Rsquared MAE
## 1.207379e+03 8.869161e-02 7.979258e+02
Mean Absolute Deviation (MAD) on Test Data = 879.56
What does this metric mean?
MAE (Mean Absolute Error) represents the average absolute difference between the predicted and actual values. The MAD of 879.56 means that, on average, the model’s predictions for daily attendance are off by about 880 people.
A lower MAD is better because it means predictions are closer to actual values.
Item 9. Is this model overfit? Support your answer with specific evaluation metrics. Yes, the model overfits
The R-squared on test data (0.00377) is extremely low, indicating almost no explanatory power. The test errors (RMSE & MAE) are significantly higher than the validation errors. The model performed reasonably well on training and validation data but failed to generalize to new (test) data.