EDA and Modeling - Scallop landings in Massachusetts

Exploratory data analysis

Defining the dependent variable

Sea scallops are the highest-revenue commercial catch in Massachusetts. This is consistent across multiple years. For this reason, scallops are economically significant for the region. Furthermore, predictions on the fleet’s performance will be of particular interest.

Exports of scallops

Both the metric tons landed and average annual price per pound grow over the period and accelerate after 2000. One metric that can be indicative of total demand for scallops is the annual exports.

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

Interestingly, the top importers of scallops from the US had an upward trend from 1990-2015, followed by a sharp downward shift. This slowdown in exports coincides with a period of rising prices and greater volume of landings, which suggests that exports may not be a strong predictor of catch. Still, total exports will be a possible independent/predictor variable in the models built.

Regulation of the industry

A major factor in the actions of Massachusetts scallop fishing fleet members is the number of days they are allowed at sea, the amount of scallops they are able to possess, and the territories they are allowed to harvest in. Days at sea (DAS) and possession limit data are available from regulatory bodies going back to 1999.

The regulatory data shows a few interesting trends. While the total days at sea has declined steadily over time, possession limits, the amount a vessel is allowed to hold onboard at one time, fluctuates. These factors drive the total annual allowed harvest, which has shifted in a sort of “M” pattern. One important element of this: these regulations took effect in the late 1990’s. Using them in a model would require a dramatic drop in the number of observations. To avoid loss of observations, an index of restrictiveness is later created wherein all values prior to 1999 are 0.

Demand from other sectors

Input-output (IO) data from the BEA was used to identify domestic drivers of demand. Industries that contribute high shares of total output in the fishing, hunting and trapping industry were isolated by sorting the total demand for that industry (contained in the rows of the IO table). The industries most reliant upon fishing, hunting and trapping were food manufacturing and food services establishments. These findings are intuitive - those industries would naturally make more use of products like fresh/frozen scallops. The performance of these industries could be used to predict the total demand and total commercial landings of scallops.

## Warning: There were 7 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(everything(), as.numeric)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 6 remaining warnings.

Feature creation: restrictiveness index

The index of regulatory restrictiveness is a weighted average of days at sea and the possession limit, both factors in how much the scalloping fleet can harvest in a year.

Correlations

Food manufacturing GDP and the price of lobster (a subtitute product) are positively correlated with the totaltons landed. The proxy variable for restrictiveness is negatively correlated. These are the variables used in the regression forecasts.

Modeling

Training

Training of the models used all years up to 2015.

The first model estimates landings using the previous year’s revenue total and the total harvest allowed per day. While the fit looks reasonable, the limited number of observations may suggest it is overfitting and may not perform well on test data. The model does have a high r-squared (0.65) and the variables are significant at the 95% level.

\[log(Metric tons) = \beta0 + log(harvest per day)\beta1 + log(Dollars(t-1))\beta2 + \epsilon\]

The second model has similar specification but instead considers the possession limit. R-squared is higher (0.9) and has all significant variables.

\[log(Metric tons) = \beta0 + log(possessionlimit)\beta1 + log(Dollars(t-1))\beta2 + \epsilon\] The third model uses the restrictiveness index and the previous year’s total revenue.

\[log(Metric tons) = \beta0 + log(restrictivenessindex)\beta1 + log(Dollars(t-1))\beta2 + \epsilon\]

The fourth model uses food manufacturing performance to estimate total landings. The coefficients were not statistically significant, which rules this model out.

The final model is an ARIMA specification. The fit appears very strong. This makes sense as ARIMA models are suited for variables with high autocorrelation as is the case here.

## Series: Metric_Tons 
## Model: TSLM 
## Transformation: log(Metric_Tons) 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38176 -0.10045  0.01901  0.12013  0.24544 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            5.1805     1.9435   2.665   0.0194 *
## log(Harvest_per_day)   0.2521     0.1572   1.604   0.1327  
## lag(log(Dollars), 1)   0.1465     0.1406   1.042   0.3163  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1861 on 13 degrees of freedom
## Multiple R-squared: 0.6481,  Adjusted R-squared: 0.594
## F-statistic: 11.97 on 2 and 13 DF, p-value: 0.001126

## Warning: Removed 49 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Series: Metric_Tons 
## Model: TSLM 
## Transformation: log(Metric_Tons) 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17529 -0.02917  0.02147  0.05654  0.14995 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -2.89155    1.10232  -2.623   0.0211 *  
## lag(log(Dollars), 1)  0.25240    0.03916   6.445 2.18e-05 ***
## log(Poss_Limit)       0.77609    0.11226   6.913 1.06e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09417 on 13 degrees of freedom
## Multiple R-squared: 0.9099,  Adjusted R-squared: 0.896
## F-statistic: 65.61 on 2 and 13 DF, p-value: 1.6102e-07

## Warning: Removed 49 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Series: Metric_Tons 
## Model: TSLM 
## Transformation: log(Metric_Tons) 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0696 -0.3520  0.1134  0.3667  0.8266 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    4.04791    0.81452   4.970 5.74e-06 ***
## lag(log(Dollars), 1)           0.27171    0.04711   5.767 2.86e-07 ***
## RestrictivenessIndex_weighted  0.08361    0.14685   0.569    0.571    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5038 on 61 degrees of freedom
## Multiple R-squared: 0.3564,  Adjusted R-squared: 0.3353
## F-statistic: 16.89 on 2 and 61 DF, p-value: 1.4522e-06

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Series: Metric_Tons 
## Model: TSLM 
## Transformation: log(Metric_Tons) 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7601 -0.0730  0.0681  0.2142  0.3945 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)             -35.7075    30.5551  -1.169    0.261
## lag(log(Dollars), 1)      0.2171     0.3429   0.633    0.536
## log(Food_manufacturing)   6.2484     5.5861   1.119    0.281
## 
## Residual standard error: 0.3411 on 15 degrees of freedom
## Multiple R-squared: 0.6416,  Adjusted R-squared: 0.5939
## F-statistic: 13.43 on 2 and 15 DF, p-value: 0.00045431

## Warning: Removed 47 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Series: Metric_Tons 
## Model: ARIMA(2,0,0) w/ mean 
## Transformation: log(Metric_Tons) 
## 
## Coefficients:
##          ar1      ar2  constant
##       1.2533  -0.3769    1.0811
## s.e.  0.1139   0.1149    0.0264
## 
## sigma^2 estimated as 0.05328:  log likelihood=3.57
## AIC=0.87   AICc=1.53   BIC=9.56

## Warning: The `...` argument of `PACF()` is deprecated as of feasts 0.2.2.
## ℹ ACF variables should be passed to the `y` argument. If multiple variables are
##   to be used, specify them using `vars(...)`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Forecast/Test

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## [1]       NA 5284.290 3190.962 3645.608 3833.047 4070.344 3793.644 6689.741
## [9] 5783.706

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## [1]       NA 4186.618 1568.229 2953.266 3138.396 2776.129 1597.016 3960.010
## [9] 1745.777

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## [1]       NA 5785.956 5979.918 7895.039 8200.879 6133.393 6220.552 7019.962
## [9] 6381.695

## [1] 8550.122 8326.095 8324.393 8353.119 8368.664 8368.256 8357.096 8341.463
## [9] 8324.696

RMSE using the test data is lowest for model 2, it is the initial pick. Additional work is needed to refine the model. Specifically, autocorrelation needs to be addressed. To do so, an AR(1) term is added to the model specification.

## Series: Metric_Tons 
## Model: LM w/ ARIMA(1,0,0) errors 
## Transformation: log(Metric_Tons) 
## 
## Coefficients:
##           ar1  lag(log(Dollars), 1)  log(Poss_Limit)  intercept
##       -0.1615                0.2521           0.7806    -2.9274
## s.e.   0.2806                0.0313           0.0920     0.9002
## 
## sigma^2 estimated as 0.001849:  log likelihood=16.92
## AIC=-23.85   AICc=-22.83   BIC=-12.98

## Warning: Removed 49 rows containing missing values or values outside the scale range
## (`geom_line()`).

## [1] 5068.5452 3897.0556  787.1276 2616.5980 2792.3997 2275.0285  746.2515
## [8] 3596.0394 1177.9131

Model selection

Given the improvement to RMSE, model 2.1, which uses an AR(1) term, is the final model.

\[log(Metric tons) = \beta0 + log(possessionlimit)\beta1 + log(Dollars(t-1))\beta2 + AR(1) + \epsilon\]

EDA and Modeling - Scallop landings in Massachusetts

Daniel McLaughlin

2024-12-01