How do livestock size, land use, and location influence whether a Maryland animal feeding operation has an active permit ?
The dataset I am using comes from the Maryland Department of the Environment Open Data Portal and includes information on Animal Feeding Operations (AFOs) and Concentrated Animal Feeding Operations (CAFOs) in Maryland. Each row represents one agricultural facility and its permit as well as operational details.
The dataset contains 697 observations and 62 variables which includes variables such as permit status, livestock populations, land use, and geographic location. It also includes administrative information about each facility’s permit type and activity.
For this project, I am going to be focusing on a smaller set of key variables which are relevant for predicting whether a facility has an active permit when using logistic regression.
These selected variables allow me to examine how farm size, livestock intensity, and geographic location can influence the permit status while keeping the model interpretable. This approach helps me simplify my analysis while still seeing meaningful patterns in agricultural operations across Maryland.
Permit Status (categorical outcome variable) This tells whether a facility is officially Issued as an active permit or not. This is what I am trying to predict.
Cattle - Includes Heifers (quantitative) This represents the number of cattle at each facility. It helps measure farm size and livestock intensity.
Chickens - Not Laying Hens (quantitative) This measures poultry population and helps capture the scale of animal production.
Swine >= 55 lbs (quantitative) This shows the number of larger pigs on the farm which represents the livestock density.
Acres Controlled (quantitative) This represents how much land the operation uses,that may influence the permit approval.
Latitude & Longitude (quantitative) These show the geographic location of each facility, which helps display the regional differences.
Permit Type Activity (categorical) This one describes the type of agricultural operation.
County (categorical) This identifies the location of each facility within Maryland.
# Citations/Disclaimer: This code and analysis follows what learned from course/class notes
# In this chunk, I am loading the libraries that I need for my analysis.
# The library dplyr helps me clean and manipulate data
# The library ggplot2 helps me create visualizations,
# The library readr helps me import the dataset, and pROC is used later for model evaluation.
library(readr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.2
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
cafo <- read_csv("Maryland_Department_of_the_Environment_-_LMA_Resource_Management_Program,_Animal_Feeding_Operations_Division(AFO)_-__Permits_20260422.csv")
## Rows: 679 Columns: 61
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): DOCUMENT, PERMIT TYPE ACTIVITY, PERMIT CLASS, ALTERNATE AI ID, NP...
## dbl (7): SITE NO, LATITUDE, LONGITUDE, PROPOSED BUILDING QTY, EXISTING BUI...
## num (9): ACRES CONTROLLED, CHICKENS- NOT LAYING HENS-DRY, LAYING HENS - DR...
## lgl (10): ORGANIC, CONTESTED CASE HEARING EXPIRES, CONTESTED CASE HEARING R...
## date (13): NOI RECEIVED, WITHDRAWN DATE, CNMP RECEIVED, CNMP EXPIRES, NMP RE...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# I am reading in the Maryland AFO/CAFO dataset, that contains information about animal feeding operations, including permits, land use, and location.
names(cafo)
## [1] "DOCUMENT" "SITE NO"
## [3] "PERMIT TYPE ACTIVITY" "PERMIT CLASS"
## [5] "ALTERNATE AI ID" "NPDES"
## [7] "REGISTRATION NO." "SITE NAME"
## [9] "STREET ADDRESS" "CITY, STATE ZIP"
## [11] "MAILING STREET ADDRESS" "MAILING CITY, STATE ZIP"
## [13] "COUNTY" "LATITUDE"
## [15] "LONGITUDE" "NEW TECHNOLOGY"
## [17] "ORGANIC" "NOI RECEIVED"
## [19] "PERMIT START DATE" "PERMIT END DATE"
## [21] "WITHDRAWN DATE" "CNMP RECEIVED"
## [23] "CNMP EXPIRES" "NMP RECEIVED"
## [25] "NMP EXPIRES" "CONSERVATION PLAN RECEIVED"
## [27] "WEB REGISTRATION STATUS" "DEADLINE TO SUBMIT WRITTEN COMMENTS"
## [29] "DEADLINE TO REQUEST HEARING" "PUBLIC HEARING DATE"
## [31] "PUBLIC HEARING LOCATION" "CONTESTED CASE HEARING EXPIRES"
## [33] "CONTESTED CASE HEARING REQUESTED" "CONTESTED CASE HEARING DATE"
## [35] "CONTESTED CASE HEARING LOCATION" "INITIAL RENEWAL NOTIF. RECEIVED"
## [37] "INITIAL RENEWAL NOTIF. REVIEWED" "DATE FEE PAID"
## [39] "AMOUNT PAID" "DATE FEE REIMBURSED"
## [41] "AMOUNT REIMBURSED" "PROPOSED BUILDING QTY"
## [43] "EXISTING BUILDING QTY" "ACRES CONTROLLED"
## [45] "CHICKENS- NOT LAYING HENS-DRY" "LAYING HENS - DRY MANURE"
## [47] "CHICKENS - LIQUID MANURE" "DAIRY CATTLE"
## [49] "CATTLE - INCLUDES HEIFERS" "VEAL"
## [51] "SHEEP AND LAMBS" "DUCKS - DRY MANURE"
## [53] "DUCKS - LIQUID MANURE" "HORSES"
## [55] "SWINE >= 55 LBS" "SWINE < 55 LBS"
## [57] "TURKEYS" "FARM NAME"
## [59] "PERMIT CATEGORY" "PERMIT STATUS"
## [61] "ADMINISTRATIVELY EXTENDED?"
# I use the function head() to preview the first few rows so I can understand what each column looks like in the data
head(cafo)
## # A tibble: 6 × 61
## DOCUMENT `SITE NO` `PERMIT TYPE ACTIVITY` `PERMIT CLASS` `ALTERNATE AI ID`
## <chr> <dbl> <chr> <chr> <chr>
## 1 <NA> 585 CAFO (New) New 14AF
## 2 https://mde… 21912 CAFO (Renew) Renew 19AF
## 3 https://mde… 22189 CAFO (New) New 19AF
## 4 <NA> 22200 CAFO (Renew) Renew 14AF
## 5 <NA> 22941 CAFO (Renew) Renew 14AF
## 6 <NA> 23560 CAFO (New) New 09AF
## # ℹ 56 more variables: NPDES <chr>, `REGISTRATION NO.` <chr>,
## # `SITE NAME` <chr>, `STREET ADDRESS` <chr>, `CITY, STATE ZIP` <chr>,
## # `MAILING STREET ADDRESS` <chr>, `MAILING CITY, STATE ZIP` <chr>,
## # COUNTY <chr>, LATITUDE <dbl>, LONGITUDE <dbl>, `NEW TECHNOLOGY` <chr>,
## # ORGANIC <lgl>, `NOI RECEIVED` <date>, `PERMIT START DATE` <chr>,
## # `PERMIT END DATE` <chr>, `WITHDRAWN DATE` <date>, `CNMP RECEIVED` <date>,
## # `CNMP EXPIRES` <date>, `NMP RECEIVED` <date>, `NMP EXPIRES` <date>, …
# I used colSums(is.na()) in order to check missing values in each variable.
# This helps me decide the columns that I need to clean before analysis.
colnames(cafo)
## [1] "DOCUMENT" "SITE NO"
## [3] "PERMIT TYPE ACTIVITY" "PERMIT CLASS"
## [5] "ALTERNATE AI ID" "NPDES"
## [7] "REGISTRATION NO." "SITE NAME"
## [9] "STREET ADDRESS" "CITY, STATE ZIP"
## [11] "MAILING STREET ADDRESS" "MAILING CITY, STATE ZIP"
## [13] "COUNTY" "LATITUDE"
## [15] "LONGITUDE" "NEW TECHNOLOGY"
## [17] "ORGANIC" "NOI RECEIVED"
## [19] "PERMIT START DATE" "PERMIT END DATE"
## [21] "WITHDRAWN DATE" "CNMP RECEIVED"
## [23] "CNMP EXPIRES" "NMP RECEIVED"
## [25] "NMP EXPIRES" "CONSERVATION PLAN RECEIVED"
## [27] "WEB REGISTRATION STATUS" "DEADLINE TO SUBMIT WRITTEN COMMENTS"
## [29] "DEADLINE TO REQUEST HEARING" "PUBLIC HEARING DATE"
## [31] "PUBLIC HEARING LOCATION" "CONTESTED CASE HEARING EXPIRES"
## [33] "CONTESTED CASE HEARING REQUESTED" "CONTESTED CASE HEARING DATE"
## [35] "CONTESTED CASE HEARING LOCATION" "INITIAL RENEWAL NOTIF. RECEIVED"
## [37] "INITIAL RENEWAL NOTIF. REVIEWED" "DATE FEE PAID"
## [39] "AMOUNT PAID" "DATE FEE REIMBURSED"
## [41] "AMOUNT REIMBURSED" "PROPOSED BUILDING QTY"
## [43] "EXISTING BUILDING QTY" "ACRES CONTROLLED"
## [45] "CHICKENS- NOT LAYING HENS-DRY" "LAYING HENS - DRY MANURE"
## [47] "CHICKENS - LIQUID MANURE" "DAIRY CATTLE"
## [49] "CATTLE - INCLUDES HEIFERS" "VEAL"
## [51] "SHEEP AND LAMBS" "DUCKS - DRY MANURE"
## [53] "DUCKS - LIQUID MANURE" "HORSES"
## [55] "SWINE >= 55 LBS" "SWINE < 55 LBS"
## [57] "TURKEYS" "FARM NAME"
## [59] "PERMIT CATEGORY" "PERMIT STATUS"
## [61] "ADMINISTRATIVELY EXTENDED?"
colSums(is.na(cafo))
## DOCUMENT SITE NO
## 251 0
## PERMIT TYPE ACTIVITY PERMIT CLASS
## 0 0
## ALTERNATE AI ID NPDES
## 0 19
## REGISTRATION NO. SITE NAME
## 4 0
## STREET ADDRESS CITY, STATE ZIP
## 0 0
## MAILING STREET ADDRESS MAILING CITY, STATE ZIP
## 0 0
## COUNTY LATITUDE
## 0 147
## LONGITUDE NEW TECHNOLOGY
## 147 675
## ORGANIC NOI RECEIVED
## 679 8
## PERMIT START DATE PERMIT END DATE
## 0 0
## WITHDRAWN DATE CNMP RECEIVED
## 596 30
## CNMP EXPIRES NMP RECEIVED
## 675 51
## NMP EXPIRES CONSERVATION PLAN RECEIVED
## 124 432
## WEB REGISTRATION STATUS DEADLINE TO SUBMIT WRITTEN COMMENTS
## 156 45
## DEADLINE TO REQUEST HEARING PUBLIC HEARING DATE
## 56 678
## PUBLIC HEARING LOCATION CONTESTED CASE HEARING EXPIRES
## 678 679
## CONTESTED CASE HEARING REQUESTED CONTESTED CASE HEARING DATE
## 679 679
## CONTESTED CASE HEARING LOCATION INITIAL RENEWAL NOTIF. RECEIVED
## 679 265
## INITIAL RENEWAL NOTIF. REVIEWED DATE FEE PAID
## 267 408
## AMOUNT PAID DATE FEE REIMBURSED
## 410 679
## AMOUNT REIMBURSED PROPOSED BUILDING QTY
## 679 132
## EXISTING BUILDING QTY ACRES CONTROLLED
## 132 0
## CHICKENS- NOT LAYING HENS-DRY LAYING HENS - DRY MANURE
## 50 670
## CHICKENS - LIQUID MANURE DAIRY CATTLE
## 679 661
## CATTLE - INCLUDES HEIFERS VEAL
## 664 679
## SHEEP AND LAMBS DUCKS - DRY MANURE
## 677 679
## DUCKS - LIQUID MANURE HORSES
## 678 676
## SWINE >= 55 LBS SWINE < 55 LBS
## 678 675
## TURKEYS FARM NAME
## 677 0
## PERMIT CATEGORY PERMIT STATUS
## 0 0
## ADMINISTRATIVELY EXTENDED?
## 0
# Citations/Disclaimer: This code and analysis follows what learned from course/class notes
# In this chunk, I clean the dataset and select only variables relevant to my research question: whether land use and location influence if a facility has an active permit.
# In this chunk, I am going to clean the dataset and select only variables which are relevant to my research question which focuses on land use, location, and permit status.
# I selected the variables needed for my analysis:
# PERMIT STATUS is the outcome variable which is what I am trying to predict
# PERMIT TYPE ACTIVITY which is what describes the type of farming operation
# COUNTY captures the geographic differences across Maryland
# LATITUDE and LONGITUDE represents the facility location
# ACRES CONTROLLED basically measures the size of each operation
# I then filter out missing values to ensure my dataset is complete and reliable for analysis.
# I use the mutate() function in order to create a new binary variable called active_permit.
# This variable basically converts PERMIT STATUS into a numeric format
# 1 is = Issued which is an active permit
# 0 is = not issued which is an inactive permit
# This step is necessary because logistic regression often requires a binary outcome variable.
cafo_clean <- cafo %>%
select(`PERMIT STATUS`,`PERMIT TYPE ACTIVITY`,
COUNTY,
LATITUDE,
LONGITUDE,
`ACRES CONTROLLED`
) %>%
filter(
!is.na(COUNTY),
!is.na(LATITUDE),
!is.na(LONGITUDE),
!is.na(`ACRES CONTROLLED`),
!is.na(`PERMIT TYPE ACTIVITY`),
!is.na(`PERMIT STATUS`)
) %>%
mutate(
active_permit = ifelse(`PERMIT STATUS` == "Issued", 1, 0)
)
# Citations/Disclaimer: This code and analysis follows what learned from course/class notes
# In this chunk, I am converting categorical variables into factors.
# PERMIT TYPE ACTIVITY and COUNTY are categorical variables, meaning they represent groups rather than numeric values.
# Converting them to factors would ensure that R would correctly treat them as categories in future analysis or modeling
# This is important for regression since it prevents R from incorrectly treating these variables as continuous numbers which often leads to inaccurate results
cafo_clean$`PERMIT TYPE ACTIVITY` <- as.factor(cafo_clean$`PERMIT TYPE ACTIVITY`)
cafo_clean$COUNTY <- as.factor(cafo_clean$COUNTY)
# Citations/Disclaimer: This code and analysis follows what learned from course/class notes
# In this chunk, I analyze how land use varies across different types of farming operations.
# I grouped the dataset by PERMIT TYPE ACTIVITY so I can compare different types of operations.
# This variable represents different categories of animal feeding operations.
# I calculate summary statistics for each group:
# avg_acres is basically average land size used by each operation type
# total_facilities is the number of facilities in each category
# active_count is the number of facilities with active permits in each category
# I am also sorting the results in order for me to interpret the operation types that use the most land.
operation_summary <- cafo_clean %>%
group_by(`PERMIT TYPE ACTIVITY`) %>%
summarise(
avg_acres = mean(`ACRES CONTROLLED`, na.rm = TRUE),
total_facilities = n(),
active_count = sum(active_permit)
) %>%
arrange(desc(avg_acres))
operation_summary
## # A tibble: 7 × 4
## `PERMIT TYPE ACTIVITY` avg_acres total_facilities active_count
## <fct> <dbl> <int> <dbl>
## 1 MAFO (New) 235. 7 2
## 2 CAFO (Renew) 182. 341 242
## 3 MAFO (Renew) 141. 12 8
## 4 CAFO (New) 35.1 140 70
## 5 CAFO (Modify) 0 6 6
## 6 CAFO (Transfer) 0 23 19
## 7 COC (New) 0 3 1
# Citations/Disclaimer: This code and analysis follows what learned from course/class notes
# I used summary() to look at the overall distribution of ACRES CONTROLLED
# In this chunk, I am going to explores the distribution of ACRES CONTROLLED.
# ACRES CONTROLLED is a quantitative variable that I used which measures how much land each facility uses and the differences across facilities.
# I calculated the mean using the mean function to understand the average land size across all farms.
# I used the max function to find to identify the largest operation in the dataset.
summary(cafo_clean$`ACRES CONTROLLED`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 132.2 0.0 5867.0
mean(cafo_clean$`ACRES CONTROLLED`, na.rm = TRUE)
## [1] 132.156
max(cafo_clean$`ACRES CONTROLLED`, na.rm = TRUE)
## [1] 5867
# Citations/Disclaimer: This code and analysis follows what learned from course/class notes
# In this chunk, I created a bar chart to in order to compare the average land use in acres across different operation types.
# The variable `PERMIT TYPE ACTIVITY` represents the category or thetype of operation,
# Meanwhile `avg_acres` represents the average number of acres which is controlled by each type.
# I used geom_col() because I am plotting the summarized data which is the average acres.
# I mapped fill into operation type so each category is visually distinct.
# I used a minimal theme to keep the graph clean and easy to read.
# I positioned the legend at the bottom the chart looks accurate and does not get cluttered
ggplot(operation_summary, aes(x = `PERMIT TYPE ACTIVITY`, y = avg_acres, fill = `PERMIT TYPE ACTIVITY`)) +
geom_col() +
labs(
title = "Average Acres Controlled by Operation Type",
x = "Operation Type",
y = "Average Acres Controlled",
caption = "Source: Maryland Department of the Environment"
) +
theme_minimal() +
theme(legend.position = "bottom")
# Citations/Disclaimer: This code and analysis follows what learned from course/class notes
# In this chunk, I am building a logistic regression model to predict whether a facility has an active permit.
# active_permit is my dependent variable which means 1 = active permit and 0 = not active.
# ACRES CONTROLLED represents the size of the farm.
# LATITUDE and LONGITUDE represent the geographic location
# I use the summary() function in order to interpret how each predictor affects the likelihood of an active permit.
model <- glm(
active_permit ~ `ACRES CONTROLLED` + LATITUDE + LONGITUDE,
data = cafo_clean,
family = binomial
)
summary(model)
##
## Call:
## glm(formula = active_permit ~ `ACRES CONTROLLED` + LATITUDE +
## LONGITUDE, family = binomial, data = cafo_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.785e+01 2.416e+01 -0.739 0.4601
## `ACRES CONTROLLED` 6.309e-04 3.296e-04 1.914 0.0556 .
## LATITUDE 2.995e-01 3.141e-01 0.953 0.3403
## LONGITUDE -9.085e-02 3.901e-01 -0.233 0.8159
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 686.12 on 531 degrees of freedom
## Residual deviance: 676.60 on 528 degrees of freedom
## AIC: 684.6
##
## Number of Fisher Scoring iterations: 4
# Citations/Disclaimer: This code and analysis follows what learned from course/class notes
# In this chunk, I am calculating the confidence intervals and odds ratios for my logistic regression model.
# The confidence intervals help me understand the range of possible values for each predictor's true effect.
# If a confidence interval includes values close to 0 or 1 when thinking in odds ratio terms it suggests how the effect might not be statistically strong
# The odds ratios (exp(coef)) it helps me interpret the results in a more practical way.
# An odds ratio slightly above 1 for ACRES CONTROLLED means that when the land size increases, the odds of having an active permit also increases slightly.
# This helps me see that land size has a small effect, as well as how latitude and longitude do not show meaningful influence on permit status.
confint(model)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -6.732794e+01 27.995802743
## `ACRES CONTROLLED` 5.336233e-05 0.001356396
## LATITUDE -3.143338e-01 0.919193883
## LONGITUDE -8.819336e-01 0.655015149
exp(coef(model))
## (Intercept) `ACRES CONTROLLED` LATITUDE LONGITUDE
## 1.774869e-08 1.000631e+00 1.349227e+00 9.131547e-01
From the odds ratio I interpreted how each predictor affected the likelihood of a facility having an active permit.
For ACRES CONTROLLED, the odds ratio was about 1.0006, which is slightly above 1. This means that as the land size increases, the odds of having an active permit increase very slightly. This effect is extremely small, as a result land size does not strongly influence permit status.
For LATITUDE, the odds ratio is about 1.349 which suggests a positive relationship where a higher latitude that is associated with higher odds of having an active permit. This result is not statistically significant, so I cannot confidently say this relationship is meaningful.
For LONGITUDE, the odds ratio is about 0.913, which means that higher longitude slightly decreases the odds of having an active permit. This effect is also considered not statistically significant.
The confidence intervals for all predictors include values close to 1, which suggests uncertainty in their effects. This tells how none of the predictors have a strong explanation of permit status alone
I used a logistic regression model in order to examine how land use and location affect whether a Maryland animal feeding operation has an active permit. My outcome variable is active_permit 1 = active, 0 = not active, and my predictors for my regression are ACRES CONTROLLED, LATITUDE, and LONGITUDE.
The results show how ACRES CONTROLLED has a small positive effect which means larger farms are more likely to have an active permit. This but this result is not statistically significant since the p-value is 0.055. Both latitude and longitude are not statistically significant which means that the geographic location does not have a strong influence on the permit status in this dataset.
This model slightly improves, which suggests how these variables do not strongly explain permit status. This implies that other factors, such as livestock type and regulatory category are more important in determining if a facility has an active permit.
# Citations/Disclaimer: This code and analysis follows what learned from course/class notes
# In this chunk, I converted model outputs into probabilities as well as predicted classes.
# I generated predicted probabilities for having an active permit
# I converted probabilities into binary predictions by using up a 0.5 cutoff.
# If probability > 0.5, then I would predict active permit as 1 otherwise it would be 0
cafo_clean$prob <- predict(model, type = "response")
cafo_clean$pred_class <- ifelse(cafo_clean$prob > 0.5, 1, 0)
# Citations/Disclaimer: This code and analysis follows what learned from course/class notes
# In this chunk I created a confusion matrix to compare my predicted values to the actual values
# This shows how many predictions I got correct and where my model made mistakes
# This also helps me evaluate the effectiveness of how well my model is performing
cm <- table(Predicted = cafo_clean$pred_class, Actual = cafo_clean$active_permit)
cm
## Actual
## Predicted 0 1
## 1 184 348
# Citations/Disclaimer: This code and analysis follows what learned from course/class notes
# In this chunk I will evaluate how well my logistic regression model distinguishes between active and non-active permits
roc_curve <- roc(cafo_clean$active_permit, cafo_clean$prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# I plotted the ROC curve which visually assess how well the model I created would perform
plot(roc_curve)
# I calculate the AUC to see the strngth of my model
# A higher AUC indicates that the model has better ability to correctly classify outcomes
auc(roc_curve)
## Area under the curve: 0.5376
The way I evaluated my logistic regression model was by using a confusion matrix and an ROC curve. The confusion matrix compares predicted and actual values by showing the amount of predictors that were correct and where the model made mistakes. In my results, the model predicted every observation as active (1) which correctly identified 348 active permits however it incorrectly labeling 184 inactive permits as active. This means the model has high sensitivity and very low specificity due to it failing to effectively identify the inactive permits.
The ROC curve stays close to the diagonal line, and the AUC is 0.5376 which is slightly better than random guessing. This indicates the model has a weak ability to distinguish between active and non-active permits, and struggles to accurately classify both groups. Numerically this means the model only has about a 53.76% chance of distinguishing accurately between an active and inactive permit.This indicates how the model has weak predictive ability and struggles to strongly classify both groups.
In this project, I examined how livestock size, land use, and geographic location influence whether a Maryland animal feeding operation has an active permit. I used a logistic regression model in order to view if these factors assist in helping predict permit status.
I found that ACRES CONTROLLED had a positive relationship by having an active permit which means larger farms are likely to be permitted. This effect is weak and not strongly statistically significant. Latitude and longitude were not significant which suggests that the geographic location within Maryland does not influence the permit decisions which occurs in the dataset.
The model shows limited predictive power overall based on the confusion matrix and ROC curve. The AUC value indicates only moderate ability to distinguish between active and non-active permits which means the selected predictors do not fully explain permit status.
This suggests that other important factors like livestock type, inspection results play a huge role in permit decisions. A limitation of my model suggests that it only uses a few predictors in order to assume a simple relationship between them and permit status.
In the future I would also like to include additional variables like livestock counts and permit type, and try testing more advanced models and interaction effects in order to improve the accuracy of my model.
This project helped me learn that while regression can identify patterns, real-world environmental decisions are often influenced by many complex factors and not just simple predictors.
Maryland Department of the Environment. Animal Feeding Operations (AFO) and Concentrated Animal Feeding Operations (CAFO) Permits Dataset. Maryland Open Data Portal.