Introduction

My Research Question

How do livestock size, land use, and location influence whether a Maryland animal feeding operation has an active permit ?

Description of my Dataset

The dataset I am using comes from the Maryland Department of the Environment Open Data Portal and includes information about Animal Feeding Operations (AFO) and Concentrated Animal Feeding Operations (CAFOs). Each row represents a permitted agricultural facility in Maryland.

The dataset has 697 observations and 62 variables which includes permit status, livestock counts, geographic location, and land usage information. Since this project focuses on regression analysis, I will only use a subset of these variables that are most relevant to predicting permit status.

Variables Used

Permit Status (categorical outcome variable) This tells whether a facility is officially “Issued” (active permit) or not. This is what I am trying to predict.

Cattle - Includes Heifers (quantitative) This represents the number of cattle at each facility. It helps measure farm size and livestock intensity.

Chickens - Not Laying Hens (quantitative) This measures poultry population and helps capture the scale of animal production.

Swine >= 55 lbs (quantitative) This shows the number of larger pigs on the farm, representing livestock density.

Acres Controlled (quantitative) This represents how much land the operation uses, which may influence permit approval.

Latitude & Longitude (quantitative) These show the geographic location of each facility, helping capture regional differences.

Permit Type Activity (categorical) This describes the type of agricultural operation.

County (categorical) This identifies the location of each facility within Maryland.

Importing Dataset

# In this chunk, I am loading the libraries that I need for my analysis.
# dplyr helps me clean and manipulate data, ggplot2 helps me create visualizations,
# readr helps me import the dataset, and pROC is used later for model evaluation.

library(readr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.2
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
cafo <- read_csv("Maryland_Department_of_the_Environment_-_LMA_Resource_Management_Program,_Animal_Feeding_Operations_Division(AFO)_-__Permits_20260422.csv")
## Rows: 679 Columns: 61
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (22): DOCUMENT, PERMIT TYPE ACTIVITY, PERMIT CLASS, ALTERNATE AI ID, NP...
## dbl   (7): SITE NO, LATITUDE, LONGITUDE, PROPOSED BUILDING QTY, EXISTING BUI...
## num   (9): ACRES CONTROLLED, CHICKENS- NOT LAYING HENS-DRY, LAYING HENS - DR...
## lgl  (10): ORGANIC, CONTESTED CASE HEARING EXPIRES, CONTESTED CASE HEARING R...
## date (13): NOI RECEIVED, WITHDRAWN DATE, CNMP RECEIVED, CNMP EXPIRES, NMP RE...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# I am reading in the Maryland AFO/CAFO dataset, which contains information about animal feeding operations, including permits, land use, and location.
names(cafo)
##  [1] "DOCUMENT"                            "SITE NO"                            
##  [3] "PERMIT TYPE ACTIVITY"                "PERMIT CLASS"                       
##  [5] "ALTERNATE AI ID"                     "NPDES"                              
##  [7] "REGISTRATION NO."                    "SITE NAME"                          
##  [9] "STREET ADDRESS"                      "CITY, STATE ZIP"                    
## [11] "MAILING STREET ADDRESS"              "MAILING CITY, STATE ZIP"            
## [13] "COUNTY"                              "LATITUDE"                           
## [15] "LONGITUDE"                           "NEW TECHNOLOGY"                     
## [17] "ORGANIC"                             "NOI RECEIVED"                       
## [19] "PERMIT START DATE"                   "PERMIT END DATE"                    
## [21] "WITHDRAWN DATE"                      "CNMP RECEIVED"                      
## [23] "CNMP EXPIRES"                        "NMP RECEIVED"                       
## [25] "NMP EXPIRES"                         "CONSERVATION PLAN RECEIVED"         
## [27] "WEB REGISTRATION STATUS"             "DEADLINE TO SUBMIT WRITTEN COMMENTS"
## [29] "DEADLINE TO REQUEST HEARING"         "PUBLIC HEARING DATE"                
## [31] "PUBLIC HEARING LOCATION"             "CONTESTED CASE HEARING EXPIRES"     
## [33] "CONTESTED CASE HEARING REQUESTED"    "CONTESTED CASE HEARING DATE"        
## [35] "CONTESTED CASE HEARING LOCATION"     "INITIAL RENEWAL NOTIF. RECEIVED"    
## [37] "INITIAL RENEWAL NOTIF. REVIEWED"     "DATE FEE PAID"                      
## [39] "AMOUNT PAID"                         "DATE FEE REIMBURSED"                
## [41] "AMOUNT REIMBURSED"                   "PROPOSED BUILDING QTY"              
## [43] "EXISTING BUILDING QTY"               "ACRES CONTROLLED"                   
## [45] "CHICKENS- NOT LAYING HENS-DRY"       "LAYING HENS - DRY MANURE"           
## [47] "CHICKENS - LIQUID MANURE"            "DAIRY CATTLE"                       
## [49] "CATTLE - INCLUDES HEIFERS"           "VEAL"                               
## [51] "SHEEP AND LAMBS"                     "DUCKS - DRY MANURE"                 
## [53] "DUCKS - LIQUID MANURE"               "HORSES"                             
## [55] "SWINE >= 55 LBS"                     "SWINE < 55 LBS"                     
## [57] "TURKEYS"                             "FARM NAME"                          
## [59] "PERMIT CATEGORY"                     "PERMIT STATUS"                      
## [61] "ADMINISTRATIVELY EXTENDED?"
# I use head() to preview the first few rows so I can understand what each column looks like in real data.
head(cafo)
## # A tibble: 6 × 61
##   DOCUMENT     `SITE NO` `PERMIT TYPE ACTIVITY` `PERMIT CLASS` `ALTERNATE AI ID`
##   <chr>            <dbl> <chr>                  <chr>          <chr>            
## 1 <NA>               585 CAFO (New)             New            14AF             
## 2 https://mde…     21912 CAFO (Renew)           Renew          19AF             
## 3 https://mde…     22189 CAFO (New)             New            19AF             
## 4 <NA>             22200 CAFO (Renew)           Renew          14AF             
## 5 <NA>             22941 CAFO (Renew)           Renew          14AF             
## 6 <NA>             23560 CAFO (New)             New            09AF             
## # ℹ 56 more variables: NPDES <chr>, `REGISTRATION NO.` <chr>,
## #   `SITE NAME` <chr>, `STREET ADDRESS` <chr>, `CITY, STATE ZIP` <chr>,
## #   `MAILING STREET ADDRESS` <chr>, `MAILING CITY, STATE ZIP` <chr>,
## #   COUNTY <chr>, LATITUDE <dbl>, LONGITUDE <dbl>, `NEW TECHNOLOGY` <chr>,
## #   ORGANIC <lgl>, `NOI RECEIVED` <date>, `PERMIT START DATE` <chr>,
## #   `PERMIT END DATE` <chr>, `WITHDRAWN DATE` <date>, `CNMP RECEIVED` <date>,
## #   `CNMP EXPIRES` <date>, `NMP RECEIVED` <date>, `NMP EXPIRES` <date>, …
# I use colSums(is.na()) to check missing values in each variable.
# This helps me decide which columns need cleaning before analysis.
colnames(cafo)
##  [1] "DOCUMENT"                            "SITE NO"                            
##  [3] "PERMIT TYPE ACTIVITY"                "PERMIT CLASS"                       
##  [5] "ALTERNATE AI ID"                     "NPDES"                              
##  [7] "REGISTRATION NO."                    "SITE NAME"                          
##  [9] "STREET ADDRESS"                      "CITY, STATE ZIP"                    
## [11] "MAILING STREET ADDRESS"              "MAILING CITY, STATE ZIP"            
## [13] "COUNTY"                              "LATITUDE"                           
## [15] "LONGITUDE"                           "NEW TECHNOLOGY"                     
## [17] "ORGANIC"                             "NOI RECEIVED"                       
## [19] "PERMIT START DATE"                   "PERMIT END DATE"                    
## [21] "WITHDRAWN DATE"                      "CNMP RECEIVED"                      
## [23] "CNMP EXPIRES"                        "NMP RECEIVED"                       
## [25] "NMP EXPIRES"                         "CONSERVATION PLAN RECEIVED"         
## [27] "WEB REGISTRATION STATUS"             "DEADLINE TO SUBMIT WRITTEN COMMENTS"
## [29] "DEADLINE TO REQUEST HEARING"         "PUBLIC HEARING DATE"                
## [31] "PUBLIC HEARING LOCATION"             "CONTESTED CASE HEARING EXPIRES"     
## [33] "CONTESTED CASE HEARING REQUESTED"    "CONTESTED CASE HEARING DATE"        
## [35] "CONTESTED CASE HEARING LOCATION"     "INITIAL RENEWAL NOTIF. RECEIVED"    
## [37] "INITIAL RENEWAL NOTIF. REVIEWED"     "DATE FEE PAID"                      
## [39] "AMOUNT PAID"                         "DATE FEE REIMBURSED"                
## [41] "AMOUNT REIMBURSED"                   "PROPOSED BUILDING QTY"              
## [43] "EXISTING BUILDING QTY"               "ACRES CONTROLLED"                   
## [45] "CHICKENS- NOT LAYING HENS-DRY"       "LAYING HENS - DRY MANURE"           
## [47] "CHICKENS - LIQUID MANURE"            "DAIRY CATTLE"                       
## [49] "CATTLE - INCLUDES HEIFERS"           "VEAL"                               
## [51] "SHEEP AND LAMBS"                     "DUCKS - DRY MANURE"                 
## [53] "DUCKS - LIQUID MANURE"               "HORSES"                             
## [55] "SWINE >= 55 LBS"                     "SWINE < 55 LBS"                     
## [57] "TURKEYS"                             "FARM NAME"                          
## [59] "PERMIT CATEGORY"                     "PERMIT STATUS"                      
## [61] "ADMINISTRATIVELY EXTENDED?"
colSums(is.na(cafo))
##                            DOCUMENT                             SITE NO 
##                                 251                                   0 
##                PERMIT TYPE ACTIVITY                        PERMIT CLASS 
##                                   0                                   0 
##                     ALTERNATE AI ID                               NPDES 
##                                   0                                  19 
##                    REGISTRATION NO.                           SITE NAME 
##                                   4                                   0 
##                      STREET ADDRESS                     CITY, STATE ZIP 
##                                   0                                   0 
##              MAILING STREET ADDRESS             MAILING CITY, STATE ZIP 
##                                   0                                   0 
##                              COUNTY                            LATITUDE 
##                                   0                                 147 
##                           LONGITUDE                      NEW TECHNOLOGY 
##                                 147                                 675 
##                             ORGANIC                        NOI RECEIVED 
##                                 679                                   8 
##                   PERMIT START DATE                     PERMIT END DATE 
##                                   0                                   0 
##                      WITHDRAWN DATE                       CNMP RECEIVED 
##                                 596                                  30 
##                        CNMP EXPIRES                        NMP RECEIVED 
##                                 675                                  51 
##                         NMP EXPIRES          CONSERVATION PLAN RECEIVED 
##                                 124                                 432 
##             WEB REGISTRATION STATUS DEADLINE TO SUBMIT WRITTEN COMMENTS 
##                                 156                                  45 
##         DEADLINE TO REQUEST HEARING                 PUBLIC HEARING DATE 
##                                  56                                 678 
##             PUBLIC HEARING LOCATION      CONTESTED CASE HEARING EXPIRES 
##                                 678                                 679 
##    CONTESTED CASE HEARING REQUESTED         CONTESTED CASE HEARING DATE 
##                                 679                                 679 
##     CONTESTED CASE HEARING LOCATION     INITIAL RENEWAL NOTIF. RECEIVED 
##                                 679                                 265 
##     INITIAL RENEWAL NOTIF. REVIEWED                       DATE FEE PAID 
##                                 267                                 408 
##                         AMOUNT PAID                 DATE FEE REIMBURSED 
##                                 410                                 679 
##                   AMOUNT REIMBURSED               PROPOSED BUILDING QTY 
##                                 679                                 132 
##               EXISTING BUILDING QTY                    ACRES CONTROLLED 
##                                 132                                   0 
##       CHICKENS- NOT LAYING HENS-DRY            LAYING HENS - DRY MANURE 
##                                  50                                 670 
##            CHICKENS - LIQUID MANURE                        DAIRY CATTLE 
##                                 679                                 661 
##           CATTLE - INCLUDES HEIFERS                                VEAL 
##                                 664                                 679 
##                     SHEEP AND LAMBS                  DUCKS - DRY MANURE 
##                                 677                                 679 
##               DUCKS - LIQUID MANURE                              HORSES 
##                                 678                                 676 
##                     SWINE >= 55 LBS                      SWINE < 55 LBS 
##                                 678                                 675 
##                             TURKEYS                           FARM NAME 
##                                 677                                   0 
##                     PERMIT CATEGORY                       PERMIT STATUS 
##                                   0                                   0 
##          ADMINISTRATIVELY EXTENDED? 
##                                   0

Data Cleaning

# In this chunk, I clean the dataset and select only variables relevant to my research question: whether land use and location influence if a facility has an active permit.
 # I select only the variables I need for my analysis.
  # PERMIT STATUS is my outcome variable (whether a permit is active or not).
  # PERMIT TYPE ACTIVITY describes the type of farming operation.
  # COUNTY helps capture geographic differences across Maryland.
  # LATITUDE and LONGITUDE represent the exact location of each facility.
  # ACRES CONTROLLED measures the land size of each operation.
cafo_clean <- cafo %>%
  select(
    `PERMIT STATUS`,
    `PERMIT TYPE ACTIVITY`,
    COUNTY,
    LATITUDE,
    LONGITUDE,
    `ACRES CONTROLLED`
  ) %>%
  filter(
    !is.na(COUNTY),
    !is.na(LATITUDE),
    !is.na(LONGITUDE),
    !is.na(`ACRES CONTROLLED`),
    !is.na(`PERMIT TYPE ACTIVITY`),
    !is.na(`PERMIT STATUS`)
  ) %>%
  mutate(
    active_permit = ifelse(`PERMIT STATUS` == "Issued", 1, 0)
  )

Summary by Operation Type

# In this chunk, I analyze how land use varies across different types of farming operations.
# I group the dataset by PERMIT TYPE ACTIVITY to compare different operation types.
# This variable represents different categories of animal feeding operations.
# I calculate key summary statistics for each group:
# avg_acres = average land size used by that operation type
# total_facilities = number of facilities in each category
# active_count = number of facilities with active permits in each category
# I sort the results so I can easily see which operation types use the most land.
# I print the summary table to interpret differences across operation types.
operation_summary <- cafo_clean %>%
  group_by(`PERMIT TYPE ACTIVITY`) %>%
  summarise(
    avg_acres   = mean(`ACRES CONTROLLED`, na.rm = TRUE),
    total_facilities = n(),
    active_count  = sum(active_permit)
  ) %>%
  arrange(desc(avg_acres))

operation_summary
## # A tibble: 7 × 4
##   `PERMIT TYPE ACTIVITY` avg_acres total_facilities active_count
##   <chr>                      <dbl>            <int>        <dbl>
## 1 MAFO (New)                 235.                 7            2
## 2 CAFO (Renew)               182.               341          242
## 3 MAFO (Renew)               141.                12            8
## 4 CAFO (New)                  35.1              140           70
## 5 CAFO (Modify)                0                  6            6
## 6 CAFO (Transfer)              0                 23           19
## 7 COC (New)                    0                  3            1

Exploring Acres Controlled

# I also use summary() to look at the overall distribution of ACRES CONTROLLED
# In this chunk, I explore the distribution of ACRES CONTROLLED.
# ACRES CONTROLLED is a quantitative variable that measures how much land
# each facility uses for operations.
# This helps me understand farm size differences across facilities.
# I calculate the mean to understand the average land size across all farms.
# I calculate the maximum value to identify the largest operation in the dataset.
summary(cafo_clean$`ACRES CONTROLLED`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0   132.2     0.0  5867.0
mean(cafo_clean$`ACRES CONTROLLED`, na.rm = TRUE)
## [1] 132.156
max(cafo_clean$`ACRES CONTROLLED`, na.rm = TRUE)
## [1] 5867

Visualization Land Use by Operation Type

# In this chunk, I create a bar chart to compare average land use across operation types.
  # I use geom_col() because I am plotting summarized values (not raw data).
  # I flip the coordinates so the labels are easier to read.
  # I label the graph clearly so it explains what is being shown.
  # I use a clean theme to make the visualization easier to interpret.
  # I remove the legend because the colors already represent the categories.
ggplot(operation_summary, aes(x = `PERMIT TYPE ACTIVITY`, y = avg_acres,fill = `PERMIT TYPE ACTIVITY`)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Average Acres Controlled by Operation Type",
    x = "Operation Type",
    y = "Average Acres Controlled",
    caption = "Source: Maryland Department of the Environment"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Logistic Regression Model

# In this chunk, I build a logistic regression model to predict whether a facility has an active permit.
# active_permit is my dependent variable (1 = active permit, 0 = not active).
# ACRES CONTROLLED represents farm size.
# LATITUDE and LONGITUDE represent geographic location.
# I use summary() to interpret how each predictor affects the likelihood of an active permit.
model <- glm(
  active_permit ~ `ACRES CONTROLLED` + LATITUDE + LONGITUDE,
  data = cafo_clean,
  family = binomial
)

summary(model)
## 
## Call:
## glm(formula = active_permit ~ `ACRES CONTROLLED` + LATITUDE + 
##     LONGITUDE, family = binomial, data = cafo_clean)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)  
## (Intercept)        -1.785e+01  2.416e+01  -0.739   0.4601  
## `ACRES CONTROLLED`  6.309e-04  3.296e-04   1.914   0.0556 .
## LATITUDE            2.995e-01  3.141e-01   0.953   0.3403  
## LONGITUDE          -9.085e-02  3.901e-01  -0.233   0.8159  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 686.12  on 531  degrees of freedom
## Residual deviance: 676.60  on 528  degrees of freedom
## AIC: 684.6
## 
## Number of Fisher Scoring iterations: 4

Analysis of Regression

I used a logistic regression model to examine how land use and location affect whether a Maryland animal feeding operation has an active permit. My outcome variable is active_permit (1 = active, 0 = not active), and my predictors are ACRES CONTROLLED, LATITUDE, and LONGITUDE.

The results show that ACRES CONTROLLED has a small positive effect, meaning larger farms are slightly more likely to have an active permit, but this result is only marginally significant (p ≈ 0.055). Both latitude and longitude are not statistically significant, meaning geographic location does not strongly influence permit status in this dataset.

Overall, the model only slightly improves over a baseline model (small drop in deviance), which suggests these variables do not strongly explain permit status. This implies that other factors, such as livestock type or regulatory category, may be more important in determining whether a facility has an active permit.

Predictions from my Model

# In this chunk, I convert model outputs into probabilities and predicted classes.
# I generate predicted probabilities of having an active permit.
# I convert probabilities into binary predictions using a 0.5 cutoff.
# If probability > 0.5, I predict active permit (1), otherwise 0.

cafo_clean$prob <- predict(model, type = "response")
cafo_clean$pred_class <- ifelse(cafo_clean$prob > 0.5, 1, 0)

Confusion Matrix

# In this chunk, I compare predicted values to actual values.

# This table shows how many predictions were correct and where errors occurred.
# It helps me evaluate how well my model is performing.

table(Predicted = cafo_clean$pred_class,
      Actual = cafo_clean$active_permit)
##          Actual
## Predicted   0   1
##         1 184 348

ROC Curve

# I evaluate how well my model separates active vs non-active permits
roc_curve <- roc(cafo_clean$active_permit, cafo_clean$prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# I plot the ROC curve to visually assess performance
plot(roc_curve)

# I calculate the AUC value to measure overall model strength
# Higher values mean better predictive performance
auc(roc_curve)
## Area under the curve: 0.5376