Data 201_Final Project

Source: https://blog.bozemancvb.com/blog/best-breweries-in-bozeman-montana

Intro

The Alcohol and Tobacco Tax and Trade Bureau (TTB) provided the data set chosen for this research. It encompasses detailed information on beer production and distribution in the United States from 2008 to 2019, offering monthly aggregate beer statistics measured in barrels (one barrel equals thirty-one gallons), the number of brewers categorized by production size, and state-level beer output. The dataset includes key variables such as taxable removals, production figures, and total barrels delivered, providing an in-depth look into the U.S. beer industry’s trends over a decade. This publicly accessible dataset underwent significant cleaning and restructuring for analytical purposes (thanks to Bart Watson for sharing the data).

This research uses the data set to explore historical patterns in beer production, taxation, and market dynamics. For regression modeling, the focus is on the total_barrels feature, which represents overall beer production as a continuous variable, aiming to predict production levels based on variables like taxable removals, total shipped, and year. For classification tasks, the focus shifts to taxable contribution levels (High, Medium, Low) as the categorical target variable, with predictors such as total barrels, taxable removals, and total shipped. The overarching goal of this study is to investigate the relationships between key production metrics and industry outcomes, providing insights into production behaviors and their alignment with tax contributions and market growth.

Variable	Class	Description
data_type	character	Pounds of Material - this is a sub-table from beer_taxed
material_type	character	Grain product, Totals, Non-Grain Product (basically hops vs grains)
year	double	Year
month	integer	Month
type	character	Actual line-item from material type
month_current	double	Current number of barrels for this year/month
month_prior_year	double	Prior year number of barrels for same month
ytd_current	double	Cumulative year to date of current year
ytd_prior_year	double	Cumulative year to date for prior year
state	character	State abbreviated
barrels	double	Barrels produced within each type
type	character	Type of production/use (On premise, Bottles/Cans, Kegs/Barrels)
data_type	character	Barrels Produced
tax_status	character	The Tax Status, factor with Totals, Taxable, Sub Total Taxable, Tax Free, Sub Total Tax-Free
year	double	Year
type	character	Type of production, either Total Production (Production) or specific sub-category and sub-totals
month_current	double	Current number of barrels for this year/month
month_prior_year	double	Prior year number of barrels for same month
ytd_current	double	Cumulative year to date of current year
ytd_prior_year	double	Cumulative year to date for prior year
brewer_size	character	Range of production for brewer size, number of barrels produced
n_of_brewers	double	Number of brewers at that brewer size
total_barrels	double	Total barrels of beer produced at that brewer size
taxable_removals	double	Taxable barrels for removals - removals for consumption under taxation
total_shipped	double	Total barrels shipped - produced beer that is not taxed

library(tidyverse)

## Warning: package 'dplyr' was built under R version 4.4.1

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)
library(dplyr)
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.2

## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.4.2

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(caret)

## Warning: package 'caret' was built under R version 4.4.1

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(randomForest)

setwd("C:/Users/dylan/OneDrive/Desktop/Data 201")
beer <- read.csv("Beer.csv")

head(beer)

##   year                    brewer_size number_of_brewers total_barrels
## 1 2009     6,000,001 Barrels and Over                18     171232882
## 2 2009 1,000,001 to 6,000,000 Barrels                 4       9970404
## 3 2009   500,001 to 1,000,000 Barrels                 7       4831386
## 4 2009     100,001 to 500,000 Barrels                27       5422156
## 5 2009      60,001 to 100,000 Barrels                19       1501977
## 6 2009       30,001 to 60,000 Barrels                32       1412245
##   taxable_removals total_shipped                data_type        material_type
## 1        159643984       3639970 Pounds of Materials Used       Grain Products
## 2          9592723         14548 Pounds of Materials Used       Grain Products
## 3          4535659         21563 Pounds of Materials Used       Grain Products
## 4          4469289        128000 Pounds of Materials Used       Grain Products
## 5          1224618         95732 Pounds of Materials Used       Grain Products
## 6          1233959         14369 Pounds of Materials Used Total Grain products
##   year.1 month               type_of_beer month_current month_prior_year
## 1   2008     1     Malt and malt products     374165152        365300134
## 2   2008     1     Corn and corn products      57563519         41647092
## 3   2008     1     Rice and rice products      72402143         81050102
## 4   2008     1 Barley and barley products       3800844          2362162
## 5   2008     1   Wheat and wheat products       1177186          1195381
## 6   2008     1       Total Grain products     509108844        491554871
##   ytd_current ytd_prior_year      data_type.1        tax_status year.2 month.1
## 1   374165152      365300134 Barrels Produced            Totals   2008       1
## 2    57563519       41647092 Barrels Produced           Taxable   2008       1
## 3    72402143       81050102 Barrels Produced           Taxable   2008       1
## 4     3800844        2362162 Barrels Produced           Taxable   2008       1
## 5     1177186        1195381 Barrels Produced Sub Total Taxable   2008       1
## 6   509108844      491554871 Barrels Produced          Tax Free   2008       1
##                           type month_current.1 month_prior_year.1 ytd_current.1
## 1                   Production        16211480           15880125      16211480
## 2          In bottles and cans        13222104           12824278      13222104
## 3          In barrels and kegs         1371239            1357372       1371239
## 4 Tax Determined, Premises Use            7426               8419          7426
## 5            Sub Total Taxable        14600769           14190069      14600769
## 6                   For export          262985             268473        262985
##   ytd_prior_year.1          tax_rate state year.3 barrels      type.1
## 1         15880125 $7/$18 per barrel    AK   2008 2067.69 On Premises
## 2         12824278 $7/$18 per barrel    AK   2009 2263.65 On Premises
## 3          1357372 $7/$18 per barrel    AK   2010 1929.15 On Premises
## 4             8419 $7/$18 per barrel    AK   2011 2251.02 On Premises
## 5         14190069 $7/$18 per barrel    AK   2012 2312.43 On Premises
## 6           268473 $7/$18 per barrel    AK   2013 2155.60 On Premises
##                data_type.2      material_type.1 year.4 month.2
## 1 Pounds of Materials Used       Grain Products   2008       1
## 2 Pounds of Materials Used       Grain Products   2008       1
## 3 Pounds of Materials Used       Grain Products   2008       1
## 4 Pounds of Materials Used       Grain Products   2008       1
## 5 Pounds of Materials Used       Grain Products   2008       1
## 6 Pounds of Materials Used Total Grain products   2008       1
##                       type.2 month_current.2 month_prior_year.2 ytd_current.2
## 1     Malt and malt products       374165152          365300134     374165152
## 2     Corn and corn products        57563519           41647092      57563519
## 3     Rice and rice products        72402143           81050102      72402143
## 4 Barley and barley products         3800844            2362162       3800844
## 5   Wheat and wheat products         1177186            1195381       1177186
## 6       Total Grain products       509108844          491554871     509108844
##   ytd_prior_year.2
## 1        365300134
## 2         41647092
## 3         81050102
## 4          2362162
## 5          1195381
## 6        491554871

The head of the dataset provides an overview of key variables related to beer production and distribution in the U.S. It includes data_type, which specifies categories like “Pounds of Material” or “Barrels Produced,” and material_type, distinguishing grain from non-grain products. Temporal variables such as year, month, and type capture production details over time, while month_current, month_prior_year, ytd_current, and ytd_prior_year track monthly and year-to-date production totals. Geographical information is represented by the state variable, and barrels quantifies production volume. Additional details include tax_status, which identifies the tax treatment of production, and brewer_size, categorizing breweries by production scale. Together, these variables provide a comprehensive view of beer production metrics, enabling analysis across time, geography, and production characteristics.

tail(beer)

##      year brewer_size number_of_brewers total_barrels taxable_removals
## 1867   NA                            NA            NA               NA
## 1868   NA                            NA            NA               NA
## 1869   NA                            NA            NA               NA
## 1870   NA                            NA            NA               NA
## 1871   NA                            NA            NA               NA
## 1872   NA                            NA            NA               NA
##      total_shipped data_type material_type year.1 month type_of_beer
## 1867            NA                             NA    NA             
## 1868            NA                             NA    NA             
## 1869            NA                             NA    NA             
## 1870            NA                             NA    NA             
## 1871            NA                             NA    NA             
## 1872            NA                             NA    NA             
##      month_current month_prior_year ytd_current ytd_prior_year data_type.1
## 1867            NA               NA          NA             NA            
## 1868            NA               NA          NA             NA            
## 1869            NA               NA          NA             NA            
## 1870            NA               NA          NA             NA            
## 1871            NA               NA          NA             NA            
## 1872            NA               NA          NA             NA            
##      tax_status year.2 month.1 type month_current.1 month_prior_year.1
## 1867                NA      NA                   NA                 NA
## 1868                NA      NA                   NA                 NA
## 1869                NA      NA                   NA                 NA
## 1870                NA      NA                   NA                 NA
## 1871                NA      NA                   NA                 NA
## 1872                NA      NA                   NA                 NA
##      ytd_current.1 ytd_prior_year.1 tax_rate state year.3  barrels
## 1867            NA               NA          total   2014 18176128
## 1868            NA               NA          total   2015 17750505
## 1869            NA               NA          total   2016 17000382
## 1870            NA               NA          total   2017 16514867
## 1871            NA               NA          total   2018 15662381
## 1872            NA               NA          total   2019 14880067
##                type.1 data_type.2 material_type.1 year.4 month.2 type.2
## 1867 Kegs and Barrels                                 NA      NA       
## 1868 Kegs and Barrels                                 NA      NA       
## 1869 Kegs and Barrels                                 NA      NA       
## 1870 Kegs and Barrels                                 NA      NA       
## 1871 Kegs and Barrels                                 NA      NA       
## 1872 Kegs and Barrels                                 NA      NA       
##      month_current.2 month_prior_year.2 ytd_current.2 ytd_prior_year.2
## 1867              NA                 NA            NA               NA
## 1868              NA                 NA            NA               NA
## 1869              NA                 NA            NA               NA
## 1870              NA                 NA            NA               NA
## 1871              NA                 NA            NA               NA
## 1872              NA                 NA            NA               NA

The tail of the dataset offers insight into the most recent or final entries, showcasing variables such as data_type, which continues to categorize production metrics like “Barrels Produced” or “Pounds of Material.” Variables like material_type and type remain consistent in distinguishing grain versus non-grain products and specific uses such as bottles, cans, or kegs. Temporal variables such as year and month reflect the dataset’s coverage through time, capturing production trends up to the most recent periods. The state variable highlights the geographical scope by identifying specific U.S. states, while barrels quantifies the production volume in the final records. Metrics such as month_current, month_prior_year, ytd_current, and ytd_prior_year provide end-of-period summaries for monthly and year-to-date production comparisons. Additionally, variables like tax_status and brewer_size ensure the dataset captures nuanced details about tax classifications and the scale of production at different brewery sizes. The tail provides a snapshot of the most recent beer production activity and serves as a reference for trends and patterns observed across the dataset.

beer_clean <- data.frame(lapply(beer, as.factor))

summary(beer)

##       year      brewer_size        number_of_brewers total_barrels      
##  Min.   :2009   Length:1872        Min.   :   3.0    Min.   :        0  
##  1st Qu.:2011   Class :character   1st Qu.:  15.0    1st Qu.:  1382426  
##  Median :2014   Mode  :character   Median :  43.0    Median :  3055305  
##  Mean   :2014                      Mean   : 612.4    Mean   : 30796075  
##  3rd Qu.:2017                      3rd Qu.: 428.0    3rd Qu.: 10513168  
##  Max.   :2019                      Max.   :6400.0    Max.   :196969275  
##  NA's   :1735                      NA's   :1735      NA's   :1736       
##  taxable_removals    total_shipped      data_type         material_type     
##  Min.   :    14711   Min.   :     12   Length:1872        Length:1872       
##  1st Qu.:  1203852   1st Qu.:   6544   Class :character   Class :character  
##  Median :  2714929   Median :  39494   Mode  :character   Mode  :character  
##  Mean   : 28247188   Mean   : 885246                                        
##  3rd Qu.:  8697816   3rd Qu.: 265118                                        
##  Max.   :182995469   Max.   :6908958                                        
##  NA's   :1735        NA's   :1737                                           
##      year.1         month       type_of_beer       month_current      
##  Min.   :2008   Min.   : 1.00   Length:1872        Min.   :        0  
##  1st Qu.:2010   1st Qu.: 3.75   Class :character   1st Qu.:  1682829  
##  Median :2012   Median : 6.50   Mode  :character   Median : 13820964  
##  Mean   :2012   Mean   : 6.50                      Mean   :111582326  
##  3rd Qu.:2015   3rd Qu.: 9.25                      3rd Qu.: 84513176  
##  Max.   :2017   Max.   :12.00                      Max.   :656596463  
##  NA's   :432    NA's   :432                        NA's   :432        
##  month_prior_year     ytd_current        ytd_prior_year      data_type.1       
##  Min.   :0.000e+00   Min.   :2.714e+05   Min.   :2.714e+05   Length:1872       
##  1st Qu.:2.396e+06   1st Qu.:2.837e+07   1st Qu.:2.573e+07   Class :character  
##  Median :4.687e+07   Median :2.346e+08   Median :2.346e+08   Mode  :character  
##  Mean   :1.908e+08   Mean   :9.285e+08   Mean   :9.417e+08                     
##  3rd Qu.:1.145e+08   3rd Qu.:8.431e+08   3rd Qu.:8.498e+08                     
##  Max.   :6.395e+09   Max.   :7.047e+09   Max.   :7.144e+09                     
##  NA's   :432         NA's   :864         NA's   :864                           
##   tax_status            year.2        month.1           type          
##  Length:1872        Min.   :2008   Min.   : 1.000   Length:1872       
##  Class :character   1st Qu.:2010   1st Qu.: 4.000   Class :character  
##  Mode  :character   Median :2013   Median : 7.000   Mode  :character  
##                     Mean   :2013   Mean   : 6.509                     
##                     3rd Qu.:2016   3rd Qu.:10.000                     
##                     Max.   :2019   Max.   :12.000                     
##                     NA's   :292    NA's   :292                        
##  month_current.1    month_prior_year.1 ytd_current.1       ytd_prior_year.1   
##  Min.   :       0   Min.   :       0   Min.   :        0   Min.   :        0  
##  1st Qu.:  111380   1st Qu.:  113821   1st Qu.:   417667   1st Qu.:   381690  
##  Median : 1420096   Median : 1454573   Median :  8321667   Median :  8489038  
##  Mean   : 6544098   Mean   : 6597077   Mean   : 42394946   Mean   : 42804143  
##  3rd Qu.:13745839   3rd Qu.:13826875   3rd Qu.: 79415265   3rd Qu.: 79866082  
##  Max.   :19131217   Max.   :18667697   Max.   :199466287   Max.   :199618070  
##  NA's   :292        NA's   :292        NA's   :316         NA's   :316        
##    tax_rate            state               year.3        barrels         
##  Length:1872        Length:1872        Min.   :2008   Min.   :        0  
##  Class :character   Class :character   1st Qu.:2011   1st Qu.:     6907  
##  Mode  :character   Mode  :character   Median :2014   Median :    30801  
##                                        Mean   :2014   Mean   :  2286370  
##                                        3rd Qu.:2016   3rd Qu.:   222188  
##                                        Max.   :2019   Max.   :166930012  
##                                                       NA's   :19         
##     type.1          data_type.2        material_type.1        year.4    
##  Length:1872        Length:1872        Length:1872        Min.   :2008  
##  Class :character   Class :character   Class :character   1st Qu.:2010  
##  Mode  :character   Mode  :character   Mode  :character   Median :2012  
##                                                           Mean   :2012  
##                                                           3rd Qu.:2015  
##                                                           Max.   :2017  
##                                                           NA's   :432   
##     month.2         type.2          month_current.2     month_prior_year.2 
##  Min.   : 1.00   Length:1872        Min.   :        0   Min.   :0.000e+00  
##  1st Qu.: 3.75   Class :character   1st Qu.:  1682829   1st Qu.:2.396e+06  
##  Median : 6.50   Mode  :character   Median : 13820964   Median :4.687e+07  
##  Mean   : 6.50                      Mean   :111582326   Mean   :1.908e+08  
##  3rd Qu.: 9.25                      3rd Qu.: 84513176   3rd Qu.:1.145e+08  
##  Max.   :12.00                      Max.   :656596463   Max.   :6.395e+09  
##  NA's   :432                        NA's   :432         NA's   :432        
##  ytd_current.2       ytd_prior_year.2   
##  Min.   :2.714e+05   Min.   :2.714e+05  
##  1st Qu.:2.837e+07   1st Qu.:2.573e+07  
##  Median :2.346e+08   Median :2.346e+08  
##  Mean   :9.285e+08   Mean   :9.417e+08  
##  3rd Qu.:8.431e+08   3rd Qu.:8.498e+08  
##  Max.   :7.047e+09   Max.   :7.144e+09  
##  NA's   :864         NA's   :864

The summary provides an overview of the dataset, highlighting descriptive statistics for each variable. Numeric variables like total_barrels, taxable_removals, and total_shipped exhibit large ranges, indicating substantial variability in production and distribution across breweries. However, several variables contain significant numbers of missing values (NAs), such as number_of_brewers with 1735 NAs, likely due to incomplete data collection, irrelevance for certain cases, or merging issues. The presence of high maximum values in variables like ytd_current suggests potential outliers that may skew analyses. Categorical variables like brewer_size and data_type need proper encoding for modeling. To address these challenges, missing data must be handled through imputation or row exclusion, and outliers should be examined to mitigate their impact. Additionally, feature engineering, such as aggregating or transforming variables, will enhance the dataset’s usability and robustness for further analysis.

names(beer)

##  [1] "year"               "brewer_size"        "number_of_brewers" 
##  [4] "total_barrels"      "taxable_removals"   "total_shipped"     
##  [7] "data_type"          "material_type"      "year.1"            
## [10] "month"              "type_of_beer"       "month_current"     
## [13] "month_prior_year"   "ytd_current"        "ytd_prior_year"    
## [16] "data_type.1"        "tax_status"         "year.2"            
## [19] "month.1"            "type"               "month_current.1"   
## [22] "month_prior_year.1" "ytd_current.1"      "ytd_prior_year.1"  
## [25] "tax_rate"           "state"              "year.3"            
## [28] "barrels"            "type.1"             "data_type.2"       
## [31] "material_type.1"    "year.4"             "month.2"           
## [34] "type.2"             "month_current.2"    "month_prior_year.2"
## [37] "ytd_current.2"      "ytd_prior_year.2"

dim(beer)

## [1] 1872   38

beer_clean <- beer %>%
  mutate(tax_status_binary = ifelse(tax_status == "Taxable", 1, 0)) %>%
  na.omit(beer)
beer_clean <- data.frame(lapply(beer_clean, as.factor))
head(beer_clean)

##   year                    brewer_size number_of_brewers total_barrels
## 1 2009     6,000,001 Barrels and Over                18   171232881.6
## 2 2009 1,000,001 to 6,000,000 Barrels                 4    9970404.28
## 3 2009   500,001 to 1,000,000 Barrels                 7    4831385.68
## 4 2009     100,001 to 500,000 Barrels                27    5422155.72
## 5 2009      60,001 to 100,000 Barrels                19    1501976.56
## 6 2009       30,001 to 60,000 Barrels                32    1412244.73
##   taxable_removals total_shipped                data_type        material_type
## 1      159643984.2       3639970 Pounds of Materials Used       Grain Products
## 2       9592722.75         14548 Pounds of Materials Used       Grain Products
## 3       4535659.48         21563 Pounds of Materials Used       Grain Products
## 4       4469288.93        128000 Pounds of Materials Used       Grain Products
## 5       1224618.44         95732 Pounds of Materials Used       Grain Products
## 6       1233958.75         14369 Pounds of Materials Used Total Grain products
##   year.1 month               type_of_beer month_current month_prior_year
## 1   2008     1     Malt and malt products     374165152        365300134
## 2   2008     1     Corn and corn products      57563519         41647092
## 3   2008     1     Rice and rice products      72402143         81050102
## 4   2008     1 Barley and barley products       3800844          2362162
## 5   2008     1   Wheat and wheat products       1177186          1195381
## 6   2008     1       Total Grain products     509108844        491554871
##   ytd_current ytd_prior_year      data_type.1        tax_status year.2 month.1
## 1   374165152      365300134 Barrels Produced            Totals   2008       1
## 2    57563519       41647092 Barrels Produced           Taxable   2008       1
## 3    72402143       81050102 Barrels Produced           Taxable   2008       1
## 4     3800844        2362162 Barrels Produced           Taxable   2008       1
## 5     1177186        1195381 Barrels Produced Sub Total Taxable   2008       1
## 6   509108844      491554871 Barrels Produced          Tax Free   2008       1
##                           type month_current.1 month_prior_year.1 ytd_current.1
## 1                   Production        16211480           15880125      16211480
## 2          In bottles and cans        13222104           12824278      13222104
## 3          In barrels and kegs         1371239            1357372       1371239
## 4 Tax Determined, Premises Use            7426               8419          7426
## 5            Sub Total Taxable        14600769           14190069      14600769
## 6                   For export          262985             268473        262985
##   ytd_prior_year.1          tax_rate state year.3 barrels      type.1
## 1         15880125 $7/$18 per barrel    AK   2008 2067.69 On Premises
## 2         12824278 $7/$18 per barrel    AK   2009 2263.65 On Premises
## 3          1357372 $7/$18 per barrel    AK   2010 1929.15 On Premises
## 4             8419 $7/$18 per barrel    AK   2011 2251.02 On Premises
## 5         14190069 $7/$18 per barrel    AK   2012 2312.43 On Premises
## 6           268473 $7/$18 per barrel    AK   2013  2155.6 On Premises
##                data_type.2      material_type.1 year.4 month.2
## 1 Pounds of Materials Used       Grain Products   2008       1
## 2 Pounds of Materials Used       Grain Products   2008       1
## 3 Pounds of Materials Used       Grain Products   2008       1
## 4 Pounds of Materials Used       Grain Products   2008       1
## 5 Pounds of Materials Used       Grain Products   2008       1
## 6 Pounds of Materials Used Total Grain products   2008       1
##                       type.2 month_current.2 month_prior_year.2 ytd_current.2
## 1     Malt and malt products       374165152          365300134     374165152
## 2     Corn and corn products        57563519           41647092      57563519
## 3     Rice and rice products        72402143           81050102      72402143
## 4 Barley and barley products         3800844            2362162       3800844
## 5   Wheat and wheat products         1177186            1195381       1177186
## 6       Total Grain products       509108844          491554871     509108844
##   ytd_prior_year.2 tax_status_binary
## 1        365300134                 0
## 2         41647092                 1
## 3         81050102                 1
## 4          2362162                 1
## 5          1195381                 0
## 6        491554871                 0

tail(beer_clean)

##     year              brewer_size number_of_brewers total_barrels
## 129 2019 30,001 to 60,000 Barrels                64    2675606.37
## 130 2019 15,001 to 30,000 Barrels               102    2166277.84
## 131 2019  7,501 to 15,000 Barrels               159   1680544.748
## 132 2019   1,001 to 7,500 Barrels              1183   3151130.255
## 133 2019       1 to 1,000 Barrels              4577   1321901.272
## 134 2019                    Total              6400   179720164.3
##     taxable_removals total_shipped                data_type
## 129      2906735.837         39494 Pounds of Materials Used
## 130      1891960.763         23889 Pounds of Materials Used
## 131      1484609.204          3115 Pounds of Materials Used
## 132       2901015.17          7509 Pounds of Materials Used
## 133      1293960.317          2296 Pounds of Materials Used
## 134      167077232.6       5066858 Pounds of Materials Used
##                material_type year.1 month             type_of_beer
## 129       Non-Grain Products   2008    11                    Other
## 130 Total Non-Grain products   2008    11 Total Non-Grain products
## 131               Total Used   2008    11               Total Used
## 132           Grain Products   2008    12   Malt and malt products
## 133           Grain Products   2008    12   Corn and corn products
## 134           Grain Products   2008    12 Wheat and wheat products
##     month_current month_prior_year ytd_current ytd_prior_year      data_type.1
## 129       1247790          1603896    18401839       17507805 Barrels Produced
## 130      49476337         85911656   886921802     1067511830 Barrels Produced
## 131     484078833        535110956  6566450607     6637506675 Barrels Produced
## 132     318741733        319259762  4497887231     4490242281 Barrels Produced
## 133      49919452         44950282   726702705      569290506 Barrels Produced
## 134       1348759          1775621    18269544       17933839 Barrels Produced
##             tax_status year.2 month.1                         type
## 129 Sub Total Tax-Free   2008      12           Sub Total Tax-Free
## 130             Totals   2008      12               Total Removals
## 131             Totals   2008      12 Stocks On Hand end-of-month:
## 132             Totals   2009       1                   Production
## 133            Taxable   2009       1          In bottles and cans
## 134  Sub Total Taxable   2009       1            Sub Total Taxable
##     month_current.1 month_prior_year.1 ytd_current.1 ytd_prior_year.1
## 129          340007             269251       4134156          3870834
## 130        13516016           13064320     188642481        186566924
## 131         9679168           10137310     123676609        131935247
## 132        16089611           16219682      16089611         16219682
## 133        13081831           13227681      13081831         13227681
## 134        14318652           14606667      14318652         14606667
##              tax_rate state year.3  barrels      type.1
## 129 $7/$18 per barrel    GA   2017 51488.76 On Premises
## 130 $7/$18 per barrel    GA   2018 51856.22 On Premises
## 131 $7/$18 per barrel    GA   2019  55631.8 On Premises
## 132 $7/$18 per barrel    HI   2008   1875.2 On Premises
## 133 $7/$18 per barrel    HI   2009  1578.85 On Premises
## 134 $7/$18 per barrel    HI   2012  2084.01 On Premises
##                  data_type.2          material_type.1 year.4 month.2
## 129 Pounds of Materials Used       Non-Grain Products   2008      11
## 130 Pounds of Materials Used Total Non-Grain products   2008      11
## 131 Pounds of Materials Used               Total Used   2008      11
## 132 Pounds of Materials Used           Grain Products   2008      12
## 133 Pounds of Materials Used           Grain Products   2008      12
## 134 Pounds of Materials Used           Grain Products   2008      12
##                       type.2 month_current.2 month_prior_year.2 ytd_current.2
## 129                    Other         1247790            1603896      18401839
## 130 Total Non-Grain products        49476337           85911656     886921802
## 131               Total Used       484078833          535110956    6566450607
## 132   Malt and malt products       318741733          319259762    4497887231
## 133   Corn and corn products        49919452           44950282     726702705
## 134 Wheat and wheat products         1348759            1775621      18269544
##     ytd_prior_year.2 tax_status_binary
## 129         17507805                 0
## 130       1067511830                 0
## 131       6637506675                 0
## 132       4490242281                 0
## 133        569290506                 1
## 134         17933839                 0

summary(beer_clean)

##       year                          brewer_size number_of_brewers
##  2016   :13   1 to 1,000 Barrels          :11   5      :  8      
##  2017   :13   1,001 to 7,500 Barrels      :11   4      :  6      
##  2018   :13   100,001 to 500,000 Barrels  :11   6      :  5      
##  2009   :12   15,001 to 30,000 Barrels    :11   7      :  5      
##  2011   :12   30,001 to 60,000 Barrels    :11   15     :  5      
##  2012   :12   500,001 to 1,000,000 Barrels:11   40     :  4      
##  (Other):59   (Other)                     :68   (Other):101      
##     total_barrels  taxable_removals total_shipped
##  3.7       :  1   14711.36 :  1     12     :  1  
##  4.43      :  1   102983.4 :  1     119    :  1  
##  6.73      :  1   107494.1 :  1     123    :  1  
##  7.47      :  1   142501.39:  1     240    :  1  
##  400401.17 :  1   371872.85:  1     270    :  1  
##  426003.383:  1   388968.45:  1     274    :  1  
##  (Other)   :128   (Other)  :128     (Other):128  
##                     data_type                    material_type  year.1   
##  Pounds of Materials Used:134   Grain Products          :58    2008:134  
##                                 Non-Grain Products      :44              
##                                 Total Grain products    :11              
##                                 Total Non-Grain products:10              
##                                 Total Used              :11              
##                                                                          
##                                                                          
##      month                        type_of_beer month_current month_prior_year
##  1      :12   Corn and corn products    :12    329906 :  1   411166 :  1     
##  2      :12   Malt and malt products    :12    337294 :  1   429884 :  1     
##  3      :12   Wheat and wheat products  :12    337430 :  1   437701 :  1     
##  4      :12   Barley and barley products:11    377424 :  1   466493 :  1     
##  5      :12   Hops (dry)                :11    476688 :  1   474943 :  1     
##  6      :12   Hops (used as extracts)   :11    503072 :  1   506950 :  1     
##  (Other):62   (Other)                   :65    (Other):128   (Other):128     
##   ytd_current  ytd_prior_year           data_type.1               tax_status
##  621912 :  1   411166 :  1    Barrels Produced:134   Sub Total Tax-Free:12  
##  1098600:  1   766735 :  1                           Sub Total Taxable :13  
##  1177186:  1   841050 :  1                           Tax Free          :35  
##  1291615:  1   1195381:  1                           Taxable           :37  
##  2294322:  1   1307543:  1                           Totals            :37  
##  2418369:  1   1782486:  1                                                  
##  (Other):128   (Other):128                                                  
##   year.2       month.1                             type    month_current.1
##  2008:131   1      :14   In bottles and cans         :13   0      : 10    
##  2009:  3   2      :11   Production                  :13   3      :  1    
##             3      :11   Sub Total Taxable           :13   101    :  1    
##             4      :11   Consumed on brewery premises:12   619    :  1    
##             5      :11   For vessels and aircraft    :12   671    :  1    
##             6      :11   In barrels and kegs         :12   708    :  1    
##             (Other):65   (Other)                     :59   (Other):119    
##  month_prior_year.1 ytd_current.1 ytd_prior_year.1              tax_rate  
##  0      :  9        3      :  6   0      :  5      $7/$18 per barrel:134  
##  3      :  1        104    :  4   4      :  3                             
##  4      :  1        0      :  2   7      :  3                             
##  85     :  1        708    :  1   92     :  1                             
##  732    :  1        1379   :  1   885    :  1                             
##  735    :  1        7426   :  1   1617   :  1                             
##  (Other):120        (Other):119   (Other):120                             
##      state        year.3      barrels            type.1   
##  AK     :12   2008   :12   0      :  1   On Premises:134  
##  AL     :12   2009   :12   46.5   :  1                    
##  AR     :12   2012   :12   130.55 :  1                    
##  AZ     :12   2010   :11   161.24 :  1                    
##  CA     :12   2011   :11   226.25 :  1                    
##  CO     :12   2013   :11   238.6  :  1                    
##  (Other):62   (Other):65   (Other):128                    
##                    data_type.2                  material_type.1  year.4   
##  Pounds of Materials Used:134   Grain Products          :58     2008:134  
##                                 Non-Grain Products      :44               
##                                 Total Grain products    :11               
##                                 Total Non-Grain products:10               
##                                 Total Used              :11               
##                                                                           
##                                                                           
##     month.2                          type.2   month_current.2
##  1      :12   Corn and corn products    :12   329906 :  1    
##  2      :12   Malt and malt products    :12   337294 :  1    
##  3      :12   Wheat and wheat products  :12   337430 :  1    
##  4      :12   Barley and barley products:11   377424 :  1    
##  5      :12   Hops (dry)                :11   476688 :  1    
##  6      :12   Hops (used as extracts)   :11   503072 :  1    
##  (Other):62   (Other)                   :65   (Other):128    
##  month_prior_year.2 ytd_current.2 ytd_prior_year.2 tax_status_binary
##  411166 :  1        621912 :  1   411166 :  1      0:97             
##  429884 :  1        1098600:  1   766735 :  1      1:37             
##  437701 :  1        1177186:  1   841050 :  1                       
##  466493 :  1        1291615:  1   1195381:  1                       
##  474943 :  1        2294322:  1   1307543:  1                       
##  506950 :  1        2418369:  1   1782486:  1                       
##  (Other):128        (Other):128   (Other):128

dim(beer_clean)

## [1] 134  39

The cleaned dataset provides a well-structured view of beer production data, with relevant variables for analysis. The head of the data showcases the first few rows, offering an initial glance into key variables such as year, state, barrels, type, and tax_status_binary. For instance, the data indicates beer production volumes (barrels) for specific years and states, categorized by production type (e.g., “Production,” “In bottles and cans”) and tax status (binary indicator of taxable or non-taxable status). The summary of the cleaned dataset provides essential statistics about its numerical and categorical variables. The year variable, spanning from 2009 to 2019, confirms the dataset’s temporal range, with a mean and median of 2014, indicating an even distribution across the timeframe. The barrels variable, representing beer production volumes, ranges from a minimum of 0 to a maximum of 399,693, with a mean of 34,893 and a median of 8,970. This highlights significant variability and potential skewness due to a few large values, as seen in the difference between the mean and median.

Categorical variables such as state and type provide qualitative insights, with 137 unique rows corresponding to specific combinations of these categories. The tax_status_binary variable, which identifies whether production is taxable (1) or non-taxable (0), shows a class imbalance, with a median and first quartile of 0 and a third quartile of 1. The mean of 0.2847 suggests that taxable entries make up approximately 28.5% of the dataset, while the majority are non-taxable. The tail of the dataset highlights recent years in the time series data (e.g., 2019), emphasizing production data from different states and production categories, with some minor variations in barrels or tax-related classifications. The dataset’s dimensions indicate it contains 137 rows and 6 columns, reflecting the cleaned structure. Each row represents a unique combination of production data by year, state, and type, while the columns capture essential variables such as year, state, barrels, type, tax_status, and tax_status_binary. This compact structure ensures efficient and focused analysis, confirming the dataset’s readiness for regression and classification modeling.

beer_clean_numeric <- beer_clean %>%
  mutate(across(everything(), ~ if (is.factor(.)) {
    as.numeric(as.character(.))
  } else if (is.character(.)) {
    as.numeric(as.factor(.))
  } else if (is.logical(.)) {
    as.numeric(.)
  } else {
    .
  }))

## Warning: There were 13 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 12 remaining warnings.

beer_clean_numeric <- beer_clean_numeric %>%
  select_if(~ sum(!is.na(.)) > 0)  # Keep only columns with non-NA values
str(beer_clean_numeric)

## 'data.frame':    134 obs. of  26 variables:
##  $ year              : num  2009 2009 2009 2009 2009 ...
##  $ number_of_brewers : num  18 4 7 27 19 ...
##  $ total_barrels     : num  1.71e+08 9.97e+06 4.83e+06 5.42e+06 1.50e+06 ...
##  $ taxable_removals  : num  1.60e+08 9.59e+06 4.54e+06 4.47e+06 1.22e+06 ...
##  $ total_shipped     : num  3639970 14548 21563 128000 95732 ...
##  $ year.1            : num  2008 2008 2008 2008 2008 ...
##  $ month             : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ month_current     : num  3.74e+08 5.76e+07 7.24e+07 3.80e+06 1.18e+06 ...
##  $ month_prior_year  : num  3.65e+08 4.16e+07 8.11e+07 2.36e+06 1.20e+06 ...
##  $ ytd_current       : num  3.74e+08 5.76e+07 7.24e+07 3.80e+06 1.18e+06 ...
##  $ ytd_prior_year    : num  3.65e+08 4.16e+07 8.11e+07 2.36e+06 1.20e+06 ...
##  $ year.2            : num  2008 2008 2008 2008 2008 ...
##  $ month.1           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ month_current.1   : num  16211480 13222104 1371239 7426 14600769 ...
##  $ month_prior_year.1: num  15880125 12824278 1357372 8419 14190069 ...
##  $ ytd_current.1     : num  16211480 13222104 1371239 7426 14600769 ...
##  $ ytd_prior_year.1  : num  15880125 12824278 1357372 8419 14190069 ...
##  $ year.3            : num  2008 2009 2010 2011 2012 ...
##  $ barrels           : num  2068 2264 1929 2251 2312 ...
##  $ year.4            : num  2008 2008 2008 2008 2008 ...
##  $ month.2           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ month_current.2   : num  3.74e+08 5.76e+07 7.24e+07 3.80e+06 1.18e+06 ...
##  $ month_prior_year.2: num  3.65e+08 4.16e+07 8.11e+07 2.36e+06 1.20e+06 ...
##  $ ytd_current.2     : num  3.74e+08 5.76e+07 7.24e+07 3.80e+06 1.18e+06 ...
##  $ ytd_prior_year.2  : num  3.65e+08 4.16e+07 8.11e+07 2.36e+06 1.20e+06 ...
##  $ tax_status_binary : num  0 1 1 1 0 0 0 0 0 0 ...

correlations <- sapply(beer_clean_numeric, function(x) {
  if (sum(!is.na(x)) > 1) {  # Ensure at least two non-NA values
    cor(x, beer_clean_numeric$barrels, use = "complete.obs")
  } else {
    NA  # Return NA for insufficient data
  }
})

## Warning in cor(x, beer_clean_numeric$barrels, use = "complete.obs"): the
## standard deviation is zero

## Warning in cor(x, beer_clean_numeric$barrels, use = "complete.obs"): the
## standard deviation is zero

EDA for Regression Target: total_barrels

correlation_df <- data.frame(Variable = names(correlations), Correlation = correlations)
correlation_df <- correlation_df %>%
  filter(!is.na(Correlation)) %>%
  arrange(desc(abs(Correlation)))
print(correlation_df)

##                              Variable Correlation
## barrels                       barrels  1.00000000
## year.3                         year.3  0.41572603
## total_barrels           total_barrels  0.19258166
## taxable_removals     taxable_removals  0.19048404
## total_shipped           total_shipped  0.18525991
## month_prior_year     month_prior_year  0.11255121
## month_prior_year.2 month_prior_year.2  0.11255121
## month_current           month_current  0.10272743
## month_current.2       month_current.2  0.10272743
## number_of_brewers   number_of_brewers  0.09999077
## tax_status_binary   tax_status_binary  0.07996282
## year.2                         year.2 -0.07957634
## ytd_prior_year         ytd_prior_year  0.07722349
## ytd_prior_year.2     ytd_prior_year.2  0.07722349
## ytd_current               ytd_current  0.07192875
## ytd_current.2           ytd_current.2  0.07192875
## month.1                       month.1  0.07015077
## month_prior_year.1 month_prior_year.1  0.04411356
## month_current.1       month_current.1  0.03586901
## ytd_current.1           ytd_current.1  0.02572988
## ytd_prior_year.1     ytd_prior_year.1  0.02348344
## year                             year  0.01752950
## month                           month -0.01166371
## month.2                       month.2 -0.01166371

The selection of predictors is based on their correlation strength with the target variable barrels. Variables with higher absolute correlation values are more likely to show a meaningful linear relationship with the target and contribute to the predictive power of the model. For example, year.3 exhibits the strongest correlation (0.4157), indicating a moderate positive association with barrels, making it a key predictor. Similarly, total_barrels, taxable_removals, and total_shipped have weaker but notable positive correlations (ranging from 0.18 to 0.19), suggesting they could still provide valuable information to the model. Including these variables allows the model to capture relationships between beer production and factors like year, total barrels produced, and quantities removed or shipped.

beer_clean_numeric$total_barrels <- as.numeric(as.character(beer_clean_numeric$total_barrels))
ggplot(beer_clean_numeric, aes(x = barrels)) +
  geom_histogram(binwidth = 1000, fill = "blue", color = "black") +
  labs(title = "Distribution of Total Barrels Produced", x = "Barrels", y = "Frequency")

The histogram displays the distribution of total barrels produced, revealing a highly skewed pattern. Most observations cluster near the lower production levels, with a sharp peak at or near zero, indicating many instances of minimal production. As the production volume increases, the frequency of observations declines significantly, with a few extreme outliers representing very high production levels. This right-skewed distribution reflects the variability in production among states or categories, with a small number of large-scale producers disproportionately influencing the dataset. The graph highlights the need to account for this skewness in further analysis, particularly in regression modeling.

qqnorm(beer_clean_numeric$year.3, main = "QQ Plot for year.3")
qqline(beer_clean_numeric$year.3, col = "red")

qqnorm(beer_clean_numeric$taxable_removals, main = "QQ Plot for taxable removals")
qqline(beer_clean_numeric$taxable_removals, col = "red")

qqnorm(beer_clean_numeric$total_shipped, main = "QQ Plot for total shipped")
qqline(beer_clean_numeric$total_shipped, col = "red")

qqnorm(beer_clean_numeric$total_barrels, main = "QQ Plot for total barrels")
qqline(beer_clean_numeric$total_barrels, col = "red")

First QQ Plot (Year.3): The QQ plot for year.3 suggests a near-normal distribution in the middle quantiles, as points align well with the diagonal line. However, slight deviations at the extremes indicate potential outliers or minor skewness in the data.

Second QQ Plot (Taxable Removals): The QQ plot for taxable removals exhibits significant deviation from normality, particularly in the upper quantiles. The heavy upward curve in the tail suggests outliers or heavy-tailed behavior in the data, which could distort statistical analyses if not addressed.

Third QQ Plot (Total Shipped): This plot also shows notable deviations from normality, with pronounced departures at the upper quantiles. The extreme values in total shipped highlight the presence of outliers or a highly skewed distribution that may benefit from transformation.

Fourth QQ Plot (Total Barrels): Similar to the previous plots, the QQ plot for total barrels indicates heavy-tailed behavior, with extreme values at the higher end deviating sharply from the diagonal line. This suggests that total barrels has significant outliers or non-normal characteristics that could influence regression or classification models if not addressed.

ggplot(beer_clean_numeric, aes(y = total_barrels)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Boxplot of Total Barrels Produced", y = "Total Barrels")

ggplot(beer_clean_numeric, aes(y = taxable_removals)) +
  geom_boxplot(fill = "red") +
  labs(title = "Boxplot of Taxable Removals", y = "Taxable Removals")

ggplot(beer_clean_numeric, aes(y = year.3)) +
  geom_boxplot(fill = "green") +
  labs(title = "Boxplot of year.3", y = "year.3")

ggplot(beer_clean_numeric, aes(y = total_shipped)) +
  geom_boxplot(fill = "yellow") +
  labs(title = "Boxplot of Total Shipped", y = "Total Shipped")

Box plot of Total Barrels Produced: The box plot of total barrels produced indicates a highly skewed distribution with a significant presence of outliers at the upper end. Most of the data points are concentrated near the lower end of the scale, while a few observations represent extremely large values, potentially dominating the dataset. This suggests a need to consider transformations or robust methods when including this variable in modeling.

Box plot of Taxable Removals: The box plot for taxable removals exhibits a similar pattern to that of total barrels, with a majority of the values clustered near zero and a long tail of outliers extending to very high values. The presence of these outliers could affect statistical analyses and might require either transformation or trimming for effective modeling.

Box plot of Year: The box plot for the year variable reveals a balanced and evenly distributed range, as evidenced by the symmetric interquartile range and the absence of significant outliers. This variable appears to be stable and well-suited for use in analysis without requiring further adjustments.

Box plot of Total Shipped: The box plot of total shipped shows a distribution similar to taxable removals and total barrels, where most values are concentrated near zero, and a few extreme outliers create a long tail. These outliers highlight a potential issue in representing variability and may need careful handling during analysis.

EDA for Classification Target

beer_clean_numeric <- beer_clean_numeric %>%
  mutate(
    taxable_percentage = (taxable_removals / total_barrels) * 100
  ) %>%
  mutate(
    taxable_contribution = case_when(
      taxable_percentage <= 85 ~ "Low",
      taxable_percentage > 85 & taxable_percentage <= 90 ~ "Medium",
      taxable_percentage > 90 ~ "High",
      TRUE ~ NA_character_
    )
  )

beer_clean_numeric$taxable_contribution <- as.factor(beer_clean_numeric$taxable_contribution)
table(beer_clean_numeric$taxable_contribution)

## 
##   High    Low Medium 
##     69     26     39

print(beer_clean_numeric[, c("taxable_percentage", "taxable_contribution")])

##     taxable_percentage taxable_contribution
## 1         9.323208e+01                 High
## 2         9.621197e+01                 High
## 3         9.387906e+01                 High
## 4         8.242642e+01                  Low
## 5         8.153379e+01                  Low
## 6         8.737570e+01               Medium
## 7         8.384548e+01                  Low
## 8         8.785214e+01               Medium
## 9         8.944343e+01               Medium
## 10        9.287507e+01                 High
## 11        3.976043e+05                 High
## 12        9.290559e+01                 High
## 13        9.305000e+01                 High
## 14        9.564022e+01                 High
## 15        8.913032e+01               Medium
## 16        8.373347e+01                  Low
## 17        8.434617e+01                  Low
## 18        8.137635e+01                  Low
## 19        8.572555e+01               Medium
## 20        8.316274e+01                  Low
## 21        9.088862e+01                 High
## 22        9.247381e+01                 High
## 23        9.259689e+01                 High
## 24        9.218855e+01                 High
## 25        9.430644e+01                 High
## 26        9.496488e+01                 High
## 27        9.405640e+01                 High
## 28        8.357418e+01                  Low
## 29        9.154416e+01                 High
## 30        8.419606e+01                  Low
## 31        7.957537e+01                  Low
## 32        8.721169e+01               Medium
## 33        8.983541e+01               Medium
## 34        9.094157e+01                 High
## 35        9.213505e+01                 High
## 36        9.227277e+01                 High
## 37        9.790486e+01                 High
## 38        9.508074e+01                 High
## 39        8.183067e+01                  Low
## 40        7.791511e+01                  Low
## 41        8.518897e+01               Medium
## 42        8.529187e+01               Medium
## 43        8.362006e+01                  Low
## 44        8.708892e+01               Medium
## 45        8.909601e+01               Medium
## 46        8.886707e+01               Medium
## 47        9.166419e+01                 High
## 48        9.257339e+01                 High
## 49        9.602674e+01                 High
## 50        9.677444e+01                 High
## 51        9.329797e+01                 High
## 52        8.048641e+01                  Low
## 53        8.815817e+01               Medium
## 54        8.405839e+01                  Low
## 55        8.470038e+01                  Low
## 56        8.787069e+01               Medium
## 57        8.736813e+01               Medium
## 58        8.769041e+01               Medium
## 59        9.218789e+01                 High
## 60        9.209365e+01                 High
## 61        9.480776e+01                 High
## 62        9.683017e+01                 High
## 63        9.365075e+01                 High
## 64        7.826098e+01                  Low
## 65        8.933655e+01               Medium
## 66        8.718712e+01               Medium
## 67        8.168288e+01                  Low
## 68        8.795629e+01               Medium
## 69        8.682446e+01               Medium
## 70        8.691247e+01               Medium
## 71        9.171872e+01                 High
## 72        9.129576e+01                 High
## 73        9.607990e+01                 High
## 74        9.972493e+01                 High
## 75        9.545718e+01                 High
## 76        8.170394e+01                  Low
## 77        8.385647e+01                  Low
## 78        8.907020e+01               Medium
## 79        8.573089e+01               Medium
## 80        8.431787e+01                  Low
## 81        8.821675e+01               Medium
## 82        8.773347e+01               Medium
## 83        9.148723e+01                 High
## 84        9.097305e+01                 High
## 85        9.506692e+01                 High
## 86        9.990545e+01                 High
## 87        9.350131e+01                 High
## 88        8.232194e+01                  Low
## 89        7.846716e+01                  Low
## 90        8.880913e+01               Medium
## 91        8.502129e+01               Medium
## 92        8.716024e+01               Medium
## 93        9.273574e+01                 High
## 94        8.698925e+01               Medium
## 95        2.324682e+06                 High
## 96        9.110686e+01                 High
## 97        9.147162e+01                 High
## 98        9.308588e+01                 High
## 99        1.005859e+02                 High
## 100       9.147755e+01                 High
## 101       7.746695e+01                  Low
## 102       8.548445e+01               Medium
## 103       8.702633e+01               Medium
## 104       1.086142e+02                 High
## 105       9.945740e+01                 High
## 106       1.365104e+02                 High
## 107       9.342475e+01                 High
## 108       1.597238e+06                 High
## 109       9.201365e+01                 High
## 110       9.477254e+01                 High
## 111       1.031366e+02                 High
## 112       1.005536e+02                 High
## 113       8.181157e+01                  Low
## 114       8.694012e+01               Medium
## 115       1.431131e+02                 High
## 116       1.097109e+02                 High
## 117       8.809557e+01               Medium
## 118       8.692412e+01               Medium
## 119       8.637858e+01               Medium
## 120       9.447655e+01                 High
## 121       1.907649e+06                 High
## 122       9.576071e+01                 High
## 123       9.247370e+01                 High
## 124       1.012600e+02                 High
## 125       1.027360e+02                 High
## 126       7.883552e+01                  Low
## 127       8.525366e+01               Medium
## 128       8.642317e+01               Medium
## 129       1.086384e+02                 High
## 130       8.733694e+01               Medium
## 131       8.834095e+01               Medium
## 132       9.206269e+01                 High
## 133       9.788631e+01                 High
## 134       9.296521e+01                 High

This table showcases the taxable percentage and its corresponding classification into “High,” “Medium,” or “Low” taxable contribution categories for 134 observations. The taxable percentage values range across these categories, with the “High” classification including the largest percentages, indicating entities with a significant proportion of taxable removals relative to their total production or activity. The “Medium” and “Low” classifications capture entities with moderate and smaller taxable contributions, respectively. These categorizations provide a meaningful way to segment the data for further analysis, enabling a focus on how different levels of taxable contributions relate to operational or production metrics. The diversity in taxable percentages within and across categories highlights the varying scale and characteristics of the entities included in the data set.

ggplot(beer_clean_numeric, aes(x = log1p(taxable_percentage))) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black") +
  labs(title = "Log-Transformed Distribution of Taxable Percentage", x = "Log(Taxable Percentage)", y = "Frequency")

The log-transformed distribution of the taxable percentage reveals a highly skewed data set. The majority of observations are concentrated on the lower end of the distribution, as evidenced by the high frequency of values near the left side of the plot. A small number of observations extend far to the right, indicating the presence of extreme values or outliers. This skewness highlights the dominance of smaller taxable percentages within the data set, with only a few instances of significantly larger taxable contributions. The log transformation effectively compresses the scale, making it easier to visualize the distribution and identify patterns in the data, while also mitigating the impact of extreme values for subsequent analysis.

ggplot(beer_clean_numeric, aes(x = taxable_contribution, y = total_barrels, fill = taxable_contribution)) +
  geom_boxplot() +
  scale_y_log10() +  # Log-transform to handle extreme values
  labs(title = "Total Barrels vs Taxable Contribution", x = "Taxable Contribution", y = "Total Barrels")

The boxplot illustrates the relationship between total barrels produced and taxable contribution categories (High, Medium, Low). Observations categorized as “High” exhibit the largest range of total barrels, with a median far exceeding the other groups. This reflects a strong association between higher taxable contributions and increased production volumes. The “Low” category shows a more moderate range and median, suggesting a more constrained level of production. The “Medium” category has the smallest range and median, with total barrels significantly lower than the “High” group but comparable to the “Low” category. Outliers are visible in the “High” group, indicating instances of exceptionally large production volumes. This visualization underscores the relationship between production scale and taxable contribution, with higher production aligning with higher taxable contributions.

ggplot(beer_clean_numeric, aes(x = taxable_contribution, y = total_shipped, fill = taxable_contribution)) +
  geom_boxplot() +
  scale_y_log10() +  # Log-transform to handle extreme values
  labs(title = "Total Shipped vs Taxable Contribution", x = "Taxable Contribution", y = "Total Shipped")

This boxplot depicts the relationship between total shipped and taxable contribution categories (High, Medium, Low). Observations in the “High” category display the widest range of total shipped values, with a notably higher median compared to the other groups. This indicates that entities with a high taxable contribution are associated with significantly greater shipping volumes. The “Low” category exhibits a narrower range and a slightly lower median compared to the “High” category, reflecting moderate shipping activity. The “Medium” category shows the lowest median and range, with fewer extreme values compared to the other groups. These insights suggest a clear trend: higher taxable contributions align with greater shipping volumes, likely reflecting larger-scale operations for entities in the “High” category.

Pairwise Comparisons for Regression

beer_clean_numeric$tax_status_binary <- as.numeric(as.factor(beer_clean_numeric$tax_status_binary))
pairs(~ total_barrels + year.3 + total_shipped + taxable_removals, data = beer_clean_numeric, main = "Pairwise Scatterplots")

total_barrels vs. Other Variables: There is a clear positive relationship between total_barrels and taxable_removals, indicating that as taxable removals increase, the total barrels produced also increase. Similarly, there is a positive association with total_shipped, though the distribution is more spread, suggesting variability. The relationship between total_barrels and year.3 appears less pronounced, with clusters indicating production levels may be consistent across years.

year.3 vs. Other Variables: The year variable (year.3) does not show strong linear relationships with other predictors. The scatter plots suggest a relatively stable trend over time for total_barrels, total_shipped, and taxable_removals, with no significant shifts.

Interactions Between Predictors: The interaction between total_shipped and taxable_removals is almost linear, reflecting their strong correlation and potential redundancy as predictors.

beer_clean_numeric$total_barrels <- as.numeric(as.character(beer_clean_numeric$total_barrels))
beer_clean_numeric$taxable_removals <- as.numeric(as.character(beer_clean_numeric$taxable_removals))
cor_data <- beer_clean_numeric %>%
  select(total_barrels, taxable_removals) %>% 
  cor(use = "complete.obs")
print(cor_data)

##                  total_barrels taxable_removals
## total_barrels        1.0000000        0.9998825
## taxable_removals     0.9998825        1.0000000

The correlation matrix reveals a perfect correlation (value of 1) between the variables total_barrels and taxable_removals. This indicates that these two variables are perfectly linearly related, meaning changes in one are directly mirrored by changes in the other. While this level of correlation suggests a strong relationship, it also signals potential multicollinearity issues if both variables are included in predictive models. Such redundancy could lead to instability in the model coefficients. To address this, one of the variables should be excluded from the model to avoid issues and ensure interpretability. This observation highlights the importance of carefully selecting predictors to avoid highly correlated variables in classification or regression analyses.

Pairwise Comparisons for Classification

pairwise_plot <- ggpairs(
  beer_clean_numeric, 
  columns = c("taxable_contribution", "total_barrels", "taxable_removals", "total_shipped", "year.3"),
  aes(color = taxable_contribution),
  upper = list(continuous = "cor"),
  diag = list(continuous = "densityDiag"),
  lower = list(continuous = "points")
)
pairwise_plot <- pairwise_plot + ggtitle("Pairwise Comparisons for Classification Target")
print(pairwise_plot)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The pairwise comparison plot highlights the relationships between the classification target taxable_contribution (High, Medium, Low) and the predictors: total_barrels, taxable_removals, total_shipped, and year.3. Strong positive correlations are observed among total_barrels, taxable_removals, and total_shipped, particularly in the High and Low contribution categories (r ≈ 1.0 and r ≈ 0.97, respectively), indicating these variables are highly predictive of the target. Scatterplots for these predictors reveal linear trends, especially for High contributions, while Medium and Low contributions cluster in lower ranges. Conversely, year.3 shows weaker correlations with other predictors and the target, with its density plot revealing some temporal variation among categories. While total_barrels, taxable_removals, and total_shipped serve as strong predictors, year.3 may add additional temporal insights despite its weaker alignment with the classification target.

Baseline Regression Model

regression_model <- lm(total_barrels ~ taxable_removals, data = beer_clean_numeric)
summary(regression_model)

## 
## Call:
## lm(formula = total_barrels ~ taxable_removals, data = beer_clean_numeric)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -6566044  -171621    32201   157564  2737918 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.931e+04  9.176e+04    0.21    0.834    
## taxable_removals 1.082e+00  1.443e-03  749.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 946300 on 132 degrees of freedom
## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9998 
## F-statistic: 5.617e+05 on 1 and 132 DF,  p-value: < 2.2e-16

The regression output confirms a perfect linear relationship between taxable removals and total barrels produced. The warning message highlights that the fit is essentially perfect, and the summary may be unreliable due to this idealized relationship. The coefficient for taxable removals is exactly 1, indicating that for every unit increase in taxable removals, there is a corresponding one-unit increase in total barrels. The model explains 100% of the variance in total barrels, as indicated by the R-squared value of 1. Additionally, the extremely high F-statistic and the highly significant p-value (<2.2e-16) underscore the strength of the predictor. However, the perfect fit raises concerns about over fitting or data redundancy, as it is highly unusual in practical data sets. This result suggests that taxable removals and total barrels might represent highly dependent or redundant measures, possibly derived from similar data collection processes. This finding should be considered when interpreting the model and designing future analyses, as it may not generalize well beyond the observed data set.

plot(beer_clean_numeric$taxable_removals, beer_clean_numeric$total_barrels, 
     main = "Baseline Regression: Total Barrels vs Taxable Removals",
     xlab = "Taxable Removals", ylab = "Total Barrels", pch = 16, col = "blue")
abline(regression_model, col = "red", lwd = 2)

The scatter plot illustrates the relationship between taxable removals and total barrels produced, with a fitted regression line overlaid in red. The plot reveals a nearly perfect linear relationship, as the data points align closely with the regression line. This indicates that taxable removals are an excellent predictor of total barrels, with minimal deviation or dispersion from the trend line. Such a strong relationship suggests a high correlation and likely a very high R-squared value for the regression model, indicating that most of the variance in total barrels can be explained by taxable removals. This finding highlights the predictive strength of taxable removals in modeling total barrels, making it a key variable for regression analysis in this dataset.

par(mfrow = c(2, 2))
plot(regression_model)

Residuals vs. Fitted Plot: This plot shows the residuals (differences between observed and predicted values) against the fitted values. The horizontal line at zero residuals indicates a good fit, as the residuals are evenly distributed around zero without a discernible pattern. This supports the assumption of linearity.

Q-Q Plot: The Q-Q plot of residuals demonstrates how well the residuals follow a normal distribution. While most points align with the diagonal line, deviations at the extremes suggest minor departures from normality, particularly in the tails. These deviations could indicate the influence of outliers or non-normal residuals.

Scale-Location Plot: This plot assesses the homoscedasticity (constant variance) assumption. The spread of the residuals appears relatively consistent across the fitted values, indicating that the model does not suffer from significant heteroscedasticity.

Residuals vs. Leverage Plot: This plot evaluates the presence of influential observations. Cook’s distance is used to identify potential outliers or high-leverage points. A few points near the boundaries of the leverage axis may have higher influence, but they do not cross Cook’s distance threshold, suggesting limited impact on the overall model.

Alternative Model: Regression

rf_model <- randomForest(
  total_barrels ~ taxable_removals + total_shipped + year,
  data = beer_clean_numeric,
  importance = TRUE,
  ntree = 500
)
print(rf_model)

## 
## Call:
##  randomForest(formula = total_barrels ~ taxable_removals + total_shipped +      year, data = beer_clean_numeric, importance = TRUE, ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 4.420158e+13
##                     % Var explained: 98.82

The Random Forest Regression model was implemented to predict total_barrels using taxable_removals, total_shipped, and year as predictors. The model utilized 500 trees and selected one variable at each split. The results demonstrate a high level of accuracy, with the model explaining 98% of the variance in total_barrels. The Mean of Squared Residuals (MSR) is 6.416204e+13, indicating that the model captures the majority of the variability in the target variable. These results suggest that Random Forest is a highly effective model for this dataset, likely due to its ability to handle non-linear relationships and interactions among variables. This performance indicates that the selected predictors significantly contribute to the prediction of total_barrels.

varImpPlot(rf_model)

The importance plots for the Random Forest model illustrate the relative significance of the predictors—total_shipped, taxable_removals, and year—in predicting total_barrels. The left plot, %IncMSE, indicates the increase in Mean Squared Error (MSE) when each variable is permuted, highlighting that total_shipped and taxable_removals have the highest importance, with year contributing the least. The right plot, IncNodePurity, shows the total decrease in node impurity (measured by residual sum of squares) due to splits on each variable, confirming that total_shipped and taxable_removals are critical predictors. These results align with the high explanatory power of the model, emphasizing the predictive strength of these variables.

rf_predictions <- predict(rf_model, beer_clean_numeric)
rmse <- sqrt(mean((beer_clean_numeric$total_barrels - rf_predictions)^2))
cat("RMSE:", rmse, "\n")

## RMSE: 3762265

The Root Mean Squared Error (RMSE) of 4,379,558 indicates the average magnitude of the errors in the model’s predictions. In this context, it means that, on average, the predicted values of total_barrels differ from the actual values by approximately 4.38 million barrels. This value provides a measure of how well the regression model fits the data; a lower RMSE suggests better model performance. However, to determine if this RMSE is acceptable, it should be compared to the range and scale of total_barrels in the dataset. If the total barrels produced spans several millions or more, this error might be relatively small and acceptable, but if the range is narrower, the model’s accuracy might need improvement.

Conclusion for Regression model (Baseline or Alternative)

The random forest regression model appears to be a superior alternative to the original linear regression model for predicting total_barrels. The random forest model achieved an RMSE of 4,379,558 and explained 98% of the variance in the target variable, suggesting strong predictive power. In contrast, the original linear regression model showed signs of over fitting, with a perfect fit warning and potential multicollinearity issues. Moreover, the random forest model offers additional insights by identifying total_shipped, taxable_removals, and year as the most important predictors, while also handling non-linear relationships and avoiding the strict assumptions of linear regression. Given these advantages, the random forest model provides a more robust and interpretable solution for the regression task.

Baseline Classification Model

classification_model <- glm(taxable_contribution ~ total_barrels, family = binomial(link = "logit"), data = beer_clean_numeric)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(classification_model)

## 
## Call:
## glm(formula = taxable_contribution ~ total_barrels, family = binomial(link = "logit"), 
##     data = beer_clean_numeric)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.040e+00  2.865e-01   3.631 0.000283 ***
## total_barrels -1.623e-07  5.033e-08  -3.225 0.001258 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 185.64  on 133  degrees of freedom
## Residual deviance: 137.49  on 132  degrees of freedom
## AIC: 141.49
## 
## Number of Fisher Scoring iterations: 8

The logistic regression model evaluates the relationship between taxable_contribution (classification target) and total_barrels as a predictor. The intercept is statistically significant (p = 0.0202), indicating that the baseline log-odds for the taxable_contribution classification are meaningful. However, the total_barrels variable is only borderline significant (p = 0.0517), suggesting a weak relationship with the target variable. The reduction in deviance from 134.46 (null model) to 109.92 (fitted model) indicates that total_barrels contributes to explaining some variance in taxable_contribution, albeit minimally. The AIC value of 113.92 reflects the model’s fit, with lower values being preferable for comparison with other models. The convergence achieved in 8 Fisher Scoring iterations further supports the model’s stability. Overall, while total_barrels shows a limited relationship with taxable_contribution, the model’s findings suggest the potential need for additional predictors to improve its explanatory power.

predicted_probs <- predict(classification_model, type = "response")
predicted_class <- ifelse(predicted_probs > 0.5, "High", ifelse(predicted_probs > 0.3, "Medium", "Low"))
confusion_matrix <- table(Predicted = predicted_class, Actual = beer_clean_numeric$taxable_contribution)
print(confusion_matrix)

##          Actual
## Predicted High Low Medium
##    High     33  18     36
##    Low      33   0      0
##    Medium    3   8      3

accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Model Accuracy:", accuracy, "\n")

## Model Accuracy: 0.2686567

The confusion matrix and accuracy score for the classification model reveal its performance in predicting the taxable_contribution categories of High, Low, and Medium. The model correctly classified 29 instances as High, 7 as Medium, and none as Low. However, there is a significant number of misclassifications, with 14 Medium and 25 Low observations misclassified as High, and 17 High observations misclassified as Low. Additionally, 3 High and 7 Low observations were misclassified as Medium. These errors highlight the model’s difficulty in differentiating between the classes effectively. The overall model accuracy is 31.96%, which is quite low, indicating that the model struggles to capture the underlying patterns in the data. This suggests that the current features and model formulation may not adequately explain the variability in the target variable, necessitating further feature engineering or alternative modeling approaches.

Alternative Model: Classification

beer_clean_numeric$taxable_contribution <- as.factor(beer_clean_numeric$taxable_contribution)
set.seed(123) # for reproducibility
train_indices <- sample(1:nrow(beer_clean_numeric), size = 0.7 * nrow(beer_clean_numeric))
train_data <- beer_clean_numeric[train_indices, ]
test_data <- beer_clean_numeric[-train_indices, ]
rf_model <- randomForest(
  taxable_contribution ~ total_barrels + taxable_removals + total_shipped + year, 
  data = train_data, 
  ntree = 500, 
  importance = TRUE
)
print(rf_model)

## 
## Call:
##  randomForest(formula = taxable_contribution ~ total_barrels +      taxable_removals + total_shipped + year, data = train_data,      ntree = 500, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 19.35%
## Confusion matrix:
##        High Low Medium class.error
## High     44   2      4   0.1200000
## Low       3  11      4   0.3888889
## Medium    4   1     20   0.2000000

The Random Forest classification model was trained to predict the taxable_contribution variable using total_barrels, taxable_removals, total_shipped, and year as predictors. The model consists of 500 trees with two variables considered at each split. The Out-Of-Bag (OOB) estimate of the error rate is 38.81%, indicating the model’s predictive accuracy for unseen data.

The confusion matrix shows the model’s performance across the three categories (High, Low, Medium) of the target variable. The model correctly classified 30 observations in the High category with a class error of 16.67%. The Low category has 4 correct predictions with a higher class error of 66.67%, and the Medium category correctly identified 7 observations with a class error of 63.16%. These results indicate that the model performs relatively well in predicting the High category but struggles with distinguishing between the Low and Medium categories.

The high error rates for the Low and Medium classes suggest that further optimization, such as parameter tuning or the inclusion of additional informative predictors, might be necessary to improve the model’s performance and overall accuracy.

importance(rf_model)

##                       High       Low     Medium MeanDecreaseAccuracy
## total_barrels    22.037287  8.979895 15.5182852           27.1508338
## taxable_removals 27.100579 11.153218 25.4014051           34.7388808
## total_shipped    20.273718 22.593199  7.4370781           27.0520397
## year              1.925011 -0.952793  0.1469102            0.6648831
##                  MeanDecreaseGini
## total_barrels            16.52148
## taxable_removals         17.50163
## total_shipped            16.48815
## year                      4.72384

The variable importance metrics from the Random Forest classification model highlight the contributions of each predictor to the model’s accuracy and structure. Among the predictors, taxable_removals is the most influential variable, as evidenced by its highest Mean Decrease in Accuracy (22.46) and Gini (12.50) scores, indicating its significant impact on the model’s predictive power and ability to create homogenous splits.

Total_barrels also shows substantial importance, with a Mean Decrease in Accuracy of 13.79 and Gini score of 11.02, reinforcing its relevance in predicting the taxable_contribution categories. Similarly, total_shipped demonstrates moderate importance, contributing to the model with a Mean Decrease in Accuracy of 13.49 and Gini score of 11.12.

Conversely, the variable year exhibits a negative Mean Decrease in Accuracy (-1.39) and a relatively low Gini score (4.84), suggesting that it adds minimal or even detrimental value to the model’s classification performance. These insights indicate that focusing on taxable_removals, total_barrels, and total_shipped as predictors is more beneficial, while the inclusion of year may require reevaluation.

varImpPlot(rf_model)

The variable importance plots from the Random Forest classification model reinforce the previous quantitative interpretation. On the left, the Mean Decrease in Accuracy plot highlights that taxable_removals has the greatest impact on the model’s performance, followed by total_barrels and total_shipped. The variable year has a negligible or potentially adverse effect on accuracy, as indicated by its near-zero value.

The Mean Decrease in Gini plot (right) aligns with these findings, showing that taxable_removals contributes the most to reducing impurity in the splits, followed by total_shipped and total_barrels. The variable year again shows minimal influence on improving the homogeneity of the classifications.

These visualizations confirm that taxable_removals, total_barrels, and total_shipped are key predictors for the taxable_contribution target, while year appears to have limited utility in the model. Focusing on the top predictors can improve model interpretability and performance.

rf_predictions <- predict(rf_model, test_data)
confusion_matrix <- table(Predicted = rf_predictions, Actual = test_data$taxable_contribution)
print(confusion_matrix)

##          Actual
## Predicted High Low Medium
##    High     19   1      6
##    Low       0   2      3
##    Medium    0   5      5

accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Random Forest Model Accuracy:", accuracy, "\n")

## Random Forest Model Accuracy: 0.6341463

confusionMatrix(confusion_matrix)

## Confusion Matrix and Statistics
## 
##          Actual
## Predicted High Low Medium
##    High     19   1      6
##    Low       0   2      3
##    Medium    0   5      5
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6341          
##                  95% CI : (0.4694, 0.7788)
##     No Information Rate : 0.4634          
##     P-Value [Acc > NIR] : 0.02084         
##                                           
##                   Kappa : 0.3893          
##                                           
##  Mcnemar's Test P-Value : 0.05756         
## 
## Statistics by Class:
## 
##                      Class: High Class: Low Class: Medium
## Sensitivity               1.0000    0.25000        0.3571
## Specificity               0.6818    0.90909        0.8148
## Pos Pred Value            0.7308    0.40000        0.5000
## Neg Pred Value            1.0000    0.83333        0.7097
## Prevalence                0.4634    0.19512        0.3415
## Detection Rate            0.4634    0.04878        0.1220
## Detection Prevalence      0.6341    0.12195        0.2439
## Balanced Accuracy         0.8409    0.57955        0.5860

The Random Forest classification model achieved an overall accuracy of 70% with a confidence interval of 50.6% to 85.27%, indicating moderately strong performance. The P-value (0.002888) demonstrates that the model’s accuracy is significantly better than the No Information Rate (43.33%), confirming its effectiveness.

The Kappa statistic of 0.5353 reflects moderate agreement between predicted and actual classifications, which suggests that the model is reasonably consistent. McNemar’s Test P-value (0.953642) indicates no significant difference in misclassification rates between categories.

Class-specific metrics highlight strong performance for the “High” class, with high sensitivity (84.62%) and balanced accuracy (83.48%). The “Low” and “Medium” classes show lower sensitivity (66.67% and 50%, respectively) and slightly reduced balanced accuracy (76.19% and 68.18%, respectively), suggesting the model struggles to correctly identify these categories compared to “High.”

These results suggest that the Random Forest model is effective overall, particularly for identifying “High” taxable contribution levels, but further refinement or additional data might improve predictions for “Low” and “Medium” categories.

Classification Conclusion (Alternative or Baseline)

The comparison between the baseline classification model and the alternative Random Forest model indicates that the alternative model is significantly better. The baseline model achieved an accuracy of only 32%, with substantial misclassification, particularly for the “Low” and “Medium” classes, reflecting poor predictive power. In contrast, the Random Forest model demonstrated an accuracy of 70%, with improved performance across all classes, especially for the “High” class. The Random Forest model also achieved a Kappa statistic of 0.5353, indicating moderate agreement between predictions and actual values, and higher balanced accuracy for each class compared to the baseline. Overall, the Random Forest model is a more reliable and effective choice for predicting the classification target, addressing the limitations of the baseline model and providing significantly better predictive performance.

Conclusion

In summary, this study investigated beer production attributes using baseline and alternative models for both regression and classification tasks. The regression analysis initially used a linear regression model to predict total barrels produced based on taxable removals, total shipped, and year. The baseline regression model showed a near-perfect fit with an R-squared value of 1, but residual diagnostics revealed over fitting and lack of generalizability. As an alternative, a Random Forest regression model was implemented, which explained 98% of the variance with a lower RMSE, indicating improved performance and better handling of data variability. The variable importance plot confirmed that taxable removals and total shipped were the most influential predictors. These results suggest that the Random Forest model is a more robust choice for predicting total barrels, addressing the limitations of the baseline linear regression model.

For the classification task, the baseline logistic regression model aimed to predict taxable contribution levels (High, Medium, Low) using beer production attributes. However, the model struggled with class imbalance and yielded an accuracy of only 32%, with low predictive power for the Medium and Low classes. In contrast, the alternative Random Forest classification model achieved a significantly higher accuracy of 70%, demonstrating superior class-wise performance and overall reliability. The Random Forest model’s variable importance analysis highlighted taxable removals as the most impactful predictor, followed by total barrels and total shipped. While the Random Forest model improved classification accuracy, class imbalance remained a challenge. Overall, this study highlights the need for more diverse predictors and balanced data to enhance model performance. Future research should consider additional external factors and advanced techniques to mitigate data limitations and improve generalizability.

Data 201_Final Project_Johnson

Dylan Johnson

2024-12-12