
Source: https://blog.bozemancvb.com/blog/best-breweries-in-bozeman-montana
The Alcohol and Tobacco Tax and Trade Bureau (TTB) provided the data set chosen for this research. It encompasses detailed information on beer production and distribution in the United States from 2008 to 2019, offering monthly aggregate beer statistics measured in barrels (one barrel equals thirty-one gallons), the number of brewers categorized by production size, and state-level beer output. The dataset includes key variables such as taxable removals, production figures, and total barrels delivered, providing an in-depth look into the U.S. beer industry’s trends over a decade. This publicly accessible dataset underwent significant cleaning and restructuring for analytical purposes (thanks to Bart Watson for sharing the data).
This research uses the data set to explore historical patterns in beer production, taxation, and market dynamics. For regression modeling, the focus is on the total_barrels feature, which represents overall beer production as a continuous variable, aiming to predict production levels based on variables like taxable removals, total shipped, and year. For classification tasks, the focus shifts to taxable contribution levels (High, Medium, Low) as the categorical target variable, with predictors such as total barrels, taxable removals, and total shipped. The overarching goal of this study is to investigate the relationships between key production metrics and industry outcomes, providing insights into production behaviors and their alignment with tax contributions and market growth.
| Variable | Class | Description |
| data_type | character | Pounds of Material - this is a sub-table from beer_taxed |
| material_type | character | Grain product, Totals, Non-Grain Product (basically hops vs grains) |
| year | double | Year |
| month | integer | Month |
| type | character | Actual line-item from material type |
| month_current | double | Current number of barrels for this year/month |
| month_prior_year | double | Prior year number of barrels for same month |
| ytd_current | double | Cumulative year to date of current year |
| ytd_prior_year | double | Cumulative year to date for prior year |
| state | character | State abbreviated |
| barrels | double | Barrels produced within each type |
| type | character | Type of production/use (On premise, Bottles/Cans, Kegs/Barrels) |
| data_type | character | Barrels Produced |
| tax_status | character | The Tax Status, factor with Totals, Taxable, Sub Total Taxable, Tax Free, Sub Total Tax-Free |
| year | double | Year |
| type | character | Type of production, either Total Production (Production) or specific sub-category and sub-totals |
| month_current | double | Current number of barrels for this year/month |
| month_prior_year | double | Prior year number of barrels for same month |
| ytd_current | double | Cumulative year to date of current year |
| ytd_prior_year | double | Cumulative year to date for prior year |
| brewer_size | character | Range of production for brewer size, number of barrels produced |
| n_of_brewers | double | Number of brewers at that brewer size |
| total_barrels | double | Total barrels of beer produced at that brewer size |
| taxable_removals | double | Taxable barrels for removals - removals for consumption under taxation |
| total_shipped | double | Total barrels shipped - produced beer that is not taxed |
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.4.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(dplyr)
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.2
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(caret)
## Warning: package 'caret' was built under R version 4.4.1
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
setwd("C:/Users/dylan/OneDrive/Desktop/Data 201")
beer <- read.csv("Beer.csv")
head(beer)
## year brewer_size number_of_brewers total_barrels
## 1 2009 6,000,001 Barrels and Over 18 171232882
## 2 2009 1,000,001 to 6,000,000 Barrels 4 9970404
## 3 2009 500,001 to 1,000,000 Barrels 7 4831386
## 4 2009 100,001 to 500,000 Barrels 27 5422156
## 5 2009 60,001 to 100,000 Barrels 19 1501977
## 6 2009 30,001 to 60,000 Barrels 32 1412245
## taxable_removals total_shipped data_type material_type
## 1 159643984 3639970 Pounds of Materials Used Grain Products
## 2 9592723 14548 Pounds of Materials Used Grain Products
## 3 4535659 21563 Pounds of Materials Used Grain Products
## 4 4469289 128000 Pounds of Materials Used Grain Products
## 5 1224618 95732 Pounds of Materials Used Grain Products
## 6 1233959 14369 Pounds of Materials Used Total Grain products
## year.1 month type_of_beer month_current month_prior_year
## 1 2008 1 Malt and malt products 374165152 365300134
## 2 2008 1 Corn and corn products 57563519 41647092
## 3 2008 1 Rice and rice products 72402143 81050102
## 4 2008 1 Barley and barley products 3800844 2362162
## 5 2008 1 Wheat and wheat products 1177186 1195381
## 6 2008 1 Total Grain products 509108844 491554871
## ytd_current ytd_prior_year data_type.1 tax_status year.2 month.1
## 1 374165152 365300134 Barrels Produced Totals 2008 1
## 2 57563519 41647092 Barrels Produced Taxable 2008 1
## 3 72402143 81050102 Barrels Produced Taxable 2008 1
## 4 3800844 2362162 Barrels Produced Taxable 2008 1
## 5 1177186 1195381 Barrels Produced Sub Total Taxable 2008 1
## 6 509108844 491554871 Barrels Produced Tax Free 2008 1
## type month_current.1 month_prior_year.1 ytd_current.1
## 1 Production 16211480 15880125 16211480
## 2 In bottles and cans 13222104 12824278 13222104
## 3 In barrels and kegs 1371239 1357372 1371239
## 4 Tax Determined, Premises Use 7426 8419 7426
## 5 Sub Total Taxable 14600769 14190069 14600769
## 6 For export 262985 268473 262985
## ytd_prior_year.1 tax_rate state year.3 barrels type.1
## 1 15880125 $7/$18 per barrel AK 2008 2067.69 On Premises
## 2 12824278 $7/$18 per barrel AK 2009 2263.65 On Premises
## 3 1357372 $7/$18 per barrel AK 2010 1929.15 On Premises
## 4 8419 $7/$18 per barrel AK 2011 2251.02 On Premises
## 5 14190069 $7/$18 per barrel AK 2012 2312.43 On Premises
## 6 268473 $7/$18 per barrel AK 2013 2155.60 On Premises
## data_type.2 material_type.1 year.4 month.2
## 1 Pounds of Materials Used Grain Products 2008 1
## 2 Pounds of Materials Used Grain Products 2008 1
## 3 Pounds of Materials Used Grain Products 2008 1
## 4 Pounds of Materials Used Grain Products 2008 1
## 5 Pounds of Materials Used Grain Products 2008 1
## 6 Pounds of Materials Used Total Grain products 2008 1
## type.2 month_current.2 month_prior_year.2 ytd_current.2
## 1 Malt and malt products 374165152 365300134 374165152
## 2 Corn and corn products 57563519 41647092 57563519
## 3 Rice and rice products 72402143 81050102 72402143
## 4 Barley and barley products 3800844 2362162 3800844
## 5 Wheat and wheat products 1177186 1195381 1177186
## 6 Total Grain products 509108844 491554871 509108844
## ytd_prior_year.2
## 1 365300134
## 2 41647092
## 3 81050102
## 4 2362162
## 5 1195381
## 6 491554871
The head of the dataset provides an overview of key variables related to beer production and distribution in the U.S. It includes data_type, which specifies categories like “Pounds of Material” or “Barrels Produced,” and material_type, distinguishing grain from non-grain products. Temporal variables such as year, month, and type capture production details over time, while month_current, month_prior_year, ytd_current, and ytd_prior_year track monthly and year-to-date production totals. Geographical information is represented by the state variable, and barrels quantifies production volume. Additional details include tax_status, which identifies the tax treatment of production, and brewer_size, categorizing breweries by production scale. Together, these variables provide a comprehensive view of beer production metrics, enabling analysis across time, geography, and production characteristics.
tail(beer)
## year brewer_size number_of_brewers total_barrels taxable_removals
## 1867 NA NA NA NA
## 1868 NA NA NA NA
## 1869 NA NA NA NA
## 1870 NA NA NA NA
## 1871 NA NA NA NA
## 1872 NA NA NA NA
## total_shipped data_type material_type year.1 month type_of_beer
## 1867 NA NA NA
## 1868 NA NA NA
## 1869 NA NA NA
## 1870 NA NA NA
## 1871 NA NA NA
## 1872 NA NA NA
## month_current month_prior_year ytd_current ytd_prior_year data_type.1
## 1867 NA NA NA NA
## 1868 NA NA NA NA
## 1869 NA NA NA NA
## 1870 NA NA NA NA
## 1871 NA NA NA NA
## 1872 NA NA NA NA
## tax_status year.2 month.1 type month_current.1 month_prior_year.1
## 1867 NA NA NA NA
## 1868 NA NA NA NA
## 1869 NA NA NA NA
## 1870 NA NA NA NA
## 1871 NA NA NA NA
## 1872 NA NA NA NA
## ytd_current.1 ytd_prior_year.1 tax_rate state year.3 barrels
## 1867 NA NA total 2014 18176128
## 1868 NA NA total 2015 17750505
## 1869 NA NA total 2016 17000382
## 1870 NA NA total 2017 16514867
## 1871 NA NA total 2018 15662381
## 1872 NA NA total 2019 14880067
## type.1 data_type.2 material_type.1 year.4 month.2 type.2
## 1867 Kegs and Barrels NA NA
## 1868 Kegs and Barrels NA NA
## 1869 Kegs and Barrels NA NA
## 1870 Kegs and Barrels NA NA
## 1871 Kegs and Barrels NA NA
## 1872 Kegs and Barrels NA NA
## month_current.2 month_prior_year.2 ytd_current.2 ytd_prior_year.2
## 1867 NA NA NA NA
## 1868 NA NA NA NA
## 1869 NA NA NA NA
## 1870 NA NA NA NA
## 1871 NA NA NA NA
## 1872 NA NA NA NA
The tail of the dataset offers insight into the most recent or final entries, showcasing variables such as data_type, which continues to categorize production metrics like “Barrels Produced” or “Pounds of Material.” Variables like material_type and type remain consistent in distinguishing grain versus non-grain products and specific uses such as bottles, cans, or kegs. Temporal variables such as year and month reflect the dataset’s coverage through time, capturing production trends up to the most recent periods. The state variable highlights the geographical scope by identifying specific U.S. states, while barrels quantifies the production volume in the final records. Metrics such as month_current, month_prior_year, ytd_current, and ytd_prior_year provide end-of-period summaries for monthly and year-to-date production comparisons. Additionally, variables like tax_status and brewer_size ensure the dataset captures nuanced details about tax classifications and the scale of production at different brewery sizes. The tail provides a snapshot of the most recent beer production activity and serves as a reference for trends and patterns observed across the dataset.
beer_clean <- data.frame(lapply(beer, as.factor))
summary(beer)
## year brewer_size number_of_brewers total_barrels
## Min. :2009 Length:1872 Min. : 3.0 Min. : 0
## 1st Qu.:2011 Class :character 1st Qu.: 15.0 1st Qu.: 1382426
## Median :2014 Mode :character Median : 43.0 Median : 3055305
## Mean :2014 Mean : 612.4 Mean : 30796075
## 3rd Qu.:2017 3rd Qu.: 428.0 3rd Qu.: 10513168
## Max. :2019 Max. :6400.0 Max. :196969275
## NA's :1735 NA's :1735 NA's :1736
## taxable_removals total_shipped data_type material_type
## Min. : 14711 Min. : 12 Length:1872 Length:1872
## 1st Qu.: 1203852 1st Qu.: 6544 Class :character Class :character
## Median : 2714929 Median : 39494 Mode :character Mode :character
## Mean : 28247188 Mean : 885246
## 3rd Qu.: 8697816 3rd Qu.: 265118
## Max. :182995469 Max. :6908958
## NA's :1735 NA's :1737
## year.1 month type_of_beer month_current
## Min. :2008 Min. : 1.00 Length:1872 Min. : 0
## 1st Qu.:2010 1st Qu.: 3.75 Class :character 1st Qu.: 1682829
## Median :2012 Median : 6.50 Mode :character Median : 13820964
## Mean :2012 Mean : 6.50 Mean :111582326
## 3rd Qu.:2015 3rd Qu.: 9.25 3rd Qu.: 84513176
## Max. :2017 Max. :12.00 Max. :656596463
## NA's :432 NA's :432 NA's :432
## month_prior_year ytd_current ytd_prior_year data_type.1
## Min. :0.000e+00 Min. :2.714e+05 Min. :2.714e+05 Length:1872
## 1st Qu.:2.396e+06 1st Qu.:2.837e+07 1st Qu.:2.573e+07 Class :character
## Median :4.687e+07 Median :2.346e+08 Median :2.346e+08 Mode :character
## Mean :1.908e+08 Mean :9.285e+08 Mean :9.417e+08
## 3rd Qu.:1.145e+08 3rd Qu.:8.431e+08 3rd Qu.:8.498e+08
## Max. :6.395e+09 Max. :7.047e+09 Max. :7.144e+09
## NA's :432 NA's :864 NA's :864
## tax_status year.2 month.1 type
## Length:1872 Min. :2008 Min. : 1.000 Length:1872
## Class :character 1st Qu.:2010 1st Qu.: 4.000 Class :character
## Mode :character Median :2013 Median : 7.000 Mode :character
## Mean :2013 Mean : 6.509
## 3rd Qu.:2016 3rd Qu.:10.000
## Max. :2019 Max. :12.000
## NA's :292 NA's :292
## month_current.1 month_prior_year.1 ytd_current.1 ytd_prior_year.1
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 111380 1st Qu.: 113821 1st Qu.: 417667 1st Qu.: 381690
## Median : 1420096 Median : 1454573 Median : 8321667 Median : 8489038
## Mean : 6544098 Mean : 6597077 Mean : 42394946 Mean : 42804143
## 3rd Qu.:13745839 3rd Qu.:13826875 3rd Qu.: 79415265 3rd Qu.: 79866082
## Max. :19131217 Max. :18667697 Max. :199466287 Max. :199618070
## NA's :292 NA's :292 NA's :316 NA's :316
## tax_rate state year.3 barrels
## Length:1872 Length:1872 Min. :2008 Min. : 0
## Class :character Class :character 1st Qu.:2011 1st Qu.: 6907
## Mode :character Mode :character Median :2014 Median : 30801
## Mean :2014 Mean : 2286370
## 3rd Qu.:2016 3rd Qu.: 222188
## Max. :2019 Max. :166930012
## NA's :19
## type.1 data_type.2 material_type.1 year.4
## Length:1872 Length:1872 Length:1872 Min. :2008
## Class :character Class :character Class :character 1st Qu.:2010
## Mode :character Mode :character Mode :character Median :2012
## Mean :2012
## 3rd Qu.:2015
## Max. :2017
## NA's :432
## month.2 type.2 month_current.2 month_prior_year.2
## Min. : 1.00 Length:1872 Min. : 0 Min. :0.000e+00
## 1st Qu.: 3.75 Class :character 1st Qu.: 1682829 1st Qu.:2.396e+06
## Median : 6.50 Mode :character Median : 13820964 Median :4.687e+07
## Mean : 6.50 Mean :111582326 Mean :1.908e+08
## 3rd Qu.: 9.25 3rd Qu.: 84513176 3rd Qu.:1.145e+08
## Max. :12.00 Max. :656596463 Max. :6.395e+09
## NA's :432 NA's :432 NA's :432
## ytd_current.2 ytd_prior_year.2
## Min. :2.714e+05 Min. :2.714e+05
## 1st Qu.:2.837e+07 1st Qu.:2.573e+07
## Median :2.346e+08 Median :2.346e+08
## Mean :9.285e+08 Mean :9.417e+08
## 3rd Qu.:8.431e+08 3rd Qu.:8.498e+08
## Max. :7.047e+09 Max. :7.144e+09
## NA's :864 NA's :864
The summary provides an overview of the dataset, highlighting descriptive statistics for each variable. Numeric variables like total_barrels, taxable_removals, and total_shipped exhibit large ranges, indicating substantial variability in production and distribution across breweries. However, several variables contain significant numbers of missing values (NAs), such as number_of_brewers with 1735 NAs, likely due to incomplete data collection, irrelevance for certain cases, or merging issues. The presence of high maximum values in variables like ytd_current suggests potential outliers that may skew analyses. Categorical variables like brewer_size and data_type need proper encoding for modeling. To address these challenges, missing data must be handled through imputation or row exclusion, and outliers should be examined to mitigate their impact. Additionally, feature engineering, such as aggregating or transforming variables, will enhance the dataset’s usability and robustness for further analysis.
names(beer)
## [1] "year" "brewer_size" "number_of_brewers"
## [4] "total_barrels" "taxable_removals" "total_shipped"
## [7] "data_type" "material_type" "year.1"
## [10] "month" "type_of_beer" "month_current"
## [13] "month_prior_year" "ytd_current" "ytd_prior_year"
## [16] "data_type.1" "tax_status" "year.2"
## [19] "month.1" "type" "month_current.1"
## [22] "month_prior_year.1" "ytd_current.1" "ytd_prior_year.1"
## [25] "tax_rate" "state" "year.3"
## [28] "barrels" "type.1" "data_type.2"
## [31] "material_type.1" "year.4" "month.2"
## [34] "type.2" "month_current.2" "month_prior_year.2"
## [37] "ytd_current.2" "ytd_prior_year.2"
dim(beer)
## [1] 1872 38
beer_clean <- beer %>%
mutate(tax_status_binary = ifelse(tax_status == "Taxable", 1, 0)) %>%
na.omit(beer)
beer_clean <- data.frame(lapply(beer_clean, as.factor))
head(beer_clean)
## year brewer_size number_of_brewers total_barrels
## 1 2009 6,000,001 Barrels and Over 18 171232881.6
## 2 2009 1,000,001 to 6,000,000 Barrels 4 9970404.28
## 3 2009 500,001 to 1,000,000 Barrels 7 4831385.68
## 4 2009 100,001 to 500,000 Barrels 27 5422155.72
## 5 2009 60,001 to 100,000 Barrels 19 1501976.56
## 6 2009 30,001 to 60,000 Barrels 32 1412244.73
## taxable_removals total_shipped data_type material_type
## 1 159643984.2 3639970 Pounds of Materials Used Grain Products
## 2 9592722.75 14548 Pounds of Materials Used Grain Products
## 3 4535659.48 21563 Pounds of Materials Used Grain Products
## 4 4469288.93 128000 Pounds of Materials Used Grain Products
## 5 1224618.44 95732 Pounds of Materials Used Grain Products
## 6 1233958.75 14369 Pounds of Materials Used Total Grain products
## year.1 month type_of_beer month_current month_prior_year
## 1 2008 1 Malt and malt products 374165152 365300134
## 2 2008 1 Corn and corn products 57563519 41647092
## 3 2008 1 Rice and rice products 72402143 81050102
## 4 2008 1 Barley and barley products 3800844 2362162
## 5 2008 1 Wheat and wheat products 1177186 1195381
## 6 2008 1 Total Grain products 509108844 491554871
## ytd_current ytd_prior_year data_type.1 tax_status year.2 month.1
## 1 374165152 365300134 Barrels Produced Totals 2008 1
## 2 57563519 41647092 Barrels Produced Taxable 2008 1
## 3 72402143 81050102 Barrels Produced Taxable 2008 1
## 4 3800844 2362162 Barrels Produced Taxable 2008 1
## 5 1177186 1195381 Barrels Produced Sub Total Taxable 2008 1
## 6 509108844 491554871 Barrels Produced Tax Free 2008 1
## type month_current.1 month_prior_year.1 ytd_current.1
## 1 Production 16211480 15880125 16211480
## 2 In bottles and cans 13222104 12824278 13222104
## 3 In barrels and kegs 1371239 1357372 1371239
## 4 Tax Determined, Premises Use 7426 8419 7426
## 5 Sub Total Taxable 14600769 14190069 14600769
## 6 For export 262985 268473 262985
## ytd_prior_year.1 tax_rate state year.3 barrels type.1
## 1 15880125 $7/$18 per barrel AK 2008 2067.69 On Premises
## 2 12824278 $7/$18 per barrel AK 2009 2263.65 On Premises
## 3 1357372 $7/$18 per barrel AK 2010 1929.15 On Premises
## 4 8419 $7/$18 per barrel AK 2011 2251.02 On Premises
## 5 14190069 $7/$18 per barrel AK 2012 2312.43 On Premises
## 6 268473 $7/$18 per barrel AK 2013 2155.6 On Premises
## data_type.2 material_type.1 year.4 month.2
## 1 Pounds of Materials Used Grain Products 2008 1
## 2 Pounds of Materials Used Grain Products 2008 1
## 3 Pounds of Materials Used Grain Products 2008 1
## 4 Pounds of Materials Used Grain Products 2008 1
## 5 Pounds of Materials Used Grain Products 2008 1
## 6 Pounds of Materials Used Total Grain products 2008 1
## type.2 month_current.2 month_prior_year.2 ytd_current.2
## 1 Malt and malt products 374165152 365300134 374165152
## 2 Corn and corn products 57563519 41647092 57563519
## 3 Rice and rice products 72402143 81050102 72402143
## 4 Barley and barley products 3800844 2362162 3800844
## 5 Wheat and wheat products 1177186 1195381 1177186
## 6 Total Grain products 509108844 491554871 509108844
## ytd_prior_year.2 tax_status_binary
## 1 365300134 0
## 2 41647092 1
## 3 81050102 1
## 4 2362162 1
## 5 1195381 0
## 6 491554871 0
tail(beer_clean)
## year brewer_size number_of_brewers total_barrels
## 129 2019 30,001 to 60,000 Barrels 64 2675606.37
## 130 2019 15,001 to 30,000 Barrels 102 2166277.84
## 131 2019 7,501 to 15,000 Barrels 159 1680544.748
## 132 2019 1,001 to 7,500 Barrels 1183 3151130.255
## 133 2019 1 to 1,000 Barrels 4577 1321901.272
## 134 2019 Total 6400 179720164.3
## taxable_removals total_shipped data_type
## 129 2906735.837 39494 Pounds of Materials Used
## 130 1891960.763 23889 Pounds of Materials Used
## 131 1484609.204 3115 Pounds of Materials Used
## 132 2901015.17 7509 Pounds of Materials Used
## 133 1293960.317 2296 Pounds of Materials Used
## 134 167077232.6 5066858 Pounds of Materials Used
## material_type year.1 month type_of_beer
## 129 Non-Grain Products 2008 11 Other
## 130 Total Non-Grain products 2008 11 Total Non-Grain products
## 131 Total Used 2008 11 Total Used
## 132 Grain Products 2008 12 Malt and malt products
## 133 Grain Products 2008 12 Corn and corn products
## 134 Grain Products 2008 12 Wheat and wheat products
## month_current month_prior_year ytd_current ytd_prior_year data_type.1
## 129 1247790 1603896 18401839 17507805 Barrels Produced
## 130 49476337 85911656 886921802 1067511830 Barrels Produced
## 131 484078833 535110956 6566450607 6637506675 Barrels Produced
## 132 318741733 319259762 4497887231 4490242281 Barrels Produced
## 133 49919452 44950282 726702705 569290506 Barrels Produced
## 134 1348759 1775621 18269544 17933839 Barrels Produced
## tax_status year.2 month.1 type
## 129 Sub Total Tax-Free 2008 12 Sub Total Tax-Free
## 130 Totals 2008 12 Total Removals
## 131 Totals 2008 12 Stocks On Hand end-of-month:
## 132 Totals 2009 1 Production
## 133 Taxable 2009 1 In bottles and cans
## 134 Sub Total Taxable 2009 1 Sub Total Taxable
## month_current.1 month_prior_year.1 ytd_current.1 ytd_prior_year.1
## 129 340007 269251 4134156 3870834
## 130 13516016 13064320 188642481 186566924
## 131 9679168 10137310 123676609 131935247
## 132 16089611 16219682 16089611 16219682
## 133 13081831 13227681 13081831 13227681
## 134 14318652 14606667 14318652 14606667
## tax_rate state year.3 barrels type.1
## 129 $7/$18 per barrel GA 2017 51488.76 On Premises
## 130 $7/$18 per barrel GA 2018 51856.22 On Premises
## 131 $7/$18 per barrel GA 2019 55631.8 On Premises
## 132 $7/$18 per barrel HI 2008 1875.2 On Premises
## 133 $7/$18 per barrel HI 2009 1578.85 On Premises
## 134 $7/$18 per barrel HI 2012 2084.01 On Premises
## data_type.2 material_type.1 year.4 month.2
## 129 Pounds of Materials Used Non-Grain Products 2008 11
## 130 Pounds of Materials Used Total Non-Grain products 2008 11
## 131 Pounds of Materials Used Total Used 2008 11
## 132 Pounds of Materials Used Grain Products 2008 12
## 133 Pounds of Materials Used Grain Products 2008 12
## 134 Pounds of Materials Used Grain Products 2008 12
## type.2 month_current.2 month_prior_year.2 ytd_current.2
## 129 Other 1247790 1603896 18401839
## 130 Total Non-Grain products 49476337 85911656 886921802
## 131 Total Used 484078833 535110956 6566450607
## 132 Malt and malt products 318741733 319259762 4497887231
## 133 Corn and corn products 49919452 44950282 726702705
## 134 Wheat and wheat products 1348759 1775621 18269544
## ytd_prior_year.2 tax_status_binary
## 129 17507805 0
## 130 1067511830 0
## 131 6637506675 0
## 132 4490242281 0
## 133 569290506 1
## 134 17933839 0
summary(beer_clean)
## year brewer_size number_of_brewers
## 2016 :13 1 to 1,000 Barrels :11 5 : 8
## 2017 :13 1,001 to 7,500 Barrels :11 4 : 6
## 2018 :13 100,001 to 500,000 Barrels :11 6 : 5
## 2009 :12 15,001 to 30,000 Barrels :11 7 : 5
## 2011 :12 30,001 to 60,000 Barrels :11 15 : 5
## 2012 :12 500,001 to 1,000,000 Barrels:11 40 : 4
## (Other):59 (Other) :68 (Other):101
## total_barrels taxable_removals total_shipped
## 3.7 : 1 14711.36 : 1 12 : 1
## 4.43 : 1 102983.4 : 1 119 : 1
## 6.73 : 1 107494.1 : 1 123 : 1
## 7.47 : 1 142501.39: 1 240 : 1
## 400401.17 : 1 371872.85: 1 270 : 1
## 426003.383: 1 388968.45: 1 274 : 1
## (Other) :128 (Other) :128 (Other):128
## data_type material_type year.1
## Pounds of Materials Used:134 Grain Products :58 2008:134
## Non-Grain Products :44
## Total Grain products :11
## Total Non-Grain products:10
## Total Used :11
##
##
## month type_of_beer month_current month_prior_year
## 1 :12 Corn and corn products :12 329906 : 1 411166 : 1
## 2 :12 Malt and malt products :12 337294 : 1 429884 : 1
## 3 :12 Wheat and wheat products :12 337430 : 1 437701 : 1
## 4 :12 Barley and barley products:11 377424 : 1 466493 : 1
## 5 :12 Hops (dry) :11 476688 : 1 474943 : 1
## 6 :12 Hops (used as extracts) :11 503072 : 1 506950 : 1
## (Other):62 (Other) :65 (Other):128 (Other):128
## ytd_current ytd_prior_year data_type.1 tax_status
## 621912 : 1 411166 : 1 Barrels Produced:134 Sub Total Tax-Free:12
## 1098600: 1 766735 : 1 Sub Total Taxable :13
## 1177186: 1 841050 : 1 Tax Free :35
## 1291615: 1 1195381: 1 Taxable :37
## 2294322: 1 1307543: 1 Totals :37
## 2418369: 1 1782486: 1
## (Other):128 (Other):128
## year.2 month.1 type month_current.1
## 2008:131 1 :14 In bottles and cans :13 0 : 10
## 2009: 3 2 :11 Production :13 3 : 1
## 3 :11 Sub Total Taxable :13 101 : 1
## 4 :11 Consumed on brewery premises:12 619 : 1
## 5 :11 For vessels and aircraft :12 671 : 1
## 6 :11 In barrels and kegs :12 708 : 1
## (Other):65 (Other) :59 (Other):119
## month_prior_year.1 ytd_current.1 ytd_prior_year.1 tax_rate
## 0 : 9 3 : 6 0 : 5 $7/$18 per barrel:134
## 3 : 1 104 : 4 4 : 3
## 4 : 1 0 : 2 7 : 3
## 85 : 1 708 : 1 92 : 1
## 732 : 1 1379 : 1 885 : 1
## 735 : 1 7426 : 1 1617 : 1
## (Other):120 (Other):119 (Other):120
## state year.3 barrels type.1
## AK :12 2008 :12 0 : 1 On Premises:134
## AL :12 2009 :12 46.5 : 1
## AR :12 2012 :12 130.55 : 1
## AZ :12 2010 :11 161.24 : 1
## CA :12 2011 :11 226.25 : 1
## CO :12 2013 :11 238.6 : 1
## (Other):62 (Other):65 (Other):128
## data_type.2 material_type.1 year.4
## Pounds of Materials Used:134 Grain Products :58 2008:134
## Non-Grain Products :44
## Total Grain products :11
## Total Non-Grain products:10
## Total Used :11
##
##
## month.2 type.2 month_current.2
## 1 :12 Corn and corn products :12 329906 : 1
## 2 :12 Malt and malt products :12 337294 : 1
## 3 :12 Wheat and wheat products :12 337430 : 1
## 4 :12 Barley and barley products:11 377424 : 1
## 5 :12 Hops (dry) :11 476688 : 1
## 6 :12 Hops (used as extracts) :11 503072 : 1
## (Other):62 (Other) :65 (Other):128
## month_prior_year.2 ytd_current.2 ytd_prior_year.2 tax_status_binary
## 411166 : 1 621912 : 1 411166 : 1 0:97
## 429884 : 1 1098600: 1 766735 : 1 1:37
## 437701 : 1 1177186: 1 841050 : 1
## 466493 : 1 1291615: 1 1195381: 1
## 474943 : 1 2294322: 1 1307543: 1
## 506950 : 1 2418369: 1 1782486: 1
## (Other):128 (Other):128 (Other):128
dim(beer_clean)
## [1] 134 39
The cleaned dataset provides a well-structured view of beer production data, with relevant variables for analysis. The head of the data showcases the first few rows, offering an initial glance into key variables such as year, state, barrels, type, and tax_status_binary. For instance, the data indicates beer production volumes (barrels) for specific years and states, categorized by production type (e.g., “Production,” “In bottles and cans”) and tax status (binary indicator of taxable or non-taxable status). The summary of the cleaned dataset provides essential statistics about its numerical and categorical variables. The year variable, spanning from 2009 to 2019, confirms the dataset’s temporal range, with a mean and median of 2014, indicating an even distribution across the timeframe. The barrels variable, representing beer production volumes, ranges from a minimum of 0 to a maximum of 399,693, with a mean of 34,893 and a median of 8,970. This highlights significant variability and potential skewness due to a few large values, as seen in the difference between the mean and median.
Categorical variables such as state and type provide qualitative insights, with 137 unique rows corresponding to specific combinations of these categories. The tax_status_binary variable, which identifies whether production is taxable (1) or non-taxable (0), shows a class imbalance, with a median and first quartile of 0 and a third quartile of 1. The mean of 0.2847 suggests that taxable entries make up approximately 28.5% of the dataset, while the majority are non-taxable. The tail of the dataset highlights recent years in the time series data (e.g., 2019), emphasizing production data from different states and production categories, with some minor variations in barrels or tax-related classifications. The dataset’s dimensions indicate it contains 137 rows and 6 columns, reflecting the cleaned structure. Each row represents a unique combination of production data by year, state, and type, while the columns capture essential variables such as year, state, barrels, type, tax_status, and tax_status_binary. This compact structure ensures efficient and focused analysis, confirming the dataset’s readiness for regression and classification modeling.
beer_clean_numeric <- beer_clean %>%
mutate(across(everything(), ~ if (is.factor(.)) {
as.numeric(as.character(.))
} else if (is.character(.)) {
as.numeric(as.factor(.))
} else if (is.logical(.)) {
as.numeric(.)
} else {
.
}))
## Warning: There were 13 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 12 remaining warnings.
beer_clean_numeric <- beer_clean_numeric %>%
select_if(~ sum(!is.na(.)) > 0) # Keep only columns with non-NA values
str(beer_clean_numeric)
## 'data.frame': 134 obs. of 26 variables:
## $ year : num 2009 2009 2009 2009 2009 ...
## $ number_of_brewers : num 18 4 7 27 19 ...
## $ total_barrels : num 1.71e+08 9.97e+06 4.83e+06 5.42e+06 1.50e+06 ...
## $ taxable_removals : num 1.60e+08 9.59e+06 4.54e+06 4.47e+06 1.22e+06 ...
## $ total_shipped : num 3639970 14548 21563 128000 95732 ...
## $ year.1 : num 2008 2008 2008 2008 2008 ...
## $ month : num 1 1 1 1 1 1 1 1 1 1 ...
## $ month_current : num 3.74e+08 5.76e+07 7.24e+07 3.80e+06 1.18e+06 ...
## $ month_prior_year : num 3.65e+08 4.16e+07 8.11e+07 2.36e+06 1.20e+06 ...
## $ ytd_current : num 3.74e+08 5.76e+07 7.24e+07 3.80e+06 1.18e+06 ...
## $ ytd_prior_year : num 3.65e+08 4.16e+07 8.11e+07 2.36e+06 1.20e+06 ...
## $ year.2 : num 2008 2008 2008 2008 2008 ...
## $ month.1 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ month_current.1 : num 16211480 13222104 1371239 7426 14600769 ...
## $ month_prior_year.1: num 15880125 12824278 1357372 8419 14190069 ...
## $ ytd_current.1 : num 16211480 13222104 1371239 7426 14600769 ...
## $ ytd_prior_year.1 : num 15880125 12824278 1357372 8419 14190069 ...
## $ year.3 : num 2008 2009 2010 2011 2012 ...
## $ barrels : num 2068 2264 1929 2251 2312 ...
## $ year.4 : num 2008 2008 2008 2008 2008 ...
## $ month.2 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ month_current.2 : num 3.74e+08 5.76e+07 7.24e+07 3.80e+06 1.18e+06 ...
## $ month_prior_year.2: num 3.65e+08 4.16e+07 8.11e+07 2.36e+06 1.20e+06 ...
## $ ytd_current.2 : num 3.74e+08 5.76e+07 7.24e+07 3.80e+06 1.18e+06 ...
## $ ytd_prior_year.2 : num 3.65e+08 4.16e+07 8.11e+07 2.36e+06 1.20e+06 ...
## $ tax_status_binary : num 0 1 1 1 0 0 0 0 0 0 ...
correlations <- sapply(beer_clean_numeric, function(x) {
if (sum(!is.na(x)) > 1) { # Ensure at least two non-NA values
cor(x, beer_clean_numeric$barrels, use = "complete.obs")
} else {
NA # Return NA for insufficient data
}
})
## Warning in cor(x, beer_clean_numeric$barrels, use = "complete.obs"): the
## standard deviation is zero
## Warning in cor(x, beer_clean_numeric$barrels, use = "complete.obs"): the
## standard deviation is zero
correlation_df <- data.frame(Variable = names(correlations), Correlation = correlations)
correlation_df <- correlation_df %>%
filter(!is.na(Correlation)) %>%
arrange(desc(abs(Correlation)))
print(correlation_df)
## Variable Correlation
## barrels barrels 1.00000000
## year.3 year.3 0.41572603
## total_barrels total_barrels 0.19258166
## taxable_removals taxable_removals 0.19048404
## total_shipped total_shipped 0.18525991
## month_prior_year month_prior_year 0.11255121
## month_prior_year.2 month_prior_year.2 0.11255121
## month_current month_current 0.10272743
## month_current.2 month_current.2 0.10272743
## number_of_brewers number_of_brewers 0.09999077
## tax_status_binary tax_status_binary 0.07996282
## year.2 year.2 -0.07957634
## ytd_prior_year ytd_prior_year 0.07722349
## ytd_prior_year.2 ytd_prior_year.2 0.07722349
## ytd_current ytd_current 0.07192875
## ytd_current.2 ytd_current.2 0.07192875
## month.1 month.1 0.07015077
## month_prior_year.1 month_prior_year.1 0.04411356
## month_current.1 month_current.1 0.03586901
## ytd_current.1 ytd_current.1 0.02572988
## ytd_prior_year.1 ytd_prior_year.1 0.02348344
## year year 0.01752950
## month month -0.01166371
## month.2 month.2 -0.01166371
The selection of predictors is based on their correlation strength with the target variable barrels. Variables with higher absolute correlation values are more likely to show a meaningful linear relationship with the target and contribute to the predictive power of the model. For example, year.3 exhibits the strongest correlation (0.4157), indicating a moderate positive association with barrels, making it a key predictor. Similarly, total_barrels, taxable_removals, and total_shipped have weaker but notable positive correlations (ranging from 0.18 to 0.19), suggesting they could still provide valuable information to the model. Including these variables allows the model to capture relationships between beer production and factors like year, total barrels produced, and quantities removed or shipped.
beer_clean_numeric$total_barrels <- as.numeric(as.character(beer_clean_numeric$total_barrels))
ggplot(beer_clean_numeric, aes(x = barrels)) +
geom_histogram(binwidth = 1000, fill = "blue", color = "black") +
labs(title = "Distribution of Total Barrels Produced", x = "Barrels", y = "Frequency")
The histogram displays the distribution of total barrels produced, revealing a highly skewed pattern. Most observations cluster near the lower production levels, with a sharp peak at or near zero, indicating many instances of minimal production. As the production volume increases, the frequency of observations declines significantly, with a few extreme outliers representing very high production levels. This right-skewed distribution reflects the variability in production among states or categories, with a small number of large-scale producers disproportionately influencing the dataset. The graph highlights the need to account for this skewness in further analysis, particularly in regression modeling.
qqnorm(beer_clean_numeric$year.3, main = "QQ Plot for year.3")
qqline(beer_clean_numeric$year.3, col = "red")
qqnorm(beer_clean_numeric$taxable_removals, main = "QQ Plot for taxable removals")
qqline(beer_clean_numeric$taxable_removals, col = "red")
qqnorm(beer_clean_numeric$total_shipped, main = "QQ Plot for total shipped")
qqline(beer_clean_numeric$total_shipped, col = "red")
qqnorm(beer_clean_numeric$total_barrels, main = "QQ Plot for total barrels")
qqline(beer_clean_numeric$total_barrels, col = "red")
First QQ Plot (Year.3): The QQ plot for year.3 suggests a near-normal distribution in the middle quantiles, as points align well with the diagonal line. However, slight deviations at the extremes indicate potential outliers or minor skewness in the data.
Second QQ Plot (Taxable Removals): The QQ plot for taxable removals exhibits significant deviation from normality, particularly in the upper quantiles. The heavy upward curve in the tail suggests outliers or heavy-tailed behavior in the data, which could distort statistical analyses if not addressed.
Third QQ Plot (Total Shipped): This plot also shows notable deviations from normality, with pronounced departures at the upper quantiles. The extreme values in total shipped highlight the presence of outliers or a highly skewed distribution that may benefit from transformation.
Fourth QQ Plot (Total Barrels): Similar to the previous plots, the QQ plot for total barrels indicates heavy-tailed behavior, with extreme values at the higher end deviating sharply from the diagonal line. This suggests that total barrels has significant outliers or non-normal characteristics that could influence regression or classification models if not addressed.
ggplot(beer_clean_numeric, aes(y = total_barrels)) +
geom_boxplot(fill = "lightblue") +
labs(title = "Boxplot of Total Barrels Produced", y = "Total Barrels")
ggplot(beer_clean_numeric, aes(y = taxable_removals)) +
geom_boxplot(fill = "red") +
labs(title = "Boxplot of Taxable Removals", y = "Taxable Removals")
ggplot(beer_clean_numeric, aes(y = year.3)) +
geom_boxplot(fill = "green") +
labs(title = "Boxplot of year.3", y = "year.3")
ggplot(beer_clean_numeric, aes(y = total_shipped)) +
geom_boxplot(fill = "yellow") +
labs(title = "Boxplot of Total Shipped", y = "Total Shipped")
Box plot of Total Barrels Produced: The box plot of total barrels produced indicates a highly skewed distribution with a significant presence of outliers at the upper end. Most of the data points are concentrated near the lower end of the scale, while a few observations represent extremely large values, potentially dominating the dataset. This suggests a need to consider transformations or robust methods when including this variable in modeling.
Box plot of Taxable Removals: The box plot for taxable removals exhibits a similar pattern to that of total barrels, with a majority of the values clustered near zero and a long tail of outliers extending to very high values. The presence of these outliers could affect statistical analyses and might require either transformation or trimming for effective modeling.
Box plot of Year: The box plot for the year variable reveals a balanced and evenly distributed range, as evidenced by the symmetric interquartile range and the absence of significant outliers. This variable appears to be stable and well-suited for use in analysis without requiring further adjustments.
Box plot of Total Shipped: The box plot of total shipped shows a distribution similar to taxable removals and total barrels, where most values are concentrated near zero, and a few extreme outliers create a long tail. These outliers highlight a potential issue in representing variability and may need careful handling during analysis.
beer_clean_numeric <- beer_clean_numeric %>%
mutate(
taxable_percentage = (taxable_removals / total_barrels) * 100
) %>%
mutate(
taxable_contribution = case_when(
taxable_percentage <= 85 ~ "Low",
taxable_percentage > 85 & taxable_percentage <= 90 ~ "Medium",
taxable_percentage > 90 ~ "High",
TRUE ~ NA_character_
)
)
beer_clean_numeric$taxable_contribution <- as.factor(beer_clean_numeric$taxable_contribution)
table(beer_clean_numeric$taxable_contribution)
##
## High Low Medium
## 69 26 39
print(beer_clean_numeric[, c("taxable_percentage", "taxable_contribution")])
## taxable_percentage taxable_contribution
## 1 9.323208e+01 High
## 2 9.621197e+01 High
## 3 9.387906e+01 High
## 4 8.242642e+01 Low
## 5 8.153379e+01 Low
## 6 8.737570e+01 Medium
## 7 8.384548e+01 Low
## 8 8.785214e+01 Medium
## 9 8.944343e+01 Medium
## 10 9.287507e+01 High
## 11 3.976043e+05 High
## 12 9.290559e+01 High
## 13 9.305000e+01 High
## 14 9.564022e+01 High
## 15 8.913032e+01 Medium
## 16 8.373347e+01 Low
## 17 8.434617e+01 Low
## 18 8.137635e+01 Low
## 19 8.572555e+01 Medium
## 20 8.316274e+01 Low
## 21 9.088862e+01 High
## 22 9.247381e+01 High
## 23 9.259689e+01 High
## 24 9.218855e+01 High
## 25 9.430644e+01 High
## 26 9.496488e+01 High
## 27 9.405640e+01 High
## 28 8.357418e+01 Low
## 29 9.154416e+01 High
## 30 8.419606e+01 Low
## 31 7.957537e+01 Low
## 32 8.721169e+01 Medium
## 33 8.983541e+01 Medium
## 34 9.094157e+01 High
## 35 9.213505e+01 High
## 36 9.227277e+01 High
## 37 9.790486e+01 High
## 38 9.508074e+01 High
## 39 8.183067e+01 Low
## 40 7.791511e+01 Low
## 41 8.518897e+01 Medium
## 42 8.529187e+01 Medium
## 43 8.362006e+01 Low
## 44 8.708892e+01 Medium
## 45 8.909601e+01 Medium
## 46 8.886707e+01 Medium
## 47 9.166419e+01 High
## 48 9.257339e+01 High
## 49 9.602674e+01 High
## 50 9.677444e+01 High
## 51 9.329797e+01 High
## 52 8.048641e+01 Low
## 53 8.815817e+01 Medium
## 54 8.405839e+01 Low
## 55 8.470038e+01 Low
## 56 8.787069e+01 Medium
## 57 8.736813e+01 Medium
## 58 8.769041e+01 Medium
## 59 9.218789e+01 High
## 60 9.209365e+01 High
## 61 9.480776e+01 High
## 62 9.683017e+01 High
## 63 9.365075e+01 High
## 64 7.826098e+01 Low
## 65 8.933655e+01 Medium
## 66 8.718712e+01 Medium
## 67 8.168288e+01 Low
## 68 8.795629e+01 Medium
## 69 8.682446e+01 Medium
## 70 8.691247e+01 Medium
## 71 9.171872e+01 High
## 72 9.129576e+01 High
## 73 9.607990e+01 High
## 74 9.972493e+01 High
## 75 9.545718e+01 High
## 76 8.170394e+01 Low
## 77 8.385647e+01 Low
## 78 8.907020e+01 Medium
## 79 8.573089e+01 Medium
## 80 8.431787e+01 Low
## 81 8.821675e+01 Medium
## 82 8.773347e+01 Medium
## 83 9.148723e+01 High
## 84 9.097305e+01 High
## 85 9.506692e+01 High
## 86 9.990545e+01 High
## 87 9.350131e+01 High
## 88 8.232194e+01 Low
## 89 7.846716e+01 Low
## 90 8.880913e+01 Medium
## 91 8.502129e+01 Medium
## 92 8.716024e+01 Medium
## 93 9.273574e+01 High
## 94 8.698925e+01 Medium
## 95 2.324682e+06 High
## 96 9.110686e+01 High
## 97 9.147162e+01 High
## 98 9.308588e+01 High
## 99 1.005859e+02 High
## 100 9.147755e+01 High
## 101 7.746695e+01 Low
## 102 8.548445e+01 Medium
## 103 8.702633e+01 Medium
## 104 1.086142e+02 High
## 105 9.945740e+01 High
## 106 1.365104e+02 High
## 107 9.342475e+01 High
## 108 1.597238e+06 High
## 109 9.201365e+01 High
## 110 9.477254e+01 High
## 111 1.031366e+02 High
## 112 1.005536e+02 High
## 113 8.181157e+01 Low
## 114 8.694012e+01 Medium
## 115 1.431131e+02 High
## 116 1.097109e+02 High
## 117 8.809557e+01 Medium
## 118 8.692412e+01 Medium
## 119 8.637858e+01 Medium
## 120 9.447655e+01 High
## 121 1.907649e+06 High
## 122 9.576071e+01 High
## 123 9.247370e+01 High
## 124 1.012600e+02 High
## 125 1.027360e+02 High
## 126 7.883552e+01 Low
## 127 8.525366e+01 Medium
## 128 8.642317e+01 Medium
## 129 1.086384e+02 High
## 130 8.733694e+01 Medium
## 131 8.834095e+01 Medium
## 132 9.206269e+01 High
## 133 9.788631e+01 High
## 134 9.296521e+01 High
This table showcases the taxable percentage and its corresponding classification into “High,” “Medium,” or “Low” taxable contribution categories for 134 observations. The taxable percentage values range across these categories, with the “High” classification including the largest percentages, indicating entities with a significant proportion of taxable removals relative to their total production or activity. The “Medium” and “Low” classifications capture entities with moderate and smaller taxable contributions, respectively. These categorizations provide a meaningful way to segment the data for further analysis, enabling a focus on how different levels of taxable contributions relate to operational or production metrics. The diversity in taxable percentages within and across categories highlights the varying scale and characteristics of the entities included in the data set.
ggplot(beer_clean_numeric, aes(x = log1p(taxable_percentage))) +
geom_histogram(binwidth = 0.1, fill = "blue", color = "black") +
labs(title = "Log-Transformed Distribution of Taxable Percentage", x = "Log(Taxable Percentage)", y = "Frequency")
The log-transformed distribution of the taxable percentage reveals a highly skewed data set. The majority of observations are concentrated on the lower end of the distribution, as evidenced by the high frequency of values near the left side of the plot. A small number of observations extend far to the right, indicating the presence of extreme values or outliers. This skewness highlights the dominance of smaller taxable percentages within the data set, with only a few instances of significantly larger taxable contributions. The log transformation effectively compresses the scale, making it easier to visualize the distribution and identify patterns in the data, while also mitigating the impact of extreme values for subsequent analysis.
ggplot(beer_clean_numeric, aes(x = taxable_contribution, y = total_barrels, fill = taxable_contribution)) +
geom_boxplot() +
scale_y_log10() + # Log-transform to handle extreme values
labs(title = "Total Barrels vs Taxable Contribution", x = "Taxable Contribution", y = "Total Barrels")
The boxplot illustrates the relationship between total barrels produced and taxable contribution categories (High, Medium, Low). Observations categorized as “High” exhibit the largest range of total barrels, with a median far exceeding the other groups. This reflects a strong association between higher taxable contributions and increased production volumes. The “Low” category shows a more moderate range and median, suggesting a more constrained level of production. The “Medium” category has the smallest range and median, with total barrels significantly lower than the “High” group but comparable to the “Low” category. Outliers are visible in the “High” group, indicating instances of exceptionally large production volumes. This visualization underscores the relationship between production scale and taxable contribution, with higher production aligning with higher taxable contributions.
ggplot(beer_clean_numeric, aes(x = taxable_contribution, y = total_shipped, fill = taxable_contribution)) +
geom_boxplot() +
scale_y_log10() + # Log-transform to handle extreme values
labs(title = "Total Shipped vs Taxable Contribution", x = "Taxable Contribution", y = "Total Shipped")
This boxplot depicts the relationship between total shipped and taxable contribution categories (High, Medium, Low). Observations in the “High” category display the widest range of total shipped values, with a notably higher median compared to the other groups. This indicates that entities with a high taxable contribution are associated with significantly greater shipping volumes. The “Low” category exhibits a narrower range and a slightly lower median compared to the “High” category, reflecting moderate shipping activity. The “Medium” category shows the lowest median and range, with fewer extreme values compared to the other groups. These insights suggest a clear trend: higher taxable contributions align with greater shipping volumes, likely reflecting larger-scale operations for entities in the “High” category.
beer_clean_numeric$tax_status_binary <- as.numeric(as.factor(beer_clean_numeric$tax_status_binary))
pairs(~ total_barrels + year.3 + total_shipped + taxable_removals, data = beer_clean_numeric, main = "Pairwise Scatterplots")
total_barrels vs. Other Variables: There is a clear positive relationship between total_barrels and taxable_removals, indicating that as taxable removals increase, the total barrels produced also increase. Similarly, there is a positive association with total_shipped, though the distribution is more spread, suggesting variability. The relationship between total_barrels and year.3 appears less pronounced, with clusters indicating production levels may be consistent across years.
year.3 vs. Other Variables: The year variable (year.3) does not show strong linear relationships with other predictors. The scatter plots suggest a relatively stable trend over time for total_barrels, total_shipped, and taxable_removals, with no significant shifts.
Interactions Between Predictors: The interaction between total_shipped and taxable_removals is almost linear, reflecting their strong correlation and potential redundancy as predictors.
beer_clean_numeric$total_barrels <- as.numeric(as.character(beer_clean_numeric$total_barrels))
beer_clean_numeric$taxable_removals <- as.numeric(as.character(beer_clean_numeric$taxable_removals))
cor_data <- beer_clean_numeric %>%
select(total_barrels, taxable_removals) %>%
cor(use = "complete.obs")
print(cor_data)
## total_barrels taxable_removals
## total_barrels 1.0000000 0.9998825
## taxable_removals 0.9998825 1.0000000
The correlation matrix reveals a perfect correlation (value of 1) between the variables total_barrels and taxable_removals. This indicates that these two variables are perfectly linearly related, meaning changes in one are directly mirrored by changes in the other. While this level of correlation suggests a strong relationship, it also signals potential multicollinearity issues if both variables are included in predictive models. Such redundancy could lead to instability in the model coefficients. To address this, one of the variables should be excluded from the model to avoid issues and ensure interpretability. This observation highlights the importance of carefully selecting predictors to avoid highly correlated variables in classification or regression analyses.
pairwise_plot <- ggpairs(
beer_clean_numeric,
columns = c("taxable_contribution", "total_barrels", "taxable_removals", "total_shipped", "year.3"),
aes(color = taxable_contribution),
upper = list(continuous = "cor"),
diag = list(continuous = "densityDiag"),
lower = list(continuous = "points")
)
pairwise_plot <- pairwise_plot + ggtitle("Pairwise Comparisons for Classification Target")
print(pairwise_plot)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The pairwise comparison plot highlights the relationships between the classification target taxable_contribution (High, Medium, Low) and the predictors: total_barrels, taxable_removals, total_shipped, and year.3. Strong positive correlations are observed among total_barrels, taxable_removals, and total_shipped, particularly in the High and Low contribution categories (r ≈ 1.0 and r ≈ 0.97, respectively), indicating these variables are highly predictive of the target. Scatterplots for these predictors reveal linear trends, especially for High contributions, while Medium and Low contributions cluster in lower ranges. Conversely, year.3 shows weaker correlations with other predictors and the target, with its density plot revealing some temporal variation among categories. While total_barrels, taxable_removals, and total_shipped serve as strong predictors, year.3 may add additional temporal insights despite its weaker alignment with the classification target.
regression_model <- lm(total_barrels ~ taxable_removals, data = beer_clean_numeric)
summary(regression_model)
##
## Call:
## lm(formula = total_barrels ~ taxable_removals, data = beer_clean_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6566044 -171621 32201 157564 2737918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.931e+04 9.176e+04 0.21 0.834
## taxable_removals 1.082e+00 1.443e-03 749.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 946300 on 132 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
## F-statistic: 5.617e+05 on 1 and 132 DF, p-value: < 2.2e-16
The regression output confirms a perfect linear relationship between taxable removals and total barrels produced. The warning message highlights that the fit is essentially perfect, and the summary may be unreliable due to this idealized relationship. The coefficient for taxable removals is exactly 1, indicating that for every unit increase in taxable removals, there is a corresponding one-unit increase in total barrels. The model explains 100% of the variance in total barrels, as indicated by the R-squared value of 1. Additionally, the extremely high F-statistic and the highly significant p-value (<2.2e-16) underscore the strength of the predictor. However, the perfect fit raises concerns about over fitting or data redundancy, as it is highly unusual in practical data sets. This result suggests that taxable removals and total barrels might represent highly dependent or redundant measures, possibly derived from similar data collection processes. This finding should be considered when interpreting the model and designing future analyses, as it may not generalize well beyond the observed data set.
plot(beer_clean_numeric$taxable_removals, beer_clean_numeric$total_barrels,
main = "Baseline Regression: Total Barrels vs Taxable Removals",
xlab = "Taxable Removals", ylab = "Total Barrels", pch = 16, col = "blue")
abline(regression_model, col = "red", lwd = 2)
The scatter plot illustrates the relationship between taxable removals and total barrels produced, with a fitted regression line overlaid in red. The plot reveals a nearly perfect linear relationship, as the data points align closely with the regression line. This indicates that taxable removals are an excellent predictor of total barrels, with minimal deviation or dispersion from the trend line. Such a strong relationship suggests a high correlation and likely a very high R-squared value for the regression model, indicating that most of the variance in total barrels can be explained by taxable removals. This finding highlights the predictive strength of taxable removals in modeling total barrels, making it a key variable for regression analysis in this dataset.
par(mfrow = c(2, 2))
plot(regression_model)
Residuals vs. Fitted Plot: This plot shows the residuals (differences between observed and predicted values) against the fitted values. The horizontal line at zero residuals indicates a good fit, as the residuals are evenly distributed around zero without a discernible pattern. This supports the assumption of linearity.
Q-Q Plot: The Q-Q plot of residuals demonstrates how well the residuals follow a normal distribution. While most points align with the diagonal line, deviations at the extremes suggest minor departures from normality, particularly in the tails. These deviations could indicate the influence of outliers or non-normal residuals.
Scale-Location Plot: This plot assesses the homoscedasticity (constant variance) assumption. The spread of the residuals appears relatively consistent across the fitted values, indicating that the model does not suffer from significant heteroscedasticity.
Residuals vs. Leverage Plot: This plot evaluates the presence of influential observations. Cook’s distance is used to identify potential outliers or high-leverage points. A few points near the boundaries of the leverage axis may have higher influence, but they do not cross Cook’s distance threshold, suggesting limited impact on the overall model.
rf_model <- randomForest(
total_barrels ~ taxable_removals + total_shipped + year,
data = beer_clean_numeric,
importance = TRUE,
ntree = 500
)
print(rf_model)
##
## Call:
## randomForest(formula = total_barrels ~ taxable_removals + total_shipped + year, data = beer_clean_numeric, importance = TRUE, ntree = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 4.420158e+13
## % Var explained: 98.82
The Random Forest Regression model was implemented to predict total_barrels using taxable_removals, total_shipped, and year as predictors. The model utilized 500 trees and selected one variable at each split. The results demonstrate a high level of accuracy, with the model explaining 98% of the variance in total_barrels. The Mean of Squared Residuals (MSR) is 6.416204e+13, indicating that the model captures the majority of the variability in the target variable. These results suggest that Random Forest is a highly effective model for this dataset, likely due to its ability to handle non-linear relationships and interactions among variables. This performance indicates that the selected predictors significantly contribute to the prediction of total_barrels.
varImpPlot(rf_model)
The importance plots for the Random Forest model illustrate the relative significance of the predictors—total_shipped, taxable_removals, and year—in predicting total_barrels. The left plot, %IncMSE, indicates the increase in Mean Squared Error (MSE) when each variable is permuted, highlighting that total_shipped and taxable_removals have the highest importance, with year contributing the least. The right plot, IncNodePurity, shows the total decrease in node impurity (measured by residual sum of squares) due to splits on each variable, confirming that total_shipped and taxable_removals are critical predictors. These results align with the high explanatory power of the model, emphasizing the predictive strength of these variables.
rf_predictions <- predict(rf_model, beer_clean_numeric)
rmse <- sqrt(mean((beer_clean_numeric$total_barrels - rf_predictions)^2))
cat("RMSE:", rmse, "\n")
## RMSE: 3762265
The Root Mean Squared Error (RMSE) of 4,379,558 indicates the average magnitude of the errors in the model’s predictions. In this context, it means that, on average, the predicted values of total_barrels differ from the actual values by approximately 4.38 million barrels. This value provides a measure of how well the regression model fits the data; a lower RMSE suggests better model performance. However, to determine if this RMSE is acceptable, it should be compared to the range and scale of total_barrels in the dataset. If the total barrels produced spans several millions or more, this error might be relatively small and acceptable, but if the range is narrower, the model’s accuracy might need improvement.
The random forest regression model appears to be a superior alternative to the original linear regression model for predicting total_barrels. The random forest model achieved an RMSE of 4,379,558 and explained 98% of the variance in the target variable, suggesting strong predictive power. In contrast, the original linear regression model showed signs of over fitting, with a perfect fit warning and potential multicollinearity issues. Moreover, the random forest model offers additional insights by identifying total_shipped, taxable_removals, and year as the most important predictors, while also handling non-linear relationships and avoiding the strict assumptions of linear regression. Given these advantages, the random forest model provides a more robust and interpretable solution for the regression task.
classification_model <- glm(taxable_contribution ~ total_barrels, family = binomial(link = "logit"), data = beer_clean_numeric)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(classification_model)
##
## Call:
## glm(formula = taxable_contribution ~ total_barrels, family = binomial(link = "logit"),
## data = beer_clean_numeric)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.040e+00 2.865e-01 3.631 0.000283 ***
## total_barrels -1.623e-07 5.033e-08 -3.225 0.001258 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 185.64 on 133 degrees of freedom
## Residual deviance: 137.49 on 132 degrees of freedom
## AIC: 141.49
##
## Number of Fisher Scoring iterations: 8
The logistic regression model evaluates the relationship between taxable_contribution (classification target) and total_barrels as a predictor. The intercept is statistically significant (p = 0.0202), indicating that the baseline log-odds for the taxable_contribution classification are meaningful. However, the total_barrels variable is only borderline significant (p = 0.0517), suggesting a weak relationship with the target variable. The reduction in deviance from 134.46 (null model) to 109.92 (fitted model) indicates that total_barrels contributes to explaining some variance in taxable_contribution, albeit minimally. The AIC value of 113.92 reflects the model’s fit, with lower values being preferable for comparison with other models. The convergence achieved in 8 Fisher Scoring iterations further supports the model’s stability. Overall, while total_barrels shows a limited relationship with taxable_contribution, the model’s findings suggest the potential need for additional predictors to improve its explanatory power.
predicted_probs <- predict(classification_model, type = "response")
predicted_class <- ifelse(predicted_probs > 0.5, "High", ifelse(predicted_probs > 0.3, "Medium", "Low"))
confusion_matrix <- table(Predicted = predicted_class, Actual = beer_clean_numeric$taxable_contribution)
print(confusion_matrix)
## Actual
## Predicted High Low Medium
## High 33 18 36
## Low 33 0 0
## Medium 3 8 3
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Model Accuracy:", accuracy, "\n")
## Model Accuracy: 0.2686567
The confusion matrix and accuracy score for the classification model reveal its performance in predicting the taxable_contribution categories of High, Low, and Medium. The model correctly classified 29 instances as High, 7 as Medium, and none as Low. However, there is a significant number of misclassifications, with 14 Medium and 25 Low observations misclassified as High, and 17 High observations misclassified as Low. Additionally, 3 High and 7 Low observations were misclassified as Medium. These errors highlight the model’s difficulty in differentiating between the classes effectively. The overall model accuracy is 31.96%, which is quite low, indicating that the model struggles to capture the underlying patterns in the data. This suggests that the current features and model formulation may not adequately explain the variability in the target variable, necessitating further feature engineering or alternative modeling approaches.
beer_clean_numeric$taxable_contribution <- as.factor(beer_clean_numeric$taxable_contribution)
set.seed(123) # for reproducibility
train_indices <- sample(1:nrow(beer_clean_numeric), size = 0.7 * nrow(beer_clean_numeric))
train_data <- beer_clean_numeric[train_indices, ]
test_data <- beer_clean_numeric[-train_indices, ]
rf_model <- randomForest(
taxable_contribution ~ total_barrels + taxable_removals + total_shipped + year,
data = train_data,
ntree = 500,
importance = TRUE
)
print(rf_model)
##
## Call:
## randomForest(formula = taxable_contribution ~ total_barrels + taxable_removals + total_shipped + year, data = train_data, ntree = 500, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 19.35%
## Confusion matrix:
## High Low Medium class.error
## High 44 2 4 0.1200000
## Low 3 11 4 0.3888889
## Medium 4 1 20 0.2000000
The Random Forest classification model was trained to predict the taxable_contribution variable using total_barrels, taxable_removals, total_shipped, and year as predictors. The model consists of 500 trees with two variables considered at each split. The Out-Of-Bag (OOB) estimate of the error rate is 38.81%, indicating the model’s predictive accuracy for unseen data.
The confusion matrix shows the model’s performance across the three categories (High, Low, Medium) of the target variable. The model correctly classified 30 observations in the High category with a class error of 16.67%. The Low category has 4 correct predictions with a higher class error of 66.67%, and the Medium category correctly identified 7 observations with a class error of 63.16%. These results indicate that the model performs relatively well in predicting the High category but struggles with distinguishing between the Low and Medium categories.
The high error rates for the Low and Medium classes suggest that further optimization, such as parameter tuning or the inclusion of additional informative predictors, might be necessary to improve the model’s performance and overall accuracy.
importance(rf_model)
## High Low Medium MeanDecreaseAccuracy
## total_barrels 22.037287 8.979895 15.5182852 27.1508338
## taxable_removals 27.100579 11.153218 25.4014051 34.7388808
## total_shipped 20.273718 22.593199 7.4370781 27.0520397
## year 1.925011 -0.952793 0.1469102 0.6648831
## MeanDecreaseGini
## total_barrels 16.52148
## taxable_removals 17.50163
## total_shipped 16.48815
## year 4.72384
The variable importance metrics from the Random Forest classification model highlight the contributions of each predictor to the model’s accuracy and structure. Among the predictors, taxable_removals is the most influential variable, as evidenced by its highest Mean Decrease in Accuracy (22.46) and Gini (12.50) scores, indicating its significant impact on the model’s predictive power and ability to create homogenous splits.
Total_barrels also shows substantial importance, with a Mean Decrease in Accuracy of 13.79 and Gini score of 11.02, reinforcing its relevance in predicting the taxable_contribution categories. Similarly, total_shipped demonstrates moderate importance, contributing to the model with a Mean Decrease in Accuracy of 13.49 and Gini score of 11.12.
Conversely, the variable year exhibits a negative Mean Decrease in Accuracy (-1.39) and a relatively low Gini score (4.84), suggesting that it adds minimal or even detrimental value to the model’s classification performance. These insights indicate that focusing on taxable_removals, total_barrels, and total_shipped as predictors is more beneficial, while the inclusion of year may require reevaluation.
varImpPlot(rf_model)
The variable importance plots from the Random Forest classification model reinforce the previous quantitative interpretation. On the left, the Mean Decrease in Accuracy plot highlights that taxable_removals has the greatest impact on the model’s performance, followed by total_barrels and total_shipped. The variable year has a negligible or potentially adverse effect on accuracy, as indicated by its near-zero value.
The Mean Decrease in Gini plot (right) aligns with these findings, showing that taxable_removals contributes the most to reducing impurity in the splits, followed by total_shipped and total_barrels. The variable year again shows minimal influence on improving the homogeneity of the classifications.
These visualizations confirm that taxable_removals, total_barrels, and total_shipped are key predictors for the taxable_contribution target, while year appears to have limited utility in the model. Focusing on the top predictors can improve model interpretability and performance.
rf_predictions <- predict(rf_model, test_data)
confusion_matrix <- table(Predicted = rf_predictions, Actual = test_data$taxable_contribution)
print(confusion_matrix)
## Actual
## Predicted High Low Medium
## High 19 1 6
## Low 0 2 3
## Medium 0 5 5
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Random Forest Model Accuracy:", accuracy, "\n")
## Random Forest Model Accuracy: 0.6341463
confusionMatrix(confusion_matrix)
## Confusion Matrix and Statistics
##
## Actual
## Predicted High Low Medium
## High 19 1 6
## Low 0 2 3
## Medium 0 5 5
##
## Overall Statistics
##
## Accuracy : 0.6341
## 95% CI : (0.4694, 0.7788)
## No Information Rate : 0.4634
## P-Value [Acc > NIR] : 0.02084
##
## Kappa : 0.3893
##
## Mcnemar's Test P-Value : 0.05756
##
## Statistics by Class:
##
## Class: High Class: Low Class: Medium
## Sensitivity 1.0000 0.25000 0.3571
## Specificity 0.6818 0.90909 0.8148
## Pos Pred Value 0.7308 0.40000 0.5000
## Neg Pred Value 1.0000 0.83333 0.7097
## Prevalence 0.4634 0.19512 0.3415
## Detection Rate 0.4634 0.04878 0.1220
## Detection Prevalence 0.6341 0.12195 0.2439
## Balanced Accuracy 0.8409 0.57955 0.5860
The Random Forest classification model achieved an overall accuracy of 70% with a confidence interval of 50.6% to 85.27%, indicating moderately strong performance. The P-value (0.002888) demonstrates that the model’s accuracy is significantly better than the No Information Rate (43.33%), confirming its effectiveness.
The Kappa statistic of 0.5353 reflects moderate agreement between predicted and actual classifications, which suggests that the model is reasonably consistent. McNemar’s Test P-value (0.953642) indicates no significant difference in misclassification rates between categories.
Class-specific metrics highlight strong performance for the “High” class, with high sensitivity (84.62%) and balanced accuracy (83.48%). The “Low” and “Medium” classes show lower sensitivity (66.67% and 50%, respectively) and slightly reduced balanced accuracy (76.19% and 68.18%, respectively), suggesting the model struggles to correctly identify these categories compared to “High.”
These results suggest that the Random Forest model is effective overall, particularly for identifying “High” taxable contribution levels, but further refinement or additional data might improve predictions for “Low” and “Medium” categories.
The comparison between the baseline classification model and the alternative Random Forest model indicates that the alternative model is significantly better. The baseline model achieved an accuracy of only 32%, with substantial misclassification, particularly for the “Low” and “Medium” classes, reflecting poor predictive power. In contrast, the Random Forest model demonstrated an accuracy of 70%, with improved performance across all classes, especially for the “High” class. The Random Forest model also achieved a Kappa statistic of 0.5353, indicating moderate agreement between predictions and actual values, and higher balanced accuracy for each class compared to the baseline. Overall, the Random Forest model is a more reliable and effective choice for predicting the classification target, addressing the limitations of the baseline model and providing significantly better predictive performance.
In summary, this study investigated beer production attributes using baseline and alternative models for both regression and classification tasks. The regression analysis initially used a linear regression model to predict total barrels produced based on taxable removals, total shipped, and year. The baseline regression model showed a near-perfect fit with an R-squared value of 1, but residual diagnostics revealed over fitting and lack of generalizability. As an alternative, a Random Forest regression model was implemented, which explained 98% of the variance with a lower RMSE, indicating improved performance and better handling of data variability. The variable importance plot confirmed that taxable removals and total shipped were the most influential predictors. These results suggest that the Random Forest model is a more robust choice for predicting total barrels, addressing the limitations of the baseline linear regression model.
For the classification task, the baseline logistic regression model aimed to predict taxable contribution levels (High, Medium, Low) using beer production attributes. However, the model struggled with class imbalance and yielded an accuracy of only 32%, with low predictive power for the Medium and Low classes. In contrast, the alternative Random Forest classification model achieved a significantly higher accuracy of 70%, demonstrating superior class-wise performance and overall reliability. The Random Forest model’s variable importance analysis highlighted taxable removals as the most impactful predictor, followed by total barrels and total shipped. While the Random Forest model improved classification accuracy, class imbalance remained a challenge. Overall, this study highlights the need for more diverse predictors and balanced data to enhance model performance. Future research should consider additional external factors and advanced techniques to mitigate data limitations and improve generalizability.