This project is geared towards exploring the Tax and Trade Bureau’s Certificate of Label Approval database. As background, every wine company is required to submit a copy of their product labels to the TTB for approval (known as a ‘COLA’). This information is then made available to the public at the TTB’s website but in a very poor format for large scale analysis.
The below code is used to read in the data from the TTB and process it entire into a smaller R data store that only contains the relevant data. Fiels are cleaned to remove non-alphanumeric values and to convert values to upper case (as this is needed to ensure collapsing exmaples such as ‘Chardonnay’ and ‘CHARDONNAY’).
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.1.1
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
library(stringr)
print_data <- read.table('print2013.txt', header=FALSE,sep='|')
source_data <- read.table('source2013.txt',sep="|",fill=TRUE,col.names=1:74)
source_data <- source_data[,1:26]
names(print_data) <- c("TTB.ID", "PLANT.REGISTRY", "GRAPE.VARIETAL", "NET.CONTENTS","ALCOHOL.CONTENT", "WINE.APPELLATION","WINE.VINTAGE")
names(source_data) <- c("TTB.ID", "Status", "Vendor.Code", "Serial.#","ClassType.Code","Origin.Code","Brand.Name","Fanciful.Name","Type.of.Application","For.Sale.In","Total.Bottle.Capacity","Grape.Varietal","Wine.Vintage","Formula","Lab.No.","Approval.Date","Permit.Number","Company.Line.One", "Company.Line.Two", "Address.Line.Two", "Address.Line.Three","Contact.Name","Phone.Number","Fax.Number","Email", "PDF.Link")
cola_data <- merge(source_data,print_data, all.x=TRUE)
cola_data <- filter(cola_data, Status == 'APPROVED', ClassType.Code %in% c('TABLE RED WINE','TABLE WHITE WINE','SPARKLING WINE/CHAMPAGNE','ROSE WINE','TABLE FLAVORED WINE','DESSERT FLAVORED WINE'))
cola_data$Approval.Date <- as.Date(cola_data$Approval.Date,'%m/%d/%Y')
cola_data$Grape.Varietal <- str_replace_all(cola_data$Grape.Varietal, "[^[:alnum:]]", " ")
cola_data$WINE.APPELLATION <- str_replace_all(cola_data$WINE.APPELLATION, "[^[:alnum:]]", " ")
cola_data$Grape.Varietal <- toupper(cola_data$Grape.Varietal)
cola_data$WINE.APPELLATION <- toupper(cola_data$WINE.APPELLATION)
cola_data$Company.Line.One <- str_replace_all(cola_data$Company.Line.One, "[^[:alnum:]]", " ")
cola_data$Company.Line.One <- toupper(cola_data$Company.Line.One)
cola_data <- cola_data %>%
select(Approval.Date,Grape.Varietal,Company.Line.One,WINE.APPELLATION)
saveRDS(cola_data,"cola_data.Rds")
This initial development provides a systematic overview of the new labels being approved by the TTB. If anything, this project presents a high level proof of concept into what this dataset can reveal. As additional data is available, year over year comparisons could provide comeplling insights into new areas of growth (for example, are Moscato-based wines being introduced at an increasing rate). Additionally, one can gain visibility into what particular varietals or appellations are being produced by key competitors or even new market entrants.