After watching the documentary “Somm”, about 4 individuals trying to pass the nearly impossible Master Sommelier exam, written and directed my Jason Wise on Netflix, a fellow “techie” decided to make something out of it for the advancement of future generations’ technology.
This is information from a dataset created by the user @zachthoutt on Kaggle from data scraped from WineEnthusiast
in June of 2017. The idea behind this data set is that the descriptions
and ratings from world class sommeleirs would allow machine learning
models to identify the wine through a blind tasting without ACTUALLY
tasting it.
The creators overall goal was “.. to create a model that can identify
the variety, winery, and location of a wine based on a description.”
This analysis uses this data to try and find the best overall wines with the information provided.
Summarize the data to show which wines are considered the best at specific price points to generate a marketing strategy to increase sales.
Originally 3 separate files, but for the purpose of this analysis, only the first original dataset will be used.
Scraped using this code
The original data was scraped and created by the same owner - not generated by an accredited software or company
The original data was collected in 2017 - Not current
There is possible sampling bias - I did not use all 3 of the files. SEE PREPARE
1 cvs file scraped and created by a user on Kaggle found here
We will focus on the type, price and amount of points received.
Data is limited to the ratings from WineEnthusiust
Data is not current - From 2017
Possible sampling bias as the ratings are from regular indivduals as well as sommeleirs and not enough of either to represent the population as a whole
I chose to begin with Excel. This proved unwise as the dataset has approx. 150k rows. Excel is best used with smaller datasets of around 100 or so rows.
To make things simpler, I removed the duplicates.
Next, I deleted the column I wouldn’t be needing : region_2 and the column with a blank header
After that, I removed all of the blank cells.
Finally, I changed the values of the cells in the “price” column to
currency AND filtered the results to only show wines with 95
points or higher.
Note: I plan on using this data again in the future for additional
analysis expanding on this project.
As I’m looking through my data, I see that there are special characters i.e #, $, % in places where normal letters should be.
To take care of this manually, the function would look something like
this :
SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2:H100000,“é”,
“e”),“â”, “a”),“á”,“a”),“ó”,“o”),“É”, “E”),“è”, “e”),“Ô,
“i”),“ö”, “o”),“é”, “a”),“ô”, “o”),“ü”, “u”)
But, that’s a lot.
So, I used the Find & Replace function instead.
Taking it a step further, I combined the “region_1” and “province”
columns to make the new “region_1” column using this formula :
CONCAT(E:E,” , “,F:F)
For future analysis sake, I also changed the column header “points” to just “points”.
I chose to transfer the dataset over to BigQuery because SQL allows for analysis of large datasets such as this to be done quicker and easier than Excel.
Now, I’m no Master Sommeleir, but I do know that were are 3 major types of wine: Red, White, and Sparkling
So, using BigQuery, I’ve added a column that helps people like me know which type of wine is red, white or sparkling. Here’s how:
I researched of those 37 types, which was red, white, or sparkling
I established the unique values within the 37 types for this code:
The CAST function saves the query results as a new table, but thats not helpful to me so lets combine the new table with the results and the old table. I accomplished this using the JOIN function
I now have a new table with the row “red_or_white”.
I want to see if I have an outliers in my data as far a pricing, so I use this code:
I see that I do have outliers, but for the sake of this analysis, I believe that this data is beneficial so I will not remove it. I will use it for my analysis.
For owners of wineries- Consider partnering with a Napa winery listed to study the creation process or create a collaboration to increase sales and marketing strategies.
For the general public- Consider shipping the wine to your house if you’re not near Napa, California or look into finding brands located in the Napa, California region at your local store.
All comments, questions, and suggestions are highly appreciated. I’m happy to share my journey with you.