Overview

This will be a multi-part series using the wine review dataset available at https://www.kaggle.com/zynicide/wine-reviews/data.
I focus on getting the data into the optimal format for further exploratory analysis in this part.
I am using the 130,000 record file because it has a few more columns than the 150,000 record version.
The end-product has 115,000 records with no NA’s.
Updated March 20, 2018 to make getting the year more efficient.

Load libraries

You only need tidyverse for this part. Unfortunately I just switched to a Macbook Pro and tidyverse is not installing so I load each individual package I need.

library(dplyr)
library(ggplot2)
library(stringr)
library(readr)
library(tidyr)
library(jsonlite)

Read in the data

wine_01 <- read_csv("winemag-data-130k-v2.csv")

How many records have no NA’s in them?

sum(complete.cases(wine_01))
## [1] 22387

Only 22,387; not good since we started with 130k

NA remediation

What is the distribution of the NA’s?

colSums(is.na(wine_01))
##                    X1               country           description 
##                     0                    63                     0 
##           designation                points                 price 
##                 37465                     0                  8996 
##              province              region_1              region_2 
##                    63                 21247                 79460 
##           taster_name taster_twitter_handle                 title 
##                 26244                 31213                     0 
##               variety                winery 
##                     1                     0

Yikes!! Let’s see if we can get our arms around this a bit.

Price is important and there are only about 9000 that don’t have a price. Let’s drop those rows and see how it impacts the other NA’s.

wine_02 <- wine_01 %>%
    filter(!is.na(price))

colSums(is.na(wine_02))
##                    X1               country           description 
##                     0                    59                     0 
##           designation                points                 price 
##                 34779                     0                     0 
##              province              region_1              region_2 
##                    59                 19575                 70683 
##           taster_name taster_twitter_handle                 title 
##                 24496                 29416                     0 
##               variety                winery 
##                     1                     0

It’s strange that country and province have the same number of NA’s. I’ll look at them to see if there is any connection.

cp_na <- wine_02 %>%
    filter(is.na(country))
head(cp_na[, c(14,2,7,8,9)], 20)
## # A tibble: 20 x 5
##    winery                          country province region_1 region_2
##    <chr>                           <chr>   <chr>    <chr>    <chr>   
##  1 Gotsa Family Wines              <NA>    <NA>     <NA>     <NA>    
##  2 Kakhetia Traditional Winemaking <NA>    <NA>     <NA>     <NA>    
##  3 Tsililis                        <NA>    <NA>     <NA>     <NA>    
##  4 Ross-idi                        <NA>    <NA>     <NA>     <NA>    
##  5 Orbelus                         <NA>    <NA>     <NA>     <NA>    
##  6 St. Donat                       <NA>    <NA>     <NA>     <NA>    
##  7 Familia Deicas                  <NA>    <NA>     <NA>     <NA>    
##  8 Tsililis                        <NA>    <NA>     <NA>     <NA>    
##  9 Stone Castle                    <NA>    <NA>     <NA>     <NA>    
## 10 Teliani Valley                  <NA>    <NA>     <NA>     <NA>    
## 11 Undurraga                       <NA>    <NA>     <NA>     <NA>    
## 12 Mt. Beautiful                   <NA>    <NA>     <NA>     <NA>    
## 13 Orbelus                         <NA>    <NA>     <NA>     <NA>    
## 14 Neumeister                      <NA>    <NA>     <NA>     <NA>    
## 15 Bachelder                       <NA>    <NA>     <NA>     <NA>    
## 16 Neumeister                      <NA>    <NA>     <NA>     <NA>    
## 17 Stone Castle                    <NA>    <NA>     <NA>     <NA>    
## 18 Stone Castle                    <NA>    <NA>     <NA>     <NA>    
## 19 Neumeister                      <NA>    <NA>     <NA>     <NA>    
## 20 Chilcas                         <NA>    <NA>     <NA>     <NA>

Every record without a country also doesn’t have a province or regions, but they all have wineries.
Let’s see if there are any other records for these wineries where we might be able to infer the values.

Get list of wineries that have an NA for Country.

cp_na_wineries <- unique(cp_na$winery)

Extract all records for wineries that have at least one NA for country.

winery_xref <- wine_02 %>%
    filter(winery %in% cp_na_wineries) %>%
    select(winery, country, province, region_1, region_2, everything()) %>%
    arrange(winery)
head(winery_xref, 20)
## # A tibble: 20 x 14
##    winery  country province region_1 region_2     X1 description          
##    <chr>   <chr>   <chr>    <chr>    <chr>     <int> <chr>                
##  1 Amiran… <NA>    <NA>     <NA>     <NA>     119787 This deep red-violet…
##  2 Bachel… Canada  Ontario  St. Dav… <NA>      13062 Elegant and polished…
##  3 Bachel… <NA>    <NA>     <NA>     <NA>      38898 Scents of clover, st…
##  4 Büyülü… <NA>    <NA>     <NA>     <NA>     129590 A blend of 60% Syrah…
##  5 Chilcas Chile   Casabla… <NA>     <NA>       7132 Neutral up front exc…
##  6 Chilcas Chile   Maule V… <NA>     <NA>       7900 Largely neutral arom…
##  7 Chilcas Chile   Maule V… <NA>     <NA>       9139 The nose bears light…
##  8 Chilcas Chile   Lolol V… <NA>     <NA>      11832 Dusty raspberry and …
##  9 Chilcas Chile   Maule V… <NA>     <NA>      20253 Full and confident o…
## 10 Chilcas Chile   Casabla… <NA>     <NA>      20266 A solid blast of tro…
## 11 Chilcas Chile   Maule V… <NA>     <NA>      20712 Earthy, lightly herb…
## 12 Chilcas Chile   Maule V… <NA>     <NA>      25715 Smells like generic …
## 13 Chilcas Chile   Maule V… <NA>     <NA>      26253 A big step up from t…
## 14 Chilcas Chile   Colchag… <NA>     <NA>      26332 Plum, cherry and cas…
## 15 Chilcas Chile   Maule V… <NA>     <NA>      26333 Aromas of herbs, spi…
## 16 Chilcas Chile   Maule V… <NA>     <NA>      28525 Lightly stalky aroma…
## 17 Chilcas Chile   Maule V… <NA>     <NA>      30954 Floral and different…
## 18 Chilcas Chile   Central… <NA>     <NA>      34513 Smells like spiced a…
## 19 Chilcas Chile   Maule V… <NA>     <NA>      39756 Jammy blackberry and…
## 20 Chilcas Chile   Colchag… <NA>     <NA>      40692 Crushed mineral, her…
## # ... with 7 more variables: designation <chr>, points <int>, price <dbl>,
## #   taster_name <chr>, taster_twitter_handle <chr>, title <chr>,
## #   variety <chr>

Most of them could be reasonably inferred with a little work.
Since there are only 59 records with empty countries I’m going to filter them.

wine_03 <- wine_02 %>%
    filter(!is.na(country))

How many NA’s now?

colSums(is.na(wine_03))
##                    X1               country           description 
##                     0                     0                     0 
##           designation                points                 price 
##                 34768                     0                     0 
##              province              region_1              region_2 
##                     0                 19516                 70624 
##           taster_name taster_twitter_handle                 title 
##                 24496                 29416                     0 
##               variety                winery 
##                     1                     0

Getting better. This eliminated all the NA’s for provinces too.

Only one NA for variety. Let’s look at it.

variety_na <- wine_03 %>%
    filter(is.na(variety))
variety_na[, c(13, 4, 8,9,10,11)]
## # A tibble: 1 x 6
##   variety designation region_1 region_2 taster_name taster_twitter_handle
##   <chr>   <chr>       <chr>    <chr>    <chr>       <chr>                
## 1 <NA>    <NA>        <NA>     <NA>     <NA>        <NA>

This record has a lot of NA’s so I’m going to drop it too.

wine_04 <- wine_03 %>%
    filter(!is.na(variety))

Where is the year?

I noticed that the dataset doesn’t have a year, but I observed the year is in all of the titles.

title_sample <- sample_n(wine_04, 10) %>%
    select(title)
title_sample
## # A tibble: 10 x 1
##    title                                                                
##    <chr>                                                                
##  1 Hey Mambo 2007 Sultry Red Red (California)                           
##  2 Domaine Marcel Deiss 2009 Burlenberg Red (Alsace)                    
##  3 Cantine Ermes 2014 Marchese Montefusco Nero d'Avola (Terre Siciliane)
##  4 Boomtown 2014 Pinot Gris (Columbia Valley (WA))                      
##  5 Nicholls 2005 Cabernet Sauvignon (Napa Valley)                       
##  6 Lemelson 2014 Chestnut Hill Vineyard Pinot Noir (Chehalem Mountains) 
##  7 Villa Wolf 2012 Dry Riesling (Pfalz)                                 
##  8 Artesa 2007 Estate Reserve Pinot Noir (Carneros)                     
##  9 Domaine de Poulvarel 2009 Les Grès Red (Costières de Nîmes)          
## 10 Bluestone 2014 Estate Grown Chardonnay (Shenandoah Valley)

Add the vintage year to the dataset from the title data

Thanks to Manish Saraswat for helping me reduce 35 lines of code down to one.

wine_05 <- wine_04 %>%
    mutate(year = str_extract(string = title, pattern = '\\d{4}'))

How many reviews are assigned to each year?

These will be the records with a year of 1900.

table(wine_05$year)
## 
##  1000  1070  1150  1492  1607  1621  1637  1752  1789  1821  1827  1840 
##     2     1     2     6     1     1     2     1     1     4     2     5 
##  1845  1847  1848  1850  1852  1856  1860  1868  1870  1872  1875  1877 
##     3     1     6     3    40     3     7     6     3     1     5     7 
##  1882  1887  1898  1904  1919  1927  1929  1934  1935  1941  1945  1947 
##     4     2    19     1     1     2     6     1     1     1     1     1 
##  1952  1957  1961  1963  1964  1965  1966  1967  1968  1969  1973  1974 
##     2     1     1     3     2     1     1     1     1     1     1     1 
##  1976  1978  1980  1982  1983  1984  1985  1986  1987  1988  1989  1990 
##     1     2     2     1     2     2     4     4     2     5     6     5 
##  1991  1992  1993  1994  1995  1996  1997  1998  1999  2000  2001  2002 
##     4    14     3    23    44    64   297   540   618   734   668   332 
##  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012  2013  2014 
##   498  1602  3290  5170  6488  6715  9049 11087 11413 14712 15167 14863 
##  2015  2016  2017  3000  7200 
##  9610  3542    11     1    14

After looking at the table I’ve decided that I only want vintages from 2000 to 2017

wine_06 <- wine_05 %>%
    filter(year >= 2000 & year <= 2017)

Now how many NA’s?

colSums(is.na(wine_06))
##                    X1               country           description 
##                     0                     0                     0 
##           designation                points                 price 
##                 33887                     0                     0 
##              province              region_1              region_2 
##                     0                 18831                 66376 
##           taster_name taster_twitter_handle                 title 
##                 22629                 27402                     0 
##               variety                winery                  year 
##                     0                     0                     0

I can work with this.

Assign a default value for NA’s where they exist

na_text <- "Not Available"
wine_07 <- wine_06 %>%
    rename(index = "X1") %>%
    select(index, year, points, price, description, everything()) %>%
    replace_na(list(designation = na_text, 
                    region_1 = na_text, 
                    region_2 = na_text, 
                    taster_name = na_text, 
                    taster_twitter_handle = na_text))

wine_chars <- wine_07[, 6:15]

Change all the character columns to factors except description

wine_chars[sapply(wine_chars, is.character)] <- lapply(wine_chars[sapply(wine_chars, is.character)], 
                                       as.factor)

Bind the factor columns to the first columns

wine_08 <- cbind(wine_07[, 1:5], wine_chars)

Add a column for cost per point, and re-arrange variables

wine_09 <- wine_08 %>%
    mutate(cost_per_point = price/points) %>%
    select(index, year, winery, points, price, cost_per_point, everything())

Boom!! Create a final object to do Exploratory Data Analysis

wine <- wine_09

END