This will be a multi-part series using the wine review dataset available at https://www.kaggle.com/zynicide/wine-reviews/data.
I focus on getting the data into the optimal format for further exploratory analysis in this part.
I am using the 130,000 record file because it has a few more columns than the 150,000 record version.
The end-product has 115,000 records with no NA’s.
Updated March 20, 2018 to make getting the year more efficient.
You only need tidyverse for this part. Unfortunately I just switched to a Macbook Pro and tidyverse is not installing so I load each individual package I need.
library(dplyr)
library(ggplot2)
library(stringr)
library(readr)
library(tidyr)
library(jsonlite)
wine_01 <- read_csv("winemag-data-130k-v2.csv")
sum(complete.cases(wine_01))
## [1] 22387
Only 22,387; not good since we started with 130k
colSums(is.na(wine_01))
## X1 country description
## 0 63 0
## designation points price
## 37465 0 8996
## province region_1 region_2
## 63 21247 79460
## taster_name taster_twitter_handle title
## 26244 31213 0
## variety winery
## 1 0
Yikes!! Let’s see if we can get our arms around this a bit.
Price is important and there are only about 9000 that don’t have a price. Let’s drop those rows and see how it impacts the other NA’s.
wine_02 <- wine_01 %>%
filter(!is.na(price))
colSums(is.na(wine_02))
## X1 country description
## 0 59 0
## designation points price
## 34779 0 0
## province region_1 region_2
## 59 19575 70683
## taster_name taster_twitter_handle title
## 24496 29416 0
## variety winery
## 1 0
It’s strange that country and province have the same number of NA’s. I’ll look at them to see if there is any connection.
cp_na <- wine_02 %>%
filter(is.na(country))
head(cp_na[, c(14,2,7,8,9)], 20)
## # A tibble: 20 x 5
## winery country province region_1 region_2
## <chr> <chr> <chr> <chr> <chr>
## 1 Gotsa Family Wines <NA> <NA> <NA> <NA>
## 2 Kakhetia Traditional Winemaking <NA> <NA> <NA> <NA>
## 3 Tsililis <NA> <NA> <NA> <NA>
## 4 Ross-idi <NA> <NA> <NA> <NA>
## 5 Orbelus <NA> <NA> <NA> <NA>
## 6 St. Donat <NA> <NA> <NA> <NA>
## 7 Familia Deicas <NA> <NA> <NA> <NA>
## 8 Tsililis <NA> <NA> <NA> <NA>
## 9 Stone Castle <NA> <NA> <NA> <NA>
## 10 Teliani Valley <NA> <NA> <NA> <NA>
## 11 Undurraga <NA> <NA> <NA> <NA>
## 12 Mt. Beautiful <NA> <NA> <NA> <NA>
## 13 Orbelus <NA> <NA> <NA> <NA>
## 14 Neumeister <NA> <NA> <NA> <NA>
## 15 Bachelder <NA> <NA> <NA> <NA>
## 16 Neumeister <NA> <NA> <NA> <NA>
## 17 Stone Castle <NA> <NA> <NA> <NA>
## 18 Stone Castle <NA> <NA> <NA> <NA>
## 19 Neumeister <NA> <NA> <NA> <NA>
## 20 Chilcas <NA> <NA> <NA> <NA>
Every record without a country also doesn’t have a province or regions, but they all have wineries.
Let’s see if there are any other records for these wineries where we might be able to infer the values.
cp_na_wineries <- unique(cp_na$winery)
winery_xref <- wine_02 %>%
filter(winery %in% cp_na_wineries) %>%
select(winery, country, province, region_1, region_2, everything()) %>%
arrange(winery)
head(winery_xref, 20)
## # A tibble: 20 x 14
## winery country province region_1 region_2 X1 description
## <chr> <chr> <chr> <chr> <chr> <int> <chr>
## 1 Amiran⦠<NA> <NA> <NA> <NA> 119787 This deep red-violetâ¦
## 2 Bachel⦠Canada Ontario St. Dav⦠<NA> 13062 Elegant and polishedâ¦
## 3 Bachel⦠<NA> <NA> <NA> <NA> 38898 Scents of clover, stâ¦
## 4 Büyülü⦠<NA> <NA> <NA> <NA> 129590 A blend of 60% Syrahâ¦
## 5 Chilcas Chile Casabla⦠<NA> <NA> 7132 Neutral up front excâ¦
## 6 Chilcas Chile Maule V⦠<NA> <NA> 7900 Largely neutral aromâ¦
## 7 Chilcas Chile Maule V⦠<NA> <NA> 9139 The nose bears lightâ¦
## 8 Chilcas Chile Lolol V⦠<NA> <NA> 11832 Dusty raspberry and â¦
## 9 Chilcas Chile Maule V⦠<NA> <NA> 20253 Full and confident oâ¦
## 10 Chilcas Chile Casabla⦠<NA> <NA> 20266 A solid blast of troâ¦
## 11 Chilcas Chile Maule V⦠<NA> <NA> 20712 Earthy, lightly herbâ¦
## 12 Chilcas Chile Maule V⦠<NA> <NA> 25715 Smells like generic â¦
## 13 Chilcas Chile Maule V⦠<NA> <NA> 26253 A big step up from tâ¦
## 14 Chilcas Chile Colchag⦠<NA> <NA> 26332 Plum, cherry and casâ¦
## 15 Chilcas Chile Maule V⦠<NA> <NA> 26333 Aromas of herbs, spiâ¦
## 16 Chilcas Chile Maule V⦠<NA> <NA> 28525 Lightly stalky aromaâ¦
## 17 Chilcas Chile Maule V⦠<NA> <NA> 30954 Floral and differentâ¦
## 18 Chilcas Chile Central⦠<NA> <NA> 34513 Smells like spiced aâ¦
## 19 Chilcas Chile Maule V⦠<NA> <NA> 39756 Jammy blackberry andâ¦
## 20 Chilcas Chile Colchag⦠<NA> <NA> 40692 Crushed mineral, herâ¦
## # ... with 7 more variables: designation <chr>, points <int>, price <dbl>,
## # taster_name <chr>, taster_twitter_handle <chr>, title <chr>,
## # variety <chr>
Most of them could be reasonably inferred with a little work.
Since there are only 59 records with empty countries I’m going to filter them.
wine_03 <- wine_02 %>%
filter(!is.na(country))
colSums(is.na(wine_03))
## X1 country description
## 0 0 0
## designation points price
## 34768 0 0
## province region_1 region_2
## 0 19516 70624
## taster_name taster_twitter_handle title
## 24496 29416 0
## variety winery
## 1 0
Getting better. This eliminated all the NA’s for provinces too.
variety_na <- wine_03 %>%
filter(is.na(variety))
variety_na[, c(13, 4, 8,9,10,11)]
## # A tibble: 1 x 6
## variety designation region_1 region_2 taster_name taster_twitter_handle
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> <NA> <NA> <NA> <NA> <NA>
This record has a lot of NA’s so I’m going to drop it too.
wine_04 <- wine_03 %>%
filter(!is.na(variety))
I noticed that the dataset doesn’t have a year, but I observed the year is in all of the titles.
title_sample <- sample_n(wine_04, 10) %>%
select(title)
title_sample
## # A tibble: 10 x 1
## title
## <chr>
## 1 Hey Mambo 2007 Sultry Red Red (California)
## 2 Domaine Marcel Deiss 2009 Burlenberg Red (Alsace)
## 3 Cantine Ermes 2014 Marchese Montefusco Nero d'Avola (Terre Siciliane)
## 4 Boomtown 2014 Pinot Gris (Columbia Valley (WA))
## 5 Nicholls 2005 Cabernet Sauvignon (Napa Valley)
## 6 Lemelson 2014 Chestnut Hill Vineyard Pinot Noir (Chehalem Mountains)
## 7 Villa Wolf 2012 Dry Riesling (Pfalz)
## 8 Artesa 2007 Estate Reserve Pinot Noir (Carneros)
## 9 Domaine de Poulvarel 2009 Les Grès Red (Costières de Nîmes)
## 10 Bluestone 2014 Estate Grown Chardonnay (Shenandoah Valley)
Thanks to Manish Saraswat for helping me reduce 35 lines of code down to one.
wine_05 <- wine_04 %>%
mutate(year = str_extract(string = title, pattern = '\\d{4}'))
These will be the records with a year of 1900.
table(wine_05$year)
##
## 1000 1070 1150 1492 1607 1621 1637 1752 1789 1821 1827 1840
## 2 1 2 6 1 1 2 1 1 4 2 5
## 1845 1847 1848 1850 1852 1856 1860 1868 1870 1872 1875 1877
## 3 1 6 3 40 3 7 6 3 1 5 7
## 1882 1887 1898 1904 1919 1927 1929 1934 1935 1941 1945 1947
## 4 2 19 1 1 2 6 1 1 1 1 1
## 1952 1957 1961 1963 1964 1965 1966 1967 1968 1969 1973 1974
## 2 1 1 3 2 1 1 1 1 1 1 1
## 1976 1978 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990
## 1 2 2 1 2 2 4 4 2 5 6 5
## 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002
## 4 14 3 23 44 64 297 540 618 734 668 332
## 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 498 1602 3290 5170 6488 6715 9049 11087 11413 14712 15167 14863
## 2015 2016 2017 3000 7200
## 9610 3542 11 1 14
After looking at the table I’ve decided that I only want vintages from 2000 to 2017
wine_06 <- wine_05 %>%
filter(year >= 2000 & year <= 2017)
colSums(is.na(wine_06))
## X1 country description
## 0 0 0
## designation points price
## 33887 0 0
## province region_1 region_2
## 0 18831 66376
## taster_name taster_twitter_handle title
## 22629 27402 0
## variety winery year
## 0 0 0
I can work with this.
na_text <- "Not Available"
wine_07 <- wine_06 %>%
rename(index = "X1") %>%
select(index, year, points, price, description, everything()) %>%
replace_na(list(designation = na_text,
region_1 = na_text,
region_2 = na_text,
taster_name = na_text,
taster_twitter_handle = na_text))
wine_chars <- wine_07[, 6:15]
wine_chars[sapply(wine_chars, is.character)] <- lapply(wine_chars[sapply(wine_chars, is.character)],
as.factor)
wine_08 <- cbind(wine_07[, 1:5], wine_chars)
wine_09 <- wine_08 %>%
mutate(cost_per_point = price/points) %>%
select(index, year, winery, points, price, cost_per_point, everything())
wine <- wine_09