30 11 2020

Overview

Which libraries?

This time these will be needed:

library(plotly)
library(dplyr)

Load data

and make a first look at it:

wines <- read.csv(file = './winemag-data-130k-v2.csv')
colnames(wines)
##  [1] "X"                     "country"               "description"          
##  [4] "designation"           "points"                "price"                
##  [7] "province"              "region_1"              "region_2"             
## [10] "taster_name"           "taster_twitter_handle" "title"                
## [13] "variety"               "winery"

I think, country, points and price are suitable for some visualization.

Countries distribution

will look good as a histogram. Here are the top ot them:

Price distribution

for selected countries will be a boxplot:

There are several outliers

which make boxplot less informative, I will separate their producers:

##  [1] "Chateau Margaux"               "Chateau La Mission Haut-Brion"
##  [3] "Chateau Haut-Brion"            "Chateau Mouton Rothschild"    
##  [5] "Chateau Petrus"                "Chateau les Ormes Sorbet"     
##  [7] "Emmerich Knoll"                "Domaine du Comte Liger-Belair"
##  [9] "Chateau Lafite Rothschild"     "Chateau Cheval Blanc"         
## [11] "Blair"

Price distribution again,

now without outliers:

What’s going on?

Overall median price in this dataset is:

median(wines$price)
## [1] 25

25$? So this is the scale needed for boxplot to be more clear. And what about points?

Points are here,

and this will be boxplot again. The median goes first:

median(wines$points)
## [1] 88

Points visualization - points

Points visualization - price

Conclusions:

  1. New world wines are mostly cheaper.
  2. Points do not depend on old or new world, there are high-graded new world wines (for example, Australia has a max = 100) and low-graded old world wines (min = 80 for Italy).
  3. Your best wine may not be included in this database, though it has more than 125k of rows.