Term Project: Draft

Import data
Description of the data and definition of variables
Visualize data
Correlation and regression analysis
Share interesting stories you found from the data
Hide the messages, but display the code and its results on the webpage.
List names of all group members (both first and last name) at the top of the webpage.
Use the correct slug.

Import data

Hint: You can choose any data you like but can’t take one that is already taken by other groups.

library(tidyverse)
wine_ratings <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-28/winemag-data-130k-v2.csv") %>%
  # lump together least common factor levels
  mutate(variety = fct_lump(variety, 7)) %>%
  # filter out "Other"
  filter(variety != "Other")

# Explain the boxplots. What is the top rated variety? How about the outliers? Which variety seems to be more consistent in their ratings than others?

wine_ratings %>%
  mutate(variety = fct_reorder(variety, points, na.rm = TRUE)) %>%
  ggplot(aes(variety, points, fill = variety)) +
  geom_boxplot() +
  coord_flip() +
  theme(legend.position = "none") +
  labs(title = "wine ratings by variety",
       subtitle = "Top ten most common varietys",
       x = NULL,
       y = NULL)


# The intercept represents predicted rating for variety: Bordeaux-style Red Blend that is free.
# Bordeaux-style Red Blend that is free is expected to have a point of 78 on a 0-100 scale.
# Point increases by about 2 every time price doubles.
# Some varieties (e.g., Cabernet Sauvignon) have negative effect on points, while other have positive effect (e.g., Riesling). what variety of wines would you produce if you were running winneries and had choices? 
wine_ratings %>%
  lm(points ~ log2(price) + variety, data = .) %>%
  summary()
## 
## Call:
## lm(formula = points ~ log2(price) + variety, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.8406  -1.5161   0.1576   1.7000   9.2261 
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)               78.15191    0.06455 1210.695  < 2e-16 ***
## log2(price)                2.11226    0.01104  191.377  < 2e-16 ***
## varietyCabernet Sauvignon -0.45211    0.04111  -10.999  < 2e-16 ***
## varietyChardonnay          0.09023    0.04004    2.254   0.0242 *  
## varietyPinot Noir          0.03115    0.03917    0.795   0.4265    
## varietyRed Blend          -0.03726    0.04192   -0.889   0.3741    
## varietyRiesling            1.50753    0.04745   31.773  < 2e-16 ***
## varietySauvignon Blanc     0.37196    0.04859    7.655 1.96e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.397 on 56816 degrees of freedom
##   (3690 observations deleted due to missingness)
## Multiple R-squared:  0.4136, Adjusted R-squared:  0.4136 
## F-statistic:  5726 on 7 and 56816 DF,  p-value: < 2.2e-16

Description of the data and definition of variables

The data contains close to 130,000 wine reviews. One review per row. The points represent wine ratings on a scale of 1-100. The data contains over 708 different varieties of wine from over 73 different countries. Other variables include the winery where the wine was produced, the tasters name, and the price of the wine.

The Box Plots are in decending order which means Riesling is the top choice based on median rating, right before pinot noir. The interquartile range shows consistency in ratings for the middle half of the wines. Red Blend with a thinner box has ratings that are more consistant than others.

The intercept represents predicted rating for variety: Bordeaux-style wine that is free. In easier to understand terms, Bordeaux-style Red blend that is free is expected to have a point of 78 on a 0-100 scale. Log2 of points has the coefficient of about 2, which means point increases about 2 evrytime price doubles. Some varieties have negative effect on points (Cabernet Sauvignon), while others have positive effect. If we ran a winery and had a choice of what wine we were producing, we would choose Riesling because thats a wine people seem to enjoy the most.

Visualize data

Hint: Create at least two plots.

library(ggplot2)

ggplot(data = wine_ratings,
       mapping = aes(x = points, y = price)) +
  geom_point()

ggplot(wine_ratings, aes(x = points)) +
  geom_bar()

```

Correlation and regression analysis

data(wine_ratings)

# select numeric variables
df <- dplyr::select_if(wine_ratings, is.numeric)

# calulate the correlations
r <- cor(df, use="complete.obs")
round(r,2)
##          X1 points price
## X1     1.00   0.01   0.0
## points 0.01   1.00   0.4
## price  0.00   0.40   1.0

houses_lm <- lm(points ~ price,
                data = wine_ratings)
summary(houses_lm)
## 
## Call:
## lm(formula = points ~ price, data = wine_ratings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.684  -1.964  -0.042   2.046  10.202 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.765e+01  1.556e-02  5634.6   <2e-16 ***
## price       2.607e-02  2.491e-04   104.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.866 on 56822 degrees of freedom
##   (3690 observations deleted due to missingness)
## Multiple R-squared:  0.1616, Adjusted R-squared:  0.1616 
## F-statistic: 1.096e+04 on 1 and 56822 DF,  p-value: < 2.2e-16

Term Project: Draft

Nathan Barnhart, Tom Albert, Riley Williams

Import data

Description of the data and definition of variables

Visualize data

Correlation and regression analysis

Share interesting stories you found from the data

Hide the messages, but display the code and its results on the webpage.

List names of all group members (both first and last name) at the top of the webpage.

Use the correct slug.