For this project of DATA 608 I will be using the kaggle dataset beer_reviews.csv. The dataset has 1.5 million beer reviews , with ratings for appearance, aroma, palate, taste, and overall impression.
What are the factors that makes a beer favorite among the beer drinkers? How does the features like - beer taste, aroma, appearance, palate, review-time, beer style affects it’s overall rating.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom 1.0.1 ✔ recipes 1.0.1
## ✔ dials 1.0.0 ✔ rsample 1.1.0
## ✔ infer 1.0.3 ✔ tune 1.0.0
## ✔ modeldata 1.0.1 ✔ workflows 1.1.0
## ✔ parsnip 1.0.2 ✔ workflowsets 1.0.0
## ✔ purrr 0.3.4 ✔ yardstick 1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ kableExtra::group_rows() masks dplyr::group_rows()
## ✖ dplyr::lag() masks stats::lag()
## ✖ infer::observe() masks shiny::observe()
## ✖ recipes::step() masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
##
## Attaching package: 'rsconnect'
## The following object is masked from 'package:shiny':
##
## serverInfo
brewery_name: The name of the brewery that made the beer. (String)review_time: The date and time of the review. (String)review_overall: The reviewer’s overall rating of the beer on a scale of 1 to 5. (Float)review_aroma: The reviewer’s rating of the beer’s aroma on a scale of 1 to 5. (Float)review_appearance: The reviewer’s rating of the beer’s appearance on a scale of 1 to 5. (Float)review_profilename: The reviewer’s username. (String)beer_style: The style of beer. (String)review_palate: The reviewer’s rating of the beer’s palate on a scale of 1 to 5. (Float)review_taste: The reviewer’s rating of the beer’s taste on a scale of 1 to 5. (Float)beer_name: The name of the beer. (String)beer_abv: The alcohol by volume of the beer. (Float)I have downloaded the dataset from kaggale and placed in my local projects folder.
df_beers <- read.csv('D:\\MSProjects\\608\\beer_reviews.csv', stringsAsFactors=T)
The whole dataset consists of 1586614 observations and 14 variables.
dim(df_beers)
## [1] 1586614 14
summary(df_beers)
## index brewery_id brewery_name
## Min. : 0 Min. : 1 Boston Beer Company (Samuel Adams): 39444
## 1st Qu.: 396653 1st Qu.: 143 Dogfish Head Brewery : 33839
## Median : 793307 Median : 429 Stone Brewing Co. : 33066
## Mean : 793307 Mean : 3130 Sierra Nevada Brewing Co. : 28751
## 3rd Qu.:1189960 3rd Qu.: 2372 Bell's Brewery, Inc. : 25191
## Max. :1586613 Max. :28003 Rogue Ales : 24083
## (Other) :1402240
## review_time review_overall review_aroma review_appearance
## Min. :8.407e+08 Min. :0.000 Min. :1.000 Min. :0.000
## 1st Qu.:1.173e+09 1st Qu.:3.500 1st Qu.:3.500 1st Qu.:3.500
## Median :1.239e+09 Median :4.000 Median :4.000 Median :4.000
## Mean :1.224e+09 Mean :3.816 Mean :3.736 Mean :3.842
## 3rd Qu.:1.289e+09 3rd Qu.:4.500 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.326e+09 Max. :5.000 Max. :5.000 Max. :5.000
##
## review_profilename beer_style
## northyorksammy: 5817 American IPA : 117586
## BuckeyeNation : 4661 American Double / Imperial IPA : 85977
## mikesgroove : 4617 American Pale Ale (APA) : 63469
## Thorpe429 : 3518 Russian Imperial Stout : 54129
## womencantsail : 3497 American Double / Imperial Stout: 50705
## NeroFiddled : 3488 American Porter : 50477
## (Other) :1561016 (Other) :1164271
## review_palate review_taste beer_name
## Min. :1.000 Min. :1.000 90 Minute IPA : 3290
## 1st Qu.:3.500 1st Qu.:3.500 India Pale Ale : 3130
## Median :4.000 Median :4.000 Old Rasputin Russian Imperial Stout: 3111
## Mean :3.744 Mean :3.793 Sierra Nevada Celebration Ale : 3000
## 3rd Qu.:4.000 3rd Qu.:4.500 Two Hearted Ale : 2728
## Max. :5.000 Max. :5.000 Arrogant Bastard Ale : 2704
## (Other) :1568651
## beer_abv beer_beerid
## Min. : 0.01 Min. : 3
## 1st Qu.: 5.20 1st Qu.: 1717
## Median : 6.50 Median :13906
## Mean : 7.04 Mean :21713
## 3rd Qu.: 8.50 3rd Qu.:39441
## Max. :57.70 Max. :77317
## NA's :67785
We see from the summary of the dataset beer_abv has 67785 NA values. Since the dataset if large we will just drop all the NA records from our further analysis.
The new cleaned dataset has 1518829 observations.
df_cleaned <- na.omit(df_beers)
dim(df_cleaned)
## [1] 1518829 14
Next, we also exclude the column index its just a sequential number which is of no use for our analysis.
df_cleaned <- df_cleaned[,-1]
Column beer_style has 2 values so this column will be split into 2 as beer_style1 and beer_style2 and beer_style column will be dropped from the dataset.
df_cleaned <- df_cleaned %>% separate(beer_style, c("beer_style1","beer_style2"), extra="drop",fill = "right" )
unique(as.factor(df_cleaned$beer_style1))
## [1] Hefeweizen English Foreign German American
## [6] Herbed Light Oatmeal Rauchbier Belgian
## [11] Russian M Euro Fruit Irish
## [16] Doppelbock Maibock Dortmunder Low Extra
## [21] Bock Altbier K Pumpkin Rye
## [26] Milk Schwarzbier Munich Vienna Scottish
## [31] Witbier Saison California Scotch Tripel
## [36] Flanders Smoked Dubbel Dunkelweizen Keller
## [41] Winter Bi Chile Czech Cream
## [46] Black Weizenbock Wheatwine Kristalweizen Baltic
## [51] Old Quadrupel Braggot Lambic Eisbock
## [56] Berliner Kvass Roggenbier Faro Gueuze
## [61] Gose Japanese Happoshu Sahti
## 64 Levels: Altbier American Baltic Belgian Berliner Bi Black Bock ... Witbier
head(df_cleaned,10) %>% kable() %>% kable_styling()
| brewery_id | brewery_name | review_time | review_overall | review_aroma | review_appearance | review_profilename | beer_style1 | beer_style2 | review_palate | review_taste | beer_name | beer_abv | beer_beerid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10325 | Vecchio Birraio | 1234817823 | 1.5 | 2.0 | 2.5 | stcules | Hefeweizen | NA | 1.5 | 1.5 | Sausa Weizen | 5.0 | 47986 |
| 10325 | Vecchio Birraio | 1235915097 | 3.0 | 2.5 | 3.0 | stcules | English | Strong | 3.0 | 3.0 | Red Moon | 6.2 | 48213 |
| 10325 | Vecchio Birraio | 1235916604 | 3.0 | 2.5 | 3.0 | stcules | Foreign | Export | 3.0 | 3.0 | Black Horse Black Beer | 6.5 | 48215 |
| 10325 | Vecchio Birraio | 1234725145 | 3.0 | 3.0 | 3.5 | stcules | German | Pilsener | 2.5 | 3.0 | Sausa Pils | 5.0 | 47969 |
| 1075 | Caldera Brewing Company | 1293735206 | 4.0 | 4.5 | 4.0 | johnmichaelsen | American | Double | 4.0 | 4.5 | Cauldron DIPA | 7.7 | 64883 |
| 1075 | Caldera Brewing Company | 1325524659 | 3.0 | 3.5 | 3.5 | oline73 | Herbed | Spiced | 3.0 | 3.5 | Caldera Ginger Beer | 4.7 | 52159 |
| 1075 | Caldera Brewing Company | 1318991115 | 3.5 | 3.5 | 3.5 | Reidrover | Herbed | Spiced | 4.0 | 4.0 | Caldera Ginger Beer | 4.7 | 52159 |
| 1075 | Caldera Brewing Company | 1306276018 | 3.0 | 2.5 | 3.5 | alpinebryant | Herbed | Spiced | 2.0 | 3.5 | Caldera Ginger Beer | 4.7 | 52159 |
| 1075 | Caldera Brewing Company | 1290454503 | 4.0 | 3.0 | 3.5 | LordAdmNelson | Herbed | Spiced | 3.5 | 4.0 | Caldera Ginger Beer | 4.7 | 52159 |
| 1075 | Caldera Brewing Company | 1285632924 | 4.5 | 3.5 | 5.0 | augustgarage | Herbed | Spiced | 4.0 | 4.0 | Caldera Ginger Beer | 4.7 | 52159 |
I will split the dataset into 70% training and 30% testing dataset as df_train and df_test respectively.
set.seed(3)
df_split <- initial_split(df_cleaned, prop = 0.7)
df_train <- training(df_split)
df_test <- testing(df_split)
Our training dataset has 1063180 observations.
summary(df_train)
## brewery_id brewery_name
## Min. : 1 Boston Beer Company (Samuel Adams): 27036
## 1st Qu.: 141 Dogfish Head Brewery : 23633
## Median : 417 Stone Brewing Co. : 23022
## Mean : 3072 Sierra Nevada Brewing Co. : 20052
## 3rd Qu.: 2281 Bell's Brewery, Inc. : 17539
## Max. :28003 Rogue Ales : 16338
## (Other) :935560
## review_time review_overall review_aroma review_appearance
## Min. :8.846e+08 Min. :0.000 Min. :1.000 Min. :0.00
## 1st Qu.:1.175e+09 1st Qu.:3.500 1st Qu.:3.500 1st Qu.:3.50
## Median :1.240e+09 Median :4.000 Median :4.000 Median :4.00
## Mean :1.225e+09 Mean :3.824 Mean :3.747 Mean :3.85
## 3rd Qu.:1.289e+09 3rd Qu.:4.500 3rd Qu.:4.000 3rd Qu.:4.00
## Max. :1.326e+09 Max. :5.000 Max. :5.000 Max. :5.00
##
## review_profilename beer_style1 beer_style2 review_palate
## northyorksammy: 3751 Length:1063180 Length:1063180 Min. :1.000
## mikesgroove : 3022 Class :character Class :character 1st Qu.:3.500
## BuckeyeNation : 2997 Mode :character Mode :character Median :4.000
## Thorpe429 : 2350 Mean :3.754
## brentk56 : 2269 3rd Qu.:4.000
## womencantsail : 2245 Max. :5.000
## (Other) :1046546
## review_taste beer_name beer_abv
## Min. :1.000 90 Minute IPA : 2279 Min. : 0.010
## 1st Qu.:3.500 Old Rasputin Russian Imperial Stout: 2160 1st Qu.: 5.200
## Median :4.000 Sierra Nevada Celebration Ale : 2098 Median : 6.500
## Mean :3.804 India Pale Ale : 2063 Mean : 7.042
## 3rd Qu.:4.500 Two Hearted Ale : 1969 3rd Qu.: 8.500
## Max. :5.000 Arrogant Bastard Ale : 1922 Max. :57.700
## (Other) :1050689
## beer_beerid
## Min. : 5
## 1st Qu.: 1655
## Median :12804
## Mean :21404
## 3rd Qu.:39238
## Max. :77316
##
dim(df_train)
## [1] 1063180 14
In the below scatter plots of brewery_id versus review_overall and beer_abv versus review_overall, we do not see any relationship.
plot(df_train$brewery_id, df_train$review_overall, main="Brewery/ Ratings",
xlab="Brewery ", ylab="Rating ", pch=19)
plot(df_train$beer_abv, df_train$review_overall, main="Beer abv/ Ratings",
xlab="Beer abv ", ylab="Rating ", pch=19)
Next, I plotted the correlation matrix for all the features and we see a positive coorelation among the partial review factors.
I build a linear regression model by choosing the significant factors and we see the model performs well with high Fstat, smaller residual error and significant p-value. Residual histogram is almost normally distributed, qq plot has most of the point falling under the line with some skewing at both edges
df_train <- dplyr::select_if(df_train,is.numeric)
model1 <- lm(review_overall ~ ., df_train)
(model1_summary <- summary(model1))
##
## Call:
## lm(formula = review_overall ~ ., data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7312 -0.2415 -0.0030 0.2469 4.0280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.457e-01 7.511e-03 59.342 <2e-16 ***
## brewery_id 1.095e-07 7.962e-08 1.375 0.169
## review_time 6.790e-11 5.950e-12 11.412 <2e-16 ***
## review_aroma 7.700e-02 8.760e-04 87.896 <2e-16 ***
## review_appearance 4.769e-02 8.400e-04 56.775 <2e-16 ***
## review_palate 2.700e-01 9.112e-04 296.301 <2e-16 ***
## review_taste 5.538e-01 9.308e-04 595.023 <2e-16 ***
## beer_abv -4.117e-02 1.878e-04 -219.221 <2e-16 ***
## beer_beerid -3.553e-07 2.252e-08 -15.774 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4126 on 1063171 degrees of freedom
## Multiple R-squared: 0.6692, Adjusted R-squared: 0.6692
## F-statistic: 2.689e+05 on 8 and 1063171 DF, p-value: < 2.2e-16
residualPlots(model1_summary)
Review time, aroma, appearance, palate and taste form the major factors for the overall rating of a beer.
App1 - Beer aroma / Overall rating by brewery name
Q1<-sqldf("
select
review_overall, review_time, review_aroma ,brewery_name
from
(
select
review_overall, review_time, review_aroma ,brewery_name
from df_beers where brewery_name <>''
group by brewery_name
) group by 4")
ui <- fluidPage(
sidebarPanel(
selectInput(inputId = "Brewery", label = "Brewery name:",
choices = levels(as.factor(unique(Q1$brewery_name)))
),
helpText("Beer aroma / Overall rating by brewery name"),
width = "auto"
),
plotOutput("plot1")
)
## Warning: The select input "Brewery" contains a large number of options; consider
## using server-side selectize for massively improved performance. See the Details
## section of the ?selectizeInput help topic.
server<- function(input, output) {
output$plot1 <-renderPlot({
SelectedCause <- input$Brewery
ggplot(data=Q1[Q1$brewery_name == SelectedCause,]
, aes(x = review_aroma, y = review_overall)) +
labs(x="review_aroma", y="review_overall",
title = "review_aroma/review_overall",
#subtitle = paste("Caused by", SelectedCause)
)+
geom_bar(stat="identity", fill="steelblue") +
#geom_hline(aes(yintercept = mean(review_overall, na.rm = TRUE), linetype = "National Average"), col="red", lwd=1) +
scale_linetype(name = NULL) +
theme_bw()
})
}
shinyApp(ui = ui, server = server)
## PhantomJS not found. You can install it with webshot::install_phantomjs(). If it is installed, please make sure the phantomjs executable can be found via the PATH variable.
APP2 : Review appearance / Overall rating by brewery name and Beer style
Q2<-sqldf("
select
review_overall, review_time, review_aroma ,brewery_name, beer_style,review_appearance
from
(
select
review_overall, review_time, review_aroma ,brewery_name, beer_style,review_appearance
from df_beers where brewery_name <>''
group by brewery_name
) group by 4")
ui <- fluidPage(
sidebarPanel(
selectInput(inputId = "Brewery", label = "Brewery name:",
choices = levels(as.factor(unique(Q2$brewery_name)))
),
selectInput(inputId = "BeerStyle", label = "Beer Style:",
choices = levels(as.factor(unique(Q2$beer_style)))
),
helpText("Review appearance / Overall rating by brewery name and Beer style"),
width = "auto"
),
plotOutput("plot2")
)
## Warning: The select input "Brewery" contains a large number of options; consider
## using server-side selectize for massively improved performance. See the Details
## section of the ?selectizeInput help topic.
server<- function(input, output) {
output$plot2 <-renderPlot({
SelectedCause <- input$Brewery
SelectedState <- input$BeerStyle
ggplot(data = Q2[Q2$brewery_name == SelectedCause & Q2$beer_style == SelectedState,]) +
geom_bar(aes(x = review_appearance, weight = review_overall), fill = "steelblue") +
labs(x="Review appearance", y = "Review overall",
title = "Review appearance / Overall rating by brewery name and Beer style"
# ,subtitle = paste("Caused by", SelectedCause, "in", SelectedState)
) +
#geom_line(aes(x = Year, y = N.Crude.Rate, linetype = "National Average"), col = "red", lwd = 1) +
scale_linetype(name = NULL) +
theme_bw()
ggplot(data=Q1[Q1$brewery_name == SelectedCause,]
, aes(x = review_aroma, y = review_overall)) +
labs(x="review_aroma", y="review_overall",
title = "review_aroma/review_overall",
#subtitle = paste("Caused by", SelectedCause)
)+
geom_bar(stat="identity", fill="steelblue") +
#geom_hline(aes(yintercept = mean(review_overall, na.rm = TRUE), linetype = "National Average"), col="red", lwd=1) +
scale_linetype(name = NULL) +
theme_bw()
})
}
shinyApp(ui = ui, server = server)