Beer Profile and Ratings Analysis

About the Data:

The Beer Profile and Ratings dataset from Kaggle was used for the project. The main data set (beer_profile_and_ratings.csv) contains the following columns: (General) • Name: Beer name (label) • Style: Beer Style • Brewery: Brewery name • Beer Name: Complete beer name (Brewery + Brew Name) • Description: Notes on the beer if available • ABV: Alcohol content of beer (% by volume) • Min IBU: The minimum IBU value each beer can possess • Max IBU: The maximum IBU value each beer can possess

(Mouth feel) • Astringency • Body • Alcohol (Taste) • Bitter • Sweet •Sour • Salty (Flavor And Aroma) • Fruits • Hoppy • Spices • Malty

(Reviews) • review_aroma • review_appearance • review_palate •review_taste • review_overall • number_of_reviews

Loading the libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(dplyr)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(boot)
library(pwr)

library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(tsibble)
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:zoo':
## 
##     index
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
library(pageviews)
library(WikipediR)
beers <- read.csv("/Users/bhavyakalra/Desktop/Stats R/Final_project_beer/beer_profile_and_ratings.csv")

str(beers)
## 'data.frame':    3197 obs. of  25 variables:
##  $ Name             : chr  "Amber" "Double Bag" "Long Trail Ale" "Doppelsticke" ...
##  $ Style            : chr  "Altbier" "Altbier" "Altbier" "Altbier" ...
##  $ Brewery          : chr  "Alaskan Brewing Co." "Long Trail Brewing Co." "Long Trail Brewing Co." "Uerige Obergärige Hausbrauerei GmbH / Zum Uerige" ...
##  $ Beer.Name..Full. : chr  "Alaskan Brewing Co. Alaskan Amber" "Long Trail Brewing Co. Double Bag" "Long Trail Brewing Co. Long Trail Ale" "Uerige Obergärige Hausbrauerei GmbH / Zum Uerige Uerige Doppelsticke" ...
##  $ Description      : chr  "Notes:Richly malty and long on the palate, with just enough hop backing to make this beautiful amber colored \""| __truncated__ "Notes:This malty, full-bodied double alt is also known as “Stickebier” – German slang for “secret brew”. Long T"| __truncated__ "Notes:Long Trail Ale is a full-bodied amber ale modeled after the “Alt-biers” of Düsseldorf, Germany. Our top f"| __truncated__ "Notes:" ...
##  $ ABV              : num  5.3 7.2 5 8.5 7.2 6 5.3 5 4.8 5.1 ...
##  $ Min.IBU          : int  25 25 25 25 25 25 25 25 25 25 ...
##  $ Max.IBU          : int  50 50 50 50 50 50 50 50 50 50 ...
##  $ Astringency      : int  13 12 14 13 25 22 28 18 25 35 ...
##  $ Body             : int  32 57 37 55 51 45 40 49 35 31 ...
##  $ Alcohol          : int  9 18 6 31 26 13 3 5 4 5 ...
##  $ Bitter           : int  47 33 42 47 44 46 40 37 38 35 ...
##  $ Sweet            : int  74 55 43 101 45 62 58 73 39 50 ...
##  $ Sour             : int  33 16 11 18 9 25 29 22 13 55 ...
##  $ Salty            : int  0 0 0 1 1 1 0 0 1 5 ...
##  $ Fruits           : int  33 24 10 49 11 34 36 21 8 52 ...
##  $ Hoppy            : int  57 35 54 40 51 60 54 37 60 66 ...
##  $ Spices           : int  8 12 4 16 20 4 8 4 16 8 ...
##  $ Malty            : int  111 84 62 119 95 103 97 98 97 77 ...
##  $ review_aroma     : num  3.5 3.8 3.41 4.15 3.62 ...
##  $ review_appearance: num  3.64 3.85 3.67 4.03 3.97 ...
##  $ review_palate    : num  3.56 3.9 3.6 4.15 3.73 ...
##  $ review_taste     : num  3.64 4.02 3.63 4.21 3.77 ...
##  $ review_overall   : num  3.85 4.03 3.83 4.01 3.82 ...
##  $ number_of_reviews: int  497 481 377 368 96 315 124 445 46 245 ...

Since there is no column in my dataset where I can perform timeseries, I have taken one of the wikipedia page called Beer.

Beer page from wikipedia

if (!requireNamespace("pageviews", quietly = TRUE)) {
    install.packages("pageviews")
}
start_date <- "2023010100"  # Start date with hour component
end_date <- "2023111923"    # End date with hour component

page_views <- article_pageviews(project = "en.wikipedia", 
                                article = "Beer",
                                start = start_date, 
                                end = end_date,
                                user_type = "user",
                                platform = "all-access")
str(page_views)
## 'data.frame':    316 obs. of  8 variables:
##  $ project    : chr  "wikipedia" "wikipedia" "wikipedia" "wikipedia" ...
##  $ language   : chr  "en" "en" "en" "en" ...
##  $ article    : chr  "Beer" "Beer" "Beer" "Beer" ...
##  $ access     : chr  "all-access" "all-access" "all-access" "all-access" ...
##  $ agent      : chr  "user" "user" "user" "user" ...
##  $ granularity: chr  "daily" "daily" "daily" "daily" ...
##  $ date       : POSIXct, format: "2023-01-01" "2023-01-02" ...
##  $ views      : num  2375 1942 1812 1772 1846 ...
# Tsibble
page_views$date <- as.Date(page_views$date)

page_views_tsibble <- as_tsibble(page_views, index = date)
ggplot(page_views_tsibble, aes(x = date, y = views)) +
  geom_line() +
  labs(title = "Daily Page Views for 'Beer'", x = "Date", y = "Views")

Baseline Traffic: There’s a spike in views throughout pages shows that it is not relatively constant over time, with minimal decrease in views spikes in start of the year.

Significant Spike: There’s a substantial spike in page views starting in May. This likely correlates with increased public interest as the summer and holidays draws closer. Also people do enjoy chilled beer in summers.

Trend Analysis

# Linear regression
model <- lm(views ~ date, data = page_views_tsibble)
summary(model)
## 
## Call:
## lm(formula = views ~ date, data = page_views_tsibble)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -436.79 -154.03  -24.41  155.63  739.73 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1494.7603  2662.9763  -0.561    0.575
## date            0.1904     0.1365   1.396    0.164
## 
## Residual standard error: 221.3 on 314 degrees of freedom
## Multiple R-squared:  0.006164,   Adjusted R-squared:  0.002999 
## F-statistic: 1.948 on 1 and 314 DF,  p-value: 0.1638
# Smoothing
page_views_tsibble$smoothed_views <- zoo::rollmean(page_views_tsibble$views, k = 7, fill = NA)

# Plotting smoothed data
ggplot(page_views_tsibble, aes(x = date, y = smoothed_views)) +
  geom_line() +
  labs(title = "Smoothed Page Views Over Time", x = "Date", y = "Smoothed Views")
## Warning: Removed 6 rows containing missing values (`geom_line()`).

The graph and calculations indicate a increase in the number of people reading the Wikipedia page as time goes on, especially as the summer holidays draws near. The data indicates that the date is a significant predictor of page views, yet it doesn’t provide a comprehensive picture. Various additional factors contribute to the day-to-day fluctuations in page views. Approximately 33% of the variations in views can be attributed to the passage of time, suggesting an increasing interest as the event draws near.

# Seasonality
acf(page_views_tsibble$views, main = "ACF of Page Views")

pacf(page_views_tsibble$views, main = "PACF of Page Views")

The ACF plot shows that the number of page views is linked over several days; if views are high one day, they’re likely to be high the next. The PACF plot suggests that each day’s views are mostly influenced just by the day before. There is a clear repeating pattern, like a “season” of high views, which is because Beer is drank throughout the year.