About the Data:
The Beer Profile and Ratings dataset from Kaggle was used for the project. The main data set (beer_profile_and_ratings.csv) contains the following columns: (General) • Name: Beer name (label) • Style: Beer Style • Brewery: Brewery name • Beer Name: Complete beer name (Brewery + Brew Name) • Description: Notes on the beer if available • ABV: Alcohol content of beer (% by volume) • Min IBU: The minimum IBU value each beer can possess • Max IBU: The maximum IBU value each beer can possess
(Mouth feel) • Astringency • Body • Alcohol (Taste) • Bitter • Sweet •Sour • Salty (Flavor And Aroma) • Fruits • Hoppy • Spices • Malty
(Reviews) • review_aroma • review_appearance • review_palate •review_taste • review_overall • number_of_reviews
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(dplyr)
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(boot)
library(pwr)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
library(tsibble)
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:zoo':
##
## index
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(pageviews)
library(WikipediR)
beers <- read.csv("/Users/bhavyakalra/Desktop/Stats R/Final_project_beer/beer_profile_and_ratings.csv")
str(beers)
## 'data.frame': 3197 obs. of 25 variables:
## $ Name : chr "Amber" "Double Bag" "Long Trail Ale" "Doppelsticke" ...
## $ Style : chr "Altbier" "Altbier" "Altbier" "Altbier" ...
## $ Brewery : chr "Alaskan Brewing Co." "Long Trail Brewing Co." "Long Trail Brewing Co." "Uerige Obergärige Hausbrauerei GmbH / Zum Uerige" ...
## $ Beer.Name..Full. : chr "Alaskan Brewing Co. Alaskan Amber" "Long Trail Brewing Co. Double Bag" "Long Trail Brewing Co. Long Trail Ale" "Uerige Obergärige Hausbrauerei GmbH / Zum Uerige Uerige Doppelsticke" ...
## $ Description : chr "Notes:Richly malty and long on the palate, with just enough hop backing to make this beautiful amber colored \""| __truncated__ "Notes:This malty, full-bodied double alt is also known as “Stickebier” – German slang for “secret brew”. Long T"| __truncated__ "Notes:Long Trail Ale is a full-bodied amber ale modeled after the “Alt-biers” of Düsseldorf, Germany. Our top f"| __truncated__ "Notes:" ...
## $ ABV : num 5.3 7.2 5 8.5 7.2 6 5.3 5 4.8 5.1 ...
## $ Min.IBU : int 25 25 25 25 25 25 25 25 25 25 ...
## $ Max.IBU : int 50 50 50 50 50 50 50 50 50 50 ...
## $ Astringency : int 13 12 14 13 25 22 28 18 25 35 ...
## $ Body : int 32 57 37 55 51 45 40 49 35 31 ...
## $ Alcohol : int 9 18 6 31 26 13 3 5 4 5 ...
## $ Bitter : int 47 33 42 47 44 46 40 37 38 35 ...
## $ Sweet : int 74 55 43 101 45 62 58 73 39 50 ...
## $ Sour : int 33 16 11 18 9 25 29 22 13 55 ...
## $ Salty : int 0 0 0 1 1 1 0 0 1 5 ...
## $ Fruits : int 33 24 10 49 11 34 36 21 8 52 ...
## $ Hoppy : int 57 35 54 40 51 60 54 37 60 66 ...
## $ Spices : int 8 12 4 16 20 4 8 4 16 8 ...
## $ Malty : int 111 84 62 119 95 103 97 98 97 77 ...
## $ review_aroma : num 3.5 3.8 3.41 4.15 3.62 ...
## $ review_appearance: num 3.64 3.85 3.67 4.03 3.97 ...
## $ review_palate : num 3.56 3.9 3.6 4.15 3.73 ...
## $ review_taste : num 3.64 4.02 3.63 4.21 3.77 ...
## $ review_overall : num 3.85 4.03 3.83 4.01 3.82 ...
## $ number_of_reviews: int 497 481 377 368 96 315 124 445 46 245 ...
if (!requireNamespace("pageviews", quietly = TRUE)) {
install.packages("pageviews")
}
start_date <- "2023010100" # Start date with hour component
end_date <- "2023111923" # End date with hour component
page_views <- article_pageviews(project = "en.wikipedia",
article = "Beer",
start = start_date,
end = end_date,
user_type = "user",
platform = "all-access")
str(page_views)
## 'data.frame': 316 obs. of 8 variables:
## $ project : chr "wikipedia" "wikipedia" "wikipedia" "wikipedia" ...
## $ language : chr "en" "en" "en" "en" ...
## $ article : chr "Beer" "Beer" "Beer" "Beer" ...
## $ access : chr "all-access" "all-access" "all-access" "all-access" ...
## $ agent : chr "user" "user" "user" "user" ...
## $ granularity: chr "daily" "daily" "daily" "daily" ...
## $ date : POSIXct, format: "2023-01-01" "2023-01-02" ...
## $ views : num 2375 1942 1812 1772 1846 ...
# Tsibble
page_views$date <- as.Date(page_views$date)
page_views_tsibble <- as_tsibble(page_views, index = date)
ggplot(page_views_tsibble, aes(x = date, y = views)) +
geom_line() +
labs(title = "Daily Page Views for 'Beer'", x = "Date", y = "Views")
Baseline Traffic: There’s a spike in views throughout pages shows that
it is not relatively constant over time, with minimal decrease in views
spikes in start of the year.
Significant Spike: There’s a substantial spike in page views starting in May. This likely correlates with increased public interest as the summer and holidays draws closer. Also people do enjoy chilled beer in summers.
# Linear regression
model <- lm(views ~ date, data = page_views_tsibble)
summary(model)
##
## Call:
## lm(formula = views ~ date, data = page_views_tsibble)
##
## Residuals:
## Min 1Q Median 3Q Max
## -436.79 -154.03 -24.41 155.63 739.73
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1494.7603 2662.9763 -0.561 0.575
## date 0.1904 0.1365 1.396 0.164
##
## Residual standard error: 221.3 on 314 degrees of freedom
## Multiple R-squared: 0.006164, Adjusted R-squared: 0.002999
## F-statistic: 1.948 on 1 and 314 DF, p-value: 0.1638
# Smoothing
page_views_tsibble$smoothed_views <- zoo::rollmean(page_views_tsibble$views, k = 7, fill = NA)
# Plotting smoothed data
ggplot(page_views_tsibble, aes(x = date, y = smoothed_views)) +
geom_line() +
labs(title = "Smoothed Page Views Over Time", x = "Date", y = "Smoothed Views")
## Warning: Removed 6 rows containing missing values (`geom_line()`).
The graph and calculations indicate a increase in the number of people
reading the Wikipedia page as time goes on, especially as the summer
holidays draws near. The data indicates that the date is a significant
predictor of page views, yet it doesn’t provide a comprehensive picture.
Various additional factors contribute to the day-to-day fluctuations in
page views. Approximately 33% of the variations in views can be
attributed to the passage of time, suggesting an increasing interest as
the event draws near.
# Seasonality
acf(page_views_tsibble$views, main = "ACF of Page Views")
pacf(page_views_tsibble$views, main = "PACF of Page Views")
The ACF plot shows that the number of page views is linked over several days; if views are high one day, they’re likely to be high the next. The PACF plot suggests that each day’s views are mostly influenced just by the day before. There is a clear repeating pattern, like a “season” of high views, which is because Beer is drank throughout the year.