Introduction

According to the US Energy Information Administration, regular gas costs 58.7% more this month than it did a year ago - $3.491 a gallon last month vs $2.20 in November 2020. (EIA). I was very much interested to know the facts behind the rise in the fuel prices. The data was pulled from the EIA website.

Despite the fact that customers buy gasoline locally, the price of the fuel is mostly decided by the worldwide market for crude oil. When the price of crude oil fluctuates on the global market, whether due to geopolitical events, crude supply variations, or gasoline demand fluctuations, the prices that consumers pay at the pump fluctuate as well.

It is always difficult to forecast fuel prices as there are factors such as Covid 19 in the past few years where the prices have gone up over the time period.

Importing the data and analyzing various columns in it.

library(readxl)
Fuel_prices_data <- read_excel("/Users/nishanthreddy/Downloads/Fuel_prices_data.xlsx")
head(Fuel_prices_data)
## # A tibble: 6 × 2
##   `Week of`           `Weekly U.S. All Grades All Formulations Retail Gasoline …
##   <dttm>                                                                   <dbl>
## 1 2022-01-17 00:00:00                                                       3.40
## 2 2022-01-10 00:00:00                                                       3.39
## 3 2022-01-03 00:00:00                                                       3.38
## 4 2021-12-27 00:00:00                                                       3.38
## 5 2021-12-20 00:00:00                                                       3.40
## 6 2021-12-13 00:00:00                                                       3.41
summary(Fuel_prices_data)
##     Week of                   
##  Min.   :1993-04-05 00:00:00  
##  1st Qu.:2000-06-15 12:00:00  
##  Median :2007-08-27 00:00:00  
##  Mean   :2007-08-27 00:00:00  
##  3rd Qu.:2014-11-06 12:00:00  
##  Max.   :2022-01-17 00:00:00  
##  Weekly U.S. All Grades All Formulations Retail Gasoline Prices Dollars per Gallon
##  Min.   :0.949                                                                    
##  1st Qu.:1.329                                                                    
##  Median :2.284                                                                    
##  Mean   :2.242                                                                    
##  3rd Qu.:2.913                                                                    
##  Max.   :4.165
str(Fuel_prices_data)
## tibble [1,503 × 2] (S3: tbl_df/tbl/data.frame)
##  $ Week of                                                                          : POSIXct[1:1503], format: "2022-01-17" "2022-01-10" ...
##  $ Weekly U.S. All Grades All Formulations Retail Gasoline Prices Dollars per Gallon: num [1:1503] 3.4 3.39 3.38 3.38 3.4 ...
Fuel_prices_data$`Week of` <- as.Date(Fuel_prices_data$`Week of`)
class(Fuel_prices_data$`Week of`)
## [1] "Date"

We can see here the summary of the data which is pulled from past years. Mean, Median values are 2.42 and 2.28 respectively. The data has 1,503 rows and 2 columns.

Finding missing values if any…

colSums(is.na(Fuel_prices_data))
##                                                                           Week of 
##                                                                                 0 
## Weekly U.S. All Grades All Formulations Retail Gasoline Prices Dollars per Gallon 
##                                                                                 0
Fuel_data <- Fuel_prices_data[which(Fuel_prices_data$`Week of` >= "2011-01-01" & Fuel_prices_data$`Week of` <= "2021-12-31" ),]
Fuel_data                                       
## # A tibble: 574 × 2
##    `Week of`  `Weekly U.S. All Grades All Formulations Retail Gasoline Prices D…
##    <date>                                                                  <dbl>
##  1 2021-12-27                                                               3.38
##  2 2021-12-20                                                               3.40
##  3 2021-12-13                                                               3.41
##  4 2021-12-06                                                               3.44
##  5 2021-11-29                                                               3.48
##  6 2021-11-22                                                               3.49
##  7 2021-11-15                                                               3.50
##  8 2021-11-08                                                               3.50
##  9 2021-11-01                                                               3.48
## 10 2021-10-25                                                               3.48
## # … with 564 more rows
head(Fuel_data)
## # A tibble: 6 × 2
##   `Week of`  `Weekly U.S. All Grades All Formulations Retail Gasoline Prices Do…
##   <date>                                                                   <dbl>
## 1 2021-12-27                                                                3.38
## 2 2021-12-20                                                                3.40
## 3 2021-12-13                                                                3.41
## 4 2021-12-06                                                                3.44
## 5 2021-11-29                                                                3.48
## 6 2021-11-22                                                                3.49

We took the data from past 10 years as its more significant and also found that there are no missing values in the data.

par(mfrow=c(1,2))
names(Fuel_data)[names(Fuel_data) == 'Week of'] <- "Date"
names(Fuel_data)[names(Fuel_data) == 'Weekly U.S. All Grades All Formulations Retail Gasoline Prices Dollars per Gallon'] <- "Price"
plot(Fuel_data$Date,Fuel_data$Price, type = "h", main = "Fuel Prices over the Years", ylab = "Fuel Prices", xlab = "Years")
plot(Fuel_data$Date,Fuel_data$Price, type = "o", main = "Fuel Prices over the Years", ylab = "Fuel Prices", xlab = "Years")

plot(Fuel_data$Date,Fuel_data$Price, type = "l", main = "Fuel Prices over the Years", ylab = "Fuel Prices", xlab = "Years")
boxplot(Fuel_data$Price, main = "Fuel Price")

In 2019 the prices again went down as there was sufficient amount of production in the United States for that period and in 2020, COVID 19 pandemic played a major role in the decline of the Fuel prices as the demand was less when compared to the previous years.

There are no outliers in the data as per the boxplot plotted above.

Linear Modeling

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
sample_data <- sort(sample(nrow(Fuel_data), nrow(Fuel_data)*0.8))
train<-Fuel_data[sample_data,]
test<-Fuel_data[-sample_data,]
lm_train = lm(Price ~ Date,data = train)
summary(lm_train)
## 
## Call:
## lm(formula = Price ~ Date, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14196 -0.32444  0.05582  0.29690  1.11868 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.890e+00  3.128e-01   25.23   <2e-16 ***
## Date        -2.917e-04  1.837e-05  -15.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4476 on 457 degrees of freedom
## Multiple R-squared:  0.3556, Adjusted R-squared:  0.3542 
## F-statistic: 252.2 on 1 and 457 DF,  p-value: < 2.2e-16

Looking at the summary of the 80% data, we can see that the pvalue is <0.05 by which we can reject the null hypothesis testing and can conclude that there is a clear relationship between the Fuel prices and Years.

predict_price <- predict(lm_train,newdata = test[c('Date')] )
predicted_test_data <- test
predicted_test_data['predict_price'] <- predict_price
install.packages("mltools", repos = "http://cran.us.r-project.org")
## 
## The downloaded binary packages are in
##  /var/folders/36/rj11vkws2rq06q3731tl873r0000gn/T//RtmpY1LjJs/downloaded_packages
#install.packages("Metrics") 
#rmse(predicted_test_data$Price, predicted_test_data$predict_price)

The Root Mean Square Error value which we got for the predicted model is 0.45.

Plotting the Predict and Actual Prices using ggplot

install.packages("tidyverse", repos = "http://cran.us.r-project.org")
## 
## The downloaded binary packages are in
##  /var/folders/36/rj11vkws2rq06q3731tl873r0000gn/T//RtmpY1LjJs/downloaded_packages
install.packages("ggplot2", repos = "http://cran.us.r-project.org")
## 
## The downloaded binary packages are in
##  /var/folders/36/rj11vkws2rq06q3731tl873r0000gn/T//RtmpY1LjJs/downloaded_packages
ggplot(train, aes(x = Date , y = Price )) +
  geom_line(colour = "Black") +
  geom_line(data = test, aes(x = Date, y = Price),colour = "green") + 
  geom_line(data = predicted_test_data, aes(x = Date, y = predict_price),colour = "red") +
  labs(title = "Fuel Prices over the Years(Predicted)",
       subtitle = "2011-2021",
       x = "Years", y = "Fuel Price per Gallon ($)")

The actual and predicted fuel prices over the past ten years are labelled in the above graph. The actual prices are shown in black color, the predicted prices are shown in the red color and followed by the test data prices which are labelled in the green color.The factors such as abundant production of fuel in the 2016,2019 years and followed by the COVID 19 pandemic in the 2020 are clearly responsible for the changes in the graphin those time periods. We can conclude by saying that Linear regression isnt the best technique for the time series data as it isnt predicting it accurately. Better techniques can be used such as ARIMA for better analysis.