The dataset contains information about weekly oil production for two consecutive years. It has 173 observations and 7 variables. Here is a brief description of each variable:
Fiscal.Year: The fiscal year of the observation.
Fiscal.Week: The fiscal week of the observation.
Current.Year.Production: The current year’s production in barrels of oil.
Previous.Year.Production: The previous year’s production in barrels of oil.
Difference.From.Same.Week.Last.Year: The difference in production between the current and previous year’s production for the same fiscal week.
Current.Year.Cumulative.Production: The cumulative production for the current year up to the current week.
Cumulative.Difference: The cumulative difference in production between the current and previous year’s production up to the current week.
library(ggplot2) # A popular package for creating graphics in R, with a syntax based on the grammar of graphics
library(forecast) # Provides methods and tools for forecasting time series data
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(xts) # Provides an extensible time series class for use with R's time-series functions
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## ################################### WARNING ###################################
## # We noticed you have dplyr installed. The dplyr lag() function breaks how #
## # base R's lag() function is supposed to work, which breaks lag(my_xts). #
## # #
## # If you call library(dplyr) later in this session, then calls to lag(my_xts) #
## # that you enter or source() into this session won't work correctly. #
## # #
## # All package code is unaffected because it is protected by the R namespace #
## # mechanism. #
## # #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## # You can use stats::lag() to make sure you're not using dplyr::lag(), or you #
## # can add conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## ################################### WARNING ###################################
data <- read.csv("weekly-gasoline.csv", header = TRUE, sep = ",")
df <- data.frame(data)
head(df,n=6)
## Fiscal.Year Fiscal.Week Current.Year.Production Previous.Year.Production
## 1 2023 16 2393748000 2417856000
## 2 2023 15 2367876000 2324364000
## 3 2023 14 2222052000 2402568000
## 4 2023 13 2209116000 2858856000
## 5 2023 12 2742138000 2641884000
## 6 2023 11 2561916000 2784768000
## Difference.From.Same.Week.Last.Year Current.Year.Cumulative.Production
## 1 -24108000 39649722000
## 2 43512000 37255974000
## 3 -180516000 34888098000
## 4 -649740000 32666046000
## 5 100254000 30456930000
## 6 -222852000 27714792000
## Cumulative.Difference
## 1 -3024672000
## 2 -3000564000
## 3 -3044076000
## 4 -2863560000
## 5 -2213820000
## 6 -2314074000
str(df) # Examine the structure of the data
## 'data.frame': 173 obs. of 7 variables:
## $ Fiscal.Year : int 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
## $ Fiscal.Week : int 16 15 14 13 12 11 10 9 8 7 ...
## $ Current.Year.Production : num 2.39e+09 2.37e+09 2.22e+09 2.21e+09 2.74e+09 ...
## $ Previous.Year.Production : num 2.42e+09 2.32e+09 2.40e+09 2.86e+09 2.64e+09 ...
## $ Difference.From.Same.Week.Last.Year: chr "-24108000" "43512000" "-180516000" "-649740000" ...
## $ Current.Year.Cumulative.Production : num 3.96e+10 3.73e+10 3.49e+10 3.27e+10 3.05e+10 ...
## $ Cumulative.Difference : chr "-3024672000" "-3000564000" "-3044076000" "-2863560000" ...
Data has 173 observations and 7 variables of interest
# Clean the data by converting "Difference.From.Same.Week.Last.Year" and "Cumulative.Difference" columns from character to numeric
df$Difference.From.Same.Week.Last.Year <- as.numeric(gsub(",", "", df$Difference.From.Same.Week.Last.Year))
## Warning: NAs introduced by coercion
df$Cumulative.Difference <- as.numeric(gsub(",", "", df$Cumulative.Difference))
## Warning: NAs introduced by coercion
df<-na.omit(df) # remove all NAs
sum(is.na(df)) # check if any value is missing
## [1] 0
No missing value
summary(df) # Summary statistics for all variables
## Fiscal.Year Fiscal.Week Current.Year.Production
## Min. :2020 Min. : 1.00 Min. :2.209e+09
## 1st Qu.:2021 1st Qu.:10.25 1st Qu.:2.541e+09
## Median :2022 Median :24.00 Median :2.632e+09
## Mean :2022 Mean :24.68 Mean :2.628e+09
## 3rd Qu.:2022 3rd Qu.:38.00 3rd Qu.:2.741e+09
## Max. :2023 Max. :52.00 Max. :2.953e+09
## Previous.Year.Production Difference.From.Same.Week.Last.Year
## Min. :1.489e+09 Min. :-649740000
## 1st Qu.:2.420e+09 1st Qu.: -77322000
## Median :2.587e+09 Median : 59976000
## Mean :2.527e+09 Mean : 100947737
## 3rd Qu.:2.710e+09 3rd Qu.: 226306500
## Max. :2.953e+09 Max. :1135722000
## Current.Year.Cumulative.Production Cumulative.Difference
## Min. :2.433e+09 Min. :-7.050e+09
## 1st Qu.:2.736e+10 1st Qu.: 1.416e+08
## Median :6.196e+10 Median : 2.118e+09
## Mean :6.354e+10 Mean : 1.717e+09
## 3rd Qu.:9.702e+10 3rd Qu.: 4.250e+09
## Max. :1.350e+11 Max. : 5.863e+09
Fiscal Year
mean(df$Fiscal.Year)
## [1] 2021.588
Fiscal Week
mean(df$Fiscal.Week)
## [1] 24.68421
Current.Year.Production
mean(df$Current.Year.Production)
## [1] 2628006684
Previous.Year.Production
mean(df$Previous.Year.Production)
## [1] 2527058947
Difference.From.Same.Week.Last.Year
mean(df$Difference.From.Same.Week.Last.Year)
## [1] 100947737
Current.Year.Cumulative.Production
mean(df$Current.Year.Cumulative.Production)
## [1] 63544045404
Cumulative.Difference
mean(df$Cumulative.Difference)
## [1] 1716707263
Fiscal Year
median(df$Fiscal.Year)
## [1] 2022
Fiscal Week
median(df$Fiscal.Week)
## [1] 24
Current.Year.Production
median(df$Current.Year.Production)
## [1] 2632035000
Previous.Year.Production
median(df$Previous.Year.Production)
## [1] 2587200000
Difference.From.Same.Week.Last.Year
median(df$Difference.From.Same.Week.Last.Year)
## [1] 59976000
Current.Year.Cumulative.Production
median(df$Current.Year.Cumulative.Production)
## [1] 61960353000
Cumulative.Difference
median(df$Cumulative.Difference)
## [1] 2117829000
Fiscal Year
min(df$Fiscal.Year)
## [1] 2020
max(df$Fiscal.Year)
## [1] 2023
Fiscal Week
min(df$Fiscal.Week)
## [1] 1
max(df$Fiscal.Week)
## [1] 52
Current.Year.Production
min(df$Current.Year.Production)
## [1] 2209116000
max(df$Current.Year.Production)
## [1] 2952642000
Previous.Year.Production
min(df$Previous.Year.Production)
## [1] 1489110000
max(df$Previous.Year.Production)
## [1] 2952642000
Difference.From.Same.Week.Last.Year
min(df$Difference.From.Same.Week.Last.Year)
## [1] -649740000
max(df$Difference.From.Same.Week.Last.Year)
## [1] 1135722000
Current.Year.Cumulative.Production
min(df$Current.Year.Cumulative.Production)
## [1] 2433144000
max(df$Current.Year.Cumulative.Production)
## [1] 1.35e+11
Cumulative.Difference
min(df$Cumulative.Difference)
## [1] -7050414000
max(df$Cumulative.Difference)
## [1] 5862654000
Fiscal Year
sd(df$Fiscal.Year)
## [1] 0.910231
sd(df$Fiscal.Year)
## [1] 0.910231
Fiscal Week
sd(df$Fiscal.Week)
## [1] 15.61048
sd(df$Fiscal.Week)
## [1] 15.61048
Current.Year.Production
min(df$Current.Year.Production)
## [1] 2209116000
max(df$Current.Year.Production)
## [1] 2952642000
Previous.Year.Production
sd(df$Previous.Year.Production)
## [1] 272720976
sd(df$Previous.Year.Production)
## [1] 272720976
Difference.From.Same.Week.Last.Year
sd(df$Difference.From.Same.Week.Last.Year)
## [1] 304899984
sd(df$Difference.From.Same.Week.Last.Year)
## [1] 304899984
Current.Year.Cumulative.Production
sd(df$Current.Year.Cumulative.Production)
## [1] 39799555792
sd(df$Current.Year.Cumulative.Production)
## [1] 39799555792
Cumulative.Difference
sd(df$Cumulative.Difference)
## [1] 3041539561
sd(df$Cumulative.Difference)
## [1] 3041539561
ggplot(df, aes(x = Fiscal.Week, y = Current.Year.Production, color = factor(Fiscal.Year))) +
geom_line() +
labs(title = "Weekly Oil Production by Fiscal Year", x = "Fiscal Week", y = "Current Year Production")
ggplot(df, aes(x = Fiscal.Week, y = Previous.Year.Production, color = factor(Fiscal.Year))) +
geom_line() +
labs(title = "Weekly Oil Production by Fiscal Year", x = "Fiscal Week", y = "Previous Year Production")
ggplot(df, aes(x = Fiscal.Week, y = Current.Year.Cumulative.Production, color = factor(Fiscal.Year))) +
geom_line() +
labs(title = "Current Year Cumulative Production by Fiscal Year", x = "Fiscal Week", y = "Current.Year.Cumulative.Production")
ggplot(df, aes(x = Difference.From.Same.Week.Last.Year, y =Cumulative.Difference, color = factor(Fiscal.Year))) +
geom_line() +
labs(title = "Difference From Same Week Last Year and Cumulative.Difference", x = "Fiscal Week", y = "Difference From Same Week Last Year Cumulative.Production")
# Compare the current year's production to the previous year's production
ggplot(df, aes(x = Previous.Year.Production, y = Current.Year.Production,color=factor(Fiscal.Year))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Comparison of Current Year's Production to Previous Year's Production", x = "Previous Year Production", y = "Current Year Production")
## `geom_smooth()` using formula = 'y ~ x'
qplot(data =df, x=Current.Year.Production , geom = "histogram", bins = 15, color =I("yellow"))
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
qplot(data =df, x=Previous.Year.Production , geom = "histogram", bins = 15, color =I("pink"))
oil_ts <- ts(na.omit(df$Current.Year.Production), start = c(2020, 1), end = c(2023, 25), frequency = 52)
plot(oil_ts, main = "Weekly Oil Production Time Series", xlab = "Fiscal Week", ylab = "Production (barrels)")
oil_decomp <- decompose(oil_ts)
autoplot(oil_decomp) +
labs(title = "Decomposition of Current Weekly Oil Production Time Series")
From the exploratory data analysis and time series analysis, we can see that there is a clear upward trend in oil production over the observed period. We can also see that there is a seasonal component to the data, with production peaking in the summer months and decreasing in the winter month.
The provided dataset does not have any categorical variable, we cannot perform a classification analysis. Therefore, we will perform linear regression analysis. First, we need to split our dataset into training and testing sets. We will use 80% of the data for training and 20% for testing.
set.seed(42)
train_index <- sample(1:nrow(df), 0.8*nrow(df))
Train Data
train_data <- df[train_index,]
head(train_data)
## Fiscal.Year Fiscal.Week Current.Year.Production Previous.Year.Production
## 50 2022 20 2519580000 2471658000
## 153 2020 21 2656290000 2640414000
## 66 2022 4 2740962000 2512230000
## 26 2022 44 2511054000 2873850000
## 75 2021 47 2814168000 2693334000
## 152 2020 22 2700684000 2664228000
## Difference.From.Same.Week.Last.Year Current.Year.Cumulative.Production
## 50 47922000 52795932000
## 153 15876000 55975542000
## 66 228732000 11045580000
## 26 -362796000 115000000000
## 75 120834000 119000000000
## 152 36456000 58676226000
## Cumulative.Difference
## 50 5038278000
## 153 153468000
## 66 959616000
## 26 3799068000
## 75 3529470000
## 152 189924000
Test Data
test_data <- df[-train_index,]
head(test_data)
## Fiscal.Year Fiscal.Week Current.Year.Production Previous.Year.Production
## 1 2023 16 2393748000 2417856000
## 7 2023 10 2426970000 2635122000
## 11 2023 6 2570148000 2722146000
## 14 2023 3 2625420000 2832396000
## 24 2022 46 2748312000 2743902000
## 45 2022 25 2539278000 2533104000
## Difference.From.Same.Week.Last.Year Current.Year.Cumulative.Production
## 1 -24108000 39649722000
## 7 -208152000 25152876000
## 11 -151998000 15375318000
## 14 -206976000 7609896000
## 24 4410000 120000000000
## 45 6174000 65715174000
## Cumulative.Difference
## 1 -3024672000
## 7 -2091222000
## 11 -1186584000
## 14 -694722000
## 24 3713220000
## 45 5862654000
lm_model <- lm(Current.Year.Production ~ Previous.Year.Production + Fiscal.Year + Difference.From.Same.Week.Last.Year, data = train_data)
predictions <- predict(lm_model, newdata = test_data)
mse <- mean((test_data$Current.Year.Production - predictions)^2)
mse
## [1] 5.338338e-13
The MSE value is very small, this shows that our model is a good fit.
ggplot(df, aes(x = Previous.Year.Production, y = Current.Year.Production)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Scatter Plot of Current vs. Previous Year's Gasoline Production",
x = "Previous Year's Production (barrels)", y = "Current Year's Production (barrels)")
## `geom_smooth()` using formula = 'y ~ x'
library("plot3D")
scatter3D(df$Fiscal.Year ,df$Current.Year.Production ,df$Previous.Year.Production)
dt <- cbind(df$Fiscal.Year ,df$Current.Year.Production ,df$Previous.Year.Production, col = NULL)
head(dt)
## [,1] [,2] [,3]
## [1,] 2023 2393748000 2417856000
## [2,] 2023 2367876000 2324364000
## [3,] 2023 2222052000 2402568000
## [4,] 2023 2209116000 2858856000
## [5,] 2023 2742138000 2641884000
## [6,] 2023 2561916000 2784768000
kmeans(dt,centers=3)
## K-means clustering with 3 clusters of sizes 44, 18, 52
##
## Cluster means:
## [,1] [,2] [,3]
## 1 2022.227 2525927727 2688957409
## 2 2021.444 2565525667 2038710333
## 3 2021.096 2736009231 2559111692
##
## Clustering vector:
## [1] 1 2 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 3 1 1 1 1
## [38] 1 1 1 1 1 1 1 3 3 3 2 1 3 2 2 1 2 2 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [75] 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 1 1 3 3 3 3 3 1 3 3 3 3 3 3 3 3
## [112] 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 1.216676e+18 1.660199e+18 1.119528e+18
## (between_SS / total_SS = 62.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
kmeans = kmeans(dt,centers=3)$cluster # vector of cluster belonging
dt$cm <- kmeans
## Warning in dt$cm <- kmeans: Coercing LHS to a list
scatter3D(df$Fiscal.Year ,df$Current.Year.Production ,df$Previous.Year.Production, colvar=df$cm)
We will use the kmeans() function in R to cluster the data into 4 clusters based on the Current.Year.Production and Previous.Year.Production variables.
set.seed(42)
oil_cluster <- df[,c("Fiscal.Year","Current.Year.Production", "Previous.Year.Production")]
kmeans_model <- kmeans(oil_cluster, centers = 3)
ggplot(oil_cluster, aes(x = Current.Year.Production, y = Previous.Year.Production, color = factor(kmeans_model$cluster))) +
geom_point() +
labs(title = "K-Means Clustering of Oil Production Data", x = "Current Year Production", y = "Previous Year Production")
The data analysis reveals useful information about the weekly oil production data. The time series analysis highlights the seasonal and upward trend of the production, which can help inform production scheduling. The linear regression model provides a useful tool for predicting future production levels based on predictor variables. The clustering analysis can help identify patterns and trends in the data, which can inform decisions about resource allocation and investment. Overall, the results of the data analysis can be used to inform strategic decision-making in the oil industry.