OVERVIEW OF THE DATASET

The dataset contains information about weekly oil production for two consecutive years. It has 173 observations and 7 variables. Here is a brief description of each variable:

Fiscal.Year: The fiscal year of the observation.

Fiscal.Week: The fiscal week of the observation.

Current.Year.Production: The current year’s production in barrels of oil.

Previous.Year.Production: The previous year’s production in barrels of oil.

Difference.From.Same.Week.Last.Year: The difference in production between the current and previous year’s production for the same fiscal week.

Current.Year.Cumulative.Production: The cumulative production for the current year up to the current week.

Cumulative.Difference: The cumulative difference in production between the current and previous year’s production up to the current week.

(1).LOAD THE REQUIRED LIBRARIES

library(ggplot2)     # A popular package for creating graphics in R, with a syntax based on the grammar of graphics
library(forecast)    # Provides methods and tools for forecasting time series data
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(xts)         # Provides an extensible time series class for use with R's time-series functions
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## ################################### WARNING ###################################
## # We noticed you have dplyr installed. The dplyr lag() function breaks how    #
## # base R's lag() function is supposed to work, which breaks lag(my_xts).      #
## #                                                                             #
## # If you call library(dplyr) later in this session, then calls to lag(my_xts) #
## # that you enter or source() into this session won't work correctly.          #
## #                                                                             #
## # All package code is unaffected because it is protected by the R namespace   #
## # mechanism.                                                                  #
## #                                                                             #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## # You can use stats::lag() to make sure you're not using dplyr::lag(), or you #
## # can add conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop   #
## # dplyr from breaking base R's lag() function.                                #
## ################################### WARNING ###################################

(2). LOADING AND CLEANING THE DAT

data <- read.csv("weekly-gasoline.csv", header = TRUE, sep = ",")
df <- data.frame(data)

2.1 HEAD and STR

head(df,n=6)
##   Fiscal.Year Fiscal.Week Current.Year.Production Previous.Year.Production
## 1        2023          16              2393748000               2417856000
## 2        2023          15              2367876000               2324364000
## 3        2023          14              2222052000               2402568000
## 4        2023          13              2209116000               2858856000
## 5        2023          12              2742138000               2641884000
## 6        2023          11              2561916000               2784768000
##   Difference.From.Same.Week.Last.Year Current.Year.Cumulative.Production
## 1                           -24108000                        39649722000
## 2                            43512000                        37255974000
## 3                          -180516000                        34888098000
## 4                          -649740000                        32666046000
## 5                           100254000                        30456930000
## 6                          -222852000                        27714792000
##   Cumulative.Difference
## 1           -3024672000
## 2           -3000564000
## 3           -3044076000
## 4           -2863560000
## 5           -2213820000
## 6           -2314074000
str(df)  # Examine the structure of the data
## 'data.frame':    173 obs. of  7 variables:
##  $ Fiscal.Year                        : int  2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
##  $ Fiscal.Week                        : int  16 15 14 13 12 11 10 9 8 7 ...
##  $ Current.Year.Production            : num  2.39e+09 2.37e+09 2.22e+09 2.21e+09 2.74e+09 ...
##  $ Previous.Year.Production           : num  2.42e+09 2.32e+09 2.40e+09 2.86e+09 2.64e+09 ...
##  $ Difference.From.Same.Week.Last.Year: chr  "-24108000" "43512000" "-180516000" "-649740000" ...
##  $ Current.Year.Cumulative.Production : num  3.96e+10 3.73e+10 3.49e+10 3.27e+10 3.05e+10 ...
##  $ Cumulative.Difference              : chr  "-3024672000" "-3000564000" "-3044076000" "-2863560000" ...

Data has 173 observations and 7 variables of interest

2.2 Convert All Character To Numeric For Easier Analysis

# Clean the data by converting "Difference.From.Same.Week.Last.Year" and "Cumulative.Difference" columns from character to numeric

df$Difference.From.Same.Week.Last.Year <- as.numeric(gsub(",", "", df$Difference.From.Same.Week.Last.Year))
## Warning: NAs introduced by coercion
df$Cumulative.Difference <- as.numeric(gsub(",", "", df$Cumulative.Difference))
## Warning: NAs introduced by coercion
df<-na.omit(df) # remove all NAs

2.3 Check For Missing Values

sum(is.na(df))  # check if any value is missing
## [1] 0

No missing value

2.4 Quick Summary Statistics Of Each Variables

summary(df)   # Summary statistics for all variables
##   Fiscal.Year    Fiscal.Week    Current.Year.Production
##  Min.   :2020   Min.   : 1.00   Min.   :2.209e+09      
##  1st Qu.:2021   1st Qu.:10.25   1st Qu.:2.541e+09      
##  Median :2022   Median :24.00   Median :2.632e+09      
##  Mean   :2022   Mean   :24.68   Mean   :2.628e+09      
##  3rd Qu.:2022   3rd Qu.:38.00   3rd Qu.:2.741e+09      
##  Max.   :2023   Max.   :52.00   Max.   :2.953e+09      
##  Previous.Year.Production Difference.From.Same.Week.Last.Year
##  Min.   :1.489e+09        Min.   :-649740000                 
##  1st Qu.:2.420e+09        1st Qu.: -77322000                 
##  Median :2.587e+09        Median :  59976000                 
##  Mean   :2.527e+09        Mean   : 100947737                 
##  3rd Qu.:2.710e+09        3rd Qu.: 226306500                 
##  Max.   :2.953e+09        Max.   :1135722000                 
##  Current.Year.Cumulative.Production Cumulative.Difference
##  Min.   :2.433e+09                  Min.   :-7.050e+09   
##  1st Qu.:2.736e+10                  1st Qu.: 1.416e+08   
##  Median :6.196e+10                  Median : 2.118e+09   
##  Mean   :6.354e+10                  Mean   : 1.717e+09   
##  3rd Qu.:9.702e+10                  3rd Qu.: 4.250e+09   
##  Max.   :1.350e+11                  Max.   : 5.863e+09

(3). EXPLORATORY DATA ANALYSIS

3.1 Calculation of mean

Fiscal Year

mean(df$Fiscal.Year)
## [1] 2021.588

Fiscal Week

mean(df$Fiscal.Week)
## [1] 24.68421

Current.Year.Production

mean(df$Current.Year.Production)
## [1] 2628006684

Previous.Year.Production

mean(df$Previous.Year.Production)
## [1] 2527058947

Difference.From.Same.Week.Last.Year

mean(df$Difference.From.Same.Week.Last.Year)
## [1] 100947737

Current.Year.Cumulative.Production

mean(df$Current.Year.Cumulative.Production)
## [1] 63544045404

Cumulative.Difference

mean(df$Cumulative.Difference)
## [1] 1716707263

3.2 Calculation of median

Fiscal Year

median(df$Fiscal.Year)
## [1] 2022

Fiscal Week

median(df$Fiscal.Week)
## [1] 24

Current.Year.Production

median(df$Current.Year.Production)
## [1] 2632035000

Previous.Year.Production

median(df$Previous.Year.Production)
## [1] 2587200000

Difference.From.Same.Week.Last.Year

median(df$Difference.From.Same.Week.Last.Year)
## [1] 59976000

Current.Year.Cumulative.Production

median(df$Current.Year.Cumulative.Production)
## [1] 61960353000

Cumulative.Difference

median(df$Cumulative.Difference)
## [1] 2117829000

3.3 Calculation of minimum,maximum

Fiscal Year

min(df$Fiscal.Year)
## [1] 2020
max(df$Fiscal.Year)
## [1] 2023

Fiscal Week

min(df$Fiscal.Week)
## [1] 1
max(df$Fiscal.Week)
## [1] 52

Current.Year.Production

min(df$Current.Year.Production)
## [1] 2209116000
max(df$Current.Year.Production)
## [1] 2952642000

Previous.Year.Production

min(df$Previous.Year.Production)
## [1] 1489110000
max(df$Previous.Year.Production)
## [1] 2952642000

Difference.From.Same.Week.Last.Year

min(df$Difference.From.Same.Week.Last.Year)
## [1] -649740000
max(df$Difference.From.Same.Week.Last.Year)
## [1] 1135722000

Current.Year.Cumulative.Production

min(df$Current.Year.Cumulative.Production)
## [1] 2433144000
max(df$Current.Year.Cumulative.Production)
## [1] 1.35e+11

Cumulative.Difference

min(df$Cumulative.Difference)
## [1] -7050414000
max(df$Cumulative.Difference)
## [1] 5862654000

3.4 Calculation of standard deviation

Fiscal Year

sd(df$Fiscal.Year)
## [1] 0.910231
sd(df$Fiscal.Year)
## [1] 0.910231

Fiscal Week

sd(df$Fiscal.Week)
## [1] 15.61048
sd(df$Fiscal.Week)
## [1] 15.61048

Current.Year.Production

min(df$Current.Year.Production)
## [1] 2209116000
max(df$Current.Year.Production)
## [1] 2952642000

Previous.Year.Production

sd(df$Previous.Year.Production)
## [1] 272720976
sd(df$Previous.Year.Production)
## [1] 272720976

Difference.From.Same.Week.Last.Year

sd(df$Difference.From.Same.Week.Last.Year)
## [1] 304899984
sd(df$Difference.From.Same.Week.Last.Year)
## [1] 304899984

Current.Year.Cumulative.Production

sd(df$Current.Year.Cumulative.Production)
## [1] 39799555792
sd(df$Current.Year.Cumulative.Production)
## [1] 39799555792

Cumulative.Difference

sd(df$Cumulative.Difference)
## [1] 3041539561
sd(df$Cumulative.Difference)
## [1] 3041539561

3.5 Graphs

Visualize the trend of current oil production over time
ggplot(df, aes(x = Fiscal.Week, y = Current.Year.Production, color = factor(Fiscal.Year))) +
  geom_line() +
  labs(title = "Weekly Oil Production by Fiscal Year", x = "Fiscal Week", y = "Current Year Production")

Visualize the trend of previous oil production over time
ggplot(df, aes(x = Fiscal.Week, y = Previous.Year.Production, color = factor(Fiscal.Year))) +
  geom_line() +
  labs(title = "Weekly Oil Production by Fiscal Year", x = "Fiscal Week", y = "Previous Year Production")

Visualize the trend in Current Year Cumulative Production of oil production over time
ggplot(df, aes(x = Fiscal.Week, y = Current.Year.Cumulative.Production, color = factor(Fiscal.Year))) +
  geom_line() +
  labs(title = "Current Year Cumulative Production by Fiscal Year", x = "Fiscal Week", y = "Current.Year.Cumulative.Production")

Compare the difference between Difference From Same Week Last Year and Cumulative Difference over time
ggplot(df, aes(x = Difference.From.Same.Week.Last.Year, y =Cumulative.Difference, color = factor(Fiscal.Year))) +
  geom_line() +
  labs(title = "Difference From Same Week Last Year and Cumulative.Difference", x = "Fiscal Week", y = "Difference From Same Week Last Year Cumulative.Production")

Compare the current year’s production to the previous year’s production
# Compare the current year's production to the previous year's production
ggplot(df, aes(x = Previous.Year.Production, y = Current.Year.Production,color=factor(Fiscal.Year))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Comparison of Current Year's Production to Previous Year's Production", x = "Previous Year Production", y = "Current Year Production")
## `geom_smooth()` using formula = 'y ~ x'

3.6 Creating Histogram For Current Year Of Production

qplot(data =df, x=Current.Year.Production , geom = "histogram", bins = 15, color =I("yellow"))
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

3.7 Creating Histogram For Previous Year Of Production

qplot(data =df, x=Previous.Year.Production , geom = "histogram", bins = 15, color =I("pink"))

(4) TIME SERIES ANALYSIS

4.1 Create a time series object using the Current.Year.Production variable

oil_ts <- ts(na.omit(df$Current.Year.Production), start = c(2020, 1), end = c(2023, 25), frequency = 52)

4.2 Plot the time series object

plot(oil_ts, main = "Weekly Oil Production Time Series", xlab = "Fiscal Week", ylab = "Production (barrels)")

4.3 Decompose the time series object into its components

oil_decomp <- decompose(oil_ts)

4.4 Visualize the trend, seasonal, and random components of the time series

autoplot(oil_decomp) +
  labs(title = "Decomposition of Current Weekly Oil Production Time Series")

From the exploratory data analysis and time series analysis, we can see that there is a clear upward trend in oil production over the observed period. We can also see that there is a seasonal component to the data, with production peaking in the summer months and decreasing in the winter month.

(5) LINEAR REGRESSION MODEL

The provided dataset does not have any categorical variable, we cannot perform a classification analysis. Therefore, we will perform linear regression analysis. First, we need to split our dataset into training and testing sets. We will use 80% of the data for training and 20% for testing.

5.1 Split the dataset

set.seed(42)
train_index <- sample(1:nrow(df), 0.8*nrow(df))

Train Data

train_data <- df[train_index,]
head(train_data)
##     Fiscal.Year Fiscal.Week Current.Year.Production Previous.Year.Production
## 50         2022          20              2519580000               2471658000
## 153        2020          21              2656290000               2640414000
## 66         2022           4              2740962000               2512230000
## 26         2022          44              2511054000               2873850000
## 75         2021          47              2814168000               2693334000
## 152        2020          22              2700684000               2664228000
##     Difference.From.Same.Week.Last.Year Current.Year.Cumulative.Production
## 50                             47922000                        52795932000
## 153                            15876000                        55975542000
## 66                            228732000                        11045580000
## 26                           -362796000                       115000000000
## 75                            120834000                       119000000000
## 152                            36456000                        58676226000
##     Cumulative.Difference
## 50             5038278000
## 153             153468000
## 66              959616000
## 26             3799068000
## 75             3529470000
## 152             189924000

Test Data

test_data <- df[-train_index,]
head(test_data)
##    Fiscal.Year Fiscal.Week Current.Year.Production Previous.Year.Production
## 1         2023          16              2393748000               2417856000
## 7         2023          10              2426970000               2635122000
## 11        2023           6              2570148000               2722146000
## 14        2023           3              2625420000               2832396000
## 24        2022          46              2748312000               2743902000
## 45        2022          25              2539278000               2533104000
##    Difference.From.Same.Week.Last.Year Current.Year.Cumulative.Production
## 1                            -24108000                        39649722000
## 7                           -208152000                        25152876000
## 11                          -151998000                        15375318000
## 14                          -206976000                         7609896000
## 24                             4410000                       120000000000
## 45                             6174000                        65715174000
##    Cumulative.Difference
## 1            -3024672000
## 7            -2091222000
## 11           -1186584000
## 14            -694722000
## 24            3713220000
## 45            5862654000

5.2 Create a linear regression model using the lm() function in R. We will use the Current.Year.Production variable as our response variable and the Previous.Year.Production, Fiscal.Week, and Cumulative.Difference variables as our predictor variables.

lm_model <- lm(Current.Year.Production ~ Previous.Year.Production + Fiscal.Year  + Difference.From.Same.Week.Last.Year, data = train_data)

5.3 We can now use our trained model to make predictions on our testing dataset and calculate the accuracy of our predictions using the mean squared error.

predictions <- predict(lm_model, newdata = test_data)
mse <- mean((test_data$Current.Year.Production - predictions)^2)
mse
## [1] 5.338338e-13

The MSE value is very small, this shows that our model is a good fit.

5.4 Visualize the relationship between the two variables using a scatter plot

ggplot(df, aes(x = Previous.Year.Production, y = Current.Year.Production)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Scatter Plot of Current vs. Previous Year's Gasoline Production",
       x = "Previous Year's Production (barrels)", y = "Current Year's Production (barrels)")
## `geom_smooth()` using formula = 'y ~ x'

(6) CLUSTERING ANALYSIS USING K-MEANS ALGORITHM.

6.1 Visualize the three variables of interest in 3D

library("plot3D")
scatter3D(df$Fiscal.Year ,df$Current.Year.Production ,df$Previous.Year.Production)

dt <- cbind(df$Fiscal.Year ,df$Current.Year.Production ,df$Previous.Year.Production, col = NULL)
head(dt)
##      [,1]       [,2]       [,3]
## [1,] 2023 2393748000 2417856000
## [2,] 2023 2367876000 2324364000
## [3,] 2023 2222052000 2402568000
## [4,] 2023 2209116000 2858856000
## [5,] 2023 2742138000 2641884000
## [6,] 2023 2561916000 2784768000
kmeans(dt,centers=3)
## K-means clustering with 3 clusters of sizes 44, 18, 52
## 
## Cluster means:
##       [,1]       [,2]       [,3]
## 1 2022.227 2525927727 2688957409
## 2 2021.444 2565525667 2038710333
## 3 2021.096 2736009231 2559111692
## 
## Clustering vector:
##   [1] 1 2 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 3 1 1 1 1
##  [38] 1 1 1 1 1 1 1 3 3 3 2 1 3 2 2 1 2 2 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 1 1 3 3 3 3 3 1 3 3 3 3 3 3 3 3
## [112] 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 1.216676e+18 1.660199e+18 1.119528e+18
##  (between_SS / total_SS =  62.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
kmeans = kmeans(dt,centers=3)$cluster  # vector of cluster belonging
dt$cm <-  kmeans  
## Warning in dt$cm <- kmeans: Coercing LHS to a list
scatter3D(df$Fiscal.Year ,df$Current.Year.Production ,df$Previous.Year.Production, colvar=df$cm)

We will use the kmeans() function in R to cluster the data into 4 clusters based on the Current.Year.Production and Previous.Year.Production variables.

set.seed(42)
oil_cluster <- df[,c("Fiscal.Year","Current.Year.Production", "Previous.Year.Production")]
kmeans_model <- kmeans(oil_cluster, centers = 3)

6.2 Visualize the clusters using a scatterplot and color the points according to their assigned cluster.

ggplot(oil_cluster, aes(x = Current.Year.Production, y = Previous.Year.Production, color = factor(kmeans_model$cluster))) + 
  geom_point() + 
  labs(title = "K-Means Clustering of Oil Production Data", x = "Current Year Production", y = "Previous Year Production")

(7) DISCUSSION

Based on the time series analysis of the Current.Year.Production variable, we can see an upward trend in oil production over the observed period. Additionally, there is a clear seasonal component to the data, with production peaking in the summer months and decreasing in the winter months. This information can be useful for companies involved in the oil industry, as they can adjust their production schedules to align with the seasonal patterns.
The linear regression model, which uses Previous.Year.Production, Fiscal.Year, and Difference.From.Same.Week.Last.Year as predictor variables, shows that the model has a low mean squared error (MSE) of 5.338338e-13. The predicted scatterplot shows that most points are close to the line, indicating that the model provides a good fit for the data. This model can be useful for predicting future oil production levels based on the values of the predictor variables.
The clustering analysis using the kmeans algorithm shows that most of the data points are close to each other. The cluster means for Fiscal.Year, Current.Year.Production, and Previous.Year.Production suggest that there are three distinct groups of data points. This information can be useful for identifying patterns and trends in the data, and for making decisions about resource allocation and investment.

(8) CONCLUSION

The data analysis reveals useful information about the weekly oil production data. The time series analysis highlights the seasonal and upward trend of the production, which can help inform production scheduling. The linear regression model provides a useful tool for predicting future production levels based on predictor variables. The clustering analysis can help identify patterns and trends in the data, which can inform decisions about resource allocation and investment. Overall, the results of the data analysis can be used to inform strategic decision-making in the oil industry.