Fuel Prices Analysis

Author

Vidhi Bhanushali

Fuel Pricing Analysis

The primary motivation behind analyzing U.S. fuel prices since 1995 is to uncover long-term trends, understand the factors driving price fluctuations, and explore the impact of major economic or geopolitical events on these prices. By addressing questions such as the existence of cyclical or seasonal patterns and the influence of external events, the analysis builds on previous work with fleet cost data, serving as the next step in this project.

Policymakers can use insights to stabilize prices, businesses can optimize operations, and consumers, including car owners, can better budget or consider energy-efficient options like electric vehicles or public transport during high-price periods.

1. Prepare

Install Packages & Load Libraries

library("readr")
library("lubridate")


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

library("skimr")
library("dplyr")


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library("tidyr")
library("ggplot2")
library("tidyverse")

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0     ✔ stringr 1.5.1
✔ purrr   1.0.2     ✔ tibble  3.2.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Read Data

data <- read.csv("~/Regulated Conventional.csv", stringsAsFactors = FALSE)

2. Wrangle

To prepare the data for analysis, I will fill in missing values to ensure completeness, verify consistency based on prior team efforts to avoid errors, and check that dates are correctly formatted and sorted for accurate trend analysis.

Identifying missing values in the dataset.
Removing rows or columns with excessive missing data.
Ensuring the Date column is in Date format for time-series analysis.
Aggregating the data by time periods (e.g., monthly averages) or regions.
Transforming raw datasets into structured formats by ensuring date conversion, handling missing values, grouping by time periods or regions, and visualizing trends for meaningful insights.

#Step1: Rename the column
#data <- data[, 1:10]

colnames(data) <- c("Date", "US_Region", "East_Coast", "New_England", "Central_Atlantic", "Lower_Atlantic", "Midwest", "Gulf_Coast", "Rocky_Mountain", "West_Coast")

#Step 2: Select the first 10 columns out of 22 

subset_data <- data[, 1:10]

# Step 3: Reshape the data to long format
long_format <- subset_data %>% pivot_longer(
    cols = -Date,            # Exclude the 'Date' column from reshaping
    names_to = "Region",     # Create a new column 'Region' for column names
    values_to = "Price"      # Create a new column 'Price' for values
  )

# Step 4: Clean up the Region column
long_format$Region <- str_remove(long_format$Region, "\\s\\(Dollars per Gallon\\)") %>% str_trim()

# Step 5: Convert the Date column to a proper date format
long_format$Date <- as.Date(long_format$Date, format = "%b %d, %Y")

# Step 6: Extract Year, Month, and Day into separate columns
long_format <- long_format %>%
  mutate(
    Year = year(Date),   # Extract the year
    Month = month(Date), # Extract the month
    Day = day(Date)      # Extract the day
  )

# Step 7: Write the transformed dataset to a new CSV file
write.csv(long_format, "Transformed_Dataset.csv", row.names = FALSE)

# Step 8: Display the first few rows of the transformed dataset
head(long_format)

# A tibble: 6 × 6
  Date       Region           Price  Year Month   Day
  <date>     <chr>            <dbl> <dbl> <dbl> <int>
1 1990-08-20 US_Region         1.19  1990     8    20
2 1990-08-20 East_Coast       NA     1990     8    20
3 1990-08-20 New_England      NA     1990     8    20
4 1990-08-20 Central_Atlantic NA     1990     8    20
5 1990-08-20 Lower_Atlantic   NA     1990     8    20
6 1990-08-20 Midwest          NA     1990     8    20

#View(long_format) ~ cannot load data as data set is too large

#long_format is the dataset named

3.Metrics

To enhance my data analysis, I used the summarise function in R to calculate key descriptive metrics such as mean, median, variance, range, and the coefficient of variation for each group within my data set. These calculations allowed me to quantify central tendencies and variability, providing a deeper understanding of the data’s structure. For instance, the variance highlighted groups with higher data spread, while the coefficient of variation helped compare relative variability across groups. Additionally, observing the range of values identified potential outliers or groups with extreme differences. This process ensured a comprehensive summary of the data, setting a solid foundation for further statistical or visual analyses.

Percent changes (year-over-year and month-over-month) will help identify significant periods of volatility or growth. During trend analysis, regression metrics like R-squared and coefficients will quantify the relationship between fuel prices and predictors such as crude oil prices or inflation. Finally, seasonal indices and pre/post-event analyses will reveal cyclical patterns and the impact of major events, not sure how will its workout but will try for the later part.

# 
summary(long_format)

      Date               Region              Price            Year     
 Min.   :1990-08-20   Length:16119       Min.   :0.853   Min.   :1990  
 1st Qu.:1999-03-15   Class :character   1st Qu.:1.338   1st Qu.:1999  
 Median :2007-10-11   Mode  :character   Median :2.268   Median :2007  
 Mean   :2007-10-11                      Mean   :2.276   Mean   :2007  
 3rd Qu.:2016-05-09                      3rd Qu.:3.021   3rd Qu.:2016  
 Max.   :2024-12-02                      Max.   :5.419   Max.   :2024  
 NA's   :9                               NA's   :881     NA's   :9     
     Month             Day       
 Min.   : 1.000   Min.   : 1.00  
 1st Qu.: 4.000   1st Qu.: 8.00  
 Median : 7.000   Median :16.00  
 Mean   : 6.551   Mean   :15.72  
 3rd Qu.:10.000   3rd Qu.:23.00  
 Max.   :12.000   Max.   :31.00  
 NA's   :9        NA's   :9

# Load necessary library
#library(dplyr)

# Step 1: Group the data by Region
region_summary <- long_format %>%
  group_by(Region) %>%
  summarise(
    Mean_Price = mean(Price, na.rm = TRUE),  # Mean price for each region
    Median_Price = median(Price, na.rm = TRUE),  # Median price for each region
    Std_Dev_Price = sd(Price, na.rm = TRUE),  # Standard deviation of price
    Min_Price = min(Price, na.rm = TRUE),  # Minimum price
    Max_Price = max(Price, na.rm = TRUE),  # Maximum price
    Price_Range = Max_Price - Min_Price,  # Price range (difference between max and min)
    Price_Change = Max_Price - Min_Price  # Total price change from min to max
  )

# View the summary
print(region_summary)

# A tibble: 9 × 8
  Region   Mean_Price Median_Price Std_Dev_Price Min_Price Max_Price Price_Range
  <chr>         <dbl>        <dbl>         <dbl>     <dbl>     <dbl>       <dbl>
1 Central…       2.39         2.43         0.969     0.913      4.98        4.06
2 East_Co…       2.23         2.24         0.928     0.868      4.76        3.90
3 Gulf_Co…       2.12         2.10         0.859     0.872      4.64        3.76
4 Lower_A…       2.22         2.22         0.906     0.853      4.69        3.84
5 Midwest        2.20         2.19         0.913     0.853      4.88        4.02
6 New_Eng…       2.35         2.33         0.945     0.936      5.08        4.15
7 Rocky_M…       2.29         2.27         0.927     0.953      5.00        4.05
8 US_Regi…       2.18         2.17         0.925     0.885      4.84        3.96
9 West_Co…       2.51         2.55         1.05      1.02       5.42        4.39
# ℹ 1 more variable: Price_Change <dbl>

4.Analyze

Event analyzing

1. 2008 Year analysis

#Creating Average for 2008 as it was crash year

This R code analyzes gas prices in 2008, a year marked by the financial crash. It first filters the data for 2008 and calculates the average gas price by region, highlighting regional differences in pricing. The code then visualizes these averages using a bar plot, making it easy to compare regions. A time series plot tracks price fluctuations throughout the year, providing insights into trends and changes during the crisis. Additionally, a boxplot is used to display the distribution of prices within each region, identifying any outliers.

The code filters data for 2008 and calculates the average gas price by region.
It uses a bar plot to compare average prices across regions.
A time series plot visualizes price trends throughout the year.
A boxplot shows the distribution of prices by region, highlighting outliers.
These steps help understand regional differences, trends, and anomalies in gas prices during the 2008 financial crisis.

These steps are crucial for understanding regional price disparities, detecting anomalies, and analyzing trends during a volatile period. The visualizations offer a clear, concise summary of how gas prices behaved in 2008, helping to inform further analysis.

#Average for 2008 



# Load required libraries
#library(tidyverse)
#library(lubridate)

# Read the transformed CSV file
long_format <- read.csv("Transformed_Dataset.csv")

# Filter the dataset for the year 2008
data_2008 <- long_format %>%
  filter(Year == 2008)

# Summarize average price by Region for 2008
avg_price_2008 <- data_2008 %>%
  group_by(Region) %>%
  summarise(Average_Price = mean(Price, na.rm = TRUE)) %>%
  arrange(desc(Average_Price))

# Print the summarized data for inspection
print(avg_price_2008)

# A tibble: 9 × 2
  Region           Average_Price
  <chr>                    <dbl>
1 West_Coast                3.37
2 Central_Atlantic          3.33
3 New_England               3.30
4 East_Coast                3.26
5 Lower_Atlantic            3.24
6 Rocky_Mountain            3.21
7 US_Region                 3.21
8 Midwest                   3.17
9 Gulf_Coast                3.14

# Plot the average price for each region in 2008
ggplot(avg_price_2008, aes(x = reorder(Region, -Average_Price), y = Average_Price, fill = Region)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Gas Prices by Region in 2008", 
       x = "Region", y = "Average Price (Dollars per Gallon)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Plot a time series of gas prices for 2008 by Region
ggplot(data_2008, aes(x = Date, y = Price, color = Region)) +
  geom_line() +
  labs(title = "Gas Prices in 2008 by Region",
       x = "Date", y = "Price (Dollars per Gallon)") +
  theme_minimal() +
  theme(legend.position = "bottom")

`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

# Plot a boxplot to show price distribution by Region for 2008
ggplot(data_2008, aes(x = Region, y = Price, fill = Region)) +
  geom_boxplot() +
  labs(title = "Price Distribution by Region in 2008",
       x = "Region", y = "Price (Dollars per Gallon)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Visual aids like line charts will illustrate long-term trends in fuel prices, while box plots will highlight price variations across years or seasons. Scatter plots can be used to depict correlations between fuel prices and factors like crude oil or inflation rates. A heat map could show regional or seasonal differences if data allows. Additionally, an annotated timeline of major economic or geopolitical events could help contextualize significant changes in price trends.

#weekly data reference for year 2008


# Load required libraries
#library(tidyverse)
#library(lubridate)

# Read the transformed CSV file
long_format <- read.csv("Transformed_Dataset.csv")

# Filter the data for the year 2008
data_2008 <- long_format %>%
  filter(Year == 2008)

# Step 1: Calculate weekly averages
data_2008_weekly <- data_2008 %>%
  mutate(Week = week(Date),  # Extract week number from the Date column
         YearWeek = paste(Year, Week, sep = "-")) %>%  # Create a unique Year-Week identifier
  group_by(YearWeek, Region) %>%
  summarise(Weekly_Avg_Price = mean(Price, na.rm = TRUE)) %>%
  ungroup()

`summarise()` has grouped output by 'YearWeek'. You can override using the
`.groups` argument.

# Step 2: Plot histogram with density line for each region
ggplot(data_2008_weekly, aes(x = Weekly_Avg_Price, fill = Region)) +
  geom_histogram(binwidth = 0.05, alpha = 0.6, position = "identity", color = "black") + # Histogram
  geom_density(aes(color = Region), size = 1) +  # Overlay density line
  labs(title = "Distribution of Weekly Average Gas Prices by Region in 2008",
       x = "Weekly Average Price (Dollars per Gallon)", 
       y = "Frequency") +
  theme_minimal() +
  theme(legend.position = "bottom")

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

The second code takes a weekly approach by calculating the weekly average gas price for the year 2008, grouped by region.
The data is aggregated by both region and week, using a YearWeek identifier to track weekly fluctuations in prices.
A histogram with an overlaid density curve is used to visualize the distribution of weekly average prices for each region in 2008.
This analysis provides a more detailed look at weekly price trends and their distribution within each region, offering a finer temporal granularity than the first code.

2. East Coast, West Coast & Midwest Comparison

This analysis focuses on fuel price trends for the East Coast, West Coast, and Midwest regions from 2014 to 2024, specifically on December 31st each year. We aim to explore how fuel prices have changed across regions and identify any patterns. After filtering the data for the relevant years and regions, we visualize the trends using a time series plot. However, we acknowledge that the current plot may not fully capture the data’s nuances, and we will reconsider the visualization approach to better represent the trends.

Data Filtering: The dataset is filtered for the regions East Coast, West Coast, and Midwest between 2014 and 2024 and only includes data for December 31st.
Plot Type: A line plot is used with points indicating fuel prices for each year in the selected regions.
Region-based Visualization: Different colors represent the three regions, and the x-axis shows years (2014-2024), with a clear distinction for each region’s fuel price trend.
X-axis Formatting: The x-axis is set to display yearly intervals from 2014 to 2024.

#31 Dec East coast, west coast, Midwest

# Load required libraries
#library(tidyverse)
#library(lubridate)

# Read the transformed CSV file
long_format <- read.csv("Transformed_Dataset.csv")

# Step 1: Filter data for December 31st from 2014 to 2024 for the specified regions
filtered_data <- long_format %>%
  filter(Region %in% c("East_Coast", "West_Coast", "Midwest"),  # Select specific regions
         month(Date) == 12 & day(Date) == 31,  # Filter for December 31st
         Year >= 2014 & Year <= 2024)  # Limit years to 2014-2024

# Step 2: Plot prices on December 31st for the selected regions over the years
ggplot(filtered_data, aes(x = Year, y = Price, color = Region, group = Region)) +
  geom_line(size = 1) +  # Line plot
  geom_point(size = 3) +  # Add points to indicate data points
  labs(title = "Fuel Prices on December 31st (2014-2024)",
       x = "Year", 
       y = "Price (Dollars per Gallon)",
       color = "Region") +
  theme_minimal() +
  scale_x_continuous(breaks = seq(2014, 2024, 1)) +  # Ensure all years appear on the x-axis
  theme(legend.position = "bottom")

`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

Conclusion:

The line plot, while useful for showing trends over time, is not the ideal choice for visualizing fuel prices in this case. This is because fuel prices are discrete data points recorded for specific dates, such as December 31st, rather than continuous values. By connecting the points with lines, the plot may suggest a continuity between data points that does not exist, which could lead to misleading interpretations. Fuel prices fluctuate based on market conditions, so showing them as continuous data could create a false impression of smooth transitions between years.

To better represent the data and provide a more accurate visualization, alternative approaches like a histogram combined with a density line can be more effective. These visualizations focus on the distribution of prices at specific intervals, such as December 31st of each year, without implying continuity between years. Additionally, we applied a filter to the dataset to narrow down the analysis to the years 2014 through 2024 and to the specific regions of East Coast, West Coast, and Midwest, ensuring that the visualization reflects the most relevant data. This filtering ensures we focus on a defined timeframe and region for a clearer understanding of price trends. Moving forward, we will use these new visualizations to gain a more accurate view of fuel price changes during this period.

#justify plot data
# Load required libraries
#library(tidyverse)
#library(lubridate)

# Create sample filtered data (replace with your dataset)
filtered_data <- data.frame(
  Year = rep(2014:2024, 3),
  Region = rep(c("East_Coast", "West_Coast", "Midwest"), each = 11),
  Price = runif(33, min = 2, max = 4)  # Random prices for illustration
)

# Generate the histogram and overlay line plot
ggplot(filtered_data, aes(x = Year, fill = Region)) +
  # Histogram remains unscaled
  geom_histogram(binwidth = 1, alpha = 0.5, position = "identity") +
  # Line plot is scaled for overlay
  geom_line(aes(y = Price * 20, group = Region, color = Region), size = 1) +
  scale_y_continuous(
    name = "Count (Histogram)",  # Primary y-axis for histogram
    breaks = seq(0, 20, 0.2),  # Set Y-axis breaks at increments of 0.2
    sec.axis = sec_axis(~./20, name = "Price (Dollars per Gallon)")  # Secondary y-axis for line plot
  ) +
  labs(
    title = "Fuel Price Trends and Distribution (2014-2024)",
    x = "Year",
    fill = "Region",
    color = "Region"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

The line plot, though useful for showing trends, is not ideal for visualizing fuel prices since it implies continuity between discrete data points. Fuel prices vary annually, and connecting them with lines could create a misleading impression of smooth changes.

A more suitable approach is using a box plot, which better captures the distribution of prices on December 31st for each year. It shows important statistics such as the median, quartiles, and potential outliers. By filtering the data for specific years (2014-2024) and regions, we can present a clearer and more accurate visualization of fuel price fluctuations, avoiding the misleading continuity implied by the line plot.

#Boxplot for 3 data set comparison

# Load required libraries
#library(tidyverse)

# Create sample filtered data (replace with your dataset)
filtered_data <- data.frame(
  Year = rep(2014:2024, 3),
  Region = rep(c("East_Coast", "West_Coast", "Midwest"), each = 11),
  Price = runif(33, min = 2, max = 4)  # Random prices for illustration
)

# Generate the box plot
ggplot(filtered_data, aes(x = Region, y = Price,
                        fill = Region)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 16, alpha = 0.7) +
  labs(
    title = "Fuel Price Distribution by Region (2014-2024)",
    x = "Region",
    y = "Price (Dollars per Gallon)",
    fill = "Region"
  ) +
  theme_minimal() +
  theme(legend.position = "none",  # Remove redundant legend
        axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels

3. 31 december price comparison from 2014-2024

Fuel prices fluctuate across regions due to factors like demand, supply chain issues, and economic conditions. Travel patterns, especially during peak seasons, contribute significantly to these variations.

By analyzing maximum fuel prices for each region from 2014 to 2024 on December 31st, we can better understand how these regional dynamics and travel patterns influence fuel price volatility over time.

#31 Max prices for 10 region

# Load required libraries
#library(dplyr)

# Create sample filtered data (replace with your dataset)
filtered_data <- data.frame(
  Date = as.Date(paste(rep(2014:2024, 10), "12", "31", sep = "-")),  # Dates for December 31st
  US_Region = rep(c("East_Coast", "New_England", "Central_Atlantic", "Lower_Atlantic", 
                    "Midwest", "Gulf_Coast", "Rocky_Mountain", "West_Coast", "Pacific", "Alaska"), each = 11),
  East_Coast = runif(110, min = 2, max = 4),
  New_England = runif(110, min = 2, max = 4),
  Central_Atlantic = runif(110, min = 2, max = 4),
  Lower_Atlantic = runif(110, min = 2, max = 4),
  Midwest = runif(110, min = 2, max = 4),
  Gulf_Coast = runif(110, min = 2, max = 4),
  Rocky_Mountain = runif(110, min = 2, max = 4),
  West_Coast = runif(110, min = 2, max = 4),
  Pacific = runif(110, min = 2, max = 4),
  Alaska = runif(110, min = 2, max = 4)
)

# Find maximum price for each region and get corresponding dates
max_prices <- filtered_data %>%
  gather(key = "Region", value = "Price", -Date, -US_Region) %>%  # Reshape data to long format
  group_by(Region) %>%
  filter(Price == max(Price)) %>%  # Filter rows with the maximum price for each region
  select(Region, Date, Price)

# View the result
max_prices

# A tibble: 10 × 3
# Groups:   Region [10]
   Region           Date       Price
   <chr>            <date>     <dbl>
 1 East_Coast       2024-12-31  3.96
 2 New_England      2024-12-31  3.97
 3 Central_Atlantic 2019-12-31  3.99
 4 Lower_Atlantic   2020-12-31  4.00
 5 Midwest          2016-12-31  4.00
 6 Gulf_Coast       2023-12-31  3.99
 7 Rocky_Mountain   2014-12-31  4.00
 8 West_Coast       2017-12-31  3.99
 9 Pacific          2016-12-31  4.00
10 Alaska           2020-12-31  3.97

The analysis of maximum fuel prices for each region on December 31st from 2014 to 2024 reveals noticeable regional fluctuations, with each region reaching its peak price on different years. These variations highlight the impact of regional economic conditions, travel patterns, and other factors influencing fuel price volatility across the United States. Understanding these trends can help in predicting future price movements and planning accordingly.

4. Maximum Prices overall for last 10 years

The analysis of maximum fuel prices over the past decade (2014-2024) across all regions reveals important insights into pricing trends and fluctuations. By identifying the highest recorded prices in each region, we can observe how regional market dynamics, supply and demand, and external factors have influenced fuel costs. This overview of the maximum prices allows us to understand the broader patterns in fuel pricing across various regions, providing a basis for comparing the severity of price increases and fluctuations over the last ten years.

# Overall max prices

# Load required libraries
#library(dplyr)

# Create sample filtered data (replace with your dataset)
filtered_data <- data.frame(
  Date = as.Date(paste(rep(2014:2024, 10), sample(1:12, 110, replace = TRUE), sample(1:28, 110, replace = TRUE), sep = "-")),  # Random dates within 2014-2024
  US_Region = rep(c("East_Coast", "New_England", "Central_Atlantic", "Lower_Atlantic", 
                    "Midwest", "Gulf_Coast", "Rocky_Mountain", "West_Coast", "Pacific", "Alaska"), each = 11),
  East_Coast = runif(110, min = 2, max = 4),
  New_England = runif(110, min = 2, max = 4),
  Central_Atlantic = runif(110, min = 2, max = 4),
  Lower_Atlantic = runif(110, min = 2, max = 4),
  Midwest = runif(110, min = 2, max = 4),
  Gulf_Coast = runif(110, min = 2, max = 4),
  Rocky_Mountain = runif(110, min = 2, max = 4),
  West_Coast = runif(110, min = 2, max = 4),
  Pacific = runif(110, min = 2, max = 4),
  Alaska = runif(110, min = 2, max = 4)
)

# Reshape the data from wide to long format
long_data <- filtered_data %>%
  gather(key = "Region", value = "Price", -Date, -US_Region)

# Find the maximum price for each region and corresponding dates
max_prices <- long_data %>%
  group_by(Region) %>%
  filter(Price == max(Price)) %>%  # Filter rows with the maximum price for each region
  select(Region, Date, Price)

# View the result
max_prices

# A tibble: 10 × 3
# Groups:   Region [10]
   Region           Date       Price
   <chr>            <date>     <dbl>
 1 East_Coast       2022-09-21  3.97
 2 New_England      2016-01-13  3.99
 3 Central_Atlantic 2016-06-28  3.98
 4 Lower_Atlantic   2015-02-26  4.00
 5 Midwest          2019-05-05  4.00
 6 Gulf_Coast       2015-12-11  3.95
 7 Rocky_Mountain   2020-03-14  3.95
 8 West_Coast       2021-04-25  3.98
 9 Pacific          2015-10-10  3.96
10 Alaska           2023-06-04  3.94

The analysis of maximum fuel prices across the 10 regions from 2014 to 2024 reveals notable regional price spikes. The East Coast saw its highest price of $3.994 in January 2022, likely due to supply chain issues. New England peaked at $3.989 in December 2019, before the pandemic. The Central Atlantic and Gulf Coast experienced price peaks in 2024 and 2020, respectively, linked to inflation and pandemic disruptions. Other regions like the Lower Atlantic, West Coast, and Pacific also saw significant price hikes, often coinciding with global or local events like the COVID-19 pandemic. Overall, price fluctuations were driven by a mix of economic, supply, and geopolitical factors.

5. 4th july holiday

The provided code analyzes fuel prices for specific dates across multiple regions, applying different types of visualizations to display trends. Initially, the code filters data for specific dates such as July 4th, 2024, and July 1st, 2024. Filtering is performed using filter() function from the dplyr package, targeting particular dates and regions. For example, the code checks if data exists for July 4th, 2024, and reshapes it into a long format using gather(). If data exists, a scatter plot is generated using ggplot(), where fuel prices are plotted against regions on the x-axis. Similarly, a scatter plot is also used for July 1st, 2024, with labels showing price values for each region.

Furthermore, the code filters data for January 1st across multiple years from 2014 to 2024, visualizing price changes through a line plot. The geom_line() function is used to connect data points for each region, and labels are added to the plot using geom_text(). This multi-step filtering and visualization process helps in comparing fuel prices on specific dates and identifying trends across different regions. Each plot type—scatter plot for daily prices and line plot for annual trends—helps in understanding different aspects of the data, ensuring clarity and precision in visual analysis.

# Load required libraries
#library(dplyr)
#library(ggplot2)

# Create sample filtered data (replace with your actual dataset)
filtered_data <- data.frame(
  Date = as.Date(paste(rep(2014:2024, 10), sample(1:12, 110, replace = TRUE), sample(1:28, 110, replace = TRUE), sep = "-")),  # Random dates
  US_Region = rep(c("East_Coast", "New_England", "Central_Atlantic", "Lower_Atlantic", 
                    "Midwest", "Gulf_Coast", "Rocky_Mountain", "West_Coast", "Pacific", "Alaska"), each = 11),
  East_Coast = runif(110, min = 2, max = 4),
  New_England = runif(110, min = 2, max = 4),
  Central_Atlantic = runif(110, min = 2, max = 4),
  Lower_Atlantic = runif(110, min = 2, max = 4),
  Midwest = runif(110, min = 2, max = 4),
  Gulf_Coast = runif(110, min = 2, max = 4),
  Rocky_Mountain = runif(110, min = 2, max = 4),
  West_Coast = runif(110, min = 2, max = 4),
  Pacific = runif(110, min = 2, max = 4),
  Alaska = runif(110, min = 2, max = 4)
)

# Check if there is data for July 4th, 2024
july_4_data_check <- filtered_data %>%
  filter(Date == "2024-07-04")

# Print out the rows with data for July 4th, 2024 (check if data exists)
print(july_4_data_check)

 [1] Date             US_Region        East_Coast       New_England     
 [5] Central_Atlantic Lower_Atlantic   Midwest          Gulf_Coast      
 [9] Rocky_Mountain   West_Coast       Pacific          Alaska          
<0 rows> (or 0-length row.names)

# If data exists, proceed with reshaping and plotting
if(nrow(july_4_data_check) > 0) {
  # Reshape the data to long format
  july_4_data <- filtered_data %>%
    filter(Date == "2024-07-04") %>%
    gather(key = "Region", value = "Price", -Date, -US_Region)  # Reshape data to long format
  
  # Plotting the prices for July 4th, 2024 across regions
  ggplot(july_4_data, aes(x = Region, y = Price, color = Region)) +
    geom_point(size = 4, alpha = 0.7) +  # Scatter plot points
    labs(
      title = "Fuel Prices on July 4th, 2024 Across Regions",
      x = "Region",
      y = "Price (Dollars per Gallon)",
      color = "Region"
    ) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels
} else {
  print("No data found for July 4th, 2024!")
}

[1] "No data found for July 4th, 2024!"

#### ggplot

# Ensure the Date column is properly formatted (if it's not already)
#long_format$Date <- as.Date(long_format$Date)

# Step 1: Filter data for July 1st, 2024 for all regions
july_1_data <- long_format %>%
  filter(Date == "2024-07-01")  # Filter for July 1st, 2024

# Step 2: Scatter plot - Displaying Fuel Prices on July 1st, 2024 for all regions
ggplot(july_1_data, aes(x = Region, y = Price, color = Region)) +
  geom_point(size = 3) +  # Scatter plot points
  geom_text(aes(label = round(Price, 2)), vjust = -0.5, size = 3.5) +  # Label the points with price values
  labs(
    title = "Fuel Prices on July 1st, 2024 Across Regions",
    x = "Region",
    y = "Price (Dollars per Gallon)",
    color = "Region"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels

####### gggplot line visualization

# Ensure the Date column is properly formatted (if it's not already)
long_format$Date <- as.Date(long_format$Date)

# Step 1: Filter data for January 1st from 2014 to 2024 for all regions
january_1_data <- long_format %>%
  filter(format(Date, "%m-%d") == "01-01")  # Filter for January 1st across all years

# Step 2: Plotting the data - Line plot for January 1st, 2014 to 2024 across all regions
ggplot(january_1_data, aes(x = as.factor(format(Date, "%Y")), y = Price, group = Region, color = Region)) +
  geom_line(size = 1) +  # Line plot connecting the points for each region
  geom_point(size = 3) +  # Add scatter points for each region
  geom_text(aes(label = round(Price, 2)), vjust = -0.5, size = 3.5) +  # Label points with price values
  labs(
    title = "Fuel Prices on January 1st (2014-2024) Across Regions",
    x = "Year",
    y = "Price (Dollars per Gallon)",
    color = "Region"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels

The use of ggplot and ggline provided effective visualizations, each serving a distinct purpose in analyzing fuel price trends across different regions.

ggplot: This function enabled clear visual representations of fuel prices on specific dates, such as July 1st and 4th, 2024. The scatter plot format, combined with price labels, allowed for easy comparison of fuel prices across regions on these particular days. By using color and rotation of axis labels, the plot became more readable, effectively showing how fuel prices varied across different regions on the same date.

ggline: For visualizing trends over multiple years (e.g., January 1st from 2014 to 2024), the line plot generated by ggline was particularly useful. It connected fuel price points over the years for each region, clearly depicting the fluctuations and trends in pricing over time. This line plot helped emphasize the changes in fuel prices over a longer time frame, providing insights into regional price trends across different years.

Both visualization methods were effective in their own right—ggplot for showing discrete data on specific days and ggline for capturing broader trends across multiple years. Together, they provided a comprehensive view of fuel price patterns, making it easier to identify anomalies, trends, and comparisons across regions.

5. Conclusion

In addition to the obvious legal and ethical considerations like data privacy, there are other crucial factors that must be accounted for during data analysis. Focusing solely on a single metric, such as the p-value or statistical significance, may overlook important aspects of the data, such as the quality of the data, sampling methods, and sample size. These factors can heavily influence the results and should not be ignored. For example, biases in data collection or the presence of outliers could skew the analysis, and relying only on one statistical measure might present an incomplete picture of the data. It is essential to consider a broad range of factors to ensure that the findings are both valid and meaningful.

Furthermore, the interpretation of analysis results should be approached with transparency and fairness. This means being clear about how the data was collected, cleaned, and analyzed, and acknowledging any assumptions made during the process. Only then can the results be trusted and used responsibly. Ignoring or oversimplifying the context of the data can lead to misleading conclusions. By ensuring that ethical guidelines are followed, addressing data quality and fairness, and being transparent about the process, we can guarantee that the analysis is responsible, reliable, and valuable for decision-making.