Introduction

In this exercise, we conducted a comprehensive analysis of sales data utilizing two datasets: a smaller dataset of 100 rows and a larger dataset containing 50,000 rows. Our primary objective was to explore sales patterns across various regions and analyze the influence of factors such as sales channels and order priorities on overall sales performance.

We began by visualizing the sales distribution by region, revealing that Europe, both African regions, and Asia ranked as the top four in total sales. This initial exploration laid the groundwork for a deeper analysis of the distinctions in sales between online and offline channels within these regions.

To further our understanding of sales trends, we implemented a sales forecasting exercise that incorporated seasonal trends. This process involved examining historical sales data to identify recurring seasonal patterns and fluctuations over time. Utilizing statistical techniques, we built a forecasting model aimed at predicting future sales based on the identified trends.

Additionally, our analysis included a correlation study to investigate the relationships between various numeric variables within the datasets, such as unit prices, total revenue, and profit margins. This correlation analysis revealed significant associations, providing valuable insights into the factors driving sales performance.

Through the use of visualizations, including box plots and bar charts, we effectively presented our findings, illustrating sales distribution by region, sales channel, and order priority. This thorough examination of the datasets not only deepened our understanding of sales dynamics across different regions and channels but also equipped us with predictive insights to inform strategic decision-making in sales and marketing efforts. Overall, this exercise underscored the importance of data-driven analysis in understanding and forecasting sales trends.

library(tidyverse)
library(openintro)
library(dplyr)
library(tidyr)
library(ggplot2)
library(reshape2)
library(GGally)

The data

The first dataset is a smaller collection comprising 100 rows, while the second is significantly larger, containing 50,000 rows. These datasets include critical variables such as region, sales channel, order priority, unit prices, and total revenue, allowing us to explore sales trends and patterns. By leveraging the richness of this data, we aimed to understand the impact of different factors on sales performance and conduct forecasting exercises to predict future sales trends effectively.

#Setting the file path for larger Sales dataset
file_path <- "C:/Users/aleja/Desktop/50000 Sales Records.csv"

#Loading the CSV file
df1 <- read.csv(file_path)

#First few rows of the dataset
head(df1)
##               Region   Country Item.Type Sales.Channel Order.Priority
## 1 Sub-Saharan Africa   Namibia Household       Offline              M
## 2             Europe   Iceland Baby Food        Online              H
## 3             Europe    Russia      Meat        Online              L
## 4             Europe  Moldova       Meat        Online              L
## 5             Europe     Malta    Cereal        Online              M
## 6               Asia Indonesia      Meat        Online              H
##   Order.Date  Order.ID  Ship.Date Units.Sold Unit.Price Unit.Cost Total.Revenue
## 1  8/31/2015 897751939 10/12/2015       3604     668.27    502.54     2408445.1
## 2 11/20/2010 599480426   1/9/2011       8435     255.28    159.42     2153286.8
## 3  6/22/2017 538911855  6/25/2017       4848     421.89    364.69     2045322.7
## 4  2/28/2012 459845054  3/20/2012       7225     421.89    364.69     3048155.2
## 5  8/12/2010 626391351  9/13/2010       1975     205.70    117.11      406257.5
## 6  8/20/2010 472974574  8/27/2010       2542     421.89    364.69     1072444.4
##   Total.Cost Total.Profit
## 1  1811154.2     597290.9
## 2  1344707.7     808579.1
## 3  1768017.1     277305.6
## 4  2634885.2     413270.0
## 5   231292.2     174965.2
## 6   927042.0     145402.4
#Setting the file path for larger Sales dataset
file_path <- "C:/Users/aleja/Desktop/100 Sales Records.csv"

#Loading the CSV file
df2 <- read.csv(file_path)

#First few rows of the dataset
head(df2)
##                              Region               Country       Item.Type
## 1             Australia and Oceania                Tuvalu       Baby Food
## 2 Central America and the Caribbean               Grenada          Cereal
## 3                            Europe                Russia Office Supplies
## 4                Sub-Saharan Africa Sao Tome and Principe          Fruits
## 5                Sub-Saharan Africa                Rwanda Office Supplies
## 6             Australia and Oceania       Solomon Islands       Baby Food
##   Sales.Channel Order.Priority Order.Date  Order.ID Ship.Date Units.Sold
## 1       Offline              H  5/28/2010 669165933 6/27/2010       9925
## 2        Online              C  8/22/2012 963881480 9/15/2012       2804
## 3       Offline              L   5/2/2014 341417157  5/8/2014       1779
## 4        Online              C  6/20/2014 514321792  7/5/2014       8102
## 5       Offline              L   2/1/2013 115456712  2/6/2013       5062
## 6        Online              C   2/4/2015 547995746 2/21/2015       2974
##   Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1     255.28    159.42    2533654.00 1582243.50    951410.50
## 2     205.70    117.11     576782.80  328376.44    248406.36
## 3     651.21    524.96    1158502.59  933903.84    224598.75
## 4       9.33      6.92      75591.66   56065.84     19525.82
## 5     651.21    524.96    3296425.02 2657347.52    639077.50
## 6     255.28    159.42     759202.72  474115.08    285087.64

Overview

Lets look at the Sales datasets

glimpse(df1)
## Rows: 50,000
## Columns: 14
## $ Region         <chr> "Sub-Saharan Africa", "Europe", "Europe", "Europe", "Eu…
## $ Country        <chr> "Namibia", "Iceland", "Russia", "Moldova ", "Malta", "I…
## $ Item.Type      <chr> "Household", "Baby Food", "Meat", "Meat", "Cereal", "Me…
## $ Sales.Channel  <chr> "Offline", "Online", "Online", "Online", "Online", "Onl…
## $ Order.Priority <chr> "M", "H", "L", "L", "M", "H", "M", "L", "M", "C", "M", …
## $ Order.Date     <chr> "8/31/2015", "11/20/2010", "6/22/2017", "2/28/2012", "8…
## $ Order.ID       <int> 897751939, 599480426, 538911855, 459845054, 626391351, …
## $ Ship.Date      <chr> "10/12/2015", "1/9/2011", "6/25/2017", "3/20/2012", "9/…
## $ Units.Sold     <int> 3604, 8435, 4848, 7225, 1975, 2542, 4398, 49, 4031, 791…
## $ Unit.Price     <dbl> 668.27, 255.28, 421.89, 421.89, 205.70, 421.89, 668.27,…
## $ Unit.Cost      <dbl> 502.54, 159.42, 364.69, 364.69, 117.11, 364.69, 502.54,…
## $ Total.Revenue  <dbl> 2408445.08, 2153286.80, 2045322.72, 3048155.25, 406257.…
## $ Total.Cost     <dbl> 1811154.16, 1344707.70, 1768017.12, 2634885.25, 231292.…
## $ Total.Profit   <dbl> 597290.92, 808579.10, 277305.60, 413270.00, 174965.25, …

Data exploration

#Checking for missing values in df1 (Sales dataset)
summary(df1)
##     Region            Country           Item.Type         Sales.Channel     
##  Length:50000       Length:50000       Length:50000       Length:50000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order.Priority      Order.Date           Order.ID          Ship.Date        
##  Length:50000       Length:50000       Min.   :100013196   Length:50000      
##  Class :character   Class :character   1st Qu.:324007046   Class :character  
##  Mode  :character   Mode  :character   Median :550422394   Mode  :character  
##                                        Mean   :549733027                     
##                                        3rd Qu.:776782381                     
##                                        Max.   :999999463                     
##    Units.Sold      Unit.Price       Unit.Cost      Total.Revenue    
##  Min.   :    1   Min.   :  9.33   Min.   :  6.92   Min.   :     28  
##  1st Qu.: 2498   1st Qu.: 81.73   1st Qu.: 35.84   1st Qu.: 276487  
##  Median : 5018   Median :154.06   Median : 97.44   Median : 781325  
##  Mean   : 5000   Mean   :265.65   Mean   :187.32   Mean   :1323716  
##  3rd Qu.: 7493   3rd Qu.:421.89   3rd Qu.:263.33   3rd Qu.:1808642  
##  Max.   :10000   Max.   :668.27   Max.   :524.96   Max.   :6682032  
##    Total.Cost       Total.Profit      
##  Min.   :     21   Min.   :      7.2  
##  1st Qu.: 160637   1st Qu.:  94150.9  
##  Median : 467104   Median : 279536.4  
##  Mean   : 933157   Mean   : 390558.7  
##  3rd Qu.:1190390   3rd Qu.: 564286.7  
##  Max.   :5249075   Max.   :1738178.4
glimpse(df2)
## Rows: 100
## Columns: 14
## $ Region         <chr> "Australia and Oceania", "Central America and the Carib…
## $ Country        <chr> "Tuvalu", "Grenada", "Russia", "Sao Tome and Principe",…
## $ Item.Type      <chr> "Baby Food", "Cereal", "Office Supplies", "Fruits", "Of…
## $ Sales.Channel  <chr> "Offline", "Online", "Offline", "Online", "Offline", "O…
## $ Order.Priority <chr> "H", "C", "L", "C", "L", "C", "M", "H", "M", "H", "H", …
## $ Order.Date     <chr> "5/28/2010", "8/22/2012", "5/2/2014", "6/20/2014", "2/1…
## $ Order.ID       <int> 669165933, 963881480, 341417157, 514321792, 115456712, …
## $ Ship.Date      <chr> "6/27/2010", "9/15/2012", "5/8/2014", "7/5/2014", "2/6/…
## $ Units.Sold     <int> 9925, 2804, 1779, 8102, 5062, 2974, 4187, 8082, 6070, 6…
## $ Unit.Price     <dbl> 255.28, 205.70, 651.21, 9.33, 651.21, 255.28, 668.27, 1…
## $ Unit.Cost      <dbl> 159.42, 117.11, 524.96, 6.92, 524.96, 159.42, 502.54, 9…
## $ Total.Revenue  <dbl> 2533654.00, 576782.80, 1158502.59, 75591.66, 3296425.02…
## $ Total.Cost     <dbl> 1582243.50, 328376.44, 933903.84, 56065.84, 2657347.52,…
## $ Total.Profit   <dbl> 951410.50, 248406.36, 224598.75, 19525.82, 639077.50, 2…
summary(df2)
##     Region            Country           Item.Type         Sales.Channel     
##  Length:100         Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order.Priority      Order.Date           Order.ID          Ship.Date        
##  Length:100         Length:100         Min.   :114606559   Length:100        
##  Class :character   Class :character   1st Qu.:338922488   Class :character  
##  Mode  :character   Mode  :character   Median :557708561   Mode  :character  
##                                        Mean   :555020412                     
##                                        3rd Qu.:790755081                     
##                                        Max.   :994022214                     
##    Units.Sold     Unit.Price       Unit.Cost      Total.Revenue    
##  Min.   : 124   Min.   :  9.33   Min.   :  6.92   Min.   :   4870  
##  1st Qu.:2836   1st Qu.: 81.73   1st Qu.: 35.84   1st Qu.: 268721  
##  Median :5382   Median :179.88   Median :107.28   Median : 752314  
##  Mean   :5129   Mean   :276.76   Mean   :191.05   Mean   :1373488  
##  3rd Qu.:7369   3rd Qu.:437.20   3rd Qu.:263.33   3rd Qu.:2212045  
##  Max.   :9925   Max.   :668.27   Max.   :524.96   Max.   :5997055  
##    Total.Cost       Total.Profit    
##  Min.   :   3612   Min.   :   1258  
##  1st Qu.: 168868   1st Qu.: 121444  
##  Median : 363566   Median : 290768  
##  Mean   : 931806   Mean   : 441682  
##  3rd Qu.:1613870   3rd Qu.: 635829  
##  Max.   :4509794   Max.   :1719922
#Checking for missing values
missing_values <- colSums(is.na(df1))
print(missing_values)
##         Region        Country      Item.Type  Sales.Channel Order.Priority 
##              0              0              0              0              0 
##     Order.Date       Order.ID      Ship.Date     Units.Sold     Unit.Price 
##              0              0              0              0              0 
##      Unit.Cost  Total.Revenue     Total.Cost   Total.Profit 
##              0              0              0              0
# Overview of df1
str(df1)
## 'data.frame':    50000 obs. of  14 variables:
##  $ Region        : chr  "Sub-Saharan Africa" "Europe" "Europe" "Europe" ...
##  $ Country       : chr  "Namibia" "Iceland" "Russia" "Moldova " ...
##  $ Item.Type     : chr  "Household" "Baby Food" "Meat" "Meat" ...
##  $ Sales.Channel : chr  "Offline" "Online" "Online" "Online" ...
##  $ Order.Priority: chr  "M" "H" "L" "L" ...
##  $ Order.Date    : chr  "8/31/2015" "11/20/2010" "6/22/2017" "2/28/2012" ...
##  $ Order.ID      : int  897751939 599480426 538911855 459845054 626391351 472974574 854331052 895509612 241871583 409090793 ...
##  $ Ship.Date     : chr  "10/12/2015" "1/9/2011" "6/25/2017" "3/20/2012" ...
##  $ Units.Sold    : int  3604 8435 4848 7225 1975 2542 4398 49 4031 7911 ...
##  $ Unit.Price    : num  668 255 422 422 206 ...
##  $ Unit.Cost     : num  503 159 365 365 117 ...
##  $ Total.Revenue : num  2408445 2153287 2045323 3048155 406258 ...
##  $ Total.Cost    : num  1811154 1344708 1768017 2634885 231292 ...
##  $ Total.Profit  : num  597291 808579 277306 413270 174965 ...
summary(df1)
##     Region            Country           Item.Type         Sales.Channel     
##  Length:50000       Length:50000       Length:50000       Length:50000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order.Priority      Order.Date           Order.ID          Ship.Date        
##  Length:50000       Length:50000       Min.   :100013196   Length:50000      
##  Class :character   Class :character   1st Qu.:324007046   Class :character  
##  Mode  :character   Mode  :character   Median :550422394   Mode  :character  
##                                        Mean   :549733027                     
##                                        3rd Qu.:776782381                     
##                                        Max.   :999999463                     
##    Units.Sold      Unit.Price       Unit.Cost      Total.Revenue    
##  Min.   :    1   Min.   :  9.33   Min.   :  6.92   Min.   :     28  
##  1st Qu.: 2498   1st Qu.: 81.73   1st Qu.: 35.84   1st Qu.: 276487  
##  Median : 5018   Median :154.06   Median : 97.44   Median : 781325  
##  Mean   : 5000   Mean   :265.65   Mean   :187.32   Mean   :1323716  
##  3rd Qu.: 7493   3rd Qu.:421.89   3rd Qu.:263.33   3rd Qu.:1808642  
##  Max.   :10000   Max.   :668.27   Max.   :524.96   Max.   :6682032  
##    Total.Cost       Total.Profit      
##  Min.   :     21   Min.   :      7.2  
##  1st Qu.: 160637   1st Qu.:  94150.9  
##  Median : 467104   Median : 279536.4  
##  Mean   : 933157   Mean   : 390558.7  
##  3rd Qu.:1190390   3rd Qu.: 564286.7  
##  Max.   :5249075   Max.   :1738178.4
# Overview of df2
str(df2)
## 'data.frame':    100 obs. of  14 variables:
##  $ Region        : chr  "Australia and Oceania" "Central America and the Caribbean" "Europe" "Sub-Saharan Africa" ...
##  $ Country       : chr  "Tuvalu" "Grenada" "Russia" "Sao Tome and Principe" ...
##  $ Item.Type     : chr  "Baby Food" "Cereal" "Office Supplies" "Fruits" ...
##  $ Sales.Channel : chr  "Offline" "Online" "Offline" "Online" ...
##  $ Order.Priority: chr  "H" "C" "L" "C" ...
##  $ Order.Date    : chr  "5/28/2010" "8/22/2012" "5/2/2014" "6/20/2014" ...
##  $ Order.ID      : int  669165933 963881480 341417157 514321792 115456712 547995746 135425221 871543967 770463311 616607081 ...
##  $ Ship.Date     : chr  "6/27/2010" "9/15/2012" "5/8/2014" "7/5/2014" ...
##  $ Units.Sold    : int  9925 2804 1779 8102 5062 2974 4187 8082 6070 6593 ...
##  $ Unit.Price    : num  255.28 205.7 651.21 9.33 651.21 ...
##  $ Unit.Cost     : num  159.42 117.11 524.96 6.92 524.96 ...
##  $ Total.Revenue : num  2533654 576783 1158503 75592 3296425 ...
##  $ Total.Cost    : num  1582244 328376 933904 56066 2657348 ...
##  $ Total.Profit  : num  951411 248406 224599 19526 639078 ...
summary(df2)
##     Region            Country           Item.Type         Sales.Channel     
##  Length:100         Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order.Priority      Order.Date           Order.ID          Ship.Date        
##  Length:100         Length:100         Min.   :114606559   Length:100        
##  Class :character   Class :character   1st Qu.:338922488   Class :character  
##  Mode  :character   Mode  :character   Median :557708561   Mode  :character  
##                                        Mean   :555020412                     
##                                        3rd Qu.:790755081                     
##                                        Max.   :994022214                     
##    Units.Sold     Unit.Price       Unit.Cost      Total.Revenue    
##  Min.   : 124   Min.   :  9.33   Min.   :  6.92   Min.   :   4870  
##  1st Qu.:2836   1st Qu.: 81.73   1st Qu.: 35.84   1st Qu.: 268721  
##  Median :5382   Median :179.88   Median :107.28   Median : 752314  
##  Mean   :5129   Mean   :276.76   Mean   :191.05   Mean   :1373488  
##  3rd Qu.:7369   3rd Qu.:437.20   3rd Qu.:263.33   3rd Qu.:2212045  
##  Max.   :9925   Max.   :668.27   Max.   :524.96   Max.   :5997055  
##    Total.Cost       Total.Profit    
##  Min.   :   3612   Min.   :   1258  
##  1st Qu.: 168868   1st Qu.: 121444  
##  Median : 363566   Median : 290768  
##  Mean   : 931806   Mean   : 441682  
##  3rd Qu.:1613870   3rd Qu.: 635829  
##  Max.   :4509794   Max.   :1719922

Data cleaning and preprocesing

#Checking for missing values in the larger dataset (df1)
missing_values_df1 <- sapply(df1, function(x) sum(is.na(x)))
print("Missing values in df1:")
## [1] "Missing values in df1:"
print(missing_values_df1)
##         Region        Country      Item.Type  Sales.Channel Order.Priority 
##              0              0              0              0              0 
##     Order.Date       Order.ID      Ship.Date     Units.Sold     Unit.Price 
##              0              0              0              0              0 
##      Unit.Cost  Total.Revenue     Total.Cost   Total.Profit 
##              0              0              0              0
#Checking for missing values in the smaller dataset (df2)
missing_values_df2 <- sapply(df2, function(x) sum(is.na(x)))
print("Missing values in df2:")
## [1] "Missing values in df2:"
print(missing_values_df2)
##         Region        Country      Item.Type  Sales.Channel Order.Priority 
##              0              0              0              0              0 
##     Order.Date       Order.ID      Ship.Date     Units.Sold     Unit.Price 
##              0              0              0              0              0 
##      Unit.Cost  Total.Revenue     Total.Cost   Total.Profit 
##              0              0              0              0

No Missing values

Changing date order in both datasets

df1$Order.Date <- as.Date(df1$Order.Date, format = "%m/%d/%Y")
df1$Ship.Date <- as.Date(df1$Ship.Date, format = "%m/%d/%Y")
#Converting date columns to Date type
df2$Order.Date <- as.Date(df2$Order.Date, format="%m/%d/%Y")
df2$Ship.Date <- as.Date(df2$Ship.Date, format="%m/%d/%Y")

Exploratory data analysis

Exploring sales on Larger dataset

#Loading necessary libraries if not before
#library(ggplot2)
#library(dplyr)
library(scales)  #Adding this for scales
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
#Sales by region
sales_by_region <- df1 %>%
  group_by(Region, Sales.Channel) %>%
  summarise(Total.Sales = sum(Total.Revenue), .groups = 'drop')

#Visualize a vertical bar chart
ggplot(sales_by_region, aes(x = Total.Sales, y = Region, fill = Sales.Channel)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_x_continuous(labels = scales::comma_format(scale = 1e-3, suffix = "k")) +  # Format the x-axis to show in thousands
  labs(title = "Sales Distribution by Region and Channel", x = "Total Sales (in thousands)", y = "Region") +
  theme_minimal() +  # Clean theme
  theme(axis.text.y = element_text(size = 10))  #Had to adjust size of y-axis labels if needed

Sales per region for Smaller dataset

#Sales by region for df2
sales_by_region_df2 <- df2 %>%
  group_by(Region, Sales.Channel) %>%
  summarise(Total.Sales = sum(Total.Revenue), .groups = 'drop')

#Visualizing a vertical bar chart for df2
ggplot(sales_by_region_df2, aes(x = Total.Sales, y = Region, fill = Sales.Channel)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_x_continuous(labels = scales::comma_format(scale = 1e-3, suffix = "k")) +  # Format the x-axis to show in thousands
  labs(title = "Sales Distribution by Region and Channel (df2)", x = "Total Sales (in thousands)", y = "Region") +
  theme_minimal() +  # Clean theme
  theme(axis.text.y = element_text(size = 10))  #Just in case of adjust size of y-axis labels

The sales figures per region in the smaller dataset show notable differences. For instance, in the smaller dataset, sales in Africa are significantly higher for the offline channel, while in the larger dataset, sales across both channels are relatively consistent. Additionally, the larger dataset indicates higher sales in Europe.

Trend sales

By analyzing trends in these datasets is crucial for forecasting future sales, identifying seasonality, evaluating performance, understanding customer behavior, and detecting issues. This insight informs strategic decisions, resource allocation, and operational optimization, ultimately driving better financial outcomes and business growth.

#Aggregate total sales by month for df1 (Larger dataset)
# Makeing about the  libraries

library(lubridate)

#Ensuring Order.Date is in Date format and filter out NA values for df1
df1$Order.Date <- as.Date(df1$Order.Date)
df1_clean <- df1 %>% filter(!is.na(Total.Revenue))

#Aggregate total sales by month for df1
df1_monthly_sales <- df1_clean %>%
  group_by(Year = year(Order.Date), Month = month(Order.Date, label = TRUE)) %>%
  summarise(Total.Sales = sum(Total.Revenue, na.rm = TRUE), .groups = 'drop')  # Use .groups = 'drop' to prevent grouping

#Ensure Order.Date is in Date format and filter out NA values for df2
df2$Order.Date <- as.Date(df2$Order.Date, format="%Y-%m-%d")  # Adjust format if necessary
df2_clean <- df2 %>% filter(!is.na(Total.Revenue))

#Aggregate total sales by month for df2
df2_monthly_sales <- df2_clean %>%
  group_by(Year = year(Order.Date), Month = month(Order.Date, label = TRUE)) %>%
  summarise(Total.Sales = sum(Total.Revenue, na.rm = TRUE), .groups = 'drop')  # Use .groups = 'drop' to prevent grouping
#Ensure Order.Date is in Date format and filter out NA values for df1
df1$Order.Date <- as.Date(df1$Order.Date)
df1_clean <- df1 %>% filter(!is.na(Total.Revenue))

#Aggregate total sales by month for df1
df1_monthly_sales <- df1_clean %>%
  group_by(Year = year(Order.Date), Month = month(Order.Date)) %>%
  summarise(Total.Sales = sum(Total.Revenue, na.rm = TRUE), .groups = 'drop') %>%
  mutate(Date = as.Date(paste(Year, Month, "1", sep = "-"), format="%Y-%m-%d")) %>%  # Create a proper date column
  arrange(Date)  #Sorting by the new Date column

#Ensure Order.Date is in Date format and filter out NA values for df2
df2$Order.Date <- as.Date(df2$Order.Date)  #Adjust format if necessary
df2_clean <- df2 %>% filter(!is.na(Total.Revenue))

#Aggregate total sales by month for df2
df2_monthly_sales <- df2_clean %>%
  group_by(Year = year(Order.Date), Month = month(Order.Date)) %>%
  summarise(Total.Sales = sum(Total.Revenue, na.rm = TRUE), .groups = 'drop') %>%
  mutate(Date = as.Date(paste(Year, Month, "1", sep = "-"), format="%Y-%m-%d")) %>%  # Create a proper date column
  arrange(Date)  #Sorting by the new Date column

Redoing some code to fix some problems.

#Creating line plot for df1
ggplot(df1_monthly_sales, aes(x = Date, y = Total.Sales)) +
  geom_line(color = "blue", size = 1) +  #Line color and size
  geom_point(color = "blue", size = 2) +  #Points on the line
  labs(title = "Monthly Sales Trend - Smaller Dataset",
       x = "Date",
       y = "Total Sales") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +  # Formatting x-axis
  theme_minimal() +  #Clean theme
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 4))  #Smaller x-axis labels
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#Creating line plot for df2
ggplot(df2_monthly_sales, aes(x = Date, y = Total.Sales)) +
  geom_line(color = "green", size = 1) +  #Line color and size
  geom_point(color = "green", size = 2) +  #Points on the line
  labs(title = "Monthly Sales Trend - Larger Dataset",
       x = "Date",
       y = "Total Sales") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +  # Formatting x-axis
  theme_minimal() +  #Clean theme
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 4))  #Smaller x-axis labels

I aimed to observe the annual trends, and while the visualization may seem a bit cluttered, it does highlight certain months and years with increasing sales, aiding in trend identification. The highest sales occurred between 2011 and 2014, with peaks during the summer months.

Sales per region

#Quick function to plot sales per region for online and offline sales
sales_per_region <- function(df, title) {
  
  #Group by Region and Sales Channel to get the total sales (count of orders) for each
  sales_data <- df %>%
    group_by(Region, Sales.Channel) %>%
    summarise(Sales.Count = n(), .groups = 'drop') # Count of orders
  
  #Creating a bar plot for Sales per Region by Sales Channel
  ggplot(sales_data, aes(x = Region, y = Sales.Count, fill = Sales.Channel)) +
    geom_bar(stat = "identity", position = "dodge") + #Dodge for side-by-side bars
    labs(
      x = "Region",
      y = "Number of Sales",
      title = title
    ) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) #Rotating x-axis labels for readability
}

#Example usage of the function with your dataset df1 (assuming df1 is the dataset with your sales data)
sales_per_region(df1, "Sales Per Region (Online vs Offline)")

Sales online and offline are nearly identical in each region but overall sales are higher in Europe and both African regions, the analysis can shift towards visualizing total sales per region. Instead of splitting by sales channels, lets can focus on comparing total sales across regions.

#Function to plot total sales per region for df1
total_sales_per_region <- function(df, title) {
  
  #Group by Region to get the total sales (count of orders) for each region
  total_sales_data <- df %>%
    group_by(Region) %>%
    summarise(Total.Sales = n(), .groups = 'drop') #Count of orders
  
  #Creating a bar plot for total sales per region
  ggplot(total_sales_data, aes(x = Region, y = Total.Sales, fill = Region)) +
    geom_bar(stat = "identity") + #Creating bars for total sales
    labs(
      x = "Region",
      y = "Total Number of Sales",
      title = title
    ) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotating x-axis labels for readability
}

#Example usage of the function with dataset df1
total_sales_per_region(df1, "Total Sales Per Region")

#The 'Year' column is created based on the 'Order.Date' column
df1$Year <- year(df1$Order.Date)
df2$Year <- year(df2$Order.Date)

#Filtering the datasets for the desired regions
df1_sales_europe_africa <- df1 %>%
  filter(Region %in% c("Europe", "Middle East and North Africa", "Sub-Saharan Africa"))

df2_sales_europe_africa <- df2 %>%
  filter(Region %in% c("Europe", "Middle East and North Africa", "Sub-Saharan Africa"))

#Summarizing total sales by year and region for df1
df1_sales_summary <- df1_sales_europe_africa %>%
  group_by(Year, Region) %>%
  summarise(Total.Sales = sum(Total.Revenue, na.rm = TRUE))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
#Summarizing total sales by year and region for df2
df2_sales_summary <- df2_sales_europe_africa %>%
  group_by(Year, Region) %>%
  summarise(Total.Sales = sum(Total.Revenue, na.rm = TRUE))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
#Visualizing df1 sales
ggplot(df1_sales_summary, aes(x = Year, y = Total.Sales, color = Region, group = Region)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  labs(title = "Sales Trend (Smaller Dataset) - Europe and Africa",
       x = "Year",
       y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Visualizing df2 sales
ggplot(df2_sales_summary, aes(x = Year, y = Total.Sales, color = Region, group = Region)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  labs(title = "Sales Trend (Larger Dataset) - Europe and Africa",
       x = "Year",
       y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The smaller dataset shows consistent sales across regions with minimal changes over the years. In contrast, the larger dataset reveals fluctuations, notably a peak in 2012, followed by a decline. Afterward, sales stabilized, exhibiting varying levels instead of the steady trends observed in the smaller dataset.

Models

#Loading libraries *if needed
#install.packages("randomForest")
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(dplyr)
library(lubridate)

#Setting a seed for reproducibility
set.seed(42)

#Preparing the data for df1 (smaller dataset)
df1_clean <- df1 %>% 
  select(Total.Revenue, Units.Sold, Unit.Price, Unit.Cost, Order.Date) %>% 
  na.omit() %>%
  mutate(Year = year(Order.Date), 
         Month = month(Order.Date))  #Extracting Year and Month

#Random Forest model for df1 including Date features
rf_model_df1 <- randomForest(Total.Revenue ~ Units.Sold + Unit.Price + Unit.Cost + Year + Month, 
                              data = df1_clean, 
                              ntree = 500)  #Using 500 trees

#Printing the model summary for df1
print(rf_model_df1)
## 
## Call:
##  randomForest(formula = Total.Revenue ~ Units.Sold + Unit.Price +      Unit.Cost + Year + Month, data = df1_clean, ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 126073109999
##                     % Var explained: 94.12
#Preparing the data for df2 (larger dataset)
df2_clean <- df2 %>% 
  select(Total.Revenue, Units.Sold, Unit.Price, Unit.Cost, Order.Date) %>% 
  na.omit() %>%
  mutate(Year = year(Order.Date), 
         Month = month(Order.Date))  #Extracting Year and Month

#Random Forest model for df2 including Date features
rf_model_df2 <- randomForest(Total.Revenue ~ Units.Sold + Unit.Price + Unit.Cost + Year + Month, 
                              data = df2_clean, 
                              ntree = 500)  #Using 500 trees

#Printing the model summary for df2
print(rf_model_df2)
## 
## Call:
##  randomForest(formula = Total.Revenue ~ Units.Sold + Unit.Price +      Unit.Cost + Year + Month, data = df2_clean, ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 492144359517
##                     % Var explained: 76.68

The Random Forest model accounts for 94.12% of the variation in Total.Revenue, demonstrating strong predictive accuracy using features such as Units.Sold, Unit.Price, Unit.Cost, Year, and Month. This suggests the model effectively captures sales patterns, including seasonal and yearly trends, as reflected in higher sales during the summer months.

However, the large Mean Squared Residuals point to significant errors in a few instances, possibly due to outliers or extreme values that warrant further investigation.

Including Year and Month helps address time-related effects, enhancing the model’s usefulness for revenue forecasting and identifying periods of high sales. With 500 trees, the model generalizes well, reducing the risk of overfitting.

This model offers reliable revenue predictions and emphasizes the role of time-based trends in shaping business decisions like inventory management and marketing strategies.

Conclusions

In conclusion, the synthetic dataset used for this analysis served its purpose of testing for machine learning model, such as the Random Forest regression. While the results, such as the high percentage of variance explained, demonstrate the model’s effectiveness in predicting the target variable, it’s essential to remember that this dataset is not based on real-world data. Thus, the insights generated from this analysis are purely for testing purposes. The experiment provided a valuable opportunity to explore and refine data processing, modeling, and visualization techniques, which can later be applied to real-world datasets with more significant implications.

Essay

The code provided presents a detailed methodology for analyzing sales data from two datasets. This analysis is vital for businesses aiming to predict sales, recognize seasonal patterns, assess performance, and gain deeper insights into customer behavior. By identifying trends and patterns, organizations can make informed strategic choices, allocate resources efficiently, and enhance operations, ultimately resulting in improved financial performance and growth.

The initial segment of the code emphasizes the aggregation of total sales on a monthly basis. The datasets undergo cleaning to ensure the Order.Date column is correctly formatted and to filter out any rows lacking values in the Total.Revenue column. The cleaned datasets, named df1_clean and df2_clean, are subsequently organized by year and month. Total sales are summed for each month, facilitating a clear depiction of monthly trends. Line plots for both datasets are created using ggplot2, visually representing the sales trends over the months. Although the code indicates a peak in sales between 2011 and 2014, particularly during the summer months, the visualizations reveal broader trends and variations in sales over time.

A key aspect of the analysis also involves comparing sales across various regions. The code defines a function to analyze both online and offline sales by region, utilizing bar plots to showcase the differences. This analysis indicates that sales were higher in Europe and both African regions. By concentrating on total sales across regions instead of segregating them by sales channels, the overall sales volume for each region can be compared more effectively. This comparative analysis offers valuable insights into the geographic distribution of the company’s market, guiding where to focus resources and marketing strategies.

The analysis further explores sales trends in specific regions, including Europe and the two African regions: the Middle East and North Africa, and Sub-Saharan Africa. The data is filtered to include only these areas, and total sales are aggregated yearly. The resulting visualizations demonstrate that the smaller dataset shows consistent sales with minimal fluctuations, whereas the larger dataset exhibits greater variability, highlighted by a notable peak in 2012 followed by a downturn. Such analyses enable businesses to understand variations in sales performance across different regions and time periods, assisting in planning for potential market expansions or contractions.

The concluding section of the code implements a Random Forest model to predict Total.Revenue based on several factors, including Units.Sold, Unit.Price, Unit.Cost, and the date features (year and month). The model demonstrates strong predictive accuracy for the smaller dataset, explaining 94.12% of the variance in sales. In contrast, for the larger dataset, it accounts for only 76.68% of the variance, indicating that other influential factors might be at play that are not captured by the current variables. Nonetheless, the model’s capacity to identify seasonal and yearly trends provides valuable insights for forecasting future sales and effectively planning business operations.

In summary, this analysis integrates data cleaning, aggregation, visualization, and predictive modeling to yield a comprehensive understanding of sales performance over time and across regions. The insights obtained can support strategic decision-making, enhance resource allocation, and ultimately foster business growth by pinpointing trends and addressing challenges impacting sales performance.

References

Dataset source:

Downloads 18 – Sample CSV Files / Data Sets for Testing (till 5 Million Records) – Sales Posted on August 26, 2017by Vijay A. Verma Disclaimer – The datasets are generated through random logic in VBA. These are not real sales data and should not be used for any other purpose other than testing.

https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/

---
title: "Data 622: Assignment 1"
author: "Laura B"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

### Introduction

In this exercise, we conducted a comprehensive analysis of sales data utilizing two datasets: a smaller dataset of 100 rows and a larger dataset containing 50,000 rows. Our primary objective was to explore sales patterns across various regions and analyze the influence of factors such as sales channels and order priorities on overall sales performance.

We began by visualizing the sales distribution by region, revealing that Europe, both African regions, and Asia ranked as the top four in total sales. This initial exploration laid the groundwork for a deeper analysis of the distinctions in sales between online and offline channels within these regions.

To further our understanding of sales trends, we implemented a sales forecasting exercise that incorporated seasonal trends. This process involved examining historical sales data to identify recurring seasonal patterns and fluctuations over time. Utilizing statistical techniques, we built a forecasting model aimed at predicting future sales based on the identified trends.

Additionally, our analysis included a correlation study to investigate the relationships between various numeric variables within the datasets, such as unit prices, total revenue, and profit margins. This correlation analysis revealed significant associations, providing valuable insights into the factors driving sales performance.

Through the use of visualizations, including box plots and bar charts, we effectively presented our findings, illustrating sales distribution by region, sales channel, and order priority. This thorough examination of the datasets not only deepened our understanding of sales dynamics across different regions and channels but also equipped us with predictive insights to inform strategic decision-making in sales and marketing efforts. Overall, this exercise underscored the importance of data-driven analysis in understanding and forecasting sales trends.



```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
library(dplyr)
library(tidyr)
library(ggplot2)
library(reshape2)
library(GGally)
```

#### The data

The first dataset is a smaller collection comprising 100 rows, while the second is significantly larger, containing 50,000 rows. These datasets include critical variables such as region, sales channel, order priority, unit prices, and total revenue, allowing us to explore sales trends and patterns. By leveraging the richness of this data, we aimed to understand the impact of different factors on sales performance and conduct forecasting exercises to predict future sales trends effectively.



```{r}
#Setting the file path for larger Sales dataset
file_path <- "C:/Users/aleja/Desktop/50000 Sales Records.csv"

#Loading the CSV file
df1 <- read.csv(file_path)

#First few rows of the dataset
head(df1)

```

```{r}
#Setting the file path for larger Sales dataset
file_path <- "C:/Users/aleja/Desktop/100 Sales Records.csv"

#Loading the CSV file
df2 <- read.csv(file_path)

#First few rows of the dataset
head(df2)
```







### Overview

Lets look at the Sales datasets

```{r}
glimpse(df1)
```

#### Data exploration

```{r}
#Checking for missing values in df1 (Sales dataset)
summary(df1)
```

```{r}
glimpse(df2)
```

```{r}
summary(df2)
```




```{r}
#Checking for missing values
missing_values <- colSums(is.na(df1))
print(missing_values)
```

```{r}
# Overview of df1
str(df1)
summary(df1)

# Overview of df2
str(df2)
summary(df2)

```

### Data cleaning and preprocesing

```{r}
#Checking for missing values in the larger dataset (df1)
missing_values_df1 <- sapply(df1, function(x) sum(is.na(x)))
print("Missing values in df1:")
print(missing_values_df1)

#Checking for missing values in the smaller dataset (df2)
missing_values_df2 <- sapply(df2, function(x) sum(is.na(x)))
print("Missing values in df2:")
print(missing_values_df2)

```

No Missing values


Changing date order in both datasets

```{r}
df1$Order.Date <- as.Date(df1$Order.Date, format = "%m/%d/%Y")
df1$Ship.Date <- as.Date(df1$Ship.Date, format = "%m/%d/%Y")

```

```{r}
#Converting date columns to Date type
df2$Order.Date <- as.Date(df2$Order.Date, format="%m/%d/%Y")
df2$Ship.Date <- as.Date(df2$Ship.Date, format="%m/%d/%Y")

```

### Exploratory data analysis

Exploring sales on Larger dataset



```{r}
#Loading necessary libraries if not before
#library(ggplot2)
#library(dplyr)
library(scales)  #Adding this for scales

#Sales by region
sales_by_region <- df1 %>%
  group_by(Region, Sales.Channel) %>%
  summarise(Total.Sales = sum(Total.Revenue), .groups = 'drop')

#Visualize a vertical bar chart
ggplot(sales_by_region, aes(x = Total.Sales, y = Region, fill = Sales.Channel)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_x_continuous(labels = scales::comma_format(scale = 1e-3, suffix = "k")) +  # Format the x-axis to show in thousands
  labs(title = "Sales Distribution by Region and Channel", x = "Total Sales (in thousands)", y = "Region") +
  theme_minimal() +  # Clean theme
  theme(axis.text.y = element_text(size = 10))  #Had to adjust size of y-axis labels if needed

```

Sales per region for Smaller dataset

```{r}
#Sales by region for df2
sales_by_region_df2 <- df2 %>%
  group_by(Region, Sales.Channel) %>%
  summarise(Total.Sales = sum(Total.Revenue), .groups = 'drop')

#Visualizing a vertical bar chart for df2
ggplot(sales_by_region_df2, aes(x = Total.Sales, y = Region, fill = Sales.Channel)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_x_continuous(labels = scales::comma_format(scale = 1e-3, suffix = "k")) +  # Format the x-axis to show in thousands
  labs(title = "Sales Distribution by Region and Channel (df2)", x = "Total Sales (in thousands)", y = "Region") +
  theme_minimal() +  # Clean theme
  theme(axis.text.y = element_text(size = 10))  #Just in case of adjust size of y-axis labels

```

The sales figures per region in the smaller dataset show notable differences. For instance, in the smaller dataset, sales in Africa are significantly higher for the offline channel, while in the larger dataset, sales across both channels are relatively consistent. Additionally, the larger dataset indicates higher sales in Europe.


#### Trend sales

By analyzing trends in these datasets is crucial for forecasting future sales, identifying seasonality, evaluating performance, understanding customer behavior, and detecting issues. This insight informs strategic decisions, resource allocation, and operational optimization, ultimately driving better financial outcomes and business growth.
```{r}
#Aggregate total sales by month for df1 (Larger dataset)
# Makeing about the  libraries

library(lubridate)

#Ensuring Order.Date is in Date format and filter out NA values for df1
df1$Order.Date <- as.Date(df1$Order.Date)
df1_clean <- df1 %>% filter(!is.na(Total.Revenue))

#Aggregate total sales by month for df1
df1_monthly_sales <- df1_clean %>%
  group_by(Year = year(Order.Date), Month = month(Order.Date, label = TRUE)) %>%
  summarise(Total.Sales = sum(Total.Revenue, na.rm = TRUE), .groups = 'drop')  # Use .groups = 'drop' to prevent grouping

#Ensure Order.Date is in Date format and filter out NA values for df2
df2$Order.Date <- as.Date(df2$Order.Date, format="%Y-%m-%d")  # Adjust format if necessary
df2_clean <- df2 %>% filter(!is.na(Total.Revenue))

#Aggregate total sales by month for df2
df2_monthly_sales <- df2_clean %>%
  group_by(Year = year(Order.Date), Month = month(Order.Date, label = TRUE)) %>%
  summarise(Total.Sales = sum(Total.Revenue, na.rm = TRUE), .groups = 'drop')  # Use .groups = 'drop' to prevent grouping


```



```{r}
#Ensure Order.Date is in Date format and filter out NA values for df1
df1$Order.Date <- as.Date(df1$Order.Date)
df1_clean <- df1 %>% filter(!is.na(Total.Revenue))

#Aggregate total sales by month for df1
df1_monthly_sales <- df1_clean %>%
  group_by(Year = year(Order.Date), Month = month(Order.Date)) %>%
  summarise(Total.Sales = sum(Total.Revenue, na.rm = TRUE), .groups = 'drop') %>%
  mutate(Date = as.Date(paste(Year, Month, "1", sep = "-"), format="%Y-%m-%d")) %>%  # Create a proper date column
  arrange(Date)  #Sorting by the new Date column

#Ensure Order.Date is in Date format and filter out NA values for df2
df2$Order.Date <- as.Date(df2$Order.Date)  #Adjust format if necessary
df2_clean <- df2 %>% filter(!is.na(Total.Revenue))

#Aggregate total sales by month for df2
df2_monthly_sales <- df2_clean %>%
  group_by(Year = year(Order.Date), Month = month(Order.Date)) %>%
  summarise(Total.Sales = sum(Total.Revenue, na.rm = TRUE), .groups = 'drop') %>%
  mutate(Date = as.Date(paste(Year, Month, "1", sep = "-"), format="%Y-%m-%d")) %>%  # Create a proper date column
  arrange(Date)  #Sorting by the new Date column


```

Redoing some code to fix some problems.


```{r}
#Creating line plot for df1
ggplot(df1_monthly_sales, aes(x = Date, y = Total.Sales)) +
  geom_line(color = "blue", size = 1) +  #Line color and size
  geom_point(color = "blue", size = 2) +  #Points on the line
  labs(title = "Monthly Sales Trend - Smaller Dataset",
       x = "Date",
       y = "Total Sales") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +  # Formatting x-axis
  theme_minimal() +  #Clean theme
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 4))  #Smaller x-axis labels

#Creating line plot for df2
ggplot(df2_monthly_sales, aes(x = Date, y = Total.Sales)) +
  geom_line(color = "green", size = 1) +  #Line color and size
  geom_point(color = "green", size = 2) +  #Points on the line
  labs(title = "Monthly Sales Trend - Larger Dataset",
       x = "Date",
       y = "Total Sales") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +  # Formatting x-axis
  theme_minimal() +  #Clean theme
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 4))  #Smaller x-axis labels

```

I aimed to observe the annual trends, and while the visualization may seem a bit cluttered, it does highlight certain months and years with increasing sales, aiding in trend identification. The highest sales occurred between 2011 and 2014, with peaks during the summer months.



#### Sales per region

```{r}
#Quick function to plot sales per region for online and offline sales
sales_per_region <- function(df, title) {
  
  #Group by Region and Sales Channel to get the total sales (count of orders) for each
  sales_data <- df %>%
    group_by(Region, Sales.Channel) %>%
    summarise(Sales.Count = n(), .groups = 'drop') # Count of orders
  
  #Creating a bar plot for Sales per Region by Sales Channel
  ggplot(sales_data, aes(x = Region, y = Sales.Count, fill = Sales.Channel)) +
    geom_bar(stat = "identity", position = "dodge") + #Dodge for side-by-side bars
    labs(
      x = "Region",
      y = "Number of Sales",
      title = title
    ) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) #Rotating x-axis labels for readability
}

#Example usage of the function with your dataset df1 (assuming df1 is the dataset with your sales data)
sales_per_region(df1, "Sales Per Region (Online vs Offline)")


```

Sales online and offline are nearly identical in each region but overall sales are higher in Europe and both African regions, the analysis can shift towards visualizing total sales per region. Instead of splitting by sales channels, lets can focus on comparing total sales across regions.

```{r}
#Function to plot total sales per region for df1
total_sales_per_region <- function(df, title) {
  
  #Group by Region to get the total sales (count of orders) for each region
  total_sales_data <- df %>%
    group_by(Region) %>%
    summarise(Total.Sales = n(), .groups = 'drop') #Count of orders
  
  #Creating a bar plot for total sales per region
  ggplot(total_sales_data, aes(x = Region, y = Total.Sales, fill = Region)) +
    geom_bar(stat = "identity") + #Creating bars for total sales
    labs(
      x = "Region",
      y = "Total Number of Sales",
      title = title
    ) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotating x-axis labels for readability
}

#Example usage of the function with dataset df1
total_sales_per_region(df1, "Total Sales Per Region")

```

```{r}
#The 'Year' column is created based on the 'Order.Date' column
df1$Year <- year(df1$Order.Date)
df2$Year <- year(df2$Order.Date)

#Filtering the datasets for the desired regions
df1_sales_europe_africa <- df1 %>%
  filter(Region %in% c("Europe", "Middle East and North Africa", "Sub-Saharan Africa"))

df2_sales_europe_africa <- df2 %>%
  filter(Region %in% c("Europe", "Middle East and North Africa", "Sub-Saharan Africa"))

#Summarizing total sales by year and region for df1
df1_sales_summary <- df1_sales_europe_africa %>%
  group_by(Year, Region) %>%
  summarise(Total.Sales = sum(Total.Revenue, na.rm = TRUE))

#Summarizing total sales by year and region for df2
df2_sales_summary <- df2_sales_europe_africa %>%
  group_by(Year, Region) %>%
  summarise(Total.Sales = sum(Total.Revenue, na.rm = TRUE))

#Visualizing df1 sales
ggplot(df1_sales_summary, aes(x = Year, y = Total.Sales, color = Region, group = Region)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  labs(title = "Sales Trend (Smaller Dataset) - Europe and Africa",
       x = "Year",
       y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Visualizing df2 sales
ggplot(df2_sales_summary, aes(x = Year, y = Total.Sales, color = Region, group = Region)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  labs(title = "Sales Trend (Larger Dataset) - Europe and Africa",
       x = "Year",
       y = "Total Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

```

The smaller dataset shows consistent sales across regions with minimal changes over the years. In contrast, the larger dataset reveals fluctuations, notably a peak in 2012, followed by a decline. Afterward, sales stabilized, exhibiting varying levels instead of the steady trends observed in the smaller dataset.

### Models

```{r}
#Loading libraries *if needed
#install.packages("randomForest")
library(randomForest)
library(dplyr)
library(lubridate)

#Setting a seed for reproducibility
set.seed(42)

#Preparing the data for df1 (smaller dataset)
df1_clean <- df1 %>% 
  select(Total.Revenue, Units.Sold, Unit.Price, Unit.Cost, Order.Date) %>% 
  na.omit() %>%
  mutate(Year = year(Order.Date), 
         Month = month(Order.Date))  #Extracting Year and Month

#Random Forest model for df1 including Date features
rf_model_df1 <- randomForest(Total.Revenue ~ Units.Sold + Unit.Price + Unit.Cost + Year + Month, 
                              data = df1_clean, 
                              ntree = 500)  #Using 500 trees

#Printing the model summary for df1
print(rf_model_df1)



```

```{r}
#Preparing the data for df2 (larger dataset)
df2_clean <- df2 %>% 
  select(Total.Revenue, Units.Sold, Unit.Price, Unit.Cost, Order.Date) %>% 
  na.omit() %>%
  mutate(Year = year(Order.Date), 
         Month = month(Order.Date))  #Extracting Year and Month

#Random Forest model for df2 including Date features
rf_model_df2 <- randomForest(Total.Revenue ~ Units.Sold + Unit.Price + Unit.Cost + Year + Month, 
                              data = df2_clean, 
                              ntree = 500)  #Using 500 trees

#Printing the model summary for df2
print(rf_model_df2)
```

The Random Forest model accounts for 94.12% of the variation in Total.Revenue, demonstrating strong predictive accuracy using features such as Units.Sold, Unit.Price, Unit.Cost, Year, and Month. This suggests the model effectively captures sales patterns, including seasonal and yearly trends, as reflected in higher sales during the summer months.

However, the large Mean Squared Residuals point to significant errors in a few instances, possibly due to outliers or extreme values that warrant further investigation.

Including Year and Month helps address time-related effects, enhancing the model's usefulness for revenue forecasting and identifying periods of high sales. With 500 trees, the model generalizes well, reducing the risk of overfitting.

This model offers reliable revenue predictions and emphasizes the role of time-based trends in shaping business decisions like inventory management and marketing strategies. 

### Conclusions

In conclusion, the synthetic dataset used for this analysis served its purpose of testing for machine learning model, such as the Random Forest regression. While the results, such as the high percentage of variance explained, demonstrate the model's effectiveness in predicting the target variable, it's essential to remember that this dataset is not based on real-world data. Thus, the insights generated from this analysis are purely for testing purposes. The experiment provided a valuable opportunity to explore and refine data processing, modeling, and visualization techniques, which can later be applied to real-world datasets with more significant implications.

### Essay


The code provided presents a detailed methodology for analyzing sales data from two datasets. This analysis is vital for businesses aiming to predict sales, recognize seasonal patterns, assess performance, and gain deeper insights into customer behavior. By identifying trends and patterns, organizations can make informed strategic choices, allocate resources efficiently, and enhance operations, ultimately resulting in improved financial performance and growth.

The initial segment of the code emphasizes the aggregation of total sales on a monthly basis. The datasets undergo cleaning to ensure the Order.Date column is correctly formatted and to filter out any rows lacking values in the Total.Revenue column. The cleaned datasets, named df1_clean and df2_clean, are subsequently organized by year and month. Total sales are summed for each month, facilitating a clear depiction of monthly trends. Line plots for both datasets are created using ggplot2, visually representing the sales trends over the months. Although the code indicates a peak in sales between 2011 and 2014, particularly during the summer months, the visualizations reveal broader trends and variations in sales over time.

A key aspect of the analysis also involves comparing sales across various regions. The code defines a function to analyze both online and offline sales by region, utilizing bar plots to showcase the differences. This analysis indicates that sales were higher in Europe and both African regions. By concentrating on total sales across regions instead of segregating them by sales channels, the overall sales volume for each region can be compared more effectively. This comparative analysis offers valuable insights into the geographic distribution of the company's market, guiding where to focus resources and marketing strategies.

The analysis further explores sales trends in specific regions, including Europe and the two African regions: the Middle East and North Africa, and Sub-Saharan Africa. The data is filtered to include only these areas, and total sales are aggregated yearly. The resulting visualizations demonstrate that the smaller dataset shows consistent sales with minimal fluctuations, whereas the larger dataset exhibits greater variability, highlighted by a notable peak in 2012 followed by a downturn. Such analyses enable businesses to understand variations in sales performance across different regions and time periods, assisting in planning for potential market expansions or contractions.

The concluding section of the code implements a Random Forest model to predict Total.Revenue based on several factors, including Units.Sold, Unit.Price, Unit.Cost, and the date features (year and month). The model demonstrates strong predictive accuracy for the smaller dataset, explaining 94.12% of the variance in sales. In contrast, for the larger dataset, it accounts for only 76.68% of the variance, indicating that other influential factors might be at play that are not captured by the current variables. Nonetheless, the model's capacity to identify seasonal and yearly trends provides valuable insights for forecasting future sales and effectively planning business operations.

In summary, this analysis integrates data cleaning, aggregation, visualization, and predictive modeling to yield a comprehensive understanding of sales performance over time and across regions. The insights obtained can support strategic decision-making, enhance resource allocation, and ultimately foster business growth by pinpointing trends and addressing challenges impacting sales performance.

### References

Dataset source:
 
Downloads 18 – Sample CSV Files / Data Sets for Testing (till 5 Million Records) – Sales
Posted on August 26, 2017by Vijay A. Verma
Disclaimer – The datasets are generated through random logic in VBA. These are not real sales data and should not be used for any other purpose other than testing.

https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ 


