Homework 1

Overview

A sales strategy’s success and efficiency can be influenced by a variety of elements, including product type, geography, sales channel, seasonality, etc. Understanding which sales to prioritize in the fast-paced world of sales can dramatically optimize resource allocation and increase overall sales performance. Sales with a high priority often necessitate fast response and can result in large income or customer retention.

Predicting the priority of sales in the large terrain of sales activities, particularly in a global setting, can be difficult. A sale’s priority level can be influenced by factors such as the region of sale, the type of goods, the sales channel(online/offline), and even the time of year.

This study dived deeply into the prediction of sales prioritization. We hoped to forecast whether a specific sale will have a vital priority by analyzing small data of 1,000 sales and large data of 50,000 sales. The feature variables included a variety of characteristics such as the sale region, item type, sales channel, units sold, their cost, and time-related parameters such as the month and year of the sale.

As the target variable was binary, we built two machine learning models: Logistic Regression and Random Forest Regression. Our data was split into training and testing sets using an 80/20 split. We deployed our revised models on a training dataset following model training with testing dataset and internal validation, assuring an unbiased assessment of our prediction prowess for both datasets (small and large)

1. Data Preparation

# load data
df_1000 <- read.csv("https://raw.githubusercontent.com/ex-pr/DATA_622/main/HW%201/1000%20Sales%20Records.csv")
df_50000 <- read.csv("https://raw.githubusercontent.com/ex-pr/DATA_622/main/HW%201/50000%20Sales%20Records.csv")

1.1 Summary Statistics

The small dataset contained 1000 observations of 17 predictor variables.

The large dataset contained 50000 observations over the same predictor variables.

Each record in both datasets described a single sales transaction in full, including the product type, sales channel, financial data, and pertinent dates. Specifically:

Region: Geographical region where the sale occurred (e.g. Asia, Europe, North America, etc.)
Country: The country where the sale took place
Item Type: Type of product sold (e.g. Cereal, Cosmetics, Fruits, etc.)
Sales Channel: The method through which the sale was made (offline/online)
Order Priority: The order’s priority (e.g., M for medium, C for critical, H for high, etc.)
Order Date: The date the order was placed
Order ID:A unique identifier for the order
Ship Date: The date the product was shipped
Units Sold: The number of units sold
Unit Price:The price per unit of the product
Unit Cost: The cost per unit of the product
Total Revenue: The total revenue from the sale
Total Cost: The total cost of the transaction
Total Profit: The total profit from the sale

The target variable Critical Priority would be deployed from the Order Priority. The value of the target would be 1 for a critical order priority, 0 for a medium, low or high priority.

As the target variable was binary, the following algorithms were considered. The following data preparation was based on this algorithm selection. The choice of the algorithms wasn’t affected by the size of the data, only by the nature of the target variable.

Logistic regression: simple, interpretable, and efficient, when the relationship independent variables and log chances are linear, it performs well, but may underperform when the data has complex, non-linear relationships.

Random Forest: can capture complicated, non-linear patterns in data, resistant to overfitting, can handle numerical as well as categorical data. But it is more difficult to interpret than simpler models, large datasets can be computationally expensive.

The data source: https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/

DT::datatable(
      df_1000[1:25,],
      extensions = c('Scroller'),
      options = list(scrollY = 350,
                     scrollX = 500,
                     deferRender = TRUE,
                     scroller = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE), 
      rownames = FALSE)

DT::datatable(
      df_50000[1:25,],
      extensions = c('Scroller'),
      options = list(scrollY = 350,
                     scrollX = 500,
                     deferRender = TRUE,
                     scroller = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE), 
      rownames = FALSE)

The table below provided a summary statistics for both datasets. There were no missing values in both.

Variables Units Sold, Total Cost, Total Profit had a wide range and could be a source of potential issue when building the models, scaling would be a solution for these variables.

It would be expected if the 50000 sales data would provide more variability and more unique values in categorical columns, but there were almost the same mean and median values for Units Sold, Total Cost, Total Profit and Total Profit in both datasets.

The small and large datasets used in the assignment, they both could introduce errors:

Large Data: could lead to overfitting. In this case, we might consider using regularization and validation techniques.

Small Data: could lead to underfitting. We also could get a high variance in predictions.

print(dfSummary(df_1000, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE), headings = FALSE, max.tbl.height = 300, footnote = NA, col.width=50, method="render")

Variable

Stats / Values

Freqs (% of Valid)

Missing

Region [character]

1. Asia

2. Australia and Oceania

3. Central America and the C

4. Europe

5. Middle East and North Afr

6. North America

7. Sub-Saharan Africa

136	(	13.6%	)
79	(	7.9%	)
99	(	9.9%	)
267	(	26.7%	)
138	(	13.8%	)
19	(	1.9%	)
262	(	26.2%	)

0 (0.0%)

Country [character]

1. Cuba

2. Bahrain

3. Czech Republic

4. Malaysia

5. Zimbabwe

6. Belarus

7. Fiji

8. Mongolia

9. Niger

10. Poland

[ 175 others ]

11	(	1.1%	)
10	(	1.0%	)
10	(	1.0%	)
10	(	1.0%	)
10	(	1.0%	)
9	(	0.9%	)
9	(	0.9%	)
9	(	0.9%	)
9	(	0.9%	)
9	(	0.9%	)
904	(	90.4%	)

0 (0.0%)

Item.Type [character]

1. Beverages

2. Vegetables

3. Office Supplies

4. Baby Food

5. Personal Care

6. Snacks

7. Cereal

8. Clothes

9. Meat

10. Household

[ 2 others ]

101	(	10.1%	)
97	(	9.7%	)
89	(	8.9%	)
87	(	8.7%	)
87	(	8.7%	)
82	(	8.2%	)
79	(	7.9%	)
78	(	7.8%	)
78	(	7.8%	)
77	(	7.7%	)
145	(	14.5%	)

0 (0.0%)

Sales.Channel [character]

1. Offline

2. Online

520	(	52.0%	)
480	(	48.0%	)

0 (0.0%)

Order.Priority [character]

1. C

2. H

3. L

4. M

262	(	26.2%	)
228	(	22.8%	)
268	(	26.8%	)
242	(	24.2%	)

0 (0.0%)

Order.Date [character]

1. 1/14/2013

2. 10/19/2010

3. 10/23/2014

4. 12/9/2013

5. 2/17/2017

6. 2/4/2013

7. 2/5/2013

8. 3/17/2012

9. 3/20/2011

10. 5/17/2012

[ 831 others ]

3	(	0.3%	)
3	(	0.3%	)
3	(	0.3%	)
3	(	0.3%	)
3	(	0.3%	)
3	(	0.3%	)
3	(	0.3%	)
3	(	0.3%	)
3	(	0.3%	)
3	(	0.3%	)
970	(	97.0%	)

0 (0.0%)

Order.ID [integer]

Mean (sd) : 549681325 (257133359)

min ≤ med ≤ max:

102928006 ≤ 556609714 ≤ 995529830

IQR (CV) : 441620457 (0.5)

1000 distinct values

0 (0.0%)

Ship.Date [character]

1. 11/23/2013

2. 11/4/2011

3. 12/23/2016

4. 2/15/2012

5. 2/16/2017

6. 2/22/2015

7. 3/3/2014

8. 4/17/2015

9. 4/19/2010

10. 5/4/2012

[ 825 others ]

3	(	0.3%	)
3	(	0.3%	)
3	(	0.3%	)
3	(	0.3%	)
3	(	0.3%	)
3	(	0.3%	)
3	(	0.3%	)
3	(	0.3%	)
3	(	0.3%	)
3	(	0.3%	)
970	(	97.0%	)

0 (0.0%)

Units.Sold [integer]

Mean (sd) : 5054 (2901.4)

min ≤ med ≤ max:

13 ≤ 5184 ≤ 9998

IQR (CV) : 5116.5 (0.6)

960 distinct values

0 (0.0%)

Unit.Price [numeric]

Mean (sd) : 262.1 (216)

min ≤ med ≤ max:

9.3 ≤ 154.1 ≤ 668.3

IQR (CV) : 340.2 (0.8)

12 distinct values

0 (0.0%)

Unit.Cost [numeric]

Mean (sd) : 185 (175.3)

min ≤ med ≤ max:

6.9 ≤ 97.4 ≤ 525

IQR (CV) : 206.7 (0.9)

12 distinct values

0 (0.0%)

Total.Revenue [numeric]

Mean (sd) : 1327322 (1486515)

min ≤ med ≤ max:

2043.2 ≤ 754939.2 ≤ 6617210

IQR (CV) : 1452311 (1.1)

999 distinct values

0 (0.0%)

Total.Cost [numeric]

Mean (sd) : 936119.2 (1162571)

min ≤ med ≤ max:

1416.8 ≤ 464726.1 ≤ 5204978

IQR (CV) : 976818.2 (1.2)

999 distinct values

0 (0.0%)

Total.Profit [numeric]

Mean (sd) : 391202.6 (383640.2)

min ≤ med ≤ max:

532.6 ≤ 277226 ≤ 1726181

IQR (CV) : 450080.7 (1)

999 distinct values

0 (0.0%)

print(dfSummary(df_50000, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE), headings = FALSE, max.tbl.height = 300, footnote = NA, col.width=50, method="render")

Variable

Stats / Values

Freqs (% of Valid)

Missing

Region [character]

1. Asia

2. Australia and Oceania

3. Central America and the C

4. Europe

5. Middle East and North Afr

6. North America

7. Sub-Saharan Africa

7348	(	14.7%	)
4017	(	8.0%	)
5451	(	10.9%	)
12841	(	25.7%	)
6128	(	12.3%	)
1099	(	2.2%	)
13116	(	26.2%	)

0 (0.0%)

Country [character]

1. Trinidad and Tobago

2. Guinea

3. Cape Verde

4. Maldives

5. Finland

6. Democratic Republic of th

7. Samoa

8. Malta

9. China

10. France

[ 175 others ]

321	(	0.6%	)
318	(	0.6%	)
315	(	0.6%	)
311	(	0.6%	)
310	(	0.6%	)
308	(	0.6%	)
306	(	0.6%	)
305	(	0.6%	)
303	(	0.6%	)
303	(	0.6%	)
46900	(	93.8%	)

0 (0.0%)

Item.Type [character]

1. Fruits

2. Meat

3. Cosmetics

4. Vegetables

5. Personal Care

6. Beverages

7. Snacks

8. Clothes

9. Cereal

10. Household

[ 2 others ]

4221	(	8.4%	)
4221	(	8.4%	)
4193	(	8.4%	)
4191	(	8.4%	)
4186	(	8.4%	)
4173	(	8.3%	)
4163	(	8.3%	)
4155	(	8.3%	)
4141	(	8.3%	)
4139	(	8.3%	)
8217	(	16.4%	)

0 (0.0%)

Sales.Channel [character]

1. Offline

2. Online

24966	(	49.9%	)
25034	(	50.1%	)

0 (0.0%)

Order.Priority [character]

1. C

2. H

3. L

4. M

12446	(	24.9%	)
12471	(	24.9%	)
12588	(	25.2%	)
12495	(	25.0%	)

0 (0.0%)

Order.Date [character]

1. 1/21/2017

2. 4/14/2013

3. 12/29/2014

4. 2/24/2010

5. 5/28/2017

6. 5/3/2011

7. 6/24/2010

8. 7/28/2010

9. 1/17/2015

10. 1/21/2011

[ 2756 others ]

34	(	0.1%	)
32	(	0.1%	)
31	(	0.1%	)
31	(	0.1%	)
31	(	0.1%	)
31	(	0.1%	)
31	(	0.1%	)
31	(	0.1%	)
30	(	0.1%	)
30	(	0.1%	)
49688	(	99.4%	)

0 (0.0%)

Order.ID [integer]

Mean (sd) : 549733027 (260917894)

min ≤ med ≤ max:

100013196 ≤ 550422394 ≤ 999999463

IQR (CV) : 452775335 (0.5)

50000 distinct values

0 (0.0%)

Ship.Date [character]

1. 7/16/2014

2. 12/28/2012

3. 12/8/2014

4. 10/10/2010

5. 10/6/2011

6. 11/17/2013

7. 7/14/2012

8. 7/2/2014

9. 11/16/2015

10. 12/15/2012

[ 2801 others ]

35	(	0.1%	)
34	(	0.1%	)
33	(	0.1%	)
32	(	0.1%	)
32	(	0.1%	)
32	(	0.1%	)
31	(	0.1%	)
31	(	0.1%	)
30	(	0.1%	)
30	(	0.1%	)
49680	(	99.4%	)

0 (0.0%)

Units.Sold [integer]

Mean (sd) : 4999.6 (2884.3)

min ≤ med ≤ max:

1 ≤ 5017.5 ≤ 10000

IQR (CV) : 4995.2 (0.6)

9943 distinct values

0 (0.0%)

Unit.Price [numeric]

Mean (sd) : 265.7 (216.9)

min ≤ med ≤ max:

9.3 ≤ 154.1 ≤ 668.3

IQR (CV) : 340.2 (0.8)

12 distinct values

0 (0.0%)

Unit.Cost [numeric]

Mean (sd) : 187.3 (175.6)

min ≤ med ≤ max:

6.9 ≤ 97.4 ≤ 525

IQR (CV) : 227.5 (0.9)

12 distinct values

0 (0.0%)

Total.Revenue [numeric]

Mean (sd) : 1323716 (1463891)

min ≤ med ≤ max:

28 ≤ 781324.7 ≤ 6682032

IQR (CV) : 1532155 (1.1)

41172 distinct values

0 (0.0%)

Total.Cost [numeric]

Mean (sd) : 933157.4 (1145548)

min ≤ med ≤ max:

20.8 ≤ 467104 ≤ 5249075

IQR (CV) : 1029753 (1.2)

41154 distinct values

0 (0.0%)

Total.Profit [numeric]

Mean (sd) : 390558.7 (377758.8)

min ≤ med ≤ max:

7.2 ≤ 279536.4 ≤ 1738178

IQR (CV) : 470135.8 (1)

41163 distinct values

0 (0.0%)

1.2 Column change

To make the dataset easy to work with when building models or conducting EDA, the series of steps were made starting with changing the name of the columns to lower case and substituting dot with underscore.

# A function to change the column names in a data frame to lower case and substituting dot with underscore

change_column_names <- function(df) {
  # Get the column names of the data frame
  
  colnames(df) <- colnames(df)  %>%
  str_to_lower()  %>%
  str_replace_all("\\.", "_")
  
  # Return the updated data frame
  return(df)
}

#Change column names, lower case, substitute dot with _
df_1000 <- change_column_names(df_1000)
df_50000 <- change_column_names(df_50000)

The Order Date or Ship Date may be important when building the model. These columns were transformed to the appropriate date format (yyyy/mm/dd).

# A function to transform a character columns with date in the format "yyyy/mm/dd" to a date format

    transform_date_columns <- function(df) {
  # Get the column names 
  col_names <- colnames(df)
  
  # Go through each column in the dataframe
  for (col_name in col_names) {
    # Check if this is character column 
    if (is.character(df[[col_name]])) {
      # Check if the column values are in the format dd/mm/yyyy
      if (any(grepl("^\\d{2}/\\d{2}/\\d{4}$", df[[col_name]]))) {
        # Transform the column values to the date format, %Y to show years 2000–2068 properly
        df[[col_name]] <- as.Date(df[[col_name]], format="%m/%d/%Y")
      }
    }
  }
  
  # Return the updated dataframe
  return(df)
    }

# Transform the character columns with dates to a date format
df_1000 <- transform_date_columns(df_1000)
df_50000 <- transform_date_columns(df_50000)

The Country column was removed as it could be presented by Region. The Order ID column was removed as well as it was a unique number for a sale and it wasn’t needed for the models.

# Delete columns country and order_id
df_1000 <- df_1000 %>%
          dplyr::select(-c(country, order_id))

df_50000 <- df_50000 %>%
          dplyr::select(-c(country, order_id))

1.3 New columns

New factor column:

critical_priority: Target variable from Order Priority. If the order priority was critical , the value was 1, 0 for the rest.

New numeric columns:

`ship_time`: The shipping duration. The difference between `Ship Date` and `Order Date`. It could be connected to the order priority.

`year`: Extracted year from the `Orer Date`. It could affect the order priority.

` month` Extracted month from the `Orer Date`. It could be affect the order priority.

Columns Order Date, Ship Date, Order Priority were removed as now we have dummy variables that are easy to work with instead.

Some other columns were also created as a try-out. For example, binary is_weekend, the value if the variable was 1 if the day of the order was weekend,0 otherwise. Or the column quarter to determine in which quarter of the year the order was placed. The new columns didn’t change the performance of the model and were removed to avoid overfitting the models.

#   A function to add a new column "quarter" based on a date column
#   The quarter of the year corresponding to each date

#   add_quarter_column <- function(df, date_col) {
  # Check if the date column is in the data frame
      
  #if (!(date_col %in% colnames(df))) {
   # stop("No such date column.")
 # }
  
  # Extract the quarter from the date column using lubridate::quarter function
  #df$quarter <- factor(quarter(df[[date_col]]))
  
  # Return the updated data frame with the new "quarter" column
  #return(df)
#}

#   A function to add a dummy variable indicating if the day of the order date
#   was a weekend (the value 1) or not (the value 0)

#add_weekend_indicator <- function(df, date_col) {
   # Check if the specified column exists
#    if (!(date_col %in% colnames(df))) {
#    stop("No such date column.")
 # }

  # Add a new dummy variable 'is_weekend'
#  df$is_weekend <- factor(ifelse(weekdays(df[[date_col]]) %in% c("Saturday", "Sunday"), 1, 0))
  
  # Return the updated dataframe
#  return(df)
#}

# A function to add a dummy variable indicating if the sales channel was a online, the value in the new column 
# would be 1, otherwise 0


#add_online_indicator <- function(df, channel_column) {
  # Check if the specified column exists 
#  if (!(channel_column %in% colnames(df))) {
#    stop("No such date column.")
#  }
  
  # Add a new dummy variable 'is_online'
#  df$is_online <- factor(ifelse(df[[channel_column]] == "Online", 1, 0))
  
  # Return the updated data frame
#  return(df)
#}

# Create new column ship_time

df_1000$ship_time <- as.numeric(difftime(df_1000$ship_date, df_1000$order_date, units = "days"))

df_50000$ship_time <- as.numeric(difftime(df_50000$ship_date, df_50000$order_date, units = "days"))

# Extract month and year from order date 
df_1000$order_month <- month(df_1000$order_date)
df_1000$order_year <- year(df_1000$order_date)

df_50000$order_month <- month(df_50000$order_date)
df_50000$order_year <- year(df_50000$order_date)

# Create 'critical_priority' column based on 'Order Priority', remove date columns
df_1000$critical_priority <- ifelse(df_1000$order_priority == "C", '1', '0')


df_1000 <- df_1000 %>%
    dplyr::select(-c(order_priority, ship_date, order_date))

df_50000$critical_priority <- ifelse(df_50000$order_priority == "C", '1', '0')

df_50000 <- df_50000 %>%
    dplyr::select(-c(order_priority, ship_date, order_date))

1.4 Change column type

Character columns were transformed to factor to be used in the models (e.g. region, item type) as well as target variable (critical_priority).

# A function to transform all character columns to factors

transform_character_columns <- function(df) {
  # Identify the character columns in the data frame
  char_cols <- sapply(df, is.character)
  
  # Transform the character columns to factor columns
  df[char_cols] <- lapply(df[char_cols], as.factor)
  
  # Return the updated data frame
  return(df)
}

# Transform to factor character columns
df_1000 <- transform_character_columns(df_1000)
df_50000 <- transform_character_columns(df_50000)

2. Data Exploration

The structure and columns of both datasets were the same. The resulting datasets contained 13 variables.

The data contained inherent dependencies. For example, total_revenue could be calculated by multiplying units_sold and unit_price. Or total_profit could be calculated by subtracting total_cost from total_revenue.

There was no specific label for supervised learning. However, we created a new variable from order_priority called critical_priority to predict if the order was of a critical priority based on other features.

The consistent structure facilitated a uniform analysis.

print(dfSummary(df_1000, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE), headings = FALSE, max.tbl.height = 300, footnote = NA, col.width=50, method="render")

Variable

Stats / Values

Freqs (% of Valid)

Missing

region [factor]

1. Asia

2. Australia and Oceania

3. Central America and the C

4. Europe

5. Middle East and North Afr

6. North America

7. Sub-Saharan Africa

136	(	13.6%	)
79	(	7.9%	)
99	(	9.9%	)
267	(	26.7%	)
138	(	13.8%	)
19	(	1.9%	)
262	(	26.2%	)

0 (0.0%)

item_type [factor]

1. Baby Food

2. Beverages

3. Cereal

4. Clothes

5. Cosmetics

6. Fruits

7. Household

8. Meat

9. Office Supplies

10. Personal Care

[ 2 others ]

87	(	8.7%	)
101	(	10.1%	)
79	(	7.9%	)
78	(	7.8%	)
75	(	7.5%	)
70	(	7.0%	)
77	(	7.7%	)
78	(	7.8%	)
89	(	8.9%	)
87	(	8.7%	)
179	(	17.9%	)

0 (0.0%)

sales_channel [factor]

1. Offline

2. Online

520	(	52.0%	)
480	(	48.0%	)

0 (0.0%)

units_sold [integer]

Mean (sd) : 5054 (2901.4)

min ≤ med ≤ max:

13 ≤ 5184 ≤ 9998

IQR (CV) : 5116.5 (0.6)

960 distinct values

0 (0.0%)

unit_price [numeric]

Mean (sd) : 262.1 (216)

min ≤ med ≤ max:

9.3 ≤ 154.1 ≤ 668.3

IQR (CV) : 340.2 (0.8)

12 distinct values

0 (0.0%)

unit_cost [numeric]

Mean (sd) : 185 (175.3)

min ≤ med ≤ max:

6.9 ≤ 97.4 ≤ 525

IQR (CV) : 206.7 (0.9)

12 distinct values

0 (0.0%)

total_revenue [numeric]

Mean (sd) : 1327322 (1486515)

min ≤ med ≤ max:

2043.2 ≤ 754939.2 ≤ 6617210

IQR (CV) : 1452311 (1.1)

999 distinct values

0 (0.0%)

total_cost [numeric]

Mean (sd) : 936119.2 (1162571)

min ≤ med ≤ max:

1416.8 ≤ 464726.1 ≤ 5204978

IQR (CV) : 976818.2 (1.2)

999 distinct values

0 (0.0%)

total_profit [numeric]

Mean (sd) : 391202.6 (383640.2)

min ≤ med ≤ max:

532.6 ≤ 277226 ≤ 1726181

IQR (CV) : 450080.7 (1)

999 distinct values

0 (0.0%)

ship_time [numeric]

Mean (sd) : 25 (14.6)

min ≤ med ≤ max:

0 ≤ 25 ≤ 50

IQR (CV) : 25 (0.6)

51 distinct values

0 (0.0%)

order_month [numeric]

Mean (sd) : 6.3 (3.5)

min ≤ med ≤ max:

1 ≤ 6 ≤ 12

IQR (CV) : 6 (0.5)

12 distinct values

0 (0.0%)

order_year [numeric]

Mean (sd) : 2013.2 (2.2)

min ≤ med ≤ max:

2010 ≤ 2013 ≤ 2017

IQR (CV) : 4 (0)

2010	:	140	(	14.0%	)
2011	:	121	(	12.1%	)
2012	:	141	(	14.1%	)
2013	:	137	(	13.7%	)
2014	:	146	(	14.6%	)
2015	:	123	(	12.3%	)
2016	:	123	(	12.3%	)
2017	:	69	(	6.9%	)

0 (0.0%)

critical_priority [factor]

1. 0

2. 1

738	(	73.8%	)
262	(	26.2%	)

0 (0.0%)

print(dfSummary(df_50000, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE), headings = FALSE, max.tbl.height = 300, footnote = NA, col.width=50, method="render")

Variable

Stats / Values

Freqs (% of Valid)

Missing

region [factor]

1. Asia

2. Australia and Oceania

3. Central America and the C

4. Europe

5. Middle East and North Afr

6. North America

7. Sub-Saharan Africa

7348	(	14.7%	)
4017	(	8.0%	)
5451	(	10.9%	)
12841	(	25.7%	)
6128	(	12.3%	)
1099	(	2.2%	)
13116	(	26.2%	)

0 (0.0%)

item_type [factor]

1. Baby Food

2. Beverages

3. Cereal

4. Clothes

5. Cosmetics

6. Fruits

7. Household

8. Meat

9. Office Supplies

10. Personal Care

[ 2 others ]

4078	(	8.2%	)
4173	(	8.3%	)
4141	(	8.3%	)
4155	(	8.3%	)
4193	(	8.4%	)
4221	(	8.4%	)
4139	(	8.3%	)
4221	(	8.4%	)
4139	(	8.3%	)
4186	(	8.4%	)
8354	(	16.7%	)

0 (0.0%)

sales_channel [factor]

1. Offline

2. Online

24966	(	49.9%	)
25034	(	50.1%	)

0 (0.0%)

units_sold [integer]

Mean (sd) : 4999.6 (2884.3)

min ≤ med ≤ max:

1 ≤ 5017.5 ≤ 10000

IQR (CV) : 4995.2 (0.6)

9943 distinct values

0 (0.0%)

unit_price [numeric]

Mean (sd) : 265.7 (216.9)

min ≤ med ≤ max:

9.3 ≤ 154.1 ≤ 668.3

IQR (CV) : 340.2 (0.8)

12 distinct values

0 (0.0%)

unit_cost [numeric]

Mean (sd) : 187.3 (175.6)

min ≤ med ≤ max:

6.9 ≤ 97.4 ≤ 525

IQR (CV) : 227.5 (0.9)

12 distinct values

0 (0.0%)

total_revenue [numeric]

Mean (sd) : 1323716 (1463891)

min ≤ med ≤ max:

28 ≤ 781324.7 ≤ 6682032

IQR (CV) : 1532155 (1.1)

41172 distinct values

0 (0.0%)

total_cost [numeric]

Mean (sd) : 933157.4 (1145548)

min ≤ med ≤ max:

20.8 ≤ 467104 ≤ 5249075

IQR (CV) : 1029753 (1.2)

41154 distinct values

0 (0.0%)

total_profit [numeric]

Mean (sd) : 390558.7 (377758.8)

min ≤ med ≤ max:

7.2 ≤ 279536.4 ≤ 1738178

IQR (CV) : 470135.8 (1)

41163 distinct values

0 (0.0%)

ship_time [numeric]

Mean (sd) : 25 (14.7)

min ≤ med ≤ max:

0 ≤ 25 ≤ 50

IQR (CV) : 26 (0.6)

51 distinct values

0 (0.0%)

order_month [numeric]

Mean (sd) : 6.4 (3.4)

min ≤ med ≤ max:

1 ≤ 6 ≤ 12

IQR (CV) : 6 (0.5)

12 distinct values

0 (0.0%)

order_year [numeric]

Mean (sd) : 2013.3 (2.2)

min ≤ med ≤ max:

2010 ≤ 2013 ≤ 2017

IQR (CV) : 4 (0)

2010	:	6594	(	13.2%	)
2011	:	6757	(	13.5%	)
2012	:	6634	(	13.3%	)
2013	:	6523	(	13.0%	)
2014	:	6596	(	13.2%	)
2015	:	6570	(	13.1%	)
2016	:	6551	(	13.1%	)
2017	:	3775	(	7.6%	)

0 (0.0%)

critical_priority [factor]

1. 0

2. 1

37554	(	75.1%	)
12446	(	24.9%	)

0 (0.0%)

2.1 Numeric varibles

Units Sold: Uniform distribution, the majority of sales records appeared to be centered around the 5,000 level, although there was a broad range, with some records having extremely high or extremely low unit sales.

Unit Price,Unit Cost: The distribution was multi-modal, showing that things were priced and cost differently.

Ship Time: The majority of shipping dates appeared to be centered around the 25 days, although there was a broad range.

Total Revenue: There was a right skew, suggesting that the majority of sales records generated revenue of less than $2.5 million, but there were a few records with significantly larger revenue.

Total Cost: The distribution was right-skewed, as was Total Revenue.

Total Profit: Right-skewed, the distribution shows that the majority of profit values were under $1 million, however there were records with significantly larger profit.

The distribution of numeric values were similar for the small and large data.

# A function to create histograms and density plots for numeric columns.
# The column names were change, underscore was subsitute with spaces, the first letter of each word was capitalized for a better representation on the plots
# Histograms and density plots are then created for each numeric variable and displayed using facet_wrap.
# Add title for each data frame

density_plot_function <-function(df, title) {
  
  # Select numeric columns and rename them
  m_df <- df %>% dplyr::select_if(is.numeric) %>% rename_all(~str_replace_all(., "_", " ") %>%
                tools::toTitleCase()) %>% melt()

   # Create histograms with the density ggplot
  m_df %>% ggplot() + 
  geom_histogram(aes(x=value, y = ..density..), alpha=0.7, fill="gray", colour='gray') +
    geom_density(aes(x=value), color='purple', size=1) + facet_wrap(~variable, scales = 'free',  ncol = 3) + 
    theme_minimal()+
    labs(title=title, x = 'Variables', y = 'Values') 
}

# Create density plots for numeric columns
density_plot_function(df_1000, 'Distributions for 1000 Sales Records')

density_plot_function(df_50000, 'Distributions for 50000 Sales Records')

Total Revenue vs. Units Sold: There was a strong positive association between the number of units sold and the total revenue earned by those sales, showing that as the number of units sold increases, so did the overall revenue. This was understandable given that income is directly proportional to the number of units sold multiplied by the unit price.

Total Revenue vs. Unit Price: There was no obvious linear relationship between total income and item unit pricing. While higher unit prices could lead to higher revenues, the quantity sold also played a role. Some lower-priced items might have higher sales quantities, resulting in equivalent income.

The scatter plots were similar for the small and large data.

# A function to create scatter plots to visualize a grid:
# Units Sold vs Total Revenue
# Unit Price vs Total Revenue
# Total Cost vs Total Profit
# 

plot_relationships <- function(df, title) {
  # Plot for Units Sold vs Total Revenue
  a1 <- ggplot(df, aes(x=units_sold, y=total_revenue)) +
    geom_point(alpha=0.5, color="blue") +
    labs(
        x = "Unit Sold",
        y = "Total Revenue") +
    theme_minimal()
  
  # Plot for Unit Price vs Total Revenue
  a2 <- ggplot(df, aes(x=unit_price, y=total_revenue)) +
    geom_point(alpha=0.5, color="blue") +
    labs(
        x = "Unit Price",
        y = "Total Revenue") +
    theme_minimal()
  
  # Plot for Total Cost vs Total Profit
  a3 <- ggplot(df, aes(x=total_cost, y=total_profit)) +
    geom_point(alpha=0.5, color="blue") +
    labs(
        x = "Total Cost",
        y = "Total Profit") +
    theme_minimal()
  
  # Arrange the plots in a grid
  grid.arrange(a1, a2, a3, ncol=3, top=title)
}

# Scatter plots for numeric columns
plot_relationships(df_1000, 'Relationships for 1000 Sales Records')

plot_relationships(df_50000, 'Relationships for 50000 Sales Records')

Total Profit vs. Units Sold: The scatter plot demonstrated an expected positive linear relationship between the number of Units Sold and Total Profit.

The Order Priority hue offered further information about how different order priority were spread among sales records. It was worth noting that higher order priorities didn’t always equate to more units sold or bigger profitability.

The scatter plots were similar for the small and large data.

# Scatter plot for Units Sold vs Total Profit colored by Order Priority
ggplot(df_1000, aes(x=units_sold, y=total_profit, color=critical_priority)) +
    geom_point() +
    scale_color_viridis_d(option = "inferno") +       
    labs(
        x = "Units Sold",
        y = "Total Profit", 
        title = 'Units Sold vs Total Profit colored by Order Priority, 1000 sales') +
    theme_minimal()

ggplot(df_50000, aes(x=units_sold, y=total_profit, color=critical_priority)) +
    geom_point() +
    scale_color_viridis_d(option = "inferno") +       
    labs(
        x = "Units Sold",
        y = "Total Profit", 
        title = 'Units Sold vs Total Profit colored by Order Priority, 50000 sales') +
    theme_minimal()

2.3 Category variables

Order Priority Count: On the graph with the distribution of records based on order priority, the majority of orders had not a critical priority.

Total Profit by Order Priority: On the box plot with the distribution of Total Profit across different order priority, profits varied widely across priority, with the median profit appearing to be consistent.

Units Sold by Order Priority: There was no apparent trend demonstrating that a specific order priority regularly generated greater or lower sales. Although all priorities had a wide range of sales, the median sales appeared to be larger for non-critical priority.

The large data was more balanced in terms of dummy variables than small data.

# A function to create 3 box plots that compare sales and order priority data:
# The number of orders by order priority.
# The total profit by order priority.
# The units sold by order priority.

priority_sales_function <- function(df, title) {

# Bar plot for 'Order Priority'
a2 <- ggplot(df, aes(x=critical_priority)) +
    geom_bar(fill=viridis(1)) + 
            labs(
        x = "Order Priority",
        y = "Count", 
        title = "'Count by Order Priority") +
    theme_minimal() + theme(legend.position = "none")


# Box plot for 'Total Profit' by 'Order Priority'
a4 <- ggplot(df, aes(x=critical_priority, y=total_profit, fill=critical_priority)) +
    geom_boxplot() +
    scale_fill_viridis_d() +
    ggtitle('Total Profit by Order Priority') +
              labs(
        x = "Order Priority",
        y = "Total Profit", 
        title = "'Total Profit by Order Priority") +
    theme_minimal() +  theme(legend.position = "none")


a6 <- ggplot(df, aes(x=critical_priority, y=units_sold, fill=critical_priority)) +
    geom_boxplot() +
    scale_fill_viridis_d() +       
        labs(
        x = "Order Priority",
        y = "Units Sold", 
        title = "Units Sold by Order Priority") +
    theme_minimal() +  theme(legend.position = "none")


# Print the plots
grid.arrange(a2, a4, ncol=2, top=title)
grid.arrange(a6, ncol=1)
}

# Box plots for sales channel, order priority
priority_sales_function(df_1000, "Order Priority for 1000 sales")

priority_sales_function(df_50000, "Order Priority for 50000 sales")

Critical priority sales were distributed similarly across both online and offline sales channels, with non-critical sales being more common.

# Distribution of Critical Priority by Sales Channel
ggplot(df_1000, aes(x = sales_channel, fill = critical_priority)) +
  geom_bar(position = "dodge") +
  labs(
    x = 'Sales Channel', 
    y = 'Count', 
    title = 'Critical Priority by Sales Channel, 1000 Sales') +
  theme_minimal() +  theme(legend.position = "none")

ggplot(df_50000, aes(x = sales_channel, fill = critical_priority)) +
  geom_bar(position = "dodge") +
  labs(
    x = 'Sales Channel', 
    y = 'Count', 
    title = 'Critical Priority by Sales Channel, 50000 Sales') +
  theme_minimal() +  theme(legend.position = "none")

Count vs Region: The region of Sub-Saharan Africa had the most sales records, and within this region, the number of sales with a critical priority was substantially smaller than those without. Europe and Central Africa had a significantly larger number of critical priority sales.

Total Profit vs Region: The median profit was reasonably consistent among regions, although each location had a large range of profits, as evidenced by the length of the boxes and whiskers.

Units Sold by Region: The median number of units sold was rather consistent among regions as shown on the box plot, similar to the profit distribution, however there was diversity within each zone.

All the plots for both data sets were almost the same except that median for units sold by region and total profit by region were more equal for a large data than for a small data.

# A function to create a set of three box plots that compare sales data by region:
# The number of orders by region.
# The total profit by region.
# The units sold by region.

byregion_function <- function(df, title) {
  # Reordering the levels of 'Region' based on their counts
  df$region <- factor(df$region, levels = names(sort(table(df$region), decreasing = TRUE)))

# Bar plot for 'Region'
a1 <- ggplot(df, aes(x = region, fill = critical_priority)) +
  geom_bar(position = "dodge") + 
      labs(
        x = "Region",
        y = "Count", 
        title = "Critical Priority by Region") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

# Box plot for 'Total Profit' by 'Region'
a2 <- ggplot(df, aes(x=region, y=total_profit, fill=region)) +
    geom_boxplot() +
    scale_fill_viridis_d() +
      labs(
        x = "Region",
        y = "Total Profit", 
        title = "Total Profit by Region") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

# Box plot for 'Units Sold' by 'Region'
a3 <- ggplot(df, aes(x=region, y=units_sold, fill=region)) +
    geom_boxplot() +
    scale_fill_viridis_d() +
        labs(
        x = "Region",
        y = "Units Sold", 
        title = "Units Sold by Region") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

grid.arrange(a1, ncol=1, top=title)
grid.arrange(a2, ncol=1)
grid.arrange(a3, ncol=1)
  
}

# Boxplots for Region disctribution
byregion_function(df_1000, "Distribution by Region, 1000 Sales")

byregion_function(df_50000, "Distribution by Region, 50000 Sales")

Profitability by Item Type: The distribution of Total Profit among different item types showed the variation in profit among item types. Some things, such as Cosmetics had larger median profits than others, such as Vegetables, which had lower median earnings.

Units Sold by Item Type: The distribution of Units Sold across different item kinds depicted that the number of units sold varied depending on the product. For example, Baby Food had a greater median unit sale, whereas ‘Cereal’ had a lower median.

Count by Item Type: Item types, cosmetics, household, and office supplies appear to have a higher number of critical priority sales. Fruits, vegetables, and snacks, on the other hand, have a smaller proportion of essential priority sales.

# A function to create a set of two plots to compare the distribution of total profit and units sold by item type:
# The total profit by item type.
# Tthe units sold by item type.

items_distribution <- function(df, title) {
  # Reordering the levels of 'Item Type' based on their counts
df$item_type <- factor(df$item_type, levels = names(sort(table(df$item_type), decreasing = TRUE)))

# Box plot for 'Total Profit' by 'Item Type'
a1 <- ggplot(df, aes(x=item_type, y=total_profit, fill=item_type)) +
    geom_boxplot() +
    scale_fill_viridis_d() +
        labs(
        x = "Item Type",
        y = "Total Profit", 
        title = "Total Profit by Item Type") +
    theme_minimal()+
    theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

# Box plot for 'Units Sold' by 'Item Type'
a2 <- ggplot(df, aes(x=item_type, y=units_sold, fill=item_type)) +
    geom_boxplot() +
    scale_fill_viridis_d() +
        labs(
        x = "Item Type",
        y = "Units Sold", 
        title = "Units Sold by Item Type") +
    theme_minimal()+
    theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

# Distribution of Critical Priority by Item Type
a3 <- ggplot(df, aes(x = item_type, fill = critical_priority)) +
  geom_bar(position = "dodge") +
  labs(
    x ="'Item Type", 
    y = "Count", 
    title = 'Critical Priority by Item Type') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


# Print the plots
grid.arrange(a1, ncol=1, top=title)
grid.arrange(a2, ncol=1)
grid.arrange(a3, ncol=1)
}

# Boxplots for item distribution
items_distribution(df_1000, "Item Type disctribution for 1000 sales")

items_distribution(df_50000, "Item Type disctribution for 50000 sales")

Count vs. Order Year: The data ranging from 2010 to 2017, with the quantity of sales records fluctuating from year to year.

Next graph showed the monthly sales trend from 2010 to 2017. There appeared to be a repeating trend in sales, with some months typically seeing more sales than others. The magnitude of sales varies from year to year. Around most years, sales peak around the middle of the year, particularly around June and July.

# Order Year count

ggplot(df_50000, aes(x = as.factor(order_year))) +
  geom_bar() + 
  labs(title = 'Order Year Distribution, 50000 Sales', x = 'Order Year', y = 'Count') +
  theme(text = element_text(size = 12),
        axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5, size = 15)) +
  theme_minimal()

# Aggregate data by order year and order month
sales <- df_50000 %>%
  group_by(order_year, order_month) %>%
  summarise(sales = n()) %>%
  ungroup()

# Monthly Sales Trend
ggplot(sales, aes(x = order_month, y = sales, group = order_year, color = as.factor(order_year))) +
  geom_line(aes(linetype = as.factor(order_year))) +
  geom_point(shape = 21, fill = "white") +
  scale_x_continuous(breaks = 1:12, labels = c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')) +
  labs(title = 'Monthly Sales Trend (2010-2017)', x = 'Month', y = 'Count') +
  theme(text = element_text(size = 12),
        plot.title = element_text(hjust = 0.5, size = 15),
        legend.position = "top",
        legend.title = element_text(size = 12)) +
  guides(color = guide_legend(title = "Year"))

### 2.3 Correlation

The Unit Price and Unit Cost had a significant connection, which made logical given that cost and price were frequently associated. Total Revenue, Total Cost and Total Profit were likewise closely connected, which was to be expected given that they were generated from one another. Units Sold had a strong relationship with Total Revenue, Total Cost, and Total Profit. This made sense because the number of units sold had an immediate influence on revenue, cost, and profit.

Region and Country were be entirely dependent on one another for an obvious reason. As well as Order Date and Shipping Date that were naturally correlated, and additionally affected by the item type.

The correlation matrix for 1000 sales.

rcore <- rcorr(as.matrix(df_1000 %>% dplyr::select(where(is.numeric))))
coeff <- rcore$r
corrplot(coeff, tl.cex = .7, tl.col="black", method = 'color', addCoef.col = "black",
         type="upper", order="hclust",
         diag=FALSE)

The correlation matrix for 50000 sales.

rcore <- rcorr(as.matrix(df_50000 %>% dplyr::select(where(is.numeric))))
coeff <- rcore$r
corrplot(coeff, tl.cex = .7, tl.col="black", method = 'color', addCoef.col = "black",
         type="upper", order="hclust",
         diag=FALSE)

3 Data Transformation

Finally, we split our data into train (80%) and test (20%) datasets to evaluate model performance before we proceeded to prediction. The train data contained 800 sales, test data 200 sales for a small dataset.

Slightly different variations were used depending on each model’s needs.

The preprocessing technique and sample.split() were used for the simple Logistic Regression. The preprocessing consisted of encoding categorical variables (region, item_type, sales_channel) to make model’s coefficients more interpretable.

For the 1000 Sales Data, the response variable had a balance of 74% for 0 response and 26% for 1 response. This was the result of the distribution of the original column Order Priority where 4 values (C, H, M, L) had a part of ~26% each. For the further analysis, it could be an option to try balance the response variable.

With imbalanced data, most machine learning models predicted the majority class more efficiently than the minority class. To address this behavior, we utilized SMOTE() function the data in order to achieve higher accuracy rates between classes. The SMOTE function oversampled the minority class by using bootstrapping and k-nearest neighbor to synthetically create additional observations of that event. We tried to run models with imbalanced data and without.

temp_1 <- df_1000


# Creating recipe to encode all categorical variables except for a target variable
preprocess_factors <- recipe(critical_priority ~ ., data = temp_1) %>%
  step_dummy(all_nominal(), -all_outcomes(), one_hot = FALSE)

# Fit the preprocessing technique 
preprocess_factors_prep <- prep(preprocess_factors)


# Encode factor variables
temp_1_transformed <- bake(preprocess_factors_prep, new_data = temp_1)
temp_1_transformed <- temp_1_transformed  %>%
  mutate_at(vars(starts_with("item_type"), starts_with("region"), starts_with("sales_channel")), as.factor)

# random seed
set.seed(42)

# 80/20 split of the data set
sample <- sample.split(temp_1_transformed$critical_priority, SplitRatio = 0.8)
train_data  <- subset(temp_1_transformed, sample == TRUE)
test_data   <- subset(temp_1_transformed, sample == FALSE)

# Check dimenstions of train and test data
dim(train_data)

## [1] 800  28

dim(test_data)

## [1] 200  28

# Check class distribution of original, train, and test sets
round(prop.table(table(dplyr::select(temp_1_transformed, critical_priority), exclude = NULL)), 4) * 100

## 
##    0    1 
## 73.8 26.2

round(prop.table(table(dplyr::select(train_data, critical_priority), exclude = NULL)), 4) * 100

## 
##     0     1 
## 73.75 26.25

round(prop.table(table(dplyr::select(test_data, critical_priority), exclude = NULL)), 4) * 100

## 
##  0  1 
## 74 26

The train data contained 4000 sales, test data 1000 sales for a large dataset.

The response variable had a balance of 75% for 0 response and 25% for 1 response. This was the result of the distribution of the original column Order Priority where 4 values (C, H, M, L) had a part of ~25% each.

temp_3 <- df_50000


# Creating recipe to encode all categorical variables except for a target variable
preprocess_factors <- recipe(critical_priority ~ ., data = temp_3) %>%
  step_dummy(all_nominal(), -all_outcomes(), one_hot = FALSE)

# Fit the preprocessing technique 
preprocess_factors_prep <- prep(preprocess_factors)


# Encode factor variables
temp_3_transformed <- bake(preprocess_factors_prep, new_data = temp_3)
temp_3_transformed <- temp_3_transformed  %>%
  mutate_at(vars(starts_with("item_type"), starts_with("region"), starts_with("sales_channel")), as.factor)

# random seed
set.seed(42)

# 80/20 split of the data set
sample <- sample.split(temp_3_transformed$critical_priority, SplitRatio = 0.8)
train_data_large  <- subset(temp_3_transformed, sample == TRUE)
test_data_large   <- subset(temp_3_transformed, sample == FALSE)

# Check dimenstions of train and test data
dim(train_data_large)

## [1] 40000    28

dim(test_data_large)

## [1] 10000    28

# Check class distribution of original, train, and test sets
round(prop.table(table(dplyr::select(temp_3_transformed, critical_priority), exclude = NULL)), 4) * 100

## 
##     0     1 
## 75.11 24.89

round(prop.table(table(dplyr::select(train_data_large, critical_priority), exclude = NULL)), 4) * 100

## 
##     0     1 
## 75.11 24.89

round(prop.table(table(dplyr::select(test_data_large, critical_priority), exclude = NULL)), 4) * 100

## 
##     0     1 
## 75.11 24.89

The train and test data frame were transformed into vectors and the corresponding matrix (X, Y) for the k-fold cross-validation.The SMOTE() function was applied to the data in order to achieve higher accuracy rates between classes.

temp_2 <- df_1000

# random seed
set.seed(42)

# 80/20 split of the data set
sample <- sample.split(temp_2$critical_priority, SplitRatio = 0.8)
df_train  <- subset(temp_2, sample == TRUE)
df_test   <- subset(temp_2, sample == FALSE)

# Fix imbalance
split_df <- as.data.frame(df_train)
split_df$critical_priority <- as.factor(split_df$critical_priority)
split_df <- SMOTE(critical_priority ~ ., split_df, perc.over = 100, perc.under=200)
split_df$critical_priority <- as.factor(split_df$critical_priority)

# Transform to vectors and the corresponding matrix
x_train <- model.matrix(critical_priority ~ ., data=split_df)[,-1]
y_train <- split_df[,"critical_priority"] 

x_test <- model.matrix(critical_priority ~ ., data=df_test)[,-1]
y_test <- df_test[,"critical_priority"]

temp_4 <- df_50000

# random seed
set.seed(42)


# 80/20 split of the data set
sample <- sample.split(temp_4$critical_priority, SplitRatio = 0.8)
df_train_large  <- subset(temp_4, sample == TRUE)
df_test_large   <- subset(temp_4, sample == FALSE)

# Fix imbalance
split_df_large <- as.data.frame(df_train_large)
split_df_large$critical_priority <- as.factor(split_df_large$critical_priority)
split_df_large  <- SMOTE(critical_priority ~ ., split_df_large, perc.over = 100, perc.under=200)
split_df_large$critical_priority <- as.factor(split_df_large$critical_priority)


# Transform to vectors and the corresponding matrix
x_train_large <- model.matrix(critical_priority ~ ., data=split_df_large)[,-1]
y_train_large <- split_df_large[,"critical_priority"] 

x_test_large <- model.matrix(critical_priority ~ ., data=df_test_large)[,-1]
y_test_large <- df_test_large[,"critical_priority"]

For the Random Forest Model, the datasets without transformations were used. The SMOTE() function was applied to the data in order to achieve higher accuracy rates between classes.

set.seed(42)

splitIndex <- sample.split(df_1000$critical_priority, SplitRatio = 0.8)
train_data_rf <-subset(df_1000, splitIndex == TRUE)
test_data_rf <- subset(df_1000, splitIndex == FALSE)

# Fix imbalance
split_rf <- as.data.frame(train_data_rf)
split_rf$critical_priority <- as.factor(split_rf$critical_priority)
split_rf <- SMOTE(critical_priority ~ ., split_rf, perc.over = 100, perc.under=200)
split_rf$critical_priority <- as.factor(split_rf$critical_priority)

set.seed(42)

splitIndex <- sample.split(df_50000$critical_priority, SplitRatio = 0.8)
train_data_large_rf <-subset(df_50000, splitIndex == TRUE)
test_data_large_rf <- subset(df_50000, splitIndex == FALSE)

# Fix imbalance
split_rf_large <- as.data.frame(train_data_large_rf)
split_rf_large$critical_priority <- as.factor(split_rf_large$critical_priority)
split_rf_large <- SMOTE(critical_priority ~ ., split_rf_large, perc.over = 100, perc.under=200)
split_rf_large$critical_priority <- as.factor(split_rf_large$critical_priority)

4 Models

4.1 Logistic regression - 1000 Sales

As the first step, we built the generalized linear model based on the transformed training dataset with categorical variables transformed to dummy variables. The variables total_profit, total_cost, total_revenue were dropped as they are connected with each other and derived from unis_sold and units_cost variables. Variables item_type_Vegetables, item_type_Snacks were removed as the model ran error when predicting the probabilities, the rank of the data matrix was at least equal to the number of parameters, it was not. Since our dependent variable takes only two values (0 and 1), we used logistic regression. To do so, the function glm() with family=binomial was used. A positive coefficient meant that as the predictor variable grew, so did the log odds of the outcome occurring, increasing the likelihood of the outcome. A negative coefficient implied the inverse.

set.seed(42)
# Build logistic regression model
log_model <- glm(critical_priority ~ . -total_profit - total_revenue - total_cost - item_type_Vegetables - item_type_Snacks, data = train_data, binomial(link = "logit"), control = list(maxit = 1000))

summ(log_model)

Observations	800
Dependent variable	critical_priority
Type	Generalized linear model
Family	binomial
Link	logit

χ²(22)	40.59
Pseudo-R² (Cragg-Uhler)	0.07
Pseudo-R² (McFadden)	0.04
AIC	926.46
BIC	1034.20

	Est.	S.E.	z val.	p
(Intercept)	66.65	78.95	0.84	0.40
units_sold	-0.00	0.00	-3.24	0.00
unit_price	0.06	0.04	1.78	0.08
unit_cost	-0.11	0.05	-1.99	0.05
ship_time	-0.00	0.01	-0.45	0.65
order_month	-0.04	0.02	-1.72	0.09
order_year	-0.03	0.04	-0.84	0.40
region_Australia.and.Oceania1	-0.21	0.41	-0.51	0.61
region_Central.America.and.the.Caribbean1	-0.05	0.37	-0.15	0.88
region_Europe1	0.19	0.28	0.67	0.50
region_Middle.East.and.North.Africa1	0.73	0.31	2.36	0.02
region_North.America1	0.80	0.61	1.30	0.19
region_Sub.Saharan.Africa1	0.24	0.28	0.86	0.39
item_type_Beverages1	-0.28	0.71	-0.40	0.69
item_type_Cereal1	-0.79	0.64	-1.24	0.22
item_type_Clothes1	-3.74	1.62	-2.31	0.02
item_type_Cosmetics1	0.14	1.25	0.11	0.91
item_type_Fruits1	-0.54	0.82	-0.65	0.52
item_type_Household1	11.42	4.44	2.57	0.01
item_type_Meat1	12.21	5.38	2.27	0.02
item_type_Office.Supplies1	15.78	6.14	2.57	0.01
item_type_Personal.Care1	0.30	0.74	0.40	0.69
sales_channel_Online1	0.07	0.17	0.43	0.67
Standard errors: MLE

The summary table didn’t show any significant variables.

The Akaike information criterion (AIC) was 926.46 which could be high, there was a need to compare with other models.

The accuracy of the model was 71.5%. Based on the confusion matrix, the model didn’t perform well in predicting critical priority as it was shown by the precision and recall values for class 1 (Sensitivity was 5.8%). F1 score of 0.095 combined the precision and recall scores of a model (our model was 9.5% accurate in predicting positive cases and 90.5% inaccurate in predicting negative cases).

The null deviance of 921.05 defined how well the target variable could be predicted by a model with only an intercept term.

The residual deviance of 880.46 defined how well the target variable can be predicted by our current model that we fit with predictor variables mentioned above. The lower the value, the better the model could predict the value of the response variable.

AUC for the ROC curve was 0.52 (the ability of a binary classifier to distinguish between classes), it ranked a random positive example higher than a random negative example 52% of the time.

# Predict using test data
log_model_prob <- predict(log_model, newdata = test_data, type = "response")
log_model_pred <- ifelse(log_model_prob  > 0.5, 1, 0) 

# Evaluate the Logistic Regression model
conf_matrix_1 <- confusionMatrix(factor(log_model_pred), factor(test_data$critical_priority), "1")


results <- tibble(Model = "Model #1 - Log Regression 1000 Sales", Accuracy=conf_matrix_1$overall[1], 
                  "Classification error rate" = 1 - conf_matrix_1$overall[1],
                  F1 = conf_matrix_1$byClass[7],
                  Deviance= log_model$deviance, 
                  R2 = 1 - log_model$deviance / log_model$null.deviance,
                  Sensitivity = conf_matrix_1$byClass["Sensitivity"],
                  Specificity = conf_matrix_1$byClass["Specificity"],
                  Precision =  conf_matrix_1$byClass["Precision"],
                  AIC= log_model$aic, 
                  ROC = auc(roc(test_data$critical_priority, log_model_pred)))

conf_matrix_1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 140  49
##          1   8   3
##                                           
##                Accuracy : 0.715           
##                  95% CI : (0.6471, 0.7764)
##     No Information Rate : 0.74            
##     P-Value [Acc > NIR] : 0.8131          
##                                           
##                   Kappa : 0.0049          
##                                           
##  Mcnemar's Test P-Value : 1.17e-07        
##                                           
##             Sensitivity : 0.05769         
##             Specificity : 0.94595         
##          Pos Pred Value : 0.27273         
##          Neg Pred Value : 0.74074         
##              Prevalence : 0.26000         
##          Detection Rate : 0.01500         
##    Detection Prevalence : 0.05500         
##       Balanced Accuracy : 0.50182         
##                                           
##        'Positive' Class : 1               
##

The feature importance plot below showed units sold and item type household, fruits, cosmetics had the largest positive influence, indicating that sales from these items were more likely to be prioritized. Item type beverages, on the other hand, had the largest negative influence, indicating that sales from these categories were less likely to be a major priority.

# Get col names without target and coeff from model
feature_names <- colnames(train_data)
feature_names <- feature_names[ !feature_names == 'critical_priority']
coefficients <- log_model$coefficients

# Sort coeff in desc order
sort_coef <- order(abs(coefficients), decreasing = TRUE)

# Create the plot
ggplot(data.frame(Feature = feature_names[sort_coef], Coefficient = coefficients[sort_coef]), aes(x = Coefficient, y = reorder(Feature, Coefficient))) +
  geom_col() +
  labs(title = "Feature Importances, Logistic Regression, 1000 Sales", x = "Importance", y = "Feature") +
  theme_minimal()

The accuracy became lower 56%. The sensitivity for class 1 now 50%, this case was much better comparing to the data with imbalance where the sensitivity was less than 6%. We kept these results for the further comparison.

set.seed(42) # for repet results
trainSplit <- as.data.frame(train_data)
trainSplit$critical_priority <- as.factor(trainSplit$critical_priority)
trainSplit <- SMOTE(critical_priority ~ ., trainSplit, perc.over = 100, perc.under=200)
trainSplit$critical_priority <- as.factor(trainSplit$critical_priority)

prop.table(table(trainSplit$critical_priority))

## 
##   0   1 
## 0.5 0.5

# Build logistic regression model
log_model <- glm(critical_priority ~ . -total_profit - total_revenue - total_cost, data = trainSplit, binomial(link = "logit"), control = list(maxit = 1000))

# Predict using test data
log_model_prob <- predict(log_model, newdata = test_data, type = "response")
log_model_pred <- ifelse(log_model_prob  > 0.5, 1, 0) 

# Evaluate the Logistic Regression model
conf_matrix_1 <- confusionMatrix(factor(log_model_pred), factor(test_data$critical_priority), "1")


results <- tibble(Model = "Model #1 - Log Regression 1000 Sales", Accuracy=conf_matrix_1$overall[1], 
                  "Classification error rate" = 1 - conf_matrix_1$overall[1],
                  F1 = conf_matrix_1$byClass[7],
                  Deviance= log_model$deviance, 
                  R2 = 1 - log_model$deviance / log_model$null.deviance,
                  Sensitivity = conf_matrix_1$byClass["Sensitivity"],
                  Specificity = conf_matrix_1$byClass["Specificity"],
                  Precision =  conf_matrix_1$byClass["Precision"],
                  AIC= log_model$aic, 
                  ROC = auc(roc(test_data$critical_priority, log_model_pred)))

conf_matrix_1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 87 26
##          1 61 26
##                                           
##                Accuracy : 0.565           
##                  95% CI : (0.4933, 0.6348)
##     No Information Rate : 0.74            
##     P-Value [Acc > NIR] : 1.0000000       
##                                           
##                   Kappa : 0.0721          
##                                           
##  Mcnemar's Test P-Value : 0.0002672       
##                                           
##             Sensitivity : 0.5000          
##             Specificity : 0.5878          
##          Pos Pred Value : 0.2989          
##          Neg Pred Value : 0.7699          
##              Prevalence : 0.2600          
##          Detection Rate : 0.1300          
##    Detection Prevalence : 0.4350          
##       Balanced Accuracy : 0.5439          
##                                           
##        'Positive' Class : 1               
##

Assumptions

The linear link between the independent variables and the logit transformation of the response variable was required by the Logistic regression model. The visualization below showed some of the linearity between independent variables and the log-odds of the target.

probabilities <- predict(log_model, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
head(predicted.classes)

##   432   458   298   516   778   421 
## "pos" "pos" "neg" "pos" "neg" "neg"

#Only numeric predictors
data <- trainSplit %>%
  dplyr::select_if(is.numeric) 
predictors <- colnames(data)

# Bind the logit and tidying the data for plot
data <- data %>%
  mutate(logit = log(probabilities/(1-probabilities))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

#Scatter plot
ggplot(data, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")

Also, observations had to be independent of each other. As this is a randomly created dataset, we could assume that each sale record was independent.

There had to be no multicollinearity. The correlation matrix below didn’t show extremely high multicollinearity. However, variables total_revenue, total_cost, total_profit were related and could be a multicollinearity issue, they were not used in the logistic regression model. The were no much correlation between variables used in the model.

rcore <- rcorr(as.matrix(trainSplit %>% dplyr::select(where(is.numeric))))
coeff <- rcore$r
corrplot(coeff, tl.cex = .7, tl.col="black", method = 'color', addCoef.col = "black",
         type="upper", order="hclust",
         diag=FALSE)

4.2 Logistic regression - 50000 Sales

The generalized linear model was built based on the transformed training dataset with categorical variables transformed to dummy variables. The variables total_profit, total_cost, total_revenue were dropped as in the previous model. The same glm() function was used to build the model.

A positive coefficient meant that as the predictor variable grew, so did the log odds of the outcome occurring, increasing the likelihood of the outcome. A negative coefficient implied the inverse. The data was balanced to better predict class 1.

The result of SMOTE() was a dataset with 39828 observations, where the target variable was balanced (50% for class 1 and 50% for class 0).

set.seed(42) # for repet results
trainSplit_large <- as.data.frame(train_data_large)
trainSplit_large$critical_priority <- as.factor(trainSplit_large$critical_priority)
trainSplit_large <- SMOTE(critical_priority ~ ., trainSplit_large, perc.over = 100, perc.under=200)
trainSplit_large$critical_priority <- as.factor(trainSplit_large$critical_priority)

prop.table(table(trainSplit_large$critical_priority))

## 
##   0   1 
## 0.5 0.5

set.seed(42)
# Build logistic regression model
log_model_large <- glm(critical_priority ~ . -total_profit - total_revenue - total_cost , data = trainSplit_large, binomial(link = "logit"), control = glm.control(maxit = 1000))

summ(log_model_large)

Observations	39828
Dependent variable	critical_priority
Type	Generalized linear model
Family	binomial
Link	logit

χ²(24)	170.72
Pseudo-R² (Cragg-Uhler)	0.01
Pseudo-R² (McFadden)	0.00
AIC	55092.61
BIC	55307.42

	Est.	S.E.	z val.	p
(Intercept)	5.82	9.35	0.62	0.53
units_sold	-0.00	0.00	-2.57	0.01
unit_price	-0.00	0.00	-2.14	0.03
unit_cost	0.00	0.00	1.73	0.08
ship_time	-0.00	0.00	-1.49	0.14
order_month	0.00	0.00	0.60	0.55
order_year	-0.00	0.00	-0.63	0.53
region_Australia.and.Oceania1	0.20	0.04	4.95	0.00
region_Central.America.and.the.Caribbean1	0.21	0.04	5.97	0.00
region_Europe1	0.13	0.03	4.76	0.00
region_Middle.East.and.North.Africa1	0.15	0.03	4.45	0.00
region_North.America1	0.08	0.07	1.17	0.24
region_Sub.Saharan.Africa1	0.10	0.03	3.62	0.00
item_type_Beverages1	-0.06	0.05	-1.14	0.26
item_type_Cereal1	0.21	0.04	4.79	0.00
item_type_Clothes1	0.02	0.05	0.39	0.70
item_type_Cosmetics1	0.16	0.06	2.58	0.01
item_type_Fruits1	-0.20	0.06	-3.49	0.00
item_type_Household1	0.24	0.06	3.78	0.00
item_type_Meat1	0.01	0.06	0.19	0.85
item_type_Office.Supplies1	0.18	0.07	2.75	0.01
item_type_Personal.Care1	0.12	0.05	2.42	0.02
item_type_Snacks1	0.13	0.04	3.10	0.00
item_type_Vegetables1	0.19	0.04	4.49	0.00
sales_channel_Online1	0.01	0.02	0.38	0.70
Standard errors: MLE

# Build logistic regression model
#log_model_large <- glm(critical_priority ~ . -total_profit - total_revenue - total_cost - item_type_Vegetables - item_type_Snacks, data #= train_data_large, binomial(link = "logit"), control = glm.control(maxit = 1000))

#summ(log_model_large)

The Akaike information criterion (AIC) was 55092.61 which could be high, there was a need to compare with other models.

The accuracy of the model was 52%, lower than for the 1000 sales. The Logistic Regression model predicted 47% of the class 1. The model predicted the target, this couldn’t happen with imbalanced data. F1 score of 0.33 combined the precision and recall scores of a model (our model was 33% accurate in predicting positive cases and 67% inaccurate in predicting negative cases).

AUC for the ROC curve was 0.5, it ranked a random positive example higher than a random negative example 50% of the time.

The null deviance of 55213.33 defined how well the target variable could be predicted by a model with only an intercept term.

The residual deviance of 55042.615 showed that our model couldn’t predict well the response variable.

# Predict using test data
log_model_prob_large <- predict(log_model_large, newdata = test_data_large, type = "response")
log_model_pred_large <- ifelse(log_model_prob_large  > 0.5, 1, 0) 

# Evaluate the Logistic Regression model
conf_matrix_2 <- confusionMatrix(factor(log_model_pred_large), factor(test_data_large$critical_priority), "1")


results <- rbind(results, tibble(Model = "Model #2 - Log Regression 50000 Sales", Accuracy=conf_matrix_2$overall[1], 
                  "Classification error rate" = 1 - conf_matrix_2$overall[1],
                  F1 = conf_matrix_2$byClass[7],
                  Deviance= log_model_large$deviance, 
                  R2 = 1 - log_model_large$deviance / log_model_large$null.deviance,
                  Sensitivity = conf_matrix_2$byClass["Sensitivity"],
                  Specificity = conf_matrix_2$byClass["Specificity"],
                  Precision = conf_matrix_2$byClass["Precision"],
                  AIC= log_model_large$aic,
                  ROC = auc(roc(test_data_large$critical_priority, log_model_pred_large)))) # 

conf_matrix_2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4008 1320
##          1 3503 1169
##                                           
##                Accuracy : 0.5177          
##                  95% CI : (0.5079, 0.5275)
##     No Information Rate : 0.7511          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0025          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.4697          
##             Specificity : 0.5336          
##          Pos Pred Value : 0.2502          
##          Neg Pred Value : 0.7523          
##              Prevalence : 0.2489          
##          Detection Rate : 0.1169          
##    Detection Prevalence : 0.4672          
##       Balanced Accuracy : 0.5016          
##                                           
##        'Positive' Class : 1               
##

The feature importance plot below showed units sold and item type Cosmetics had the largest positive influence, indicating that sales from this item were more likely to be prioritized. Item type clothes had the largest negative influence, indicating that sales from this category were less likely to be a major priority.

# Get col names without target and coeff from model
feature_names <- colnames(trainSplit_large)
feature_names <- feature_names[ !feature_names == 'critical_priority']
coefficients <- log_model_large$coefficients

# Sort coeff in desc order
sort_coef <- order(abs(coefficients), decreasing = TRUE)

# Create the plot
ggplot(data.frame(Feature = feature_names[sort_coef], Coefficient = coefficients[sort_coef]), aes(x = Coefficient, y = reorder(Feature, Coefficient))) +
  geom_col() +
  labs(title = "Feature Importances, Logistic Regression, 50000 Sales", x = "Importance", y = "Feature") +
  theme_minimal()

Assumptions

The code below took too much time to run, the eval was set to FALSE to be able to post the report.

probabilities <- predict(log_model_large, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
head(predicted.classes)

#Only numeric predictors
data <- trainSplit_large %>%
  dplyr::select_if(is.numeric) 
predictors <- colnames(data)

# Bind the logit and tidying the data for plot
data <- data %>%
  mutate(logit = log(probabilities/(1-probabilities))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

#Scatter plot
ggplot(data, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")

Also, observations had to be independent of each other. As this is a randomly created dataset, we could assume that each sale record was independent.

There had to be no multicollinearity. The correlation matrix below didn’t show extremely high multicollinearity. However, variables total_revenue, total_cost, total_profit were related and could be a multicollinearity issue, they were not used in the logistic regression model. The correlation between variables looked less than in the small data (coefficients were almost 0).

rcore <- rcorr(as.matrix(trainSplit_large %>% dplyr::select(where(is.numeric))))
coeff <- rcore$r
corrplot(coeff, tl.cex = .7, tl.col="black", method = 'color', addCoef.col = "black",
         type="upper", order="hclust",
         diag=FALSE)

4.3 Cross-validation - 1000 Sales

The the k-fold cross-validation with Lasso regularization was used in order to improve feature selection for the Logistic regression model. The balanced data after SMOTE() was used with all the variables.

# Train Lasso rmodel
set.seed(42)
lasso_model<- cv.glmnet(x_train, 
                       y_train,
                       alpha = 1, #alpha=1 is lasso
                       family = "binomial",
                       link = "logit",
                       nfolds = 5, 
                       type.measure = "class")

The accuracy of the Lasso model was 52% instead of the previous 56% in the Logistic model for 1000 sales. The AUC score of 0.46 didn’t suggest discriminative power. Based on the confusion matrix, the model performed worse in predicting critical priority as it was shown by the precision and recall values for class 1. F1 score was 0.27 (our model was 27% accurate in predicting positive cases), worse than in the previous models.

AUC for the ROC curve was 0.46, it ranked a random positive example higher than a random negative example 46% of the time. The residual deviance of 1.42 showed that our model could predict the response variable much better than the model without Lasso.

The Lasso regularization showed that all region variables except for “Central America and the Carribean” and “Europe”, all item_type variables except for “Clothes”, “Personal Care”, “Snacks”, variables ship_time, order_month, order_year, units_sold were essential for the model.

The Akaike information criterion (AIC) was -38 which was better comparing to all the previous models.

glmnet_cv_aicc <- function(fit, lambda = 'lambda.min'){
  whlm <- which(fit$lambda == fit[[lambda]])
  with(fit$glmnet.fit,
       {
         tLL <- nulldev - nulldev * (1 - dev.ratio)[whlm]
         k <- df[whlm]
         n <- nobs
         return(list('AICc' = - tLL + 2 * k + 2 * k * (k + 1) / (n - k - 1),
                     'BIC' = log(n) * k - tLL))
       })
}

# Predict the test data
lasso_prob <- predict(lasso_model, s = lasso_model$lambda.min, newx = x_test, type = "response")
lasso_pred <- ifelse(lasso_prob > 0.5, 1, 0) 


# The features selected by Lasso
features_matrix <- as.matrix(coef(lasso_model, s = lasso_model$lambda.min))
selected_features <- rownames(features_matrix)[features_matrix != 0]

selected_features

##  [1] "(Intercept)"                        "regionAustralia and Oceania"       
##  [3] "regionMiddle East and North Africa" "regionNorth America"               
##  [5] "item_typeCereal"                    "item_typeClothes"                  
##  [7] "item_typeFruits"                    "item_typeMeat"                     
##  [9] "item_typeOffice Supplies"           "item_typeVegetables"               
## [11] "units_sold"                         "ship_time"                         
## [13] "order_month"

# Confusion matrix
conf_matrix_3 = confusionMatrix(factor(lasso_pred),factor(y_test), "1")


lasso.r1 <- assess.glmnet(lasso_model,           
                                newx = x_test,              
                                newy = y_test )   
lasso.r2 <- glmnet_cv_aicc(lasso_model, 'lambda.min')

conf_matrix_3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 86 34
##          1 62 18
##                                          
##                Accuracy : 0.52           
##                  95% CI : (0.4484, 0.591)
##     No Information Rate : 0.74           
##     P-Value [Acc > NIR] : 1.000000       
##                                          
##                   Kappa : -0.0619        
##                                          
##  Mcnemar's Test P-Value : 0.005857       
##                                          
##             Sensitivity : 0.3462         
##             Specificity : 0.5811         
##          Pos Pred Value : 0.2250         
##          Neg Pred Value : 0.7167         
##              Prevalence : 0.2600         
##          Detection Rate : 0.0900         
##    Detection Prevalence : 0.4000         
##       Balanced Accuracy : 0.4636         
##                                          
##        'Positive' Class : 1              
##

results <- rbind(results,tibble(Model = "Model #3 - Lasso Model 1000 Sales", Accuracy=conf_matrix_3$overall[1], 
                  "Classification error rate" = 1 - lasso.r1$auc[1],
                  F1 = conf_matrix_3$byClass[7],
                  Deviance=lasso.r1$deviance[[1]], 
                  R2 = NA,
                  Sensitivity = conf_matrix_3$byClass["Sensitivity"],
                  Specificity = conf_matrix_3$byClass["Specificity"],
                  Precision =  conf_matrix_3$byClass["Precision"],
                  AIC= lasso.r2$AICc, 
                  ROC = auc(roc(as.numeric(y_test), as.numeric(lasso_pred)))))

Assumptions

The observations had to be independent of each other. As this is a randomly created dataset, we could assume that each sale record was independent

Each fold in the k-fold cross-validation had to have approximately the same proportion of samples of each target class (stratified sampling).

4.4 Cross-validation - 50000 Sales

The steps to build the k-fold cross-validation with Lasso regularization were repeated for 50000 sales.

# Train Lasso rmodel
set.seed(42)
lasso_model_large <- cv.glmnet(x_train_large, 
                       y_train_large,
                       alpha = 1, #alpha=1 is lasso
                       family = "binomial",
                       link = "logit",
                       nfolds = 5, 
                       type.measure = "class")

The accuracy of the Lasso model was 51%, almost the same as in the model without cross-validation. The AUC score of 0.5 didn’t suggest discriminative power. Based on the confusion matrix, the model the same as previous models with 49% sensitivity when predicting critical priority as it was shown by the precision and recall values for class 1.

F1 score was 0.33 (our model was 33% accurate in predicting positive cases), the same as for the model without Lasso. AUC for the ROC curve was 0.5. The residual deviance of 1.38 showed that our model could predict the response variable much better than the model without Lasso.

The Lasso regularization showed that all region variables except for “Australia and Oceania”, sales_channel, units_sold, order_month were essential for the model.

The Akaike information criterion (AIC) was -86, it was better than in all the previous models.

# Predict the test data
lasso_prob_large <- predict(lasso_model_large, s = lasso_model_large$lambda.min, newx = x_test_large, type = "response")
lasso_pred_large <- ifelse(lasso_prob_large > 0.5, 1, 0) 


# The features selected by Lasso
features_matrix <- as.matrix(coef(lasso_model_large, s = lasso_model_large$lambda.min))
selected_features <- rownames(features_matrix)[features_matrix != 0]

selected_features

##  [1] "(Intercept)"                            
##  [2] "regionCentral America and the Caribbean"
##  [3] "regionEurope"                           
##  [4] "regionMiddle East and North Africa"     
##  [5] "regionNorth America"                    
##  [6] "regionSub-Saharan Africa"               
##  [7] "item_typeBeverages"                     
##  [8] "item_typeCereal"                        
##  [9] "item_typeClothes"                       
## [10] "item_typeCosmetics"                     
## [11] "item_typeFruits"                        
## [12] "item_typeHousehold"                     
## [13] "item_typeMeat"                          
## [14] "item_typeOffice Supplies"               
## [15] "item_typePersonal Care"                 
## [16] "item_typeSnacks"                        
## [17] "item_typeVegetables"                    
## [18] "sales_channelOnline"                    
## [19] "units_sold"                             
## [20] "order_month"

# Confusion matrix
conf_matrix_4 = confusionMatrix(factor(lasso_pred_large),factor(y_test_large), "1")


lasso.r1 <- assess.glmnet(lasso_model_large,           
                                newx = x_test_large,              
                                newy = y_test_large)   
lasso.r2 <- glmnet_cv_aicc(lasso_model_large, 'lambda.min')


conf_matrix_4

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3882 1262
##          1 3629 1227
##                                           
##                Accuracy : 0.5109          
##                  95% CI : (0.5011, 0.5207)
##     No Information Rate : 0.7511          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0074          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.4930          
##             Specificity : 0.5168          
##          Pos Pred Value : 0.2527          
##          Neg Pred Value : 0.7547          
##              Prevalence : 0.2489          
##          Detection Rate : 0.1227          
##    Detection Prevalence : 0.4856          
##       Balanced Accuracy : 0.5049          
##                                           
##        'Positive' Class : 1               
##

results <- rbind(results,tibble(Model = "Model #4- Lasso Model 50000 Sales", Accuracy=conf_matrix_4$overall[1], 
                  "Classification error rate" = 1 - lasso.r1$auc[1],
                  F1 = conf_matrix_4$byClass[7],
                  Deviance=lasso.r1$deviance[[1]], 
                  R2 = NA,
                  Sensitivity = conf_matrix_4$byClass["Sensitivity"],
                  Specificity = conf_matrix_4$byClass["Specificity"],
                  Precision =  conf_matrix_4$byClass["Precision"],
                  AIC= lasso.r2$AICc,
                  ROC = auc(roc(as.numeric(y_test_large), as.numeric(lasso_pred_large)))))

Assumptions

The observations had to be independent of each other. As this is a randomly created dataset, we could assume that each sale record was independent

Each fold in the k-fold cross-validation had to have approximately the same proportion of samples of each target class (stratified sampling).

4.5 Random Forest - 1000 Sales

Next, we tried Random Forest Regression model for the small dataset. The original data without additional dummy variables with fixed imbalace in the target variables was used. The variables total_profit, total_cost, total_revenue were dropped as they were derived from other variables.

set.seed(42)
# Build a Random Forest model
rf_model <- randomForest(critical_priority ~ . -total_profit -total_cost -total_revenue, data = split_rf, ntree = 100, importance = TRUE)
print(rf_model)

## 
## Call:
##  randomForest(formula = critical_priority ~ . - total_profit -      total_cost - total_revenue, data = split_rf, ntree = 100,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 22.5%
## Confusion matrix:
##     0   1 class.error
## 0 340  80   0.1904762
## 1 109 311   0.2595238

The accuracy of the model was 57%. Comparing to th previous models, the Random forest model had a accuracy better but predicted worse critical priority as it was shown by the precision and recall values for class 1 (Sensitivity 40%). F1 score was 0.33 (our model was 33% accurate in predicting positive cases). AUC for the ROC curve was 0.52.

# Predict test data
rf_model_pred <- predict(rf_model, newdata = test_data_rf)

# Evaluate the model

conf_matrix_5 <- confusionMatrix(factor(rf_model_pred ), factor(test_data_rf$critical_priority), "1")


results <- rbind(results, tibble(Model = "Model #5 - Random Forest 1000 Sales", Accuracy=conf_matrix_5$overall[1], 
                  "Classification error rate" = 1 - conf_matrix_5$overall[1],
                  F1 = conf_matrix_5$byClass[7],
                  Deviance= NA, 
                  R2 = NA,
                  Sensitivity = conf_matrix_5$byClass["Sensitivity"],
                  Specificity = conf_matrix_5$byClass["Specificity"],
                  Precision = conf_matrix_5$byClass["Precision"],
                  AIC= NA,
                  ROC = auc(roc(test_data_rf$critical_priority, as.numeric(rf_model_pred)))))

conf_matrix_5

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 94 31
##          1 54 21
##                                           
##                Accuracy : 0.575           
##                  95% CI : (0.5033, 0.6444)
##     No Information Rate : 0.74            
##     P-Value [Acc > NIR] : 1.00000         
##                                           
##                   Kappa : 0.0341          
##                                           
##  Mcnemar's Test P-Value : 0.01702         
##                                           
##             Sensitivity : 0.4038          
##             Specificity : 0.6351          
##          Pos Pred Value : 0.2800          
##          Neg Pred Value : 0.7520          
##              Prevalence : 0.2600          
##          Detection Rate : 0.1050          
##    Detection Prevalence : 0.3750          
##       Balanced Accuracy : 0.5195          
##                                           
##        'Positive' Class : 1               
##

The plot below showed the feature importance, the Mean Decrease Accuracy showed how much the model accuracy decreased when we dropped that variable. Higher the value of mean decrease accuracy or mean decrease gini score, higher the importance of the variable in the model. Units sold, ship_time were the most important variables.

plot(rf_model)

varImpPlot(rf_model, main = "Varibale Importance, Random Forest, 1000 Sales" , pch=16)

Assumptions

Although Random Forest Regression didn’t have strict assumptions like linear models, there were several things to consider when building the model.

For example, there was no need for a data transformations or the need to follow a specific distribution as non-linear interactions can be captured without any explicit transformation.

The plot below depicted a single forest decision tree. Each node reflected a decision depending on the value of a feature. The decision tried to separate the data in such a way that the purity of the resulting child nodes was increased. Because the tree made binary choices, there were always two branches. The anticipated class was shown by the color of the leaf nodes, with darker colors signifying better predictions.

# Build a single tree 
single_tree <- rpart(critical_priority ~ . , data = split_rf, method = "class", control = rpart.control(cp = 0.01))

# Plot the tree
rpart.plot(single_tree)

4.6 Random Forest - 50000 Sales

As the last model, Random Forest Regression model for the large dataset.

Building the model for the larger dataset took significantly more time compared to the smaller dataset

set.seed(42)
# Build a Random Forest model
rf_model_large <- randomForest(critical_priority ~ . -total_profit -total_cost -total_revenue, data = split_rf_large, ntree = 100, importance = TRUE)
print(rf_model_large)

## 
## Call:
##  randomForest(formula = critical_priority ~ . - total_profit -      total_cost - total_revenue, data = split_rf_large, ntree = 100,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 24.88%
## Confusion matrix:
##       0     1 class.error
## 0 18776  1138  0.05714573
## 1  8771 11143  0.44044391

The accuracy of the model was 71%, better than all the previous models. The Random Forest model predicted certain crucial events, but with low precision and recall, the sensitivity was only 9%. F1 score was 0.14 (our model was 14% accurate in predicting positive cases). AUC for the ROC curve was 0.5.

# Predict test data
rf_model_pred_large <- predict(rf_model_large, newdata = test_data_large_rf)

# Evaluate the model

conf_matrix_6 <- confusionMatrix(factor(rf_model_pred_large), factor(test_data_large_rf$critical_priority), "1")


results <- rbind(results, tibble(Model = "Model #6 - Random Forest 50000 Sales", Accuracy=conf_matrix_6$overall[1], 
                  "Classification error rate" = 1 - conf_matrix_6$overall[1],
                  F1 = conf_matrix_6$byClass[7],
                  Deviance= NA, 
                  R2 = NA,
                  Sensitivity = conf_matrix_6$byClass["Sensitivity"],
                  Specificity = conf_matrix_6$byClass["Specificity"],
                  Precision =NA,
                  AIC= NA,
                  ROC = auc(roc(test_data_large_rf$critical_priority, as.numeric(rf_model_pred_large)))))

conf_matrix_6

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6844 2259
##          1  667  230
##                                           
##                Accuracy : 0.7074          
##                  95% CI : (0.6984, 0.7163)
##     No Information Rate : 0.7511          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0046          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.09241         
##             Specificity : 0.91120         
##          Pos Pred Value : 0.25641         
##          Neg Pred Value : 0.75184         
##              Prevalence : 0.24890         
##          Detection Rate : 0.02300         
##    Detection Prevalence : 0.08970         
##       Balanced Accuracy : 0.50180         
##                                           
##        'Positive' Class : 1               
##

The plot below showed the feature importance., the Mean Decrease Accuracy showed how much the model accuracy decreased when we dropped that variable. Higher the value of mean decrease accuracy or mean decrease gini score, higher the importance of the variable in the model. Units sold was the most important variable.

plot(rf_model_large)

varImpPlot(rf_model_large,  main = "Varibale Importance, Random Forest, 50000 Sales")

Assumptions

Although Random Forest Regression didn’t have strict assumptions like linear models, there were several things to consider when building the model.

For example, there was no need for a data transformations or the need to follow a specific distribution as non-linear interactions can be captured without any explicit transformation.

The plot below depicted a single forest decision tree. Each node reflects a decision depending on the value of a feature. The decision seeks to separate the data in such a way that the purity of the resulting child nodes is increased. Because the tree makes binary choices, there are always two branches. The anticipated class is shown by the color of the leaf nodes, with darker colors signifying better predictions.

# Build a single tree
single_tree <- rpart(critical_priority ~ . , data = split_rf_large, method = "class", control = rpart.control(cp = 0.01))

# Plot the tree
rpart.plot(single_tree)

5. Model selection

The accuracy of both Logistic Regression was almost the same for the “50000 Sales Records” and for the “1000 Sales Records” dataset. Though the Random Forest showed the accuracy greater for a larger data. The results improved a little for 50,000 sales with Lasso regularization and became worse for 1,000 sales. The Random forest worked better with 1,000 sales than with 50,000 sales.

Both algorithms struggled to reliably identify essential priorities, especially the Random Forest for 50000 sales. This could be due to the imbalance in the classes of the original data that we had to fix with SMOTE function, the necessity for more complex modeling methodologies (tuning hyperparameters, feature engineering, or using other modeling techniques), and the synthetic nature of the data.

Anyway, the Random Forest Regression would be a better choice with a real data. I would also choose this model for a small data, it worked with better results (accuracy, f1, sensitivity), it could predict some of the class 1 values. There were no requirements for preprocessing tasks such as scaling and normalization of each feature. For a large data. I would choose Logistic regression with Lasso regularization, it had similar results to other models but didn’t use all the features for training and had better AIC value. In reality, the larger datasets could give better results with a Random Forest as it was more representative, and models could be trained on more diverse data, making them more robust and less prone to overfitting though the time spent on the training and the requirements for computer’s capabilities would be higher.

nice_table <- function(df, cap=NULL, cols=NULL, dig=3, fw=F){
  if (is.null(cols)) {c <- colnames(df)} else {c <- cols}
  table <- df %>% 
    kable(caption=cap, col.names=c, digits=dig) %>% 
    kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"),
      html_font = 'monospace',
      full_width = fw)
  return(table)
}

results %>% 
  nice_table(cap='Logistic Model Comparison') %>% 
  scroll_box(width='100%')

Logistic Model Comparison
Model	Accuracy	Classification error rate	F1	Deviance	R2	Sensitivity	Specificity	Precision	AIC	ROC
Model #1 - Log Regression 1000 Sales	0.565	0.435	0.374	1090.143	0.064	0.500	0.588	0.299	1140.143	0.5439189
Model #2 - Log Regression 50000 Sales	0.518	0.482	0.326	55042.615	0.003	0.470	0.534	0.250	55092.615	0.5016419
Model #3 - Lasso Model 1000 Sales	0.520	0.512	0.273	1.412	NA	0.346	0.581	0.225	-38.200	0.4636175
Model #4- Lasso Model 50000 Sales	0.511	0.495	0.334	1.385	NA	0.493	0.517	0.253	-86.875	0.5049055
Model #5 - Random Forest 1000 Sales	0.575	0.425	0.331	NA	NA	0.404	0.635	0.280	NA	0.5194906
Model #6 - Random Forest 50000 Sales	0.707	0.293	0.136	NA	NA	0.092	0.911	NA	NA	0.5018018

6. Conclusion

The datasets were generated with random logic and were not real sales data. Since the data was randomly generated, the patterns, connections, and distributions within the data were not indicative of real-world sales circumstances. As a result, any patterns or associations found by the models could be artifacts of the random data generating process but not real-world occurrences. Both, Logistic Regression and Random Forest, models struggled to identify significant priorities in both datasets. This difficulty was caused by the data’s inherent randomness, which lacks any underlying patterns that a model could leverage. Models trained on this data should not be used in real-world circumstances. The provided data was not adequate for true predictive modeling applications. The above work could be used for the preprocessing of sales data as it wouldn’t lead to costly mistakes on real data. In reality, if I had to make a business choice, I’d choose a model for the dataset with 50,000 sales as it had a larger sample size and was more likely to produce a more generic model. As mentioned above in the results and remembering the pros and cons of both algorithms, I would choose Random Forest Regression. But with these random data, the Random Forest model was a selection for 1,000 sales, Logistic regression with Lasso regularization for 50,000 sales.

As a result, the randomly generated data may be valuable for practice and testing, but it is critical to understand its limitations when evaluating results or contemplating its use.

References

Verma, V. A. (2021). Downloads 18 – Sample CSV Files / Data Sets for Testing (till 5 Million Records) – Sales. Excel BI Analytics. https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/
Nwanganga, F., & Chapple, M. (2020). Practical Machine Learning in R. https://doi.org/10.1002/9781119591542
Sheather, S. (2009). A Modern Approach to Regression with R. Springer Science & Business Media.
Faraway, J. J. (2016). Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. CRC Press.
Verma, V. A. (2021b). Downloads 18 – Sample CSV Files / Data Sets for Testing (till 5 Million Records) – Sales. Excel BI Analytics. https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/

Appendix: Essay. Machine Learning models for Sales Data

The work below shows the stages from data preparation to model selection for the sales data.

The datasets with 1,000 sales and 50,000 sales, both with 17 predictor variables. There was no specific label for supervised learning. However, we created a new variable from order_priority called critical_priority to predict if the order was of a critical priority based on other features. Correlations between columns were thoroughly evaluated during the data processing phase. Notably, variables total_revenue, total_cost, and total_profit were derived from the variables unit_sold, unit_cost, unit_price.

The first critical step was data transformation. The target variable was the critical_priority column built from Order Priority. The critical orders were substituted with a 1 and all others with a 0. The datasets were refined further for clarity and efficiency: columns were given standardized names (Order.Priority became order_priority), dates were formatted consistently, and extraneous columns such as Country and Order ID were removed. The Order Date and Ship Date columns were used to create new columns order_year, order_month, ship_time and were removed. Character columns were transformed into model-compatible dummy variables. As the target variable was binary, the following algorithms were considered. The following data preparation was based on this algorithm selection. The choice of the algorithms wasn’t affected by the size of the data, only by the nature of the target variable.

Logistic Regression was chosen because of its simplicity and applicability for binary classification. Random Forest was used to capture any non-linear relationships in the dataset. The size of the dataset also influenced the decision, as training a more sophisticated model on a larger dataset took substantially longer.

The datasets were divided into training (80%) and testing (20%) sets prior to modeling to ensure unbiased validation and a realistic evaluation of model performance on unknown data. The categorical variables were transformed to dummies for the Logistic models, the data without dummy variables was used for Random Forest. For the Lasso model, the training and testing data with all variables (no dummies) was transformed to matrix. All datasets had an imbalance in the target variable (~75% for class 0 and ~25% for class 1). With imbalanced data, most machine learning models predicted the majority class more efficiently than the minority class. To address this behavior, we utilized SMOTE() function the data in order to achieve higher accuracy rates between classes. The SMOTE function oversampled the minority class by using bootstrapping and k-nearest neighbor to synthetically create additional observations of that event. We tried to run models with imbalanced data and without.

Logistic Regression - 1,000 Sales. A logistic regression model was built for the smaller dataset of 1,000 sales. Due to their intricate relationship and probable multicollinearity, variables such as total_profit, total_cost, total_revenue were omitted, variables item_type_Vegetables, item_type_Snacks were removed as the model ran error when predicting the probabilities, the rank of the data matrix was at least equal to the number of parameters, it was not. The logistic regression model had a 71% accuracy. This model’s Akaike information criterion (AIC) was 926, indicating room for improvement. According to the confusion matrix, the model predicted important priority with 6% sensitivity, as evidenced by the precision and recall values for class 1. F1 score of 0.095 combined the precision and recall scores of a model (our model was 9.5% accurate in predicting positive cases and 90.5% inaccurate in predicting negative cases). According to feature importance, sales of fruits, and household items were more likely to be prioritized, whereas beverage sales were not.

We tried to fix the imbalance in the target variable using SMOTE() function and train the logistic model again (without excluding item_type_Vegetables and item_type_Snacks variables). The result of SMOTE() was a training dataset with 840 observations, where the target variable was balanced (50% for class 1 and 50% for class 0). F1 score was 0.37 (our model 37% was accurate in predicting positive cases). The accuracy became lower 56%. The sensitivity for class 1 now 50%, this case was much better comparing to the data with imbalance where the sensitivity was less than 6%. F1 score was 0.37 (our model 37% was accurate in predicting positive cases). AUC for the ROC curve was 0.5, it ranked a random positive example higher than a random negative example 50% of the time. The residual deviance of 1090 defined how well the target variable can be predicted by our current model that we fit with predictor variables mentioned above. The lower the value, the better the model could predict the value of the response variable.

Logistic Regression - 50,000 Sales. For the larger dataset, the same logistic regression model was applied with similar variable exclusions. The imbalance was fixed with the SMOTE() function (50% for class 1 and 50% for class 0 instead of 25%/75%), variables such as total_profit, total_cost, total_revenue were omitted. This model’s accuracy was 52%, lower than for the 1,000 sales. AIC was significantly at 55092.61. The feature importance showed units sold and item type Cosmetics had the largest positive influence, indicating that sales from this item were more likely to be prioritized. Item type clothes had the largest negative influence, indicating that sales from this category were less likely to be a major priority. The model predicted class 1 with sensitivity of 47%. This result was worse compared to the result for 1,000 sales. F1 score of 0.33 combined the precision and recall scores of a model (our model was 33% accurate in predicting positive cases and 67% inaccurate in predicting negative cases). AUC was the same, 0.5. The residual deviance of 55042.615 showed that our model couldn’t predict well the response variable.

Cross-validation - 1,000 & 50,000 Sales. Lasso regularization was developed to improve feature selection in the logistic regression model. The accuracy improved to 52% for the 1,000 sales dataset and 51% for the 50,000 sales dataset. The AIC was -38 for 1,000 sales and -86 for 50,000 sales, these AIC results was better than in the previous models. F1 score was 0.27 (our model was 27% accurate in predicting positive cases) and sensitivity 35% for 1,000 sales and 0.33 / 49% for 50,000 sales. AUC for the ROC curve was 0.46 for 1,000 sales, it ranked a random positive example higher than a random negative example 46% of the time. The residual deviance of 1.42 showed that our model could predict the response variable much better than the model without Lasso. AUC for the ROC curve was 0.5 for 50,000 sales. The residual deviance of 1.38 showed that our model could predict the response variable much better than the model without Lasso. The results improved a little for 50,000 sales with Lasso regularization and became worse for 1,000 sales.

For 1,000 sales, the Lasso regularization showed that all region variables except for “Central America and the Carribean” and “Europe”, all item_type variables except for “Clothes”, “Personal Care”, “Snacks”, variables ship_time, order_month, order_year, units_sold were essential for the model. For 50,000 sales, the Lasso regularization showed that all region variables except for “Australia and Oceania”, sales_channel, units_sold, order_month were essential for the model.

Random Forest - 1,000 & 50,000 Sales. Following that, Random Forest, a well-known ensemble learning method, was used. The algorithm attained a 57% accuracy for the 1,000 sales dataset. The larger dataset of 50,000 sales, on the other hand, provided an accuracy of 71%. For both datasets, feature significance plots revealed that ‘units sold’ was the most influential variable. The precision and recall values for class 1 for the large dataset showed that the Random forest model predicted critical priority worse than the previous models (sensitivity 9%, f1 score 0.136). For 1,000 sales these metrics stayed almost the same but the accuracy increased (sensitivity 40%, f1 score 0.33). The AUC stayed almost the same for both models, 0.5 The Random forest worked better with 1,000 sales than with 50,000 sales.

The accuracy of both Logistic Regression was almost the same for the “50,000 Sales Records” and for the “1,000 Sales Records” dataset. However, the Random Forest showed the accuracy greater for a larger data. The results improved a little for 50,000 sales with Lasso regularization and became worse for 1,000 sales. The Random forest worked better with 1,000 sales than with 50,000 sales.

Both algorithms struggled to reliably identify essential priorities, especially the Random Forest for 50,000 sales. This could be due to the imbalance in the classes of the original data that we had to fix with SMOTE function, the necessity for more complex modeling methodologies (tuning hyperparameters, feature engineering, or using other modeling techniques), and the synthetic nature of the data.

Anyway, the Random Forest Regression would be a better choice with a real data. I would also choose this model for a small data, it worked with better results (accuracy, f1, sensitivity), it could predict some of the class 1 values. There were no requirements for preprocessing tasks such as scaling and normalization of each feature. For a large data, I would choose Logistic regression with Lasso regularization, it had similar results to other models but didn’t use all the features for training and had better AIC value. In reality, the larger datasets could give better results with a Random Forest as it was more representative, and models could be trained on more diverse data, making them more robust and less prone to overfitting.

The provided datasets were generated using random logic and were not genuine sales data. Since the data was randomly generated, the patterns, connections, and distributions within the data were not indicative of real-world sales circumstances. As a result, any patterns or associations found by the models could be artifacts of the random data generating process rather than real-world occurrences. Both our Logistic Regression and Random Forest models struggled to identify significant priorities in both datasets. This difficulty was caused by the data’s inherent randomness, which lacks any underlying patterns that a model could leverage. Models trained on this data should not be used in real-world circumstances. The provided data was not adequate for true predictive modeling applications. The above work could be used for the preprocessing of sales data as it wouldn’t lead to costly mistakes on real data. In reality, if I had to make a business choice, I’d choose a model for the dataset with 50,000 sales as it had a larger sample size and was more likely to produce a more generic model. As mentioned above in the results and remembering the pros and cons of both algorithms, I would choose Random Forest Regression. But with these random data, the Random Forest model was a selection for 1,000 sales, Logistic regression with Lasso regularization for 50,000 sales.

As a result, the randomly generated data might be valuable for practice and testing, but it is critical to understand its limitations when evaluating results or contemplating its use.

Homework 1

Daria Dubovskaia

Overview

1. Data Preparation

1.1 Summary Statistics

1.2 Column change

1.3 New columns

1.4 Change column type

2. Data Exploration

2.1 Numeric varibles

2.3 Category variables

3 Data Transformation

4 Models

4.1 Logistic regression - 1000 Sales

Assumptions

4.2 Logistic regression - 50000 Sales

Assumptions

4.3 Cross-validation - 1000 Sales

Assumptions

4.4 Cross-validation - 50000 Sales

Assumptions

4.5 Random Forest - 1000 Sales

Assumptions

4.6 Random Forest - 50000 Sales

Assumptions

5. Model selection

6. Conclusion

References

Appendix: Essay. Machine Learning models for Sales Data