AIB TERM PAPER

HARIKRISHNAN PS - CB.BU.P2ASB22079

YADUKRISHNAN K - CB.BU.P2ASB22191

Introduction

The Sales dataset contains information about various sales transactions, likely from a retail or wholesale business. It consists of several variables that provide insights into the quantity of products ordered, pricing details, costs, turnover, and margins associated with each transaction.

Here’s a brief overview of the variables included in the dataset:

  1. Quantity_Ordered: This variable represents the quantity of products ordered in each transaction.

  2. Price_Each: It denotes the price of each product sold.

  3. Cost_price: This variable indicates the cost price of each product.

  4. turnover: The turnover variable likely represents the total revenue generated from each transaction.

  5. margin: Margin refers to the profit margin associated with each transaction, calculated as the difference between revenue and costs.

Step 1 – collecting data

The dataset contains 1999 observations or rows, suggesting a reasonably large sample size for analysis. Each observation likely corresponds to a unique sales transaction recorded over a period of time.

In the provided code, normalization techniques are applied to the dataset, followed by the construction and evaluation of neural network models to predict the margin associated with each transaction. These models leverage the features provided in the dataset to learn patterns and relationships between the predictors (e.g., quantity ordered, price, turnover) and the target variable (margin).

Step 2 – exploring and preparing the data

Sales <- read.csv("C:/AMRITA SCHOOL OF BUSINESS/Trimister 6/AIB/HARI/Sales_Data.csv")
str(Sales)
'data.frame':   1999 obs. of  5 variables:
 $ Quantity_Ordered: int  1 1 2 1 1 1 1 1 1 1 ...
 $ Price_Each      : num  700 14.9 12 150 12 ...
 $ Cost_price      : num  231 7.47 6 97.49 6 ...
 $ turnover        : num  700 14.9 24 150 12 ...
 $ margin          : num  469 7.47 11.99 52.5 6 ...

This code reads a CSV file named “Sales_Data.csv” from the specified directory and assigns it to a data frame named Sales.

This code displays the structure of the Sales data frame, showing the number of observations (rows) and variables (columns) along with their types.

normalize <- function(x) { 
  return((x - min(x)) / (max(x) - min(x)))
}

This defines a function named normalize that takes a vector x as input and returns its normalized values.

Sales_norm <- as.data.frame(lapply(Sales, normalize))

This code applies the normalize function to each column of the Sales data frame using lapply and stores the normalized data in a new data frame called Sales_norm.

summary(Sales_norm$margin)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.000000 0.003938 0.005257 0.104094 0.044839 1.000000 
summary(Sales$margin)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   1.495    5.975    7.475  119.903   52.500 1139.000 

These lines provide summary statistics for the ‘margin’ column before and after normalization, showing the minimum, maximum, median, mean, and quartiles.

Sales_train <- Sales_norm[1:1500, ]
Sales_test <- Sales_norm[1501:1999, ]

This code splits the normalized data into training and testing sets. The first 1500 rows are assigned to the training set (Sales_train), and the remaining rows are assigned to the testing set (Sales_test).

Step 3 – training a model on the data

library(neuralnet)
Warning: package 'neuralnet' was built under R version 4.2.3

This code loads the neuralnet library, which is used for building neural networks in R.

set.seed(12345) # to guarantee repeatable results
Sales_model <- neuralnet(margin ~ Quantity_Ordered + Price_Each + Cost_price + turnover,
                              data = Sales_train)

This sets the seed for reproducibility, ensuring that the same random numbers are generated each time the code is run.

This code constructs a neural network model (Sales_model) using the neuralnet function. The model predicts the ‘margin’ variable based on the predictors: ‘Quantity_Ordered’, ‘Price_Each’, ‘Cost_price’, and ‘turnover’, using data from the training set (Sales_train).

plot(Sales_model)

This command plots the neural network model (Sales_model).

Step 4 – evaluating model performance

# obtain model results
model_results <- compute(Sales_model, Sales_test[1:4])

This code computes the output of the neural network model (Sales_model) using the testing set predictors (Sales_test[1:4]).

# obtain predicted strength values
predicted_Margin <- model_results$net.result

This line extracts the predicted values of the ‘margin’ variable from the model results.

# examine the correlation between predicted and actual values
cor(predicted_Margin, Sales_test$margin)
          [,1]
[1,] 0.9999451

This code calculates the correlation between the predicted ‘margin’ values and the actual ‘margin’ values in the testing set, providing a measure of the model’s accuracy.

Step 5 – improving model performance

set.seed(12345) # to guarantee repeatable results
Sales_model2 <- neuralnet(margin ~ Quantity_Ordered + Price_Each + Cost_price + turnover,
                              data = Sales_train, hidden = 5)

This code constructs another neural network model (Sales_model2) with an additional hidden layer specified by hidden = 5.

plot(Sales_model2)

This command plots the second neural network model (Sales_model2).

model_results2 <- compute(Sales_model2, Sales_test[1:4])
predicted_Margin2 <- model_results2$net.result
cor(predicted_Margin2, Sales_test$margin)
          [,1]
[1,] 0.9999671

This code computes the output of the second neural network model (Sales_model2) using the testing set predictors (Sales_test[1:4]).

This line extracts the predicted values of the ‘margin’ variable from the results of the second model.

This code calculates the correlation between the predicted ‘margin’ values from the second model and the actual ‘margin’ values in the testing set, providing an evaluation of its accuracy.

Analysis

The provided R code encompasses a comprehensive analysis pipeline for sales data utilizing neural network modeling techniques. Initially, the script begins by loading the sales data from a CSV file named “Sales_Data.csv” into a dataframe called Sales. A structural examination of this dataframe is conducted using the str() function, revealing key details such as the number of observations (1999) and variables (5), including Quantity_Ordered, Price_Each, Cost_price, turnover, and margin.

Following data loading, a normalization procedure is implemented through a custom-defined normalize() function. This function is applied to each column of the Sales dataframe using lapply(), resulting in a normalized dataframe named Sales_norm. Normalization is crucial for ensuring that all variables are on a similar scale, which aids in the convergence and stability of the neural network models during training.

The script proceeds with a summary of the ‘margin’ variable both before and after normalization. This summary provides insights into the distribution and range of the margin values, highlighting any transformations introduced by the normalization process.

Next, the normalized data is partitioned into training (Sales_train) and testing (Sales_test) sets. The first 1500 observations are allocated for training the neural network models, while the remaining data is reserved for evaluating model performance.

The neural network modeling stage begins with the loading of the neuralnet library. A neural network model (Sales_model) is then trained using the neuralnet() function, where the ‘margin’ variable serves as the dependent variable, while ‘Quantity_Ordered’, ‘Price_Each’, ‘Cost_price’, and ‘turnover’ are designated as independent variables. Notably, the initial model is constructed without any hidden layers, representing a straightforward architecture.

Following model training, a visualization of the neural network structure is generated using the plot() function. This visual representation offers insights into the connections and layers within the model architecture. Subsequently, model results are computed using the testing set, and predicted margin values are derived.

To evaluate model performance, the script calculates the correlation coefficient between the predicted and actual margin values. This metric serves as an indicator of the model’s ability to capture the underlying relationships within the data. The exceptionally high correlation coefficient obtained suggests that the neural network model effectively predicts margin values based on the provided features.

In a comparative analysis, the script proceeds to train an additional neural network model (Sales_model2) with an increased complexity, incorporating three hidden layers. The same steps of visualization, result computation, and correlation coefficient calculation are repeated for this model, facilitating a comparison of its performance against the initial model.

Ultimately, the evaluation of both models reveals remarkably high correlation coefficients, signifying the robust predictive capabilities of the neural network architectures on the sales data. This comprehensive analysis underscores the effectiveness of neural network modeling techniques in capturing complex patterns and relationships within sales datasets.