Introduction

The project was created to demonstrate my skills in data science and data analysis.

The challenge is based on the following problem: Your company supplies two brands of firearm makers: Remington and Henry Lever. Each company sells different handgun models which require custom-fitted bullets. Your job is to forecast the sales of each bullet-model that is going through the RS channel, as well as the sales aggregate by handgun (Replacement Sales only).

The first step to perform data analysis is to read the dataset and make a basic analysis of the data.

# Set working directory
setwd("C:/Users/olive/Desktop/Ability Apps - Upwork")

# Read datasets
bullets <- read.csv('bullets_sales.csv', stringsAsFactors =TRUE, na.strings = '')
handgun <- read.csv('handgun_sales.csv', stringsAsFactors =TRUE, na.strings = '')

# Print datasets basic analysis
print(str(bullets))
## 'data.frame':    30684 obs. of  6 variables:
##  $ Date    : Factor w/ 2557 levels "1810-01-01","1810-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Maker   : Factor w/ 2 levels "Henry Lever",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Model   : Factor w/ 3 levels "model_1","model_2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bullets : Factor w/ 3 levels "bullets_A","bullets_B",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Channel : Factor w/ 2 levels "RS","TF": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Quantity: int  0 2 3 2 3 2 1 0 0 2 ...
## NULL
print(str(handgun))
## 'data.frame':    15342 obs. of  4 variables:
##  $ Date    : Factor w/ 2557 levels "1810-01-01","1810-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Maker   : Factor w/ 2 levels "Henry Lever",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Model   : Factor w/ 3 levels "model_1","model_2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Quantity: int  4 6 13 23 32 37 33 24 21 29 ...
## NULL

It is s possible to notice observing the basic analysis of the datasets that the ‘bullets_sales.csv’ dataset has 6 variables (5 categorical and 1 numeric) with 30684 observations each. On the other hand, the ‘handgun_sales.csv’ dataset has 4 variables (3 categorical and 1 numeric) with 15342 observations each.

Refine data

Initially, the datasets were evaluated to check for missing values (NA). This is an essential procedure to begin the datasets refinement.

# Load package
library(naniar)

# Get missing values (NA) summary
NA_bullets <- as.data.frame(miss_var_summary(bullets))
NA_handgun <- as.data.frame(miss_var_summary(handgun))

# Print missing values tables
print(NA_bullets)
##   variable n_miss pct_miss
## 1    Maker  15342       50
## 2    Model  15342       50
## 3     Date      0        0
## 4  Bullets      0        0
## 5  Channel      0        0
## 6 Quantity      0        0
print(NA_handgun)
##   variable n_miss pct_miss
## 1     Date      0        0
## 2    Maker      0        0
## 3    Model      0        0
## 4 Quantity      0        0

The NA tables attest that the variables Marker and Model of the bullets dataset has 50% of missing values, those missing values needs to be refined.

It this part of the project the variables will be classified according to its definition. The variable Date is present in both dataset, is important change this variable classification from factor to Date.

# Refine varible Date
bullets$Date <- as.Date(bullets$Date)
handgun$Date <- as.Date(handgun$Date)

The challenge statement indicates that the main objective is to forecast the sales of each bullet-model that is going through the RS channel. Because of that, the bullet dataset needs to be refined to have only the RS channel.

# Separate Replacement Sales Channel
Ord <- bullets$Channel == 'RS'
bullets <- bullets[Ord,]

# Print bullets dataframe structure
print(str(bullets))
## 'data.frame':    15342 obs. of  6 variables:
##  $ Date    : Date, format: "1810-01-01" "1810-01-02" ...
##  $ Maker   : Factor w/ 2 levels "Henry Lever",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Model   : Factor w/ 3 levels "model_1","model_2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Bullets : Factor w/ 3 levels "bullets_A","bullets_B",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Channel : Factor w/ 2 levels "RS","TF": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Quantity: int  87 147 291 237 145 110 145 292 162 156 ...
## NULL

As mentioned before the bullets dataset variables Marker and Model contain missing values, it is necessary to examine with more details the NA´s to refine the dataset.

# Varibles  Marker and Model missing values
print(summary(is.na(bullets$Maker)))
##    Mode    TRUE 
## logical   15342
print(summary(is.na(bullets$Model)))
##    Mode    TRUE 
## logical   15342
# Remove variables
bullets$Maker <- NULL
bullets$Model <- NULL

The variables Maker and Model were excluded from the dataset bullets, after the verification that those variables had all observations as missing values.

The challenge requests to return a 12-months ahead forecast. Because of that, the datasets will be separated into training into test sets according to the date.

# Load package
library(dplyr)

# Aggregate the handgun datasets by date
handgun <-  aggregate(handgun$Quantity,
                          by = list(handgun$Date, handgun$Maker, handgun$Model),
                          FUN = sum)

# Name columns
names(handgun) <- c('Date','Maker','Model','Quantity')

# Aggregate the bullets datasets by date
bullets <-  aggregate(bullets$Quantity,
                          by = list(bullets$Date, bullets$Bullets),
                          FUN = sum)
# Name columns
names(bullets) <- c('Date','Bullet_Type','Quantity')

# Separate bullets dataset into training and test set
bullets_train <- filter(bullets, bullets$Date < '1816-01-01')
bullets_test <- filter(bullets, bullets$Date >= '1816-01-01')

# Separate handgun dataset into training and test set
handgun_train <- filter(handgun, handgun$Date < '1816-01-01')
handgun_test <- filter(handgun, handgun$Date >= '1816-01-01')

Data Analysis

Secondly, after refinement of the dataset, the data will be analysed, using graphics, to highlight important characteristics of the dataset variables.

The first graph focus on the quantity of bullets sells according to the model of the bullets.

Graph 1 - Bullets type lineplot

# Load packages
library(ggplot2)
library(plotly)

# Plot graphic
Plot1 <- ggplot(bullets_train,aes(x=Date, y=Quantity, color=Bullet_Type)) +
  geom_line(size=0.3)+
  theme_minimal() 

ggplotly(Plot1)

It is possible to notice with Graph 1 that, the most sell bullet type is the A followed by the B and C. Beyond that, the graph shows that, in general, the bullets seeling increases over time.

The second graph focus on the quantity of handgun sells according to the Maker type.

Graph 2 - Handgun maker type boxplot

#Refine handgun for plot
handgun_plot <- handgun_train

handgun_plot$Date <- format(handgun_plot$Date, '%Y')
handgun_plot$Date <- as.factor(as.character(handgun_plot$Date))

#Plot graphic
plot_ly(
  handgun_plot,
  x = ~ Date,
  y = ~ Quantity,
  color = ~ as.factor(Maker),
  colors = "Dark2",
  type = "box") %>%
  layout(boxmode = "group") 

Graph 2 shows that the handguns with Marker Henry Lever are increasing sellings over time, on the other hand, the selling of Remington Marker handgun is decreasing.

Predictive Models

The algorithm chosen to be used by the predictive model was the Random Forest, this algorithm is a supervised learning model that can be used to solve both regression (this challenge case) and classification problems, making it a diverse model that is widely used. Furthermore, it´s easy to apply the algorithm and doesn’t request much refinement of the dataset.

To obtain the predict the sales of each bullet for the year 1816, the Random Forest algorithm will be applied in the handgun_training dataset.

# Load package
library(randomForest)

# First Model regressor
regressor <- randomForest(x = bullets_train[1:2],
                         y = bullets_train$Quantity,
                         ntree = 500)

# Predicting a new result with Random Forest regression
y_pred <- predict(regressor, bullets_test[1:2])

# Print prediction summary
print(summary(y_pred))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   251.6   251.6  1248.4   931.9  1295.6  1295.6
# Print real result summary
print(summary(bullets_test$Quantity))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     249    1203    1130    1514    3900
# Prediction and real result difference 
diff <- y_pred - bullets_test$Quantity

# Print difference summary
print(summary(diff))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2604.37  -254.35   -18.88  -198.36   117.55   777.63

Additionally, the second model will predict the sales aggregate by handgun. the Random Forest algorithm will be applied to the handgun training dataset.

# Second Model regressor
regressor <- randomForest(x = handgun_train[1:3],
                          y = handgun_train$Quantity,
                          ntree = 500)

# Predicting a new result with Random Forest regression
y_pred <- predict(regressor, handgun_test[1:3])

# Print aggregate prediction sale
print(sum(y_pred))
## [1] 332854.5
# Print aggregate real result
print(sum(handgun_test$Quantity))
## [1] 281463

Conclusion

To summarize, both models showed average results with significant differences from the real results values to the prediction values. To improve the efficiency of the models is recommended to unite the handgun and bullets dataset to create more efficient models.