Anomaly Detection

CarreFour Sales and Marketing Strategies.

1. Research Question

Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax).This project is aimed at doing analysis on the dataset provided by carrefour and create insights on how to achieve highest sales. .

2. Metric of Success

Be able to do away with redundancy in the dataset and any existing anomaly.

3. Understanding the context.

CarreFour is an International chain of retail supemarkets in the world, It was set up in Kenya in the year 2016 and has been performing well over the years. This project is aimed at creating insights from existing and current trends to develop marketing strategies that will enable the marketing team achieve higher sales.

4. Recording the Experimental Design

Data Loading
Data Cleaning and preprocessing
Exploratory Data Analysis
Implementation of solution
Recommendations and Conclusions.

5. Data Relevance.

he provided data is relevant for this study since it’s been sourced from CarreFour database and is a reflection of current transactions.

Data Preview

Loading the libraries

library(modelr)
library(broom)

## 
## Attaching package: 'broom'

## The following object is masked from 'package:modelr':
## 
##     bootstrap

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(rpart)
library(ggplot2)
library(Amelia)

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.0, built: 2021-05-26)
## ## Copyright (C) 2005-2022 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v tibble  3.1.6     v purrr   0.3.4
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x data.table::between() masks dplyr::between()
## x broom::bootstrap()    masks modelr::bootstrap()
## x dplyr::filter()       masks stats::filter()
## x data.table::first()   masks dplyr::first()
## x dplyr::lag()          masks stats::lag()
## x data.table::last()    masks dplyr::last()
## x purrr::lift()         masks caret::lift()
## x purrr::transpose()    masks data.table::transpose()

Loading the data

sales.dates <- fread('http://bit.ly/CarreFourSalesDataset')
sales.dates

##            Date     Sales
##    1:  1/5/2019  548.9715
##    2:  3/8/2019   80.2200
##    3:  3/3/2019  340.5255
##    4: 1/27/2019  489.0480
##    5:  2/8/2019  634.3785
##   ---                    
##  996: 1/29/2019   42.3675
##  997:  3/2/2019 1022.4900
##  998:  2/9/2019   33.4320
##  999: 2/22/2019   69.1110
## 1000: 2/18/2019  649.2990

Checking on the first 6 entries

head(sales.dates, 6)

##         Date    Sales
## 1:  1/5/2019 548.9715
## 2:  3/8/2019  80.2200
## 3:  3/3/2019 340.5255
## 4: 1/27/2019 489.0480
## 5:  2/8/2019 634.3785
## 6: 3/25/2019 627.6165

checking on last 6 entries

tail(sales.dates, 6)

##         Date     Sales
## 1: 2/18/2019   63.9975
## 2: 1/29/2019   42.3675
## 3:  3/2/2019 1022.4900
## 4:  2/9/2019   33.4320
## 5: 2/22/2019   69.1110
## 6: 2/18/2019  649.2990

checking on data types

str(sales.dates)

## Classes 'data.table' and 'data.frame':   1000 obs. of  2 variables:
##  $ Date : chr  "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
##  $ Sales: num  549 80.2 340.5 489 634.4 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Checking on dataset description

summary(sales.dates)

##      Date               Sales        
##  Length:1000        Min.   :  10.68  
##  Class :character   1st Qu.: 124.42  
##  Mode  :character   Median : 253.85  
##                     Mean   : 322.97  
##                     3rd Qu.: 471.35  
##                     Max.   :1042.65

checking the size/shape of a dataframe

dim(sales.dates)

## [1] 1000    2

Data Preprocessing.

i. Completeness

This is achieved by checking for missing values if any imputed to ensure correct predictions are made.

is.null(sales.dates)

## [1] FALSE

Total number of null valuesin dataset

total_null <- sum(is.na(sales.dates))
total_null

## [1] 0

checking for missing values in every column

lapply(sales.dates, function(x) sum(is.na(x)))

## $Date
## [1] 0
## 
## $Sales
## [1] 0

ii. Consistency.

Consistency is achieved when all the duplicated rows are done away with.

duplicated_rows <- sales.dates[duplicated(sales.dates), ]
duplicated_rows

## Empty data.table (0 rows and 2 cols): Date,Sales

anyDuplicated(sales.dates)

## [1] 0

iii. Relevance.

Relevance is achieved by ensuring all the features provided for the analysis are relevant to the objective Which in this case all provided features are.

iv. Accuracy.

Checking that all entries are correct.

Outliers

We can visualize any outliers in a dataset using boxplots

colnames(sales.dates)

## [1] "Date"  "Sales"

Sales

boxplot(sales.dates$Sales)