Data Description

For this analysis I will be exploring the Superstore Dataset obtained from Kaggle.com. Superstore Dataset | Kaggle

superstore = read.csv("Sample - Superstore.csv")
str(superstore)
## 'data.frame':    9994 obs. of  21 variables:
##  $ Row.ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID     : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date   : chr  "11/8/2016" "11/8/2016" "6/12/2016" "10/11/2015" ...
##  $ Ship.Date    : chr  "11/11/2016" "11/11/2016" "6/16/2016" "10/18/2015" ...
##  $ Ship.Mode    : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID  : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Customer.Name: chr  "Claire Gute" "Claire Gute" "Darrin Van Huff" "Sean O'Donnell" ...
##  $ Segment      : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Country      : chr  "United States" "United States" "United States" "United States" ...
##  $ City         : chr  "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
##  $ State        : chr  "Kentucky" "Kentucky" "California" "Florida" ...
##  $ Postal.Code  : int  42420 42420 90036 33311 33311 90032 90032 90032 90032 90032 ...
##  $ Region       : chr  "South" "South" "West" "South" ...
##  $ Product.ID   : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category     : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category : chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name : chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales        : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity     : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount     : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit       : num  41.91 219.58 6.87 -383.03 2.52 ...

The raw data consists of 9994 observations of 21 different variables. Metadata as provided by the data set owner:

summary(superstore)
##      Row.ID       Order.ID          Order.Date         Ship.Date        
##  Min.   :   1   Length:9994        Length:9994        Length:9994       
##  1st Qu.:2499   Class :character   Class :character   Class :character  
##  Median :4998   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :4998                                                           
##  3rd Qu.:7496                                                           
##  Max.   :9994                                                           
##   Ship.Mode         Customer.ID        Customer.Name        Segment         
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Country              City              State            Postal.Code   
##  Length:9994        Length:9994        Length:9994        Min.   : 1040  
##  Class :character   Class :character   Class :character   1st Qu.:23223  
##  Mode  :character   Mode  :character   Mode  :character   Median :56431  
##                                                           Mean   :55190  
##                                                           3rd Qu.:90008  
##                                                           Max.   :99301  
##     Region           Product.ID          Category         Sub.Category      
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Product.Name           Sales              Quantity        Discount     
##  Length:9994        Min.   :    0.444   Min.   : 1.00   Min.   :0.0000  
##  Class :character   1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000  
##  Mode  :character   Median :   54.490   Median : 3.00   Median :0.2000  
##                     Mean   :  229.858   Mean   : 3.79   Mean   :0.1562  
##                     3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000  
##                     Max.   :22638.480   Max.   :14.00   Max.   :0.8000  
##      Profit         
##  Min.   :-6599.978  
##  1st Qu.:    1.729  
##  Median :    8.666  
##  Mean   :   28.657  
##  3rd Qu.:   29.364  
##  Max.   : 8399.976

Data Cleaning

#identifying if there are any missing values in the dataset. Will return FALSE if none.
any(is.na(superstore))
## [1] FALSE
#viewing the number of rows and columns
nrow(superstore)
## [1] 9994
ncol(superstore)
## [1] 21
#Code to remove any NA values if there were any in the dataset. 
superstore = superstore[complete.cases(superstore),]
#checking the number of rows and columns to see if any were removed from our above code.
nrow(superstore)
## [1] 9994
ncol(superstore)
## [1] 21

There are no missing values in our dataset because the is.na function returned FALSE. I checked the number of rows and columns before applying code that would remove any NA values if there were any in the dataset. The complete.cases function was used to do this. Then I checked the number of rows and columns after. The number of rows and columns stayed the same, which is expected because our data had no missing values.

superstore$Row.ID = NULL
superstore$Customer.Name = NULL

I am also dropping Row.ID and Customer.Name because they are redundant to any analysis that we will be doing.

str(superstore)
## 'data.frame':    9994 obs. of  19 variables:
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : chr  "11/8/2016" "11/8/2016" "6/12/2016" "10/11/2015" ...
##  $ Ship.Date   : chr  "11/11/2016" "11/11/2016" "6/16/2016" "10/18/2015" ...
##  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Country     : chr  "United States" "United States" "United States" "United States" ...
##  $ City        : chr  "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
##  $ State       : chr  "Kentucky" "Kentucky" "California" "Florida" ...
##  $ Postal.Code : int  42420 42420 90036 33311 33311 90032 90032 90032 90032 90032 ...
##  $ Region      : chr  "South" "South" "West" "South" ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Data Analysis

The variable I want to analyze is the Sales variable because it would be the most valuable to a business to see which products, places, customers etc. result in the most sales.

Two basic graphs to describe the distribution of Sales variable.

library(ggplot2)
ggplot(superstore, aes(x=Sales))+geom_histogram(bins=50)

ggplot(superstore, aes(x=Sales))+geom_density()

Further detailing the two above graphs using: * title * theme * Setting the range of x axis * removing the y-axis label

a = ggplot(superstore, aes(x=Sales))
a+geom_histogram(bins=50)+ggtitle("Histogram of Sales")+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(xlim=c(0, 5000))+ylab(NULL)

a+geom_density()+ggtitle("Density Curve of Sales")+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(xlim=c(0, 5000))+ylab(NULL)

Exploring the Sales variable with some other variables in the data:

ggplot(superstore, aes(x=Ship.Mode, y=Sales, fill=Ship.Mode))+geom_col()+scale_fill_discrete(guide = "none")+ggtitle("Sales vs Shipping Mode")+theme(plot.title=element_text(hjust=0.5))

Based on the above graph, most customers from this superstore data ordered Standard Class shipping.

superstore$Order.Date = as.Date(superstore$Order.Date, format = "%m/%d/%Y")

ggplot(superstore, aes(x=as.Date(Order.Date), y=Sales))+geom_point(color="purple")+xlab("Order Date")+ggtitle("Sales Over Time")+theme(plot.title=element_text(hjust=0.5))

The above graph shows the distribution of order sales across time. There seems to be a pretty even distribution of sales across the years. There is a single point near the beginning of 2014 that resulted in the largest sale amount.

ggplot(superstore, aes(x=Sub.Category,y=Sales,fill=Category))+geom_col()+ggtitle("Sales of Sub Categories")+xlab("Sub Category")+theme(plot.title=element_text(hjust=0.5))+theme(axis.text.x = element_text(angle=90, hjust=1, vjust=.5))

This graph indicates the categories and subcategories that received the most sales. Chairs and Phones were the two most purchased sub categories in the Furniture and Technology categories respectfully. The least purchased sub category is fasteners in office supplies.

ggplot(superstore, aes(x=Segment, y=Sales, fill=Segment))+geom_col()+scale_fill_discrete(guide = "none")+ggtitle("Sales of Segments")+theme(plot.title=element_text(hjust=0.5))

The graph of segments indicates the type of customer that purchased. It looks like consumers are the customers that purchase the most, followed by corporate, and home office customers.

ggplot(superstore, aes(x=State, y=Sales, fill=Region))+geom_col()+theme(axis.text.x = element_text(angle=90, hjust=1, vjust=.5))

Above is a bar graph of each state’s sales. California, New York, and Texas are the top three states with the most sales.