For this analysis I will be exploring the Superstore Dataset obtained from Kaggle.com. Superstore Dataset | Kaggle
superstore = read.csv("Sample - Superstore.csv")
str(superstore)
## 'data.frame': 9994 obs. of 21 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : chr "11/8/2016" "11/8/2016" "6/12/2016" "10/11/2015" ...
## $ Ship.Date : chr "11/11/2016" "11/11/2016" "6/16/2016" "10/18/2015" ...
## $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Customer.Name: chr "Claire Gute" "Claire Gute" "Darrin Van Huff" "Sean O'Donnell" ...
## $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ Country : chr "United States" "United States" "United States" "United States" ...
## $ City : chr "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
## $ State : chr "Kentucky" "Kentucky" "California" "Florida" ...
## $ Postal.Code : int 42420 42420 90036 33311 33311 90032 90032 90032 90032 90032 ...
## $ Region : chr "South" "South" "West" "South" ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ Sub.Category : chr "Bookcases" "Chairs" "Labels" "Tables" ...
## $ Product.Name : chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
The raw data consists of 9994 observations of 21 different variables. Metadata as provided by the data set owner:
summary(superstore)
## Row.ID Order.ID Order.Date Ship.Date
## Min. : 1 Length:9994 Length:9994 Length:9994
## 1st Qu.:2499 Class :character Class :character Class :character
## Median :4998 Mode :character Mode :character Mode :character
## Mean :4998
## 3rd Qu.:7496
## Max. :9994
## Ship.Mode Customer.ID Customer.Name Segment
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Country City State Postal.Code
## Length:9994 Length:9994 Length:9994 Min. : 1040
## Class :character Class :character Class :character 1st Qu.:23223
## Mode :character Mode :character Mode :character Median :56431
## Mean :55190
## 3rd Qu.:90008
## Max. :99301
## Region Product.ID Category Sub.Category
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Product.Name Sales Quantity Discount
## Length:9994 Min. : 0.444 Min. : 1.00 Min. :0.0000
## Class :character 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000
## Mode :character Median : 54.490 Median : 3.00 Median :0.2000
## Mean : 229.858 Mean : 3.79 Mean :0.1562
## 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000
## Max. :22638.480 Max. :14.00 Max. :0.8000
## Profit
## Min. :-6599.978
## 1st Qu.: 1.729
## Median : 8.666
## Mean : 28.657
## 3rd Qu.: 29.364
## Max. : 8399.976
#identifying if there are any missing values in the dataset. Will return FALSE if none.
any(is.na(superstore))
## [1] FALSE
#viewing the number of rows and columns
nrow(superstore)
## [1] 9994
ncol(superstore)
## [1] 21
#Code to remove any NA values if there were any in the dataset.
superstore = superstore[complete.cases(superstore),]
#checking the number of rows and columns to see if any were removed from our above code.
nrow(superstore)
## [1] 9994
ncol(superstore)
## [1] 21
There are no missing values in our dataset because the is.na function returned FALSE. I checked the number of rows and columns before applying code that would remove any NA values if there were any in the dataset. The complete.cases function was used to do this. Then I checked the number of rows and columns after. The number of rows and columns stayed the same, which is expected because our data had no missing values.
superstore$Row.ID = NULL
superstore$Customer.Name = NULL
I am also dropping Row.ID and Customer.Name because they are redundant to any analysis that we will be doing.
str(superstore)
## 'data.frame': 9994 obs. of 19 variables:
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : chr "11/8/2016" "11/8/2016" "6/12/2016" "10/11/2015" ...
## $ Ship.Date : chr "11/11/2016" "11/11/2016" "6/16/2016" "10/18/2015" ...
## $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ Country : chr "United States" "United States" "United States" "United States" ...
## $ City : chr "Henderson" "Henderson" "Los Angeles" "Fort Lauderdale" ...
## $ State : chr "Kentucky" "Kentucky" "California" "Florida" ...
## $ Postal.Code : int 42420 42420 90036 33311 33311 90032 90032 90032 90032 90032 ...
## $ Region : chr "South" "South" "West" "South" ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
The variable I want to analyze is the Sales variable because it would be the most valuable to a business to see which products, places, customers etc. result in the most sales.
Two basic graphs to describe the distribution of Sales variable.
library(ggplot2)
ggplot(superstore, aes(x=Sales))+geom_histogram(bins=50)
ggplot(superstore, aes(x=Sales))+geom_density()
Further detailing the two above graphs using: * title * theme * Setting the range of x axis * removing the y-axis label
a = ggplot(superstore, aes(x=Sales))
a+geom_histogram(bins=50)+ggtitle("Histogram of Sales")+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(xlim=c(0, 5000))+ylab(NULL)
a+geom_density()+ggtitle("Density Curve of Sales")+theme(plot.title=element_text(hjust=0.5))+coord_cartesian(xlim=c(0, 5000))+ylab(NULL)
Exploring the Sales variable with some other variables in the data:
ggplot(superstore, aes(x=Ship.Mode, y=Sales, fill=Ship.Mode))+geom_col()+scale_fill_discrete(guide = "none")+ggtitle("Sales vs Shipping Mode")+theme(plot.title=element_text(hjust=0.5))
Based on the above graph, most customers from this superstore data ordered Standard Class shipping.
superstore$Order.Date = as.Date(superstore$Order.Date, format = "%m/%d/%Y")
ggplot(superstore, aes(x=as.Date(Order.Date), y=Sales))+geom_point(color="purple")+xlab("Order Date")+ggtitle("Sales Over Time")+theme(plot.title=element_text(hjust=0.5))
The above graph shows the distribution of order sales across time. There seems to be a pretty even distribution of sales across the years. There is a single point near the beginning of 2014 that resulted in the largest sale amount.
ggplot(superstore, aes(x=Sub.Category,y=Sales,fill=Category))+geom_col()+ggtitle("Sales of Sub Categories")+xlab("Sub Category")+theme(plot.title=element_text(hjust=0.5))+theme(axis.text.x = element_text(angle=90, hjust=1, vjust=.5))
This graph indicates the categories and subcategories that received the most sales. Chairs and Phones were the two most purchased sub categories in the Furniture and Technology categories respectfully. The least purchased sub category is fasteners in office supplies.
ggplot(superstore, aes(x=Segment, y=Sales, fill=Segment))+geom_col()+scale_fill_discrete(guide = "none")+ggtitle("Sales of Segments")+theme(plot.title=element_text(hjust=0.5))
The graph of segments indicates the type of customer that purchased. It looks like consumers are the customers that purchase the most, followed by corporate, and home office customers.
ggplot(superstore, aes(x=State, y=Sales, fill=Region))+geom_col()+theme(axis.text.x = element_text(angle=90, hjust=1, vjust=.5))
Above is a bar graph of each state’s sales. California, New York, and Texas are the top three states with the most sales.