Rajesh Kumar CUNY R Bridge Final Project

CUNY R Bridge Final Project Assignment

My Data Set: Retail Marketing Sales Profit Loss Data Set chosen from IBM’s Watson Analytics Community https://www.ibm.com/communities/analytics/watson-analytics-blog/retail-sales-marketing-profit-cost/

This Data contains 4 years of Revenue(Planned Vs. Actual), Cost and Profit data for an unknown fictitious retailer. I choose this data because on initial look it looked interesting.

  • The retailer seems to have good cross section of sales from various sales channels
  • The retailer seems to have good cross section of product hierarchy
  • The retailer seems to have good cross section of international sales
  • The data has approx. 85K observations and 14 variables

Data Issues: There are some issues with data as well

  • The data seems to have several NA values, which would need to be handled appropriately
  • Later I also notinced, the Product price seems to have a very large spread starting at around $2 to approx. $1400
  • The pricing and revenue data seemed to have very wide spreads which made it a bit challenging to plot

My Hypothesis for Analysis:

Since there is no summary available on this data, I feel this situation is similar to real life for many companies, who have little analysis in terms of their sales data. Most companies use basic tools for reporting but very few look at data cross sections to understand it better. I would like to conduct the analysis to help bring some insights into the functioning of this company.

These insights might help this company improve its sales and Profitability. Lets go!

Step 1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.

# BONUS: Fetch the data from github... 
library(RCurl)
## Loading required package: bitops
fileURL <- "https://raw.githubusercontent.com/rajk11040/RBridge/master/RetailSalesMarketingProfitCost.csv"
marketingDF <- read.csv(text = getURL(fileURL), header = TRUE, sep = ",")

#Code below for quick reading via local drive.. :)
#myFilePath = file.path("C:", "Users", "User", "Documents", "CUNY Bridge Program","R","RetailSalesMarketingProfitCost.csv" )
#marketingDF <- read.csv(file=myFilePath, header = TRUE, sep = ",")

#total observations
nrow(marketingDF)
## [1] 84672
#total columns
ncol(marketingDF)
## [1] 14
#understanding the structure of the dataset
str(marketingDF)
## 'data.frame':    84672 obs. of  14 variables:
##  $ Year             : int  2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
##  $ Product.line     : Factor w/ 5 levels "Camping Equipment",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Product.type     : Factor w/ 21 levels "Binoculars","Climbing Accessories",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Product          : Factor w/ 144 levels "Aloe Relief",..: 139 139 139 139 139 139 139 139 139 139 ...
##  $ Order.method.type: Factor w/ 7 levels "E-mail","Fax",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Retailer.country : Factor w/ 21 levels "Australia","Austria",..: 21 5 14 4 12 13 6 16 1 15 ...
##  $ Revenue          : num  315044 13445 NA NA 181120 ...
##  $ Planned.revenue  : num  437477 14313 NA NA 235237 ...
##  $ Product.cost     : num  158372 6299 NA NA 89413 ...
##  $ Quantity         : int  66385 2172 NA NA 35696 NA 15205 7833 NA 14328 ...
##  $ Unit.cost        : num  2.55 2.9 NA NA 2.66 ...
##  $ Unit.price       : num  6.59 6.59 NA NA 6.59 NA 6.59 6.59 NA 6.59 ...
##  $ Gross.profit     : num  156673 7146 NA NA 91707 ...
##  $ Unit.sale.price  : num  5.2 6.19 NA NA 5.49 ...
#summary of the data set
summary(marketingDF)
##       Year                        Product.line  
##  Min.   :2004   Camping Equipment       :24108  
##  1st Qu.:2005   Golf Equipment          : 8820  
##  Median :2006   Mountaineering Equipment:12348  
##  Mean   :2006   Outdoor Protection      : 8820  
##  3rd Qu.:2006   Personal Accessories    :30576  
##  Max.   :2007                                   
##                                                 
##                Product.type                 Product     
##  Eyewear             : 9408   Aloe Relief       :  588  
##  Watches             : 7644   Astro Pilot       :  588  
##  Lanterns            : 7056   Auto Pilot        :  588  
##  Cooking Gear        : 5880   Bear Edge         :  588  
##  Navigation          : 5880   Bear Survival Edge:  588  
##  Climbing Accessories: 4116   Bella             :  588  
##  (Other)             :44688   (Other)           :81144  
##    Order.method.type  Retailer.country    Revenue        
##  E-mail     :12096   Australia: 4032   Min.   :       0  
##  Fax        :12096   Austria  : 4032   1st Qu.:   18579  
##  Mail       :12096   Belgium  : 4032   Median :   59867  
##  Sales visit:12096   Brazil   : 4032   Mean   :  189418  
##  Special    :12096   Canada   : 4032   3rd Qu.:  190193  
##  Telephone  :12096   China    : 4032   Max.   :10054289  
##  Web        :12096   (Other)  :60480   NA's   :59929     
##  Planned.revenue     Product.cost        Quantity        Unit.cost     
##  Min.   :      16   Min.   :      6   Min.   :     1   Min.   :  0.85  
##  1st Qu.:   19557   1st Qu.:   9432   1st Qu.:   328   1st Qu.: 11.43  
##  Median :   63907   Median :  32784   Median :  1043   Median : 36.83  
##  Mean   :  198818   Mean   : 111625   Mean   :  3607   Mean   : 84.89  
##  3rd Qu.:  203996   3rd Qu.: 111371   3rd Qu.:  3288   3rd Qu.: 80.00  
##  Max.   :10054289   Max.   :6756853   Max.   :313628   Max.   :690.00  
##  NA's   :59929      NA's   :59929     NA's   :59929    NA's   :59929   
##    Unit.price       Gross.profit     Unit.sale.price  
##  Min.   :   2.06   Min.   : -18160   Min.   :   0.00  
##  1st Qu.:  23.00   1st Qu.:   8333   1st Qu.:  20.15  
##  Median :  66.77   Median :  25794   Median :  62.65  
##  Mean   : 155.99   Mean   :  77793   Mean   : 147.23  
##  3rd Qu.: 148.30   3rd Qu.:  78254   3rd Qu.: 140.96  
##  Max.   :1359.72   Max.   :3521098   Max.   :1307.80  
##  NA's   :59929     NA's   :59929     NA's   :59929
# print basic info to make sure data shows correctly

head(marketingDF,3)
##   Year      Product.line Product.type             Product
## 1 2004 Camping Equipment Cooking Gear TrailChef Water Bag
## 2 2004 Camping Equipment Cooking Gear TrailChef Water Bag
## 3 2004 Camping Equipment Cooking Gear TrailChef Water Bag
##   Order.method.type Retailer.country   Revenue Planned.revenue
## 1         Telephone    United States 315044.33       437477.15
## 2         Telephone           Canada  13444.68        14313.48
## 3         Telephone           Mexico        NA              NA
##   Product.cost Quantity Unit.cost Unit.price Gross.profit Unit.sale.price
## 1     158371.8    66385  2.552857       6.59    156672.57        5.195714
## 2       6298.8     2172  2.900000       6.59      7145.88        6.190000
## 3           NA       NA        NA         NA           NA              NA

So from the above we see that that data has 84672 observations of 14 variables - Year: Year of Observation - Product.line: 1st level of product hierarchy - Product.type: 2nd level of product hierarchy - Product: Product under the 2nd level hierarchy - Order.method.type: Order method like email, phone etc - Retailer.country: Country - Revenue: Actual revenue generated - Planned.revenue: Planned revenue - Product.cost: Total cost of the item - Quantity: Quntity of product sold - Unit.cost: Unit cost of each item - Unit.price: Unit Price of the each item - Gross.profit: Gross Profit for this item - Unit.sale.price: Unit sale price for each item

2. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example - if it makes sense you could sum two columns together)

# Data cleanup
# Remove any rows that do not contain proper data
marketingDF <- marketingDF[complete.cases(marketingDF),]

#mean
mean(marketingDF$Revenue)
## [1] 189418.3
mean(marketingDF$Gross.profit)
## [1] 77793.11
#median
median(marketingDF$Revenue)
## [1] 59867.27
median(marketingDF$Gross.profit)
## [1] 25793.76
#install.packages("dplyr")
#dplyr will be used for data manipulation
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Simplify the column names in the data frame using dplyr
marketingDF <- rename(marketingDF, 
                      Order.method = Order.method.type,
                      Country = Retailer.country)
# Arrange the products by their respective profits
profitbyproduct <- aggregate(Gross.profit~Product, marketingDF, mean)
# Sort products by highest first
profitbyproduct <- arrange(profitbyproduct, desc(Gross.profit))
# View the most profitable products in last 4 years mean
head(profitbyproduct)
##    Product Gross.profit
## 1     Zone     414862.6
## 2       TX     403548.7
## 3  Maximus     390558.5
## 4 Infinity     373489.7
## 5    Retro     369931.6
## 6  Inferno     359893.4
#another way to aggregate Based on year and Product
profitbyproduct <- aggregate( Gross.profit ~ Year + Product , marketingDF , mean )
# Sort products by highest first
profitbyproduct <- arrange(profitbyproduct, desc(Year+Gross.profit))
# View the most profitable products in last 4 years mean
head(profitbyproduct)
##   Year  Product Gross.profit
## 1 2006 Infinity     720290.1
## 2 2006  Maximus     660768.2
## 3 2006     Zone     650529.2
## 4 2006  Inferno     624382.2
## 5 2006       TX     612577.8
## 6 2007  Inferno     602423.7
#profitbyproduct[which(profitbyproduct$Year==c(2006,2007)), ]

#Dump this dataframe into a local csv file to double check the results 
#writeFilePath = file.path("C:", "Users", "User", "Documents", "CUNY Bridge Program","R","RetailSalesMarketingProfitCostClean.csv" )
#write.csv(file=writeFilePath, x=marketingDF)


#yearlySummary = group_by(marketingDF, Year)
#summarize(yearlySummary, sum(Revenue), sum(Planned.revenue), sum(Product.cost), sum(Gross.profit))

# More Data Wrangling so the we can look at summarization of these values
yearGroupDF = group_by(marketingDF, Year)
#summarize these values by the years
yearlySummaryDF=summarize(yearGroupDF, sum(Revenue), sum(Planned.revenue), sum(Product.cost), sum(Gross.profit))
yearlySummaryDF <- rename(yearlySummaryDF, 
       Revenue = "sum(Revenue)", 
       planned.revenue = "sum(Planned.revenue)",
       Product.cost = "sum(Product.cost)",
       Gross.profit = "sum(Gross.profit)")
yearlySummaryDF
## # A tibble: 4 x 5
##    Year    Revenue planned.revenue Product.cost Gross.profit
##   <int>      <dbl>           <dbl>        <dbl>        <dbl>
## 1  2004  914352804       968475138    546734895    367617909
## 2  2005 1159195590      1218177467    690511899    468683692
## 3  2006 1495891101      1560738880    872630528    623260572
## 4  2007 1117336274      1171950878    652063452    465272822

3. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

We draw a simple Scatterplot to understand how the revenue and the profit did Notice that this did not provide us a good understanding as there is too much overlap of dots to understand the data

#GGPlot2
#install.packages("ggplot2")
library(ggplot2)


# PLOT1:Scatter Plot: As overall revenue has risen, how did the profit for the company do ?
ggplot(marketingDF, aes(Revenue, Gross.profit, color=Year)) + 
    geom_point() + 
    theme_bw()+
    labs(title="Profit Vs. Revenue by Year")+ 
    theme(plot.title = element_text(hjust = 0.5))

# PLOT2:Scatter Plot: Which category contributed how muc to revenue and profits
g2 <- ggplot(marketingDF, aes(Revenue, Gross.profit, color=Year)) + 
    geom_point() + 
    theme_bw()+
    labs(title="Profit Vs. Revenue By Product Line")+
    theme(plot.title = element_text(hjust = 0.5))
g2 <- g2 + facet_wrap(~marketingDF$Product.line)
g2

This plot gives us the understanding the clearly Product Line - “Personal Accessories” is the biggest line for this company

Keeping up the subject of Product line

# Plot3: This bar shows all the product categories and their prevelance in each country
ggplot(marketingDF, aes(Country, fill=Product.line)) + 
    geom_bar() + 
    theme_bw()+ 
    labs(title="Product Line by Country")+ 
    theme(plot.title = element_text(hjust = 0.5))

This graph shows how many sales were recorded for each product line by the country

Lets now visualize how the Product line and Profit Margins for these lines worked across the countries

#Plot4: SCatterPlot 
ggplot(marketingDF, aes(x=Country, y=Gross.profit, shape=Product.line, color=Product.line)) + 
    geom_point() + 
    theme_bw()+
    labs(title="Profit Vs. Revenue By Product Line")+
    theme(plot.title = element_text(hjust = 0.5))

As visibile in this graph that the most important Countries were 1. USA 2. Japan 3. UK, China and Canada coming close 3rd Its again also clear the in most of the countries Product Line “Personal Accessories” seems to do the best

Now lets draw a Freq Polygon to see how the unit price of the items exist in the data set

# Plot5: Frequency Polygon
# Shows how the Unit Cost of the item varies 
ggplot(marketingDF, aes(x=Unit.price)) + 
    geom_freqpoly(binwidth=50, col="blue") +
    theme_bw() +
    labs(title="Unit Prices for Products")+ 
    theme(plot.title = element_text(hjust = 0.5))

As seen from the graph, The Products have a wide range of unit prices. While most products are priced under $250. There are items in the dataset that are price close to $1400.

Lets also evaluate how sales order types (emails, web, phone etc) impacted these categories

# Plot 6: This bar shows how most of the customers ordered products by country
# Clearly Web was the most popular means of ordering followed by Sales Visits and then 
ggplot(marketingDF, aes(Country, fill=Order.method)) + 
    geom_bar() + 
    theme_bw() +
    labs(title="Sales Channels (Order Type) by Country")+ 
    theme(plot.title = element_text(hjust = 0.5))

Lets plot the revenue by year using a bar plot to understand how the revenue was across the various year

# PLOT 7: Box Plots
# Understand how a continous variable changes in respect to a categorical variable
# install.packages("gridExtra")
require(gridExtra)
## Loading required package: gridExtra
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
RevByYear <- ggplot(marketingDF, aes(x=factor(Year), weights=Revenue)) + 
    geom_bar(col="green") + 
    theme_bw() + 
    labs(title="Revenue by Year", x="Year", y="Dollar Amount") + 
    theme(plot.title = element_text(hjust = 0.5))
RevByYear

#PLOT 8:
RevByYearByPL <- ggplot(marketingDF, aes(x=factor(Year), weights=Gross.profit)) + 
    geom_bar() + 
    theme_bw() + 
    facet_grid(Product.line ~ .) + 
    labs(title="Revenue by Product Lines", x="Year", y="Dollar Amount") + 
    theme(plot.title = element_text(hjust = 0.5))
RevByYearByPL

# Use Grid to add both graphs together in the same grid
#PLOT 9:
grid.arrange(RevByYear, RevByYearByPL, ncol=2) 

Wow!! The above graph gives us a very clean breakdown of how Revenue performed Year over year for 4 years. The second chart also gives us facets of the same by Product line.

Lets now Grapth below a comparison of how the key financial figures - Actual Revenue - Planned Revenue - Total Product Cost - Gross Profit behaved year over year. This will give us understanding if there is any correlation between these numbers

# GGPlot for merging Major financial values by year on single frame 
library(reshape2)
yearGroupDF = group_by(marketingDF, Year)
yearlySummaryDF=summarize(yearGroupDF, sum(Revenue), sum(Planned.revenue), sum(Product.cost), sum(Gross.profit))
yearlySummaryDF <- rename(yearlySummaryDF, 
       Revenue = "sum(Revenue)", 
       Planned.revenue = "sum(Planned.revenue)",
       Product.cost = "sum(Product.cost)",
       Gross.profit = "sum(Gross.profit)")
yearlySummaryDF
## # A tibble: 4 x 5
##    Year    Revenue Planned.revenue Product.cost Gross.profit
##   <int>      <dbl>           <dbl>        <dbl>        <dbl>
## 1  2004  914352804       968475138    546734895    367617909
## 2  2005 1159195590      1218177467    690511899    468683692
## 3  2006 1495891101      1560738880    872630528    623260572
## 4  2007 1117336274      1171950878    652063452    465272822
tempDF <- melt(yearlySummaryDF, id.vars="Year")
# Everything on the same plot
# Plot 10: 
  ggplot(tempDF, aes(Year,value, col=variable)) + 
  geom_point() + 
  stat_smooth() +
      theme_bw() +
      labs(title="Key Financials by Year",y="Gross Profit")
## `geom_smooth()` using method = 'loess'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 2004
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4.0602
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small.
## fewer data values than degrees of freedom.
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 2004
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 4.0602
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 2004
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4.0602
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small.
## fewer data values than degrees of freedom.
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 2004
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 4.0602
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 2004
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4.0602
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small.
## fewer data values than degrees of freedom.
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 2004
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 4.0602
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 2004
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4.0602
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small.
## fewer data values than degrees of freedom.
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 2004
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 4.0602

Since these all tracked very close to each other and there were dips in each one of these in 2007, clearly there were other factors in the sales that were responsible for the drop in profit for the company

Lets look at Product Price Level Comparison

# Plot 11: Unit Sale price of items 
# understand diversity of the product assortment in this data
# this shows the item prices range from close to 0 to around $1400
ggplot(marketingDF, aes(Unit.price)) + 
    geom_histogram(binwidth=10, fill="palegreen4", col="green") + 
    theme_bw()+
    labs(title="Unit Sale Prices by Products")+ 
    theme(plot.title = element_text(hjust = 0.5))

Lets now see how the top 10 items contributed to the profit

#another way to aggregate Based on year and Product
profitbyproductDF <- aggregate( Gross.profit ~ Year + Product , marketingDF , mean )
profitbyproductDF <- arrange(profitbyproductDF, desc(Year+Gross.profit))

head(profitbyproductDF)
##   Year  Product Gross.profit
## 1 2006 Infinity     720290.1
## 2 2006  Maximus     660768.2
## 3 2006     Zone     650529.2
## 4 2006  Inferno     624382.2
## 5 2006       TX     612577.8
## 6 2007  Inferno     602423.7
# Boxplot grouping top 10 items per year for easier analysis
top2004items <- profitbyproductDF[which(profitbyproductDF$Year == 2004),]
top2004items <- top2004items[1:10,]
top2005items <- profitbyproductDF[which(profitbyproductDF$Year == 2005),]
top2005items <- top2005items[1:10,]
top2006items <- profitbyproductDF[which(profitbyproductDF$Year == 2006),]
top2006items <- top2006items[1:10,]
top2007items <- profitbyproductDF[which(profitbyproductDF$Year == 2007),]
top2007items <- top2007items[1:10,]


topitemsDF <- union(top2004items, union(top2005items, union(top2006items, top2007items)))

nrow(topitemsDF)
## [1] 40
topitemsDF <- mutate(topitemsDF, Year=as.factor(Year))

#Plot 12: Box Plot for top items across the years
ggplot(topitemsDF, aes(Product, Gross.profit)) + 
  geom_boxplot() + 
  theme_bw() +
    labs(title="Top Items vs. Profit")+ 
    theme(plot.title = element_text(hjust = 0.5))

#Top 10 item by Gross Profit
topitemsDF <- group_by(topitemsDF, Product)
topitemsDF <- summarize(topitemsDF, sum(Gross.profit))
topitemsDF <- rename(topitemsDF, Gross.profit = "sum(Gross.profit)") 
topitemsDF <- arrange(topitemsDF, desc(Gross.profit))
topitemsDF
## # A tibble: 15 x 2
##    Product                      Gross.profit
##    <fctr>                              <dbl>
##  1 Zone                              1788581
##  2 Maximus                           1751140
##  3 TX                                1698563
##  4 Infinity                          1595949
##  5 Inferno                           1516148
##  6 Hailstorm Titanium Woods Set      1296648
##  7 Retro                             1009365
##  8 Legend                            1003858
##  9 Cat Eye                            758239
## 10 Hawk Eye                           710121
## 11 Venue                              477406
## 12 Hailstorm Titanium Irons           460765
## 13 Dante                              420196
## 14 Fairway                            363215
## 15 Star Lite                          149333

4. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end. 5. BONUS - place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

Product Data Level Comparison

We need to do analysis of the all the top items from year 2006 and 2007 to find out, if there is significant anamoly in the products which could point to reason for lower revenue and profits in 2007.

library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
#This function compares the dataframe of products and identifies missing products 
compareYears <- function(a1, a2){
    a1NotIna2 <- sqldf('SELECT Product FROM a1 EXCEPT SELECT Product FROM a2')
    missinga2item <- as.vector(a1NotIna2$Product)
    a1extra <- a1[which(a1[ ,2] %in% missinga2item),]
    
    a2NotIna1 <- sqldf('SELECT Product FROM a2 EXCEPT SELECT Product FROM a1')
    missinga1item <- as.vector(a2NotIna1$Product)
    a2extra <- a2[which(a2[ ,2] %in% missinga1item),]

    if(sum(a1extra$Gross.profit) > sum(a2extra$Gross.profit)){
        print("2006 more profitable for top 10 items by")
        return (sum(a1extra$Gross.profit) - sum(a2extra$Gross.profit))
    }else{
        print("2007 more profitable for top 10 items by ")
        return (sum(a2extra$Gross.profit) - sum(a1extra$Gross.profit))
        
    }
}


compareYears(top2006items, top2007items)
## [1] "2006 more profitable for top 10 items by"
## [1] 83370.71

This comparison points to some reasons why the revenue declined in 2007. Many of the items that did not sell well in 2007 were high value items with greater profit margin. Just the top 10 items comparison show the loss of profit of around $83370.

4. Conclusions

The following conclusions could be generated from the above analysis - Revenue Drop in year 2007: The above company experienced drop in revenue as well as drop in profit in the last year of observation. Based on above, the sales revenue (planned and actual), seemed to dropped across all categories evenly (Plot 10). As the planned revenue also dropped, it is expected that the company was aware of this anticipated drop, due to reason unknown for this dataset.

The following reasons can be still infered from the dataset 1. Based on Plot 8 and 9: The Sales of Product category “Outdoor Protection” seem to decline significantly to near zero in 2007 2. Based on Plot 4: The sales of some higher value Products seem to decline from year 2006 to 2007. These brought the overall revenue and profit down 3. - Highest Profitable Product Categories were identfied as Product Categories were - Personal Accessories - Camping Equipment

Further Analysis: Additonal profitability analysis could also be done on the products by the sale channel (order type) to create recommendation for the sales team on which products to suggest more to customers.