The data set

The data set I chose for this project consists of European Bike Store Sales from 2011-2016 collected by PrepInsta Technologies. The categorical data collected on each sale includes: -the date the sale occurred -the month the sale occurred in -the day the sale occurred -the year the sale occurred -the Age Group of the customer -the gender of the customer -the country in which the sale took place -the state in which the sale took place -the product category (which is whether the product was a bike, an accessory, or another product) -the product subcategory which give a little more detail about the specifics of the product) -the name of the product sold

The quantitative data recorded for each sale includes: -the age of the customer -the order quantity -the cost require to make the unit sold -the price per unit of the product sold -the cost of the entire sale -the revenue of the entire sale -the profit of the sale (revenue-cost)

Research Questions

  1. How does customer age impact the sales revenue?
  2. Is there a difference in the frequency in which each gender purchases different product categories?
  3. How does the average sales performance vary across different countries?

Read and Explore the data

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.2.1     ✔ dplyr   1.1.1
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(stats)
sales <- read_csv("sales.csv")
## Rows: 113036 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): Month, Age_Group, Customer_Gender, Country, State, Product_Categor...
## dbl  (9): Day, Year, Customer_Age, Order_Quantity, Unit_Cost, Unit_Price, Pr...
## date (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(sales)
## # A tibble: 6 × 18
##   Date         Day Month     Year Custom…¹ Age_G…² Custo…³ Country State Produ…⁴
##   <date>     <dbl> <chr>    <dbl>    <dbl> <chr>   <chr>   <chr>   <chr> <chr>  
## 1 2013-11-26    26 November  2013       19 Youth … M       Canada  Brit… Access…
## 2 2015-11-26    26 November  2015       19 Youth … M       Canada  Brit… Access…
## 3 2014-03-23    23 March     2014       49 Adults… M       Austra… New … Access…
## 4 2016-03-23    23 March     2016       49 Adults… M       Austra… New … Access…
## 5 2014-05-15    15 May       2014       47 Adults… F       Austra… New … Access…
## 6 2016-05-15    15 May       2016       47 Adults… F       Austra… New … Access…
## # … with 8 more variables: Sub_Category <chr>, Product <chr>,
## #   Order_Quantity <dbl>, Unit_Cost <dbl>, Unit_Price <dbl>, Profit <dbl>,
## #   Cost <dbl>, Revenue <dbl>, and abbreviated variable names ¹​Customer_Age,
## #   ²​Age_Group, ³​Customer_Gender, ⁴​Product_Category
tail(sales)
## # A tibble: 6 × 18
##   Date         Day Month  Year Customer_…¹ Age_G…² Custo…³ Country State Produ…⁴
##   <date>     <dbl> <chr> <dbl>       <dbl> <chr>   <chr>   <chr>   <chr> <chr>  
## 1 2014-04-12    12 April  2014          41 Adults… M       United… Engl… Clothi…
## 2 2016-04-12    12 April  2016          41 Adults… M       United… Engl… Clothi…
## 3 2014-04-02     2 April  2014          18 Youth … M       Austra… Quee… Clothi…
## 4 2016-04-02     2 April  2016          18 Youth … M       Austra… Quee… Clothi…
## 5 2014-03-04     4 March  2014          37 Adults… F       France  Sein… Clothi…
## 6 2016-03-04     4 March  2016          37 Adults… F       France  Sein… Clothi…
## # … with 8 more variables: Sub_Category <chr>, Product <chr>,
## #   Order_Quantity <dbl>, Unit_Cost <dbl>, Unit_Price <dbl>, Profit <dbl>,
## #   Cost <dbl>, Revenue <dbl>, and abbreviated variable names ¹​Customer_Age,
## #   ²​Age_Group, ³​Customer_Gender, ⁴​Product_Category
str(sales)
## spc_tbl_ [113,036 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Date            : Date[1:113036], format: "2013-11-26" "2015-11-26" ...
##  $ Day             : num [1:113036] 26 26 23 23 15 15 22 22 22 22 ...
##  $ Month           : chr [1:113036] "November" "November" "March" "March" ...
##  $ Year            : num [1:113036] 2013 2015 2014 2016 2014 ...
##  $ Customer_Age    : num [1:113036] 19 19 49 49 47 47 47 47 35 35 ...
##  $ Age_Group       : chr [1:113036] "Youth (<25)" "Youth (<25)" "Adults (35-64)" "Adults (35-64)" ...
##  $ Customer_Gender : chr [1:113036] "M" "M" "M" "M" ...
##  $ Country         : chr [1:113036] "Canada" "Canada" "Australia" "Australia" ...
##  $ State           : chr [1:113036] "British Columbia" "British Columbia" "New South Wales" "New South Wales" ...
##  $ Product_Category: chr [1:113036] "Accessories" "Accessories" "Accessories" "Accessories" ...
##  $ Sub_Category    : chr [1:113036] "Bike Racks" "Bike Racks" "Bike Racks" "Bike Racks" ...
##  $ Product         : chr [1:113036] "Hitch Rack - 4-Bike" "Hitch Rack - 4-Bike" "Hitch Rack - 4-Bike" "Hitch Rack - 4-Bike" ...
##  $ Order_Quantity  : num [1:113036] 8 8 23 20 4 5 4 2 22 21 ...
##  $ Unit_Cost       : num [1:113036] 45 45 45 45 45 45 45 45 45 45 ...
##  $ Unit_Price      : num [1:113036] 120 120 120 120 120 120 120 120 120 120 ...
##  $ Profit          : num [1:113036] 590 590 1366 1188 238 ...
##  $ Cost            : num [1:113036] 360 360 1035 900 180 ...
##  $ Revenue         : num [1:113036] 950 950 2401 2088 418 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Date = col_date(format = ""),
##   ..   Day = col_double(),
##   ..   Month = col_character(),
##   ..   Year = col_double(),
##   ..   Customer_Age = col_double(),
##   ..   Age_Group = col_character(),
##   ..   Customer_Gender = col_character(),
##   ..   Country = col_character(),
##   ..   State = col_character(),
##   ..   Product_Category = col_character(),
##   ..   Sub_Category = col_character(),
##   ..   Product = col_character(),
##   ..   Order_Quantity = col_double(),
##   ..   Unit_Cost = col_double(),
##   ..   Unit_Price = col_double(),
##   ..   Profit = col_double(),
##   ..   Cost = col_double(),
##   ..   Revenue = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Check for missing values

sum(is.na(sales))
## [1] 0

Exlpore Distributions

summary(sales)
##       Date                 Day           Month                Year     
##  Min.   :2011-01-01   Min.   : 1.00   Length:113036      Min.   :2011  
##  1st Qu.:2013-12-22   1st Qu.: 8.00   Class :character   1st Qu.:2013  
##  Median :2014-06-27   Median :16.00   Mode  :character   Median :2014  
##  Mean   :2014-11-23   Mean   :15.67                      Mean   :2014  
##  3rd Qu.:2016-01-09   3rd Qu.:23.00                      3rd Qu.:2016  
##  Max.   :2016-07-31   Max.   :31.00                      Max.   :2016  
##   Customer_Age    Age_Group         Customer_Gender      Country         
##  Min.   :17.00   Length:113036      Length:113036      Length:113036     
##  1st Qu.:28.00   Class :character   Class :character   Class :character  
##  Median :35.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :35.92                                                           
##  3rd Qu.:43.00                                                           
##  Max.   :87.00                                                           
##     State           Product_Category   Sub_Category         Product         
##  Length:113036      Length:113036      Length:113036      Length:113036     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order_Quantity   Unit_Cost        Unit_Price         Profit       
##  Min.   : 1.0   Min.   :   1.0   Min.   :   2.0   Min.   :  -30.0  
##  1st Qu.: 2.0   1st Qu.:   2.0   1st Qu.:   5.0   1st Qu.:   29.0  
##  Median :10.0   Median :   9.0   Median :  24.0   Median :  101.0  
##  Mean   :11.9   Mean   : 267.3   Mean   : 452.9   Mean   :  285.1  
##  3rd Qu.:20.0   3rd Qu.:  42.0   3rd Qu.:  70.0   3rd Qu.:  358.0  
##  Max.   :32.0   Max.   :2171.0   Max.   :3578.0   Max.   :15096.0  
##       Cost            Revenue       
##  Min.   :    1.0   Min.   :    2.0  
##  1st Qu.:   28.0   1st Qu.:   63.0  
##  Median :  108.0   Median :  223.0  
##  Mean   :  469.3   Mean   :  754.4  
##  3rd Qu.:  432.0   3rd Qu.:  800.0  
##  Max.   :42978.0   Max.   :58074.0

Exlpore Relationships, Outliers, Categorical Variables, Quantitative Varaibles, and Time-Series Data

Research question 1: How does customer age impact the sales revenue?
sales_age <- sales %>% group_by(Customer_Age) %>% summarize(mean_revenue = mean(Revenue))
plot(sales_age$Customer_Age, sales_age$mean_revenue, main = "Customer Age vs. Mean Sales Revenue", xlab = "Customer Age", ylab = "Sales Revenue ($)",
     col = "steelblue", pch = 16)

correlation <- cor(sales_age$Customer_Age, sales_age$mean_revenue)
cat("Correlation coefficient between Customer Age and Mean Revenue:", correlation, "\n")
## Correlation coefficient between Customer Age and Mean Revenue: -0.595246
linear_reg <- lm(sales_age$mean_revenue ~ sales_age$Customer_Age)
summary(linear_reg)
## 
## Call:
## lm(formula = sales_age$mean_revenue ~ sales_age$Customer_Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -405.84 -109.37   -1.93  134.47  988.11 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1002.300     68.787  14.571  < 2e-16 ***
## sales_age$Customer_Age   -7.583      1.241  -6.109 5.49e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 210.8 on 68 degrees of freedom
## Multiple R-squared:  0.3543, Adjusted R-squared:  0.3448 
## F-statistic: 37.31 on 1 and 68 DF,  p-value: 5.487e-08
Reasearch Question 2: Is there a difference in the frequency in which each gender purchases different product categories?
product_counts <- sales %>% 
  group_by(Customer_Gender, Product_Category) %>%
  summarize(count = n())
## `summarise()` has grouped output by 'Customer_Gender'. You can override using
## the `.groups` argument.
barplot(product_counts$count, main = "Most Popular Product Categories by Gender", 
        xlab = "Product Category", ylab = "Count", col = c("lightblue", "salmon"),
        beside = TRUE, legend.text = TRUE, args.legend = list(x = "topright", bty = "n"),
        names.arg = paste(product_counts$Customer_Gender, product_counts$Product_Category, sep = "\n"))

anova_result1 <- aov(count ~ Product_Category + Customer_Gender, data = product_counts)
summary(anova_result1)
##                  Df    Sum Sq   Mean Sq F value  Pr(>F)   
## Product_Category  2 809796657 404898329 655.296 0.00152 **
## Customer_Gender   1   2145624   2145624   3.473 0.20342   
## Residuals         2   1235772    617886                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Research Question 3: How does average the sales performance vary across different countries or states?
sales_location <- sales %>%
  select(Country, Revenue)
mean_revenue <- sales_location %>%
  group_by(Country) %>%
  summarise(mean_revenue = mean(Revenue))
barplot(mean_revenue$mean_revenue, 
        names.arg = mean_revenue$Country,
        main = "Mean Revenue by Country",
        xlab = "Country",
        ylab = "Mean Revenue",
        col = "steelblue",             
        cex.names = 0.8,               
        ylim = c(0, max(mean_revenue$mean_revenue) * 1.1))

anova_result2 <- aov(Revenue ~ Country, data = sales_location)
summary(anova_result2)
##                 Df    Sum Sq   Mean Sq F value Pr(>F)    
## Country          5 1.088e+09 217505918   127.6 <2e-16 ***
## Residuals   113030 1.926e+11   1704183                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

EDA and statistical analysis summary

When looking at my scatter plot, correlation, and linear regression for my first research question. It is clear that as the age of the customer increases, the average revenue per purchase decreases. The correlation of -0.595246 as well as a coefficient of the linear regression being -7.583 further supports the conclusion that as the age of the customer increases, the mean revenue is decreasing. There is also clearly an outlier in the data at the age 71, as this point has far higher mean revenue than any other age. When looking at research question 2, the bar graph reveals that accessories are by far the most popular purchase that both male and female customers make. However, men seem to buy slightly more accessories and clothing from the European Bike Store than women do. It also appears that men and women both purchase bikes equally. However, the anova test reveals that there is no significant difference of what type of product is bought by either gender, as the p-value is 0.20342, which is not statistically significant with a significance level of 0.05. The bar graph reveals that the mean revenue in Australia is the highest, meaning that their performance in terms of revenue made form bike sales is the highest, followed by Germany, the United Kingdom, the Untied States, and then Canada. The anova test confirms that the revenues of each country are significantly different due to the p-value showing up as p < 0.001 significance.