The data set I chose for this project consists of European Bike Store Sales from 2011-2016 collected by PrepInsta Technologies. The categorical data collected on each sale includes: -the date the sale occurred -the month the sale occurred in -the day the sale occurred -the year the sale occurred -the Age Group of the customer -the gender of the customer -the country in which the sale took place -the state in which the sale took place -the product category (which is whether the product was a bike, an accessory, or another product) -the product subcategory which give a little more detail about the specifics of the product) -the name of the product sold
The quantitative data recorded for each sale includes: -the age of the customer -the order quantity -the cost require to make the unit sold -the price per unit of the product sold -the cost of the entire sale -the revenue of the entire sale -the profit of the sale (revenue-cost)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.2.1 ✔ dplyr 1.1.1
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(stats)
sales <- read_csv("sales.csv")
## Rows: 113036 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Month, Age_Group, Customer_Gender, Country, State, Product_Categor...
## dbl (9): Day, Year, Customer_Age, Order_Quantity, Unit_Cost, Unit_Price, Pr...
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(sales)
## # A tibble: 6 × 18
## Date Day Month Year Custom…¹ Age_G…² Custo…³ Country State Produ…⁴
## <date> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 2013-11-26 26 November 2013 19 Youth … M Canada Brit… Access…
## 2 2015-11-26 26 November 2015 19 Youth … M Canada Brit… Access…
## 3 2014-03-23 23 March 2014 49 Adults… M Austra… New … Access…
## 4 2016-03-23 23 March 2016 49 Adults… M Austra… New … Access…
## 5 2014-05-15 15 May 2014 47 Adults… F Austra… New … Access…
## 6 2016-05-15 15 May 2016 47 Adults… F Austra… New … Access…
## # … with 8 more variables: Sub_Category <chr>, Product <chr>,
## # Order_Quantity <dbl>, Unit_Cost <dbl>, Unit_Price <dbl>, Profit <dbl>,
## # Cost <dbl>, Revenue <dbl>, and abbreviated variable names ¹Customer_Age,
## # ²Age_Group, ³Customer_Gender, ⁴Product_Category
tail(sales)
## # A tibble: 6 × 18
## Date Day Month Year Customer_…¹ Age_G…² Custo…³ Country State Produ…⁴
## <date> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 2014-04-12 12 April 2014 41 Adults… M United… Engl… Clothi…
## 2 2016-04-12 12 April 2016 41 Adults… M United… Engl… Clothi…
## 3 2014-04-02 2 April 2014 18 Youth … M Austra… Quee… Clothi…
## 4 2016-04-02 2 April 2016 18 Youth … M Austra… Quee… Clothi…
## 5 2014-03-04 4 March 2014 37 Adults… F France Sein… Clothi…
## 6 2016-03-04 4 March 2016 37 Adults… F France Sein… Clothi…
## # … with 8 more variables: Sub_Category <chr>, Product <chr>,
## # Order_Quantity <dbl>, Unit_Cost <dbl>, Unit_Price <dbl>, Profit <dbl>,
## # Cost <dbl>, Revenue <dbl>, and abbreviated variable names ¹Customer_Age,
## # ²Age_Group, ³Customer_Gender, ⁴Product_Category
str(sales)
## spc_tbl_ [113,036 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Date : Date[1:113036], format: "2013-11-26" "2015-11-26" ...
## $ Day : num [1:113036] 26 26 23 23 15 15 22 22 22 22 ...
## $ Month : chr [1:113036] "November" "November" "March" "March" ...
## $ Year : num [1:113036] 2013 2015 2014 2016 2014 ...
## $ Customer_Age : num [1:113036] 19 19 49 49 47 47 47 47 35 35 ...
## $ Age_Group : chr [1:113036] "Youth (<25)" "Youth (<25)" "Adults (35-64)" "Adults (35-64)" ...
## $ Customer_Gender : chr [1:113036] "M" "M" "M" "M" ...
## $ Country : chr [1:113036] "Canada" "Canada" "Australia" "Australia" ...
## $ State : chr [1:113036] "British Columbia" "British Columbia" "New South Wales" "New South Wales" ...
## $ Product_Category: chr [1:113036] "Accessories" "Accessories" "Accessories" "Accessories" ...
## $ Sub_Category : chr [1:113036] "Bike Racks" "Bike Racks" "Bike Racks" "Bike Racks" ...
## $ Product : chr [1:113036] "Hitch Rack - 4-Bike" "Hitch Rack - 4-Bike" "Hitch Rack - 4-Bike" "Hitch Rack - 4-Bike" ...
## $ Order_Quantity : num [1:113036] 8 8 23 20 4 5 4 2 22 21 ...
## $ Unit_Cost : num [1:113036] 45 45 45 45 45 45 45 45 45 45 ...
## $ Unit_Price : num [1:113036] 120 120 120 120 120 120 120 120 120 120 ...
## $ Profit : num [1:113036] 590 590 1366 1188 238 ...
## $ Cost : num [1:113036] 360 360 1035 900 180 ...
## $ Revenue : num [1:113036] 950 950 2401 2088 418 ...
## - attr(*, "spec")=
## .. cols(
## .. Date = col_date(format = ""),
## .. Day = col_double(),
## .. Month = col_character(),
## .. Year = col_double(),
## .. Customer_Age = col_double(),
## .. Age_Group = col_character(),
## .. Customer_Gender = col_character(),
## .. Country = col_character(),
## .. State = col_character(),
## .. Product_Category = col_character(),
## .. Sub_Category = col_character(),
## .. Product = col_character(),
## .. Order_Quantity = col_double(),
## .. Unit_Cost = col_double(),
## .. Unit_Price = col_double(),
## .. Profit = col_double(),
## .. Cost = col_double(),
## .. Revenue = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
sum(is.na(sales))
## [1] 0
summary(sales)
## Date Day Month Year
## Min. :2011-01-01 Min. : 1.00 Length:113036 Min. :2011
## 1st Qu.:2013-12-22 1st Qu.: 8.00 Class :character 1st Qu.:2013
## Median :2014-06-27 Median :16.00 Mode :character Median :2014
## Mean :2014-11-23 Mean :15.67 Mean :2014
## 3rd Qu.:2016-01-09 3rd Qu.:23.00 3rd Qu.:2016
## Max. :2016-07-31 Max. :31.00 Max. :2016
## Customer_Age Age_Group Customer_Gender Country
## Min. :17.00 Length:113036 Length:113036 Length:113036
## 1st Qu.:28.00 Class :character Class :character Class :character
## Median :35.00 Mode :character Mode :character Mode :character
## Mean :35.92
## 3rd Qu.:43.00
## Max. :87.00
## State Product_Category Sub_Category Product
## Length:113036 Length:113036 Length:113036 Length:113036
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Order_Quantity Unit_Cost Unit_Price Profit
## Min. : 1.0 Min. : 1.0 Min. : 2.0 Min. : -30.0
## 1st Qu.: 2.0 1st Qu.: 2.0 1st Qu.: 5.0 1st Qu.: 29.0
## Median :10.0 Median : 9.0 Median : 24.0 Median : 101.0
## Mean :11.9 Mean : 267.3 Mean : 452.9 Mean : 285.1
## 3rd Qu.:20.0 3rd Qu.: 42.0 3rd Qu.: 70.0 3rd Qu.: 358.0
## Max. :32.0 Max. :2171.0 Max. :3578.0 Max. :15096.0
## Cost Revenue
## Min. : 1.0 Min. : 2.0
## 1st Qu.: 28.0 1st Qu.: 63.0
## Median : 108.0 Median : 223.0
## Mean : 469.3 Mean : 754.4
## 3rd Qu.: 432.0 3rd Qu.: 800.0
## Max. :42978.0 Max. :58074.0
sales_age <- sales %>% group_by(Customer_Age) %>% summarize(mean_revenue = mean(Revenue))
plot(sales_age$Customer_Age, sales_age$mean_revenue, main = "Customer Age vs. Mean Sales Revenue", xlab = "Customer Age", ylab = "Sales Revenue ($)",
col = "steelblue", pch = 16)
correlation <- cor(sales_age$Customer_Age, sales_age$mean_revenue)
cat("Correlation coefficient between Customer Age and Mean Revenue:", correlation, "\n")
## Correlation coefficient between Customer Age and Mean Revenue: -0.595246
linear_reg <- lm(sales_age$mean_revenue ~ sales_age$Customer_Age)
summary(linear_reg)
##
## Call:
## lm(formula = sales_age$mean_revenue ~ sales_age$Customer_Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -405.84 -109.37 -1.93 134.47 988.11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1002.300 68.787 14.571 < 2e-16 ***
## sales_age$Customer_Age -7.583 1.241 -6.109 5.49e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 210.8 on 68 degrees of freedom
## Multiple R-squared: 0.3543, Adjusted R-squared: 0.3448
## F-statistic: 37.31 on 1 and 68 DF, p-value: 5.487e-08
product_counts <- sales %>%
group_by(Customer_Gender, Product_Category) %>%
summarize(count = n())
## `summarise()` has grouped output by 'Customer_Gender'. You can override using
## the `.groups` argument.
barplot(product_counts$count, main = "Most Popular Product Categories by Gender",
xlab = "Product Category", ylab = "Count", col = c("lightblue", "salmon"),
beside = TRUE, legend.text = TRUE, args.legend = list(x = "topright", bty = "n"),
names.arg = paste(product_counts$Customer_Gender, product_counts$Product_Category, sep = "\n"))
anova_result1 <- aov(count ~ Product_Category + Customer_Gender, data = product_counts)
summary(anova_result1)
## Df Sum Sq Mean Sq F value Pr(>F)
## Product_Category 2 809796657 404898329 655.296 0.00152 **
## Customer_Gender 1 2145624 2145624 3.473 0.20342
## Residuals 2 1235772 617886
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
sales_location <- sales %>%
select(Country, Revenue)
mean_revenue <- sales_location %>%
group_by(Country) %>%
summarise(mean_revenue = mean(Revenue))
barplot(mean_revenue$mean_revenue,
names.arg = mean_revenue$Country,
main = "Mean Revenue by Country",
xlab = "Country",
ylab = "Mean Revenue",
col = "steelblue",
cex.names = 0.8,
ylim = c(0, max(mean_revenue$mean_revenue) * 1.1))
anova_result2 <- aov(Revenue ~ Country, data = sales_location)
summary(anova_result2)
## Df Sum Sq Mean Sq F value Pr(>F)
## Country 5 1.088e+09 217505918 127.6 <2e-16 ***
## Residuals 113030 1.926e+11 1704183
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When looking at my scatter plot, correlation, and linear regression for my first research question. It is clear that as the age of the customer increases, the average revenue per purchase decreases. The correlation of -0.595246 as well as a coefficient of the linear regression being -7.583 further supports the conclusion that as the age of the customer increases, the mean revenue is decreasing. There is also clearly an outlier in the data at the age 71, as this point has far higher mean revenue than any other age. When looking at research question 2, the bar graph reveals that accessories are by far the most popular purchase that both male and female customers make. However, men seem to buy slightly more accessories and clothing from the European Bike Store than women do. It also appears that men and women both purchase bikes equally. However, the anova test reveals that there is no significant difference of what type of product is bought by either gender, as the p-value is 0.20342, which is not statistically significant with a significance level of 0.05. The bar graph reveals that the mean revenue in Australia is the highest, meaning that their performance in terms of revenue made form bike sales is the highest, followed by Germany, the United Kingdom, the Untied States, and then Canada. The anova test confirms that the revenues of each country are significantly different due to the p-value showing up as p < 0.001 significance.