PaulinaSuvorovMVAHW1

library(readxl)
mydata <- read_excel("C:/Users/pauli/Downloads/Adidas US Sales Datasets.xlsx")
head(mydata)

## # A tibble: 6 × 13
##   Retai…¹ Retai…² `Invoice Date`      Region State City  Product Price…³ Units…⁴
##   <chr>     <dbl> <dttm>              <chr>  <chr> <chr> <chr>     <dbl>   <dbl>
## 1 Foot L… 1185732 2020-01-01 00:00:00 North… New … New … Men's …      50    1200
## 2 Foot L… 1185732 2020-01-02 00:00:00 North… New … New … Men's …      50    1000
## 3 Foot L… 1185732 2020-01-03 00:00:00 North… New … New … Women'…      40    1000
## 4 Foot L… 1185732 2020-01-04 00:00:00 North… New … New … Women'…      45     850
## 5 Foot L… 1185732 2020-01-05 00:00:00 North… New … New … Men's …      60     900
## 6 Foot L… 1185732 2020-01-06 00:00:00 North… New … New … Women'…      50    1000
## # … with 4 more variables: `Total Sales` <dbl>, `Operating Profit` <dbl>,
## #   `Operating Margin` <dbl>, `Sales Method` <chr>, and abbreviated variable
## #   names ¹Retailer, ²`Retailer ID`, ³`Price per Unit`, ⁴`Units Sold`

mydata1 <- mydata[,c(-2, -3, -4, -6, -12)]
head(mydata1)

## # A tibble: 6 × 8
##   Retailer    State    Product           Price…¹ Units…² Total…³ Opera…⁴ Sales…⁵
##   <chr>       <chr>    <chr>               <dbl>   <dbl>   <dbl>   <dbl> <chr>  
## 1 Foot Locker New York Men's Street Foo…      50    1200   60000  300000 In-sto…
## 2 Foot Locker New York Men's Athletic F…      50    1000   50000  150000 In-sto…
## 3 Foot Locker New York Women's Street F…      40    1000   40000  140000 In-sto…
## 4 Foot Locker New York Women's Athletic…      45     850   38250  133875 In-sto…
## 5 Foot Locker New York Men's Apparel          60     900   54000  162000 In-sto…
## 6 Foot Locker New York Women's Apparel        50    1000   50000  125000 In-sto…
## # … with abbreviated variable names ¹`Price per Unit`, ²`Units Sold`,
## #   ³`Total Sales`, ⁴`Operating Profit`, ⁵`Sales Method`

The initial data set consists of 9648 observations of 13 variables. For further research the data set is trimmed down to 8 variables (4 of them are characteristic and 4 of them are numeric)

The unit of observation: Product line of Adidas by retailer in specific state in the US

The variables are the following:

-Retailer: name of a retailer -State: name of the state in the US -Product: type of Adidas merchandise -Price per unit: the price of a merchandise in US Dollars -Units sold: quantity of Adidas merchandise -Total sales: total sales in US Dollars -Operating profit: operating profit in US Dollars -Sales method: sales channel

The data set was found on the Kaggle website, the author is Heemali Chaudhari. Retrieved January 5th 2023, from https://www.kaggle.com/datasets/heemalichaudhari/adidas-sales-dataset .

The purpose of analyzing Adidas sales data is to identify the top-performing states, retailers, products, and sales methods in the US.Additionally, we will use regression analysis to examine the relationship between the price per unit and the quantity of units sold on total sales.

RQ: What are the key drivers of Adidas sales?

Let’s begin with some descriptive statistics.

summary(mydata1)

##    Retailer            State             Product          Price per Unit  
##  Length:9648        Length:9648        Length:9648        Min.   :  7.00  
##  Class :character   Class :character   Class :character   1st Qu.: 35.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 45.22  
##                                                           3rd Qu.: 55.00  
##                                                           Max.   :110.00  
##    Units Sold      Total Sales    Operating Profit Sales Method      
##  Min.   :   0.0   Min.   :    0   Min.   :     0   Length:9648       
##  1st Qu.: 106.0   1st Qu.: 4065   1st Qu.:  1922   Class :character  
##  Median : 176.0   Median : 7804   Median :  4371   Mode  :character  
##  Mean   : 256.9   Mean   :12455   Mean   : 34425                     
##  3rd Qu.: 350.0   3rd Qu.:15864   3rd Qu.: 52063                     
##  Max.   :1275.0   Max.   :82500   Max.   :390000

library(psych)

## Warning: package 'psych' was built under R version 4.2.2

describe(mydata1[, c(-1, -2, -3, -8)])

##                  vars    n     mean       sd  median  trimmed     mad min
## Price per Unit      1 9648    45.22    14.71   45.00    44.83   14.83   7
## Units Sold          2 9648   256.93   214.25  176.00   221.49  134.92   0
## Total Sales         3 9648 12455.08 12716.39 7803.50  9958.75 6802.91   0
## Operating Profit    4 9648 34425.24 54193.11 4371.42 22270.22 5086.02   0
##                     max  range skew kurtosis     se
## Price per Unit      110    103 0.36     0.43   0.15
## Units Sold         1275   1275 1.46     1.70   2.18
## Total Sales       82500  82500 1.96     4.12 129.46
## Operating Profit 390000 390000 2.23     5.81 551.73

PRICE PER UNIT The average price per unit was 45.22, the range of prices was from 7 to 110 US Dollars.
UNITS SOLD The minimum units sold in US shops were 0 and the maximum was 1275. The 1st quantile is 106, which means that 25 percent of units of products sold is 106 or less and 75 percent of quantity of units is above 106.
TOTAL SALES The minimum value of total sales was 0 US Dollars and the maximum was 82500 US dollars, the mean was 12455.08 US dollars. The median of total sales was 7803.50, meaning half of the retailers in the specific state have total sales of 7803.5 US dollars or less and half of them have more than 7803.5 US dollars.
OPERATING PROFIT The range of operating profits was from 0 to 390000 US dollars, with an average of 34425.24 US dollars. The median operating profit was 4371.42 US dollars, indicating that half of the retailers in the state have profits at or below this amount and half have profits above it.

hist(mydata$`Price per Unit`, xlab = "Price of product", ylab = "Frequency", main = "Product's price distribution", breaks = seq(10, 50, 1000), 
     right = FALSE)

mean(mydata1$`Price per Unit`)

## [1] 45.21663

First, to look at the total sales we should check the distribution of price of product line. The average price of product is 45.22 as already mentioned before. By observing the graph above, we can see that the selling price has normal distribution, which is not skewed to the right or left.

mydata2 <- aggregate(mydata1$`Total Sales` ~ State, mydata1, sum)
colnames(mydata2)<-c("State","Total Sales")
state_with_highest_sales <- mydata2[order(-mydata2$`Total Sales`) , ]
head(state_with_highest_sales)

##             State Total Sales
## 32       New York     8670464
## 5      California     8580508
## 9         Florida     7820589
## 43          Texas     6612371
## 40 South Carolina     3593112
## 18      Louisiana     3377031

Next, we will examine the states with the highest revenue of Adidas merchandise. The top three states by total sales were New York, California, and Florida, in that order.

mydata3 <- aggregate(mydata1$`Total Sales` ~ mydata1$`Sales Method`, mydata1, sum)
colnames(mydata3)<-c("Sales method","Total Sales")
sales_method_with_the_highest_total_sales <- mydata3[order(-mydata3$`Total Sales`) , ]
head(sales_method_with_the_highest_total_sales)

##   Sales method Total Sales
## 2       Online    44965657
## 3       Outlet    39536618
## 1     In-store    35664375

Subsequently, we will analyze the total sales of Adidas merchandise by sales method. Online sales generated the most revenue, followed by outlet sales and in-store sales. The following two graphics show the percentage of total sales generated by each sales method.

sales_method <- c(35664375, 44965657, 39536618)
labels <- c("In-store", "Online", "Outlet")
colors <- c("red", "green", "yellow")

pie(sales_method, labels = labels, col = colors, main = "Pie chart of total sales by sales method")

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

sales_method <- c(35664375/1000000, 44965657/1000000, 39536618/1000000)
labels <- c("In-store", "Online", "Outlet")

ggplot(mydata3, aes(x = labels , y = sales_method)) +
  geom_col() +
  ggtitle("Total sales by sales method in millions (US Dollars)") +
  ylab("Total sales") +
  xlab("Sales method") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

mydata4 <- aggregate(mydata1$`Total Sales` ~ mydata1$Retailer, mydata1, sum)
colnames(mydata4)<-c("Retailer","Total Sales")
retailer_with_highest_total_sales <- mydata4[order(-mydata4$`Total Sales`) , ]
head(retailer_with_highest_total_sales)

##        Retailer Total Sales
## 6     West Gear    32409558
## 2   Foot Locker    29024945
## 4 Sports Direct    24616622
## 3        Kohl's    13512453
## 5       Walmart    10506085
## 1        Amazon    10096987

Additionally, we analyzed total sales of Adidas merchandise by retailer. West Gear had the highest total sales, followed by Foot Locker and Sports Direct. Below there are two graphic representations of that.

library(ggplot2)

df <- data.frame (
  retailer = c("West Gear", "Foot Locker", "Sports Direct", "Kohl's", "Amazon", "Walmart"),
  total_sales = c( 32409558, 29024945, 24616622, 13512453, 10506085, 10096987
                        ))
  
colors <- c("red", "green", "blue", "yellow", "darkorange", "dodgerblue2"
                 )
pie(df$total_sales, labels = df$retailer, col = colors, main = "Total sales by retailer")

library(ggplot2)
library(scales)

## 
## Attaching package: 'scales'

## The following objects are masked from 'package:psych':
## 
##     alpha, rescale

df <- data.frame (
  retailer = c("West Gear", "Foot Locker", "Sports Direct", "Kohl's", "Amazon", "Walmart"),
  total_sales = c( 32409558/1000000, 29024945/1000000, 24616622/1000000, 13512453/1000000, 10506085/1000000, 10096987/1000000
                        ))

ggplot(df, aes(x = retailer , y = total_sales)) +
  geom_col() +
  ggtitle("Total sales per retailer in millions (US Dollars)") +
  ylab("Total sales") +
  xlab("Retailer") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

mydata5 <- aggregate(mydata1$`Total Sales` ~ mydata1$Product, mydata1, sum)
colnames(mydata5)<-c("Product","Total sales")
product_with_the_highest_sales <- mydata5[order(-mydata5$`Total sales`) , ]
head(product_with_the_highest_sales)

##                     Product Total sales
## 3     Men's Street Footwear    27680769
## 4           Women's Apparel    23870985
## 2   Men's Athletic Footwear    20577180
## 6   Women's Street Footwear    17201563
## 1             Men's Apparel    16520632
## 5 Women's Athletic Footwear    14315521

Next, we look at total sales by the type of Adidas product sold. Men’s Street Footwear generated the most revenue, followed by Women’s Apparel in second place and Men’s Athletic Footwear in third.

product_type <- data.frame(
  product = c("Men's Street Footwear", "Women's Apparel", "Men's Athletic Footwear", "Women's Street Footwear", "Men's Apparel", "Women's Athletic Footwear"), 
  total_sales = c(27680769/1000000, 23870985/1000000, 20577180/1000000, 17201563/1000000, 16520632/1000000, 14315521/1000000)
)
library(ggplot2)
library(scales)
ggplot(product_type, aes(x = product , y = total_sales)) +
  geom_col() +
  ggtitle("Total sales per product in millions (US Dollars)") +
  ylab("Total sales") +
  xlab("Type of Product") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

mydata1$`Total Sales in thousand`<-(mydata$`Total Sales`/1000)

library(car)

## Warning: package 'car' was built under R version 4.2.2

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplot(x=mydata1$`Units Sold`, y=mydata1$`Total Sales in thousand`,
        main = "Relationship between total sales and units sold",
        xlab = "Units sold",
        ylab = "Total Sales",
        smooth = FALSE)

library(car)
scatterplot(x=mydata1$`Price per Unit`, y=mydata1$`Total Sales in thousand`,
        main = "Relationship between total sales and average price per unit",
        ylab = "Total Sales",
        xlab ="Price" ,
        smooth = FALSE)

reg<-lm(mydata1$`Total Sales`~mydata1$`Price per Unit`+mydata1$`Units Sold`,data=mydata1)

summary(reg)

## 
## Call:
## lm(formula = mydata1$`Total Sales` ~ mydata1$`Price per Unit` + 
##     mydata1$`Units Sold`, data = mydata1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17739.8  -1356.7   -453.8   1288.0  27820.8 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -12694.056    104.254  -121.8   <2e-16 ***
## mydata1$`Price per Unit`    274.618      2.258   121.6   <2e-16 ***
## mydata1$`Units Sold`         49.554      0.155   319.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3143 on 9645 degrees of freedom
## Multiple R-squared:  0.9389, Adjusted R-squared:  0.9389 
## F-statistic: 7.411e+04 on 2 and 9645 DF,  p-value: < 2.2e-16

Based on our economic instinct we expected price and units sold both significantly impact total sales, therefore, we conducted a regression analysis. We also created two scatter plots (above), which show a positive correlation between price and units sold as independent variables and total sales as the dependent variable. The scatter plots indicate that units sold have a greater impact on total sales. According to the regression function, if the average price per unit increases by one US Dollar, total sales increase on average by 274.62 US Dollars, all else being equal. In contrast, if the number of units sold increases by one, total sales on average increase by 50 units, all else being equal.

PaulinaSuvorovMVAHW1

Paulina Suvorov

2023-01-05