HW2

library(readxl)
mydata <- read_excel("C:/Users/pauli/OneDrive/Desktop/MVA/Adidas US Sales Datasets.xlsx")
head(mydata)

## # A tibble: 6 × 13
##   Retai…¹ Retai…² `Invoice Date`      Region State City  Product Price…³ Units…⁴
##   <chr>     <dbl> <dttm>              <chr>  <chr> <chr> <chr>     <dbl>   <dbl>
## 1 Foot L… 1185732 2020-01-01 00:00:00 North… New … New … Men's …      50    1200
## 2 Foot L… 1185732 2020-01-02 00:00:00 North… New … New … Men's …      50    1000
## 3 Foot L… 1185732 2020-01-03 00:00:00 North… New … New … Women'…      40    1000
## 4 Foot L… 1185732 2020-01-04 00:00:00 North… New … New … Women'…      45     850
## 5 Foot L… 1185732 2020-01-05 00:00:00 North… New … New … Men's …      60     900
## 6 Foot L… 1185732 2020-01-06 00:00:00 North… New … New … Women'…      50    1000
## # … with 4 more variables: `Total Sales` <dbl>, `Operating Profit` <dbl>,
## #   `Operating Margin` <dbl>, `Sales Method` <chr>, and abbreviated variable
## #   names ¹Retailer, ²`Retailer ID`, ³`Price per Unit`, ⁴`Units Sold`

mydata1 <- mydata[,c(-2, -3, -4, -6, -12)]
colnames(mydata1) <- c("Retailer", "State", "Product", "Priceperunit", "Unitssold", "Totalsales", "Operatingprofit", "Salesmethod")
head(mydata1)

## # A tibble: 6 × 8
##   Retailer    State    Product           Price…¹ Units…² Total…³ Opera…⁴ Sales…⁵
##   <chr>       <chr>    <chr>               <dbl>   <dbl>   <dbl>   <dbl> <chr>  
## 1 Foot Locker New York Men's Street Foo…      50    1200   60000  300000 In-sto…
## 2 Foot Locker New York Men's Athletic F…      50    1000   50000  150000 In-sto…
## 3 Foot Locker New York Women's Street F…      40    1000   40000  140000 In-sto…
## 4 Foot Locker New York Women's Athletic…      45     850   38250  133875 In-sto…
## 5 Foot Locker New York Men's Apparel          60     900   54000  162000 In-sto…
## 6 Foot Locker New York Women's Apparel        50    1000   50000  125000 In-sto…
## # … with abbreviated variable names ¹Priceperunit, ²Unitssold, ³Totalsales,
## #   ⁴Operatingprofit, ⁵Salesmethod

I am using the same data set as I did already for HW1, above I already removed the variables that I do not need. The data set consists of 9648 observations and 8 variables. The unit of observation: Product line of Adidas by retailer in specific state in the US.

The variables are the following:

-Retailer: name of a retailer -State: name of the state in the US -Product: type of Adidas merchandise -Price per unit: the price of a merchandise in US Dollars -Units sold: quantity of Adidas merchandise -Total sales: total sales in US Dollars -Operating profit: operating profit in US Dollars -Sales method: sales channel

The data set was found on the Kaggle website, the author is Heemali Chaudhari. Retrieved January 5th 2023, from https://www.kaggle.com/datasets/heemalichaudhari/adidas-sales-dataset .

The research questions will be different for each type of hypothesis testing method and I will state it before each one of the hypothesis statements.

##1. Testing the hypothesis about the population arithmetic mean

Research question: Does Adidas footwear, on average, cost more or less than 80 US Dollars?

H0:The average price of Adidas footwear is 80 (Mean = 80) -> The median price of Adidas footwear is 70 (Median = 70) H1:The average price of Adidas footwear is not 80 (Mean = 80) -> The median price of Adidas footwear is not 70 (Median =/= 70)

For these hypotheses we conducted some external research for the median sneaker price of Adidas footwear. https://www.statista.com/statistics/828403/median-price-of-popular-sneaker-brands-worldwide/ https://www.statista.com/statistics/685483/us-athletic-footwear-average-selling-price/

Since I did not found any data on how much the average price of Adidas shoes (street wear and athletic wear), I only found that the average selling price of athletic footwear in the US for Adidas, according to Statista in 2017 was 58.16 US Dollars, I decided to “round up” this average price to 80 US Dollars (because the initial 58.16 US Dollars do not include street footwear and I assumed the price would go up). The median price of Adidas shoes in 2017 was 70 US Dollars, according to Statista. (Statista, 2017)

Conditions and assumptions: -variable is numeric (the variable Units sold is numeric) -normality (variable of the population is normally distributed) (we will check this with the help of the ggplot) -there are no outliers (we will check this with the help of the ggplot)

set.seed(1)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

sampledata1 <- sample_n(mydata1, size=4000)

First, we take mydata1 and make a sample, with the size of 4000 units (so that we can later preform Shapiro-Wilk Normality Test, because in the initial set there is too many observations)

sampledata1filtered <- sampledata1 %>% filter(Product != "Men's Apparel", Product != "Women's Apparel")

Since the data set includes also other products than shoes, we decided to remove them, then our sample size is only 2633 units.

library(ggplot2)
  sampledata1filtered %>% 
  ggplot(aes(x = Priceperunit) ) +
  geom_histogram(binwidth = 10, colour = "grey", fill = "lightpink1") +
  ylab("Frequency") +
  xlab("Price per unit") +
  ggtitle("Distribution of price per unit") +
  theme_minimal()

According to the graph above, the distribution of price per unit seems pretty normally distributed, but we can not say for sure therefore, we will conduct Shapiro-Wilk Normality Test, to check again the normality of the distribution.

H0: Prices of the Adidas footwear is distributed normally. H1: Prices of Adidas footwear is not distributed normally.

library(stats)
shapiro.test(sampledata1filtered$Priceperunit)

## 
##  Shapiro-Wilk normality test
## 
## data:  sampledata1filtered$Priceperunit
## W = 0.99106, p-value = 9.91e-12

We can reject H0 at p<0.001, and assume that distribution of prices of Adidas footwear is not normally distributed, and since the normality assumption is not met, we have to use Wilcoxon Signed Rank Test (condition:variable must be numeric, which it is).

again: H0:The median price of Adidas footwear is 70 (Median = 70) H1:The median price of Adidas footwear is not 70 (Median =/= 70)

wilcox.test(sampledata1filtered$Priceperunit,
            mu = 70,
            alternative = "two.sided",
            correct = FALSE)

## 
##  Wilcoxon signed rank test
## 
## data:  sampledata1filtered$Priceperunit
## V = 17086, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 70

We can reject the H0 at p<0.001, the median price is not 70 as it was in 2017.

median(sampledata1filtered$Priceperunit)

## [1] 41

mean(sampledata1filtered$Priceperunit)

## [1] 42.35207

library(effectsize)
effectsize(wilcox.test(sampledata1filtered$Priceperunit,
            mu = 70,
            alternative = "two.sided",
            correct = FALSE))

## r (rank biserial) |         95% CI
## ----------------------------------
## -0.99             | [-0.99, -0.99]
## 
## - Deviation from a difference of 70.

interpret_rank_biserial(0.99, rules="funder2019")

## [1] "very large"
## (Rules: funder2019)

According to the sample data about prices of Adidas footwear, we found that the median price now is 41 US Dollars and is much below as it was in the year 2017 (70 US Dollars), (p < 0.001, r = 0.99 - very large effect).

##Testing the hypothesis about the difference between two population aritmethic means

We will test if there is difference between the number of units sold by Walmart and Sports Direct. I choose these two, because Walmart is American multinational retail corporation and it is not really familiar to me and Sports Direct that is familiar to me, because we have Sports Direct shops in Slovenia too.

The research question: Is there a difference between the number of units sold by Walmart and sold by Sports Direct. H0: The arithmetic means of number of units sold by Walmart and Sports Direct are the same -> Distribution locations are the same H1: The arithmetic means of number of units sold by Walmart and Sports Direct are different -> Distribution locations are not the same

Before we can decide on which testing method to use, we have to check whether all of the conditions and assumptions for the parametric test are fulfilled.

These are: -the variable has to be numeric(number of units sold by the specific retailer is a numeric variable), -the distribution of the variable is normal for both populations, -the data comes from two populations and -that the variable has the same variance in both populations. All other conditions/assumptions we will test below.

First, we will take a look at a ggplot of both distributions.

library(ggplot2)
Walmart <- mydata1  %>% 
                            group_by(Retailer) %>%
                              filter(Retailer == 'Walmart') %>% 
                                ggplot(aes(x = Unitssold)) +
                                geom_histogram(binwidth = 100, colour = "grey", fill = "lightpink") +
                                ylab("Frequency") +
                                xlab("Number of units sold by Walmart") +
                                theme_minimal()
Sportsdirect <- mydata1  %>% 
                      group_by(Retailer) %>% 
                        filter(Retailer == 'Sports Direct') %>%
                          ggplot(aes(x = Unitssold)) +
                          geom_histogram(binwidth = 100, colour = "grey", fill = "lightpink") +
                          ylab("Frequency") +
                          xlab("Number of units sold by Sports Direct") +
                          theme_minimal()
library(ggpubr)
ggarrange(Walmart, Sportsdirect,
          ncol = 2, nrow = 1)

According to the graphs above, we cannot assume these two are distributed normally, both are skewed to the left.

We will check with Shapiro-Wilk Normality Test.

H0: Number of units sold by Walmart is distributed normally. H1: Number of units sold by Walmart is not distributed normally.

As we can see below, we reject H0 at p < 0.001 and conclude that the distribution of units sold is not normally distributed.

library(rstatix)

## 
## Attaching package: 'rstatix'

## The following objects are masked from 'package:effectsize':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:stats':
## 
##     filter

mydata1 %>% 
  group_by(Retailer) %>%
    filter(Retailer == 'Walmart') %>% 
      shapiro_test(Unitssold)

## # A tibble: 1 × 4
##   Retailer variable  statistic        p
##   <chr>    <chr>         <dbl>    <dbl>
## 1 Walmart  Unitssold     0.831 3.31e-25

And we co the same for Sports Direct.

H0: Number of units sold by Sports Direct is distributed normally. H1: Number of units sold by Sports Direct is not distributed normally.

mydata1 %>%
  group_by(Retailer) %>%
    filter(Retailer == 'Sports Direct') %>% 
      shapiro_test(Unitssold)

## # A tibble: 1 × 4
##   Retailer      variable  statistic        p
##   <chr>         <chr>         <dbl>    <dbl>
## 1 Sports Direct Unitssold     0.840 3.32e-41

The results of the Shapiro-Wilk test show that neither distributions of the variables are not normal, therefore the null hypothesis (H0:Distribution of the number of units sold by Walmart/Sportd Direct is normal) is rejected, at p-value < 0.001 for both retailers.

Since the normality assumption is not met, we will continue with the non-parametric alternative to the independent samples t-test which is the Wilcoxon Rank Sum Test.

again: H0: Distribution locations are the same. H1: Distribution locations are not the same.

mydata2 <- mydata1[mydata1$Retailer == 'Walmart' | mydata1$Retailer == 'Sports Direct',]

wilcox.test(mydata2$Unitssold ~ mydata2$Retailer,
                  paired = FALSE,
                  correct = FALSE,
                  exact = FALSE,
                  alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata2$Unitssold by mydata2$Retailer
## W = 533356, p-value = 9.651e-10
## alternative hypothesis: true location shift is not equal to 0

As we can see above, we can reject H0 at p < 0.001 and conclude that distribution locations are not the same.

library(effectsize)
effectsize(wilcox.test(mydata2$Unitssold ~ mydata2$Retailer,
                  paired = FALSE,
                  correct = FALSE,
                  exact = FALSE,
                  alternative = "two.sided"))

## r (rank biserial) |         95% CI
## ----------------------------------
## -0.16             | [-0.21, -0.11]

interpret_rank_biserial(0.16)

## [1] "small"
## (Rules: funder2019)

According to the data, we found out that number of units sold by Walmart and Sports Direct are different (p < 0.001). The difference is small (r = 0.16).

##Test of population proportion Research question: Can we conclude that there more units sold by in-person methods (in-store+outlet) than online? First, we need to merge the number of units sold in in-store and outlet stores.

mydata3 <- aggregate(mydata1$Unitssold ~ mydata1$Salesmethod, mydata1, sum)
colnames(mydata3)<-c("Sales method","Units sold")
head(mydata3)

##   Sales method Units sold
## 1     In-store     689990
## 2       Online     939093
## 3       Outlet     849778

df <- data.frame(
  salesmethod = c("In-person", "Online"),
  unitssold = c("1539768", "939093")
)
View(df)

The total of units sold last year was 2.478.861, the number from in-person units sold was 1.539.768 and online was 939.093.

H0: pi=0.5 (Proportion of units sold in-person(in-store and outlet) methods is the same as the proportion of the units sold online ) H1: pi>0.5 (Proportion of units sold in-person(in-store and outlet) methods is greater than the proportion of the units sold online)

Assumptions: we have set:pi(0)=0.5

n*pi(0)>5 -> 1.239.430,5 > 5 -> ok
n*(1-pi(0))>5 -> also ok

The assumptions are finally met, we can do the test of population proportion.

prop.test(x = 1539768,
          n = 2478861,
          p = 0.5,
          correct = FALSE,
          alternative = "greater")

## 
##  1-sample proportions test without continuity correction
## 
## data:  1539768 out of 2478861, null probability 0.5
## X-squared = 145555, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
##  0.6206526 1.0000000
## sample estimates:
##         p 
## 0.6211595

We can reject the H0 at p < 0.001, and conclude that the proportion of units sold by in-person methods is greater than the proportion of units sold online.

HW2

Paulina Suvorov

2023-01-11