Author: Yaqi Hu


Descriptive statistics of dataset

1 Description of data set used in the analysis

1.1 Source
# Data Import
mydata <-read.table("./zara.csv", header= TRUE, sep=";",dec=".")

Dataset from: https://www.kaggle.com/datasets/xontoloyo/data-penjualan-zara The data shows product sales from Zara stores.

head(mydata)
##   Product.ID Product.Position Promotion Product.Category Seasonal Sales.Volume
## 1     185102            Aisle        No         Clothing       No         2823
## 2     188771            Aisle        No         Clothing       No          654
## 3     180176          End-cap       Yes         Clothing      Yes         2220
## 4     112917            Aisle       Yes         Clothing      Yes         1568
## 5     192936          End-cap        No         Clothing      Yes         2942
## 6     117590          End-cap        No         Clothing       No         2968
##   brand                                                                 url
## 1  Zara       https://www.zara.com/us/en/basic-puffer-jacket-p06985450.html
## 2  Zara             https://www.zara.com/us/en/tuxedo-jacket-p08896675.html
## 3  Zara      https://www.zara.com/us/en/slim-fit-suit-jacket-p01564520.html
## 4  Zara       https://www.zara.com/us/en/stretch-suit-jacket-p01564300.html
## 5  Zara       https://www.zara.com/us/en/double-faced-jacket-p08281477.html
## 6  Zara https://www.zara.com/us/en/contrasting-collar-jacket-p06987331.html
##                sku                      name
## 1  272145190-250-2       BASIC PUFFER JACKET
## 2 324052738-800-46             TUXEDO JACKET
## 3 335342680-800-44      SLIM FIT SUIT JACKET
## 4 328303236-420-44       STRETCH SUIT JACKET
## 5  312368260-800-2       DOUBLE FACED JACKET
## 6  320298385-807-2 CONTRASTING COLLAR JACKET
##                                                                                                                                                                                              description
## 1          Puffer jacket made of tear-resistant ripstop fabric. High collar and adjustable long sleeves with adhesive straps. Welt pockets at hip. Adjustable hem with side elastics. Front zip closure.
## 2                               Straight fit blazer. Pointed lapel collar and long sleeves with buttoned cuffs. Welt pockets at hip and interior pocket. Central back vent at hem. Front button closure.
## 3                              Slim fit jacket. Notched lapel collar. Long sleeves with buttoned cuffs. Welt pocket at chest and flap pockets at hip. Interior pocket. Back vents. Front button closure.
## 4 Slim fit jacket made of viscose blend fabric. Notched lapel collar. Long sleeves with buttoned cuffs. Welt pocket at chest and flap pockets at hip. Interior pocket. Back vents. Front button closure.
## 5                                                             Jacket made of faux leather faux shearling with fleece interior. Tabbed lapel collar. Long sleeves. Zip pockets at hip. Front zip closure.
## 6                                             Relaxed fit jacket. Contrasting lapel collar and long sleeves with buttoned cuffs. Front pouch pockets. Interior pocket. Washed effect. Front zip closure.
##    price currency                 scraped_at   terms section
## 1  19.99      USD 2024-02-19T08:50:05.654618 jackets     MAN
## 2 169.00      USD 2024-02-19T08:50:06.590930 jackets     MAN
## 3 129.00      USD 2024-02-19T08:50:07.301419 jackets     MAN
## 4 129.00      USD 2024-02-19T08:50:07.882922 jackets     MAN
## 5 139.00      USD 2024-02-19T08:50:08.453847 jackets     MAN
## 6  79.90      USD 2024-02-19T08:50:09.140497 jackets     MAN
1.2 Definition of variables
  • Unit of observations: items sold at Zara

  • Sample size: 252

  • Product.ID: Identification number for each product

  • Product.Position: Location of the produc

  • Promotion: Indicates whether the product is currently being offered at a promotional price.

  • Product Category: Broad product group of an item

  • Seasonal: Product sold seasonally

  • Sales Volume: The number of units sold for an item.

  • Brand: Brand of the item

  • URL: web link to the item

  • SKU: Stock Keeping Unit, identification number to manage the inventory for the product

  • Name: Name of the product

  • Description: Description of the product

  • Price: Price of the product in USD

  • Currency: Currency of the product price.

  • Scraped_at: The time when the data was scraped

  • Terms: Subcategory of the product

  • Section: Specifies whether the product is intended for men or women

#Converting categorical variables into factors
mydata$Product.Position <- factor(mydata$Product.Position)
mydata$Promotion <- factor(mydata$Promotion)
mydata$Product.Category <- factor(mydata$Product.Category)
mydata$Seasonal <- factor(mydata$Seasonal)
mydata$brand <- factor(mydata$brand)
mydata$terms <- factor(mydata$terms)
mydata$section <- factor(mydata$section)

2 Descriptive statistics

# Descriptive statistics with summary
summary(mydata[ , -c(1,4,7,8,9,10,11,13,14)])
##        Product.Position Promotion Seasonal   Sales.Volume      price       
##  Aisle         :97      No :132   No :124   Min.   : 529   Min.   :  7.99  
##  End-cap       :86      Yes:120   Yes:128   1st Qu.:1243   1st Qu.: 49.90  
##  Front of Store:69                          Median :1840   Median : 79.90  
##                                             Mean   :1824   Mean   : 86.25  
##                                             3rd Qu.:2399   3rd Qu.:109.00  
##                                             Max.   :2989   Max.   :439.00  
##       terms      section   
##  jackets :140   MAN  :218  
##  jeans   :  8   WOMAN: 34  
##  shoes   : 31              
##  sweaters: 41              
##  t-shirts: 32              
## 
# Descriptive statistics with pastecs/stat.desc
library(pastecs)
round(stat.desc(mydata[ , c(6, 12)]), 1)
##              Sales.Volume   price
## nbr.val             252.0   252.0
## nbr.null              0.0     0.0
## nbr.na                0.0     0.0
## min                 529.0     8.0
## max                2989.0   439.0
## range              2460.0   431.0
## sum              459573.0 21735.6
## median             1839.5    79.9
## mean               1823.7    86.3
## SE.mean              44.0     3.3
## CI.mean.0.95         86.6     6.5
## var              486790.5  2712.7
## std.dev             697.7    52.1
## coef.var              0.4     0.6
# Descriptive statistics with psych/describe
library(psych)
round(describe(mydata[ , c(6, 12)]),1)
##              vars   n   mean    sd median trimmed   mad min  max range skew
## Sales.Volume    1 252 1823.7 697.7 1839.5  1835.5 868.8 529 2989  2460 -0.1
## price           2 252   86.3  52.1   79.9    80.9  43.1   8  439   431  2.4
##              kurtosis   se
## Sales.Volume     -1.1 44.0
## price            11.0  3.3
#Maximum sales volume
round(max(mydata$Sales.Volume),0)
## [1] 2989

The highest sales volume of an item amounts 2989.

#Minimum sales volume
round(min(mydata$Sales.Volume),0)
## [1] 529

The lowest sales volume of an item amounts 529

# 20% quantile sales volume

round(quantile(mydata$Sales.Volume, 0.2),0)
##  20% 
## 1096

20% of the items in the sample have a sales volume of 1096 or less, the other 80% have a higher value.

# Median price
round(median(mydata$price),1)
## [1] 79.9

The median shows that 50% of the clothes cost more than $79.9 and 50% of the clothes are priced below.

# Mean price
round(mean(mydata$price),1)
## [1] 86.3

The arithmetic mean shows a higher price. The average price amounts to $86.3 , it indicates that there might be a few outlier items which priced very high.

# Trimmed mean price
round(mean(mydata$price,trim = 0.1),1)
## [1] 80.9

When trimming the mean by 10 % the mean decreases by $80.9 and therefore it gets closer to the median.

#Descriptive statistics by groups psych/describeBy
library(psych)
describeBy(x = mydata$price,
           group = mydata$Promotion)
## 
##  Descriptive statistics by group 
## group: No
##    vars   n  mean    sd median trimmed   mad  min max  range skew kurtosis   se
## X1    1 132 80.65 44.93   69.9   76.98 29.65 7.99 349 341.01 1.84     8.13 3.91
## ------------------------------------------------------------ 
## group: Yes
##    vars   n  mean    sd median trimmed   mad   min max  range skew kurtosis
## X1    1 120 92.41 58.54   89.9   85.69 51.22 12.99 439 426.01 2.44    10.39
##      se
## X1 5.34

These descriptive statistics illustrate the price between items with and without promotions. The statistics show promotional items in this sample are on average more expensive (mean = USD 92.41) than items without promotion (mean value = USD 80.65).

boxplot(mydata$price)

The boxplot shows that approximately half of items price is ranged between $50-100

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:pastecs':
## 
##     first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata <- mydata %>%
  filter(price<= 200)
boxplot(mydata$price)

This time I have removed items with a price of $200 or more, for illustrative purposes this removes the outliers. The boxplot zoomed in, the scale for the prices is now 50 instead of 100.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(mydata, aes(x = price, fill = section))+
  geom_histogram(position = position_dodge(2), binwidth = 10, colour = "grey") +
  ylab("Frequency") +
  labs(fill = "Section")

The histogram shows how the price of items varies between man and woman clothing. The x-axis shows the price of the items and y-axis shows the frequency of items, which are priced within that range. The sample includes more man clothing. The histogram shows that the prices for women’s clothing are lower than for men’s clothing. The highest price frequency for men’s clothing is 90-100 USD. For women, the highest price frequency for women’s clothing is 50-60 USD.

ggplot(mydata, aes(y = price, x = section)) + geom_boxplot()

These two boxplots compare the prices for women’s and men’s clothing. This result also suggests that men’s clothing are priced higher in this sample.