Author: Yaqi Hu
# Data Import
mydata <-read.table("./zara.csv", header= TRUE, sep=";",dec=".")
Dataset from: https://www.kaggle.com/datasets/xontoloyo/data-penjualan-zara The data shows product sales from Zara stores.
head(mydata)
## Product.ID Product.Position Promotion Product.Category Seasonal Sales.Volume
## 1 185102 Aisle No Clothing No 2823
## 2 188771 Aisle No Clothing No 654
## 3 180176 End-cap Yes Clothing Yes 2220
## 4 112917 Aisle Yes Clothing Yes 1568
## 5 192936 End-cap No Clothing Yes 2942
## 6 117590 End-cap No Clothing No 2968
## brand url
## 1 Zara https://www.zara.com/us/en/basic-puffer-jacket-p06985450.html
## 2 Zara https://www.zara.com/us/en/tuxedo-jacket-p08896675.html
## 3 Zara https://www.zara.com/us/en/slim-fit-suit-jacket-p01564520.html
## 4 Zara https://www.zara.com/us/en/stretch-suit-jacket-p01564300.html
## 5 Zara https://www.zara.com/us/en/double-faced-jacket-p08281477.html
## 6 Zara https://www.zara.com/us/en/contrasting-collar-jacket-p06987331.html
## sku name
## 1 272145190-250-2 BASIC PUFFER JACKET
## 2 324052738-800-46 TUXEDO JACKET
## 3 335342680-800-44 SLIM FIT SUIT JACKET
## 4 328303236-420-44 STRETCH SUIT JACKET
## 5 312368260-800-2 DOUBLE FACED JACKET
## 6 320298385-807-2 CONTRASTING COLLAR JACKET
## description
## 1 Puffer jacket made of tear-resistant ripstop fabric. High collar and adjustable long sleeves with adhesive straps. Welt pockets at hip. Adjustable hem with side elastics. Front zip closure.
## 2 Straight fit blazer. Pointed lapel collar and long sleeves with buttoned cuffs. Welt pockets at hip and interior pocket. Central back vent at hem. Front button closure.
## 3 Slim fit jacket. Notched lapel collar. Long sleeves with buttoned cuffs. Welt pocket at chest and flap pockets at hip. Interior pocket. Back vents. Front button closure.
## 4 Slim fit jacket made of viscose blend fabric. Notched lapel collar. Long sleeves with buttoned cuffs. Welt pocket at chest and flap pockets at hip. Interior pocket. Back vents. Front button closure.
## 5 Jacket made of faux leather faux shearling with fleece interior. Tabbed lapel collar. Long sleeves. Zip pockets at hip. Front zip closure.
## 6 Relaxed fit jacket. Contrasting lapel collar and long sleeves with buttoned cuffs. Front pouch pockets. Interior pocket. Washed effect. Front zip closure.
## price currency scraped_at terms section
## 1 19.99 USD 2024-02-19T08:50:05.654618 jackets MAN
## 2 169.00 USD 2024-02-19T08:50:06.590930 jackets MAN
## 3 129.00 USD 2024-02-19T08:50:07.301419 jackets MAN
## 4 129.00 USD 2024-02-19T08:50:07.882922 jackets MAN
## 5 139.00 USD 2024-02-19T08:50:08.453847 jackets MAN
## 6 79.90 USD 2024-02-19T08:50:09.140497 jackets MAN
Unit of observations: items sold at Zara
Sample size: 252
Product.ID: Identification number for each product
Product.Position: Location of the produc
Promotion: Indicates whether the product is currently being offered at a promotional price.
Product Category: Broad product group of an item
Seasonal: Product sold seasonally
Sales Volume: The number of units sold for an item.
Brand: Brand of the item
URL: web link to the item
SKU: Stock Keeping Unit, identification number to manage the inventory for the product
Name: Name of the product
Description: Description of the product
Price: Price of the product in USD
Currency: Currency of the product price.
Scraped_at: The time when the data was scraped
Terms: Subcategory of the product
Section: Specifies whether the product is intended for men or women
#Converting categorical variables into factors
mydata$Product.Position <- factor(mydata$Product.Position)
mydata$Promotion <- factor(mydata$Promotion)
mydata$Product.Category <- factor(mydata$Product.Category)
mydata$Seasonal <- factor(mydata$Seasonal)
mydata$brand <- factor(mydata$brand)
mydata$terms <- factor(mydata$terms)
mydata$section <- factor(mydata$section)
# Descriptive statistics with summary
summary(mydata[ , -c(1,4,7,8,9,10,11,13,14)])
## Product.Position Promotion Seasonal Sales.Volume price
## Aisle :97 No :132 No :124 Min. : 529 Min. : 7.99
## End-cap :86 Yes:120 Yes:128 1st Qu.:1243 1st Qu.: 49.90
## Front of Store:69 Median :1840 Median : 79.90
## Mean :1824 Mean : 86.25
## 3rd Qu.:2399 3rd Qu.:109.00
## Max. :2989 Max. :439.00
## terms section
## jackets :140 MAN :218
## jeans : 8 WOMAN: 34
## shoes : 31
## sweaters: 41
## t-shirts: 32
##
Excluded from the summary: 1.ID, 8.URL, 9.SKU, 10.Name, 11.Description, 14.Scraped_At
Excluded because dataset only contains data from one category (clothing + Zara + USD): 4.Product.Category , 7.Brand , 13.Currency
# Descriptive statistics with pastecs/stat.desc
library(pastecs)
round(stat.desc(mydata[ , c(6, 12)]), 1)
## Sales.Volume price
## nbr.val 252.0 252.0
## nbr.null 0.0 0.0
## nbr.na 0.0 0.0
## min 529.0 8.0
## max 2989.0 439.0
## range 2460.0 431.0
## sum 459573.0 21735.6
## median 1839.5 79.9
## mean 1823.7 86.3
## SE.mean 44.0 3.3
## CI.mean.0.95 86.6 6.5
## var 486790.5 2712.7
## std.dev 697.7 52.1
## coef.var 0.4 0.6
# Descriptive statistics with psych/describe
library(psych)
round(describe(mydata[ , c(6, 12)]),1)
## vars n mean sd median trimmed mad min max range skew
## Sales.Volume 1 252 1823.7 697.7 1839.5 1835.5 868.8 529 2989 2460 -0.1
## price 2 252 86.3 52.1 79.9 80.9 43.1 8 439 431 2.4
## kurtosis se
## Sales.Volume -1.1 44.0
## price 11.0 3.3
#Maximum sales volume
round(max(mydata$Sales.Volume),0)
## [1] 2989
The highest sales volume of an item amounts 2989.
#Minimum sales volume
round(min(mydata$Sales.Volume),0)
## [1] 529
The lowest sales volume of an item amounts 529
# 20% quantile sales volume
round(quantile(mydata$Sales.Volume, 0.2),0)
## 20%
## 1096
20% of the items in the sample have a sales volume of 1096 or less, the other 80% have a higher value.
# Median price
round(median(mydata$price),1)
## [1] 79.9
The median shows that 50% of the clothes cost more than $79.9 and 50% of the clothes are priced below.
# Mean price
round(mean(mydata$price),1)
## [1] 86.3
The arithmetic mean shows a higher price. The average price amounts to $86.3 , it indicates that there might be a few outlier items which priced very high.
# Trimmed mean price
round(mean(mydata$price,trim = 0.1),1)
## [1] 80.9
When trimming the mean by 10 % the mean decreases by $80.9 and therefore it gets closer to the median.
#Descriptive statistics by groups psych/describeBy
library(psych)
describeBy(x = mydata$price,
group = mydata$Promotion)
##
## Descriptive statistics by group
## group: No
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 132 80.65 44.93 69.9 76.98 29.65 7.99 349 341.01 1.84 8.13 3.91
## ------------------------------------------------------------
## group: Yes
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 120 92.41 58.54 89.9 85.69 51.22 12.99 439 426.01 2.44 10.39
## se
## X1 5.34
These descriptive statistics illustrate the price between items with and without promotions. The statistics show promotional items in this sample are on average more expensive (mean = USD 92.41) than items without promotion (mean value = USD 80.65).
boxplot(mydata$price)
The boxplot shows that approximately half of items price is ranged between $50-100
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(price<= 200)
boxplot(mydata$price)
This time I have removed items with a price of $200 or more, for illustrative purposes this removes the outliers. The boxplot zoomed in, the scale for the prices is now 50 instead of 100.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes(x = price, fill = section))+
geom_histogram(position = position_dodge(2), binwidth = 10, colour = "grey") +
ylab("Frequency") +
labs(fill = "Section")
The histogram shows how the price of items varies between man and woman clothing. The x-axis shows the price of the items and y-axis shows the frequency of items, which are priced within that range. The sample includes more man clothing. The histogram shows that the prices for women’s clothing are lower than for men’s clothing. The highest price frequency for men’s clothing is 90-100 USD. For women, the highest price frequency for women’s clothing is 50-60 USD.
ggplot(mydata, aes(y = price, x = section)) + geom_boxplot()
These two boxplots compare the prices for women’s and men’s clothing. This result also suggests that men’s clothing are priced higher in this sample.