New Chic Data Analysis

Problem Statement

The inference of this project is to check which factors are affecting the dependent variable Likes_count. As our dependent variable is count variable. We thought to prepare the Poisson Regression Model.

1) Mounika Balreddyguda

Loading datasets and packages

Loading required packages

library(shiny)
library(shinythemes)
library(kernlab)
library(AER)
library(dplyr)
library(broom)
library(maxLik)
library(tidyr)
library(tidyverse)
library(skimr)
library(factoextra)
library(ggpubr)
library(plotly)
library(cluster)
library(DT)
library(ggforce)
library(gridExtra)
library(corrplot)
library(MASS)

Loading the datasets

accessories <- read_csv("~/Downloads/accessories.csv", na = c("", "NA"))

Transforming data

Creating discount price variable

By Understanding from the dataset I see that we have only discount % variable. However, we thought of creating a new variable called discount_price to get the exact value.

accessory <- accessories %>%
  mutate(discount_price= (raw_price*(100-discount))/100)

Checking datatypes of the dataset and converting the datatypes of variables as needed

# Check the data types of each variable
str(accessory)
## tibble [6,358 × 23] (S3: tbl_df/tbl/data.frame)
##  $ category             : chr [1:6358] "accessories" "accessories" "accessories" "accessories" ...
##  $ subcategory          : chr [1:6358] "Chapeaux & Bonnets" "Baseball Caps" "Flat Caps" "Cache-oreilles & Masques" ...
##  $ name                 : chr [1:6358] "Masques d'isolation de protection du visage Masques anti-buée anti-éclaboussures" "Chapeau de parasol de casquette de baseball en denim lavé à bordure personnalisée en plein air" "Casquette Gavroche Cabbie Lvy Flat Hat Vintage Painter Beret Hats" "Masque à fleurs en coton multicolore Masque vintage" ...
##  $ current_price        : num [1:6358] 9.99 9.99 17.49 5.99 13.99 ...
##  $ raw_price            : num [1:6358] 17 19 37.3 16 27 ...
##  $ currency             : chr [1:6358] "USD" "USD" "USD" "USD" ...
##  $ discount             : num [1:6358] 41 47 53 63 48 47 50 46 52 50 ...
##  $ likes_count          : num [1:6358] 373 73 140 686 35 91 147 46 121 71 ...
##  $ is_new               : logi [1:6358] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ brand                : chr [1:6358] NA NA NA NA ...
##  $ brand_url            : chr [1:6358] NA NA NA NA ...
##  $ codCountry           : chr [1:6358] "ID,MY,PH,SG,TH,VN" "ID,MY,PH,SG,TH,VN" "ID,MY,PH,SG,TH,VN" "ID,MY,PH,SG,TH,VN" ...
##  $ variation_0_color    : chr [1:6358] "Gray" "Navy" "Black" "#01" ...
##  $ variation_1_color    : chr [1:6358] "Blue" "Black" "camel" "#02" ...
##  $ variation_0_thumbnail: chr [1:6358] "https://imgaz1.chiccdn.com/thumb/list_grid/oaupload/newchic/images/09/FD/e8204e98-8e30-41ca-a599-c8ed90469e08.jpg" "https://imgaz1.chiccdn.com/thumb/list_grid/oaupload/newchic/images/45/ED/2e7729ff-3dde-4f1c-9ece-d6c2a5f24302.jpg" "https://imgaz1.chiccdn.com/thumb/list_grid/oaupload/newchic/images/A4/2F/42467cfd-63b4-4f6e-901c-90dc34cd52b3.jpg" "https://imgaz1.chiccdn.com/thumb/list_grid/oaupload/newchic/images/39/4F/db60e479-87f5-4a82-94cb-ec17b4e772a4.jpg" ...
##  $ variation_0_image    : chr [1:6358] "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/09/FD/e8204e98-8e30-41ca-a599-c8ed90469e08.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/45/ED/2e7729ff-3dde-4f1c-9ece-d6c2a5f24302.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/A4/2F/42467cfd-63b4-4f6e-901c-90dc34cd52b3.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/39/4F/db60e479-87f5-4a82-94cb-ec17b4e772a4.jpg" ...
##  $ variation_1_thumbnail: chr [1:6358] "https://imgaz1.chiccdn.com/thumb/list_grid/oaupload/newchic/images/2E/72/1c322075-7a7a-474d-976e-6bcbc45daaec.jpg" "https://imgaz1.chiccdn.com/thumb/list_grid/oaupload/newchic/images/79/30/9b8eb997-55d4-4e67-8b89-e81178cf2e8e.jpg" "https://imgaz1.chiccdn.com/thumb/list_grid/oaupload/newchic/images/30/CE/5f9c589f-81f7-48c4-a24c-4ba41cdf3718.jpg" "https://imgaz1.chiccdn.com/thumb/list_grid/oaupload/newchic/images/D2/B6/0179fdce-340b-4da5-b89b-e49a7ec39d17.jpg" ...
##  $ variation_1_image    : chr [1:6358] "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/2E/72/1c322075-7a7a-474d-976e-6bcbc45daaec.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/79/30/9b8eb997-55d4-4e67-8b89-e81178cf2e8e.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/30/CE/5f9c589f-81f7-48c4-a24c-4ba41cdf3718.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/D2/B6/0179fdce-340b-4da5-b89b-e49a7ec39d17.jpg" ...
##  $ image_url            : chr [1:6358] "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/2E/72/1c322075-7a7a-474d-976e-6bcbc45daaec.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/0D/0D/a735a697-871f-4bea-b92d-19947ff8f328.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/30/CE/5f9c589f-81f7-48c4-a24c-4ba41cdf3718.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/E6/1D/834076a4-45ea-476c-8f45-bbe1b95db3d7.jpg" ...
##  $ url                  : chr [1:6358] "https://fr.newchic.com/hats-and-caps-4192/p-1671872.html" "https://fr.newchic.com/baseball-caps-12158/p-1674377.html" "https://fr.newchic.com/flat-caps-12160/p-1637790.html" "https://fr.newchic.com/earmuffsandmouth-maskandblindfold-6011/p-1654953.html" ...
##  $ id                   : num [1:6358] 1671872 1674377 1637790 1654953 1666600 ...
##  $ model                : chr [1:6358] "SKUF08305" "SKUF11351" "SKUE50978" "SKUE76490" ...
##  $ discount_price       : num [1:6358] 10.02 10.06 17.55 5.92 14.03 ...
# Convert the subcategory and currency variables to factors
accessory$subcategory <- as.factor(accessory$subcategory)
accessory$currency <- as.factor(accessory$currency)
skim(accessory)
Data summary
Name accessory
Number of rows 6358
Number of columns 23
_______________________
Column type frequency:
character 14
factor 2
logical 1
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
category 0 1.00 11 11 0 1 0
name 0 1.00 5 254 0 6036 0
brand 6228 0.02 2 10 0 16 0
brand_url 6235 0.02 41 49 0 14 0
codCountry 647 0.90 2 17 0 8 0
variation_0_color 464 0.93 1 52 0 305 0
variation_1_color 892 0.86 1 57 0 313 0
variation_0_thumbnail 464 0.93 57 133 0 5844 0
variation_0_image 464 0.93 52 128 0 5844 0
variation_1_thumbnail 892 0.86 73 133 0 5426 0
variation_1_image 892 0.86 68 128 0 5426 0
image_url 0 1.00 68 128 0 6298 0
url 0 1.00 46 76 0 6299 0
model 0 1.00 9 33 0 6299 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
subcategory 0 1 FALSE 35 Cha: 980, Bea: 722, Bas: 583, Fou: 441
currency 0 1 FALSE 1 USD: 6358

Variable type: logical

skim_variable n_missing complete_rate mean count
is_new 0 1 0.02 FAL: 6251, TRU: 107

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
current_price 0 1 12.15 7.14 0.22 8.05 11.30 14.61 92.99 ▇▁▁▁▁
raw_price 0 1 25.27 33.03 0.00 16.41 22.86 29.09 2129.93 ▇▁▁▁▁
discount 0 1 50.76 9.30 0.00 47.00 49.00 56.00 100.00 ▁▁▇▁▁
likes_count 0 1 96.50 201.80 0.00 19.00 44.00 97.00 7277.00 ▇▁▁▁▁
id 0 1 1375095.39 215220.95 39174.00 1215135.50 1356958.50 1555828.75 1723918.00 ▁▁▁▇▇
discount_price 0 1 12.10 7.18 0.00 8.00 11.27 14.54 92.59 ▇▁▁▁▁

Remove any missing values or outliers from the dataset.

# Remove any missing values or outliers from the dataset
accessory <- accessory %>% 
  drop_na() %>%  # Remove rows with missing values
  filter(likes_count < quantile(likes_count, 0.99))  # Remove the top 1% of likes_count values

Explore the data

Calculate the summary statistics and distribution of the ‘likes_count’ variable using the ‘summary’ and ‘hist’ functions.

# Check the summary statistics and distribution of the 'likes_count' variable
summary(accessory$likes_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   43.00   68.50   94.83  127.50  588.00
ggplot(accessory, aes(x = likes_count)) +
  geom_histogram(bins = 20, fill = "lightblue", color = "black") +
  xlab("Likes Count") +
  ylab("Frequency") +
  ggtitle("Distribution of Likes Count")

Scatter plots for all variables vs. ‘Likes_Count’ Variable.

# Create scatterplots for all variables vs. 'likes_count'

p1 <- ggplot(accessory, aes(x = current_price, y = likes_count)) +
  geom_point() +
  ggtitle("Scatterplot of current_price vs. likes_count")

p2 <- ggplot(accessory, aes(x = raw_price, y = likes_count)) +
  geom_point() +
  ggtitle("Scatterplot of raw_price vs. likes_count")

p3 <- ggplot(accessory, aes(x = discount, y = likes_count)) +
  geom_point() +
  ggtitle("Scatterplot of discount vs. likes_count")

p4 <- ggplot(accessory, aes(x = discount_price, y = likes_count)) +
  geom_point() +
  ggtitle("Scatterplot of discount_price vs. likes_count")

# Arrange the scatterplots in a grid
grid.arrange(p1, p2, p3, p4, ncol = 2)

Compute the correlation coefficients between all variables and ‘likes_count’ using a loop

# Compute the correlation matrix for all variables
cor_matrix <- cor(accessory[, c("current_price", "raw_price", "discount", "likes_count", "discount_price")])

# Create a correlation plot using the 'corrplot' package
corrplot(cor_matrix, method= "number",type = "upper", order = "hclust",tl.col = "black", col= colorRampPalette(c("white", "deepskyblue", "blue4"))(100))

Maximum Likehood Estimations

fit <- fitdistr(accessory$likes_count, "normal")
mean_est <- fit$estimate[1]
sd_est <- fit$estimate[2]

# print the estimated parameter values
cat("Maximum likelihood estimates:\n")
## Maximum likelihood estimates:
cat("Mean:", mean_est, "\n")
## Mean: 94.82979
cat("Standard deviation:", sd_est, "\n")
## Standard deviation: 86.25107

2) Maruthi Sai Phani Teja Chadalapaka

Poisson Model

Mean-Variance Relationship

# Calculate the mean and variance of likes_count for each value of the subcategory variable
mean_var_df <- accessory %>% group_by(subcategory) %>% summarise(mean_likes = mean(likes_count), var_likes = var(likes_count))

# Plot the mean-variance relationship
ggplot(mean_var_df, aes(x = mean_likes, y = var_likes)) +
  geom_point() +
  scale_x_continuous(name = "Mean likes count") +
  scale_y_continuous(name = "Variance of likes count") +
  ggtitle("Mean-Variance Relationship for Likes Count")

Model Preparation for poisson

Dividng the data into test and train

# Split data into training and testing sets
set.seed(123)
train_indices <- sample(nrow(accessory), 0.7*nrow(accessory)) # 70% for training
train_data <- accessory[train_indices, ]
test_data <- accessory[-train_indices, ]

Preparing the model and checking the assumptions

# Create Poisson regression model using training data
poisson_model <- glm(likes_count ~ current_price + raw_price + discount, data = train_data, family = "poisson")

poisson_aug <- augment(poisson_model)

# Linearity assumption: plot residuals vs. fitted values
ggplot(data = poisson_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")

# Independence assumption: plot residuals vs. order of observations
ggplot(data = poisson_aug, aes(x = .resid)) +
  geom_histogram(binwidth = 0.25) +
  xlab("Residuals")

# Overdispersion assumption: calculate dispersion parameter
dispersiontest(poisson_model)
## 
##  Overdispersion test
## 
## data:  poisson_model
## z = 3.7344, p-value = 9.408e-05
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   42.64843

The output shows the results of an overdispersion test for the Poisson regression model poisson_model. The test uses the score test statistic, which follows a chi-squared distribution under the null hypothesis that the dispersion parameter equals 1.

The test statistic is z = 3.7344, which is very large and corresponds to a p-value is 9.408e-05, indicating strong evidence against the null hypothesis of no overdispersion. The alternative hypothesis is that the true dispersion parameter is greater than 1, indicating that the Poisson assumption of equal mean and variance may not hold and that there is more variation in the data than expected by the Poisson model.

The estimated dispersion parameter is 42.64843, which is much larger than 1 and supports the conclusion that the model suffers from overdispersion. To address overdispersion, one possible approach is to fit a negative binomial regression model, which allows for greater variability than the Poisson model.

Negative Binomial Regression Model

Model 1 :

Model Equation: $likes_count ~ $current price + $raw price + $discount + $discount price

negbin_model <- glm.nb(likes_count ~ current_price + raw_price + discount +discount_price, data = train_data)

#Linearity Assumptions
# Plot fitted values vs. residuals
plot(fitted(negbin_model), residuals(negbin_model), xlab = "Fitted values", ylab = "Residuals", main = "Residuals vs. Fitted Values")
# Add lowess smoother
lines(lowess(fitted(negbin_model), residuals(negbin_model)), col = "red")

# Check for any patterns or nonlinear relationships

# Independence Assumptions
# Plot residuals vs. order of observations
plot(resid(negbin_model) ~ seq_along(resid(negbin_model)), xlab = "Observation order", ylab = "Residuals", main = "Residuals vs. Observation Order")

# Overdispersion Assumptions

# Calculate Pearson residuals
pearson_resid <- residuals(negbin_model, type = "pearson")

# Calculate deviance residuals
deviance_resid <- residuals(negbin_model, type = "deviance")

# Plot Pearson residuals vs. fitted values
plot(fitted(negbin_model), sqrt(abs(pearson_resid)), xlab = "Fitted values", ylab = "sqrt(abs(Pearson residuals))", main = "Pearson Residuals vs. Fitted Values")

# Add lowess smoother
lines(lowess(fitted(negbin_model), sqrt(abs(pearson_resid))), col = "red")

# Check for any patterns or trends

# Plot deviance residuals vs. fitted values
plot(fitted(negbin_model), sqrt(abs(deviance_resid)), xlab = "Fitted values", ylab = "sqrt(abs(Deviance residuals))", main = "Deviance Residuals vs. Fitted Values")

# Add lowess smoother
lines(lowess(fitted(negbin_model), sqrt(abs(deviance_resid))), col = "red")

Model 2 :

Model Equation: $likes_count ~ $raw price + $discount + $discount price

negbin_model_2 <- glm.nb(likes_count ~ raw_price + discount +discount_price, data = train_data)

#Linearity Assumptions
# Plot fitted values vs. residuals
plot(fitted(negbin_model_2), residuals(negbin_model_2), xlab = "Fitted values", ylab = "Residuals", main = "Residuals vs. Fitted Values")
# Add lowess smoother
lines(lowess(fitted(negbin_model_2), residuals(negbin_model_2)), col = "red")

# Check for any patterns or nonlinear relationships

# Independence Assumptions
# Plot residuals vs. order of observations
plot(resid(negbin_model_2) ~ seq_along(resid(negbin_model_2)), xlab = "Observation order", ylab = "Residuals", main = "Residuals vs. Observation Order")

# Overdispersion Assumptions

# Calculate Pearson residuals
pearson_resid <- residuals(negbin_model_2, type = "pearson")

# Calculate deviance residuals
deviance_resid <- residuals(negbin_model_2, type = "deviance")

# Plot Pearson residuals vs. fitted values
plot(fitted(negbin_model_2), sqrt(abs(pearson_resid)), xlab = "Fitted values", ylab = "sqrt(abs(Pearson residuals))", main = "Pearson Residuals vs. Fitted Values")

# Add lowess smoother
lines(lowess(fitted(negbin_model_2), sqrt(abs(pearson_resid))), col = "red")

# Check for any patterns or trends

# Plot deviance residuals vs. fitted values
plot(fitted(negbin_model_2), sqrt(abs(deviance_resid)), xlab = "Fitted values", ylab = "sqrt(abs(Deviance residuals))", main = "Deviance Residuals vs. Fitted Values")

# Add lowess smoother
lines(lowess(fitted(negbin_model_2), sqrt(abs(deviance_resid))), col = "red")

Comparing two model

Comparing the two models using AIC and BIC

    # Calculate AIC and BIC for both models
    aic_negbin <- AIC(negbin_model)
    bic_negbin <- BIC(negbin_model)
    aic_negbin_2 <- AIC(negbin_model_2)
    bic_negbin_2 <- BIC(negbin_model_2)
    
    # Compare AIC and BIC values to determine which model has a better fit
    if (aic_negbin < aic_negbin_2 & bic_negbin < bic_negbin_2) {
      result <- "Negbin Binomial model 1 is the better fit."
      eqn <- "likes_count ~ current_price + raw_price + discount +discount_price"
    } else if (aic_negbin < aic_negbin_2 & bic_negbin < bic_negbin_2) {
      result <- "Negative Binomial model 2 is the better fit."
      eqn <- "likes_count ~ raw_price + discount +discount_price"
    } else {
      result <- "Neither model has a significantly better fit."
      eqn <- " No Equation"
    }
    
    # Output the results and model equation
    cat("AIC comparison:\n")
## AIC comparison:
    print(cbind(negbin_model = aic_negbin, negbin_model_2 = aic_negbin_2))
##      negbin_model negbin_model_2
## [1,]     693.9664       691.9749
    cat("\nBIC comparison:\n")
## 
## BIC comparison:
    print(cbind(negbin_model = bic_negbin, negbin_model_2 = bic_negbin_2))
##      negbin_model negbin_model_2
## [1,]     707.0128       702.8468
    cat("\n", result, "\n")
## 
##  Neither model has a significantly better fit.
    cat("Model equation:\n", eqn)
## Model equation:
##   No Equation
    # Calculate AIC and BIC for both models
    aic_poisson <- AIC(poisson_model)
    bic_poisson <- BIC(poisson_model)
    aic_negbin <- AIC(negbin_model)
    bic_negbin <- BIC(negbin_model)
    
    # Compare AIC and BIC values to determine which model has a better fit
    if (aic_poisson < aic_negbin & bic_poisson < bic_negbin) {
      result <- "Poisson model is the better fit."
      eqn <- "likes_count ~ current_price + raw_price + discount"
    } else if (aic_negbin < aic_poisson & bic_negbin < bic_poisson) {
      result <- "Negative Binomial model is the better fit."
      eqn <- "likes_count ~ current_price + raw_price + discount + discount_price"
    } else {
      result <- "Neither model has a significantly better fit."
      eqn <- "likes_count ~ current_price + raw_price + discount,"
    }
    
    # Output the results and model equation
    cat("AIC comparison:\n")
## AIC comparison:
    print(cbind(poisson_model = aic_poisson, negbin_model = aic_negbin))
##      poisson_model negbin_model
## [1,]      2924.329     693.9664
    cat("\nBIC comparison:\n")
## 
## BIC comparison:
    print(cbind(poisson_model = bic_poisson, negbin_model = bic_negbin))
##      poisson_model negbin_model
## [1,]      2933.027     707.0128
    cat("\n", result, "\n")
## 
##  Negative Binomial model is the better fit.
    cat("Model equation:\n", eqn)
## Model equation:
##  likes_count ~ current_price + raw_price + discount + discount_price

The table provided shows the Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC) for two different models, a Poisson regression model and a negative binomial regression model.

The BIC and AIC are both measures of the goodness of fit of a statistical model. They take into account both the likelihood of the data given the model and the complexity of the model, which is typically measured by the number of parameters.

In this case, the Poisson model has 4 degrees of freedom (df) and the negative binomial model has 10 df. The BIC for the Poisson model is 2933.0270 and for the negative binomial model is 707.0128. The lower the BIC value, the better the model fits the data. Therefore, in this case, the Poisson model has a worse fit than the negative binomial model.

Similarly, the AIC for the Poisson model is 2924.3294 and for the negative binomial model is 693.9664. Again, the lower the AIC value, the better the model fits the data. Therefore, in this case, the negative binomial model has a better fit than the Poisson model.

Overall, based on the BIC and AIC values, the negative binomial model seems to be a better fit for the data than the Poisson model.