New Chic Data Analysis
Problem Statement
The inference of this project is to check which factors are affecting the dependent variable Likes_count. As our dependent variable is count variable. We thought to prepare the Poisson Regression Model.
1) Mounika Balreddyguda
Loading datasets and packages
Loading required packages
library(shiny)
library(shinythemes)
library(kernlab)
library(AER)
library(dplyr)
library(broom)
library(maxLik)
library(tidyr)
library(tidyverse)
library(skimr)
library(factoextra)
library(ggpubr)
library(plotly)
library(cluster)
library(DT)
library(ggforce)
library(gridExtra)
library(corrplot)
library(MASS)Loading the datasets
accessories <- read_csv("~/Downloads/accessories.csv", na = c("", "NA"))Transforming data
Creating discount price variable
By Understanding from the dataset I see that we have only discount % variable. However, we thought of creating a new variable called discount_price to get the exact value.
accessory <- accessories %>%
mutate(discount_price= (raw_price*(100-discount))/100)Checking datatypes of the dataset and converting the datatypes of variables as needed
# Check the data types of each variable
str(accessory)## tibble [6,358 × 23] (S3: tbl_df/tbl/data.frame)
## $ category : chr [1:6358] "accessories" "accessories" "accessories" "accessories" ...
## $ subcategory : chr [1:6358] "Chapeaux & Bonnets" "Baseball Caps" "Flat Caps" "Cache-oreilles & Masques" ...
## $ name : chr [1:6358] "Masques d'isolation de protection du visage Masques anti-buée anti-éclaboussures" "Chapeau de parasol de casquette de baseball en denim lavé à bordure personnalisée en plein air" "Casquette Gavroche Cabbie Lvy Flat Hat Vintage Painter Beret Hats" "Masque à fleurs en coton multicolore Masque vintage" ...
## $ current_price : num [1:6358] 9.99 9.99 17.49 5.99 13.99 ...
## $ raw_price : num [1:6358] 17 19 37.3 16 27 ...
## $ currency : chr [1:6358] "USD" "USD" "USD" "USD" ...
## $ discount : num [1:6358] 41 47 53 63 48 47 50 46 52 50 ...
## $ likes_count : num [1:6358] 373 73 140 686 35 91 147 46 121 71 ...
## $ is_new : logi [1:6358] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ brand : chr [1:6358] NA NA NA NA ...
## $ brand_url : chr [1:6358] NA NA NA NA ...
## $ codCountry : chr [1:6358] "ID,MY,PH,SG,TH,VN" "ID,MY,PH,SG,TH,VN" "ID,MY,PH,SG,TH,VN" "ID,MY,PH,SG,TH,VN" ...
## $ variation_0_color : chr [1:6358] "Gray" "Navy" "Black" "#01" ...
## $ variation_1_color : chr [1:6358] "Blue" "Black" "camel" "#02" ...
## $ variation_0_thumbnail: chr [1:6358] "https://imgaz1.chiccdn.com/thumb/list_grid/oaupload/newchic/images/09/FD/e8204e98-8e30-41ca-a599-c8ed90469e08.jpg" "https://imgaz1.chiccdn.com/thumb/list_grid/oaupload/newchic/images/45/ED/2e7729ff-3dde-4f1c-9ece-d6c2a5f24302.jpg" "https://imgaz1.chiccdn.com/thumb/list_grid/oaupload/newchic/images/A4/2F/42467cfd-63b4-4f6e-901c-90dc34cd52b3.jpg" "https://imgaz1.chiccdn.com/thumb/list_grid/oaupload/newchic/images/39/4F/db60e479-87f5-4a82-94cb-ec17b4e772a4.jpg" ...
## $ variation_0_image : chr [1:6358] "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/09/FD/e8204e98-8e30-41ca-a599-c8ed90469e08.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/45/ED/2e7729ff-3dde-4f1c-9ece-d6c2a5f24302.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/A4/2F/42467cfd-63b4-4f6e-901c-90dc34cd52b3.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/39/4F/db60e479-87f5-4a82-94cb-ec17b4e772a4.jpg" ...
## $ variation_1_thumbnail: chr [1:6358] "https://imgaz1.chiccdn.com/thumb/list_grid/oaupload/newchic/images/2E/72/1c322075-7a7a-474d-976e-6bcbc45daaec.jpg" "https://imgaz1.chiccdn.com/thumb/list_grid/oaupload/newchic/images/79/30/9b8eb997-55d4-4e67-8b89-e81178cf2e8e.jpg" "https://imgaz1.chiccdn.com/thumb/list_grid/oaupload/newchic/images/30/CE/5f9c589f-81f7-48c4-a24c-4ba41cdf3718.jpg" "https://imgaz1.chiccdn.com/thumb/list_grid/oaupload/newchic/images/D2/B6/0179fdce-340b-4da5-b89b-e49a7ec39d17.jpg" ...
## $ variation_1_image : chr [1:6358] "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/2E/72/1c322075-7a7a-474d-976e-6bcbc45daaec.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/79/30/9b8eb997-55d4-4e67-8b89-e81178cf2e8e.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/30/CE/5f9c589f-81f7-48c4-a24c-4ba41cdf3718.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/D2/B6/0179fdce-340b-4da5-b89b-e49a7ec39d17.jpg" ...
## $ image_url : chr [1:6358] "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/2E/72/1c322075-7a7a-474d-976e-6bcbc45daaec.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/0D/0D/a735a697-871f-4bea-b92d-19947ff8f328.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/30/CE/5f9c589f-81f7-48c4-a24c-4ba41cdf3718.jpg" "https://imgaz1.chiccdn.com/thumb/view/oaupload/newchic/images/E6/1D/834076a4-45ea-476c-8f45-bbe1b95db3d7.jpg" ...
## $ url : chr [1:6358] "https://fr.newchic.com/hats-and-caps-4192/p-1671872.html" "https://fr.newchic.com/baseball-caps-12158/p-1674377.html" "https://fr.newchic.com/flat-caps-12160/p-1637790.html" "https://fr.newchic.com/earmuffsandmouth-maskandblindfold-6011/p-1654953.html" ...
## $ id : num [1:6358] 1671872 1674377 1637790 1654953 1666600 ...
## $ model : chr [1:6358] "SKUF08305" "SKUF11351" "SKUE50978" "SKUE76490" ...
## $ discount_price : num [1:6358] 10.02 10.06 17.55 5.92 14.03 ...
# Convert the subcategory and currency variables to factors
accessory$subcategory <- as.factor(accessory$subcategory)
accessory$currency <- as.factor(accessory$currency)skim(accessory)| Name | accessory |
| Number of rows | 6358 |
| Number of columns | 23 |
| _______________________ | |
| Column type frequency: | |
| character | 14 |
| factor | 2 |
| logical | 1 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| category | 0 | 1.00 | 11 | 11 | 0 | 1 | 0 |
| name | 0 | 1.00 | 5 | 254 | 0 | 6036 | 0 |
| brand | 6228 | 0.02 | 2 | 10 | 0 | 16 | 0 |
| brand_url | 6235 | 0.02 | 41 | 49 | 0 | 14 | 0 |
| codCountry | 647 | 0.90 | 2 | 17 | 0 | 8 | 0 |
| variation_0_color | 464 | 0.93 | 1 | 52 | 0 | 305 | 0 |
| variation_1_color | 892 | 0.86 | 1 | 57 | 0 | 313 | 0 |
| variation_0_thumbnail | 464 | 0.93 | 57 | 133 | 0 | 5844 | 0 |
| variation_0_image | 464 | 0.93 | 52 | 128 | 0 | 5844 | 0 |
| variation_1_thumbnail | 892 | 0.86 | 73 | 133 | 0 | 5426 | 0 |
| variation_1_image | 892 | 0.86 | 68 | 128 | 0 | 5426 | 0 |
| image_url | 0 | 1.00 | 68 | 128 | 0 | 6298 | 0 |
| url | 0 | 1.00 | 46 | 76 | 0 | 6299 | 0 |
| model | 0 | 1.00 | 9 | 33 | 0 | 6299 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| subcategory | 0 | 1 | FALSE | 35 | Cha: 980, Bea: 722, Bas: 583, Fou: 441 |
| currency | 0 | 1 | FALSE | 1 | USD: 6358 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| is_new | 0 | 1 | 0.02 | FAL: 6251, TRU: 107 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| current_price | 0 | 1 | 12.15 | 7.14 | 0.22 | 8.05 | 11.30 | 14.61 | 92.99 | ▇▁▁▁▁ |
| raw_price | 0 | 1 | 25.27 | 33.03 | 0.00 | 16.41 | 22.86 | 29.09 | 2129.93 | ▇▁▁▁▁ |
| discount | 0 | 1 | 50.76 | 9.30 | 0.00 | 47.00 | 49.00 | 56.00 | 100.00 | ▁▁▇▁▁ |
| likes_count | 0 | 1 | 96.50 | 201.80 | 0.00 | 19.00 | 44.00 | 97.00 | 7277.00 | ▇▁▁▁▁ |
| id | 0 | 1 | 1375095.39 | 215220.95 | 39174.00 | 1215135.50 | 1356958.50 | 1555828.75 | 1723918.00 | ▁▁▁▇▇ |
| discount_price | 0 | 1 | 12.10 | 7.18 | 0.00 | 8.00 | 11.27 | 14.54 | 92.59 | ▇▁▁▁▁ |
Remove any missing values or outliers from the dataset.
# Remove any missing values or outliers from the dataset
accessory <- accessory %>%
drop_na() %>% # Remove rows with missing values
filter(likes_count < quantile(likes_count, 0.99)) # Remove the top 1% of likes_count valuesExplore the data
Calculate the summary statistics and distribution of the ‘likes_count’ variable using the ‘summary’ and ‘hist’ functions.
# Check the summary statistics and distribution of the 'likes_count' variable
summary(accessory$likes_count)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 43.00 68.50 94.83 127.50 588.00
ggplot(accessory, aes(x = likes_count)) +
geom_histogram(bins = 20, fill = "lightblue", color = "black") +
xlab("Likes Count") +
ylab("Frequency") +
ggtitle("Distribution of Likes Count")Scatter plots for all variables vs. ‘Likes_Count’ Variable.
# Create scatterplots for all variables vs. 'likes_count'
p1 <- ggplot(accessory, aes(x = current_price, y = likes_count)) +
geom_point() +
ggtitle("Scatterplot of current_price vs. likes_count")
p2 <- ggplot(accessory, aes(x = raw_price, y = likes_count)) +
geom_point() +
ggtitle("Scatterplot of raw_price vs. likes_count")
p3 <- ggplot(accessory, aes(x = discount, y = likes_count)) +
geom_point() +
ggtitle("Scatterplot of discount vs. likes_count")
p4 <- ggplot(accessory, aes(x = discount_price, y = likes_count)) +
geom_point() +
ggtitle("Scatterplot of discount_price vs. likes_count")
# Arrange the scatterplots in a grid
grid.arrange(p1, p2, p3, p4, ncol = 2)Compute the correlation coefficients between all variables and ‘likes_count’ using a loop
# Compute the correlation matrix for all variables
cor_matrix <- cor(accessory[, c("current_price", "raw_price", "discount", "likes_count", "discount_price")])
# Create a correlation plot using the 'corrplot' package
corrplot(cor_matrix, method= "number",type = "upper", order = "hclust",tl.col = "black", col= colorRampPalette(c("white", "deepskyblue", "blue4"))(100))Maximum Likehood Estimations
fit <- fitdistr(accessory$likes_count, "normal")
mean_est <- fit$estimate[1]
sd_est <- fit$estimate[2]
# print the estimated parameter values
cat("Maximum likelihood estimates:\n")## Maximum likelihood estimates:
cat("Mean:", mean_est, "\n")## Mean: 94.82979
cat("Standard deviation:", sd_est, "\n")## Standard deviation: 86.25107
2) Maruthi Sai Phani Teja Chadalapaka
Poisson Model
Mean-Variance Relationship
# Calculate the mean and variance of likes_count for each value of the subcategory variable
mean_var_df <- accessory %>% group_by(subcategory) %>% summarise(mean_likes = mean(likes_count), var_likes = var(likes_count))
# Plot the mean-variance relationship
ggplot(mean_var_df, aes(x = mean_likes, y = var_likes)) +
geom_point() +
scale_x_continuous(name = "Mean likes count") +
scale_y_continuous(name = "Variance of likes count") +
ggtitle("Mean-Variance Relationship for Likes Count")Model Preparation for poisson
Dividng the data into test and train
# Split data into training and testing sets
set.seed(123)
train_indices <- sample(nrow(accessory), 0.7*nrow(accessory)) # 70% for training
train_data <- accessory[train_indices, ]
test_data <- accessory[-train_indices, ]Preparing the model and checking the assumptions
# Create Poisson regression model using training data
poisson_model <- glm(likes_count ~ current_price + raw_price + discount, data = train_data, family = "poisson")
poisson_aug <- augment(poisson_model)
# Linearity assumption: plot residuals vs. fitted values
ggplot(data = poisson_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals")# Independence assumption: plot residuals vs. order of observations
ggplot(data = poisson_aug, aes(x = .resid)) +
geom_histogram(binwidth = 0.25) +
xlab("Residuals")# Overdispersion assumption: calculate dispersion parameter
dispersiontest(poisson_model)##
## Overdispersion test
##
## data: poisson_model
## z = 3.7344, p-value = 9.408e-05
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 42.64843
The output shows the results of an overdispersion test for the
Poisson regression model poisson_model.
The test uses the score test statistic, which follows a chi-squared
distribution under the null hypothesis that the dispersion parameter
equals 1.
The test statistic is z = 3.7344, which is very large and corresponds to a p-value is 9.408e-05, indicating strong evidence against the null hypothesis of no overdispersion. The alternative hypothesis is that the true dispersion parameter is greater than 1, indicating that the Poisson assumption of equal mean and variance may not hold and that there is more variation in the data than expected by the Poisson model.
The estimated dispersion parameter is 42.64843, which is much larger than 1 and supports the conclusion that the model suffers from overdispersion. To address overdispersion, one possible approach is to fit a negative binomial regression model, which allows for greater variability than the Poisson model.
Negative Binomial Regression Model
Model 1 :
Model Equation: $likes_count ~ $current price + $raw price + $discount + $discount price
negbin_model <- glm.nb(likes_count ~ current_price + raw_price + discount +discount_price, data = train_data)
#Linearity Assumptions
# Plot fitted values vs. residuals
plot(fitted(negbin_model), residuals(negbin_model), xlab = "Fitted values", ylab = "Residuals", main = "Residuals vs. Fitted Values")
# Add lowess smoother
lines(lowess(fitted(negbin_model), residuals(negbin_model)), col = "red")# Check for any patterns or nonlinear relationships
# Independence Assumptions
# Plot residuals vs. order of observations
plot(resid(negbin_model) ~ seq_along(resid(negbin_model)), xlab = "Observation order", ylab = "Residuals", main = "Residuals vs. Observation Order")# Overdispersion Assumptions
# Calculate Pearson residuals
pearson_resid <- residuals(negbin_model, type = "pearson")
# Calculate deviance residuals
deviance_resid <- residuals(negbin_model, type = "deviance")
# Plot Pearson residuals vs. fitted values
plot(fitted(negbin_model), sqrt(abs(pearson_resid)), xlab = "Fitted values", ylab = "sqrt(abs(Pearson residuals))", main = "Pearson Residuals vs. Fitted Values")
# Add lowess smoother
lines(lowess(fitted(negbin_model), sqrt(abs(pearson_resid))), col = "red")# Check for any patterns or trends
# Plot deviance residuals vs. fitted values
plot(fitted(negbin_model), sqrt(abs(deviance_resid)), xlab = "Fitted values", ylab = "sqrt(abs(Deviance residuals))", main = "Deviance Residuals vs. Fitted Values")
# Add lowess smoother
lines(lowess(fitted(negbin_model), sqrt(abs(deviance_resid))), col = "red")Model 2 :
Model Equation: $likes_count ~ $raw price + $discount + $discount price
negbin_model_2 <- glm.nb(likes_count ~ raw_price + discount +discount_price, data = train_data)
#Linearity Assumptions
# Plot fitted values vs. residuals
plot(fitted(negbin_model_2), residuals(negbin_model_2), xlab = "Fitted values", ylab = "Residuals", main = "Residuals vs. Fitted Values")
# Add lowess smoother
lines(lowess(fitted(negbin_model_2), residuals(negbin_model_2)), col = "red")# Check for any patterns or nonlinear relationships
# Independence Assumptions
# Plot residuals vs. order of observations
plot(resid(negbin_model_2) ~ seq_along(resid(negbin_model_2)), xlab = "Observation order", ylab = "Residuals", main = "Residuals vs. Observation Order")# Overdispersion Assumptions
# Calculate Pearson residuals
pearson_resid <- residuals(negbin_model_2, type = "pearson")
# Calculate deviance residuals
deviance_resid <- residuals(negbin_model_2, type = "deviance")
# Plot Pearson residuals vs. fitted values
plot(fitted(negbin_model_2), sqrt(abs(pearson_resid)), xlab = "Fitted values", ylab = "sqrt(abs(Pearson residuals))", main = "Pearson Residuals vs. Fitted Values")
# Add lowess smoother
lines(lowess(fitted(negbin_model_2), sqrt(abs(pearson_resid))), col = "red")# Check for any patterns or trends
# Plot deviance residuals vs. fitted values
plot(fitted(negbin_model_2), sqrt(abs(deviance_resid)), xlab = "Fitted values", ylab = "sqrt(abs(Deviance residuals))", main = "Deviance Residuals vs. Fitted Values")
# Add lowess smoother
lines(lowess(fitted(negbin_model_2), sqrt(abs(deviance_resid))), col = "red")Comparing two model
Comparing the two models using AIC and BIC
# Calculate AIC and BIC for both models
aic_negbin <- AIC(negbin_model)
bic_negbin <- BIC(negbin_model)
aic_negbin_2 <- AIC(negbin_model_2)
bic_negbin_2 <- BIC(negbin_model_2)
# Compare AIC and BIC values to determine which model has a better fit
if (aic_negbin < aic_negbin_2 & bic_negbin < bic_negbin_2) {
result <- "Negbin Binomial model 1 is the better fit."
eqn <- "likes_count ~ current_price + raw_price + discount +discount_price"
} else if (aic_negbin < aic_negbin_2 & bic_negbin < bic_negbin_2) {
result <- "Negative Binomial model 2 is the better fit."
eqn <- "likes_count ~ raw_price + discount +discount_price"
} else {
result <- "Neither model has a significantly better fit."
eqn <- " No Equation"
}
# Output the results and model equation
cat("AIC comparison:\n")## AIC comparison:
print(cbind(negbin_model = aic_negbin, negbin_model_2 = aic_negbin_2))## negbin_model negbin_model_2
## [1,] 693.9664 691.9749
cat("\nBIC comparison:\n")##
## BIC comparison:
print(cbind(negbin_model = bic_negbin, negbin_model_2 = bic_negbin_2))## negbin_model negbin_model_2
## [1,] 707.0128 702.8468
cat("\n", result, "\n")##
## Neither model has a significantly better fit.
cat("Model equation:\n", eqn)## Model equation:
## No Equation
# Calculate AIC and BIC for both models
aic_poisson <- AIC(poisson_model)
bic_poisson <- BIC(poisson_model)
aic_negbin <- AIC(negbin_model)
bic_negbin <- BIC(negbin_model)
# Compare AIC and BIC values to determine which model has a better fit
if (aic_poisson < aic_negbin & bic_poisson < bic_negbin) {
result <- "Poisson model is the better fit."
eqn <- "likes_count ~ current_price + raw_price + discount"
} else if (aic_negbin < aic_poisson & bic_negbin < bic_poisson) {
result <- "Negative Binomial model is the better fit."
eqn <- "likes_count ~ current_price + raw_price + discount + discount_price"
} else {
result <- "Neither model has a significantly better fit."
eqn <- "likes_count ~ current_price + raw_price + discount,"
}
# Output the results and model equation
cat("AIC comparison:\n")## AIC comparison:
print(cbind(poisson_model = aic_poisson, negbin_model = aic_negbin))## poisson_model negbin_model
## [1,] 2924.329 693.9664
cat("\nBIC comparison:\n")##
## BIC comparison:
print(cbind(poisson_model = bic_poisson, negbin_model = bic_negbin))## poisson_model negbin_model
## [1,] 2933.027 707.0128
cat("\n", result, "\n")##
## Negative Binomial model is the better fit.
cat("Model equation:\n", eqn)## Model equation:
## likes_count ~ current_price + raw_price + discount + discount_price
The table provided shows the Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC) for two different models, a Poisson regression model and a negative binomial regression model.
The BIC and AIC are both measures of the goodness of fit of a statistical model. They take into account both the likelihood of the data given the model and the complexity of the model, which is typically measured by the number of parameters.
In this case, the Poisson model has 4 degrees of freedom (df) and the negative binomial model has 10 df. The BIC for the Poisson model is 2933.0270 and for the negative binomial model is 707.0128. The lower the BIC value, the better the model fits the data. Therefore, in this case, the Poisson model has a worse fit than the negative binomial model.
Similarly, the AIC for the Poisson model is 2924.3294 and for the negative binomial model is 693.9664. Again, the lower the AIC value, the better the model fits the data. Therefore, in this case, the negative binomial model has a better fit than the Poisson model.
Overall, based on the BIC and AIC values, the negative binomial model seems to be a better fit for the data than the Poisson model.