Assignment 5

What Age Groups are likely to recommend a clothing line

Women’s Clothing E-Commerce dataset revolving around the reviews written by customers. I will be exploring the dataset to find out:

The probability of a clothing line with high ratings getting recommended.
Age groups that are likely to recommend the clothing line.

Content

This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:

Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.
Age: Positive Integer variable of the reviewers age.
Title: String variable for the title of the review.
Review Text: String variable for the review body.
Rating: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
Division Name: Categorical name of the product high level division.
Department Name: Categorical name of the product department name.
Class Name: Categorical name of the product class name.

library(dplyr)
library(Zelig)
library(broom)
library(pander)
library(radiant.data)
library(texreg)
library(visreg)
library(ggplot2)

Clothing <- read.csv("/users/sharanbhamra/Desktop/SOC 712/Womens Clothing E-Commerce Reviews.csv") 
head(Clothing)

ggplot(data = Clothing, aes(x = Age)) + geom_histogram( fill = "blue")

The above histogram tells us the ages groups of the customers. Women who shop more at the e-commerce store are likely to be between 30 to 45 years. The mean age of is 43 and the distribution seems to be positively skewed towards the right.

Simple Logistic Models

A1<- glm(Recommended.IND ~ Age, family = "binomial", data = Clothing)
summary(A1)

## 
## Call:
## glm(formula = Recommended.IND ~ Age, family = "binomial", data = Clothing)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0071   0.5903   0.6196   0.6385   0.6736  
## 
## Coefficients:
##             Estimate Std. Error z value             Pr(>|z|)    
## (Intercept) 1.248494   0.062468   19.99 < 0.0000000000000002 ***
## Age         0.006622   0.001412    4.69           0.00000273 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21973  on 23485  degrees of freedom
## Residual deviance: 21951  on 23484  degrees of freedom
## AIC: 21955
## 
## Number of Fisher Scoring iterations: 4

visreg(A1, "Age", scale = "response")

The above graph shows us the probabilities of a clothing line being recommended by Age. It shows us that 20 years age group are less likely to recommend as compared to women in the age groups 40 and above. We can say as the age group starts increasing the probability of the clothing line being recommended is also increasing by 0.006622.

Rate<- Clothing %>% 
    group_by(Rating) %>% 
    summarise(py1 = mean(Recommended.IND)) %>% 
  mutate(py0 = 1 - py1) %>% 
  pandoc.table()

## 
## -----------------------------
##  Rating     py1       py0    
## -------- --------- ----------
##    1       0.019     0.981   
## 
##    2      0.06006    0.9399  
## 
##    3      0.4141     0.5859  
## 
##    4      0.9669    0.03309  
## 
##    5      0.9981    0.001904 
## -----------------------------

C1<- glm(Recommended.IND ~ Rating, family = "binomial", data = Clothing)
summary(C1)

## 
## Call:
## glm(formula = Recommended.IND ~ Rating, family = "binomial", 
##     data = Clothing)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5245   0.0634   0.0634   0.3087   3.6188  
## 
## Coefficients:
##             Estimate Std. Error z value            Pr(>|z|)    
## (Intercept) -9.73518    0.17834  -54.59 <0.0000000000000002 ***
## Rating       3.18882    0.05451   58.51 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21973.1  on 23485  degrees of freedom
## Residual deviance:  6729.3  on 23484  degrees of freedom
## AIC: 6733.3
## 
## Number of Fisher Scoring iterations: 7

visreg(C1, "Rating", scale = "response")

The above model and graph shows us the probability of a clothing line being recommended depending on a low or high rating received. At the rating scale of 3, the probability of the clothing line being recommended is 0.4141 as demonstrated by the graph and models.

Interaction Models

C2<- glm(Recommended.IND ~ Rating + Age+ Positive.Feedback.Count, family = "binomial", data = Clothing)
summary(C2)

## 
## Call:
## glm(formula = Recommended.IND ~ Rating + Age + Positive.Feedback.Count, 
##     family = "binomial", data = Clothing)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6376   0.0583   0.0640   0.2789   3.6549  
## 
## Coefficients:
##                           Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)             -10.063005   0.213882 -47.049 < 0.0000000000000002 ***
## Rating                    3.184583   0.054526  58.405 < 0.0000000000000002 ***
## Age                       0.009206   0.002680   3.435             0.000592 ***
## Positive.Feedback.Count  -0.016082   0.004820  -3.336             0.000849 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21973.1  on 23485  degrees of freedom
## Residual deviance:  6707.4  on 23482  degrees of freedom
## AIC: 6715.4
## 
## Number of Fisher Scoring iterations: 7

visreg(C2, "Age", scale = "response", xlab="Age of Customers", ylab="P(Recomendations)")

C3<- glm(Recommended.IND ~ Rating*Age + Positive.Feedback.Count, family = "binomial", data = Clothing)
summary(C3)

## 
## Call:
## glm(formula = Recommended.IND ~ Rating * Age + Positive.Feedback.Count, 
##     family = "binomial", data = Clothing)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6294   0.0554   0.0635   0.2951   3.7411  
## 
## Coefficients:
##                           Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)             -12.156607   0.654134 -18.584 < 0.0000000000000002 ***
## Rating                    3.835271   0.200159  19.161 < 0.0000000000000002 ***
## Age                       0.057169   0.014073   4.062            0.0000486 ***
## Positive.Feedback.Count  -0.016040   0.004824  -3.325             0.000883 ***
## Rating:Age               -0.014914   0.004295  -3.472             0.000516 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21973.1  on 23485  degrees of freedom
## Residual deviance:  6695.8  on 23481  degrees of freedom
## AIC: 6705.8
## 
## Number of Fisher Scoring iterations: 7

visreg(C3, "Age", by = "Positive.Feedback.Count", scale = "response")

In the interaction model C2 we can see that as age is increasing, the probability of recommendation is increasing at very low. As for C3, 20 and 30 age groups are likely to give a positive feedback as compared to the older age groups.

Grouped Data Models

Cloth <- Clothing %>%
    group_by(Rating, Age, Positive.Feedback.Count) %>%
    summarise(total = n(), yrecom = sum(Recommended.IND)) %>%
    mutate(nrecom = total - yrecom)
head(Cloth)

dim(Cloth)

## [1] 3246    6

C4<- glm(cbind(yrecom,nrecom) ~ Rating + Age + Positive.Feedback.Count, family = "binomial", data = Cloth)
summary(C4)

## 
## Call:
## glm(formula = cbind(yrecom, nrecom) ~ Rating + Age + Positive.Feedback.Count, 
##     family = "binomial", data = Cloth)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4876  -0.2313   0.0856   0.3448   3.8433  
## 
## Coefficients:
##                           Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)             -10.063005   0.213886 -47.048 < 0.0000000000000002 ***
## Rating                    3.184583   0.054527  58.403 < 0.0000000000000002 ***
## Age                       0.009206   0.002680   3.435             0.000592 ***
## Positive.Feedback.Count  -0.016082   0.004820  -3.336             0.000849 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17162.2  on 3245  degrees of freedom
## Residual deviance:  1896.4  on 3242  degrees of freedom
## AIC: 2864.6
## 
## Number of Fisher Scoring iterations: 6

visreg(C4, "Age", scale = "response")

We grouped all the discrete data variables like Rating, Age, Positive.Feedback.Count, so that we can run our model for 3246 observations. By doing so we will improve the performance of the model.

Likelihood Ratio Test

anova(A1,C1, C2, C3, test = "Chisq")

The anova test determines the deviance for each model to see which model fits best. Model 3 and 4 are highly significant as compared to models 1 and 2, but model 4 fits best because of a lower deviance and measure of error.

library(texreg)
htmlreg(list(A1,C1, C2, C3, C4))

Statistical models
	Model 1	Model 2	Model 3	Model 4	Model 5
(Intercept)	1.25^***	-9.74^***	-10.06^***	-12.16^***	-10.06^***
	(0.06)	(0.18)	(0.21)	(0.65)	(0.21)
Age	0.01^***		0.01^***	0.06^***	0.01^***
	(0.00)		(0.00)	(0.01)	(0.00)
Rating		3.19^***	3.18^***	3.84^***	3.18^***
		(0.05)	(0.05)	(0.20)	(0.05)
Positive.Feedback.Count			-0.02^***	-0.02^***	-0.02^***
			(0.00)	(0.00)	(0.00)
Rating:Age				-0.01^***
				(0.00)
AIC	21954.91	6733.27	6715.40	6705.85	2864.59
BIC	21971.04	6749.40	6747.66	6746.17	2888.93
Log Likelihood	-10975.46	-3364.63	-3353.70	-3347.92	-1428.29
Deviance	21950.91	6729.27	6707.40	6695.85	1896.41
Num. obs.	23486	23486	23486	23486	3246
p < 0.001, p < 0.01, p < 0.05

Lower values of AIC and BIC, indicate a better fit. Here we see that Model 5 is the best fitting model for this data.