Data Science - Insights Extraction - Jumia Black Friday Morocco

Introduction :

This is a notebook covering a statitical analysis and insight extraction of Jumia Black Friday promotions in Morocco, based on user’s reviews in the Facebook Official page ofJumia Morocco during November, 2017 .

Data Exploration :

First of all we load our data set as a data frame so that we can manipulate it easily

# loading the reviews data set
df <- read.csv(file = "final_data.csv", stringsAsFactors = FALSE)
# get a summary of our data frame
summary(df)

##    username            review              rating          date          
##  Length:1532        Length:1532        Min.   :1.000   Length:1532       
##  Class :character   Class :character   1st Qu.:1.000   Class :character  
##  Mode  :character   Mode  :character   Median :1.000   Mode  :character  
##                                        Mean   :1.497                     
##                                        3rd Qu.:1.000                     
##                                        Max.   :5.000                     
##      Angry              Haha              Like              Love         
##  Min.   :0.00000   Min.   :0.00000   Min.   : 0.0000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.: 0.0000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.00000   Median : 0.0000   Median :0.000000  
##  Mean   :0.01044   Mean   :0.02611   Mean   : 0.8649   Mean   :0.003264  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.: 1.0000   3rd Qu.:0.000000  
##  Max.   :4.00000   Max.   :5.00000   Max.   :28.0000   Max.   :1.000000  
##       Sad         Wow       gender            location        
##  Min.   :0   Min.   :0   Length:1532        Length:1532       
##  1st Qu.:0   1st Qu.:0   Class :character   Class :character  
##  Median :0   Median :0   Mode  :character   Mode  :character  
##  Mean   :0   Mean   :0                                        
##  3rd Qu.:0   3rd Qu.:0                                        
##  Max.   :0   Max.   :0

Now lets have a better visualisation for our data set, so that we can get things much clear

# Collect how many rating we have per each class of rating which is from 1 to 5
star_1 <- length(which(df$rating == 1))
star_2 <- length(which(df$rating == 2))
star_3 <- length(which(df$rating == 3))
star_4 <- length(which(df$rating == 4))
star_5 <- length(which(df$rating == 5))
# Create the Rating and their count vector
Ratings <- c("1", "2", "3", "4", "5")
Count <- c(star_1, star_2, star_3, star_4, star_5)
output <- data.frame(Ratings, Count)
ggplot(data = output, aes(x = Ratings, y = Count)) +
  geom_bar(aes(fill = Ratings), stat = "identity") +
  theme(legend.position = "none") +
  xlab("Ratings") + ylab("Total Count") + ggtitle("Histogram of different ratings during the Black Friday Promotion by Jumia") -> reviews_by_rating
ggplotly(reviews_by_rating)

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Taking a closer look, the next Histogram will be showing the frequency of occurences of unigrams, but before that, We first had to prepare a corpus of all the documents in the dataframe.

corpus <- Corpus(VectorSource(df$review))
# Inspect the corpus
corpus

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1532

# Let's inspect some lines from our corpus
inspect(corpus[22:24])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] Black Fuck                        De l arnaque tout simplement     
## [3] CHFARA LAYN3L ZAML BOHA CHARIKA

Data Cleaning

Since our corpus is multilingual, the first thing to do is to remove stopwords and then we clean up the corpus by eliminating numbers, punctuation, white space, and by converting to lower case. In addition, we discard common stop words such as “le”, “la”, “dans”, “sur”, “in”, “with”, etc. We use the tm_map() function from the ‘tm’ package to this end.

## $text
## [1] "à 00 00 Black friday j ai réussi à mettre dans le panier un article Quand j ai voulu validé l achat on m affiche stock épuisé tout ca avant 00 01 on peut accepter qu il y avait beaucoup de demandes mais comment pour des 0 Likeprofessionnels je m en doute maintenant0 Like ont cette faille dans leurs applications"
## 
## $arabicStopwordList
##   [1] "في"       "فيه"      "فيها"     "فيهم"     "على"      "عليك"    
##   [7] "عليكم"    "علينا"    "عليه"     "عليها"    "عليهم"    "علي"     
##  [13] "به"       "بها"      "بهم"      "بهذا"     "بذلك"     "بك"      
##  [19] "بكم"      "بكل"      "بما"      "بمن"      "بنا"      "له"      
##  [25] "لها"      "لهم"      "مع"       "معه"      "معها"     "معهم"    
##  [31] "عن"       "عنا"      "عنه"      "عنها"     "عنهم"     "تحت"     
##  [37] "حتى"      "فوق"      "فوقَ"      "بجانب"    "أمام"     "أمامَ"    
##  [43] "امام"     "خارج"     "بالخارج"  "حولَ"      "حول"      "رغم"     
##  [49] "بالرغم"   "رغمَ"      "منذ"      "منذُ"      "من"       "خلال"    
##  [55] "خلالَ"     "قبل"      "قبلَ"      "وفقا"     "إلى"      "الىوراءَ" 
##  [61] "وراء"     "بينَ"      "بين"      "بينهم"    "بينهما"   "بينكم"   
##  [67] "بينما"    "بدون"     "لكن"      "باتجاه"   "أقل"      "اقل"     
##  [73] "اكثر"     "هذا"      "هذه"      "ذلك"      "تلك"      "هؤلَاء"   
##  [79] "هؤلاء"    "اولائك"   "هذان"     "هذينهتان" "هتينأنا"  "انا"     
##  [85] "أنت"      "هما"      "أنتَ"      "انت"      "أنتِ"      "انتهو"   
##  [91] "هوَ"       "هو"       "هي"       "هيَ"       "نحن"      "أنتم"    
##  [97] "انتم"     "هُم"       "هم"       "منهم"     "وهم"      "التي"    
## [103] "الذي"     "اللذان"   "اللذين"   "اللتان"   "اللتين"   "ان"      
## [109] "وان"      "إن"       "إنه"      "إنها"     "إنهم"     "إنهما"   
## [115] "إني"      "وإن"      "وأن"      "انه"      "انها"     "انهم"    
## [121] "انهما"    "اني"      "أنك"      "إنك"      "انك"      "أنكم"    
## [127] "إنكم"     "انكم"     "اننا"     "أن"       "ألا"      "بأن"     
## [133] "الا"      "بان"      "بانهم"    "أنه"      "أنها"     "أنهم"    
## [139] "أنهما"    "أذ"       "اذ"       "اذا"      "إذ"       "إذا"     
## [145] "وإذ"      "وإذا"     "فاذا"     "ماذا"     "واذ"      "واذا"    
## [151] "لولا"     "لو"       "ولوسوف"   "لن"       "ما"       "لم"      
## [157] "ولم"      "أما"      "اما"      "لا"       "ولا"      "إلا"     
## [163] "أم"       "أو"       "ام"       "او"       "بل"       "قد"      
## [169] "وقد"      "لقد"      "أنما"     "إنما"     "انما"     "و"       
## [175] "كما"      "لما"      "لأن"      "لان"      "لي"       "لى"      
## [181] "لهذأ"     "لذأ"      "لأنه"     "لأنها"    "لأنهم"    "لانه"    
## [187] "لانها"    "لانهم"    "ثم"       "أيضا"     "ايضا"     "كذلك"    
## [193] "بعد"      "ولكن"     "لكنه"     "لكنها"    "لكنهم"    "فقط"     
## [199] "بفضل"     "حيث"      "بحيث"     "لكي"      "هنا"      "هناك"    
## [205] "بسبب"     "ذات"      "ذو"       "ذي"       "ذى"       "وه"      
## [211] "يا"       "فهذا"     "فهو"      "فما"      "فمن"      "فيما"    
## [217] "فهل"      "وهل"      "فهؤلاء"   "كذا"      "لذلك"     "لماذا"   
## [223] "لمن"      "لنا"      "منا"      "منك"      "منكم"     "منهما"   
## [229] "لك"       "ولو"      "مما"      "وما"      "ومن"      "عند"     
## [235] "عندهم"    "عندما"    "عندنا"    "عنهما"    "عنك"      "اذن"     
## [241] "فانا"     "فانهم"    "فهم"      "فه"       "فكل"      "لكل"     
## [247] "لكم"      "فلم"      "فلما"     "فيك"      "فيكم"     "لهذا"    
## [253] "امامَ"     "الى"      "هتينانا"  "انتَ"      "انتِ"      "لذا"

corpus.clean <- corpus %>%
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(removeWords, stopwords(kind="fr")) %>%
  tm_map(removeWords, stopwords(kind="en")) %>%
  tm_map(stripWhitespace)

#change date format for size reduction reasons we will keep only dates without time
df$date <- as.Date(df$date)

The Document Term Matrix

We represent the bag of words tokens with a document term matrix (DTM). The rows of the DTM correspond to documents in the collection, columns correspond to terms, and its elements are the term frequencies. We use a built-in function from the ‘tm’ package to create the DTM.

dtm <- DocumentTermMatrix(corpus.clean)
# Inspect the dtm
inspect(dtm[50:60, 15:21])

## <<DocumentTermMatrix (documents: 11, terms: 7)>>
## Non-/sparse entries: 0/77
## Sparsity           : 100%
## Maximal term length: 18
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs likeprofessionnels maintenant mettre panier peut quand réussi
##   50                  0          0      0      0    0     0      0
##   51                  0          0      0      0    0     0      0
##   52                  0          0      0      0    0     0      0
##   53                  0          0      0      0    0     0      0
##   54                  0          0      0      0    0     0      0
##   55                  0          0      0      0    0     0      0
##   56                  0          0      0      0    0     0      0
##   57                  0          0      0      0    0     0      0
##   58                  0          0      0      0    0     0      0
##   59                  0          0      0      0    0     0      0

And here we get our histogram showing the frequency of occurences of unigrams in the corpus.

# Frggplot(df, aes(x = rating, fill = gender)) + geom_bar()equency
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wf <- data.frame(word=names(freq), freq=freq)
# Plot Histogram
subset(wf, freq>40)    %>%
        ggplot(aes(word, freq)) +
        geom_bar(stat="identity", fill="lightblue", colour="deepskyblue") +
        theme(axis.text.x=element_text(angle=45, hjust=1)) +
        ggtitle("Histogram of occurences frequency during the Black Friday Promotion by Jumia") -> occ_freq
ggplotly(occ_freq)

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

So to end up with data exploration, we will generate the Word cloud representaion

m <- as.matrix(TermDocumentMatrix(corpus.clean))
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=100, random.order=FALSE, rot.per=0.35,
          colors=brewer.pal(8, "Dark2"))

2. Let’s go deeper with Statistical insights

In the following plot we will have a more detailled representation showing how reviews were distributed by gender

ggplot(df, aes(x = rating, fill = gender)) +
  geom_bar(position = "dodge") +
  ggtitle("Distribution of different ratings during the Black Friday Promotion by gender") -> g
ggplotly(g)

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

One of theimportant plot is to visualize how reviews were distributed during time and their frequencies

# df %>% group_by(date, rating) %>% summarise(n=n()) %>%
#   ggplot(aes(x = date, y = n, fill = rating)) + geom_bar(stat = 'identity', position = 'stack')

df %>% group_by(rating,date) %>% summarise(n=n()) %>%
  ggplot(aes(x=date, y=n, group=rating, color=as.factor(rating))) +
  geom_line(size=0.5) + geom_point() +
  ggtitle("Distribution of each review frequencies by date grouping by rating during the Black Friday by Jumia") +
  labs(x="Days", y="review frequencies", color="rating") -> p

ggplotly(p)

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

However that’s not enough for data exploring. we still need to visualize from where those reviews were coming so that we get a better insight about those peoples

leaflet(my_map) %>% addTiles() %>%
  setView(lng = -8.0188344, lat = 29.8255677, zoom = 05) %>%
  addCircles(lng = ~as.numeric(longitude), lat = ~as.numeric(latitude), weight = 1,
    radius = ~n *500, popup = ~location)

In addition, we can get more insights about those bad reviews from where are they coming

cities %>% group_by(rating,location, latitude, longitude) %>% filter(rating == 1) %>%
  summarise(n=n()) -> bad_rating_map

leaflet(bad_rating_map) %>% addTiles() %>%
  setView(lng = -8.0188344, lat = 29.8255677, zoom = 06) %>%
  addCircles(lng = ~as.numeric(longitude), lat = ~as.numeric(latitude), weight = 1,
    radius = ~n *300, popup = ~location, color = "red"
  )

## Warning in validateCoords(lng, lat, funcName): Data contains 5 rows with
## either missing or invalid lat/lon values and will be ignored

What kind of reviews have Most Like reaction

# we create our most_liked_reviews data frame as follow
df %>% group_by(Like,rating) %>% summarise(n=n()) -> most_liked_reviews
most_liked_reviews <-most_liked_reviews[, c("rating", "n")]
most_liked_reviews <- aggregate(n~., most_liked_reviews, FUN=sum)
most_liked_reviews$rating <- as.factor(most_liked_reviews$rating)
most_liked_reviews %>% arrange(desc(n)) %>%
  mutate(prop = percent(n / sum(n))) -> most_liked_reviews 
# create the pie chart plot with percentage as labels
pie <- ggplot(most_liked_reviews, aes(x = "", y = n, fill = fct_inorder(rating))) +
       geom_bar(width = 1, stat = "identity") + blank_theme +
  theme(axis.text.x=element_blank()) +
       coord_polar("y", start = 0) +
       geom_label_repel(aes(label = prop), size=5, show.legend = F, nudge_x = 1) +
       guides(fill = guide_legend(title = "Rating"))+
  ggtitle("Proportion of most liked reviews by other users")
pie

Let’s Go deeper now by extracting & visualizing some interseting statistical insights

Now, we gonna use this data to figure out the Jumia’s users real degree of satisfaction.
The key variable to measure this degree of satisfaction is the “rating”. Let’s have a look at the general statistical aspects of this variable.

summary(df$rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.497   1.000   5.000

#Computing the proportion for each classe
round(prop.table(table(df$rating))*100,3)

## 
##      1      2      3      4      5 
## 85.183  1.501  1.567  1.958  9.791

As we can see from the first summary, the mean value is 1.497 so it’s between the class 1 and 2. On the other hand, the median, the 1st quartile and the 3rd quartile are referring to the first class. That shows us that a very important amount of the ratings is from the class 1. This can be seen clearly from the proportions of each class, where the class 1 has been observed 85.183% of the times.
Now, let’s see how those ratings are distributed according to genders using boxplots.

boxplot(rating~gender, data = df, main="Distribution of the ratings by genders")

From the first sight, we can say that the distribution of the ratings is the same to both males and females. Reffering to the classes 2,3,4 and 5 as circles in the boxplots of both genders means that those values are considered to be outliers, this is due to the very low frequency of appearance of those classes in comparaison to the class 1, in fact, the proportions of the four classes combined don’t reach the 15% of all the ratings. However, as we said earlier, the class 1 is the most present one, the marks relative to the 1st quartile, the 3rd quartile and the median are barely seen since there all mingled with the main body of the boxplot to refer to the class 1. This emphasize the class 1 proportion of 85.183% found before.
Now, let see if the mean value 1.497 of the variable rating we got earlier can be generalised outside this sample. In other terms, if we concidered other users of Jumia whom reviews didn’t figure in that sample, will there rating’s mean be similar to our sample’s mean? To be more specific, we need to see if our mean belongs to the interval generated by this test in order to concidere it general.
To answer this question, we’re going to use a statistical test called “STUDENT mean test”. But first, we need to make sure that the variable “rating” is normaly distributed.

qqnorm(df$rating, col= "green",main="Normal Q-Q Plot: Distribution of the vari
       able rating")
qqline(df$rating, col = "red")

As we can see from the plot, the “Henry line” (in red) and the majority of the rating’s distribution (in green) are almost aligned.
That means that we can say that the variable “rating” follows a normal distribution. Then, We are able to make our “Student maen test” (with a ).

t.test(df$rating,mu=mean(df$rating),conf.level = 0.95)$conf.int

## [1] 1.433848 1.559625
## attr(,"conf.level")
## [1] 0.95

Indeed, the mean 1.497 we got earlier belongs to the interval [1.4338,1.5596] generated by this test. This mean is then said to be “significant”.
Now, let’s see if the mean of the rating for males is significantly different than the mean for females.

aggregate(rating~gender, data = df, mean)

##   gender   rating
## 1 Female 1.528736
## 2   Male 1.457265
## 3    N/A 1.582090

We’re going to use the “Student test” for means comparaison. We already shown that the variable “rating” is normaly distributed, so we can use the test. We need first to convert those string “N/A”s to the value NA, and put that in a new variable: na.gender.

na.gender = df$gender
na.gender[na.gender=="N/A"]=NA
t.test(df$rating~na.gender)

## 
##  Welch Two Sample t-test
## 
## data:  df$rating by na.gender
## t = 0.82301, df = 407.2, p-value = 0.411
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.09924122  0.24218257
## sample estimates:
## mean in group Female   mean in group Male 
##             1.528736             1.457265

As we can see from the result, the p-value of the test is 0.411 which is way superior to the freedom degree 0.05, therefore we cannot say that the means for males and females are significantly different from one another. And it make sense since the means estimated from the sample are close to each other but still, we cannot conclude. We cannot be absolutly sure but we can say that males and females are on the same level of satisfaction, the level 1, reffering to the unsatisfied or negative reviews.
Now, we’re going to try to analyse the relationship between the ratings and the facebook reactions.
In the following, we’re going to focus on the like, love and angry reactions since the other reaction, laugh, sad and Woah, don’t provide clear emotion about the review. For instance, we cannot know whether laugh reaction is used because the review is funny or used to mock the review. Of course this ambiguity can be present in the used reactions but not as intance as in those excluded.
To make this analysis, we gonna need to make a modification to the data. We need to binarize the variable “rating” to garantee a clear polarity. For this sake, we’re going to be nice to “Jumia” and leave the rating from class 1 as they are and turn the rest into 5. Class 1 will then represent the unsatisfied class and class 5 the satisfide one.

bin.rating= ifelse(df$rating==1,1,5)

Now, let’s see how “Like” is used in average.

aggregate(df$Like~bin.rating, data = df, mean)

##   bin.rating   df$Like
## 1          1 0.9609195
## 2          5 0.3127753

From the first look, we can say that the reviews belonging to the class 1 are more liked than the rest, but let’s test that statisticly and see if they are significantly different. We start, as we saw earlier, by testing the normality of the variable “Like” before using the previously used “STUDENT test” for means comparaison.

qqnorm(df$Like, col= "orange",main="Normal Q-Q Plot: Distribution of the variable Like")
qqline(df$Like, col = "blue")

The “Henry line” (in blue) and the majority of the Like’s distribution (in orange) are almost aligned. The Like variable is then normaly distributed. Let’s compare the means from each class (1 and 5).

t.test(df$Like~bin.rating)

## 
##  Welch Two Sample t-test
## 
## data:  df$Like by bin.rating
## t = 6.649, df = 830.31, p-value = 5.347e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.4568067 0.8394817
## sample estimates:
## mean in group 1 mean in group 5 
##       0.9609195       0.3127753

The resulted p-value is: 5.347e-11 (almost equal to 0) which is inferior to 0.05, the freedom degree. We can then accept the alternative hypothesis which means that we can say that the means are significantly different from each other.
We’re going to do the same for the variable Love.

aggregate(df$Love~bin.rating, data = df, mean)

##   bin.rating     df$Love
## 1          1 0.002298851
## 2          5 0.008810573

Those results give the impression that the means are too different. Let’s see.

qqnorm(df$Love, col= "yellow",main="Normal Q-Q Plot: Distribution of the variable Love")
qqline(df$Love, col = "green")

The variable Love follows the normal distribution, we can do the test.

t.test(df$Love~bin.rating)

## 
##  Welch Two Sample t-test
## 
## data:  df$Love by bin.rating
## t = -1.0245, df = 246.95, p-value = 0.3066
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.019030836  0.006007392
## sample estimates:
## mean in group 1 mean in group 5 
##     0.002298851     0.008810573

The p-value is 0.3066 and is superior to the freedom degree 0.05. We cannot reject the null hypothesis. That means that we cannot say that the means are significantly different from in the two classes.
We can then note that the unsatisfied reviews have been loved in average the same amount of time the satisfied reviews have, even if they are way more than the satisfied reviews.
Now, it’s time for Angry variable.

aggregate(df$Angry~bin.rating, data = df, mean)

##   bin.rating    df$Angry
## 1          1 0.006896552
## 2          5 0.030837004

One more time we get the impression that the means are different. We need to see what the statistical test is saying.

qqnorm(df$Angry, col= "pink",main="Normal Q-Q Plot: Distribution of the variable Angry")
qqline(df$Angry, col = "brown")

The variable Angry follows the normal distribution, we can do the test.

t.test(df$Angry~bin.rating)

## 
##  Welch Two Sample t-test
## 
## data:  df$Angry by bin.rating
## t = -1.1818, df = 231.89, p-value = 0.2385
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.06385366  0.01597275
## sample estimates:
## mean in group 1 mean in group 5 
##     0.006896552     0.030837004

The p-value 0.2385 is superior to the freedom degree 0.05. We cannot reject the null hypothesis. Therefore we cannot claim that the means are significantly different.
Now, we’re going to see if there is any relation between the rating and those reactions. And if it’s the case, how do they influence each other.
For that sake, we’re going to compute the correlation between those variables using the statistical test of “Pearson”.
Let’s start with the likes.
We demonstrated earlier that both the variable rating and like are following a normal distribution. We are then free to use the Pearson test.

cor.test(df$rating,df$Like, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  df$rating and df$Like
## t = -3.8783, df = 1530, p-value = 0.0001096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.14801807 -0.04882696
## sample estimates:
##         cor 
## -0.09866759

This test resulted to a p-value of 0.0001697 (close to 0), this is inferior to the freedom degree 0.05. This mean that we can reject the null hypothesis that claims that the correlation is equal to zero.

p <- ggplot(df, aes(Like, rating))
p + geom_point(aes(size = Like,colour = factor(rating)))

In other terms, we can say that the variables rating and Like are significantly correlated with a correlation value of -0.09866759 but still, this can be considered as a small strength of association between the two variables. We can speculate from this negative correlation value that the two variables are evolving in opposite directions. In other words, when the rating is growing, the likes get fewer and vice versa, as shown in the plot above.
To illustrate how do the the ratings and Likes influente each other, and to make sure that speculations made earlier are correct, we need to build a linear regression model between the two variables.

reg.like = lm(df$rating~df$Like);reg.like

## 
## Call:
## lm(formula = df$rating ~ df$Like)
## 
## Coefficients:
## (Intercept)      df$Like  
##     1.54133     -0.05157

As we can see, the slope provided by the linear regression model is equal to -0.05157. Which means that the line corresponding to this slope is decreasing as pridicted before. Here’s how much the linear regression line can explain the data.

p <- ggplot(df, aes(df$Like, df$rating))
p <- p + geom_point(aes(size = df$Like,colour = factor(df$rating)))
p + geom_abline(intercept = reg.like$coef[1], slope = reg.like$coef[2]) + labs(x="Like", y="Ratings", color="rating", size="Size")

This plot shows us that the class 1 rating (most unsatisfied ones) are the one getting most of the Likes.
We’re going to do the same for the variable Angry. Let’s calculate the Pearson’s test.

cor.test(df$rating,df$Angry, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  df$rating and df$Angry
## t = 2.7858, df = 1530, p-value = 0.005406
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02103256 0.12069214
## sample estimates:
##        cor 
## 0.07103963

We can see that the p-value is equal to the freedom degree 0.05. That means that there is a significant correlation between the ratings and the Love reactions. But still, an estimated correlation of 0.07103963 is not strong enough. This time, the estimated correlation is positive, that insinuate that the variables evolve in the same direction. Let see if it’s really the case by building a regression model.

reg.Angry = lm(df$rating~df$Angry);reg.Angry

## 
## Call:
## lm(formula = df$rating ~ df$Angry)
## 
## Coefficients:
## (Intercept)     df$Angry  
##      1.4901       0.6386

As expected, the slope provided by the regression model is equal to 0.6386, which correspond to a increasing linear regression line. The following plot illustrate this clearly.

p <- ggplot(df, aes(df$Angry, df$rating))
p <- p + geom_point(aes(size = df$Angry,colour = factor(df$rating)))
p + geom_abline(intercept = reg.Angry$coef[1], slope = reg.Angry$coef[2]) + labs(x="Angry", y="Ratings", color="rating", size="Size")

It’s clear from the plot above that the class 5 and 4 rating (corresponding to the most satisfied reviews) are the one getting the most Angry reactions. And the amount of Angry reactions recieved by the class 1 is considerably inferior to the Angry reaction from the class 5.
Let’s see now the developpement of the rations before and after the event called “the black friday” that took place on November 24 in 2017.
For this purpose, we’re going to split the date into two categories, one before the black friday and one after the black friday.

bin.date= ifelse(df$date < "2017-11-24","Before BF","After BF")
aggregate(rating~bin.date, data = df, mean)

##    bin.date   rating
## 1  After BF 1.210057
## 2 Before BF 2.673540

boxplot(rating~bin.date, data = df, main="Distribution of the ratings before and after the Black Friday")

As we can see, there is a striking difference between the two plots. Before the black friday, even if the 1st quartile, the mean and 3rd quartile are reffering to the class 1 but still, the main body of the boxplot covers all the classes from 1 to 5, that means that they were reviews posts from all the classes. On the opposite side, after the black friday, the body and all the box marks were mingled in the class 1, and the rest of the classes are considered being some oulier values. That means that after the black friday poeple started getting more unsatisfied from the services of Jumia, even they weren’t that satisfied before.

Jumia black Friday Analysis

Ayoub RMIDI and Basma ESSATOUTI

ENSIAS - Master SDBD

5 December 2017

Data Science - Insights Extraction - Jumia Black Friday Morocco

Introduction :

This is a notebook covering a statitical analysis and insight extraction of Jumia Black Friday promotions in Morocco, based on user’s reviews in the Facebook Official page ofJumia Morocco during November, 2017 .

Data Exploration :

Data Cleaning

The Document Term Matrix

2. Let’s go deeper with Statistical insights

Let’s Go deeper now by extracting & visualizing some interseting statistical insights

Jumia black Friday Analysis

Ayoub RMIDI and Basma ESSATOUTIENSIAS - Master SDBD

5 December 2017

Data Science - Insights Extraction - Jumia Black Friday Morocco

Introduction :

This is a notebook covering a statitical analysis and insight extraction of Jumia Black Friday promotions in Morocco, based on user’s reviews in the Facebook Official page ofJumia Morocco during November, 2017 .

Data Exploration :

Data Cleaning

The Document Term Matrix

2. Let’s go deeper with Statistical insights

Let’s Go deeper now by extracting & visualizing some interseting statistical insights

Ayoub RMIDI and Basma ESSATOUTI

ENSIAS - Master SDBD