Loading in the data
gps <- read.csv("googleplaystore.csv")
Data Cleaning
The following needs to be removed/ cleaned:
1) Record of an App by name Life made wi-fi….
2) The Current Version of the App and Android Version have to be processed to display the current sequence (12.1.2 is 12)
3) The Special characters in Installs(+), Price($), Size(MB) should be removed
4) All the redundant records should be removed
5) The levels in Installs should be fixed to a Descending order
6) Geners should be removed, it is the same as Category and makes it redundant
7) All the other Nas should be removed
8) Fixing Content Rating by combining like levels
gps <- gps[gps$App != "Life Made WI-Fi Touchscreen Photo Frame",] %>% na.omit() %>% unique() %>% select(-Genres)
gps$Current.Ver <- gps$Current.Ver %>% as.character()
i <- 1
for(i in c(seq(1, 8892, by = 1))){
a <- gps$Current.Ver[i]
b <- b <- unlist(strsplit(a, "[.]"))
b <- b[1]
gps$Current.Ver[i] = b
rm(b)
rm(a)
i <- i + 1
}
rm(i)
gps$Android.Ver <- gps$Android.Ver %>% as.character()
i <- 1
for(i in c(seq(1, 8892, by = 1))){
a <- gps$Android.Ver[i]
b <- b <- unlist(strsplit(a, "[.]"))
b <- b[1]
gps$Android.Ver[i] <- b
rm(b)
rm(a)
i <- i + 1
}
rm(i)
gps$Category <- gps$Category %>% droplevels()
gps$Installs = gsub("[^0-9.]", '', (gps$Installs %>% as.character()))
gps$Installs = as.numeric(gps$Installs)
gps$Size = gsub("[^0-9.]", '', (gps$Size %>% as.character()))
gps$Size = as.numeric(gps$Size)
gps$Price = gsub("[^0-9.]", '', (gps$Price %>% as.character()))
gps$Price = as.numeric(gps$Price)
gps$Installs <- factor(gps$Installs, levels = c("1000000000", "500000000", "100000000", "50000000", "10000000", "5000000", "1000000", "500000", "100000", "50000", "10000", "5000", "1000", "500", "100", "50", "10", "5", "1"))
gps$Content.Rating <- gps$Content.Rating %>% as.character() %>% word(1) %>% as.factor() %>% droplevels()
Different Categories and their presence in the market
Of the different kinds of people present in this word, Which category is globally accepted?
What are the top categories that perform the best?
gps_cb <- gps %>% na.omit() %>% group_by(Category) %>% summarize(Sum = sum(Rating), count = n(), `Normalized Rating` = sum(Rating)/n()) %>% arrange(desc(`Normalized Rating`))
kable(head(gps_cb, 7), caption = "Top 7 categories based on Normalized Rating")
| Category | Sum | count | Normalized Rating |
|---|---|---|---|
| EVENTS | 170.2 | 38 | 4.478947 |
| EDUCATION | 416.1 | 95 | 4.380000 |
| ART_AND_DESIGN | 257.3 | 59 | 4.361017 |
| PARENTING | 191.3 | 44 | 4.347727 |
| PERSONALIZATION | 1206.6 | 279 | 4.324731 |
| BOOKS_AND_REFERENCE | 618.0 | 143 | 4.321678 |
| BEAUTY | 158.8 | 37 | 4.291892 |
kable(tail(gps_cb, 7), caption = "Bottom 7 categories based on Normalized Rating")
| Category | Sum | count | Normalized Rating |
|---|---|---|---|
| COMMUNICATION | 844.1 | 206 | 4.097573 |
| LIFESTYLE | 1117.4 | 273 | 4.093040 |
| TRAVEL_AND_LOCAL | 591.9 | 147 | 4.026531 |
| VIDEO_PLAYERS | 467.0 | 116 | 4.025862 |
| MAPS_AND_NAVIGATION | 381.3 | 95 | 4.013684 |
| TOOLS | 2543.0 | 634 | 4.011041 |
| DATING | 558.3 | 141 | 3.959574 |
gps %>% filter(Category %in% (gps_cb$Category %>% head(15))) %>% ggplot(aes(x = Category, y = Rating, fill = Category)) + geom_violin(color = "black") + theme(legend.position = "none") + geom_hline(size = 0.6, color = "red", linetype = 'dashed', alpha = 0.5, yintercept = mean(gps$Rating %>% na.omit()))+coord_flip()+labs(y = "Rating", x = "Category", title = "Category Vs Rating Violin Plot")
rm(gps_cb)
Events, Education and Personalization are the best in overall rating.
While Dating, Tools, Maps and Navigation are the worst rated app categories.
Subtle and Significant Category Forensics
Which category doesn’t perform well on a overall scale?
What are the total number of installs for different categories?
data <- gps %>% group_by(Category) %>% summarise(Count = n(), `Total installs` = sum(as.numeric(as.character(Installs))), `Normalized Installations` = `Total installs`/Count)
data %>%
ggplot(aes(x = reorder(Category, Count), fill = Category, y = Count))+geom_bar(stat = "identity", color = "black")+theme(legend.position = "none")+coord_flip()+
labs(y = "Number of Apps present", x = "Category", title = "Number of Apps present in Different Category")
data %>%
ggplot(aes(x = reorder(Category, `Total installs`), fill = Category, y = `Total installs`))+geom_bar(stat = "identity", color = "black")+theme(legend.position = "none")+coord_flip()+
labs(y = "Total number of installs", x = "Category", title = "Total Number of Installations per Category")
data %>%
ggplot(aes(x = reorder(Category, `Normalized Installations`), fill = Category, y = `Normalized Installations`))+geom_bar(stat = "identity", color = "black")+theme(legend.position = "none")+coord_flip()+
labs(y = "Total number of normalized installations", x = "Category", title = "Total Number of Normalized Installations per Category")
info <- cbind((data %>% arrange(desc(data$Count)) %>% select(Category)),
(data %>% arrange(desc(data$`Total installs`)) %>% select(Category)),
(data %>% arrange(desc(data$`Normalized Installations`)) %>% select(Category)))
names(info) <- c("By Count", "By Total Installs", "By Total Normalized Installs")
kable(head(info, 7), caption = "Top 7 categories")
| By Count | By Total Installs | By Total Normalized Installs |
|---|---|---|
| FAMILY | GAME | COMMUNICATION |
| GAME | COMMUNICATION | SOCIAL |
| TOOLS | SOCIAL | VIDEO_PLAYERS |
| PRODUCTIVITY | PRODUCTIVITY | PRODUCTIVITY |
| FINANCE | TOOLS | PHOTOGRAPHY |
| PERSONALIZATION | FAMILY | TRAVEL_AND_LOCAL |
| COMMUNICATION | PHOTOGRAPHY | GAME |
kable(tail(info, 7), caption = "Bottom 7 categories")
| By Count | By Total Installs | By Total Normalized Installs | |
|---|---|---|---|
| 27 | HOUSE_AND_HOME | LIBRARIES_AND_DEMO | COMICS |
| 28 | LIBRARIES_AND_DEMO | COMICS | LIBRARIES_AND_DEMO |
| 29 | ART_AND_DESIGN | AUTO_AND_VEHICLES | AUTO_AND_VEHICLES |
| 30 | COMICS | MEDICAL | BEAUTY |
| 31 | PARENTING | PARENTING | PARENTING |
| 32 | EVENTS | BEAUTY | EVENTS |
| 33 | BEAUTY | EVENTS | MEDICAL |
rm(info)
Family has the highest competition (number of apps) followed by tools which don’t seem to be rewarding.
Game, and Productivity do well wrt to number of installs even though they have a lot of competition.
Communication and Social are two domains where there is a surge in downloads and the number of competitors are also less.
Comments, Parenting, and Events can be seen as domains that one must ignore as there are very few downloads and so is the competition. Or they can also be seen as domains that have huge potential and are underdog markets that can be exploited with a disruptive idea.
Content Type
What is the distribution of apps of various content?
Is there an effect of content type of installs?
How are they priced?
gps %>% group_by(Content.Rating) %>% summarise(count = n()) %>%
ggplot(aes(x = reorder(Content.Rating, -count), y = count, fill = Content.Rating))+ geom_bar(stat = "identity", color = "black")+
scale_fill_viridis(discrete = TRUE) +
labs(y = "Count", x = "Content Type", title = "Content type and their presence")
Apps for everyone are present in bulk
Installs vs Rating per content type
gps %>%
ggplot(aes(x = Rating, y = as.integer(as.character(gps$Installs)), color = Content.Rating))+
geom_point()+
facet_wrap(~Content.Rating) + geom_vline(xintercept = 5 ,colour = 'red', linetype = 'dashed') + geom_vline(xintercept = 4 ,colour = 'red', linetype = 'dashed') + geom_hline(yintercept = 1001000000 ,colour = 'red', linetype = 'dashed') + geom_hline(yintercept = 751000000 ,colour = 'red', linetype = 'dashed')
Everyone and Teens content type attract the most number of installs and a decent rating.
Rating analysis
App ratings (on a scale of 1 to 5) impact the discoverability, and indicates the company’s overall brand image. Ratings are a key performance indicator of an app.
gps %>% ggplot(aes(x = Rating))+geom_histogram(binwidth = 0.1, fill = "purple")+geom_vline(color = "red", linetype = 'dashed', xintercept = mean(gps$Rating %>% na.omit()))+theme(legend.position = "none")+labs(y = "Count", x = "Rating", title = "Rating Histogram")
Most of the apps have an average rating above 4 due to which there is a significant skewness present in the rating part.
Ratings are right skewed with major apps being highly rated and only few with a low rating.
This indicates that most of the apps in the Market are sized below 25mb.
The applications with large size can make it difficult for users to download. Higher download durations could upset the user and ignore the app.
More importantly the memory a user has in the device is limited. This is to be tested.
Let’s investigate the same with the data.
The distribution of size
gps %>% ggplot(aes(x = Size))+geom_histogram(binwidth = 1, fill = "purple")+theme(legend.position = "none")+labs(y = "Count", x = "Size (MB)", title = "Size distribution Histogram")
gps %>% filter(Size < 120) %>% ggplot(aes(x = Size))+geom_histogram(binwidth = 1, fill = "purple")+theme(legend.position = "none")+labs(y = "Count", x = "Size (MB)", title = "Size distribution Histogram (Sizes below 120Mb)")
Most of the apps in the Market are sized below 25mb.
Next is the size.
The applications with large size can make it difficult for users to download. Higher download durations could upset the user and ignore the app.
More importantly the memory a user has in the device is limited. This is to be tested.
Impact of Size on Rating
p <- gps %>% ggplot(aes(x = Size, y = Rating))+geom_point()+labs(y = "Rating", x = "Size (MB)", title = "Size Vs Rating")
ggMarginal(p, type = "histogram")
p <- gps %>% filter(Size < 100) %>% ggplot(aes(x = Size, y = Rating))+geom_point()+labs(y = "Rating", x = "Size (MB)", title = "Size Vs Rating (Sizes below 100)")
ggMarginal(p, type = "histogram")
p <- gps %>% filter(Size < 10) %>% ggplot(aes(x = Size, y = Rating))+geom_point()+labs(y = "Rating", x = "Size (MB)", title = "Size Vs Rating (Sizes below 10)")
ggMarginal(p, type = "histogram")
rm(p)
Majority of apps with high rating (rating over 4) are between 1 MB to 10 MB.
What is the variation in the number of insallations
gps %>% ggplot(aes(x = Installs)) +
geom_histogram(stat = 'count', bins = 100, fill = "purple", color = 'black')+
coord_flip() + scale_y_continuous(breaks = seq(0, 1500, 100))+labs(y = "Count", x = "Installs Range", title = "Installs distribution Histogram")
The Installs are perfectly distributed as expected. The high end market is very competitive and very few apps fall into that space, while most of the others fall behind the race getting in mediocre downloads. But the interesting part is that there are a lot of apps that fall into 1M downloads point. This is interesting as this gives a hope to a lot of designers as there is scope.
data <- gps %>% group_by(Installs) %>% summarise(Count = n()) %>% arrange(desc(Installs))
data <- data %>% mutate(product = as.numeric(as.character(data$Installs))*data$Count, `market share` = product/146625951338*100, `market share per app in the segment` = `market share`/Count)
names(data) <- c("Installs Range", "Total Number of apps present", "Total Installations","Market share of the segment", "Market share per app in that segment")
kable(data, caption = "Installations and their contribution")
| Installs Range | Total Number of apps present | Total Installations | Market share of the segment | Market share per app in that segment |
|---|---|---|---|---|
| 1 | 3 | 3 | 0.0000000 | 0.0000000 |
| 5 | 9 | 45 | 0.0000000 | 0.0000000 |
| 10 | 69 | 690 | 0.0000005 | 0.0000000 |
| 50 | 56 | 2800 | 0.0000019 | 0.0000000 |
| 100 | 303 | 30300 | 0.0000207 | 0.0000001 |
| 500 | 199 | 99500 | 0.0000679 | 0.0000003 |
| 1000 | 698 | 698000 | 0.0004760 | 0.0000007 |
| 5000 | 426 | 2130000 | 0.0014527 | 0.0000034 |
| 10000 | 989 | 9890000 | 0.0067451 | 0.0000068 |
| 50000 | 462 | 23100000 | 0.0157544 | 0.0000341 |
| 100000 | 1110 | 111000000 | 0.0757028 | 0.0000682 |
| 500000 | 516 | 258000000 | 0.1759579 | 0.0003410 |
| 1000000 | 1486 | 1486000000 | 1.0134632 | 0.0006820 |
| 5000000 | 683 | 3415000000 | 2.3290556 | 0.0034100 |
| 10000000 | 1132 | 11320000000 | 7.7203250 | 0.0068201 |
| 50000000 | 272 | 13600000000 | 9.2753021 | 0.0341004 |
| 100000000 | 369 | 36900000000 | 25.1660771 | 0.0682008 |
| 500000000 | 61 | 30500000000 | 20.8012291 | 0.3410038 |
| 1000000000 | 49 | 49000000000 | 33.4183680 | 0.6820075 |
The top five segments make up 96.38% of the market share. This implies that an app present in the top-most installion segment make up close 0.7% of market.
Impact of Price Does Price have anything to do with the number of installs?
gps %>% ggplot(aes(x = Type, y = as.numeric(as.character(gps$Installs)), fill = Type))+geom_boxplot()+scale_fill_manual(values=c('#E69F00', '#56B4E9'))+labs(y = "Installs", x = "Free or Paid", title = "Influence of type of availability on Installations")
#This is because of the outliers present at a very high value. This can be fixed by using the log transformation.
gps %>%
ggplot(aes(x = Type, y = log(as.numeric(as.character(gps$Installs))), fill = Type))+geom_boxplot()+scale_fill_manual(values=c('#E69F00', '#56B4E9'))+labs(y = "Installs after log transformation", x = "Free or Paid", title = "Influence of type of availability on Installations")
NA
As expected, free apps have higher installations than paid.
Price Vs Category
gps %>%
ggplot(aes(x = Price, y = Category, color = Category))+
geom_jitter() + theme(legend.position = "none")+labs(y = "Category", x = "Price", title = "Price Vs Category")
gps %>% select(c("App", "Price")) %>% arrange(desc(Price)) %>% head(20)
The top 15 most expensive apps appear to be junk.
Ignoring them and proceeding with the analysis gives
gps %>% filter(Price < 80) %>%
ggplot(aes(x = Price, y = Category, color = Category))+
geom_jitter() + theme(legend.position = "none")+labs(y = "Category", x = "Price", title = "Price Vs Category after cleaning junk")
The most expensive apps belong to Medical and Lifestyle followed by Family.
TEXT MINING - Sentimental Analysis
Word Cloud Function
wordcloud_c <- function(data, num_words = 100, background = "white") {
# If text is provided, convert it to a dataframe of word frequencies
if (is.character(data)) {
corpus <- Corpus(VectorSource(data))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
tdm <- as.matrix(TermDocumentMatrix(corpus))
data <- sort(rowSums(tdm), decreasing = TRUE)
data <- data.frame(word = names(data), freq = as.numeric(data))
}
# Make sure a proper num_words is provided
if (!is.numeric(num_words) || num_words < 3) {
num_words <- 3
}
# Grab the top n most common words
data <- head(data, n = num_words)
if (nrow(data) == 0) {
return(NULL)
}
wordcloud2(data, backgroundColor = background)
}
SA - Review Analysis
The reviews had some special characters, which had to be removed.
This data is grouped with the data used for EDA and all the irrelevant attributes are dropped.
The Sentiment polarity and Sentiment Subjectivity were then grouped with Category to see how they impact it.
gspr <- read.csv("googleplaystore_user_reviews.csv") %>% na.omit()
gspr$App <- as.character(gspr$App)
gspr$Translated_Review <- as.character(gspr$Translated_Review)
data <- gps %>% select(-c("Reviews", "Price", "Last.Updated", "Android.Ver")) %>% as.data.frame()
data$App <- as.character(data$App)
data <- data %>% inner_join(gspr, by = "App")
#gspr$Languages <- textcat(gspr$Translated_Review)
#gspr <- gspr %>% filter(Languages == "english")
#wordcloud_c(gspr[,2], 5)
#get_sentiments(gspr[,2])
#All the sentiment extractions has already been done in the data so proceding to crunch in that info
data$Sentiment_Polarity <- round(data$Sentiment_Polarity, 2)
data$Sentiment_Subjectivity <- round(data$Sentiment_Subjectivity, 2)
Word Clouds for Positive, Neutral and Negative Reviews
data_po <- data %>% filter(Sentiment_Polarity == 1) %>% select("Translated_Review")
data_nu <- data %>% filter(Sentiment_Polarity == 0) %>% select("Translated_Review")
data_ne <- data %>% filter(Sentiment_Polarity == -1) %>% select("Translated_Review")
wordcloud_c(data_po[,1], 100)
wordcloud_c(data_nu[,1], 100)
wordcloud_c(data_ne[,1], 100)
rm(data_ne)
rm(data_nu)
rm(data_po)
SA - Text Forensics
das <- data %>% select(-"Translated_Review")
das <- das %>% group_by(Category, Sentiment) %>% summarise(Count = n(), NP = round(sum(Sentiment_Polarity)/n(),2))
das <- das %>% mutate(s = Sentiment)
das <- das %>% spread(Sentiment, NP)
names(das) <- c("Category", "Count", "s", "Ne_S", "Nu_S", "Po_S")
das <- das %>% spread(s, Count)
names(das) <- c("Category", "Ne_S", "Nu_S", "Po_S", "Ne_C", "Nu_C", "Po_C")
das[is.na(das)] <- 0
das <- das %>% group_by(Category) %>% summarise(Nu_C = sum(Nu_C), Ne_C = sum(Ne_C), Po_C = sum(Po_C), Nu_S = sum(Nu_S), Ne_S = sum(Ne_S), Po_S = sum(Po_S))
das %>% ggplot(aes(y = Po_S, x = reorder(Category, Po_S), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
labs(y = "Sentiment Polarity", x = "Category", title = "Strength of Positive Sentiment Vs Category")
das %>% ggplot(aes(y = Po_C, x = reorder(Category, Po_C), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
labs(y = "Number of Positive Reviews", x = "Category", title = "Number of Positive Vs Category")
das %>% ggplot(aes(y = Ne_S, x = reorder(Category, -Ne_S), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
labs(y = "Sentiment Polarity", x = "Category", title = "Strength of Negative Sentiment Vs Category")
das %>% ggplot(aes(y = Ne_C, x = reorder(Category, Ne_C), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
labs(y = "Number of Negative Reviews", x = "Category", title = "Number of Negative Vs Category")
das %>% ggplot(aes(y = (Po_C-Ne_C)/(Ne_C), x = reorder(Category, ((Po_C-Ne_C)/(Ne_C))), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
labs(y = "Percentage Number of Positive Reviews wrt Negative Reviews", x = "Category", title = "Percentage of Positive Reviews more than Negative Reviews")
das %>% ggplot(aes(y = (Po_S-Ne_S), x = reorder(Category, ((Po_S-Ne_S))), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
labs(y = "Relative polarity of Positive Reviews wrt Negative Reviews", x = "Category", title = "Relative polarity Positive Reviews wrt Negative Reviews")
rm(das)
Even though beauty has high negative reviews it has the highest polarity difference between positive and negative among all the categories. This category can be ignored due to low number of reviews.
Comics has the highest relative number of positive reviews compared to that of negative.
Almost all the categories have similar intensities when it comes to positive or negative reactions. Health and fitness received the most positive reactions compared to that of the negative ones.
Content Vs Reviews
das <- data %>% select(-c("Translated_Review", "App"))
das %>% ggplot(aes(x = as.factor(Content.Rating), fill = Content.Rating))+geom_histogram(stat = "count")+labs(y = "Number of Reviews", x = "Content Type", fill = "Content Type", title = "Number of Reviews per content")+scale_fill_viridis(discrete = TRUE)+facet_wrap(~Sentiment)+coord_flip()
das %>% ggplot(aes(x = as.factor(Content.Rating), y = Sentiment_Polarity, color = Content.Rating))+geom_boxplot()+labs(y = "Polarity", x = "Content Type", title = "Content Type Vs the intensity of reviews")
das %>% ggplot(aes(x = as.factor(Content.Rating), fill = Sentiment))+geom_bar(stat = "count", position = "fill")+labs(y = "Number of Reviews", x = "Content Type", fill = "Content Type", title = "Number of Reviews per content")+scale_fill_viridis(discrete = TRUE)
As expected there Everyone has the highest number of reviews in all categories. This could be because the it has the highest number of apps in the market.
Adult followed by Mature Apps have higher number of strong positive reviews while Teens have the higher number of strong negative reviews.
Machine Learning - Pre processing
Here apart from general preprocessing, I am defining the success to be installations equal to or greater than 500,000 i.e. the app must take 0.00034% of the total market downloads we classify result to be a “hit”, and else it’s “flop”.
The correlation among various variable is as follows:
rm(das)
rm(data)
data <- gps %>% select(c("Category", "Rating", "Reviews", "Size", "Type", "Content.Rating", "Current.Ver", "Installs"))
glimpse(data)
Observations: 8,892
Variables: 8
$ Category [3m[38;5;246m<fct>[39m[23m ART_AND_DESIGN, ART_AND_DE...
$ Rating [3m[38;5;246m<dbl>[39m[23m 4.1, 3.9, 4.7, 4.5, 4.3, 4...
$ Reviews [3m[38;5;246m<fct>[39m[23m 159, 967, 87510, 215644, 9...
$ Size [3m[38;5;246m<dbl>[39m[23m 19.0, 14.0, 8.7, 25.0, 2.8...
$ Type [3m[38;5;246m<fct>[39m[23m Free, Free, Free, Free, Fr...
$ Content.Rating [3m[38;5;246m<fct>[39m[23m Everyone, Everyone, Everyo...
$ Current.Ver [3m[38;5;246m<chr>[39m[23m "1", "2", "1", "Varies wit...
$ Installs [3m[38;5;246m<fct>[39m[23m 10000, 500000, 5000000, 50...
data$Reviews <- as.integer(as.character(data$Reviews))
data$Current.Ver <- as.integer(as.character(data$Current.Ver))
data$Installs <- as.integer(as.character(data$Installs))
data <- data %>% na.omit()
data$Rating <- as.numeric(as.character(data$Rating))
glimpse(data)
Observations: 7,238
Variables: 8
$ Category [3m[38;5;246m<fct>[39m[23m ART_AND_DESIGN, ART_AND_DE...
$ Rating [3m[38;5;246m<dbl>[39m[23m 4.1, 3.9, 4.7, 4.3, 4.4, 3...
$ Reviews [3m[38;5;246m<int>[39m[23m 159, 967, 87510, 967, 167,...
$ Size [3m[38;5;246m<dbl>[39m[23m 19.0, 14.0, 8.7, 2.8, 5.6,...
$ Type [3m[38;5;246m<fct>[39m[23m Free, Free, Free, Free, Fr...
$ Content.Rating [3m[38;5;246m<fct>[39m[23m Everyone, Everyone, Everyo...
$ Current.Ver [3m[38;5;246m<int>[39m[23m 1, 2, 1, 1, 1, 1, 6, 2, 2,...
$ Installs [3m[38;5;246m<int>[39m[23m 10000, 500000, 5000000, 10...
cor(data %>% select(c("Rating", "Reviews", "Size", "Current.Ver", "Installs")))
Rating Reviews Size
Rating 1.000000000 0.080283728 -0.02059633
Reviews 0.080283728 1.000000000 0.03373306
Size -0.020596331 0.033733058 1.00000000
Current.Ver -0.002349692 -0.005979475 0.02677096
Installs 0.053719879 0.643536646 0.01538013
Current.Ver Installs
Rating -0.002349692 0.053719879
Reviews -0.005979475 0.643536646
Size 0.026770959 0.015380130
Current.Ver 1.000000000 -0.005627174
Installs -0.005627174 1.000000000
corrplot(cor(data %>% select(c("Rating", "Reviews", "Size", "Current.Ver", "Installs"))), method = "color", title = "Correlation Plot", mar=c(0,0,1,0))
data <- data %>% mutate(Result = as.factor(ifelse(Installs >= 500000, "Hit", "Flop"))) %>% select(-Installs)
Installs has the highest correlation with reviews
Machine Learning - Data Partition
normalize <- function(x){
return ((x-min(x))/(max(x)-min(x)))
}
data.ml<- cbind(as.data.frame(lapply(data[,c(2,3,4,7)],normalize)), data[,-c(2,3,4,7)])
set.seed(123)
trainindex <- createDataPartition(data.ml$Result, p=0.8, list= FALSE)
trainn <- data.ml[trainindex, ]
testn <- data.ml[-trainindex, ]
set.seed(123)
trainindex <- createDataPartition(data.ml$Result, p=0.8, list= FALSE)
training <- data.ml[trainindex, ]
testing <- data.ml[-trainindex, ]
I split the data into two parts training and testing using data partition at 80% split. I then normalized the continuous variables of the data for models that require normalized input and added the data that is not normalized to models which don’t require normalized inputs.
I have used the following models on the data to classify the records as hit or flop:
Logistic Regression
Decision Trees
Random Forest
Naïve Bayes Classifier
SVM with Default Cost parameter
Tuned SVM (Couldn’t add it as it takes more than 12 hours to run)
These models were run upon all the attributes with Result (“Hit” or “Flop”) being the target attribute.
The outputs of the models were as follows:
Machine Learning - Modelling - Logistic Regression
rm(trainindex)
rm(data.ml)
train <- trainn
test <- testn
logit.reg <- glm(Result ~.,data = train, family = "binomial")
summary(logit.reg)
Call:
glm(formula = Result ~ ., family = "binomial", data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-5.1087 -0.3290 -0.0005 0.0000 2.7745
Coefficients:
Estimate Std. Error
(Intercept) -10.93293 1455.39801
Rating -1.88838 0.39186
Reviews 25958.79073 991.38262
Size 0.32449 0.63443
Current.Ver 1.08064 1.52771
CategoryAUTO_AND_VEHICLES 0.03749 0.72313
CategoryBEAUTY 0.50437 0.81195
CategoryBOOKS_AND_REFERENCE -0.34906 0.71308
CategoryBUSINESS -0.58562 0.67192
CategoryCOMICS -1.56059 1.05006
CategoryCOMMUNICATION -0.98232 0.79991
CategoryDATING -0.70588 0.76343
CategoryEDUCATION 0.76883 0.73797
CategoryENTERTAINMENT 0.17953 0.97639
CategoryEVENTS -1.53144 1.26783
CategoryFAMILY -0.35114 0.60144
CategoryFINANCE -1.41404 0.68849
CategoryFOOD_AND_DRINK -0.25462 0.77866
CategoryGAME -0.47244 0.63475
CategoryHEALTH_AND_FITNESS -0.23195 0.67832
CategoryHOUSE_AND_HOME 0.77182 0.74567
CategoryLIBRARIES_AND_DEMO 0.05045 0.78375
CategoryLIFESTYLE -0.40941 0.65481
CategoryMAPS_AND_NAVIGATION -0.69873 0.84434
CategoryMEDICAL -1.35866 0.70991
CategoryNEWS_AND_MAGAZINES -0.71658 0.74933
CategoryPARENTING 0.54597 0.82074
CategoryPERSONALIZATION 0.19263 0.67100
CategoryPHOTOGRAPHY 0.44799 0.67702
CategoryPRODUCTIVITY -0.39227 0.67103
CategorySHOPPING 0.15645 0.70957
CategorySOCIAL -0.74732 0.78432
CategorySPORTS -0.78227 0.68603
CategoryTOOLS -0.41091 0.62426
CategoryTRAVEL_AND_LOCAL 0.01793 0.69436
CategoryVIDEO_PLAYERS 0.00502 0.72942
CategoryWEATHER 0.52329 0.82155
TypePaid -13.67281 0.94806
Content.RatingEveryone 9.74847 1455.39787
Content.RatingMature 10.02771 1455.39790
Content.RatingTeen 8.89360 1455.39787
Content.RatingUnrated -3.44570 2058.24319
z value
(Intercept) -0.008
Rating -4.819
Reviews 26.184
Size 0.511
Current.Ver 0.707
CategoryAUTO_AND_VEHICLES 0.052
CategoryBEAUTY 0.621
CategoryBOOKS_AND_REFERENCE -0.490
CategoryBUSINESS -0.872
CategoryCOMICS -1.486
CategoryCOMMUNICATION -1.228
CategoryDATING -0.925
CategoryEDUCATION 1.042
CategoryENTERTAINMENT 0.184
CategoryEVENTS -1.208
CategoryFAMILY -0.584
CategoryFINANCE -2.054
CategoryFOOD_AND_DRINK -0.327
CategoryGAME -0.744
CategoryHEALTH_AND_FITNESS -0.342
CategoryHOUSE_AND_HOME 1.035
CategoryLIBRARIES_AND_DEMO 0.064
CategoryLIFESTYLE -0.625
CategoryMAPS_AND_NAVIGATION -0.828
CategoryMEDICAL -1.914
CategoryNEWS_AND_MAGAZINES -0.956
CategoryPARENTING 0.665
CategoryPERSONALIZATION 0.287
CategoryPHOTOGRAPHY 0.662
CategoryPRODUCTIVITY -0.585
CategorySHOPPING 0.220
CategorySOCIAL -0.953
CategorySPORTS -1.140
CategoryTOOLS -0.658
CategoryTRAVEL_AND_LOCAL 0.026
CategoryVIDEO_PLAYERS 0.007
CategoryWEATHER 0.637
TypePaid -14.422
Content.RatingEveryone 0.007
Content.RatingMature 0.007
Content.RatingTeen 0.006
Content.RatingUnrated -0.002
Pr(>|z|)
(Intercept) 0.9940
Rating 0.00000144 ***
Reviews < 0.0000000000000002 ***
Size 0.6090
Current.Ver 0.4793
CategoryAUTO_AND_VEHICLES 0.9587
CategoryBEAUTY 0.5345
CategoryBOOKS_AND_REFERENCE 0.6245
CategoryBUSINESS 0.3835
CategoryCOMICS 0.1372
CategoryCOMMUNICATION 0.2194
CategoryDATING 0.3552
CategoryEDUCATION 0.2975
CategoryENTERTAINMENT 0.8541
CategoryEVENTS 0.2271
CategoryFAMILY 0.5593
CategoryFINANCE 0.0400 *
CategoryFOOD_AND_DRINK 0.7437
CategoryGAME 0.4567
CategoryHEALTH_AND_FITNESS 0.7324
CategoryHOUSE_AND_HOME 0.3006
CategoryLIBRARIES_AND_DEMO 0.9487
CategoryLIFESTYLE 0.5318
CategoryMAPS_AND_NAVIGATION 0.4079
CategoryMEDICAL 0.0556 .
CategoryNEWS_AND_MAGAZINES 0.3389
CategoryPARENTING 0.5059
CategoryPERSONALIZATION 0.7741
CategoryPHOTOGRAPHY 0.5082
CategoryPRODUCTIVITY 0.5588
CategorySHOPPING 0.8255
CategorySOCIAL 0.3407
CategorySPORTS 0.2542
CategoryTOOLS 0.5104
CategoryTRAVEL_AND_LOCAL 0.9794
CategoryVIDEO_PLAYERS 0.9945
CategoryWEATHER 0.5242
TypePaid < 0.0000000000000002 ***
Content.RatingEveryone 0.9947
Content.RatingMature 0.9945
Content.RatingTeen 0.9951
Content.RatingUnrated 0.9987
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7960.4 on 5790 degrees of freedom
Residual deviance: 1925.0 on 5749 degrees of freedom
AIC: 2009
Number of Fisher Scoring iterations: 14
logit.reg <- glm(Result ~.,data = train %>% select(-"Category"), family = "binomial")
summary(logit.reg)
Call:
glm(formula = Result ~ ., family = "binomial", data = train %>%
select(-"Category"))
Deviance Residuals:
Min 1Q Median 3Q Max
-5.0934 -0.3408 -0.0006 0.0000 2.6391
Coefficients:
Estimate Std. Error z value
(Intercept) -11.6063 1455.3979 -0.008
Rating -1.8041 0.3704 -4.870
Reviews 25796.9828 957.4169 26.944
Size 0.2645 0.6097 0.434
Current.Ver 0.9836 1.4986 0.656
TypePaid -13.3421 0.9482 -14.071
Content.RatingEveryone 10.0192 1455.3979 0.007
Content.RatingMature 10.1071 1455.3979 0.007
Content.RatingTeen 9.1242 1455.3979 0.006
Content.RatingUnrated -3.2440 2058.2432 -0.002
Pr(>|z|)
(Intercept) 0.994
Rating 0.00000112 ***
Reviews < 0.0000000000000002 ***
Size 0.664
Current.Ver 0.512
TypePaid < 0.0000000000000002 ***
Content.RatingEveryone 0.995
Content.RatingMature 0.994
Content.RatingTeen 0.995
Content.RatingUnrated 0.999
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7960.4 on 5790 degrees of freedom
Residual deviance: 1984.3 on 5781 degrees of freedom
AIC: 2004.3
Number of Fisher Scoring iterations: 14
#confint(logit.reg) to check for confidence intervals
test$predraw <- predict(logit.reg, newdata = test %>% select(-c("Category", "Result")), type = "response")
par <- data.frame("Cut off" = 1, "Accuracy" = 1, "Sensitivity" = 1)
i = 0
j = 1
for(i in seq(0,1,0.001)){
test <- test %>% mutate(pred = factor(as.character(ifelse(predraw < i, "Flop", "Hit")), levels = c("Hit", "Flop")))
par[j,2] <- 1-mean(test$Result != test$pred)
par[j,1] <- i
test1 <- test %>% filter(Result == "Hit")
par[j,3] <- mean(test1$Result == test1$pred)
j <- j+1
i <- i+1
rm(test1)
}
rm(i)
rm(j)
kable(par %>% filter(Accuracy == max(Accuracy)) %>% head(5), caption = "The cut off for the best accuracy")
| Cut.off | Accuracy | Sensitivity |
|---|---|---|
| 0.370 | 0.9502419 | 0.9162791 |
| 0.371 | 0.9502419 | 0.9162791 |
| 0.372 | 0.9502419 | 0.9147287 |
| 0.373 | 0.9502419 | 0.9147287 |
| 0.374 | 0.9502419 | 0.9147287 |
par %>%ggplot(aes(x = Cut.off, y = par$Accuracy))+geom_point(size = 0.05, color = "gray")+geom_line()+geom_hline(yintercept = max(par$Accuracy), color = "red")+geom_vline(xintercept = 0.375, color = "red")+
labs(x = "Cut Off value", y = "Accuracy of prediction", title = "Cut off Vs Accuracy")
rm(par)
test <- test %>% mutate(pred = as.factor(ifelse(predraw <= 0.35, "Flop", "Hit")))
test$pred <- factor(test$pred, levels = c("Hit", "Flop"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))
cm <- confusionMatrix(test$pred, test$Result)
cm
Confusion Matrix and Statistics
Reference
Prediction Hit Flop
Hit 591 22
Flop 54 780
Accuracy : 0.9475
95% CI : (0.9347, 0.9584)
No Information Rate : 0.5543
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.8932
Mcnemar's Test P-Value : 0.0003766
Sensitivity : 0.9163
Specificity : 0.9726
Pos Pred Value : 0.9641
Neg Pred Value : 0.9353
Prevalence : 0.4457
Detection Rate : 0.4084
Detection Prevalence : 0.4236
Balanced Accuracy : 0.9444
'Positive' Class : Hit
logit <- data.frame("Model" = "Logistic Regression", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
Logistic regression has performed pretty well in all the aspects. Note: It must remembered that our class of interest is Hit.
Machine Learning - Modelling - Decision Trees
rm(test)
rm(train)
rm(logit.reg)
train <- training
test <- testing
set.seed(123)
dt <- rpart(Result ~., data = train, method = "class", cp = 0.00001, minsplit = 1, xval = 10)
dt.pruned <- prune(dt, cp = 0.004)
test$pred <- predict(dt.pruned, newdata = test[,-8], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics
Reference
Prediction Hit Flop
Hit 593 28
Flop 52 774
Accuracy : 0.9447
95% CI : (0.9317, 0.9559)
No Information Rate : 0.5543
P-Value [Acc > NIR] : < 0.0000000000000002
Kappa : 0.8877
Mcnemar's Test P-Value : 0.01013
Sensitivity : 0.9194
Specificity : 0.9651
Pos Pred Value : 0.9549
Neg Pred Value : 0.9370
Prevalence : 0.4457
Detection Rate : 0.4098
Detection Prevalence : 0.4292
Balanced Accuracy : 0.9422
'Positive' Class : Hit
det <- data.frame("Model" = "Decision Trees", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
rpart.plot(dt.pruned, box.palette="RdBu", shadow.col="gray", nn=TRUE)
rpart.rules(dt.pruned, style = "tallw")
Result is 0.00 when
Reviews is 0.00009 to 0.00019
Type is Paid
Result is 0.00 when
Reviews is 0.00019 to 0.00053
Type is Paid
Result is 0.07 when
Reviews < 0.00009
Result is 0.68 when
Reviews is 0.00009 to 0.00019
Type is Free
Result is 0.80 when
Reviews >= 0.00053
Type is Paid
Result is 0.98 when
Reviews >= 0.00019
Type is Free
Machine Learning - Modelling - Random Forest
rm(train)
rm(test)
rm(dt)
rm(dt.pruned)
train <- training
test <- testing
rf <- randomForest(Result ~., data = train)
plot(rf, main = "Error rate of random forest")
varImpPlot(rf, main = "Gini Plot - Importance of Variables")
test$pred = predict(rf, newdata = test[-8])
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics
Reference
Prediction Hit Flop
Hit 613 44
Flop 32 758
Accuracy : 0.9475
95% CI : (0.9347, 0.9584)
No Information Rate : 0.5543
P-Value [Acc > NIR] : <0.0000000000000002
Kappa : 0.8939
Mcnemar's Test P-Value : 0.207
Sensitivity : 0.9504
Specificity : 0.9451
Pos Pred Value : 0.9330
Neg Pred Value : 0.9595
Prevalence : 0.4457
Detection Rate : 0.4236
Detection Prevalence : 0.4540
Balanced Accuracy : 0.9478
'Positive' Class : Hit
raf <- data.frame("Model" = "Random Forest", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
Machine Learning - Modelling - Naive Bayes Classifier
rm(test)
rm(train)
rm(rf)
train <- training
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))
delays.nb <- naiveBayes(Result~., data = train)
delays.nb
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
Hit Flop
0.446037 0.553963
Conditional probabilities:
Rating
Y [,1] [,2]
Hit 0.8122145 0.08782847
Flop 0.7745324 0.16554476
Reviews
Y [,1] [,2]
Hit 0.01358894493 0.05288638212
Flop 0.00002235907 0.00006197387
Size
Y [,1] [,2]
Hit 0.03556935 0.06165665
Flop 0.03563030 0.10756489
Current.Ver
Y [,1] [,2]
Hit 0.0009565182 0.02834286
Flop 0.0017979409 0.03567101
Category
Y ART_AND_DESIGN AUTO_AND_VEHICLES BEAUTY
Hit 0.0050329075 0.0061943477 0.0034843206
Flop 0.0093516209 0.0127805486 0.0065461347
Category
Y BOOKS_AND_REFERENCE BUSINESS COMICS
Hit 0.0154858691 0.0158730159 0.0034843206
Flop 0.0233790524 0.0417705736 0.0084164589
Category
Y COMMUNICATION DATING EDUCATION
Hit 0.0313588850 0.0174216028 0.0174216028
Flop 0.0249376559 0.0224438903 0.0068578554
Category
Y ENTERTAINMENT EVENTS FAMILY
Hit 0.0178087495 0.0007742935 0.2040263260
Flop 0.0028054863 0.0068578554 0.2319201995
Category
Y FINANCE FOOD_AND_DRINK GAME
Hit 0.0240030972 0.0147115757 0.1966705381
Flop 0.0464463840 0.0102867830 0.0692019950
Category
Y HEALTH_AND_FITNESS HOUSE_AND_HOME
Hit 0.0282617112 0.0096786682
Flop 0.0246259352 0.0046758105
Category
Y LIBRARIES_AND_DEMO LIFESTYLE
Hit 0.0054200542 0.0247773906
Flop 0.0112219451 0.0458229426
Category
Y MAPS_AND_NAVIGATION MEDICAL
Hit 0.0116144019 0.0081300813
Flop 0.0130922693 0.0620324190
Category
Y NEWS_AND_MAGAZINES PARENTING
Hit 0.0197444832 0.0038714673
Flop 0.0240024938 0.0059226933
Category
Y PERSONALIZATION PHOTOGRAPHY PRODUCTIVITY
Hit 0.0356174990 0.0460704607 0.0321331785
Flop 0.0386533666 0.0174563591 0.0308603491
Category
Y SHOPPING SOCIAL SPORTS
Hit 0.0329074719 0.0228416570 0.0301974448
Flop 0.0124688279 0.0208852868 0.0317955112
Category
Y TOOLS TRAVEL_AND_LOCAL VIDEO_PLAYERS
Hit 0.0662020906 0.0205187766 0.0185830430
Flop 0.0950748130 0.0202618454 0.0124688279
Category
Y WEATHER
Hit 0.0096786682
Flop 0.0046758105
Type
Y 0 Free NaN
Hit 0.000000000 0.992257065 0.000000000
Flop 0.000000000 0.875311721 0.000000000
Type
Y Paid
Hit 0.007742935
Flop 0.124688279
Content.Rating
Y Adults Everyone Mature
Hit 0.0003871467 0.8033294619 0.0514905149
Flop 0.0000000000 0.8753117207 0.0405236908
Content.Rating
Y Teen Unrated
Hit 0.1447928765 0.0000000000
Flop 0.0838528678 0.0003117207
pred.prob <- as.data.frame(predict(delays.nb, newdata = test, type = "raw"))
pred.prob <- round(pred.prob, 4)
pred.prob <- pred.prob %>% select("Hit")
names(pred.prob) <- "predraw"
test <- cbind(test, pred.prob)
rm(pred.prob)
par <- data.frame("Cut off" = 1, "Accuracy" = 1, "Sensitivity" = 1)
i = 0
j = 1
for(i in seq(0,1,0.001)){
test <- test %>% mutate(pred = factor(as.character(ifelse(predraw < i, "Flop", "Hit")), levels = c("Hit", "Flop")))
par[j,2] <- 1-mean(test$Result != test$pred)
test1 <- test %>% filter(Result == "Hit")
par[j,3] <- mean(test1$Result == test1$pred)
par[j,1] <- i
j <- j+1
i <- i+1
rm(test1)
}
rm(i)
rm(j)
kable(par %>% filter(Accuracy == max(Accuracy)) %>% head(5), caption = "The (Hit) probability cut off for the best accuracy")
| Cut.off | Accuracy | Sensitivity |
|---|---|---|
| 0.022 | 0.9246717 | 0.8573643 |
| 0.023 | 0.9246717 | 0.8573643 |
| 0.024 | 0.9246717 | 0.8573643 |
| 0.028 | 0.9246717 | 0.8542636 |
par %>% ggplot(aes(x = `Cut.off`, y = par$Accuracy))+geom_point(size = 0.05, color = "gray")+geom_line()+geom_hline(yintercept = max(par$Accuracy), color = "red")+geom_vline(xintercept = 0.024, color = "red")+
labs(x = "Cut Off value", y = "Accuracy of prediction", title = "Cut off Vs Accuracy")
rm(par)
test <- test %>% mutate(pred = factor(as.character(ifelse(predraw < 0.024, "Flop", "Hit")), levels = c("Hit", "Flop")))
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics
Reference
Prediction Hit Flop
Hit 553 17
Flop 92 785
Accuracy : 0.9247
95% CI : (0.9098, 0.9377)
No Information Rate : 0.5543
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.8458
Mcnemar's Test P-Value : 0.000000000001361
Sensitivity : 0.8574
Specificity : 0.9788
Pos Pred Value : 0.9702
Neg Pred Value : 0.8951
Prevalence : 0.4457
Detection Rate : 0.3822
Detection Prevalence : 0.3939
Balanced Accuracy : 0.9181
'Positive' Class : Hit
nab <- data.frame("Model" = "Naive Bayes", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
Machine Learning - Modelling - SVM
rm(test)
rm(train)
rm(delays.nb)
train <- training
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))
set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "linear")
test$pred <- predict(svm.fit, newdata = test[,-8], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics
Reference
Prediction Hit Flop
Hit 376 40
Flop 269 762
Accuracy : 0.7865
95% CI : (0.7644, 0.8073)
No Information Rate : 0.5543
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.5523
Mcnemar's Test P-Value : < 0.00000000000000022
Sensitivity : 0.5829
Specificity : 0.9501
Pos Pred Value : 0.9038
Neg Pred Value : 0.7391
Prevalence : 0.4457
Detection Rate : 0.2598
Detection Prevalence : 0.2875
Balanced Accuracy : 0.7665
'Positive' Class : Hit
linear <- data.frame("MOdel" = "SVM - Linear", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(test)
rm(train)
rm(svm.fit)
rm(cm)
train <- training
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))
set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "radial")
test$pred <- predict(svm.fit, newdata = test[,-8], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics
Reference
Prediction Hit Flop
Hit 343 93
Flop 302 709
Accuracy : 0.727
95% CI : (0.7033, 0.7498)
No Information Rate : 0.5543
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.4294
Mcnemar's Test P-Value : < 0.00000000000000022
Sensitivity : 0.5318
Specificity : 0.8840
Pos Pred Value : 0.7867
Neg Pred Value : 0.7013
Prevalence : 0.4457
Detection Rate : 0.2370
Detection Prevalence : 0.3013
Balanced Accuracy : 0.7079
'Positive' Class : Hit
radial <- data.frame("MOdel" = "SVM - Radial", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(test)
rm(train)
rm(svm.fit)
rm(cm)
train <- training
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))
set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "polynomial")
test$pred <- predict(svm.fit, newdata = test[,-8], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics
Reference
Prediction Hit Flop
Hit 22 0
Flop 623 802
Accuracy : 0.5695
95% CI : (0.5435, 0.5951)
No Information Rate : 0.5543
P-Value [Acc > NIR] : 0.1277
Kappa : 0.0377
Mcnemar's Test P-Value : <0.0000000000000002
Sensitivity : 0.03411
Specificity : 1.00000
Pos Pred Value : 1.00000
Neg Pred Value : 0.56281
Prevalence : 0.44575
Detection Rate : 0.01520
Detection Prevalence : 0.01520
Balanced Accuracy : 0.51705
'Positive' Class : Hit
polynomial <- data.frame("MOdel" = "SVM - Polynomial", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
rm(test)
rm(train)
rm(svm.fit)
svm <- rbind(linear, radial, polynomial)
rm(linear)
rm(radial)
rm(polynomial)
names(svm) <- c("Model", "Accuracy", "Sensitivity", "Specificity")
Machine Learning - Modelling - SVM (Tuning)
set.seed(123)
tunesvm <- tune(svm, Result~., data = train, kernal = "linear", ranges = list(cost = c(seq(0.01,0.1,by = 0.01), seq(0.1,1,by = 0.1), seq(1,10,by = 1))))
dynamic.cost <- tunesvm$best.model$cost
summary(tunesvm)
data.par <- tunesvm$performances
data.par[data.par$error == min(data.par$error),]
set.seed(123)
svm_cost <- svm(Result~., data=train,kernel = "linear", cost = dynamic.cost)
pred_train <- predict(svm_cost, train)
train.error <- mean(pred_train != train$Result)
pred_test <- predict(svm_cost, test)
test.error <- mean(pred_test != test$Result)
linear <- data.frame("Kernal" = "Linear", "Cost" = dynamic.cost,
"train Error" = train.error,
"test Error" = test.error)
set.seed(123)
tunesvm <- tune(svm, Result~., data = train, kernal = "radial", ranges = list(cost = c(seq(0.01,0.1,by = 0.01), seq(0.1,1,by = 0.1), seq(1,10,by = 1))))
dynamic.cost <- tunesvm$best.model$cost
summary(tunesvm)
data.par <- tunesvm$performances
data.par[data.par$error == min(data.par$error),]
svm_cost <- svm(Result~., data=train, kernel = "radial", cost = dynamic.cost)
pred_train <- predict(svm_cost, train)
train.error <- mean(pred_train != train$Result)
pred_test <- predict(svm_cost, test)
test.error <- mean(pred_test != test$Result)
radial <- data.frame("Kernal" = "Radial", "Cost" = dynamic.cost,
"train Error" = train.error,
"test Error" = test.error)
set.seed(123)
tunesvm <- tune(svm, Result~., data = train, kernal = "polynomial", degree = 2, ranges = list(cost = c(seq(0.01,0.1, by = 0.01), seq(0.1,1,by = 0.1), seq(1,10,by = 1))))
dynamic.cost <- tunesvm$best.model$cost
summary(tunesvm)
data.par <- tunesvm$performances
data.par[data.par$error == min(data.par$error),]
svm_cost <- svm(Result~., data=train, kernel = "polynomial", degree = 2, cost = dynamic.cost)
pred_train <- predict(svm_cost, train)
train.error <- mean(pred_train != train$Result)
pred_test <- predict(svm_cost, test)
test.error <- mean(pred_test != test$Result)
polynomial <- data.frame("Kernal" = "Polynomial", "Cost" = dynamic.cost,
"train Error" = train.error,
"test Error" = test.error)
kable(final <- rbind(linear, radial, polynomial), caption = "SVM Performance with different kernals")
In this analysis I used Rating and Reviews which are known only after an App is launched. So the analysis I did so far is purely hypothetical. For this reason I am repeating the entire analysis dropping Rating and Reviews Attributes.
Machine Learning - Modelling - Logistic Regression
rm(trainindex)
rm(data.ml)
train <- trainn %>% select(-c("Rating", "Reviews"))
test <- testn %>% select(-c("Rating", "Reviews"))
logit.reg <- glm(Result ~.,data = train, family = "binomial")
summary(logit.reg)
Call:
glm(formula = Result ~ ., family = "binomial", data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9539 -1.0772 -0.3188 1.1647 2.9656
Coefficients:
Estimate Std. Error
(Intercept) 10.932973 196.968026
Size 0.290329 0.343408
Current.Ver -1.065856 0.955384
CategoryAUTO_AND_VEHICLES -0.166999 0.446399
CategoryBEAUTY -0.078609 0.520725
CategoryBOOKS_AND_REFERENCE 0.198912 0.389295
CategoryBUSINESS -0.387421 0.380052
CategoryCOMICS -0.481986 0.513833
CategoryCOMMUNICATION 0.884703 0.373083
CategoryDATING 0.028899 0.420699
CategoryEDUCATION 1.603696 0.430444
CategoryENTERTAINMENT 2.278416 0.497425
CategoryEVENTS -1.646004 0.811304
CategoryFAMILY 0.540242 0.340662
CategoryFINANCE -0.034169 0.368334
CategoryFOOD_AND_DRINK 0.929963 0.412082
CategoryGAME 1.751529 0.347747
CategoryHEALTH_AND_FITNESS 0.755244 0.373784
CategoryHOUSE_AND_HOME 1.275349 0.468070
CategoryLIBRARIES_AND_DEMO -0.227878 0.466059
CategoryLIFESTYLE 0.003346 0.367949
CategoryMAPS_AND_NAVIGATION 0.499621 0.413614
CategoryMEDICAL -1.259244 0.407924
CategoryNEWS_AND_MAGAZINES 0.351646 0.381546
CategoryPARENTING 0.162152 0.517472
CategoryPERSONALIZATION 0.894042 0.368283
CategoryPHOTOGRAPHY 1.694726 0.376429
CategoryPRODUCTIVITY 0.701663 0.368215
CategorySHOPPING 1.533020 0.387390
CategorySOCIAL 0.502244 0.384332
CategorySPORTS 0.621386 0.369610
CategoryTOOLS 0.328361 0.349381
CategoryTRAVEL_AND_LOCAL 0.653622 0.384622
CategoryVIDEO_PLAYERS 0.925326 0.397957
CategoryWEATHER 1.610839 0.491177
TypePaid -3.079504 0.235667
Content.RatingEveryone -11.714243 196.967741
Content.RatingMature -11.409212 196.967821
Content.RatingTeen -11.479430 196.967763
Content.RatingUnrated -22.828966 278.554424
z value
(Intercept) 0.056
Size 0.845
Current.Ver -1.116
CategoryAUTO_AND_VEHICLES -0.374
CategoryBEAUTY -0.151
CategoryBOOKS_AND_REFERENCE 0.511
CategoryBUSINESS -1.019
CategoryCOMICS -0.938
CategoryCOMMUNICATION 2.371
CategoryDATING 0.069
CategoryEDUCATION 3.726
CategoryENTERTAINMENT 4.580
CategoryEVENTS -2.029
CategoryFAMILY 1.586
CategoryFINANCE -0.093
CategoryFOOD_AND_DRINK 2.257
CategoryGAME 5.037
CategoryHEALTH_AND_FITNESS 2.021
CategoryHOUSE_AND_HOME 2.725
CategoryLIBRARIES_AND_DEMO -0.489
CategoryLIFESTYLE 0.009
CategoryMAPS_AND_NAVIGATION 1.208
CategoryMEDICAL -3.087
CategoryNEWS_AND_MAGAZINES 0.922
CategoryPARENTING 0.313
CategoryPERSONALIZATION 2.428
CategoryPHOTOGRAPHY 4.502
CategoryPRODUCTIVITY 1.906
CategorySHOPPING 3.957
CategorySOCIAL 1.307
CategorySPORTS 1.681
CategoryTOOLS 0.940
CategoryTRAVEL_AND_LOCAL 1.699
CategoryVIDEO_PLAYERS 2.325
CategoryWEATHER 3.280
TypePaid -13.067
Content.RatingEveryone -0.059
Content.RatingMature -0.058
Content.RatingTeen -0.058
Content.RatingUnrated -0.082
Pr(>|z|)
(Intercept) 0.955735
Size 0.397869
Current.Ver 0.264580
CategoryAUTO_AND_VEHICLES 0.708328
CategoryBEAUTY 0.880007
CategoryBOOKS_AND_REFERENCE 0.609383
CategoryBUSINESS 0.308018
CategoryCOMICS 0.348234
CategoryCOMMUNICATION 0.017724 *
CategoryDATING 0.945234
CategoryEDUCATION 0.000195 ***
CategoryENTERTAINMENT 0.000004640 ***
CategoryEVENTS 0.042475 *
CategoryFAMILY 0.112771
CategoryFINANCE 0.926090
CategoryFOOD_AND_DRINK 0.024024 *
CategoryGAME 0.000000473 ***
CategoryHEALTH_AND_FITNESS 0.043328 *
CategoryHOUSE_AND_HOME 0.006436 **
CategoryLIBRARIES_AND_DEMO 0.624880
CategoryLIFESTYLE 0.992744
CategoryMAPS_AND_NAVIGATION 0.227070
CategoryMEDICAL 0.002022 **
CategoryNEWS_AND_MAGAZINES 0.356719
CategoryPARENTING 0.754012
CategoryPERSONALIZATION 0.015199 *
CategoryPHOTOGRAPHY 0.000006728 ***
CategoryPRODUCTIVITY 0.056704 .
CategorySHOPPING 0.000075801 ***
CategorySOCIAL 0.191282
CategorySPORTS 0.092725 .
CategoryTOOLS 0.347302
CategoryTRAVEL_AND_LOCAL 0.089246 .
CategoryVIDEO_PLAYERS 0.020062 *
CategoryWEATHER 0.001040 **
TypePaid < 0.0000000000000002 ***
Content.RatingEveryone 0.952575
Content.RatingMature 0.953809
Content.RatingTeen 0.953525
Content.RatingUnrated 0.934682
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7960.4 on 5790 degrees of freedom
Residual deviance: 6978.9 on 5751 degrees of freedom
AIC: 7058.9
Number of Fisher Scoring iterations: 10
#confint(logit.reg) to check for confidence intervals
test$predraw <- predict(logit.reg, newdata = test %>% select(-"Result"), type = "response")
par <- data.frame("Cut off" = 1, "Accuracy" = 1, "Sensitivity" = 1)
i = 0
j = 1
for(i in seq(0,1,0.001)){
test <- test %>% mutate(pred = factor(as.character(ifelse(predraw < i, "Flop", "Hit")), levels = c("Hit", "Flop")))
par[j,2] <- 1-mean(test$Result != test$pred)
par[j,1] <- i
test1 <- test %>% filter(Result == "Hit")
par[j,3] <- mean(test1$Result == test1$pred)
j <- j+1
i <- i+1
rm(test1)
}
rm(i)
rm(j)
kable(par %>% filter(Accuracy == max(Accuracy)) %>% head(5), caption = "The cut off for the best accuracy")
| Cut.off | Accuracy | Sensitivity |
|---|---|---|
| 0.443 | 0.6862474 | 0.696124 |
par %>%ggplot(aes(x = Cut.off, y = par$Accuracy))+geom_point(size = 0.05, color = "gray")+geom_line()+geom_hline(yintercept = max(par$Accuracy), color = "red")+geom_vline(xintercept = 0.443, color = "red")+
labs(x = "Cut Off value", y = "Accuracy of prediction", title = "Cut off Vs Accuracy")
rm(par)
test <- test %>% mutate(pred = as.factor(ifelse(predraw <= 0.443, "Flop", "Hit")))
test$pred <- factor(test$pred, levels = c("Hit", "Flop"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))
cm <- confusionMatrix(test$pred, test$Result)
cm
Confusion Matrix and Statistics
Reference
Prediction Hit Flop
Hit 449 258
Flop 196 544
Accuracy : 0.6862
95% CI : (0.6616, 0.7101)
No Information Rate : 0.5543
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.3709
Mcnemar's Test P-Value : 0.004198
Sensitivity : 0.6961
Specificity : 0.6783
Pos Pred Value : 0.6351
Neg Pred Value : 0.7351
Prevalence : 0.4457
Detection Rate : 0.3103
Detection Prevalence : 0.4886
Balanced Accuracy : 0.6872
'Positive' Class : Hit
logit.u <- data.frame("Model" = "Logistic Regression (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
Machine Learning - Modelling - Decision Trees
rm(test)
rm(train)
rm(logit.reg)
train <- training %>% select(-c("Rating", "Reviews"))
test <- testing %>% select(-c("Rating", "Reviews"))
set.seed(123)
dt <- rpart(Result ~., data = train, method = "class", cp = 0.00001, minsplit = 1, xval = 10)
dt.pruned <- prune(dt, cp = 0.004)
test$pred <- predict(dt.pruned, newdata = test[,-6], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics
Reference
Prediction Hit Flop
Hit 468 261
Flop 177 541
Accuracy : 0.6973
95% CI : (0.6729, 0.7209)
No Information Rate : 0.5543
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.3951
Mcnemar's Test P-Value : 0.00007312
Sensitivity : 0.7256
Specificity : 0.6746
Pos Pred Value : 0.6420
Neg Pred Value : 0.7535
Prevalence : 0.4457
Detection Rate : 0.3234
Detection Prevalence : 0.5038
Balanced Accuracy : 0.7001
'Positive' Class : Hit
det.u <- data.frame("Model" = "Decision Tress (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
rpart.plot(dt.pruned, box.palette="RdBu", shadow.col="gray", nn=TRUE)
rpart.rules(dt.pruned, style = "tallw")
Result is 0.00 when
Category is LIBRARIES_AND_DEMO or MEDICAL
Size >= 0.0307
Current.Ver >= 0.000000127
Result is 0.03 when
Category is ART_AND_DESIGN or AUTO_AND_VEHICLES or BEAUTY or BOOKS_AND_REFERENCE or BUSINESS or COMICS or COMMUNICATION or DATING or EVENTS or FAMILY or FINANCE or HEALTH_AND_FITNESS or LIBRARIES_AND_DEMO or LIFESTYLE or MAPS_AND_NAVIGATION or MEDICAL or NEWS_AND_MAGAZINES or PARENTING or PERSONALIZATION or PRODUCTIVITY or SOCIAL or SPORTS or TOOLS or TRAVEL_AND_LOCAL
Current.Ver < 0.000000127
Type is Paid
Result is 0.08 when
Category is EDUCATION or ENTERTAINMENT or FOOD_AND_DRINK or GAME or HOUSE_AND_HOME or PHOTOGRAPHY or SHOPPING or VIDEO_PLAYERS or WEATHER
Type is Paid
Result is 0.14 when
Category is ART_AND_DESIGN or BEAUTY or BOOKS_AND_REFERENCE or COMMUNICATION or FAMILY or PERSONALIZATION or SPORTS
Size >= 0.1269
Current.Ver < 0.000000127
Type is Free
Result is 0.16 when
Category is AUTO_AND_VEHICLES or BUSINESS or DATING or FAMILY or LIFESTYLE or PARENTING or SPORTS
Size < 0.0078
Current.Ver >= 0.000000127
Result is 0.16 when
Category is ART_AND_DESIGN or BEAUTY or EVENTS or FINANCE or LIBRARIES_AND_DEMO or MEDICAL
Size < 0.0307
Current.Ver >= 0.000000127
Result is 0.21 when
Category is AUTO_AND_VEHICLES or BOOKS_AND_REFERENCE or BUSINESS or COMICS or COMMUNICATION or DATING or FAMILY or HEALTH_AND_FITNESS or LIFESTYLE or MAPS_AND_NAVIGATION or NEWS_AND_MAGAZINES or PARENTING or PERSONALIZATION or PRODUCTIVITY or SOCIAL or SPORTS or TOOLS or TRAVEL_AND_LOCAL
Size >= 0.0078
Current.Ver >= 0.000000127
Type is Paid
Result is 0.24 when
Category is AUTO_AND_VEHICLES or BUSINESS or COMICS or DATING or EVENTS or FINANCE or HEALTH_AND_FITNESS or LIBRARIES_AND_DEMO or LIFESTYLE or MAPS_AND_NAVIGATION or MEDICAL or NEWS_AND_MAGAZINES or PARENTING or PRODUCTIVITY or SOCIAL or TOOLS or TRAVEL_AND_LOCAL
Size >= 0.0126
Current.Ver < 0.000000091
Type is Free
Result is 0.26 when
Category is ART_AND_DESIGN or AUTO_AND_VEHICLES or BEAUTY or BOOKS_AND_REFERENCE or BUSINESS or COMICS or COMMUNICATION or DATING or EVENTS or FAMILY or FINANCE or HEALTH_AND_FITNESS or LIBRARIES_AND_DEMO or LIFESTYLE or MAPS_AND_NAVIGATION or MEDICAL or NEWS_AND_MAGAZINES or PARENTING or PERSONALIZATION or PRODUCTIVITY or SOCIAL or SPORTS or TOOLS or TRAVEL_AND_LOCAL
Size < 0.0126
Current.Ver < 0.000000127
Type is Free
Result is 0.30 when
Category is EDUCATION or ENTERTAINMENT or FOOD_AND_DRINK or GAME or HOUSE_AND_HOME or PHOTOGRAPHY or SHOPPING or VIDEO_PLAYERS or WEATHER
Size < 0.0029
Type is Free
Result is 0.38 when
Category is AUTO_AND_VEHICLES or BUSINESS or DATING or LIBRARIES_AND_DEMO or LIFESTYLE or MEDICAL or TOOLS or TRAVEL_AND_LOCAL
Size >= 0.0126
Current.Ver is 0.000000091 to 0.000000127
Type is Free
Result is 0.56 when
Category is BOOKS_AND_REFERENCE or COMMUNICATION or HEALTH_AND_FITNESS or MAPS_AND_NAVIGATION or NEWS_AND_MAGAZINES or PERSONALIZATION or PRODUCTIVITY or SOCIAL or TOOLS or TRAVEL_AND_LOCAL
Size < 0.0078
Current.Ver >= 0.000000127
Result is 0.57 when
Category is EDUCATION or ENTERTAINMENT or FOOD_AND_DRINK or GAME or HOUSE_AND_HOME or PHOTOGRAPHY or SHOPPING or VIDEO_PLAYERS or WEATHER
Size is 0.0029 to 0.0089
Type is Free
Result is 0.59 when
Category is ART_AND_DESIGN or BEAUTY or BOOKS_AND_REFERENCE or COMMUNICATION or FAMILY or PERSONALIZATION or SPORTS
Size is 0.0126 to 0.1269
Current.Ver < 0.000000127
Type is Free
Result is 0.72 when
Category is AUTO_AND_VEHICLES or BOOKS_AND_REFERENCE or BUSINESS or COMICS or COMMUNICATION or DATING or FAMILY or HEALTH_AND_FITNESS or LIFESTYLE or MAPS_AND_NAVIGATION or NEWS_AND_MAGAZINES or PARENTING or PERSONALIZATION or PRODUCTIVITY or SOCIAL or SPORTS or TOOLS or TRAVEL_AND_LOCAL
Size >= 0.0078
Current.Ver >= 0.000000127
Type is Free
Result is 0.72 when
Category is COMICS or FINANCE or HEALTH_AND_FITNESS or MAPS_AND_NAVIGATION or NEWS_AND_MAGAZINES or PARENTING or PRODUCTIVITY or SOCIAL
Size >= 0.0126
Current.Ver is 0.000000091 to 0.000000127
Type is Free
Result is 0.78 when
Category is EDUCATION or ENTERTAINMENT or FOOD_AND_DRINK or GAME or HOUSE_AND_HOME or PHOTOGRAPHY or SHOPPING or VIDEO_PLAYERS or WEATHER
Size >= 0.0089
Type is Free
Result is 0.87 when
Category is BEAUTY or FINANCE
Size >= 0.0307
Current.Ver >= 0.000000127
Machine Learning - Modelling - Random Forest
rm(train)
rm(test)
rm(dt)
rm(dt.pruned)
train <- training %>% select(-c("Rating", "Reviews"))
test <- testing %>% select(-c("Rating", "Reviews"))
rf <- randomForest(Result ~., data = train)
#plot(rf, main = "Error rate of random forest")
varImpPlot(rf, main = "Gini Plot - Importance of Variables")
test$pred = predict(rf, newdata = test[-6])
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics
Reference
Prediction Hit Flop
Hit 440 180
Flop 205 622
Accuracy : 0.7339
95% CI : (0.7104, 0.7566)
No Information Rate : 0.5543
P-Value [Acc > NIR] : <0.0000000000000002
Kappa : 0.4595
Mcnemar's Test P-Value : 0.2213
Sensitivity : 0.6822
Specificity : 0.7756
Pos Pred Value : 0.7097
Neg Pred Value : 0.7521
Prevalence : 0.4457
Detection Rate : 0.3041
Detection Prevalence : 0.4285
Balanced Accuracy : 0.7289
'Positive' Class : Hit
raf.u <- data.frame("Model" = "Random Forest (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
Machine Learning - Modelling - Naive Bayes Classifier
rm(test)
rm(train)
rm(rf)
train <- training %>% select(-c("Rating", "Reviews"))
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing %>% select(-c("Rating", "Reviews"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))
delays.nb <- naiveBayes(Result~., data = train)
delays.nb
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
Hit Flop
0.446037 0.553963
Conditional probabilities:
Size
Y [,1] [,2]
Hit 0.03556935 0.06165665
Flop 0.03563030 0.10756489
Current.Ver
Y [,1] [,2]
Hit 0.0009565182 0.02834286
Flop 0.0017979409 0.03567101
Category
Y ART_AND_DESIGN AUTO_AND_VEHICLES BEAUTY
Hit 0.0050329075 0.0061943477 0.0034843206
Flop 0.0093516209 0.0127805486 0.0065461347
Category
Y BOOKS_AND_REFERENCE BUSINESS COMICS
Hit 0.0154858691 0.0158730159 0.0034843206
Flop 0.0233790524 0.0417705736 0.0084164589
Category
Y COMMUNICATION DATING EDUCATION
Hit 0.0313588850 0.0174216028 0.0174216028
Flop 0.0249376559 0.0224438903 0.0068578554
Category
Y ENTERTAINMENT EVENTS FAMILY
Hit 0.0178087495 0.0007742935 0.2040263260
Flop 0.0028054863 0.0068578554 0.2319201995
Category
Y FINANCE FOOD_AND_DRINK GAME
Hit 0.0240030972 0.0147115757 0.1966705381
Flop 0.0464463840 0.0102867830 0.0692019950
Category
Y HEALTH_AND_FITNESS HOUSE_AND_HOME
Hit 0.0282617112 0.0096786682
Flop 0.0246259352 0.0046758105
Category
Y LIBRARIES_AND_DEMO LIFESTYLE
Hit 0.0054200542 0.0247773906
Flop 0.0112219451 0.0458229426
Category
Y MAPS_AND_NAVIGATION MEDICAL
Hit 0.0116144019 0.0081300813
Flop 0.0130922693 0.0620324190
Category
Y NEWS_AND_MAGAZINES PARENTING
Hit 0.0197444832 0.0038714673
Flop 0.0240024938 0.0059226933
Category
Y PERSONALIZATION PHOTOGRAPHY PRODUCTIVITY
Hit 0.0356174990 0.0460704607 0.0321331785
Flop 0.0386533666 0.0174563591 0.0308603491
Category
Y SHOPPING SOCIAL SPORTS
Hit 0.0329074719 0.0228416570 0.0301974448
Flop 0.0124688279 0.0208852868 0.0317955112
Category
Y TOOLS TRAVEL_AND_LOCAL VIDEO_PLAYERS
Hit 0.0662020906 0.0205187766 0.0185830430
Flop 0.0950748130 0.0202618454 0.0124688279
Category
Y WEATHER
Hit 0.0096786682
Flop 0.0046758105
Type
Y 0 Free NaN
Hit 0.000000000 0.992257065 0.000000000
Flop 0.000000000 0.875311721 0.000000000
Type
Y Paid
Hit 0.007742935
Flop 0.124688279
Content.Rating
Y Adults Everyone Mature
Hit 0.0003871467 0.8033294619 0.0514905149
Flop 0.0000000000 0.8753117207 0.0405236908
Content.Rating
Y Teen Unrated
Hit 0.1447928765 0.0000000000
Flop 0.0838528678 0.0003117207
pred.prob <- as.data.frame(predict(delays.nb, newdata = test, type = "raw"))
pred.prob <- round(pred.prob, 4)
pred.prob <- pred.prob %>% select("Hit")
names(pred.prob) <- "predraw"
test <- cbind(test, pred.prob)
rm(pred.prob)
par <- data.frame("Cut off" = 1, "Accuracy" = 1, "Sensitivity" = 1)
i = 0
j = 1
for(i in seq(0,1,0.001)){
test <- test %>% mutate(pred = factor(as.character(ifelse(predraw < i, "Flop", "Hit")), levels = c("Hit", "Flop")))
par[j,2] <- 1-mean(test$Result != test$pred)
test1 <- test %>% filter(Result == "Hit")
par[j,3] <- mean(test1$Result == test1$pred)
par[j,1] <- i
j <- j+1
i <- i+1
rm(test1)
}
rm(i)
rm(j)
kable(par %>% filter(Accuracy == max(Accuracy)) %>% head(5), caption = "The (Hit) probability cut off for the best accuracy")
| Cut.off | Accuracy | Sensitivity |
|---|---|---|
| 0.615 | 0.6848652 | 0.6852713 |
| 0.616 | 0.6848652 | 0.6775194 |
par %>% ggplot(aes(x = `Cut.off`, y = par$Accuracy))+geom_point(size = 0.05, color = "gray")+geom_line()+geom_hline(yintercept = max(par$Accuracy), color = "red")+geom_vline(xintercept = 0.615, color = "red")+
labs(x = "Cut Off value", y = "Accuracy of prediction", title = "Cut off Vs Accuracy")
rm(par)
test <- test %>% mutate(pred = factor(as.character(ifelse(predraw < 0.615, "Flop", "Hit")), levels = c("Hit", "Flop")))
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics
Reference
Prediction Hit Flop
Hit 442 253
Flop 203 549
Accuracy : 0.6849
95% CI : (0.6602, 0.7088)
No Information Rate : 0.5543
P-Value [Acc > NIR] : < 0.0000000000000002
Kappa : 0.367
Mcnemar's Test P-Value : 0.02175
Sensitivity : 0.6853
Specificity : 0.6845
Pos Pred Value : 0.6360
Neg Pred Value : 0.7301
Prevalence : 0.4457
Detection Rate : 0.3055
Detection Prevalence : 0.4803
Balanced Accuracy : 0.6849
'Positive' Class : Hit
nab.u <- data.frame("Model" = "Naive Bayes (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
Machine Learning - Modelling - SVM
rm(test)
rm(train)
rm(delays.nb)
train <- training %>% select(-c("Rating", "Reviews"))
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing %>% select(-c("Rating", "Reviews"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))
set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "linear")
test$pred <- predict(svm.fit, newdata = test[,-6], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics
Reference
Prediction Hit Flop
Hit 277 138
Flop 368 664
Accuracy : 0.6503
95% CI : (0.6251, 0.6749)
No Information Rate : 0.5543
P-Value [Acc > NIR] : 0.0000000000000681
Kappa : 0.2667
Mcnemar's Test P-Value : < 0.00000000000000022
Sensitivity : 0.4295
Specificity : 0.8279
Pos Pred Value : 0.6675
Neg Pred Value : 0.6434
Prevalence : 0.4457
Detection Rate : 0.1914
Detection Prevalence : 0.2868
Balanced Accuracy : 0.6287
'Positive' Class : Hit
linear <- data.frame("MOdel" = "SVM - Linear (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(test)
rm(train)
rm(svm.fit)
rm(cm)
train <- training %>% select(-c("Rating", "Reviews"))
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing %>% select(-c("Rating", "Reviews"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))
set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "radial")
test$pred <- predict(svm.fit, newdata = test[,-6], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics
Reference
Prediction Hit Flop
Hit 294 130
Flop 351 672
Accuracy : 0.6676
95% CI : (0.6426, 0.6918)
No Information Rate : 0.5543
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.3039
Mcnemar's Test P-Value : < 0.00000000000000022
Sensitivity : 0.4558
Specificity : 0.8379
Pos Pred Value : 0.6934
Neg Pred Value : 0.6569
Prevalence : 0.4457
Detection Rate : 0.2032
Detection Prevalence : 0.2930
Balanced Accuracy : 0.6469
'Positive' Class : Hit
radial <- data.frame("MOdel" = "SVM - Radial (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(test)
rm(train)
rm(svm.fit)
rm(cm)
train <- training %>% select(-c("Rating", "Reviews"))
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing %>% select(-c("Rating", "Reviews"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))
set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "polynomial")
test$pred <- predict(svm.fit, newdata = test[,-6], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics
Reference
Prediction Hit Flop
Hit 0 0
Flop 645 802
Accuracy : 0.5543
95% CI : (0.5282, 0.5801)
No Information Rate : 0.5543
P-Value [Acc > NIR] : 0.5109
Kappa : 0
Mcnemar's Test P-Value : <0.0000000000000002
Sensitivity : 0.0000
Specificity : 1.0000
Pos Pred Value : NaN
Neg Pred Value : 0.5543
Prevalence : 0.4457
Detection Rate : 0.0000
Detection Prevalence : 0.0000
Balanced Accuracy : 0.5000
'Positive' Class : Hit
polynomial <- data.frame("MOdel" = "SVM - Polynomial (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
rm(test)
rm(train)
rm(svm.fit)
svm.u <- rbind(linear, radial, polynomial)
rm(linear)
rm(radial)
rm(polynomial)
names(svm.u) <- c("Model", "Accuracy", "Sensitivity", "Specificity")
Final model Summary
model.ds <- rbind(logit, det, raf, nab, svm)
rm(logit)
rm(det)
rm(raf)
rm(nab)
rm(svm)
model.realtime <- rbind(logit.u, det.u, raf.u, nab.u, svm.u)
rm(logit.u)
rm(det.u)
rm(raf.u)
rm(nab.u)
rm(svm.u)
kable(rbind(model.ds, model.realtime), caption = "Final model results for complete and real time data")
| Model | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| Logistic Regression | 0.9474775 | 0.9162791 | 0.9725686 |
| Decision Trees | 0.9447132 | 0.9193798 | 0.9650873 |
| Random Forest | 0.9474775 | 0.9503876 | 0.9451372 |
| Naive Bayes | 0.9246717 | 0.8573643 | 0.9788030 |
| SVM - Linear | 0.7864547 | 0.5829457 | 0.9501247 |
| SVM - Radial | 0.7270214 | 0.5317829 | 0.8840399 |
| SVM - Polynomial | 0.5694540 | 0.0341085 | 1.0000000 |
| Logistic Regression (U) | 0.6862474 | 0.6961240 | 0.6783042 |
| Decision Tress (U) | 0.6973048 | 0.7255814 | 0.6745636 |
| Random Forest (U) | 0.7339323 | 0.6821705 | 0.7755611 |
| Naive Bayes (U) | 0.6848652 | 0.6852713 | 0.6845387 |
| SVM - Linear (U) | 0.6503110 | 0.4294574 | 0.8279302 |
| SVM - Radial (U) | 0.6675881 | 0.4558140 | 0.8379052 |
| SVM - Polynomial (U) | 0.5542502 | 0.0000000 | 1.0000000 |
kable((model.ds %>% arrange(desc(Accuracy)) %>% head(2)), caption = "Top 2 models for complete data")
| Model | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| Logistic Regression | 0.9474775 | 0.9162791 | 0.9725686 |
| Random Forest | 0.9474775 | 0.9503876 | 0.9451372 |
kable((model.realtime%>% arrange(desc(Accuracy)) %>% head(2)), caption = "Top 2 models for data that can be extracted in real time")
| Model | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| Random Forest (U) | 0.7339323 | 0.6821705 | 0.7755611 |
| Decision Tress (U) | 0.6973048 | 0.7255814 | 0.6745636 |
Conclusion:
It is clear that the size of the app and category define if the app would be a hit or not.
From the EDA it is clear that the app must be below 25 MB and the categories to target would vary depending on the risk one wants to take.
While Game, Family, and Communication give a constant reward, the highest can be gained from Categories that are yet to be explored like Comics (Which are positively seen but are low in market) or Events (Which have a huge criticism and are low in market).
Sources:
https://www.kaggle.com/lava18/google-play-store-apps
http://r-statistics.co/
https://uc-r.github.io/
https://www.r-bloggers.com/
https://medium.com/analytics-vidhya/a-guide-to-machine-learning-in-r-for-beginners-part-5-4c00f2366b90
https://en.proft.me/2016/11/9/classification-using-decis ion-trees-r/
https://www.gormanalysis.com/blog/decision-trees-in-r-us ing-rpart/
http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/141-cart-model-decision-tree-essentials/
https://dataaspirant.com/2017/02/03/decision-tree-classifier-implementation-in-r/
https://www.blopig.com/blog/2017/04/a-very-basic-introduction-to-random-forests-using-r/
https://www.rdocumentation.org/
https://www.datacamp.com/
Introduction to Data Mining
Book by Michael Steinbach, Pang-Ning Tan, and Vipin Kumar
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R
Book by Galit Shmueli , Peter C. Bruce, Inbal Yahav, Nitin R. Patel, Kenneth C. Lichtendahl Jr.