Loading in the data

gps <- read.csv("googleplaystore.csv")

Data Cleaning

The following needs to be removed/ cleaned:
1) Record of an App by name Life made wi-fi….
2) The Current Version of the App and Android Version have to be processed to display the current sequence (12.1.2 is 12)
3) The Special characters in Installs(+), Price($), Size(MB) should be removed
4) All the redundant records should be removed
5) The levels in Installs should be fixed to a Descending order
6) Geners should be removed, it is the same as Category and makes it redundant
7) All the other Nas should be removed
8) Fixing Content Rating by combining like levels

gps <- gps[gps$App !=   "Life Made WI-Fi Touchscreen Photo Frame",] %>% na.omit() %>% unique() %>% select(-Genres)
gps$Current.Ver <- gps$Current.Ver %>% as.character()
i <- 1
for(i in c(seq(1, 8892, by = 1))){
  a <- gps$Current.Ver[i] 
  b <- b <- unlist(strsplit(a, "[.]"))
  b <- b[1]
  gps$Current.Ver[i] = b
  rm(b)
  rm(a)
  i <- i + 1
}
rm(i)
gps$Android.Ver <- gps$Android.Ver %>% as.character()
i <- 1
for(i in c(seq(1, 8892, by = 1))){
  a <- gps$Android.Ver[i]
  b <- b <- unlist(strsplit(a, "[.]"))
  b <- b[1]
  gps$Android.Ver[i] <- b
  rm(b)
  rm(a)
  i <- i + 1
}
rm(i)

gps$Category <- gps$Category %>% droplevels()
gps$Installs = gsub("[^0-9.]", '', (gps$Installs %>% as.character()))
    gps$Installs = as.numeric(gps$Installs)
gps$Size = gsub("[^0-9.]", '', (gps$Size %>% as.character()))
    gps$Size = as.numeric(gps$Size)
gps$Price = gsub("[^0-9.]", '', (gps$Price %>% as.character()))
    gps$Price = as.numeric(gps$Price)    

gps$Installs <- factor(gps$Installs, levels = c("1000000000", "500000000", "100000000", "50000000", "10000000", "5000000", "1000000", "500000", "100000", "50000", "10000", "5000", "1000", "500", "100", "50", "10", "5", "1"))

gps$Content.Rating <- gps$Content.Rating %>% as.character() %>% word(1) %>% as.factor() %>% droplevels()

Different Categories and their presence in the market

Of the different kinds of people present in this word, Which category is globally accepted?
What are the top categories that perform the best?

gps_cb <- gps %>% na.omit() %>% group_by(Category) %>% summarize(Sum = sum(Rating), count = n(), `Normalized Rating` = sum(Rating)/n()) %>% arrange(desc(`Normalized Rating`))
kable(head(gps_cb, 7), caption = "Top 7 categories based on Normalized Rating")

Category Sum count Normalized Rating
EVENTS 170.2 38 4.478947
EDUCATION 416.1 95 4.380000
ART_AND_DESIGN 257.3 59 4.361017
PARENTING 191.3 44 4.347727
PERSONALIZATION 1206.6 279 4.324731
BOOKS_AND_REFERENCE 618.0 143 4.321678
BEAUTY 158.8 37 4.291892

kable(tail(gps_cb, 7), caption = "Bottom 7 categories based on Normalized Rating")

Category Sum count Normalized Rating
COMMUNICATION 844.1 206 4.097573
LIFESTYLE 1117.4 273 4.093040
TRAVEL_AND_LOCAL 591.9 147 4.026531
VIDEO_PLAYERS 467.0 116 4.025862
MAPS_AND_NAVIGATION 381.3 95 4.013684
TOOLS 2543.0 634 4.011041
DATING 558.3 141 3.959574

  
gps %>% filter(Category %in% (gps_cb$Category %>% head(15))) %>% ggplot(aes(x = Category, y = Rating, fill = Category)) + geom_violin(color = "black") + theme(legend.position = "none") + geom_hline(size = 0.6, color = "red", linetype = 'dashed', alpha = 0.5, yintercept = mean(gps$Rating %>% na.omit()))+coord_flip()+labs(y = "Rating", x = "Category", title = "Category Vs Rating Violin Plot")

rm(gps_cb)

Events, Education and Personalization are the best in overall rating.
While Dating, Tools, Maps and Navigation are the worst rated app categories.

Subtle and Significant Category Forensics

Which category doesn’t perform well on a overall scale?
What are the total number of installs for different categories?

data <- gps %>% group_by(Category) %>% summarise(Count = n(), `Total installs` = sum(as.numeric(as.character(Installs))), `Normalized Installations` = `Total installs`/Count) 

data %>% 
  ggplot(aes(x = reorder(Category, Count), fill = Category, y = Count))+geom_bar(stat = "identity", color = "black")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Number of Apps present", x = "Category", title = "Number of Apps present in Different Category")


data %>% 
  ggplot(aes(x = reorder(Category, `Total installs`), fill = Category, y = `Total installs`))+geom_bar(stat = "identity", color = "black")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Total number of installs", x = "Category", title = "Total Number of Installations per Category")


data %>% 
  ggplot(aes(x = reorder(Category, `Normalized Installations`), fill = Category, y = `Normalized Installations`))+geom_bar(stat = "identity", color = "black")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Total number of normalized installations", x = "Category", title = "Total Number of Normalized Installations per Category")


info <- cbind((data %>% arrange(desc(data$Count)) %>% select(Category)),
(data %>% arrange(desc(data$`Total installs`)) %>% select(Category)),
(data %>% arrange(desc(data$`Normalized Installations`)) %>% select(Category)))
names(info) <- c("By Count", "By Total Installs", "By Total Normalized Installs")
kable(head(info, 7), caption = "Top 7 categories")

By Count By Total Installs By Total Normalized Installs
FAMILY GAME COMMUNICATION
GAME COMMUNICATION SOCIAL
TOOLS SOCIAL VIDEO_PLAYERS
PRODUCTIVITY PRODUCTIVITY PRODUCTIVITY
FINANCE TOOLS PHOTOGRAPHY
PERSONALIZATION FAMILY TRAVEL_AND_LOCAL
COMMUNICATION PHOTOGRAPHY GAME

kable(tail(info, 7), caption = "Bottom 7 categories")

By Count By Total Installs By Total Normalized Installs
27 HOUSE_AND_HOME LIBRARIES_AND_DEMO COMICS
28 LIBRARIES_AND_DEMO COMICS LIBRARIES_AND_DEMO
29 ART_AND_DESIGN AUTO_AND_VEHICLES AUTO_AND_VEHICLES
30 COMICS MEDICAL BEAUTY
31 PARENTING PARENTING PARENTING
32 EVENTS BEAUTY EVENTS
33 BEAUTY EVENTS MEDICAL

rm(info)

Family has the highest competition (number of apps) followed by tools which don’t seem to be rewarding.
Game, and Productivity do well wrt to number of installs even though they have a lot of competition.
Communication and Social are two domains where there is a surge in downloads and the number of competitors are also less.

Comments, Parenting, and Events can be seen as domains that one must ignore as there are very few downloads and so is the competition. Or they can also be seen as domains that have huge potential and are underdog markets that can be exploited with a disruptive idea.

Content Type

What is the distribution of apps of various content?
Is there an effect of content type of installs?
How are they priced?

gps %>% group_by(Content.Rating) %>% summarise(count = n()) %>%
  ggplot(aes(x = reorder(Content.Rating, -count), y = count, fill = Content.Rating))+ geom_bar(stat = "identity", color = "black")+
 scale_fill_viridis(discrete = TRUE) +
  labs(y = "Count", x = "Content Type", title = "Content type and their presence")

Apps for everyone are present in bulk

Installs vs Rating per content type

gps %>% 
  ggplot(aes(x = Rating, y = as.integer(as.character(gps$Installs)), color = Content.Rating))+
  geom_point()+
  facet_wrap(~Content.Rating) + geom_vline(xintercept = 5 ,colour = 'red', linetype = 'dashed') + geom_vline(xintercept = 4 ,colour = 'red', linetype = 'dashed') +  geom_hline(yintercept = 1001000000 ,colour = 'red', linetype = 'dashed') + geom_hline(yintercept = 751000000 ,colour = 'red', linetype = 'dashed')

Everyone and Teens content type attract the most number of installs and a decent rating.

Rating analysis

App ratings (on a scale of 1 to 5) impact the discoverability, and indicates the company’s overall brand image. Ratings are a key performance indicator of an app.

gps  %>% ggplot(aes(x = Rating))+geom_histogram(binwidth = 0.1, fill = "purple")+geom_vline(color = "red", linetype = 'dashed', xintercept = mean(gps$Rating %>% na.omit()))+theme(legend.position = "none")+labs(y = "Count", x = "Rating", title = "Rating Histogram")

Most of the apps have an average rating above 4 due to which there is a significant skewness present in the rating part.
Ratings are right skewed with major apps being highly rated and only few with a low rating.

This indicates that most of the apps in the Market are sized below 25mb.

The applications with large size can make it difficult for users to download. Higher download durations could upset the user and ignore the app.
More importantly the memory a user has in the device is limited. This is to be tested.
Let’s investigate the same with the data.

The distribution of size

gps %>% ggplot(aes(x = Size))+geom_histogram(binwidth = 1, fill = "purple")+theme(legend.position = "none")+labs(y = "Count", x = "Size (MB)", title = "Size distribution Histogram")


gps  %>% filter(Size < 120) %>% ggplot(aes(x = Size))+geom_histogram(binwidth = 1, fill = "purple")+theme(legend.position = "none")+labs(y = "Count", x = "Size (MB)", title = "Size distribution Histogram (Sizes below 120Mb)")

Most of the apps in the Market are sized below 25mb.

Next is the size.
The applications with large size can make it difficult for users to download. Higher download durations could upset the user and ignore the app.
More importantly the memory a user has in the device is limited. This is to be tested.

Impact of Size on Rating

p <- gps %>% ggplot(aes(x = Size, y = Rating))+geom_point()+labs(y = "Rating", x = "Size (MB)", title = "Size Vs Rating") 
ggMarginal(p, type = "histogram")



p <- gps %>% filter(Size < 100) %>% ggplot(aes(x = Size, y = Rating))+geom_point()+labs(y = "Rating", x = "Size (MB)", title = "Size Vs Rating (Sizes below 100)") 
ggMarginal(p, type = "histogram")




p <- gps %>% filter(Size < 10) %>% ggplot(aes(x = Size, y = Rating))+geom_point()+labs(y = "Rating", x = "Size (MB)", title = "Size Vs Rating (Sizes below 10)")  
ggMarginal(p, type = "histogram")

rm(p)

Majority of apps with high rating (rating over 4) are between 1 MB to 10 MB.

What is the variation in the number of insallations

gps %>% ggplot(aes(x = Installs)) +
  geom_histogram(stat = 'count', bins = 100, fill = "purple", color = 'black')+
coord_flip() +  scale_y_continuous(breaks = seq(0, 1500, 100))+labs(y = "Count", x = "Installs Range", title = "Installs distribution Histogram")

The Installs are perfectly distributed as expected. The high end market is very competitive and very few apps fall into that space, while most of the others fall behind the race getting in mediocre downloads. But the interesting part is that there are a lot of apps that fall into 1M downloads point. This is interesting as this gives a hope to a lot of designers as there is scope.

data <- gps %>% group_by(Installs) %>% summarise(Count = n()) %>% arrange(desc(Installs)) 
data <- data %>% mutate(product = as.numeric(as.character(data$Installs))*data$Count, `market share` = product/146625951338*100, `market share per app in the segment` = `market share`/Count)
names(data) <- c("Installs Range", "Total Number of apps present", "Total Installations","Market share of the segment", "Market share per app in that segment")

kable(data, caption = "Installations and their contribution")
Installs Range Total Number of apps present Total Installations Market share of the segment Market share per app in that segment
1 3 3 0.0000000 0.0000000
5 9 45 0.0000000 0.0000000
10 69 690 0.0000005 0.0000000
50 56 2800 0.0000019 0.0000000
100 303 30300 0.0000207 0.0000001
500 199 99500 0.0000679 0.0000003
1000 698 698000 0.0004760 0.0000007
5000 426 2130000 0.0014527 0.0000034
10000 989 9890000 0.0067451 0.0000068
50000 462 23100000 0.0157544 0.0000341
100000 1110 111000000 0.0757028 0.0000682
500000 516 258000000 0.1759579 0.0003410
1000000 1486 1486000000 1.0134632 0.0006820
5000000 683 3415000000 2.3290556 0.0034100
10000000 1132 11320000000 7.7203250 0.0068201
50000000 272 13600000000 9.2753021 0.0341004
100000000 369 36900000000 25.1660771 0.0682008
500000000 61 30500000000 20.8012291 0.3410038
1000000000 49 49000000000 33.4183680 0.6820075

The top five segments make up 96.38% of the market share. This implies that an app present in the top-most installion segment make up close 0.7% of market.

Impact of Price Does Price have anything to do with the number of installs?

gps %>% ggplot(aes(x = Type, y = as.numeric(as.character(gps$Installs)), fill = Type))+geom_boxplot()+scale_fill_manual(values=c('#E69F00', '#56B4E9'))+labs(y = "Installs", x = "Free or Paid", title = "Influence of type of availability on Installations")


#This is because of the outliers present at a very high value. This can be fixed by using the log transformation.

gps  %>% 
  ggplot(aes(x = Type, y = log(as.numeric(as.character(gps$Installs))), fill = Type))+geom_boxplot()+scale_fill_manual(values=c('#E69F00', '#56B4E9'))+labs(y = "Installs after log transformation", x = "Free or Paid", title = "Influence of type of availability on Installations")

NA

As expected, free apps have higher installations than paid.

Price Vs Category

gps %>% 
  ggplot(aes(x = Price, y = Category, color = Category))+
  geom_jitter() + theme(legend.position = "none")+labs(y = "Category", x = "Price", title = "Price Vs Category")


gps %>% select(c("App", "Price")) %>% arrange(desc(Price)) %>% head(20)

The top 15 most expensive apps appear to be junk.

Ignoring them and proceeding with the analysis gives

gps %>% filter(Price < 80) %>%
  ggplot(aes(x = Price, y = Category, color = Category))+
  geom_jitter() + theme(legend.position = "none")+labs(y = "Category", x = "Price", title = "Price Vs Category after cleaning junk")

The most expensive apps belong to Medical and Lifestyle followed by Family.

TEXT MINING - Sentimental Analysis

Word Cloud Function


  wordcloud_c <- function(data, num_words = 100, background = "white") {
  
  # If text is provided, convert it to a dataframe of word frequencies
  if (is.character(data)) {
    corpus <- Corpus(VectorSource(data))
    corpus <- tm_map(corpus, tolower)
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, removeNumbers)
    corpus <- tm_map(corpus, removeWords, stopwords("english"))
    tdm <- as.matrix(TermDocumentMatrix(corpus))
    data <- sort(rowSums(tdm), decreasing = TRUE)
    data <- data.frame(word = names(data), freq = as.numeric(data))
  }
  
  # Make sure a proper num_words is provided
  if (!is.numeric(num_words) || num_words < 3) {
    num_words <- 3
  }  
  
  # Grab the top n most common words
  data <- head(data, n = num_words)
  if (nrow(data) == 0) {
    return(NULL)
  }
  
  wordcloud2(data, backgroundColor = background)
}

SA - Review Analysis

The reviews had some special characters, which had to be removed.
This data is grouped with the data used for EDA and all the irrelevant attributes are dropped.
The Sentiment polarity and Sentiment Subjectivity were then grouped with Category to see how they impact it.

gspr <- read.csv("googleplaystore_user_reviews.csv") %>% na.omit()
gspr$App <- as.character(gspr$App)
gspr$Translated_Review <- as.character(gspr$Translated_Review)
data <- gps %>% select(-c("Reviews", "Price", "Last.Updated", "Android.Ver")) %>% as.data.frame()
data$App <- as.character(data$App)
data <- data %>% inner_join(gspr, by = "App")
#gspr$Languages <- textcat(gspr$Translated_Review)
#gspr <- gspr %>% filter(Languages == "english")
#wordcloud_c(gspr[,2], 5)
#get_sentiments(gspr[,2])
#All the sentiment extractions has already been done in the data so proceding to crunch in that info

data$Sentiment_Polarity <- round(data$Sentiment_Polarity, 2)
data$Sentiment_Subjectivity <- round(data$Sentiment_Subjectivity, 2)

Word Clouds for Positive, Neutral and Negative Reviews

data_po <- data %>% filter(Sentiment_Polarity == 1) %>% select("Translated_Review") 
data_nu <- data %>% filter(Sentiment_Polarity == 0) %>% select("Translated_Review")
data_ne <- data %>% filter(Sentiment_Polarity == -1) %>% select("Translated_Review")
wordcloud_c(data_po[,1], 100)

wordcloud_c(data_nu[,1], 100)

wordcloud_c(data_ne[,1], 100)


rm(data_ne)
rm(data_nu)
rm(data_po)

SA - Text Forensics

das <- data %>% select(-"Translated_Review")
das <- das %>% group_by(Category, Sentiment) %>% summarise(Count = n(), NP = round(sum(Sentiment_Polarity)/n(),2))
das <- das %>% mutate(s = Sentiment)
das <- das %>% spread(Sentiment, NP)
names(das) <-  c("Category", "Count", "s", "Ne_S", "Nu_S", "Po_S")
das <- das %>% spread(s, Count)
names(das) <-  c("Category", "Ne_S", "Nu_S", "Po_S", "Ne_C", "Nu_C", "Po_C")
das[is.na(das)] <- 0
das <- das %>% group_by(Category) %>% summarise(Nu_C = sum(Nu_C), Ne_C = sum(Ne_C), Po_C = sum(Po_C), Nu_S = sum(Nu_S), Ne_S = sum(Ne_S), Po_S = sum(Po_S))


das %>% ggplot(aes(y = Po_S, x = reorder(Category, Po_S), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Sentiment Polarity", x = "Category", title = "Strength of Positive Sentiment Vs Category")


das %>% ggplot(aes(y = Po_C, x = reorder(Category, Po_C), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Number of Positive Reviews", x = "Category", title = "Number of Positive Vs Category")


das %>% ggplot(aes(y = Ne_S, x = reorder(Category, -Ne_S), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Sentiment Polarity", x = "Category", title = "Strength of Negative Sentiment Vs Category")


das %>% ggplot(aes(y = Ne_C, x = reorder(Category, Ne_C), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Number of Negative Reviews", x = "Category", title = "Number of Negative Vs Category")



das %>% ggplot(aes(y = (Po_C-Ne_C)/(Ne_C), x = reorder(Category, ((Po_C-Ne_C)/(Ne_C))), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Percentage Number of Positive Reviews wrt Negative Reviews", x = "Category", title = "Percentage of Positive Reviews more than Negative Reviews")


das %>% ggplot(aes(y = (Po_S-Ne_S), x = reorder(Category, ((Po_S-Ne_S))), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Relative polarity of Positive Reviews wrt Negative Reviews", x = "Category", title = "Relative polarity Positive Reviews wrt Negative Reviews")

rm(das)

Even though beauty has high negative reviews it has the highest polarity difference between positive and negative among all the categories. This category can be ignored due to low number of reviews.
Comics has the highest relative number of positive reviews compared to that of negative.

Almost all the categories have similar intensities when it comes to positive or negative reactions. Health and fitness received the most positive reactions compared to that of the negative ones.

Content Vs Reviews

das <- data %>% select(-c("Translated_Review", "App"))
das %>% ggplot(aes(x = as.factor(Content.Rating), fill = Content.Rating))+geom_histogram(stat = "count")+labs(y = "Number of Reviews", x = "Content Type", fill = "Content Type", title = "Number of Reviews per content")+scale_fill_viridis(discrete = TRUE)+facet_wrap(~Sentiment)+coord_flip()


das %>% ggplot(aes(x = as.factor(Content.Rating), y = Sentiment_Polarity, color = Content.Rating))+geom_boxplot()+labs(y = "Polarity", x = "Content Type", title = "Content Type Vs the intensity of reviews")


das %>% ggplot(aes(x = as.factor(Content.Rating), fill = Sentiment))+geom_bar(stat = "count", position = "fill")+labs(y = "Number of Reviews", x = "Content Type", fill = "Content Type", title = "Number of Reviews per content")+scale_fill_viridis(discrete = TRUE)

As expected there Everyone has the highest number of reviews in all categories. This could be because the it has the highest number of apps in the market.
Adult followed by Mature Apps have higher number of strong positive reviews while Teens have the higher number of strong negative reviews.

Machine Learning - Pre processing

Here apart from general preprocessing, I am defining the success to be installations equal to or greater than 500,000 i.e. the app must take 0.00034% of the total market downloads we classify result to be a “hit”, and else it’s “flop”.
The correlation among various variable is as follows:

rm(das)
rm(data)
data <- gps %>% select(c("Category", "Rating", "Reviews", "Size", "Type", "Content.Rating", "Current.Ver", "Installs"))
glimpse(data)
Observations: 8,892
Variables: 8
$ Category       <fct> ART_AND_DESIGN, ART_AND_DE...
$ Rating         <dbl> 4.1, 3.9, 4.7, 4.5, 4.3, 4...
$ Reviews        <fct> 159, 967, 87510, 215644, 9...
$ Size           <dbl> 19.0, 14.0, 8.7, 25.0, 2.8...
$ Type           <fct> Free, Free, Free, Free, Fr...
$ Content.Rating <fct> Everyone, Everyone, Everyo...
$ Current.Ver    <chr> "1", "2", "1", "Varies wit...
$ Installs       <fct> 10000, 500000, 5000000, 50...
data$Reviews <- as.integer(as.character(data$Reviews))
data$Current.Ver <- as.integer(as.character(data$Current.Ver))
data$Installs <- as.integer(as.character(data$Installs))
data <- data %>% na.omit()
data$Rating <- as.numeric(as.character(data$Rating))
glimpse(data)
Observations: 7,238
Variables: 8
$ Category       <fct> ART_AND_DESIGN, ART_AND_DE...
$ Rating         <dbl> 4.1, 3.9, 4.7, 4.3, 4.4, 3...
$ Reviews        <int> 159, 967, 87510, 967, 167,...
$ Size           <dbl> 19.0, 14.0, 8.7, 2.8, 5.6,...
$ Type           <fct> Free, Free, Free, Free, Fr...
$ Content.Rating <fct> Everyone, Everyone, Everyo...
$ Current.Ver    <int> 1, 2, 1, 1, 1, 1, 6, 2, 2,...
$ Installs       <int> 10000, 500000, 5000000, 10...
cor(data %>% select(c("Rating", "Reviews", "Size", "Current.Ver", "Installs")))
                  Rating      Reviews        Size
Rating       1.000000000  0.080283728 -0.02059633
Reviews      0.080283728  1.000000000  0.03373306
Size        -0.020596331  0.033733058  1.00000000
Current.Ver -0.002349692 -0.005979475  0.02677096
Installs     0.053719879  0.643536646  0.01538013
             Current.Ver     Installs
Rating      -0.002349692  0.053719879
Reviews     -0.005979475  0.643536646
Size         0.026770959  0.015380130
Current.Ver  1.000000000 -0.005627174
Installs    -0.005627174  1.000000000
corrplot(cor(data %>% select(c("Rating", "Reviews", "Size", "Current.Ver", "Installs"))), method = "color", title = "Correlation Plot", mar=c(0,0,1,0))


data <- data %>% mutate(Result = as.factor(ifelse(Installs >= 500000, "Hit", "Flop"))) %>% select(-Installs)

Installs has the highest correlation with reviews

Machine Learning - Data Partition

normalize <- function(x){
  return ((x-min(x))/(max(x)-min(x)))
}
data.ml<- cbind(as.data.frame(lapply(data[,c(2,3,4,7)],normalize)), data[,-c(2,3,4,7)])

set.seed(123)
trainindex <- createDataPartition(data.ml$Result, p=0.8, list= FALSE)
trainn <- data.ml[trainindex, ]
testn <- data.ml[-trainindex, ]

set.seed(123)
trainindex <- createDataPartition(data.ml$Result, p=0.8, list= FALSE)
training <- data.ml[trainindex, ]
testing <- data.ml[-trainindex, ]

I split the data into two parts training and testing using data partition at 80% split. I then normalized the continuous variables of the data for models that require normalized input and added the data that is not normalized to models which don’t require normalized inputs.

I have used the following models on the data to classify the records as hit or flop:

Logistic Regression
Decision Trees
Random Forest
Naïve Bayes Classifier
SVM with Default Cost parameter
Tuned SVM (Couldn’t add it as it takes more than 12 hours to run)

These models were run upon all the attributes with Result (“Hit” or “Flop”) being the target attribute.

The outputs of the models were as follows:

Machine Learning - Modelling - Logistic Regression

rm(trainindex)
rm(data.ml)
train <- trainn
test <- testn
logit.reg <- glm(Result ~.,data = train, family = "binomial")
summary(logit.reg)

Call:
glm(formula = Result ~ ., family = "binomial", data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.1087  -0.3290  -0.0005   0.0000   2.7745  

Coefficients:
                               Estimate  Std. Error
(Intercept)                   -10.93293  1455.39801
Rating                         -1.88838     0.39186
Reviews                     25958.79073   991.38262
Size                            0.32449     0.63443
Current.Ver                     1.08064     1.52771
CategoryAUTO_AND_VEHICLES       0.03749     0.72313
CategoryBEAUTY                  0.50437     0.81195
CategoryBOOKS_AND_REFERENCE    -0.34906     0.71308
CategoryBUSINESS               -0.58562     0.67192
CategoryCOMICS                 -1.56059     1.05006
CategoryCOMMUNICATION          -0.98232     0.79991
CategoryDATING                 -0.70588     0.76343
CategoryEDUCATION               0.76883     0.73797
CategoryENTERTAINMENT           0.17953     0.97639
CategoryEVENTS                 -1.53144     1.26783
CategoryFAMILY                 -0.35114     0.60144
CategoryFINANCE                -1.41404     0.68849
CategoryFOOD_AND_DRINK         -0.25462     0.77866
CategoryGAME                   -0.47244     0.63475
CategoryHEALTH_AND_FITNESS     -0.23195     0.67832
CategoryHOUSE_AND_HOME          0.77182     0.74567
CategoryLIBRARIES_AND_DEMO      0.05045     0.78375
CategoryLIFESTYLE              -0.40941     0.65481
CategoryMAPS_AND_NAVIGATION    -0.69873     0.84434
CategoryMEDICAL                -1.35866     0.70991
CategoryNEWS_AND_MAGAZINES     -0.71658     0.74933
CategoryPARENTING               0.54597     0.82074
CategoryPERSONALIZATION         0.19263     0.67100
CategoryPHOTOGRAPHY             0.44799     0.67702
CategoryPRODUCTIVITY           -0.39227     0.67103
CategorySHOPPING                0.15645     0.70957
CategorySOCIAL                 -0.74732     0.78432
CategorySPORTS                 -0.78227     0.68603
CategoryTOOLS                  -0.41091     0.62426
CategoryTRAVEL_AND_LOCAL        0.01793     0.69436
CategoryVIDEO_PLAYERS           0.00502     0.72942
CategoryWEATHER                 0.52329     0.82155
TypePaid                      -13.67281     0.94806
Content.RatingEveryone          9.74847  1455.39787
Content.RatingMature           10.02771  1455.39790
Content.RatingTeen              8.89360  1455.39787
Content.RatingUnrated          -3.44570  2058.24319
                            z value
(Intercept)                  -0.008
Rating                       -4.819
Reviews                      26.184
Size                          0.511
Current.Ver                   0.707
CategoryAUTO_AND_VEHICLES     0.052
CategoryBEAUTY                0.621
CategoryBOOKS_AND_REFERENCE  -0.490
CategoryBUSINESS             -0.872
CategoryCOMICS               -1.486
CategoryCOMMUNICATION        -1.228
CategoryDATING               -0.925
CategoryEDUCATION             1.042
CategoryENTERTAINMENT         0.184
CategoryEVENTS               -1.208
CategoryFAMILY               -0.584
CategoryFINANCE              -2.054
CategoryFOOD_AND_DRINK       -0.327
CategoryGAME                 -0.744
CategoryHEALTH_AND_FITNESS   -0.342
CategoryHOUSE_AND_HOME        1.035
CategoryLIBRARIES_AND_DEMO    0.064
CategoryLIFESTYLE            -0.625
CategoryMAPS_AND_NAVIGATION  -0.828
CategoryMEDICAL              -1.914
CategoryNEWS_AND_MAGAZINES   -0.956
CategoryPARENTING             0.665
CategoryPERSONALIZATION       0.287
CategoryPHOTOGRAPHY           0.662
CategoryPRODUCTIVITY         -0.585
CategorySHOPPING              0.220
CategorySOCIAL               -0.953
CategorySPORTS               -1.140
CategoryTOOLS                -0.658
CategoryTRAVEL_AND_LOCAL      0.026
CategoryVIDEO_PLAYERS         0.007
CategoryWEATHER               0.637
TypePaid                    -14.422
Content.RatingEveryone        0.007
Content.RatingMature          0.007
Content.RatingTeen            0.006
Content.RatingUnrated        -0.002
                                        Pr(>|z|)    
(Intercept)                               0.9940    
Rating                                0.00000144 ***
Reviews                     < 0.0000000000000002 ***
Size                                      0.6090    
Current.Ver                               0.4793    
CategoryAUTO_AND_VEHICLES                 0.9587    
CategoryBEAUTY                            0.5345    
CategoryBOOKS_AND_REFERENCE               0.6245    
CategoryBUSINESS                          0.3835    
CategoryCOMICS                            0.1372    
CategoryCOMMUNICATION                     0.2194    
CategoryDATING                            0.3552    
CategoryEDUCATION                         0.2975    
CategoryENTERTAINMENT                     0.8541    
CategoryEVENTS                            0.2271    
CategoryFAMILY                            0.5593    
CategoryFINANCE                           0.0400 *  
CategoryFOOD_AND_DRINK                    0.7437    
CategoryGAME                              0.4567    
CategoryHEALTH_AND_FITNESS                0.7324    
CategoryHOUSE_AND_HOME                    0.3006    
CategoryLIBRARIES_AND_DEMO                0.9487    
CategoryLIFESTYLE                         0.5318    
CategoryMAPS_AND_NAVIGATION               0.4079    
CategoryMEDICAL                           0.0556 .  
CategoryNEWS_AND_MAGAZINES                0.3389    
CategoryPARENTING                         0.5059    
CategoryPERSONALIZATION                   0.7741    
CategoryPHOTOGRAPHY                       0.5082    
CategoryPRODUCTIVITY                      0.5588    
CategorySHOPPING                          0.8255    
CategorySOCIAL                            0.3407    
CategorySPORTS                            0.2542    
CategoryTOOLS                             0.5104    
CategoryTRAVEL_AND_LOCAL                  0.9794    
CategoryVIDEO_PLAYERS                     0.9945    
CategoryWEATHER                           0.5242    
TypePaid                    < 0.0000000000000002 ***
Content.RatingEveryone                    0.9947    
Content.RatingMature                      0.9945    
Content.RatingTeen                        0.9951    
Content.RatingUnrated                     0.9987    
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 7960.4  on 5790  degrees of freedom
Residual deviance: 1925.0  on 5749  degrees of freedom
AIC: 2009

Number of Fisher Scoring iterations: 14
logit.reg <- glm(Result ~.,data = train %>% select(-"Category"), family = "binomial")
summary(logit.reg)

Call:
glm(formula = Result ~ ., family = "binomial", data = train %>% 
    select(-"Category"))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.0934  -0.3408  -0.0006   0.0000   2.6391  

Coefficients:
                         Estimate Std. Error z value
(Intercept)              -11.6063  1455.3979  -0.008
Rating                    -1.8041     0.3704  -4.870
Reviews                25796.9828   957.4169  26.944
Size                       0.2645     0.6097   0.434
Current.Ver                0.9836     1.4986   0.656
TypePaid                 -13.3421     0.9482 -14.071
Content.RatingEveryone    10.0192  1455.3979   0.007
Content.RatingMature      10.1071  1455.3979   0.007
Content.RatingTeen         9.1242  1455.3979   0.006
Content.RatingUnrated     -3.2440  2058.2432  -0.002
                                   Pr(>|z|)    
(Intercept)                           0.994    
Rating                           0.00000112 ***
Reviews                < 0.0000000000000002 ***
Size                                  0.664    
Current.Ver                           0.512    
TypePaid               < 0.0000000000000002 ***
Content.RatingEveryone                0.995    
Content.RatingMature                  0.994    
Content.RatingTeen                    0.995    
Content.RatingUnrated                 0.999    
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 7960.4  on 5790  degrees of freedom
Residual deviance: 1984.3  on 5781  degrees of freedom
AIC: 2004.3

Number of Fisher Scoring iterations: 14
#confint(logit.reg) to check for confidence intervals

test$predraw <- predict(logit.reg, newdata = test %>% select(-c("Category", "Result")), type = "response")

par <- data.frame("Cut off" = 1, "Accuracy" = 1, "Sensitivity" = 1)
i = 0
j = 1
for(i in seq(0,1,0.001)){
  test <- test %>% mutate(pred = factor(as.character(ifelse(predraw < i, "Flop", "Hit")), levels = c("Hit", "Flop")))
 par[j,2] <- 1-mean(test$Result != test$pred)
 par[j,1] <- i
 test1 <- test %>% filter(Result == "Hit")
  par[j,3] <- mean(test1$Result == test1$pred)
j <- j+1
i <- i+1
rm(test1)
}
rm(i)
rm(j)

kable(par %>% filter(Accuracy == max(Accuracy)) %>% head(5), caption = "The cut off for the best accuracy")

Cut.off Accuracy Sensitivity
0.370 0.9502419 0.9162791
0.371 0.9502419 0.9162791
0.372 0.9502419 0.9147287
0.373 0.9502419 0.9147287
0.374 0.9502419 0.9147287


par %>%ggplot(aes(x = Cut.off, y = par$Accuracy))+geom_point(size = 0.05, color = "gray")+geom_line()+geom_hline(yintercept = max(par$Accuracy), color = "red")+geom_vline(xintercept = 0.375, color = "red")+
  labs(x = "Cut Off value", y = "Accuracy of prediction", title = "Cut off Vs Accuracy")

rm(par)  

test <- test %>% mutate(pred = as.factor(ifelse(predraw <= 0.35, "Flop", "Hit")))
test$pred <- factor(test$pred, levels = c("Hit", "Flop"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))
cm <- confusionMatrix(test$pred, test$Result)
cm
Confusion Matrix and Statistics

          Reference
Prediction Hit Flop
      Hit  591   22
      Flop  54  780
                                               
               Accuracy : 0.9475               
                 95% CI : (0.9347, 0.9584)     
    No Information Rate : 0.5543               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.8932               
                                               
 Mcnemar's Test P-Value : 0.0003766            
                                               
            Sensitivity : 0.9163               
            Specificity : 0.9726               
         Pos Pred Value : 0.9641               
         Neg Pred Value : 0.9353               
             Prevalence : 0.4457               
         Detection Rate : 0.4084               
   Detection Prevalence : 0.4236               
      Balanced Accuracy : 0.9444               
                                               
       'Positive' Class : Hit                  
                                               
logit <- data.frame("Model" = "Logistic Regression", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])

Logistic regression has performed pretty well in all the aspects. Note: It must remembered that our class of interest is Hit.

Machine Learning - Modelling - Decision Trees

rm(test)
rm(train)
rm(logit.reg)
train <- training
test <- testing

set.seed(123)
dt <- rpart(Result ~., data = train, method = "class", cp = 0.00001, minsplit = 1, xval = 10)
dt.pruned <- prune(dt, cp = 0.004)

test$pred <- predict(dt.pruned, newdata = test[,-8], type = "class")

cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics

          Reference
Prediction Hit Flop
      Hit  593   28
      Flop  52  774
                                              
               Accuracy : 0.9447              
                 95% CI : (0.9317, 0.9559)    
    No Information Rate : 0.5543              
    P-Value [Acc > NIR] : < 0.0000000000000002
                                              
                  Kappa : 0.8877              
                                              
 Mcnemar's Test P-Value : 0.01013             
                                              
            Sensitivity : 0.9194              
            Specificity : 0.9651              
         Pos Pred Value : 0.9549              
         Neg Pred Value : 0.9370              
             Prevalence : 0.4457              
         Detection Rate : 0.4098              
   Detection Prevalence : 0.4292              
      Balanced Accuracy : 0.9422              
                                              
       'Positive' Class : Hit                 
                                              
det <- data.frame("Model" = "Decision Trees", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
rpart.plot(dt.pruned, box.palette="RdBu", shadow.col="gray", nn=TRUE)


rpart.rules(dt.pruned, style = "tallw")
Result is 0.00 when
               Reviews is 0.00009 to 0.00019
               Type is Paid

Result is 0.00 when
               Reviews is 0.00019 to 0.00053
               Type is Paid

Result is 0.07 when
               Reviews < 0.00009

Result is 0.68 when
               Reviews is 0.00009 to 0.00019
               Type is Free

Result is 0.80 when
               Reviews >= 0.00053
               Type is Paid

Result is 0.98 when
               Reviews >= 0.00019
               Type is Free

Machine Learning - Modelling - Random Forest

rm(train)
rm(test)
rm(dt)
rm(dt.pruned)
train <- training
test <- testing
rf <- randomForest(Result ~., data = train)
plot(rf, main = "Error rate of random forest")

varImpPlot(rf, main = "Gini Plot - Importance of Variables")

test$pred = predict(rf, newdata = test[-8])
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics

          Reference
Prediction Hit Flop
      Hit  613   44
      Flop  32  758
                                             
               Accuracy : 0.9475             
                 95% CI : (0.9347, 0.9584)   
    No Information Rate : 0.5543             
    P-Value [Acc > NIR] : <0.0000000000000002
                                             
                  Kappa : 0.8939             
                                             
 Mcnemar's Test P-Value : 0.207              
                                             
            Sensitivity : 0.9504             
            Specificity : 0.9451             
         Pos Pred Value : 0.9330             
         Neg Pred Value : 0.9595             
             Prevalence : 0.4457             
         Detection Rate : 0.4236             
   Detection Prevalence : 0.4540             
      Balanced Accuracy : 0.9478             
                                             
       'Positive' Class : Hit                
                                             
raf <- data.frame("Model" = "Random Forest", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)

Machine Learning - Modelling - Naive Bayes Classifier

rm(test)
rm(train)
rm(rf)
train <- training
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))

delays.nb <- naiveBayes(Result~., data = train)
delays.nb

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
     Hit     Flop 
0.446037 0.553963 

Conditional probabilities:
      Rating
Y           [,1]       [,2]
  Hit  0.8122145 0.08782847
  Flop 0.7745324 0.16554476

      Reviews
Y               [,1]          [,2]
  Hit  0.01358894493 0.05288638212
  Flop 0.00002235907 0.00006197387

      Size
Y            [,1]       [,2]
  Hit  0.03556935 0.06165665
  Flop 0.03563030 0.10756489

      Current.Ver
Y              [,1]       [,2]
  Hit  0.0009565182 0.02834286
  Flop 0.0017979409 0.03567101

      Category
Y      ART_AND_DESIGN AUTO_AND_VEHICLES       BEAUTY
  Hit    0.0050329075      0.0061943477 0.0034843206
  Flop   0.0093516209      0.0127805486 0.0065461347
      Category
Y      BOOKS_AND_REFERENCE     BUSINESS       COMICS
  Hit         0.0154858691 0.0158730159 0.0034843206
  Flop        0.0233790524 0.0417705736 0.0084164589
      Category
Y      COMMUNICATION       DATING    EDUCATION
  Hit   0.0313588850 0.0174216028 0.0174216028
  Flop  0.0249376559 0.0224438903 0.0068578554
      Category
Y      ENTERTAINMENT       EVENTS       FAMILY
  Hit   0.0178087495 0.0007742935 0.2040263260
  Flop  0.0028054863 0.0068578554 0.2319201995
      Category
Y           FINANCE FOOD_AND_DRINK         GAME
  Hit  0.0240030972   0.0147115757 0.1966705381
  Flop 0.0464463840   0.0102867830 0.0692019950
      Category
Y      HEALTH_AND_FITNESS HOUSE_AND_HOME
  Hit        0.0282617112   0.0096786682
  Flop       0.0246259352   0.0046758105
      Category
Y      LIBRARIES_AND_DEMO    LIFESTYLE
  Hit        0.0054200542 0.0247773906
  Flop       0.0112219451 0.0458229426
      Category
Y      MAPS_AND_NAVIGATION      MEDICAL
  Hit         0.0116144019 0.0081300813
  Flop        0.0130922693 0.0620324190
      Category
Y      NEWS_AND_MAGAZINES    PARENTING
  Hit        0.0197444832 0.0038714673
  Flop       0.0240024938 0.0059226933
      Category
Y      PERSONALIZATION  PHOTOGRAPHY PRODUCTIVITY
  Hit     0.0356174990 0.0460704607 0.0321331785
  Flop    0.0386533666 0.0174563591 0.0308603491
      Category
Y          SHOPPING       SOCIAL       SPORTS
  Hit  0.0329074719 0.0228416570 0.0301974448
  Flop 0.0124688279 0.0208852868 0.0317955112
      Category
Y             TOOLS TRAVEL_AND_LOCAL VIDEO_PLAYERS
  Hit  0.0662020906     0.0205187766  0.0185830430
  Flop 0.0950748130     0.0202618454  0.0124688279
      Category
Y           WEATHER
  Hit  0.0096786682
  Flop 0.0046758105

      Type
Y                0        Free         NaN
  Hit  0.000000000 0.992257065 0.000000000
  Flop 0.000000000 0.875311721 0.000000000
      Type
Y             Paid
  Hit  0.007742935
  Flop 0.124688279

      Content.Rating
Y            Adults     Everyone       Mature
  Hit  0.0003871467 0.8033294619 0.0514905149
  Flop 0.0000000000 0.8753117207 0.0405236908
      Content.Rating
Y              Teen      Unrated
  Hit  0.1447928765 0.0000000000
  Flop 0.0838528678 0.0003117207
pred.prob <- as.data.frame(predict(delays.nb, newdata = test, type = "raw"))
pred.prob <-  round(pred.prob, 4)
pred.prob <- pred.prob %>% select("Hit")
names(pred.prob) <-  "predraw"
test <- cbind(test, pred.prob)
rm(pred.prob)

par <- data.frame("Cut off" = 1, "Accuracy" = 1, "Sensitivity" = 1)
i = 0
j = 1
for(i in seq(0,1,0.001)){
  test <- test %>% mutate(pred = factor(as.character(ifelse(predraw < i, "Flop", "Hit")), levels = c("Hit", "Flop")))
 par[j,2] <- 1-mean(test$Result != test$pred)
 test1 <- test %>% filter(Result == "Hit")
 par[j,3] <- mean(test1$Result == test1$pred)
 par[j,1] <- i
j <- j+1
i <- i+1
rm(test1)
}
rm(i)
rm(j)

kable(par %>% filter(Accuracy == max(Accuracy)) %>% head(5), caption = "The (Hit) probability cut off for the best accuracy")

Cut.off Accuracy Sensitivity
0.022 0.9246717 0.8573643
0.023 0.9246717 0.8573643
0.024 0.9246717 0.8573643
0.028 0.9246717 0.8542636


par %>% ggplot(aes(x = `Cut.off`, y = par$Accuracy))+geom_point(size = 0.05, color = "gray")+geom_line()+geom_hline(yintercept = max(par$Accuracy), color = "red")+geom_vline(xintercept = 0.024, color = "red")+
  labs(x = "Cut Off value", y = "Accuracy of prediction", title = "Cut off Vs Accuracy")


rm(par)

test <- test %>% mutate(pred =  factor(as.character(ifelse(predraw < 0.024, "Flop", "Hit")), levels = c("Hit", "Flop")))

cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics

          Reference
Prediction Hit Flop
      Hit  553   17
      Flop  92  785
                                               
               Accuracy : 0.9247               
                 95% CI : (0.9098, 0.9377)     
    No Information Rate : 0.5543               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.8458               
                                               
 Mcnemar's Test P-Value : 0.000000000001361    
                                               
            Sensitivity : 0.8574               
            Specificity : 0.9788               
         Pos Pred Value : 0.9702               
         Neg Pred Value : 0.8951               
             Prevalence : 0.4457               
         Detection Rate : 0.3822               
   Detection Prevalence : 0.3939               
      Balanced Accuracy : 0.9181               
                                               
       'Positive' Class : Hit                  
                                               
nab <- data.frame("Model" = "Naive Bayes", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)

Machine Learning - Modelling - SVM

rm(test)
rm(train)
rm(delays.nb)

train <- training
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))

set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "linear")

test$pred <- predict(svm.fit, newdata = test[,-8], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics

          Reference
Prediction Hit Flop
      Hit  376   40
      Flop 269  762
                                               
               Accuracy : 0.7865               
                 95% CI : (0.7644, 0.8073)     
    No Information Rate : 0.5543               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.5523               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022
                                               
            Sensitivity : 0.5829               
            Specificity : 0.9501               
         Pos Pred Value : 0.9038               
         Neg Pred Value : 0.7391               
             Prevalence : 0.4457               
         Detection Rate : 0.2598               
   Detection Prevalence : 0.2875               
      Balanced Accuracy : 0.7665               
                                               
       'Positive' Class : Hit                  
                                               
linear <- data.frame("MOdel" = "SVM - Linear", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])

rm(test)
rm(train)
rm(svm.fit)
rm(cm)

train <- training
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))

set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "radial")

test$pred <- predict(svm.fit, newdata = test[,-8], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics

          Reference
Prediction Hit Flop
      Hit  343   93
      Flop 302  709
                                               
               Accuracy : 0.727                
                 95% CI : (0.7033, 0.7498)     
    No Information Rate : 0.5543               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.4294               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022
                                               
            Sensitivity : 0.5318               
            Specificity : 0.8840               
         Pos Pred Value : 0.7867               
         Neg Pred Value : 0.7013               
             Prevalence : 0.4457               
         Detection Rate : 0.2370               
   Detection Prevalence : 0.3013               
      Balanced Accuracy : 0.7079               
                                               
       'Positive' Class : Hit                  
                                               
radial <- data.frame("MOdel" = "SVM - Radial", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])

rm(test)
rm(train)
rm(svm.fit)
rm(cm)

train <- training
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))

set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "polynomial")

test$pred <- predict(svm.fit, newdata = test[,-8], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics

          Reference
Prediction Hit Flop
      Hit   22    0
      Flop 623  802
                                             
               Accuracy : 0.5695             
                 95% CI : (0.5435, 0.5951)   
    No Information Rate : 0.5543             
    P-Value [Acc > NIR] : 0.1277             
                                             
                  Kappa : 0.0377             
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.03411            
            Specificity : 1.00000            
         Pos Pred Value : 1.00000            
         Neg Pred Value : 0.56281            
             Prevalence : 0.44575            
         Detection Rate : 0.01520            
   Detection Prevalence : 0.01520            
      Balanced Accuracy : 0.51705            
                                             
       'Positive' Class : Hit                
                                             
polynomial <- data.frame("MOdel" = "SVM - Polynomial", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])

rm(cm)
rm(test)
rm(train)
rm(svm.fit)

svm <- rbind(linear, radial, polynomial)
rm(linear)
rm(radial)
rm(polynomial)
names(svm) <- c("Model", "Accuracy", "Sensitivity", "Specificity")

Machine Learning - Modelling - SVM (Tuning)

set.seed(123)
tunesvm <- tune(svm, Result~., data = train, kernal = "linear", ranges = list(cost = c(seq(0.01,0.1,by = 0.01), seq(0.1,1,by = 0.1), seq(1,10,by = 1))))
dynamic.cost <- tunesvm$best.model$cost
summary(tunesvm)
data.par <- tunesvm$performances
data.par[data.par$error == min(data.par$error),]

set.seed(123)
svm_cost <- svm(Result~., data=train,kernel = "linear", cost = dynamic.cost)
pred_train <- predict(svm_cost, train)
train.error <- mean(pred_train != train$Result)
pred_test <- predict(svm_cost, test)
test.error <- mean(pred_test != test$Result)
linear <- data.frame("Kernal" = "Linear", "Cost" = dynamic.cost,
                     "train Error" = train.error, 
                     "test Error" = test.error)

set.seed(123)
tunesvm <- tune(svm, Result~., data = train, kernal = "radial", ranges = list(cost = c(seq(0.01,0.1,by = 0.01), seq(0.1,1,by = 0.1), seq(1,10,by = 1))))
dynamic.cost <- tunesvm$best.model$cost
summary(tunesvm)
data.par <- tunesvm$performances
data.par[data.par$error == min(data.par$error),] 


svm_cost <- svm(Result~., data=train, kernel = "radial", cost = dynamic.cost)
pred_train <- predict(svm_cost, train)
train.error <- mean(pred_train != train$Result)
pred_test <- predict(svm_cost, test)
test.error <- mean(pred_test != test$Result)
radial <- data.frame("Kernal" = "Radial", "Cost" = dynamic.cost,
                     "train Error" =  train.error, 
                     "test Error" = test.error)

set.seed(123)
tunesvm <- tune(svm, Result~., data = train, kernal = "polynomial", degree = 2, ranges = list(cost = c(seq(0.01,0.1, by = 0.01), seq(0.1,1,by = 0.1), seq(1,10,by = 1))))
dynamic.cost <- tunesvm$best.model$cost
summary(tunesvm)
data.par <- tunesvm$performances
data.par[data.par$error == min(data.par$error),]
svm_cost <- svm(Result~., data=train, kernel = "polynomial", degree = 2, cost = dynamic.cost)
pred_train <- predict(svm_cost, train)
train.error <- mean(pred_train != train$Result)
pred_test <- predict(svm_cost, test)
test.error <- mean(pred_test != test$Result)
polynomial <- data.frame("Kernal" = "Polynomial", "Cost" = dynamic.cost,
                         "train Error" =  train.error, 
                         "test Error" = test.error)


kable(final <- rbind(linear, radial, polynomial), caption = "SVM Performance with different kernals")

In this analysis I used Rating and Reviews which are known only after an App is launched. So the analysis I did so far is purely hypothetical. For this reason I am repeating the entire analysis dropping Rating and Reviews Attributes.

Machine Learning - Modelling - Logistic Regression

rm(trainindex)
rm(data.ml)
train <- trainn %>% select(-c("Rating", "Reviews"))
test <- testn %>% select(-c("Rating", "Reviews"))

logit.reg <- glm(Result ~.,data = train, family = "binomial")
summary(logit.reg)

Call:
glm(formula = Result ~ ., family = "binomial", data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9539  -1.0772  -0.3188   1.1647   2.9656  

Coefficients:
                              Estimate Std. Error
(Intercept)                  10.932973 196.968026
Size                          0.290329   0.343408
Current.Ver                  -1.065856   0.955384
CategoryAUTO_AND_VEHICLES    -0.166999   0.446399
CategoryBEAUTY               -0.078609   0.520725
CategoryBOOKS_AND_REFERENCE   0.198912   0.389295
CategoryBUSINESS             -0.387421   0.380052
CategoryCOMICS               -0.481986   0.513833
CategoryCOMMUNICATION         0.884703   0.373083
CategoryDATING                0.028899   0.420699
CategoryEDUCATION             1.603696   0.430444
CategoryENTERTAINMENT         2.278416   0.497425
CategoryEVENTS               -1.646004   0.811304
CategoryFAMILY                0.540242   0.340662
CategoryFINANCE              -0.034169   0.368334
CategoryFOOD_AND_DRINK        0.929963   0.412082
CategoryGAME                  1.751529   0.347747
CategoryHEALTH_AND_FITNESS    0.755244   0.373784
CategoryHOUSE_AND_HOME        1.275349   0.468070
CategoryLIBRARIES_AND_DEMO   -0.227878   0.466059
CategoryLIFESTYLE             0.003346   0.367949
CategoryMAPS_AND_NAVIGATION   0.499621   0.413614
CategoryMEDICAL              -1.259244   0.407924
CategoryNEWS_AND_MAGAZINES    0.351646   0.381546
CategoryPARENTING             0.162152   0.517472
CategoryPERSONALIZATION       0.894042   0.368283
CategoryPHOTOGRAPHY           1.694726   0.376429
CategoryPRODUCTIVITY          0.701663   0.368215
CategorySHOPPING              1.533020   0.387390
CategorySOCIAL                0.502244   0.384332
CategorySPORTS                0.621386   0.369610
CategoryTOOLS                 0.328361   0.349381
CategoryTRAVEL_AND_LOCAL      0.653622   0.384622
CategoryVIDEO_PLAYERS         0.925326   0.397957
CategoryWEATHER               1.610839   0.491177
TypePaid                     -3.079504   0.235667
Content.RatingEveryone      -11.714243 196.967741
Content.RatingMature        -11.409212 196.967821
Content.RatingTeen          -11.479430 196.967763
Content.RatingUnrated       -22.828966 278.554424
                            z value
(Intercept)                   0.056
Size                          0.845
Current.Ver                  -1.116
CategoryAUTO_AND_VEHICLES    -0.374
CategoryBEAUTY               -0.151
CategoryBOOKS_AND_REFERENCE   0.511
CategoryBUSINESS             -1.019
CategoryCOMICS               -0.938
CategoryCOMMUNICATION         2.371
CategoryDATING                0.069
CategoryEDUCATION             3.726
CategoryENTERTAINMENT         4.580
CategoryEVENTS               -2.029
CategoryFAMILY                1.586
CategoryFINANCE              -0.093
CategoryFOOD_AND_DRINK        2.257
CategoryGAME                  5.037
CategoryHEALTH_AND_FITNESS    2.021
CategoryHOUSE_AND_HOME        2.725
CategoryLIBRARIES_AND_DEMO   -0.489
CategoryLIFESTYLE             0.009
CategoryMAPS_AND_NAVIGATION   1.208
CategoryMEDICAL              -3.087
CategoryNEWS_AND_MAGAZINES    0.922
CategoryPARENTING             0.313
CategoryPERSONALIZATION       2.428
CategoryPHOTOGRAPHY           4.502
CategoryPRODUCTIVITY          1.906
CategorySHOPPING              3.957
CategorySOCIAL                1.307
CategorySPORTS                1.681
CategoryTOOLS                 0.940
CategoryTRAVEL_AND_LOCAL      1.699
CategoryVIDEO_PLAYERS         2.325
CategoryWEATHER               3.280
TypePaid                    -13.067
Content.RatingEveryone       -0.059
Content.RatingMature         -0.058
Content.RatingTeen           -0.058
Content.RatingUnrated        -0.082
                                        Pr(>|z|)    
(Intercept)                             0.955735    
Size                                    0.397869    
Current.Ver                             0.264580    
CategoryAUTO_AND_VEHICLES               0.708328    
CategoryBEAUTY                          0.880007    
CategoryBOOKS_AND_REFERENCE             0.609383    
CategoryBUSINESS                        0.308018    
CategoryCOMICS                          0.348234    
CategoryCOMMUNICATION                   0.017724 *  
CategoryDATING                          0.945234    
CategoryEDUCATION                       0.000195 ***
CategoryENTERTAINMENT                0.000004640 ***
CategoryEVENTS                          0.042475 *  
CategoryFAMILY                          0.112771    
CategoryFINANCE                         0.926090    
CategoryFOOD_AND_DRINK                  0.024024 *  
CategoryGAME                         0.000000473 ***
CategoryHEALTH_AND_FITNESS              0.043328 *  
CategoryHOUSE_AND_HOME                  0.006436 ** 
CategoryLIBRARIES_AND_DEMO              0.624880    
CategoryLIFESTYLE                       0.992744    
CategoryMAPS_AND_NAVIGATION             0.227070    
CategoryMEDICAL                         0.002022 ** 
CategoryNEWS_AND_MAGAZINES              0.356719    
CategoryPARENTING                       0.754012    
CategoryPERSONALIZATION                 0.015199 *  
CategoryPHOTOGRAPHY                  0.000006728 ***
CategoryPRODUCTIVITY                    0.056704 .  
CategorySHOPPING                     0.000075801 ***
CategorySOCIAL                          0.191282    
CategorySPORTS                          0.092725 .  
CategoryTOOLS                           0.347302    
CategoryTRAVEL_AND_LOCAL                0.089246 .  
CategoryVIDEO_PLAYERS                   0.020062 *  
CategoryWEATHER                         0.001040 ** 
TypePaid                    < 0.0000000000000002 ***
Content.RatingEveryone                  0.952575    
Content.RatingMature                    0.953809    
Content.RatingTeen                      0.953525    
Content.RatingUnrated                   0.934682    
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 7960.4  on 5790  degrees of freedom
Residual deviance: 6978.9  on 5751  degrees of freedom
AIC: 7058.9

Number of Fisher Scoring iterations: 10
#confint(logit.reg) to check for confidence intervals

test$predraw <- predict(logit.reg, newdata = test %>% select(-"Result"), type = "response")

par <- data.frame("Cut off" = 1, "Accuracy" = 1, "Sensitivity" = 1)
i = 0
j = 1
for(i in seq(0,1,0.001)){
  test <- test %>% mutate(pred = factor(as.character(ifelse(predraw < i, "Flop", "Hit")), levels = c("Hit", "Flop")))
 par[j,2] <- 1-mean(test$Result != test$pred)
 par[j,1] <- i
 test1 <- test %>% filter(Result == "Hit")
  par[j,3] <- mean(test1$Result == test1$pred)
j <- j+1
i <- i+1
rm(test1)
}
rm(i)
rm(j)

kable(par %>% filter(Accuracy == max(Accuracy)) %>% head(5), caption = "The cut off for the best accuracy")

Cut.off Accuracy Sensitivity
0.443 0.6862474 0.696124


par %>%ggplot(aes(x = Cut.off, y = par$Accuracy))+geom_point(size = 0.05, color = "gray")+geom_line()+geom_hline(yintercept = max(par$Accuracy), color = "red")+geom_vline(xintercept = 0.443, color = "red")+
  labs(x = "Cut Off value", y = "Accuracy of prediction", title = "Cut off Vs Accuracy")

rm(par)  

test <- test %>% mutate(pred = as.factor(ifelse(predraw <= 0.443, "Flop", "Hit")))
test$pred <- factor(test$pred, levels = c("Hit", "Flop"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))
cm <- confusionMatrix(test$pred, test$Result)
cm
Confusion Matrix and Statistics

          Reference
Prediction Hit Flop
      Hit  449  258
      Flop 196  544
                                               
               Accuracy : 0.6862               
                 95% CI : (0.6616, 0.7101)     
    No Information Rate : 0.5543               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.3709               
                                               
 Mcnemar's Test P-Value : 0.004198             
                                               
            Sensitivity : 0.6961               
            Specificity : 0.6783               
         Pos Pred Value : 0.6351               
         Neg Pred Value : 0.7351               
             Prevalence : 0.4457               
         Detection Rate : 0.3103               
   Detection Prevalence : 0.4886               
      Balanced Accuracy : 0.6872               
                                               
       'Positive' Class : Hit                  
                                               
logit.u <- data.frame("Model" = "Logistic Regression (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)

Machine Learning - Modelling - Decision Trees

rm(test)
rm(train)
rm(logit.reg)
train <- training %>% select(-c("Rating", "Reviews"))
test <- testing %>% select(-c("Rating", "Reviews"))

set.seed(123)
dt <- rpart(Result ~., data = train, method = "class", cp = 0.00001, minsplit = 1, xval = 10)
dt.pruned <- prune(dt, cp = 0.004)

test$pred <- predict(dt.pruned, newdata = test[,-6], type = "class")

cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics

          Reference
Prediction Hit Flop
      Hit  468  261
      Flop 177  541
                                               
               Accuracy : 0.6973               
                 95% CI : (0.6729, 0.7209)     
    No Information Rate : 0.5543               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.3951               
                                               
 Mcnemar's Test P-Value : 0.00007312           
                                               
            Sensitivity : 0.7256               
            Specificity : 0.6746               
         Pos Pred Value : 0.6420               
         Neg Pred Value : 0.7535               
             Prevalence : 0.4457               
         Detection Rate : 0.3234               
   Detection Prevalence : 0.5038               
      Balanced Accuracy : 0.7001               
                                               
       'Positive' Class : Hit                  
                                               
det.u <- data.frame("Model" = "Decision Tress (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
rpart.plot(dt.pruned, box.palette="RdBu", shadow.col="gray", nn=TRUE)


rpart.rules(dt.pruned, style = "tallw")
Result is 0.00 when
               Category is LIBRARIES_AND_DEMO or MEDICAL
               Size >= 0.0307
               Current.Ver >= 0.000000127

Result is 0.03 when
               Category is ART_AND_DESIGN or AUTO_AND_VEHICLES or BEAUTY or BOOKS_AND_REFERENCE or BUSINESS or COMICS or COMMUNICATION or DATING or EVENTS or FAMILY or FINANCE or HEALTH_AND_FITNESS or LIBRARIES_AND_DEMO or LIFESTYLE or MAPS_AND_NAVIGATION or MEDICAL or NEWS_AND_MAGAZINES or PARENTING or PERSONALIZATION or PRODUCTIVITY or SOCIAL or SPORTS or TOOLS or TRAVEL_AND_LOCAL
               Current.Ver < 0.000000127
               Type is Paid

Result is 0.08 when
               Category is EDUCATION or ENTERTAINMENT or FOOD_AND_DRINK or GAME or HOUSE_AND_HOME or PHOTOGRAPHY or SHOPPING or VIDEO_PLAYERS or WEATHER
               Type is Paid

Result is 0.14 when
               Category is ART_AND_DESIGN or BEAUTY or BOOKS_AND_REFERENCE or COMMUNICATION or FAMILY or PERSONALIZATION or SPORTS
               Size >= 0.1269
               Current.Ver < 0.000000127
               Type is Free

Result is 0.16 when
               Category is AUTO_AND_VEHICLES or BUSINESS or DATING or FAMILY or LIFESTYLE or PARENTING or SPORTS
               Size < 0.0078
               Current.Ver >= 0.000000127

Result is 0.16 when
               Category is ART_AND_DESIGN or BEAUTY or EVENTS or FINANCE or LIBRARIES_AND_DEMO or MEDICAL
               Size < 0.0307
               Current.Ver >= 0.000000127

Result is 0.21 when
               Category is AUTO_AND_VEHICLES or BOOKS_AND_REFERENCE or BUSINESS or COMICS or COMMUNICATION or DATING or FAMILY or HEALTH_AND_FITNESS or LIFESTYLE or MAPS_AND_NAVIGATION or NEWS_AND_MAGAZINES or PARENTING or PERSONALIZATION or PRODUCTIVITY or SOCIAL or SPORTS or TOOLS or TRAVEL_AND_LOCAL
               Size >= 0.0078
               Current.Ver >= 0.000000127
               Type is Paid

Result is 0.24 when
               Category is AUTO_AND_VEHICLES or BUSINESS or COMICS or DATING or EVENTS or FINANCE or HEALTH_AND_FITNESS or LIBRARIES_AND_DEMO or LIFESTYLE or MAPS_AND_NAVIGATION or MEDICAL or NEWS_AND_MAGAZINES or PARENTING or PRODUCTIVITY or SOCIAL or TOOLS or TRAVEL_AND_LOCAL
               Size >= 0.0126
               Current.Ver < 0.000000091
               Type is Free

Result is 0.26 when
               Category is ART_AND_DESIGN or AUTO_AND_VEHICLES or BEAUTY or BOOKS_AND_REFERENCE or BUSINESS or COMICS or COMMUNICATION or DATING or EVENTS or FAMILY or FINANCE or HEALTH_AND_FITNESS or LIBRARIES_AND_DEMO or LIFESTYLE or MAPS_AND_NAVIGATION or MEDICAL or NEWS_AND_MAGAZINES or PARENTING or PERSONALIZATION or PRODUCTIVITY or SOCIAL or SPORTS or TOOLS or TRAVEL_AND_LOCAL
               Size < 0.0126
               Current.Ver < 0.000000127
               Type is Free

Result is 0.30 when
               Category is EDUCATION or ENTERTAINMENT or FOOD_AND_DRINK or GAME or HOUSE_AND_HOME or PHOTOGRAPHY or SHOPPING or VIDEO_PLAYERS or WEATHER
               Size < 0.0029
               Type is Free

Result is 0.38 when
               Category is AUTO_AND_VEHICLES or BUSINESS or DATING or LIBRARIES_AND_DEMO or LIFESTYLE or MEDICAL or TOOLS or TRAVEL_AND_LOCAL
               Size >= 0.0126
               Current.Ver is 0.000000091 to 0.000000127
               Type is Free

Result is 0.56 when
               Category is BOOKS_AND_REFERENCE or COMMUNICATION or HEALTH_AND_FITNESS or MAPS_AND_NAVIGATION or NEWS_AND_MAGAZINES or PERSONALIZATION or PRODUCTIVITY or SOCIAL or TOOLS or TRAVEL_AND_LOCAL
               Size < 0.0078
               Current.Ver >= 0.000000127

Result is 0.57 when
               Category is EDUCATION or ENTERTAINMENT or FOOD_AND_DRINK or GAME or HOUSE_AND_HOME or PHOTOGRAPHY or SHOPPING or VIDEO_PLAYERS or WEATHER
               Size is 0.0029 to 0.0089
               Type is Free

Result is 0.59 when
               Category is ART_AND_DESIGN or BEAUTY or BOOKS_AND_REFERENCE or COMMUNICATION or FAMILY or PERSONALIZATION or SPORTS
               Size is 0.0126 to 0.1269
               Current.Ver < 0.000000127
               Type is Free

Result is 0.72 when
               Category is AUTO_AND_VEHICLES or BOOKS_AND_REFERENCE or BUSINESS or COMICS or COMMUNICATION or DATING or FAMILY or HEALTH_AND_FITNESS or LIFESTYLE or MAPS_AND_NAVIGATION or NEWS_AND_MAGAZINES or PARENTING or PERSONALIZATION or PRODUCTIVITY or SOCIAL or SPORTS or TOOLS or TRAVEL_AND_LOCAL
               Size >= 0.0078
               Current.Ver >= 0.000000127
               Type is Free

Result is 0.72 when
               Category is COMICS or FINANCE or HEALTH_AND_FITNESS or MAPS_AND_NAVIGATION or NEWS_AND_MAGAZINES or PARENTING or PRODUCTIVITY or SOCIAL
               Size >= 0.0126
               Current.Ver is 0.000000091 to 0.000000127
               Type is Free

Result is 0.78 when
               Category is EDUCATION or ENTERTAINMENT or FOOD_AND_DRINK or GAME or HOUSE_AND_HOME or PHOTOGRAPHY or SHOPPING or VIDEO_PLAYERS or WEATHER
               Size >= 0.0089
               Type is Free

Result is 0.87 when
               Category is BEAUTY or FINANCE
               Size >= 0.0307
               Current.Ver >= 0.000000127

Machine Learning - Modelling - Random Forest

rm(train)
rm(test)
rm(dt)
rm(dt.pruned)
train <- training %>% select(-c("Rating", "Reviews"))
test <- testing %>% select(-c("Rating", "Reviews"))
rf <- randomForest(Result ~., data = train)
#plot(rf, main = "Error rate of random forest")
varImpPlot(rf, main = "Gini Plot - Importance of Variables")

test$pred = predict(rf, newdata = test[-6])
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics

          Reference
Prediction Hit Flop
      Hit  440  180
      Flop 205  622
                                             
               Accuracy : 0.7339             
                 95% CI : (0.7104, 0.7566)   
    No Information Rate : 0.5543             
    P-Value [Acc > NIR] : <0.0000000000000002
                                             
                  Kappa : 0.4595             
                                             
 Mcnemar's Test P-Value : 0.2213             
                                             
            Sensitivity : 0.6822             
            Specificity : 0.7756             
         Pos Pred Value : 0.7097             
         Neg Pred Value : 0.7521             
             Prevalence : 0.4457             
         Detection Rate : 0.3041             
   Detection Prevalence : 0.4285             
      Balanced Accuracy : 0.7289             
                                             
       'Positive' Class : Hit                
                                             
raf.u <- data.frame("Model" = "Random Forest (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)

Machine Learning - Modelling - Naive Bayes Classifier

rm(test)
rm(train)
rm(rf)
train <- training %>% select(-c("Rating", "Reviews"))
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing %>% select(-c("Rating", "Reviews"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))

delays.nb <- naiveBayes(Result~., data = train)
delays.nb

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
     Hit     Flop 
0.446037 0.553963 

Conditional probabilities:
      Size
Y            [,1]       [,2]
  Hit  0.03556935 0.06165665
  Flop 0.03563030 0.10756489

      Current.Ver
Y              [,1]       [,2]
  Hit  0.0009565182 0.02834286
  Flop 0.0017979409 0.03567101

      Category
Y      ART_AND_DESIGN AUTO_AND_VEHICLES       BEAUTY
  Hit    0.0050329075      0.0061943477 0.0034843206
  Flop   0.0093516209      0.0127805486 0.0065461347
      Category
Y      BOOKS_AND_REFERENCE     BUSINESS       COMICS
  Hit         0.0154858691 0.0158730159 0.0034843206
  Flop        0.0233790524 0.0417705736 0.0084164589
      Category
Y      COMMUNICATION       DATING    EDUCATION
  Hit   0.0313588850 0.0174216028 0.0174216028
  Flop  0.0249376559 0.0224438903 0.0068578554
      Category
Y      ENTERTAINMENT       EVENTS       FAMILY
  Hit   0.0178087495 0.0007742935 0.2040263260
  Flop  0.0028054863 0.0068578554 0.2319201995
      Category
Y           FINANCE FOOD_AND_DRINK         GAME
  Hit  0.0240030972   0.0147115757 0.1966705381
  Flop 0.0464463840   0.0102867830 0.0692019950
      Category
Y      HEALTH_AND_FITNESS HOUSE_AND_HOME
  Hit        0.0282617112   0.0096786682
  Flop       0.0246259352   0.0046758105
      Category
Y      LIBRARIES_AND_DEMO    LIFESTYLE
  Hit        0.0054200542 0.0247773906
  Flop       0.0112219451 0.0458229426
      Category
Y      MAPS_AND_NAVIGATION      MEDICAL
  Hit         0.0116144019 0.0081300813
  Flop        0.0130922693 0.0620324190
      Category
Y      NEWS_AND_MAGAZINES    PARENTING
  Hit        0.0197444832 0.0038714673
  Flop       0.0240024938 0.0059226933
      Category
Y      PERSONALIZATION  PHOTOGRAPHY PRODUCTIVITY
  Hit     0.0356174990 0.0460704607 0.0321331785
  Flop    0.0386533666 0.0174563591 0.0308603491
      Category
Y          SHOPPING       SOCIAL       SPORTS
  Hit  0.0329074719 0.0228416570 0.0301974448
  Flop 0.0124688279 0.0208852868 0.0317955112
      Category
Y             TOOLS TRAVEL_AND_LOCAL VIDEO_PLAYERS
  Hit  0.0662020906     0.0205187766  0.0185830430
  Flop 0.0950748130     0.0202618454  0.0124688279
      Category
Y           WEATHER
  Hit  0.0096786682
  Flop 0.0046758105

      Type
Y                0        Free         NaN
  Hit  0.000000000 0.992257065 0.000000000
  Flop 0.000000000 0.875311721 0.000000000
      Type
Y             Paid
  Hit  0.007742935
  Flop 0.124688279

      Content.Rating
Y            Adults     Everyone       Mature
  Hit  0.0003871467 0.8033294619 0.0514905149
  Flop 0.0000000000 0.8753117207 0.0405236908
      Content.Rating
Y              Teen      Unrated
  Hit  0.1447928765 0.0000000000
  Flop 0.0838528678 0.0003117207
pred.prob <- as.data.frame(predict(delays.nb, newdata = test, type = "raw"))
pred.prob <-  round(pred.prob, 4)
pred.prob <- pred.prob %>% select("Hit")
names(pred.prob) <-  "predraw"
test <- cbind(test, pred.prob)
rm(pred.prob)

par <- data.frame("Cut off" = 1, "Accuracy" = 1, "Sensitivity" = 1)
i = 0
j = 1
for(i in seq(0,1,0.001)){
  test <- test %>% mutate(pred = factor(as.character(ifelse(predraw < i, "Flop", "Hit")), levels = c("Hit", "Flop")))
 par[j,2] <- 1-mean(test$Result != test$pred)
 test1 <- test %>% filter(Result == "Hit")
 par[j,3] <- mean(test1$Result == test1$pred)
 par[j,1] <- i
j <- j+1
i <- i+1
rm(test1)
}
rm(i)
rm(j)

kable(par %>% filter(Accuracy == max(Accuracy)) %>% head(5), caption = "The (Hit) probability cut off for the best accuracy")

Cut.off Accuracy Sensitivity
0.615 0.6848652 0.6852713
0.616 0.6848652 0.6775194


par %>% ggplot(aes(x = `Cut.off`, y = par$Accuracy))+geom_point(size = 0.05, color = "gray")+geom_line()+geom_hline(yintercept = max(par$Accuracy), color = "red")+geom_vline(xintercept = 0.615, color = "red")+
  labs(x = "Cut Off value", y = "Accuracy of prediction", title = "Cut off Vs Accuracy")


rm(par)

test <- test %>% mutate(pred =  factor(as.character(ifelse(predraw < 0.615, "Flop", "Hit")), levels = c("Hit", "Flop")))

cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics

          Reference
Prediction Hit Flop
      Hit  442  253
      Flop 203  549
                                              
               Accuracy : 0.6849              
                 95% CI : (0.6602, 0.7088)    
    No Information Rate : 0.5543              
    P-Value [Acc > NIR] : < 0.0000000000000002
                                              
                  Kappa : 0.367               
                                              
 Mcnemar's Test P-Value : 0.02175             
                                              
            Sensitivity : 0.6853              
            Specificity : 0.6845              
         Pos Pred Value : 0.6360              
         Neg Pred Value : 0.7301              
             Prevalence : 0.4457              
         Detection Rate : 0.3055              
   Detection Prevalence : 0.4803              
      Balanced Accuracy : 0.6849              
                                              
       'Positive' Class : Hit                 
                                              
nab.u <- data.frame("Model" = "Naive Bayes (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)

Machine Learning - Modelling - SVM

rm(test)
rm(train)
rm(delays.nb)

train <- training %>% select(-c("Rating", "Reviews"))
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing %>% select(-c("Rating", "Reviews"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))

set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "linear")

test$pred <- predict(svm.fit, newdata = test[,-6], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics

          Reference
Prediction Hit Flop
      Hit  277  138
      Flop 368  664
                                               
               Accuracy : 0.6503               
                 95% CI : (0.6251, 0.6749)     
    No Information Rate : 0.5543               
    P-Value [Acc > NIR] : 0.0000000000000681   
                                               
                  Kappa : 0.2667               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022
                                               
            Sensitivity : 0.4295               
            Specificity : 0.8279               
         Pos Pred Value : 0.6675               
         Neg Pred Value : 0.6434               
             Prevalence : 0.4457               
         Detection Rate : 0.1914               
   Detection Prevalence : 0.2868               
      Balanced Accuracy : 0.6287               
                                               
       'Positive' Class : Hit                  
                                               
linear <- data.frame("MOdel" = "SVM - Linear (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])

rm(test)
rm(train)
rm(svm.fit)
rm(cm)

train <- training %>% select(-c("Rating", "Reviews"))
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing %>% select(-c("Rating", "Reviews"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))

set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "radial")

test$pred <- predict(svm.fit, newdata = test[,-6], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics

          Reference
Prediction Hit Flop
      Hit  294  130
      Flop 351  672
                                               
               Accuracy : 0.6676               
                 95% CI : (0.6426, 0.6918)     
    No Information Rate : 0.5543               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.3039               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022
                                               
            Sensitivity : 0.4558               
            Specificity : 0.8379               
         Pos Pred Value : 0.6934               
         Neg Pred Value : 0.6569               
             Prevalence : 0.4457               
         Detection Rate : 0.2032               
   Detection Prevalence : 0.2930               
      Balanced Accuracy : 0.6469               
                                               
       'Positive' Class : Hit                  
                                               
radial <- data.frame("MOdel" = "SVM - Radial (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])

rm(test)
rm(train)
rm(svm.fit)
rm(cm)

train <- training %>% select(-c("Rating", "Reviews"))
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing %>% select(-c("Rating", "Reviews"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))

set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "polynomial")

test$pred <- predict(svm.fit, newdata = test[,-6], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
Confusion Matrix and Statistics

          Reference
Prediction Hit Flop
      Hit    0    0
      Flop 645  802
                                             
               Accuracy : 0.5543             
                 95% CI : (0.5282, 0.5801)   
    No Information Rate : 0.5543             
    P-Value [Acc > NIR] : 0.5109             
                                             
                  Kappa : 0                  
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.0000             
            Specificity : 1.0000             
         Pos Pred Value :    NaN             
         Neg Pred Value : 0.5543             
             Prevalence : 0.4457             
         Detection Rate : 0.0000             
   Detection Prevalence : 0.0000             
      Balanced Accuracy : 0.5000             
                                             
       'Positive' Class : Hit                
                                             
polynomial <- data.frame("MOdel" = "SVM - Polynomial (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])

rm(cm)
rm(test)
rm(train)
rm(svm.fit)

svm.u <- rbind(linear, radial, polynomial)
rm(linear)
rm(radial)
rm(polynomial)
names(svm.u) <- c("Model", "Accuracy", "Sensitivity", "Specificity")

Final model Summary

model.ds <- rbind(logit, det, raf, nab, svm)
rm(logit)
rm(det)
rm(raf)
rm(nab)
rm(svm)
model.realtime <- rbind(logit.u, det.u, raf.u, nab.u, svm.u)
rm(logit.u)
rm(det.u)
rm(raf.u)
rm(nab.u)
rm(svm.u)

kable(rbind(model.ds, model.realtime), caption = "Final model results for complete and real time data")

Model Accuracy Sensitivity Specificity
Logistic Regression 0.9474775 0.9162791 0.9725686
Decision Trees 0.9447132 0.9193798 0.9650873
Random Forest 0.9474775 0.9503876 0.9451372
Naive Bayes 0.9246717 0.8573643 0.9788030
SVM - Linear 0.7864547 0.5829457 0.9501247
SVM - Radial 0.7270214 0.5317829 0.8840399
SVM - Polynomial 0.5694540 0.0341085 1.0000000
Logistic Regression (U) 0.6862474 0.6961240 0.6783042
Decision Tress (U) 0.6973048 0.7255814 0.6745636
Random Forest (U) 0.7339323 0.6821705 0.7755611
Naive Bayes (U) 0.6848652 0.6852713 0.6845387
SVM - Linear (U) 0.6503110 0.4294574 0.8279302
SVM - Radial (U) 0.6675881 0.4558140 0.8379052
SVM - Polynomial (U) 0.5542502 0.0000000 1.0000000


kable((model.ds %>% arrange(desc(Accuracy)) %>% head(2)), caption = "Top 2 models for complete data")

Model Accuracy Sensitivity Specificity
Logistic Regression 0.9474775 0.9162791 0.9725686
Random Forest 0.9474775 0.9503876 0.9451372


kable((model.realtime%>% arrange(desc(Accuracy)) %>% head(2)), caption = "Top 2 models for data that can be extracted in real time")
Model Accuracy Sensitivity Specificity
Random Forest (U) 0.7339323 0.6821705 0.7755611
Decision Tress (U) 0.6973048 0.7255814 0.6745636

Conclusion:

It is clear that the size of the app and category define if the app would be a hit or not.
From the EDA it is clear that the app must be below 25 MB and the categories to target would vary depending on the risk one wants to take.
While Game, Family, and Communication give a constant reward, the highest can be gained from Categories that are yet to be explored like Comics (Which are positively seen but are low in market) or Events (Which have a huge criticism and are low in market).

Sources:
https://www.kaggle.com/lava18/google-play-store-apps
http://r-statistics.co/
https://uc-r.github.io/
https://www.r-bloggers.com/
https://medium.com/analytics-vidhya/a-guide-to-machine-learning-in-r-for-beginners-part-5-4c00f2366b90
https://en.proft.me/2016/11/9/classification-using-decis ion-trees-r/
https://www.gormanalysis.com/blog/decision-trees-in-r-us ing-rpart/
http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/141-cart-model-decision-tree-essentials/
https://dataaspirant.com/2017/02/03/decision-tree-classifier-implementation-in-r/
https://www.blopig.com/blog/2017/04/a-very-basic-introduction-to-random-forests-using-r/
https://www.rdocumentation.org/
https://www.datacamp.com/

Introduction to Data Mining
Book by Michael Steinbach, Pang-Ning Tan, and Vipin Kumar

Data Mining for Business Analytics: Concepts, Techniques, and Applications in R
Book by Galit Shmueli , Peter C. Bruce, Inbal Yahav, Nitin R. Patel, Kenneth C. Lichtendahl Jr. 

---
title: "Sushanth Chintalapati Advanced BA with R project"
output: html_notebook
editor_options: 
  chunk_output_type: inline
---

```{r packages installation, include=FALSE, message=F, warning=F}
if(!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, dplyr, ggplot2, ggExtra, knitr, viridis, corrplot, wordcloud2, tm, textcat, rpart, rpart.plot, randomForest, e1071, caret)
```

```{r Loading Libraries, include=FALSE, message=F, warning=F}
library("tidyverse")
library("dplyr")
library("ggplot2")
library("ggExtra")
library("knitr")
library("viridis")
library("corrplot")
library("e1071")
library("caret")
library("wordcloud2")
library("tm")
library("textcat")
library("rpart")
library("rpart.plot")
library("randomForest")
options(scipen = 999)
```

Loading in the data
```{r Loading Data, message=F, warning=F}
gps <- read.csv("googleplaystore.csv")
```

Data Cleaning

The following needs to be removed/ cleaned:  
1) Record of an App by name Life made wi-fi....  
2) The Current Version of the App and Android Version   have to be processed to display the current sequence (12.1.2 is 12)  
3) The Special characters in Installs(+), Price($), Size(MB) should be removed  
4) All the redundant records should be removed  
5) The levels in Installs should be fixed to a Descending order  
6) Geners should be removed, it is the same as Category and makes it redundant  
7) All the other Nas should be removed  
8) Fixing Content Rating by combining like levels  
```{r Cleaning Data, warning=FALSE, message=F, warning=F}
gps <- gps[gps$App !=	"Life Made WI-Fi Touchscreen Photo Frame",] %>% na.omit() %>% unique() %>% select(-Genres)
gps$Current.Ver <- gps$Current.Ver %>% as.character()
i <- 1
for(i in c(seq(1, 8892, by = 1))){
  a <- gps$Current.Ver[i] 
  b <- b <- unlist(strsplit(a, "[.]"))
  b <- b[1]
  gps$Current.Ver[i] = b
  rm(b)
  rm(a)
  i <- i + 1
}
rm(i)
gps$Android.Ver <- gps$Android.Ver %>% as.character()
i <- 1
for(i in c(seq(1, 8892, by = 1))){
  a <- gps$Android.Ver[i]
  b <- b <- unlist(strsplit(a, "[.]"))
  b <- b[1]
  gps$Android.Ver[i] <- b
  rm(b)
  rm(a)
  i <- i + 1
}
rm(i)

gps$Category <- gps$Category %>% droplevels()
gps$Installs = gsub("[^0-9.]", '', (gps$Installs %>% as.character()))
    gps$Installs = as.numeric(gps$Installs)
gps$Size = gsub("[^0-9.]", '', (gps$Size %>% as.character()))
    gps$Size = as.numeric(gps$Size)
gps$Price = gsub("[^0-9.]", '', (gps$Price %>% as.character()))
    gps$Price = as.numeric(gps$Price)    

gps$Installs <- factor(gps$Installs, levels = c("1000000000", "500000000", "100000000", "50000000", "10000000", "5000000", "1000000", "500000", "100000", "50000", "10000", "5000", "1000", "500", "100", "50", "10", "5", "1"))

gps$Content.Rating <- gps$Content.Rating %>% as.character() %>% word(1) %>% as.factor() %>% droplevels()

```


Different Categories and their presence in the market 

Of the different kinds of people present in this word, Which category is globally accepted?  
What are the top categories that perform the best?  

```{r EDA - Best Categories as per Rating, message=F, warning=F}
gps_cb <- gps %>% na.omit() %>% group_by(Category) %>% summarize(Sum = sum(Rating), count = n(), `Normalized Rating` = sum(Rating)/n()) %>% arrange(desc(`Normalized Rating`))
kable(head(gps_cb, 7), caption = "Top 7 categories based on Normalized Rating")
kable(tail(gps_cb, 7), caption = "Bottom 7 categories based on Normalized Rating")
  
gps %>% filter(Category %in% (gps_cb$Category %>% head(15))) %>% ggplot(aes(x = Category, y = Rating, fill = Category)) + geom_violin(color = "black") + theme(legend.position = "none") + geom_hline(size = 0.6, color = "red", linetype = 'dashed', alpha = 0.5, yintercept = mean(gps$Rating %>% na.omit()))+coord_flip()+labs(y = "Rating", x = "Category", title = "Category Vs Rating Violin Plot")
rm(gps_cb)
```

Events, Education and Personalization are the best in overall rating.  
While Dating, Tools, Maps and Navigation are the worst rated app categories.   

Subtle and Significant Category Forensics

Which category doesn't perform well on a overall scale?  
What are the total number of installs for different categories?  

```{r EDA - Category Forensics, message=F, warning=F}
data <- gps %>% group_by(Category) %>% summarise(Count = n(), `Total installs` = sum(as.numeric(as.character(Installs))), `Normalized Installations` = `Total installs`/Count) 

data %>% 
  ggplot(aes(x = reorder(Category, Count), fill = Category, y = Count))+geom_bar(stat = "identity", color = "black")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Number of Apps present", x = "Category", title = "Number of Apps present in Different Category")

data %>% 
  ggplot(aes(x = reorder(Category, `Total installs`), fill = Category, y = `Total installs`))+geom_bar(stat = "identity", color = "black")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Total number of installs", x = "Category", title = "Total Number of Installations per Category")

data %>% 
  ggplot(aes(x = reorder(Category, `Normalized Installations`), fill = Category, y = `Normalized Installations`))+geom_bar(stat = "identity", color = "black")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Total number of normalized installations", x = "Category", title = "Total Number of Normalized Installations per Category")

info <- cbind((data %>% arrange(desc(data$Count)) %>% select(Category)),
(data %>% arrange(desc(data$`Total installs`)) %>% select(Category)),
(data %>% arrange(desc(data$`Normalized Installations`)) %>% select(Category)))
names(info) <- c("By Count", "By Total Installs", "By Total Normalized Installs")
kable(head(info, 7), caption = "Top 7 categories")
kable(tail(info, 7), caption = "Bottom 7 categories")
rm(info)
```
Family has the highest competition (number of apps) followed by tools which don't seem to be rewarding.   
Game, and Productivity do well wrt to number of installs even though they have a lot of competition.  
Communication and Social are two domains where there is a surge in downloads and the number of competitors are also less.  

Comments, Parenting, and Events can be seen as domains that one must ignore as there are very few downloads and so is the competition. Or they can also be seen as domains that have huge potential and are underdog markets that can be exploited with a disruptive idea.  
 


Content Type 

What is the distribution of apps of various content?  
Is there an effect of content type of installs?  
How are they priced?  
```{r EDA -  Content Type and Count, message=F, warning=F}
gps %>% group_by(Content.Rating) %>% summarise(count = n()) %>%
  ggplot(aes(x = reorder(Content.Rating, -count), y = count, fill = Content.Rating))+ geom_bar(stat = "identity", color = "black")+
 scale_fill_viridis(discrete = TRUE) +
  labs(y = "Count", x = "Content Type", title = "Content type and their presence")
```
Apps for everyone are present in bulk  

Installs vs Rating per content type
```{r EDA -  Content type- Installs and Rating, message=F, warning=F}
gps %>% 
  ggplot(aes(x = Rating, y = as.integer(as.character(gps$Installs)), color = Content.Rating))+
  geom_point()+
  facet_wrap(~Content.Rating) + geom_vline(xintercept = 5 ,colour = 'red', linetype = 'dashed') + geom_vline(xintercept = 4 ,colour = 'red', linetype = 'dashed') +  geom_hline(yintercept = 1001000000 ,colour = 'red', linetype = 'dashed') + geom_hline(yintercept = 751000000 ,colour = 'red', linetype = 'dashed')
```
Everyone and Teens content type attract the most number of installs and a decent rating.  


Rating analysis  

App ratings (on a scale of 1 to 5) impact the discoverability, and indicates the company's overall brand image. Ratings are a key performance indicator of an app.
```{r EDA -  Rating analysis - Distribution, message=F, warning=F}
gps  %>% ggplot(aes(x = Rating))+geom_histogram(binwidth = 0.1, fill = "purple")+geom_vline(color = "red", linetype = 'dashed', xintercept = mean(gps$Rating %>% na.omit()))+theme(legend.position = "none")+labs(y = "Count", x = "Rating", title = "Rating Histogram")
```
Most of the apps have an average rating above 4 due to which there is a significant skewness present in the rating part.   
Ratings are right skewed with major apps being highly rated and only few with a low rating.  

This indicates that most of the apps in the Market are sized below 25mb.  

The applications with large size can make it difficult for users to download. Higher download durations could upset the user and ignore the app.   
More importantly the memory a user has in the device is limited. This is to be tested.  
Let’s investigate the same with the data.  

The distribution of size
```{r EDA -  Size Distribution, warning = FALSE, message=F, warning=F}
gps %>% ggplot(aes(x = Size))+geom_histogram(binwidth = 1, fill = "purple")+theme(legend.position = "none")+labs(y = "Count", x = "Size (MB)", title = "Size distribution Histogram")

gps  %>% filter(Size < 120) %>% ggplot(aes(x = Size))+geom_histogram(binwidth = 1, fill = "purple")+theme(legend.position = "none")+labs(y = "Count", x = "Size (MB)", title = "Size distribution Histogram (Sizes below 120Mb)")
```
Most of the apps in the Market are sized below 25mb.  

Next is the size.  
The applications with large size can make it difficult for users to download. Higher download durations could upset the user and ignore the app.   
More importantly the memory a user has in the device is limited. This is to be tested.  

Impact of Size on Rating
```{r EDA -  Size Vs Rating, message=F, warning=F}
p <- gps %>% ggplot(aes(x = Size, y = Rating))+geom_point()+labs(y = "Rating", x = "Size (MB)", title = "Size Vs Rating") 
ggMarginal(p, type = "histogram")


p <- gps %>% filter(Size < 100) %>% ggplot(aes(x = Size, y = Rating))+geom_point()+labs(y = "Rating", x = "Size (MB)", title = "Size Vs Rating (Sizes below 100)") 
ggMarginal(p, type = "histogram")



p <- gps %>% filter(Size < 10) %>% ggplot(aes(x = Size, y = Rating))+geom_point()+labs(y = "Rating", x = "Size (MB)", title = "Size Vs Rating (Sizes below 10)")  
ggMarginal(p, type = "histogram")
rm(p)
```
Majority of apps with high rating (rating over 4) are between 1 MB to 10 MB.  

What is the variation in the number of insallations
```{r EDA -  Installs Distribution, message=F, warning=F}
gps %>% ggplot(aes(x = Installs)) +
  geom_histogram(stat = 'count', bins = 100, fill = "purple", color = 'black')+
coord_flip() +  scale_y_continuous(breaks = seq(0, 1500, 100))+labs(y = "Count", x = "Installs Range", title = "Installs distribution Histogram")
```
The Installs are perfectly distributed as expected. The high end market is very competitive and very few apps fall into that space, while most of the others fall behind the race getting in mediocre downloads. But the interesting part is that there are a lot of apps that fall into 1M downloads point. This is interesting as this gives a hope to a lot of designers as there is scope.  


```{r EDA -  Installs Market Segmentation, message=F, warning=F}
data <- gps %>% group_by(Installs) %>% summarise(Count = n()) %>% arrange(desc(Installs)) 
data <- data %>% mutate(product = as.numeric(as.character(data$Installs))*data$Count, `market share` = product/146625951338*100, `market share per app in the segment` = `market share`/Count)
names(data) <- c("Installs Range", "Total Number of apps present", "Total Installations","Market share of the segment", "Market share per app in that segment")

kable(data, caption = "Installations and their contribution")
```
The top five segments make up 96.38% of the market share.
This implies that an app present in the top-most installion segment make up close 0.7% of market.  

Impact of Price
Does Price have anything to do with the number of installs?  
```{r EDA -  Price Impact, message=F, warning=F}
gps %>% ggplot(aes(x = Type, y = as.numeric(as.character(gps$Installs)), fill = Type))+geom_boxplot()+scale_fill_manual(values=c('#E69F00', '#56B4E9'))+labs(y = "Installs", x = "Free or Paid", title = "Influence of type of availability on Installations")

#This is because of the outliers present at a very high value. This can be fixed by using the log transformation.

gps  %>% 
  ggplot(aes(x = Type, y = log(as.numeric(as.character(gps$Installs))), fill = Type))+geom_boxplot()+scale_fill_manual(values=c('#E69F00', '#56B4E9'))+labs(y = "Installs after log transformation", x = "Free or Paid", title = "Influence of type of availability on Installations")
           
```
As expected, free apps have higher installations than paid.  


Price Vs Category
```{r EDA -  Category Vs Price, message=F, warning=F}
gps %>% 
  ggplot(aes(x = Price, y = Category, color = Category))+
  geom_jitter() + theme(legend.position = "none")+labs(y = "Category", x = "Price", title = "Price Vs Category")

gps %>% select(c("App", "Price")) %>% arrange(desc(Price)) %>% head(20)
```
The top 15 most expensive apps appear to be junk.

Ignoring them and proceeding with the analysis gives
```{r EDA -  Category Vs Price after removing junk, message=F, warning=F} 
gps %>% filter(Price < 80) %>%
  ggplot(aes(x = Price, y = Category, color = Category))+
  geom_jitter() + theme(legend.position = "none")+labs(y = "Category", x = "Price", title = "Price Vs Category after cleaning junk")
```
The most expensive apps belong to Medical and Lifestyle followed by Family.  

TEXT MINING - Sentimental Analysis

Word Cloud Function
```{r SA - Word Cloud Function, message=F, warning=F}

  wordcloud_c <- function(data, num_words = 100, background = "white") {
  
  # If text is provided, convert it to a dataframe of word frequencies
  if (is.character(data)) {
    corpus <- Corpus(VectorSource(data))
    corpus <- tm_map(corpus, tolower)
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, removeNumbers)
    corpus <- tm_map(corpus, removeWords, stopwords("english"))
    tdm <- as.matrix(TermDocumentMatrix(corpus))
    data <- sort(rowSums(tdm), decreasing = TRUE)
    data <- data.frame(word = names(data), freq = as.numeric(data))
  }
  
  # Make sure a proper num_words is provided
  if (!is.numeric(num_words) || num_words < 3) {
    num_words <- 3
  }  
  
  # Grab the top n most common words
  data <- head(data, n = num_words)
  if (nrow(data) == 0) {
    return(NULL)
  }
  
  wordcloud2(data, backgroundColor = background)
}

```

SA - Review Analysis

The reviews had some special characters, which had to be removed.  
This data is grouped with the data used for EDA and all the irrelevant attributes are dropped.   
The Sentiment polarity and Sentiment Subjectivity were then grouped with Category to see how they impact it.

```{r SA - RA pre-processing, message=F, warning=F}
gspr <- read.csv("googleplaystore_user_reviews.csv") %>% na.omit()
gspr$App <- as.character(gspr$App)
gspr$Translated_Review <- as.character(gspr$Translated_Review)
data <- gps %>% select(-c("Reviews", "Price", "Last.Updated", "Android.Ver")) %>% as.data.frame()
data$App <- as.character(data$App)
data <- data %>% inner_join(gspr, by = "App")
#gspr$Languages <- textcat(gspr$Translated_Review)
#gspr <- gspr %>% filter(Languages == "english")
#wordcloud_c(gspr[,2], 5)
#get_sentiments(gspr[,2])
#All the sentiment extractions has already been done in the data so proceding to crunch in that info

data$Sentiment_Polarity <- round(data$Sentiment_Polarity, 2)
data$Sentiment_Subjectivity <- round(data$Sentiment_Subjectivity, 2)
```

Word Clouds for Positive, Neutral and Negative Reviews
```{r SA - Word Clouds, message=F, warning=F}
data_po <- data %>% filter(Sentiment_Polarity == 1) %>% select("Translated_Review") 
data_nu <- data %>% filter(Sentiment_Polarity == 0) %>% select("Translated_Review")
data_ne <- data %>% filter(Sentiment_Polarity == -1) %>% select("Translated_Review")
wordcloud_c(data_po[,1], 100)
wordcloud_c(data_nu[,1], 100)
wordcloud_c(data_ne[,1], 100)

rm(data_ne)
rm(data_nu)
rm(data_po)
```

SA - Text Forensics
```{r SA - Category Vs Reviews, message=F, warning=F}
das <- data %>% select(-"Translated_Review")
das <- das %>% group_by(Category, Sentiment) %>% summarise(Count = n(), NP = round(sum(Sentiment_Polarity)/n(),2))
das <- das %>% mutate(s = Sentiment)
das <- das %>% spread(Sentiment, NP)
names(das) <-  c("Category", "Count", "s", "Ne_S", "Nu_S", "Po_S")
das <- das %>% spread(s, Count)
names(das) <-  c("Category", "Ne_S", "Nu_S", "Po_S", "Ne_C", "Nu_C", "Po_C")
das[is.na(das)] <- 0
das <- das %>% group_by(Category) %>% summarise(Nu_C = sum(Nu_C), Ne_C = sum(Ne_C), Po_C = sum(Po_C), Nu_S = sum(Nu_S), Ne_S = sum(Ne_S), Po_S = sum(Po_S))


das %>% ggplot(aes(y = Po_S, x = reorder(Category, Po_S), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Sentiment Polarity", x = "Category", title = "Strength of Positive Sentiment Vs Category")

das %>% ggplot(aes(y = Po_C, x = reorder(Category, Po_C), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Number of Positive Reviews", x = "Category", title = "Number of Positive Vs Category")

das %>% ggplot(aes(y = Ne_S, x = reorder(Category, -Ne_S), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Sentiment Polarity", x = "Category", title = "Strength of Negative Sentiment Vs Category")

das %>% ggplot(aes(y = Ne_C, x = reorder(Category, Ne_C), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Number of Negative Reviews", x = "Category", title = "Number of Negative Vs Category")


das %>% ggplot(aes(y = (Po_C-Ne_C)/(Ne_C), x = reorder(Category, ((Po_C-Ne_C)/(Ne_C))), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Percentage Number of Positive Reviews wrt Negative Reviews", x = "Category", title = "Percentage of Positive Reviews more than Negative Reviews")

das %>% ggplot(aes(y = (Po_S-Ne_S), x = reorder(Category, ((Po_S-Ne_S))), fill = Category))+geom_bar(stat = "identity")+theme(legend.position = "none")+coord_flip()+
  labs(y = "Relative polarity of Positive Reviews wrt Negative Reviews", x = "Category", title = "Relative polarity Positive Reviews wrt Negative Reviews")
rm(das)
```

Even though beauty has high negative reviews it has the highest polarity difference between positive and negative among all the categories. This category can be ignored due to low number of reviews.  
Comics has the highest relative number of positive reviews compared to that of negative.   

Almost all the categories have similar intensities when it comes to positive or negative reactions.
Health and fitness received the most positive reactions compared to that of the negative ones.  
  

Content Vs Reviews
```{r SA Content, message=F, warning=F}
das <- data %>% select(-c("Translated_Review", "App"))
das %>% ggplot(aes(x = as.factor(Content.Rating), fill = Content.Rating))+geom_histogram(stat = "count")+labs(y = "Number of Reviews", x = "Content Type", fill = "Content Type", title = "Number of Reviews per content")+scale_fill_viridis(discrete = TRUE)+facet_wrap(~Sentiment)+coord_flip()

das %>% ggplot(aes(x = as.factor(Content.Rating), y = Sentiment_Polarity, color = Content.Rating))+geom_boxplot()+labs(y = "Polarity", x = "Content Type", title = "Content Type Vs the intensity of reviews")

das %>% ggplot(aes(x = as.factor(Content.Rating), fill = Sentiment))+geom_bar(stat = "count", position = "fill")+labs(y = "Number of Reviews", x = "Content Type", fill = "Content Type", title = "Number of Reviews per content")+scale_fill_viridis(discrete = TRUE)

```

As expected there Everyone has the highest number of reviews in all categories. This could be because the it has the highest number of apps in the market.  
Adult followed by Mature Apps have higher number of strong positive reviews while Teens have the higher number of strong negative reviews.   




Machine Learning - Pre processing

Here apart from general preprocessing, I am defining the success to be installations equal to or greater than 500,000 i.e. the app must take 0.00034% of the total market downloads we classify result to be a “hit”, and else it's “flop”.  
The correlation among various variable is as follows:

```{r ML - Pre processing, message=F, warning=F}
rm(das)
rm(data)
data <- gps %>% select(c("Category", "Rating", "Reviews", "Size", "Type", "Content.Rating", "Current.Ver", "Installs"))
glimpse(data)
data$Reviews <- as.integer(as.character(data$Reviews))
data$Current.Ver <- as.integer(as.character(data$Current.Ver))
data$Installs <- as.integer(as.character(data$Installs))
data <- data %>% na.omit()
data$Rating <- as.numeric(as.character(data$Rating))
glimpse(data)
cor(data %>% select(c("Rating", "Reviews", "Size", "Current.Ver", "Installs")))
corrplot(cor(data %>% select(c("Rating", "Reviews", "Size", "Current.Ver", "Installs"))), method = "color", title = "Correlation Plot", mar=c(0,0,1,0))

data <- data %>% mutate(Result = as.factor(ifelse(Installs >= 500000, "Hit", "Flop"))) %>% select(-Installs)
```
Installs has the highest correlation with reviews

Machine Learning - Data Partition

```{r ML - Data Partition, message=F, warning=F}
normalize <- function(x){
  return ((x-min(x))/(max(x)-min(x)))
}
data.ml<- cbind(as.data.frame(lapply(data[,c(2,3,4,7)],normalize)), data[,-c(2,3,4,7)])

set.seed(123)
trainindex <- createDataPartition(data.ml$Result, p=0.8, list= FALSE)
trainn <- data.ml[trainindex, ]
testn <- data.ml[-trainindex, ]

set.seed(123)
trainindex <- createDataPartition(data.ml$Result, p=0.8, list= FALSE)
training <- data.ml[trainindex, ]
testing <- data.ml[-trainindex, ]
```

I split the data into two parts training and testing using data partition at 80% split. I then normalized the continuous variables of the data for models that require normalized input and added the data that is not normalized to models which don’t require normalized inputs.

I have used the following models on the data to classify the records as hit or flop:

	Logistic Regression
	Decision Trees
	Random Forest
	Naïve Bayes Classifier
	SVM with Default Cost parameter
	Tuned SVM (Couldn’t add it as it takes more than 12 hours to run)

These models were run upon all the attributes with Result (“Hit” or “Flop”) being the target attribute.

The outputs of the models were as follows:


Machine Learning - Modelling - Logistic Regression

```{r ML - Logistic Regression, message=F, warning=F}
rm(trainindex)
rm(data.ml)
train <- trainn
test <- testn
logit.reg <- glm(Result ~.,data = train, family = "binomial")
summary(logit.reg)


logit.reg <- glm(Result ~.,data = train %>% select(-"Category"), family = "binomial")
summary(logit.reg)
#confint(logit.reg) to check for confidence intervals

test$predraw <- predict(logit.reg, newdata = test %>% select(-c("Category", "Result")), type = "response")

par <- data.frame("Cut off" = 1, "Accuracy" = 1, "Sensitivity" = 1)
i = 0
j = 1
for(i in seq(0,1,0.001)){
  test <- test %>% mutate(pred = factor(as.character(ifelse(predraw < i, "Flop", "Hit")), levels = c("Hit", "Flop")))
 par[j,2] <- 1-mean(test$Result != test$pred)
 par[j,1] <- i
 test1 <- test %>% filter(Result == "Hit")
  par[j,3] <- mean(test1$Result == test1$pred)
j <- j+1
i <- i+1
rm(test1)
}
rm(i)
rm(j)

kable(par %>% filter(Accuracy == max(Accuracy)) %>% head(5), caption = "The cut off for the best accuracy")

par %>%ggplot(aes(x = Cut.off, y = par$Accuracy))+geom_point(size = 0.05, color = "gray")+geom_line()+geom_hline(yintercept = max(par$Accuracy), color = "red")+geom_vline(xintercept = 0.375, color = "red")+
  labs(x = "Cut Off value", y = "Accuracy of prediction", title = "Cut off Vs Accuracy")
rm(par)  

test <- test %>% mutate(pred = as.factor(ifelse(predraw <= 0.35, "Flop", "Hit")))
test$pred <- factor(test$pred, levels = c("Hit", "Flop"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))
cm <- confusionMatrix(test$pred, test$Result)
cm
logit <- data.frame("Model" = "Logistic Regression", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
```
Logistic regression has performed pretty well in all the aspects. Note: It must remembered that our class of interest is Hit.


Machine Learning - Modelling - Decision Trees
```{r ML - Decision Trees, message=F, warning=F}
rm(test)
rm(train)
rm(logit.reg)
train <- training
test <- testing

set.seed(123)
dt <- rpart(Result ~., data = train, method = "class", cp = 0.00001, minsplit = 1, xval = 10)
dt.pruned <- prune(dt, cp = 0.004)

test$pred <- predict(dt.pruned, newdata = test[,-8], type = "class")

cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
det <- data.frame("Model" = "Decision Trees", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
rpart.plot(dt.pruned, box.palette="RdBu", shadow.col="gray", nn=TRUE)

rpart.rules(dt.pruned, style = "tallw")
```

Machine Learning - Modelling - Random Forest
```{r ML - Random Forest, message=F, warning=F}
rm(train)
rm(test)
rm(dt)
rm(dt.pruned)
train <- training
test <- testing
rf <- randomForest(Result ~., data = train)
plot(rf, main = "Error rate of random forest")
varImpPlot(rf, main = "Gini Plot - Importance of Variables")
test$pred = predict(rf, newdata = test[-8])
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
raf <- data.frame("Model" = "Random Forest", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
```

Machine Learning - Modelling - Naive Bayes Classifier
```{r ML - Naive Bayes Classifier, message=F, warning=F}
rm(test)
rm(train)
rm(rf)
train <- training
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))

delays.nb <- naiveBayes(Result~., data = train)
delays.nb

pred.prob <- as.data.frame(predict(delays.nb, newdata = test, type = "raw"))
pred.prob <-  round(pred.prob, 4)
pred.prob <- pred.prob %>% select("Hit")
names(pred.prob) <-  "predraw"
test <- cbind(test, pred.prob)
rm(pred.prob)

par <- data.frame("Cut off" = 1, "Accuracy" = 1, "Sensitivity" = 1)
i = 0
j = 1
for(i in seq(0,1,0.001)){
  test <- test %>% mutate(pred = factor(as.character(ifelse(predraw < i, "Flop", "Hit")), levels = c("Hit", "Flop")))
 par[j,2] <- 1-mean(test$Result != test$pred)
 test1 <- test %>% filter(Result == "Hit")
 par[j,3] <- mean(test1$Result == test1$pred)
 par[j,1] <- i
j <- j+1
i <- i+1
rm(test1)
}
rm(i)
rm(j)

kable(par %>% filter(Accuracy == max(Accuracy)) %>% head(5), caption = "The (Hit) probability cut off for the best accuracy")

par %>% ggplot(aes(x = `Cut.off`, y = par$Accuracy))+geom_point(size = 0.05, color = "gray")+geom_line()+geom_hline(yintercept = max(par$Accuracy), color = "red")+geom_vline(xintercept = 0.024, color = "red")+
  labs(x = "Cut Off value", y = "Accuracy of prediction", title = "Cut off Vs Accuracy")

rm(par)

test <- test %>% mutate(pred =  factor(as.character(ifelse(predraw < 0.024, "Flop", "Hit")), levels = c("Hit", "Flop")))

cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
nab <- data.frame("Model" = "Naive Bayes", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
```

Machine Learning - Modelling - SVM
```{r ML - SVM, message=F, warning=F}
rm(test)
rm(train)
rm(delays.nb)

train <- training
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))

set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "linear")

test$pred <- predict(svm.fit, newdata = test[,-8], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm

linear <- data.frame("MOdel" = "SVM - Linear", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])

rm(test)
rm(train)
rm(svm.fit)
rm(cm)

train <- training
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))

set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "radial")

test$pred <- predict(svm.fit, newdata = test[,-8], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm

radial <- data.frame("MOdel" = "SVM - Radial", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])

rm(test)
rm(train)
rm(svm.fit)
rm(cm)

train <- training
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))

set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "polynomial")

test$pred <- predict(svm.fit, newdata = test[,-8], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm

polynomial <- data.frame("MOdel" = "SVM - Polynomial", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])

rm(cm)
rm(test)
rm(train)
rm(svm.fit)

svm <- rbind(linear, radial, polynomial)
rm(linear)
rm(radial)
rm(polynomial)
names(svm) <- c("Model", "Accuracy", "Sensitivity", "Specificity")

```


Machine Learning - Modelling - SVM (Tuning)

```{r ML - SVM Tuning, eval = FALSE, message=F, warning=F}
set.seed(123)
tunesvm <- tune(svm, Result~., data = train, kernal = "linear", ranges = list(cost = c(seq(0.01,0.1,by = 0.01), seq(0.1,1,by = 0.1), seq(1,10,by = 1))))
dynamic.cost <- tunesvm$best.model$cost
summary(tunesvm)
data.par <- tunesvm$performances
data.par[data.par$error == min(data.par$error),]

set.seed(123)
svm_cost <- svm(Result~., data=train,kernel = "linear", cost = dynamic.cost)
pred_train <- predict(svm_cost, train)
train.error <- mean(pred_train != train$Result)
pred_test <- predict(svm_cost, test)
test.error <- mean(pred_test != test$Result)
linear <- data.frame("Kernal" = "Linear", "Cost" = dynamic.cost,
                     "train Error" = train.error, 
                     "test Error" = test.error)

set.seed(123)
tunesvm <- tune(svm, Result~., data = train, kernal = "radial", ranges = list(cost = c(seq(0.01,0.1,by = 0.01), seq(0.1,1,by = 0.1), seq(1,10,by = 1))))
dynamic.cost <- tunesvm$best.model$cost
summary(tunesvm)
data.par <- tunesvm$performances
data.par[data.par$error == min(data.par$error),] 


svm_cost <- svm(Result~., data=train, kernel = "radial", cost = dynamic.cost)
pred_train <- predict(svm_cost, train)
train.error <- mean(pred_train != train$Result)
pred_test <- predict(svm_cost, test)
test.error <- mean(pred_test != test$Result)
radial <- data.frame("Kernal" = "Radial", "Cost" = dynamic.cost,
                     "train Error" =  train.error, 
                     "test Error" = test.error)

set.seed(123)
tunesvm <- tune(svm, Result~., data = train, kernal = "polynomial", degree = 2, ranges = list(cost = c(seq(0.01,0.1, by = 0.01), seq(0.1,1,by = 0.1), seq(1,10,by = 1))))
dynamic.cost <- tunesvm$best.model$cost
summary(tunesvm)
data.par <- tunesvm$performances
data.par[data.par$error == min(data.par$error),]
svm_cost <- svm(Result~., data=train, kernel = "polynomial", degree = 2, cost = dynamic.cost)
pred_train <- predict(svm_cost, train)
train.error <- mean(pred_train != train$Result)
pred_test <- predict(svm_cost, test)
test.error <- mean(pred_test != test$Result)
polynomial <- data.frame("Kernal" = "Polynomial", "Cost" = dynamic.cost,
                         "train Error" =  train.error, 
                         "test Error" = test.error)


kable(final <- rbind(linear, radial, polynomial), caption = "SVM Performance with different kernals")
```



In this analysis I used Rating and Reviews which are known only after an App is launched. So the analysis I did so far is purely hypothetical.
For this reason I am repeating the entire analysis dropping Rating and Reviews Attributes.

Machine Learning - Modelling - Logistic Regression

```{r ML - Logistic Regression for Real Data, message=F, warning=F}
rm(trainindex)
rm(data.ml)
train <- trainn %>% select(-c("Rating", "Reviews"))
test <- testn %>% select(-c("Rating", "Reviews"))

logit.reg <- glm(Result ~.,data = train, family = "binomial")
summary(logit.reg)
#confint(logit.reg) to check for confidence intervals

test$predraw <- predict(logit.reg, newdata = test %>% select(-"Result"), type = "response")

par <- data.frame("Cut off" = 1, "Accuracy" = 1, "Sensitivity" = 1)
i = 0
j = 1
for(i in seq(0,1,0.001)){
  test <- test %>% mutate(pred = factor(as.character(ifelse(predraw < i, "Flop", "Hit")), levels = c("Hit", "Flop")))
 par[j,2] <- 1-mean(test$Result != test$pred)
 par[j,1] <- i
 test1 <- test %>% filter(Result == "Hit")
  par[j,3] <- mean(test1$Result == test1$pred)
j <- j+1
i <- i+1
rm(test1)
}
rm(i)
rm(j)

kable(par %>% filter(Accuracy == max(Accuracy)) %>% head(5), caption = "The cut off for the best accuracy")

par %>%ggplot(aes(x = Cut.off, y = par$Accuracy))+geom_point(size = 0.05, color = "gray")+geom_line()+geom_hline(yintercept = max(par$Accuracy), color = "red")+geom_vline(xintercept = 0.443, color = "red")+
  labs(x = "Cut Off value", y = "Accuracy of prediction", title = "Cut off Vs Accuracy")
rm(par)  

test <- test %>% mutate(pred = as.factor(ifelse(predraw <= 0.443, "Flop", "Hit")))
test$pred <- factor(test$pred, levels = c("Hit", "Flop"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))
cm <- confusionMatrix(test$pred, test$Result)
cm
logit.u <- data.frame("Model" = "Logistic Regression (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
```


Machine Learning - Modelling - Decision Trees
```{r ML - Decision Trees for Real Data, message=F, warning=F}
rm(test)
rm(train)
rm(logit.reg)
train <- training %>% select(-c("Rating", "Reviews"))
test <- testing %>% select(-c("Rating", "Reviews"))

set.seed(123)
dt <- rpart(Result ~., data = train, method = "class", cp = 0.00001, minsplit = 1, xval = 10)
dt.pruned <- prune(dt, cp = 0.004)

test$pred <- predict(dt.pruned, newdata = test[,-6], type = "class")

cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
det.u <- data.frame("Model" = "Decision Tress (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
rpart.plot(dt.pruned, box.palette="RdBu", shadow.col="gray", nn=TRUE)

rpart.rules(dt.pruned, style = "tallw")
```

Machine Learning - Modelling - Random Forest
```{r ML - Random Forest  for Real Data, message=F, warning=F}
rm(train)
rm(test)
rm(dt)
rm(dt.pruned)
train <- training %>% select(-c("Rating", "Reviews"))
test <- testing %>% select(-c("Rating", "Reviews"))
rf <- randomForest(Result ~., data = train)
#plot(rf, main = "Error rate of random forest")
varImpPlot(rf, main = "Gini Plot - Importance of Variables")
test$pred = predict(rf, newdata = test[-6])
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
raf.u <- data.frame("Model" = "Random Forest (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
```

Machine Learning - Modelling - Naive Bayes Classifier
```{r ML - Naive Bayes Classifier for Real Data, message=F, warning=F}
rm(test)
rm(train)
rm(rf)
train <- training %>% select(-c("Rating", "Reviews"))
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing %>% select(-c("Rating", "Reviews"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))

delays.nb <- naiveBayes(Result~., data = train)
delays.nb

pred.prob <- as.data.frame(predict(delays.nb, newdata = test, type = "raw"))
pred.prob <-  round(pred.prob, 4)
pred.prob <- pred.prob %>% select("Hit")
names(pred.prob) <-  "predraw"
test <- cbind(test, pred.prob)
rm(pred.prob)

par <- data.frame("Cut off" = 1, "Accuracy" = 1, "Sensitivity" = 1)
i = 0
j = 1
for(i in seq(0,1,0.001)){
  test <- test %>% mutate(pred = factor(as.character(ifelse(predraw < i, "Flop", "Hit")), levels = c("Hit", "Flop")))
 par[j,2] <- 1-mean(test$Result != test$pred)
 test1 <- test %>% filter(Result == "Hit")
 par[j,3] <- mean(test1$Result == test1$pred)
 par[j,1] <- i
j <- j+1
i <- i+1
rm(test1)
}
rm(i)
rm(j)

kable(par %>% filter(Accuracy == max(Accuracy)) %>% head(5), caption = "The (Hit) probability cut off for the best accuracy")

par %>% ggplot(aes(x = `Cut.off`, y = par$Accuracy))+geom_point(size = 0.05, color = "gray")+geom_line()+geom_hline(yintercept = max(par$Accuracy), color = "red")+geom_vline(xintercept = 0.615, color = "red")+
  labs(x = "Cut Off value", y = "Accuracy of prediction", title = "Cut off Vs Accuracy")

rm(par)

test <- test %>% mutate(pred =  factor(as.character(ifelse(predraw < 0.615, "Flop", "Hit")), levels = c("Hit", "Flop")))

cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm
nab.u <- data.frame("Model" = "Naive Bayes (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])
rm(cm)
```

Machine Learning - Modelling - SVM
```{r ML - SVM for Real Data, message=F, warning=F}
rm(test)
rm(train)
rm(delays.nb)

train <- training %>% select(-c("Rating", "Reviews"))
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing %>% select(-c("Rating", "Reviews"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))

set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "linear")

test$pred <- predict(svm.fit, newdata = test[,-6], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm

linear <- data.frame("MOdel" = "SVM - Linear (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])

rm(test)
rm(train)
rm(svm.fit)
rm(cm)

train <- training %>% select(-c("Rating", "Reviews"))
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing %>% select(-c("Rating", "Reviews"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))

set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "radial")

test$pred <- predict(svm.fit, newdata = test[,-6], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm

radial <- data.frame("MOdel" = "SVM - Radial (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])

rm(test)
rm(train)
rm(svm.fit)
rm(cm)

train <- training %>% select(-c("Rating", "Reviews"))
train$Result <- factor(train$Result, levels = c("Hit", "Flop"))
test <- testing %>% select(-c("Rating", "Reviews"))
test$Result <- factor(test$Result, levels = c("Hit", "Flop"))

set.seed(123)
svm.fit <- svm(Result~., data=train, kernel = "polynomial")

test$pred <- predict(svm.fit, newdata = test[,-6], type = "class")
cm <- confusionMatrix(factor(test$pred, levels = c("Hit", "Flop")), factor(test$Result, levels = c("Hit", "Flop")))
cm

polynomial <- data.frame("MOdel" = "SVM - Polynomial (U)", "Accuracy" = cm[["overall"]][["Accuracy"]], "Sensitivity" = cm[["byClass"]][["Sensitivity"]], "Specificity" = cm[["byClass"]][["Specificity"]])

rm(cm)
rm(test)
rm(train)
rm(svm.fit)

svm.u <- rbind(linear, radial, polynomial)
rm(linear)
rm(radial)
rm(polynomial)
names(svm.u) <- c("Model", "Accuracy", "Sensitivity", "Specificity")
```



Final model Summary
```{r Final Model Summary, message=F, warning=F}
model.ds <- rbind(logit, det, raf, nab, svm)
rm(logit)
rm(det)
rm(raf)
rm(nab)
rm(svm)
model.realtime <- rbind(logit.u, det.u, raf.u, nab.u, svm.u)
rm(logit.u)
rm(det.u)
rm(raf.u)
rm(nab.u)
rm(svm.u)

kable(rbind(model.ds, model.realtime), caption = "Final model results for complete and real time data")

kable((model.ds %>% arrange(desc(Accuracy)) %>% head(2)), caption = "Top 2 models for complete data")

kable((model.realtime%>% arrange(desc(Accuracy)) %>% head(2)), caption = "Top 2 models for data that can be extracted in real time")
```

```{r Conclusion, echo = F, eval = F, include = F}

```
Conclusion:

It is clear that the size of the app and category define if the app would be a hit or not.  
From the EDA it is clear that the app must be below 25 MB and the categories to target would vary depending on the risk one wants to take.  
While Game, Family, and Communication give a constant reward, the highest can be gained from Categories that are yet to be explored like Comics (Which are positively seen but are low in market) or Events (Which have a huge criticism and are low in market).  


```{r Sources, echo = F, eval = F, include = F}

```

Sources:  
https://www.kaggle.com/lava18/google-play-store-apps  
http://r-statistics.co/  
https://uc-r.github.io/  
https://www.r-bloggers.com/  
https://medium.com/analytics-vidhya/a-guide-to-machine-learning-in-r-for-beginners-part-5-4c00f2366b90  
https://en.proft.me/2016/11/9/classification-using-decis
ion-trees-r/  
https://www.gormanalysis.com/blog/decision-trees-in-r-us
ing-rpart/  
http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/141-cart-model-decision-tree-essentials/  
https://dataaspirant.com/2017/02/03/decision-tree-classifier-implementation-in-r/  
https://www.blopig.com/blog/2017/04/a-very-basic-introduction-to-random-forests-using-r/  
https://www.rdocumentation.org/  
https://www.datacamp.com/  

Introduction to Data Mining  
Book by Michael Steinbach, Pang-Ning Tan, and Vipin Kumar

Data Mining for Business Analytics: Concepts, Techniques, and Applications in R   
Book by Galit Shmueli , Peter C. Bruce, Inbal Yahav, Nitin R. Patel,
Kenneth C. Lichtendahl Jr. 
