Required packages


library(readr)
library(dplyr)
library(tidyr)
library(lubridate)
library(stringr)
library(knitr)
library(forecast)
library(car)

Executive Summary

The aim of this assignment is to pre-process user review data of Google Play applications to prepare it for analysis on the application’s effectiveness e.g. Correlations between Price, reviews and sentiments and to rank the apps in different categories.

The 2 data sets used, googleplaystore.csv and googleplaystore_user_reviews.csv, were imported and merged by a common variable (Apps). For a better understanding of the data set, an analysis of the structure and variable class type was conducted. Irrelevant variables were dropped to simplify the process. Then, data type conversions were carried out on some variables, some variables were factored while some ordered.

The data was already in a tidy format, hence no reshaping was needed. 3 new columns were created through the mutate function which separated the date on which the application was last updated into day, month and year for easier comparison of data by month or year.

Several missing values were identified in the data set. Rows with missing values in Ratings, Sentiment Polarity and Translated Review were removed, whereas missing values in Sizes were replaced by the mean size of their individual category using imputation. The missing values in Price were due to the applications being free, so they were replaced with a 0.

Lastly, removal of outliers and transformation were performed to try to reduce the effects of outliers on skewing the results. The capping method was used to handle the outliers. We capped outlier values with the closest 5th percentile. On the heavily right skewed, data, log10 transformation was applied to the variable to reduce the skewness before capping. This resulted in a more normal distribution and eliminated much of the perceived outliers.

Data

Data obtained from: https://www.kaggle.com/lava18/google-play-store-apps Data was scraped from Google Play App Store on over 10k apps as well as their reviews.

Datasets used: googleplaystore.csv and googleplaystore_user_reviews.csv App information is stored in the googleplaystore.csv while reviews information is stored in the googleplaystore_user_reviews.csv. Variable descriptions for each dataset are shown below.

Both datasets were imported and merged (left_join) by a common variable, ‘App’, for easier analysis. The left_join is appropriate as it matches rows from googleplaystore_user_reviews to googleplaystore, so that each review is matched to the appropriate app.

To simplify the process, variables ‘Android Ver’ and ‘Current Ver’ were removed as version history will not be useful in the analysis.

Loading the data into R.

googleplaystore <- read_csv("googleplaystore.csv")
googleplaystore_user_reviews <- read_csv("googleplaystore_user_reviews.csv")
playstoredescription <- read_csv("playstoredescription.csv")
UserReviewsdescription <- read_csv("UserReviewsdescription.csv")

Variables description in googleplaystore:

Variable Description
App Application name
Category Category the app belongs to
Rating Overall user rating of the app (as when scraped)
Reviews Number of user reviews for the app (as when scraped)
Size Size of the app (as when scraped)
Installs Number of user downloads/installs for the app (as when scraped)
Type Paid or Free
Price Price of the app (as when scraped)
Content Rating Age group the app is targeted at - Children / Mature 21+ / Adult
Genres An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.
Last Updated Date when the app was last updated on Play Store (as when scraped)
Current Ver Current version of the app available on Play Store (as when scraped)
Android Ver Min required Android version (as when scraped)

Variables description in googleplaystore_user_reviews:

Variable Description
App Application name
Translated_Review User review (Preprocessed and translated to English)
Sentiment Positive/Negative/Neutral (Preprocessed)
Sentiment_Polarity Sentiment polarity score
Sentiment_Subjectivity Sentiment subjectivity score

Joining the Data Sets

The common variable on the data set is apps, where the name of the apps are stored. To create meaningful analysis, we will join the two data sets together with on the app names. New table we will be working with will be called apps.

apps <- googleplaystore %>% left_join(googleplaystore_user_reviews, by = "App")

Understand

Checking the structure of the data

str(apps)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    131971 obs. of  17 variables:
 $ App                   : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "Coloring book moana" "Coloring book moana" ...
 $ Category              : chr  "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
 $ Rating                : num  4.1 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 ...
 $ Reviews               : num  159 967 967 967 967 967 967 967 967 967 ...
 $ Size                  : chr  "19M" "14M" "14M" "14M" ...
 $ Installs              : chr  "10,000+" "500,000+" "500,000+" "500,000+" ...
 $ Type                  : chr  "Free" "Free" "Free" "Free" ...
 $ Price                 : chr  "0" "0" "0" "0" ...
 $ Content Rating        : chr  "Everyone" "Everyone" "Everyone" "Everyone" ...
 $ Genres                : chr  "Art & Design" "Art & Design;Pretend Play" "Art & Design;Pretend Play" "Art & Design;Pretend Play" ...
 $ Last Updated          : chr  "January 7, 2018" "January 15, 2018" "January 15, 2018" "January 15, 2018" ...
 $ Current Ver           : chr  "1.0.0" "2.0.0" "2.0.0" "2.0.0" ...
 $ Android Ver           : chr  "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" ...
 $ Translated_Review     : chr  NA "A kid's excessive ads. The types ads allowed app, let alone kids" "It bad >:(" "like" ...
 $ Sentiment             : chr  NA "Negative" "Negative" "Neutral" ...
 $ Sentiment_Polarity    : num  NA -0.25 -0.725 0 NaN 0.5 -0.8 NaN 0 0.5 ...
 $ Sentiment_Subjectivity: num  NA 1 0.833 0 NaN ...

Removing unused variables

Version history is not useful for us. Therefore we decided to remove them as variables.

  • Android Ver
  • Current Ver
apps<-apps %>% select(-c(`Android Ver`,`Current Ver`))
colnames(apps)
 [1] "App"                    "Category"              
 [3] "Rating"                 "Reviews"               
 [5] "Size"                   "Installs"              
 [7] "Type"                   "Price"                 
 [9] "Content Rating"         "Genres"                
[11] "Last Updated"           "Translated_Review"     
[13] "Sentiment"              "Sentiment_Polarity"    
[15] "Sentiment_Subjectivity"

Converting the variable into ordered and unordered factors

  • Installs
  • Type
  • Content Ratings
  • Sentiment
  • Category
apps <- apps %>% mutate(
  Installs = factor(apps$Installs, 
                    levels = c( "0","0+","1+","5+","10+","50+","100+","500+","1,000+",
                                "5,000+","10,000+",  "50,000+", "100,000+",  "500,000+",
                                "1,000,000+","5,000,000+" ,  "10,000,000+" ,  "50,000,000+", "100,000,000+",
                                "500,000,000+","1,000,000,000+") ,
                    labels = c( "0","0+","1+","5+","10+","50+","100+","500+","1,000+",
                                "5,000+","10,000+",  "50,000+", "100,000+",  "500,000+",
                                "1,000,000+","5,000,000+" ,  "10,000,000+" ,  "50,000,000+",
                                "100,000,000+", "500,000,000+","1,000,000,000+"),
                    ordered=T),
  Type = factor(apps$Type, 
                levels = c("Free", "Paid"),
                labels = c("Free", "Paid")),
  `Content Rating`= factor(apps$`Content Rating`,
                           levels = c("Everyone", "Everyone 10+",  "Teen", "Mature 17+", "Adults only 18+"),
                           labels = c("Everyone", "Everyone 10+",  "Teen", "Mature 17+", "Adults only 18+"),
                           ordered = T ),
  Sentiment = factor(apps$Sentiment, 
                      levels = c("Negative", "Neutral", "Positive"),
                      labels = c("Negative", "Neutral", "Positive"),
                      ordered = T),
  Category = factor(apps$Category)
  )
str(apps[,c("Installs", "Type", "Content Rating", "Sentiment", "Category")]) 
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   131971 obs. of  5 variables:
 $ Installs      : Ord.factor w/ 21 levels "0"<"0+"<"1+"<..: 11 14 14 14 14 14 14 14 14 14 ...
 $ Type          : Factor w/ 2 levels "Free","Paid": 1 1 1 1 1 1 1 1 1 1 ...
 $ Content Rating: Ord.factor w/ 5 levels "Everyone"<"Everyone 10+"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Sentiment     : Ord.factor w/ 3 levels "Negative"<"Neutral"<..: NA 1 1 2 NA 3 1 NA 2 3 ...
 $ Category      : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..: 2 2 2 2 2 2 2 2 2 2 ...

Date conversion

Converting ‘Last Updated’ into date format in a new column. Then dropping the column ‘Last Updated’. This is to avoid errors if run code twice, as it would convert date variables to NA. This is to ensure integrity.

apps <- apps %>% mutate(Updated = mdy(apps$`Last Updated`))
 1 failed to parse.
apps <- apps %>% select(-`Last Updated`)

str(apps$Updated)
 Date[1:131971], format: "2018-01-07" "2018-01-15" "2018-01-15" "2018-01-15" "2018-01-15" ...

Changing Price from character to numeric

apps$Price <- substr(apps$Price,2,nchar(apps$Price)) %>% as.numeric()
NAs introduced by coercion
str(apps$Price)
 num [1:131971] NA NA NA NA NA NA NA NA NA NA ...

We note that as.numeric changes 0 to NA in the conversion process. We will impute back the 0s in the scan process.

Changing application size variable to numeric

Application sizes are either recorded in megabytes, kilobytes, or are recorded as “varies with device”. We want to convert this to numeric for better analysis hence a common unit of measurement should be used. We decided to use megabytes for this.

We first extract the numeric part from the string. Then extract the ‘M’, or ‘k’. if it is anything else, we recognise it as the ‘varies with device’ value. This we are going to allow to be NA as we do not have enough information.
Finally we are putting it all together. If it is in kilobytes, we are multiplying by 0.001 to adjust the value to be in megabytes. NAs are recorded for ‘varies with device’ value. This we will impute in the later section with the mean size of the respective category.

unit_size<-str_extract(apps$Size,"[aA-zZ]") 
value_size <- substr(apps$Size,start = 1,stop=(nchar(apps$Size)-1)) %>% as.numeric()
NAs introduced by coercion
conversion <-function(x,y) {ifelse(x=="M",y,ifelse(x=="k",y*0.001,NA))}
size<-conversion(unit_size,value_size)
apps<-apps %>% mutate(Size=size)

class(apps$Size)
[1] "numeric"
summary(apps$Size)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.01   10.00   24.00   33.33   52.00  100.00   48407 

Check on data structure that all variables are in the right class.

str(apps)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    131971 obs. of  15 variables:
 $ App                   : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "Coloring book moana" "Coloring book moana" ...
 $ Category              : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Rating                : num  4.1 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 ...
 $ Reviews               : num  159 967 967 967 967 967 967 967 967 967 ...
 $ Size                  : num  19 14 14 14 14 14 14 14 14 14 ...
 $ Installs              : Ord.factor w/ 21 levels "0"<"0+"<"1+"<..: 11 14 14 14 14 14 14 14 14 14 ...
 $ Type                  : Factor w/ 2 levels "Free","Paid": 1 1 1 1 1 1 1 1 1 1 ...
 $ Price                 : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Content Rating        : Ord.factor w/ 5 levels "Everyone"<"Everyone 10+"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Genres                : chr  "Art & Design" "Art & Design;Pretend Play" "Art & Design;Pretend Play" "Art & Design;Pretend Play" ...
 $ Translated_Review     : chr  NA "A kid's excessive ads. The types ads allowed app, let alone kids" "It bad >:(" "like" ...
 $ Sentiment             : Ord.factor w/ 3 levels "Negative"<"Neutral"<..: NA 1 1 2 NA 3 1 NA 2 3 ...
 $ Sentiment_Polarity    : num  NA -0.25 -0.725 0 NaN 0.5 -0.8 NaN 0 0.5 ...
 $ Sentiment_Subjectivity: num  NA 1 0.833 0 NaN ...
 $ Updated               : Date, format: "2018-01-07" "2018-01-15" ...
  • Apps are the name of each app. So okay to keep as character data.
  • Genres are okay to keep as character data.
  • Sentiment_Polarity and Sentiment_Subjectivity are out analysis variables and they should be in numerical.
  • All other variable we have changed into the correct class.

Tidy & Manipulate Data I

The data is already in a tidy format since:
1. All variables have a column. - Each column relates to an attribute of the app.
2. All observations have row - ie. each row relates to an app and an individual review.
3. Each value is in a cell.

head(apps, 6)

Tidy & Manipulate Data II

If analysis on when is the app updated have an effect on the sentiment of the reviews, it will be useful to have the Year, Month, and Day of when last reviewed in separate columns for analysis.

Creating the Year, Month and Day column for updated values.

apps <- apps %>% mutate(Day = day(apps$Updated), 
                  Month = month(apps$Updated), 
                  Year = year(apps$Updated))
str(apps[,c("Day", "Month", "Year")])
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   131971 obs. of  3 variables:
 $ Day  : int  7 15 15 15 15 15 15 15 15 15 ...
 $ Month: num  1 1 1 1 1 1 1 1 1 1 ...
 $ Year : num  2018 2018 2018 2018 2018 ...

Scan I

Checing for missing values(NA, Infinite and NaN).

Checking the total number of missing and special values and displaying them to handle one by one.

n <- colSums(is.na(apps)) %>% as.data.frame()
names(n) <- "NA"

i <- sapply(apps, is.infinite) %>% as.data.frame() %>% 
colSums()

nan <- sapply(apps, is.nan) %>% as.data.frame() %>% 
colSums()

x <- n %>% mutate(Infinite = i, Nan = nan) 
row.names(x) <- colnames(apps)
x
NA

Checking why reviews and installs have only one NA

apps[which(is.na(apps$Reviews)),]

Looks to have the values in the wrong columns. It is likely to be an error from the scraping. Since we have a large data sample, we will deal with it by deleting.

apps <- apps[-which(is.na(apps$Reviews)),]
sum(is.na(apps$Reviews))
[1] 0

For missing values that are in ratings, we will deal with them by removing all rows with missing values. Since our analysis are on rating sentiments. The apps that does not have any ratings is not useful to be included.

apps <- apps[-which(is.na(apps$Rating)),]
sum(is.na(apps$Rating))
[1] 0

When converting Price into numeric, 0 was changed to NA. As, 0 is still a valid price, and it does add value to the information, we are Imputing NA in price with 0.


apps$Price[which(is.na(apps$Price))] <- 0
sum(is.na(apps$Price))
[1] 0

For the Sizes variable, there was a value called “Varies with device”. When changing into numeric format, this have become NA.

We will impute these NA with the average size of apps of their individual category.

apps <- apps %>% 
  group_by(Category) %>% 
  mutate(Size = ifelse(is.na(Size), 
                           mean(Size,na.rm = T),
                           Size)) %>% ungroup()

# Checking if Size is imputed 
sum(is.na(apps$Size))
[1] 0
# Displaying sizes by categories 
apps %>% 
  group_by(Category) %>% summarise(mean=round(mean(Size),2))
NA

Translated Review is where the reviews are collected. An app user can leave or not leave a written review after giving a rating. If no rating is given, then it is recorded as NaN. Some, non recorded reviews are recorded as NA here. Since we are going to analyse the sentiments, we will look at only reviews are left. Therefore we will deal with missing vallues in Translated_Review by removing them.

apps <- apps[-which(is.na(apps$Translated_Review)),]
sum(is.na(apps$Translated_Review))
[1] 0

Sentiment Polarity is one of the variables for analysis. Therefore it is good to have a data set with none missing values here, Removing rows with missing values in Sentiment Polarity. Also not that, is.na here also includes NaNs which was created for any apps that had a review but didn’t leave any text. We will be excluding these from our analysis.

apps <- apps[-which(is.na(apps$Sentiment_Polarity)),]
sum(is.na(apps$Sentiment_Polarity))
[1] 0

Final missing value check:

Checking for NA, Inf and Nan.

n <- colSums(is.na(apps)) %>% as.data.frame()
names(n) <- "NA"

i <- sapply(apps, is.infinite) %>% as.data.frame() %>% 
colSums()

nan <- sapply(apps, is.nan) %>% as.data.frame() %>% 
colSums()

x <- n %>% mutate(Infinite = i, Nan = nan) 
row.names(x) <- colnames(apps)
x
NA

We have dealt with all the missing, infinite and Nan values.

Scan II

Identify numeric data

check_numeric <-sapply(apps, is.numeric) %>% as.data.frame()
names(check_numeric) <-"Numeric"
check<-check_numeric %>% mutate(Variable=colnames(apps),Numeric=Numeric) 
check<-check%>% filter(Numeric==T) %>% select(Variable,Numeric)
check

We can see numeric data are
* Rating
* Reviews
* Size
* Price
* Sentiment_Polarity
* Sentiment_Subjectivity

We will ignore the variables Day, Month and Year as these have been created by us and it’s not relevant to check for outliers

Checking for outliers in them:

par(mfrow=c(2,3))
Boxplot(apps$Rating, main="Rating")
 [1] 42886 42887 42888 42889 42890 42891 42892 42893 42894 42895
Boxplot(apps$Reviews, main="Reviews")
 [1] 43462 43463 43464 43465 43466 43467 43468 43469 43470 43471
Boxplot(apps$Size, main="Size")
 [1] 20255 20256 20257 20258 20259 20260 20261 20262 20263 20264
Boxplot(apps$Sentiment_Polarity, main= "Sentiment Polarity")
 [1]  11  33 116 168 292 300 515 673 715 717  22  44  77 112 163 264 266 337
[19] 473 474
Boxplot(apps$Sentiment_Subjectivity, main = "Sentiment Subjectivity")
 [1]   3   6  26  29  46  72  80  81  84 102

For price, we group the data by Type of app (Free/Paid) and check for outliers as the data would otherwise be heavily skewed due to large number of free apps


Boxplot(apps$Price~apps$Type, main = "Price grouped by type of app")
 [1] "49836" "49837" "49838" "49839" "49840" "49841" "49842" "49843" "49844"
[10] "49845"

Reviews looks to be severly right skewed.

hist(apps$Reviews, main="Reviews")

It will make better sense if we do a transformation of the data before capping the outliers in case of doing loosing too much information.

Capping the Outliers for:

  • Rating
  • Size
  • Sentiment_Polarity
  • Sentiment_Subjectivity

We are capping them within the 95%. As it makes sense for these variables to still have the outlier value creating an effect. Just the effect should not be excessive.

cap <- function(x){
quantiles <- quantile( x,probs =  c(0.05, 0.25, 0.75, 0.95),na.rm=TRUE)
x[ x < quantiles[2] - 1.5*IQR(x,na.rm=T) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x,na.rm=T) ] <- quantiles[4]
x
}
apps[,c("Rating", "Size","Sentiment_Polarity", "Sentiment_Subjectivity")] <- sapply(apps[,c("Rating", "Size","Sentiment_Polarity", "Sentiment_Subjectivity")], cap) %>% 
  as.data.frame()

Capping the Outliers for Price grouped by Type (Paid apps get capped among paid apps only)


apps <- apps %>% 
  group_by(Type) %>% 
  mutate(Price = cap(Price)) %>% ungroup()

Checking if Outliers are capped:

par(mfrow=c(2,2), pty = "s" )
Boxplot(apps$Rating, main="Rating")
Boxplot(apps$Size, main="Size")
Boxplot(apps$Sentiment_Polarity, main= "Sentiment Polarity")
Boxplot(apps$Sentiment_Subjectivity, main = "Sentiment Subjectivity")
 [1]   3   6  26  29  46  72  80  81  84 102

It is seen that the outliers remain in Sentiment Subjectivity even after capping to the nearest 5th quantile

Checking outliers for Price

Boxplot(apps$Price~apps$Type, main = "Price groupd by type of app")
 [1] "49836" "49837" "49838" "49839" "49840" "49841" "49842" "49843" "49844"
[10] "49845"

It is seen that due to the number of highly priced apps the outliers remain even after capping to the nearest quantile

Transform

We see from the previous section that number of reviews is heavily skewed.


hist(apps$Reviews, main="Reviews")

So for heavily right skewed, we apply a log 10 transformation to get it to more normally distributed.

apps <- apps %>% mutate(Reviews_t = log10(apps$Reviews))

Checking the distribution of the transformed variable (reviews)


hist(apps$Reviews_t, main="Log10(Reviews)")


Boxplot(apps$Reviews_t, main="log10(Reviews)")

We also note that by applying the transformation, all the outliers are removed also.

Summary

Arranging the columns and the rows(according to Category and ranking) and check the structure of the final data.


apps<-apps %>% select(App,Category,Genres,Size,Updated,Year,Month,Day,Type,Price,`Content Rating`,Reviews,Rating,Installs,
                        Translated_Review ,Sentiment,Sentiment_Polarity,  Sentiment_Subjectivity,Reviews_t)

#arranging the data according to Categories (A-Z) and ratings (high to low) within these categories.
apps<-apps %>% arrange(Category,desc(Rating))

head(apps,10)

#checking the structure of the final data
str(apps)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   72566 obs. of  19 variables:
 $ App                   : chr  "Colorfit - Drawing & Coloring" "Colorfit - Drawing & Coloring" "Colorfit - Drawing & Coloring" "Colorfit - Drawing & Coloring" ...
 $ Category              : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Genres                : chr  "Art & Design;Creativity" "Art & Design;Creativity" "Art & Design;Creativity" "Art & Design;Creativity" ...
 $ Size                  : num  25 25 25 25 25 25 25 25 25 25 ...
 $ Updated               : Date, format: "2017-10-11" "2017-10-11" ...
 $ Year                  : num  2017 2017 2017 2017 2017 ...
 $ Month                 : num  10 10 10 10 10 10 10 10 10 10 ...
 $ Day                   : int  11 11 11 11 11 11 11 11 11 11 ...
 $ Type                  : Factor w/ 2 levels "Free","Paid": 1 1 1 1 1 1 1 1 1 1 ...
 $ Price                 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Content Rating        : Ord.factor w/ 5 levels "Everyone"<"Everyone 10+"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Reviews               : num  20260 20260 20260 20260 20260 ...
 $ Rating                : num  4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 ...
 $ Installs              : Ord.factor w/ 21 levels "0"<"0+"<"1+"<..: 14 14 14 14 14 14 14 14 14 14 ...
 $ Translated_Review     : chr  "Good luck getting pictures free. Everyday supposed collect diamonds color pictures free. I rarely get diamonds "| __truncated__ "This terrible - there's way get back color chooser brush/paint screen. The way I get back go picture I'm workin"| __truncated__ "I love I'd like see pictures every beautiful thing world. Great job. Should maybe play games watch videos recei"| __truncated__ "I really liked first got really buggy. The worst part I took lot time shading went save changed colors blue! Ve"| __truncated__ ...
 $ Sentiment             : Ord.factor w/ 3 levels "Negative"<"Neutral"<..: 3 1 3 1 3 1 3 3 3 1 ...
 $ Sentiment_Polarity    : num  0.15 -0.25 0.5375 -0.1542 0.0833 ...
 $ Sentiment_Subjectivity: num  0.721 0.375 0.613 0.568 0.567 ...
 $ Reviews_t             : num  4.31 4.31 4.31 4.31 4.31 ...

After arranging the columns and row, check the structure of the final data. We see that our data is in the right format, ordered properly, tidy and does not have any missing values or outliers and is ready for analysis all irrelevant variables have been dropped. The data is now ready for analysis to understand how Apps are doing in different categories, type(Free/Paid), different price ranges by analysing rating, sentiments, sentiment polarity, number of installs, Content rating etc.

Below we present some basic analysis that could be helpful in understanding the data:

Finding the top 5 rated categories in the app store

ratings <- apps %>% group_by(Category) %>% summarise("Avg rating"=round(mean(Rating),2)) 
ratings<-ratings%>% arrange (desc(`Avg rating`))
head(ratings,5)

Finding the top 5 rated apps in the app store

apprating<-apps %>% group_by(App)  %>% summarise(rating=mean(Rating))
apprating<- apprating %>% arrange(desc(rating))
head(apprating,5)

Sentiment Polarity versus Last updated year

boxplot(Sentiment_Polarity~Year,data = apps,
        main="Sentiment spread based on last updated Year",
        xlab = "Year last updated",
        ylab = "Average Sentiment",
        col=(c("lightblue") ))

Sentiment Polarity versus Price

boxplot(Sentiment_Polarity~Price, data = apps,
        main="Sentiment spread based on Price",
        xlab = "Price of App",
        ylab = "Average Sentiment",
        col=(c("gold")))

