MATH2349 Semester 1, 2019

class: center, middle, inverse, title-slide

# MATH2349 Semester 1, 2019
## Assignment 3
### Yinan Zhang, Tanmay Nagi, Sabrina Loh
### 2019-06-07

---

class: middle
name: cc-by

## Required packages

```r
library(readr)
library(dplyr)
library(tidyr)
library(lubridate)
library(stringr)
library(knitr)
library(forecast)
library(car)
library(kableExtra)
```

---

# Executive Summary 
The aim of this assignment to pre-process user review data of Google Play applications to prepare it for analysis on the application’s effectiveness e.g. Correlations between Price, reviews and sentiments and to rank the apps in different categories.

The 2 data sets used, googleplaystore.csv and googleplaystore_user_reviews.csv, were imported and merged by a common variable (Apps). For a better understanding of the data set, an analysis of the structure and variable class type was conducted. Irrelevant variables were dropped to simplify the process. Then, data type conversions were carried out on some variables factoring was done on relevant variables.

The data was already in a tidy format, so no reshaping was needed. 3 new columns were created through the mutate function which separated the date on which the application was last updated into day, month and year for easier comparison of data by month or year.

---
## Executive Summary Cont..

Several missing values were identified in the data set. Rows with missing values in Ratings, Sentiment Polarity and Translated Review were removed, whereas missing values in Sizes were replaced by the mean size of their individual category using imputation. The missing values in Price were due to the applications being free, so they were replaced with a 0.

Lastly, removal of outliers and transformation were performed to try to reduce the effects of outliers on skewing the results. The capping method was used to handle the outlier.  We capped outlier values at the outer 95% percentile limits.  On the heavily right skewed, data, log10 transformation was applied to the variable to reduce the skewness before capping.  This resulted in a more normal distribution and eliminated much of the perceived outliers.

---

# Data

Data obtained from: https://www.kaggle.com/lava18/google-play-store-apps
Data was scraped from Google Play App Store on over 10k apps as well as their reviews.

Datasets used: googleplaystore.csv and googleplaystore_user_reviews.csv
App information is stored in the googleplaystore.csv while reviews information is stored in the googleplaystore_user_reviews.csv. Variable descriptions for each dataset are shown below.

Both datasets were imported and merged (left_join) by a common variable, ‘App’, for easier analysis. The left_join is appropriate as it matches rows from googleplaystore_user_reviews to googleplaystore, so that each review is matched to the appropriate app.

To simplify the process, variables ‘Android Ver’ and ‘Current Ver’ were removed as version history will not be useful in the analysis.

---
#Loading the data into R.

```r
googleplaystore <- read_csv("googleplaystore.csv")
```

```
## Parsed with column specification:
## cols(
##   App = col_character(),
##   Category = col_character(),
##   Rating = col_double(),
##   Reviews = col_double(),
##   Size = col_character(),
##   Installs = col_character(),
##   Type = col_character(),
##   Price = col_character(),
##   `Content Rating` = col_character(),
##   Genres = col_character(),
##   `Last Updated` = col_character(),
##   `Current Ver` = col_character(),
##   `Android Ver` = col_character()
## )
```

```r
googleplaystore_user_reviews <- read_csv("googleplaystore_user_reviews.csv")
```

```
## Parsed with column specification:
## cols(
##   App = col_character(),
##   Translated_Review = col_character(),
##   Sentiment = col_character(),
##   Sentiment_Polarity = col_double(),
##   Sentiment_Subjectivity = col_double()
## )
```

```r
playstoredescription <- read_csv("playstoredescription.csv")
```

```
## Parsed with column specification:
## cols(
##   Variable = col_character(),
##   Description = col_character()
## )
```

```r
UserReviewsdescription <- read_csv("UserReviewsdescription.csv")
```

```
## Parsed with column specification:
## cols(
##   Variable = col_character(),
##   Description = col_character()
## )
```

---
##Variables description in googleplaystore:

<table class="table table-striped table-hover" style="font-size: 12px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Variable </th>
   <th style="text-align:left;"> Description </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> App </td>
   <td style="text-align:left;width: 30em; "> Application name </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Category </td>
   <td style="text-align:left;width: 30em; "> Category the app belongs to </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Rating </td>
   <td style="text-align:left;width: 30em; "> Overall user rating of the app (as when scraped) </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Reviews </td>
   <td style="text-align:left;width: 30em; "> Number of user reviews for the app (as when scraped) </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Size </td>
   <td style="text-align:left;width: 30em; "> Size of the app (as when scraped) </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Installs </td>
   <td style="text-align:left;width: 30em; "> Number of user downloads/installs for the app (as when scraped) </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Type </td>
   <td style="text-align:left;width: 30em; "> Paid or Free </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Price </td>
   <td style="text-align:left;width: 30em; "> Price of the app (as when scraped) </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Content Rating </td>
   <td style="text-align:left;width: 30em; "> Age group the app is targeted at - Children / Mature 21+ / Adult </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Genres </td>
   <td style="text-align:left;width: 30em; "> An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres. </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Last Updated </td>
   <td style="text-align:left;width: 30em; "> Date when the app was last updated on Play Store (as when scraped) </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Current Ver </td>
   <td style="text-align:left;width: 30em; "> Current version of the app available on Play Store (as when scraped) </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Android Ver </td>
   <td style="text-align:left;width: 30em; "> Min required Android version (as when scraped) </td>
  </tr>
</tbody>
</table>

---

##Variables description in googleplaystore_user_reviews:
<table class="table table-striped table-hover" style="font-size: 12px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Variable </th>
   <th style="text-align:left;"> Description </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> App </td>
   <td style="text-align:left;width: 30em; "> Application name </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Translated_Review </td>
   <td style="text-align:left;width: 30em; "> User review (Preprocessed and translated to English) </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Sentiment </td>
   <td style="text-align:left;width: 30em; "> Positive/Negative/Neutral (Preprocessed) </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Sentiment_Polarity </td>
   <td style="text-align:left;width: 30em; "> Sentiment polarity score </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Sentiment_Subjectivity </td>
   <td style="text-align:left;width: 30em; "> Sentiment subjectivity score </td>
  </tr>
</tbody>
</table>

---

#Joining the Data Sets

The common variable on the data set is apps, where the name of the apps are stored. To create meaningful analysis, we will join the two data sets together with on the app names.   New table we will be working with will be called apps.

```r
apps <- googleplaystore %>% left_join(googleplaystore_user_reviews, by = "App")
```

---
# Understand

Checking the structure of the data

```r
str(apps)
```

```
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':	131971 obs. of  17 variables:
##  $ App                   : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "Coloring book moana" "Coloring book moana" ...
##  $ Category              : chr  "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
##  $ Rating                : num  4.1 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 ...
##  $ Reviews               : num  159 967 967 967 967 967 967 967 967 967 ...
##  $ Size                  : chr  "19M" "14M" "14M" "14M" ...
##  $ Installs              : chr  "10,000+" "500,000+" "500,000+" "500,000+" ...
##  $ Type                  : chr  "Free" "Free" "Free" "Free" ...
##  $ Price                 : chr  "0" "0" "0" "0" ...
##  $ Content Rating        : chr  "Everyone" "Everyone" "Everyone" "Everyone" ...
##  $ Genres                : chr  "Art & Design" "Art & Design;Pretend Play" "Art & Design;Pretend Play" "Art & Design;Pretend Play" ...
##  $ Last Updated          : chr  "January 7, 2018" "January 15, 2018" "January 15, 2018" "January 15, 2018" ...
##  $ Current Ver           : chr  "1.0.0" "2.0.0" "2.0.0" "2.0.0" ...
##  $ Android Ver           : chr  "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" ...
##  $ Translated_Review     : chr  NA "A kid's excessive ads. The types ads allowed app, let alone kids" "It bad >:(" "like" ...
##  $ Sentiment             : chr  NA "Negative" "Negative" "Neutral" ...
##  $ Sentiment_Polarity    : num  NA -0.25 -0.725 0 NaN 0.5 -0.8 NaN 0 0.5 ...
##  $ Sentiment_Subjectivity: num  NA 1 0.833 0 NaN ...
```

---
##Removing unused variables

Version history is not useful for us. Therefore we decided to remove them as variables.

* Android Ver  
* Current Ver

```r
apps<-apps %>% select(-c(`Android Ver`,`Current Ver`))
colnames(apps)
```

```
##  [1] "App"                    "Category"              
##  [3] "Rating"                 "Reviews"               
##  [5] "Size"                   "Installs"              
##  [7] "Type"                   "Price"                 
##  [9] "Content Rating"         "Genres"                
## [11] "Last Updated"           "Translated_Review"     
## [13] "Sentiment"              "Sentiment_Polarity"    
## [15] "Sentiment_Subjectivity"
```

---

##Converting the variable into ordered and unordered factors

* Installs  
* Type    
* Content Ratings   
* Sentiment   
* Category

---
##Converting the variable into ordered and unordered factors code...

```r
apps <- apps %>% mutate(
  Installs = factor(apps$Installs, 
                    levels = c( "0","0+","1+","5+","10+","50+","100+","500+","1,000+",
                                "5,000+","10,000+",  "50,000+", "100,000+",  "500,000+",
                                "1,000,000+","5,000,000+" ,  "10,000,000+" ,  "50,000,000+", "100,000,000+",
                                "500,000,000+","1,000,000,000+") ,
                    labels = c( "0","0+","1+","5+","10+","50+","100+","500+","1,000+",
                                "5,000+","10,000+",  "50,000+", "100,000+",  "500,000+",
                                "1,000,000+","5,000,000+" ,  "10,000,000+" ,  "50,000,000+",
                                "100,000,000+", "500,000,000+","1,000,000,000+"),
                    ordered=T),
  Type = factor(apps$Type, 
                levels = c("Free", "Paid"),
                labels = c("Free", "Paid")),
  `Content Rating`= factor(apps$`Content Rating`,
                           levels = c("Everyone", "Everyone 10+",  "Teen", "Mature 17+", "Adults only 18+"),
                           labels = c("Everyone", "Everyone 10+",  "Teen", "Mature 17+", "Adults only 18+"),
                           ordered = T ),
  Sentiment = factor(apps$Sentiment, 
                      levels = c("Negative", "Neutral", "Positive"),
                      labels = c("Negative", "Neutral", "Positive"),
                      ordered = T),
  Category = factor(apps$Category)
  )
str(apps[,c("Installs", "Type", "Content Rating", "Sentiment", "Category")]) 
```

```
## Classes 'tbl_df', 'tbl' and 'data.frame':	131971 obs. of  5 variables:
##  $ Installs      : Ord.factor w/ 21 levels "0"<"0+"<"1+"<..: 11 14 14 14 14 14 14 14 14 14 ...
##  $ Type          : Factor w/ 2 levels "Free","Paid": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Content Rating: Ord.factor w/ 5 levels "Everyone"<"Everyone 10+"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sentiment     : Ord.factor w/ 3 levels "Negative"<"Neutral"<..: NA 1 1 2 NA 3 1 NA 2 3 ...
##  $ Category      : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..: 2 2 2 2 2 2 2 2 2 2 ...
```

---
##Date conversion

Converting 'Last Updated' into date format in a new column. Then dropping the column 'Last Updated'. This is to avoid errors if run code twice, as it would convert date variables to NA.  This is to ensure integrity.

```r
apps <- apps %>% mutate(Updated = mdy(apps$`Last Updated`))
apps <- apps %>% select(-`Last Updated`)

str(apps$Updated)
```

```
##  Date[1:131971], format: "2018-01-07" "2018-01-15" "2018-01-15" "2018-01-15" "2018-01-15" ...
```

---
##Changing Price from character to numeric

```r
apps$Price <- substr(apps$Price,2,nchar(apps$Price)) %>% as.numeric()

str(apps$Price)
```

```
##  num [1:131971] NA NA NA NA NA NA NA NA NA NA ...
```

We note that as.numeric changes 0 to NA in the conversion process.   We will impute back the 0s in the scan process.

---
## Changing application size variable to numeric

Application sizes are either recorded in megabytes, kilobytes, or are recorded as varies with device.   We want to convert this to numeric for better analysis hence a common unit of measurement should be used.   We decided to use megabytes for this.

We first extract the numeric part from the string.  Then extract the ‘M’, or ‘k’.  if it is anything else, we recognise it as the ‘varies with device’ value.  This we are going to allow to be NA as we do not have enough information.   
Finally we putting it all together.  If it is in kilobytes, we are multiplying by 0.001 to adjust the value to be in megabytes.   NA are recorded for ‘varies with device’ value.  This we will impute in the later section with the median size of the respective category.

---
## Changing application size variable to numeric code..

```r
unit_size<-str_extract(apps$Size,"[aA-zZ]") 
value_size <- substr(apps$Size,start = 1,stop=(nchar(apps$Size)-1)) %>% as.numeric()

conversion <-function(x,y) {ifelse(x=="M",y,ifelse(x=="k",y*0.001,NA))}
size<-conversion(unit_size,value_size)
apps<-apps %>% mutate(Size=size)

class(apps$Size)
```

```
## [1] "numeric"
```

```r
summary(apps$Size)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.01   10.00   24.00   33.33   52.00  100.00   48407
```

---
##Final Check on data structure that all variables are in the right class.

Final Check on data structure that all variables are in the right class.

```r
str(apps)
```

```
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':	131971 obs. of  15 variables:
##  $ App                   : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "Coloring book moana" "Coloring book moana" ...
##  $ Category              : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Rating                : num  4.1 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 ...
##  $ Reviews               : num  159 967 967 967 967 967 967 967 967 967 ...
##  $ Size                  : num  19 14 14 14 14 14 14 14 14 14 ...
##  $ Installs              : Ord.factor w/ 21 levels "0"<"0+"<"1+"<..: 11 14 14 14 14 14 14 14 14 14 ...
##  $ Type                  : Factor w/ 2 levels "Free","Paid": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Price                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Content Rating        : Ord.factor w/ 5 levels "Everyone"<"Everyone 10+"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Genres                : chr  "Art & Design" "Art & Design;Pretend Play" "Art & Design;Pretend Play" "Art & Design;Pretend Play" ...
##  $ Translated_Review     : chr  NA "A kid's excessive ads. The types ads allowed app, let alone kids" "It bad >:(" "like" ...
##  $ Sentiment             : Ord.factor w/ 3 levels "Negative"<"Neutral"<..: NA 1 1 2 NA 3 1 NA 2 3 ...
##  $ Sentiment_Polarity    : num  NA -0.25 -0.725 0 NaN 0.5 -0.8 NaN 0 0.5 ...
##  $ Sentiment_Subjectivity: num  NA 1 0.833 0 NaN ...
##  $ Updated               : Date, format: "2018-01-07" "2018-01-15" ...
```
---
##Final Check on data structure cont...

* Apps are the name of each app.  So okay to keep as character data.  
* Genres are okay to keep as character data.  
* Sentiment_Polarity and Sentiment_Subjectivity are out analysis variables and they should be in numerical.   
* All other variable we have changed into the correct class.

---
#	Tidy & Manipulate Data I

The data is already in a tidy format since:  
1. All variables have a column. - Each column relates to an attribute of the app.    
2. All observations have row - ie. each row relates to an app and an individual review.  
3. Each value is in a cell.

```r
head(apps, 6)
```

```
## # A tibble: 6 x 15
##   App   Category Rating Reviews  Size Installs Type  Price `Content Rating`
##   <chr> <fct>     <dbl>   <dbl> <dbl> <ord>    <fct> <dbl> <ord>           
## 1 Phot~ ART_AND~    4.1     159    19 10,000+  Free     NA Everyone        
## 2 Colo~ ART_AND~    3.9     967    14 500,000+ Free     NA Everyone        
## 3 Colo~ ART_AND~    3.9     967    14 500,000+ Free     NA Everyone        
## 4 Colo~ ART_AND~    3.9     967    14 500,000+ Free     NA Everyone        
## 5 Colo~ ART_AND~    3.9     967    14 500,000+ Free     NA Everyone        
## 6 Colo~ ART_AND~    3.9     967    14 500,000+ Free     NA Everyone        
## # ... with 6 more variables: Genres <chr>, Translated_Review <chr>,
## #   Sentiment <ord>, Sentiment_Polarity <dbl>,
## #   Sentiment_Subjectivity <dbl>, Updated <date>
```

---
#	Tidy & Manipulate Data II

If analysis on when is the app updated have an effect on the sentiment of the reviews, it will be useful to have the Year, Month, and Day of when last reviewed in separate columns for analysis.

Creating the Year, Month and Day column for updated values.

```r
apps <- apps %>% mutate(Day = day(apps$Updated), 
                  Month = month(apps$Updated), 
                  Year = year(apps$Updated))
str(apps[,c("Day", "Month", "Year")])
```

```
## Classes 'tbl_df', 'tbl' and 'data.frame':	131971 obs. of  3 variables:
##  $ Day  : int  7 15 15 15 15 15 15 15 15 15 ...
##  $ Month: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Year : num  2018 2018 2018 2018 2018 ...
```

---
class: middle, center, inverse

#Scan I

---
## Checking for NA, Infinite and NaN values.

```r
n <- colSums(is.na(apps)) %>% as.data.frame()
names(n) <- "NA"
i <- sapply(apps, is.infinite) %>% as.data.frame() %>% 
colSums()
nan <- sapply(apps, is.nan) %>% as.data.frame() %>% 
colSums()
x <- n %>% mutate(Infinite = i, Nan = nan) 
row.names(x) <- colnames(apps)
x
```

```
##                            NA Infinite   Nan
## App                         0        0     0
## Category                    0        0     0
## Rating                   1513        0  1513
## Reviews                     1        0     0
## Size                    48407        0     0
## Installs                    1        0     0
## Type                        2        0     0
## Price                  129645        0     0
## Content Rating              3        0     0
## Genres                      0        0     0
## Translated_Review        9319        0     0
## Sentiment               59356        0     0
## Sentiment_Polarity      59356        0 50047
## Sentiment_Subjectivity  59356        0 50047
## Updated                     1        0     0
## Day                         1        0     0
## Month                       1        0     0
## Year                        1        0     0
```

---
##Checking why reviews and installs have only one NA

```r
apps[which(is.na(apps$Reviews)),]
```

```
## # A tibble: 1 x 18
##   App   Category Rating Reviews  Size Installs Type  Price `Content Rating`
##   <chr> <fct>     <dbl>   <dbl> <dbl> <ord>    <fct> <dbl> <ord>           
## 1 Life~ 1.9          19      NA    NA <NA>     <NA>     NA <NA>            
## # ... with 9 more variables: Genres <chr>, Translated_Review <chr>,
## #   Sentiment <ord>, Sentiment_Polarity <dbl>,
## #   Sentiment_Subjectivity <dbl>, Updated <date>, Day <int>, Month <dbl>,
## #   Year <dbl>
```
Looks to have the values in the wrong columns.  It is likely to be an error from the scraping.  Since we have a large data sample, we will deal with it by deleting.

```r
apps <- apps[-which(is.na(apps$Reviews)),]
sum(is.na(apps$Reviews))
```

```
## [1] 0
```

---
### Missing values in ratings

For missing values that are in ratings, we will deal with them by removing all rows with missing values.  Since our analysis are on rating sentiments.  The apps that does not have any ratings is not useful to be included.

```r
apps <- apps[-which(is.na(apps$Rating)),]
sum(is.na(apps$Rating))
```

```
## [1] 0
```

---
### Missing values in Price

When converting Price into numeric, 0 was changed to NA.  As, 0 is still a valid price, and it does add value to the information, we are Imputing NA in price with 0.

```r
apps$Price[which(is.na(apps$Price))] <- 0
sum(is.na(apps$Price))
```

```
## [1] 0
```

---
### Missing values in Size

For the Sizes variable, there was a value called "Varies with device".  When changing into numeric format, this have become NA.

We will impute these NA with the average size of apps of their individual category.

```r
apps <- apps %>% 
  group_by(Category) %>% 
  mutate(Size = ifelse(is.na(Size), 
                           mean(Size,na.rm = T),
                           Size)) %>% ungroup()
sum(is.na(apps$Category))
```

```
## [1] 0
```

---
### Missing values in Translated Review

Translated Review is where the reviews are collected.  An app user can leave or not leave a written review after giving a rating.   If no rating is given, then it is recorded as NaN.   Some, non recorded reviews are recorded as NA here.  Since we are going to analyse the sentiments, we will look at only reviews are left.  Therefore we will deal with missing vallues in Translated_Review by removing them.

```r
apps <- apps[-which(is.na(apps$Translated_Review)),]
sum(is.na(apps$Translated_Review))
```

```
## [1] 0
```

---

### Missing values in Sentiment Polarity

Sentiment Polarity is one of the variables for analysis.  Therefore it is good to have a data set with none missing values here,  Removing rows with missing values in Sentiment Polarity.   Also not that, is.na here also includes NaNs which was created for any apps that had a review but didn't leave any text.  We will be excluding these from our analysis.

```r
apps <- apps[-which(is.na(apps$Sentiment_Polarity)),]
sum(is.na(apps$Sentiment_Polarity))
```

```
## [1] 0
```

---
## Final missing value check:

###Checking for NA, Inf and Nan.

```r
n <- colSums(is.na(apps)) %>% as.data.frame()
names(n) <- "NA"

i <- sapply(apps, is.infinite) %>% as.data.frame() %>% 
colSums()

nan <- sapply(apps, is.nan) %>% as.data.frame() %>% 
colSums()

x <- n %>% mutate(Infinite = i, Nan = nan) 
row.names(x) <- colnames(apps)
x
```

```
##                        NA Infinite Nan
## App                     0        0   0
## Category                0        0   0
## Rating                  0        0   0
## Reviews                 0        0   0
## Size                    0        0   0
## Installs                0        0   0
## Type                    0        0   0
## Price                   0        0   0
## Content Rating          0        0   0
## Genres                  0        0   0
## Translated_Review       0        0   0
## Sentiment               0        0   0
## Sentiment_Polarity      0        0   0
## Sentiment_Subjectivity  0        0   0
## Updated                 0        0   0
## Day                     0        0   0
## Month                   0        0   0
## Year                    0        0   0
```

---
##	Scan II

###Identify numeric data

```r
check_numeric <-sapply(apps, is.numeric) %>% as.data.frame()
names(check_numeric) <-"Numeric"
check<-check_numeric %>% mutate(Variable=colnames(apps),Numeric=Numeric) 
check<-check%>% filter(Numeric==T) %>% select(Variable,Numeric)
check
```

```
##                 Variable Numeric
## 1                 Rating    TRUE
## 2                Reviews    TRUE
## 3                   Size    TRUE
## 4                  Price    TRUE
## 5     Sentiment_Polarity    TRUE
## 6 Sentiment_Subjectivity    TRUE
## 7                    Day    TRUE
## 8                  Month    TRUE
## 9                   Year    TRUE
```

---
###Identify numeric data cont..

We can see numeric data are  
* Rating  
* Reviews  
* Size  
* Price  
* Sentiment_Polarity  
* Sentiment_Subjectivity

We will ignore the variables Day, Month and Year as these have been created by us.

---
###Checking for outliers in the numeric data:

```r
par(mfrow=c(1,3))
boxplot(apps$Rating, main="Rating")
boxplot(apps$Reviews, main="Reviews")
boxplot(apps$Size, main="Size")
```

![](MATH2349_Assignment_3_-_YZ_TN_SL_-_v3_files/figure-html/unnamed-chunk-25-1.png)

---

```r
par(mfrow=c(1,3))
boxplot(apps$Sentiment_Polarity, main= "Sentiment Polarity")
boxplot(apps$Sentiment_Subjectivity, main = "Sentiment Subjectivity")
```

![](MATH2349_Assignment_3_-_YZ_TN_SL_-_v3_files/figure-html/unnamed-chunk-26-1.png)

---
### Outliers in Price

For price, we group the data by Type of app (Free/Paid) and check for outliers as the data would otherwise be heavily skewed due to large number of free apps

```r
Boxplot(apps$Price~apps$Type, main = "Price grouped by type of app")
```

```
##  [1] "49836" "49837" "49838" "49839" "49840" "49841" "49842" "49843"
##  [9] "49844" "49845"
```

---

### Outliers in Reviews

Reviews looks to be severly right skewed.  It will make better sense if we do a transformation of the data before capping the outliers in case of doing loosing too much information.

```r
hist(apps$Reviews, main="Reviews")
```

![](MATH2349_Assignment_3_-_YZ_TN_SL_-_v3_files/figure-html/unnamed-chunk-28-1.png)

---

## Capping Outliers

Capping the Outliers for:  
* Rating  
* Size  
* Sentiment_Polarity
* Sentiment_Subjectivity

We are capping them within the 95%.  As it makes sense for these variables to still have the outlier value creating an effect.  Just the effect should not be excessive.

```r
cap <- function(x){
quantiles <- quantile( x,probs =  c(0.05, 0.25, 0.75, 0.95),na.rm=TRUE)
x[ x < quantiles[2] - 1.5*IQR(x,na.rm=T) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x,na.rm=T) ] <- quantiles[4]
x
}
apps[,c("Rating", "Size","Sentiment_Polarity", "Sentiment_Subjectivity")] <- sapply(apps[,c("Rating", "Size","Sentiment_Polarity", "Sentiment_Subjectivity")], cap) %>% 
  as.data.frame()
```

---

###Capping Outliers cont...

Capping the Outliers for Price grouped by Type (Paid apps get capped among paid apps only)

```r
apps <- apps %>% 
  group_by(Type) %>% 
  mutate(Price = cap(Price)) %>% ungroup()
```

---

## Checking Outliers are capped:

It is seen that the outliers remain in Sentiment Subjectivity even after capping to the nearest quantile

```r
par(mfrow=c(2,2), pty = "s" )
Boxplot(apps$Rating, main="Rating")
Boxplot(apps$Size, main="Size")
Boxplot(apps$Sentiment_Polarity, main= "Sentiment Polarity")
Boxplot(apps$Sentiment_Subjectivity, main = "Sentiment Subjectivity")
```

![](MATH2349_Assignment_3_-_YZ_TN_SL_-_v3_files/figure-html/unnamed-chunk-31-1.png)

```
##  [1]   3   6  26  29  46  72  80  81  84 102
```

---

###Checking outliers for Price

```r
Boxplot(apps$Price~apps$Type, main = "Price groupd by type of app")
```

![](MATH2349_Assignment_3_-_YZ_TN_SL_-_v3_files/figure-html/unnamed-chunk-32-1.png)

```
##  [1] "49836" "49837" "49838" "49839" "49840" "49841" "49842" "49843"
##  [9] "49844" "49845"
```
It is seen that due to the number of highly priced apps the outliers remain even after capping to the nearest quantile

---

#	Transform

We see from the previous section that number of reviews is heavily skewed.

```r
hist(apps$Reviews, main="Reviews")
```

![](MATH2349_Assignment_3_-_YZ_TN_SL_-_v3_files/figure-html/unnamed-chunk-33-1.png)

---
###Applying transformation.
So for heavily right skewed, we apply a log 10 transformation to get it to more normally distributed.

```r
apps <- apps %>% mutate(Reviews_t = log10(apps$Reviews))
```

---
###Checking the distribution of the transformed variable (reviews)

```r
hist(apps$Reviews_t, main="Log10(Reviews)")
```

![](MATH2349_Assignment_3_-_YZ_TN_SL_-_v3_files/figure-html/unnamed-chunk-35-1.png)

---

```r
Boxplot(apps$Reviews_t, main="log10(Reviews)")
```

![](MATH2349_Assignment_3_-_YZ_TN_SL_-_v3_files/figure-html/unnamed-chunk-36-1.png)
We also note that by applying the transformation, all the outliers are removed.

---

#Conclusions

Finally we arrange the columns and check the structure of the final data

```r
apps<-apps %>% select(App,Category,Genres,Size,Updated,Type,Price,`Content Rating`,Reviews,Rating,Installs,
                        Translated_Review ,Sentiment,Sentiment_Polarity,  Sentiment_Subjectivity,Reviews_t)
str(apps)
```

```
## Classes 'tbl_df', 'tbl' and 'data.frame':	72566 obs. of  16 variables:
##  $ App                   : chr  "Coloring book moana" "Coloring book moana" "Coloring book moana" "Coloring book moana" ...
##  $ Category              : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Genres                : chr  "Art & Design;Pretend Play" "Art & Design;Pretend Play" "Art & Design;Pretend Play" "Art & Design;Pretend Play" ...
##  $ Size                  : num  14 14 14 14 14 14 14 14 14 14 ...
##  $ Updated               : Date, format: "2018-01-15" "2018-01-15" ...
##  $ Type                  : Factor w/ 2 levels "Free","Paid": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Price                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Content Rating        : Ord.factor w/ 5 levels "Everyone"<"Everyone 10+"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Reviews               : num  967 967 967 967 967 967 967 967 967 967 ...
##  $ Rating                : num  3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 ...
##  $ Installs              : Ord.factor w/ 21 levels "0"<"0+"<"1+"<..: 14 14 14 14 14 14 14 14 14 14 ...
##  $ Translated_Review     : chr  "A kid's excessive ads. The types ads allowed app, let alone kids" "It bad >:(" "like" "I love colors inspyering" ...
##  $ Sentiment             : Ord.factor w/ 3 levels "Negative"<"Neutral"<..: 1 1 2 3 1 2 3 3 3 3 ...
##  $ Sentiment_Polarity    : num  -0.25 -0.371 0 0.5 -0.371 ...
##  $ Sentiment_Subjectivity: num  1 0.833 0 0.6 0.9 ...
##  $ Reviews_t             : num  2.99 2.99 2.99 2.99 2.99 ...
```