R provides wide range of functions to perfrom different preprocessing tasks. Here we have installed different libraries such as: 1)readr package allows to import files in r using read_csv() function 2) Hmisc allows us to impute values instead of NA in the column using impute() function 3)dplyr will allow us to perform summarise(), join() etc.. 4)outliers allows us to use cap() function to cap outliers , 5)tidyr allows us to use mutate(), separate(), unite() functions to tidy the data, 6)knitr allows us to convert the r markdown to html file, 7)lubridate allows us to perform different date functions such as as.Date(), 8)magrittr allows us to use %>%(pipe).
library(readr)
library(Hmisc)
library(dplyr)
library(outliers)
library(tidyr)
library(knitr)
library(lubridate)
library(magrittr)
This purpose of this assignment is to showcase the knowledge gained in the Data Preprocessing course. Data Preprocessing is done to make the data ready for further statistical analysis. The steps performed for preprocessing of data are as stated below: Step 1: Importing data from open source. Step 2: Understand the structure of data. Step 3: Tidying and cleaning the data Step 4: Scanning for possible na values or special values or inconsistencies. Step 5: Scanning for possible outliers Step 6: Transforming data from heavily skewed to normal.
We imported the data from “Kaggle” website and the link of the dataset is, https://www.kaggle.com/lava18/google-play-store-apps#googleplaystore.csv.The two data sets google_playstore and google_user_reviews are merged using inner_join by App. On doing the same will get the common data in both the datsets. after combining with inner_join we ended up with 122662 observations and 16 varaibles.
The dataset(data_combined1) contain wide dimension of data and variety of data types. It has 122662 observation and 16 variables. They are, App = Application name, Category = Category of application for eg. sports, game, lifestyle, personalise, Rating = Rating given to application, Reviews = Reviews provided for the application, Size = Size of the application, Installs = No. of installations done, Type = Type of the applicatopn for eg. free, paid, Price = Price of the application, Content Rating = Rating based on age group for eg. teen, everyone, Last Updated = Last update date of the application, Current Ver = Current version of the application, Android Ver = Android version to run the application, Translated_Review = User review (Preprocessed and translated to English), Sentiment = Positive/Negative/Neutral (Preprocessed, Sentiment_Polarity = Sentiment polarity score, Sentiment_Subjectivity = Sentiment subjectivity score,
Google_playstore_data <- read_csv("D:/Data Prepocessing/Assignment3/googleplaystore.csv")
Google_playstore_user_reviews <- read_csv("D:/Data Prepocessing/Assignment3/googleplaystore_user_reviews.csv")
data_combined1<-inner_join(Google_playstore_data,Google_playstore_user_reviews,by="App")
Type of data is determined using sapply() function. The structure of the dataset is found out using str() function. In the dataset(data_combined1) it contains char,int,numeric data types. In order to satisfy the requirement 3, we are converting char -> factor. ‘Content Rating’ is being factorised and labelled. Similarly, char is changed to date format. Instead of directly using as.date function, we are seprating the ‘Last Updated’ into “Month”,“Day”,“Year” and labelling month with number. With unite function it is being combined into one variable and converted the data type to date. When checking the structure of the dataset(data_combined3) it contains all types of data types as specified in the requirements.
sapply(data_combined1, typeof)
## App Category Rating
## "character" "character" "double"
## Reviews Size Installs
## "integer" "character" "character"
## Type Price Content Rating
## "character" "character" "character"
## Last Updated Current Ver Android Ver
## "character" "character" "character"
## Translated_Review Sentiment Sentiment_Polarity
## "character" "character" "double"
## Sentiment_Subjectivity
## "double"
str(data_combined1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 122662 obs. of 16 variables:
## $ App : chr "Coloring book moana" "Coloring book moana" "Coloring book moana" "Coloring book moana" ...
## $ Category : chr "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
## $ Rating : num 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 ...
## $ Reviews : int 967 967 967 967 967 967 967 967 967 967 ...
## $ Size : chr "14M" "14M" "14M" "14M" ...
## $ Installs : chr "500,000+" "500,000+" "500,000+" "500,000+" ...
## $ Type : chr "Free" "Free" "Free" "Free" ...
## $ Price : chr "0" "0" "0" "0" ...
## $ Content Rating : chr "Everyone" "Everyone" "Everyone" "Everyone" ...
## $ Last Updated : chr "15-Jan-18" "15-Jan-18" "15-Jan-18" "15-Jan-18" ...
## $ Current Ver : chr "2.0.0" "2.0.0" "2.0.0" "2.0.0" ...
## $ Android Ver : chr "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" ...
## $ Translated_Review : chr "A kid's excessive ads. The types ads allowed app, let alone kids" "It bad >:(" "like" "nan" ...
## $ Sentiment : chr "Negative" "Negative" "Neutral" "nan" ...
## $ Sentiment_Polarity : num -0.25 -0.725 0 NaN 0.5 -0.8 NaN 0 0.5 0.5 ...
## $ Sentiment_Subjectivity: num 1 0.833 0 NaN 0.6 ...
class(data_combined1)
## [1] "tbl_df" "tbl" "data.frame"
dim(data_combined1)
## [1] 122662 16
data_combined1$`Content Rating` <- data_combined1$`Content Rating` %>% factor( levels=c("Teen","Mature 17+","Everyone 10+","Everyone"), labels = c("TEENS", "MATURE 17+", "EVERYONE 10+", "EVERYONE"),ordered=TRUE)
data_combined2 <-data_combined1 %>% separate(`Last Updated`, into = c("Date", "Month", "Year"), sep = "-")
data_combined2$Month <- data_combined2$Month %>% factor( levels=c("Jan","Feb","Mar","Apr","May","Jun","Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), labels = c(1:12),ordered=TRUE)
data_combined3 <-data_combined2 %>% unite('Last Updated', Date, Month,Year, sep = "/")
data_combined3$`Last Updated`<- as.Date(data_combined3$`Last Updated`)
is.Date(data_combined3$`Last Updated`)
## [1] TRUE
str(data_combined3)
## Classes 'tbl_df', 'tbl' and 'data.frame': 122662 obs. of 16 variables:
## $ App : chr "Coloring book moana" "Coloring book moana" "Coloring book moana" "Coloring book moana" ...
## $ Category : chr "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
## $ Rating : num 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 ...
## $ Reviews : int 967 967 967 967 967 967 967 967 967 967 ...
## $ Size : chr "14M" "14M" "14M" "14M" ...
## $ Installs : chr "500,000+" "500,000+" "500,000+" "500,000+" ...
## $ Type : chr "Free" "Free" "Free" "Free" ...
## $ Price : chr "0" "0" "0" "0" ...
## $ Content Rating : Ord.factor w/ 4 levels "TEENS"<"MATURE 17+"<..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Last Updated : Date, format: "0015-01-18" "0015-01-18" ...
## $ Current Ver : chr "2.0.0" "2.0.0" "2.0.0" "2.0.0" ...
## $ Android Ver : chr "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" ...
## $ Translated_Review : chr "A kid's excessive ads. The types ads allowed app, let alone kids" "It bad >:(" "like" "nan" ...
## $ Sentiment : chr "Negative" "Negative" "Neutral" "nan" ...
## $ Sentiment_Polarity : num -0.25 -0.725 0 NaN 0.5 -0.8 NaN 0 0.5 0.5 ...
## $ Sentiment_Subjectivity: num 1 0.833 0 NaN 0.6 ...
The tidy data has to meet three requirements, which are: 1) Every variable must form a column. 2) Every observation must form a row. 3) Each and every observation should form a cell. Our dataset(data_combined3) is in tidy format as it meets the requirements of tidy data.
head(data_combined3)
As per the requirement 6, Mutate variable is created(rating_percentage) from the existing variable(Rating) by finding the percentage of the variable(rating).It is done by mutate() funtion. and saved into data_combined4.Thus the requirement is satisfied.
rating_percentage <- ( data_combined3$Rating / 5 ) * 100
data_combined4 <- mutate(data_combined3, rating_percentage)
head(data_combined4)
The dataset is being scanned for missing values,inconsistencies and obvious errors as per the requirement 7. First colsums() function is used to find the missing values in variables. For excluding NA values we are performing imputation technique.Since Rating,rating percentage,Sentiment_Polarity,Sentiment_Subjectivity are numeric we are imputing with mean values and rounding the decimals points to 2 in Sentiment_Polarity and Sentiment_Subjectivity and “Translated_Review” and “Content Rating” is char we are imputed it with mode.
We are checking inconsistencies with is.nan() function. It will give all the NaN values.Since there are no inconsistencies it displays 0.Thus is satisfies requirment 7.
colSums(is.na(data_combined4))
## App Category Rating
## 0 0 40
## Reviews Size Installs
## 0 0 0
## Type Price Content Rating
## 0 0 40
## Last Updated Current Ver Android Ver
## 0 0 0
## Translated_Review Sentiment Sentiment_Polarity
## 10 0 50047
## Sentiment_Subjectivity rating_percentage
## 50047 40
data_combined4$Rating <- impute(data_combined4$Rating, fun = mean)
data_combined4$rating_percentage <- impute(data_combined4$rating_percentage, fun = mean)
data_combined4$Sentiment_Polarity <- impute(data_combined4$Sentiment_Polarity, fun = mean)
data_combined4$Sentiment_Polarity <- round(data_combined4$Sentiment_Polarity, digits = 2)
data_combined4$Sentiment_Polarity <- as.numeric(data_combined4$Sentiment_Polarity)
is.numeric(data_combined4$Sentiment_Polarity)
## [1] TRUE
data_combined4$Sentiment_Subjectivity <- impute(data_combined4$Sentiment_Subjectivity, fun = mean)
data_combined4$Sentiment_Subjectivity <- round(data_combined4$Sentiment_Subjectivity, digits = 2)
data_combined4$Sentiment_Subjectivity <- as.numeric(data_combined4$Sentiment_Subjectivity)
is.numeric(data_combined4$Sentiment_Subjectivity)
## [1] TRUE
data_combined4$Translated_Review <- impute(data_combined4$Translated_Review, fun = mode)
data_combined4$`Content Rating` <- impute(data_combined4$`Content Rating`, fun = mode)
head(data_combined4)
sum(is.nan(data_combined4$App))
## [1] 0
sum(is.nan(data_combined4$Category))
## [1] 0
sum(is.nan(data_combined4$Rating))
## [1] 0
sum(is.nan(data_combined4$Reviews))
## [1] 0
sum(is.nan(data_combined4$Size))
## [1] 0
sum(is.nan(data_combined4$Installs))
## [1] 0
sum(is.nan(data_combined4$Type))
## [1] 0
sum(is.nan(data_combined4$Price))
## [1] 0
sum(is.nan(data_combined4$`Content Rating`))
## [1] 0
sum(is.nan(data_combined4$`Last Updated`))
## [1] 0
sum(is.nan(data_combined4$`Current Ver`))
## [1] 0
sum(is.nan(data_combined4$`Android Ver`))
## [1] 0
sum(is.nan(data_combined4$Translated_Review))
## [1] 0
sum(is.nan(data_combined4$Sentiment))
## [1] 0
sum(is.nan(data_combined4$Sentiment_Subjectivity))
## [1] 0
sum(is.nan(data_combined4$rating_percentage))
## [1] 0
In dataset(data_combined4) all the numberica values are being scanned for outliers. Using plot() and boxplot() functions the outliers for all numeric variables are displayed. Capping method is used to handle outliers. In capping,for outliers that lie outside the outlier fences on a box-plot hence, observations outside the lower limit are relplaced with the 5th percentile and those observations that lie above the upper limit are relplaced with 95th percentile. Due to page limitation we just displayed only two plots.
plot(data_combined4$Rating)
plot(data_combined4$rating_percentage)
boxplot(data_combined4$Reviews)
boxplot(data_combined4$Sentiment_Polarity)
boxplot(data_combined4$Sentiment_Subjectivity)
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
data_combined4$Rating <- data_combined4$Rating %>% cap()
data_combined4$rating_percentage <- data_combined4$rating_percentage %>% cap()
data_combined4$Reviews <- data_combined4$Reviews %>% cap()
data_combined4$Sentiment_Polarity <- data_combined4$Sentiment_Polarity %>% cap()
data_combined4$Sentiment_Subjectivity <- data_combined4$Sentiment_Subjectivity %>% cap()
plot(data_combined4$Rating)
plot(data_combined4$rating_percentage)
Transformation is an important step to further proceed with statistical analysis. Transformation is performed to:
Change the scale of variable or standarising the variable for better understanding Transform non-linear relationship to linear relationship *Reduce skewness or heterogeneity of variances In the task below, Reviews variable from dataset is chosen to perform transformation. the histogram of reviews shows that the variable is heavily right skewed hence, to bring it into normal form or to reduce right skewness the log transformation technique is used. Here log10() will bring the variable is normal distribution which is shown using histigram of transformed variable. At last the summary statistics is performed.
hist(data_combined4$Reviews)
log_reviews <- log(data_combined4$Reviews)
log_reviews1 <- log_reviews ^ 2
hist_reviews <- hist(log_reviews1)
summary(log_reviews)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.303 10.223 11.982 12.073 14.222 16.200