Preprocessing of data sample of google app store

Setup

In below chunk, I’m importing some packages which are necessary to process necessary functions. readr library is required to import data.

library(readr) # Advantageous for importing data

Data Description

This data is extracted from kaggle.com. Url of data is as below: https://www.kaggle.com/lava18/google-play-store-apps

To download above data directly the url is: https://www.kaggle.com/lava18/google-play-store-apps/downloads/google-play-store-apps.zip/6

Below are some of the notable points about this data

1.The main zip file contains two files googleplaystore and googleplaystore_userreviews. I haveused googleplaystore as it was more relevant and contains all the necessary variables required.

The data file is in CSV format.
Data file contains 13 columns/variables.
Data file contains 10841 rows.
The information in this data is scrapped from google play store which the official app store for Android mobile phones(Largest mobile operating system)
The 13 variables are as follows:
- App - This variable contains the official application name as mentioned by it’s publisher on google app store.
- Category - This Variable describes the category the application belong to. for Eg: Music family, Music games, Arts, Weather etc.
- Rating - Users can rate the app on the scale of 1 to 5 where 5 being the Excellent and 1 being the worst. This variable holds the rating of each app.
- Reviews - Users can express their views and can discuss about all goods and bads in the apps. This variable holds the number of reviews each app has.
- Size - this variable tells us abuot the size of each app in Megabytes.
- Installs - This variable will tell approximate no user downloads of each app.
- Type - Whether any app is free to use or user has to pay some amount for it.
- Price - The amount which user need to pay to access the app completely, it is basically the price of any particular app.
- Content.Rating - Targeted marketed segment for content of the app, Whether the application is suitable for - Children, Mature 21 Plus or Adults.
- Genres - There can be various genres for an app in play store (other than its main category). For eg, a children musical game will belong to genres:- Music, children and Family Genres variable covers the same.
- Last.Updated - Date when app was last updated on Play Store
- Current.Ver - Recent version of the app on google app store.
- Android.Ver - Minimum required version of Android OS for app to be usable in a particular phone.
Data modification - There is only one modification in this data, In variable name “Reviews”, I have changes 3.0M to 30,00,000 where M stands for Million, I did this to make review variable an integer because all of it’s other enteries are integer only.

Read/Import Data

In this segment, I’m importing the csv data using read.csv, The data has been saved in local computer in the folder name data and drive named C. Variable name assigned to this data is googlem and stringAsfactors argument has been declared as FALSE.In line 2, we are checking the head of data and In line 3, we are saving the data as Rdata file. In line #4, I’m checking whether my data is data frame or not

googlem <- read.csv("C:/data/googleplaystore.csv", stringsAsFactors = FALSE) #1
head(googlem) #2

save(googlem, file = "googlem.RData") #3
is.data.frame(googlem)#4

## [1] TRUE

Head() function is helpful in inspecting the structure of dataframe.
Head function shows the first 6 rows and all the columns of data
The default value of StringAsfactors is TRUE.
stringasFactors is declared as FALSE so that we can retrieve all the data in character format rather than factor because retreiving data in factor with huge input and large number of rows is not advicable.

Inspect and Understand

In this section I’m Inspecting the data using various R function

The very first function used here is dim(), dim function is usable in checking the dimensions of the data, in simple words it will provide us with the total number of rows & columns.
Second function used is str(), in this function we are checking the datatypes of all the variables. This function will give us the complete structure of the data and tell us whether a variable is character, int, num, factor etc. This function also tells us whether a factor is ordered or not.
In #3, I have converted two character variables into factors because I wanted to show it as a categorial value because of their observations/values.
In #4 , I have checked the levels of factor variables and then rearrage them using factor() function. After this I checked structure of dataset using str() to check whether my variables are rearranged or not
In #5, I checked column names using colnames() function.

#1
dim(googlem)

## [1] 10841    13

#2
str(googlem)

## 'data.frame':    10841 obs. of  13 variables:
##  $ App           : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite â\200“ FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
##  $ Category      : chr  "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
##  $ Rating        : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
##  $ Reviews       : int  159 967 87510 215644 967 167 178 36815 13791 121 ...
##  $ Size          : chr  "19M" "14M" "8.7M" "25M" ...
##  $ Installs      : chr  "10,000+" "500,000+" "5,000,000+" "50,000,000+" ...
##  $ Type          : chr  "Free" "Free" "Free" "Free" ...
##  $ Price         : chr  "0" "0" "0" "0" ...
##  $ Content.Rating: chr  "Everyone" "Everyone" "Everyone" "Teen" ...
##  $ Genres        : chr  "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
##  $ Last.Updated  : chr  "7-Jan-18" "15-Jan-18" "1-Aug-18" "8-Jun-18" ...
##  $ Current.Ver   : chr  "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
##  $ Android.Ver   : chr  "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...

#3
googlem$Type <- as.factor(googlem$Type)
googlem$Content.Rating <- as.factor(googlem$Content.Rating)

#4
levels(googlem$Type)

## [1] "0"    "Free" "NaN"  "Paid"

levels(googlem$Content.Rating)

## [1] ""                "Adults only 18+" "Everyone"        "Everyone 10+"   
## [5] "Mature 17+"      "Teen"            "Unrated"

googlem$Type<-factor(googlem$Type, levels = c("0","Free","Paid", "NAN"),ordered=TRUE)

googlem$Content.Rating<-factor(googlem$Content.Rating,levels = c("Unrated","Everyone","Everyone 10+","Teen", "Mature 17+","Adults only 18+"),ordered=TRUE)

#str(googlem)
str(googlem$Type)

##  Ord.factor w/ 4 levels "0"<"Free"<"Paid"<..: 2 2 2 2 2 2 2 2 2 2 ...

str(googlem$Content.Rating)

##  Ord.factor w/ 6 levels "Unrated"<"Everyone"<..: 2 2 2 4 2 2 2 2 2 2 ...

#5
colnames(googlem)

##  [1] "App"            "Category"       "Rating"         "Reviews"       
##  [5] "Size"           "Installs"       "Type"           "Price"         
##  [9] "Content.Rating" "Genres"         "Last.Updated"   "Current.Ver"   
## [13] "Android.Ver"

Subset One

In #1 The data is subsetted with 10 observations and after that I’m checking dimensions using dim() to confirm if the data is subsetted properly.
In #2 , I’m converting subset into matrix using function as.matrix() and re confirming whether it has been changed to matrix or not using is.matrix().
In #3, I’m checking the matrix structure using function str(). For reconfirmation, I’m checking the class using class() function which display it as a matrix.
Structure of this matrix is character and all variables are displaying as character, even the variables stored as an integer in main data are dispalyed as character because of the methodology in which data is stored in matrix. All the data is stored in inverted comma’s i.e "", therefore the integers/numbers are also stored in commas and the system is displaying them as a string , hence characters.

In #4, we are reconfirming that our matrix is character and not any other datatype.

#1
googlemsubset <-googlem[0:10,]
dim(googlemsubset)

## [1] 10 13

#2
googlematrix <- as.matrix(googlemsubset)
is.matrix(googlematrix)

## [1] TRUE

#3
str(googlematrix)

##  chr [1:10, 1:13] "Photo Editor & Candy Camera & Grid & ScrapBook" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:10] "1" "2" "3" "4" ...
##   ..$ : chr [1:13] "App" "Category" "Rating" "Reviews" ...

class(googlematrix)

## [1] "matrix"

#4

is.character(googlematrix)

## [1] TRUE

Subsetting II

In below code, I have subsetted my dataframe and include only 1st & last variable, after subsetting I have saved my file as .Rdata using save.image() function.

googlemsubsetII <-googlem[c("App", "Android.Ver")]
googlemsubsetII

save.image(file = "googlesubsetII.RData")

Forming a new DataFrame

In #1, I’m creating a new dataframe by utilizing function named data.frame() , I have declared two variables with 4 observations each. The first variable Employee is an ordinal variable which can be ordered alphabetically and second variable salary is Integer. I have converted salary variable to integer using as.integer() and after which I’m inspecting the structure of my data frame to confirm that it has one integer and an ordinal variable.
In #2, I’m factoring my variable and ordering it alphabetically.
In #3, I’m displaying the structure of each variable and data frame.
In #4, I’m forming a new variable and combining it to my dataframe, I’m dispalying the new data frame after combining.
In #5, I’m checking the dimensions using dim() function and attributes using attributes() function of new data frame created.

#1
EmpM <- c('Eva','Peter','John', 'Mack')
SalM <- c(21000, 23400, 26800, 15000) #Salary of the Employee
EmpM.dataM <- data.frame(EmpM, SalM, stringsAsFactors = FALSE)
str(EmpM.dataM)

## 'data.frame':    4 obs. of  2 variables:
##  $ EmpM: chr  "Eva" "Peter" "John" "Mack"
##  $ SalM: num  21000 23400 26800 15000

#2
factor(EmpM)

## [1] Eva   Peter John  Mack 
## Levels: Eva John Mack Peter

EmployeeM2 = as.factor(EmpM) 

EmployeeM2 <-factor(EmployeeM2,levels = c("Eva","John","Mack", "Peter"),ordered=TRUE)
levels(EmployeeM2)

## [1] "Eva"   "John"  "Mack"  "Peter"

#3
str(EmployeeM2)

##  Ord.factor w/ 4 levels "Eva"<"John"<"Mack"<..: 1 4 2 3

str(SalM)

##  num [1:4] 21000 23400 26800 15000

str(EmpM.dataM)

## 'data.frame':    4 obs. of  2 variables:
##  $ EmpM: chr  "Eva" "Peter" "John" "Mack"
##  $ SalM: num  21000 23400 26800 15000

#4
EmployM.age <- c(25 ,36 , 42 ,32)
EmployeeMdata <- cbind(EmpM.dataM,EmployM.age)
EmployeeMdata

#5
dim(EmployeeMdata)

## [1] 4 3

attributes(EmployeeMdata)

## $names
## [1] "EmpM"        "SalM"        "EmployM.age"
## 
## $class
## [1] "data.frame"
## 
## $row.names
## [1] 1 2 3 4