Setup

In below chunk, I’m importing some packages which are necessary to process necessary functions. readr library is required to import data.

library(readr) # Advantageous for importing data

Data Description

This data is extracted from kaggle.com. Url of data is as below: https://www.kaggle.com/lava18/google-play-store-apps

To download above data directly the url is: https://www.kaggle.com/lava18/google-play-store-apps/downloads/google-play-store-apps.zip/6

Below are some of the notable points about this data

1.The main zip file contains two files googleplaystore and googleplaystore_userreviews. I haveused googleplaystore as it was more relevant and contains all the necessary variables required.

  1. The data file is in CSV format.
  2. Data file contains 13 columns/variables.
  3. Data file contains 10841 rows.
  4. The information in this data is scrapped from google play store which the official app store for Android mobile phones(Largest mobile operating system)
  5. The 13 variables are as follows:
    • App - This variable contains the official application name as mentioned by it’s publisher on google app store.
    • Category - This Variable describes the category the application belong to. for Eg: Music family, Music games, Arts, Weather etc.
    • Rating - Users can rate the app on the scale of 1 to 5 where 5 being the Excellent and 1 being the worst. This variable holds the rating of each app.
    • Reviews - Users can express their views and can discuss about all goods and bads in the apps. This variable holds the number of reviews each app has.
    • Size - this variable tells us abuot the size of each app in Megabytes.
    • Installs - This variable will tell approximate no user downloads of each app.
    • Type - Whether any app is free to use or user has to pay some amount for it.
    • Price - The amount which user need to pay to access the app completely, it is basically the price of any particular app.
    • Content.Rating - Targeted marketed segment for content of the app, Whether the application is suitable for - Children, Mature 21 Plus or Adults.
    • Genres - There can be various genres for an app in play store (other than its main category). For eg, a children musical game will belong to genres:- Music, children and Family Genres variable covers the same.
    • Last.Updated - Date when app was last updated on Play Store
    • Current.Ver - Recent version of the app on google app store.
    • Android.Ver - Minimum required version of Android OS for app to be usable in a particular phone.
  6. Data modification - There is only one modification in this data, In variable name “Reviews”, I have changes 3.0M to 30,00,000 where M stands for Million, I did this to make review variable an integer because all of it’s other enteries are integer only.

Read/Import Data

In this segment, I’m importing the csv data using read.csv, The data has been saved in local computer in the folder name data and drive named C. Variable name assigned to this data is googlem and stringAsfactors argument has been declared as FALSE.In line 2, we are checking the head of data and In line 3, we are saving the data as Rdata file. In line #4, I’m checking whether my data is data frame or not

googlem <- read.csv("C:/data/googleplaystore.csv", stringsAsFactors = FALSE) #1
head(googlem) #2
save(googlem, file = "googlem.RData") #3
is.data.frame(googlem)#4
## [1] TRUE

Inspect and Understand

In this section I’m Inspecting the data using various R function

#1
dim(googlem) 
## [1] 10841    13
#2
str(googlem)
## 'data.frame':    10841 obs. of  13 variables:
##  $ App           : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite â\200“ FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
##  $ Category      : chr  "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
##  $ Rating        : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
##  $ Reviews       : int  159 967 87510 215644 967 167 178 36815 13791 121 ...
##  $ Size          : chr  "19M" "14M" "8.7M" "25M" ...
##  $ Installs      : chr  "10,000+" "500,000+" "5,000,000+" "50,000,000+" ...
##  $ Type          : chr  "Free" "Free" "Free" "Free" ...
##  $ Price         : chr  "0" "0" "0" "0" ...
##  $ Content.Rating: chr  "Everyone" "Everyone" "Everyone" "Teen" ...
##  $ Genres        : chr  "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
##  $ Last.Updated  : chr  "7-Jan-18" "15-Jan-18" "1-Aug-18" "8-Jun-18" ...
##  $ Current.Ver   : chr  "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
##  $ Android.Ver   : chr  "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...
#3
googlem$Type <- as.factor(googlem$Type)
googlem$Content.Rating <- as.factor(googlem$Content.Rating)

#4
levels(googlem$Type)
## [1] "0"    "Free" "NaN"  "Paid"
levels(googlem$Content.Rating)
## [1] ""                "Adults only 18+" "Everyone"        "Everyone 10+"   
## [5] "Mature 17+"      "Teen"            "Unrated"
googlem$Type<-factor(googlem$Type, levels = c("0","Free","Paid", "NAN"),ordered=TRUE)

googlem$Content.Rating<-factor(googlem$Content.Rating,levels = c("Unrated","Everyone","Everyone 10+","Teen", "Mature 17+","Adults only 18+"),ordered=TRUE)

#str(googlem)
str(googlem$Type)
##  Ord.factor w/ 4 levels "0"<"Free"<"Paid"<..: 2 2 2 2 2 2 2 2 2 2 ...
str(googlem$Content.Rating)
##  Ord.factor w/ 6 levels "Unrated"<"Everyone"<..: 2 2 2 4 2 2 2 2 2 2 ...
#5
colnames(googlem)
##  [1] "App"            "Category"       "Rating"         "Reviews"       
##  [5] "Size"           "Installs"       "Type"           "Price"         
##  [9] "Content.Rating" "Genres"         "Last.Updated"   "Current.Ver"   
## [13] "Android.Ver"

Subset One

  1. In #4, we are reconfirming that our matrix is character and not any other datatype.
#1
googlemsubset <-googlem[0:10,]
dim(googlemsubset)
## [1] 10 13
#2
googlematrix <- as.matrix(googlemsubset)
is.matrix(googlematrix)
## [1] TRUE
#3
str(googlematrix)
##  chr [1:10, 1:13] "Photo Editor & Candy Camera & Grid & ScrapBook" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:10] "1" "2" "3" "4" ...
##   ..$ : chr [1:13] "App" "Category" "Rating" "Reviews" ...
class(googlematrix)
## [1] "matrix"
#4

is.character(googlematrix)
## [1] TRUE

Subsetting II

In below code, I have subsetted my dataframe and include only 1st & last variable, after subsetting I have saved my file as .Rdata using save.image() function.

googlemsubsetII <-googlem[c("App", "Android.Ver")]
googlemsubsetII
save.image(file = "googlesubsetII.RData")

Forming a new DataFrame

#1
EmpM <- c('Eva','Peter','John', 'Mack')
SalM <- c(21000, 23400, 26800, 15000) #Salary of the Employee
EmpM.dataM <- data.frame(EmpM, SalM, stringsAsFactors = FALSE)
str(EmpM.dataM)
## 'data.frame':    4 obs. of  2 variables:
##  $ EmpM: chr  "Eva" "Peter" "John" "Mack"
##  $ SalM: num  21000 23400 26800 15000
#2
factor(EmpM)
## [1] Eva   Peter John  Mack 
## Levels: Eva John Mack Peter
EmployeeM2 = as.factor(EmpM) 

EmployeeM2 <-factor(EmployeeM2,levels = c("Eva","John","Mack", "Peter"),ordered=TRUE)
levels(EmployeeM2)
## [1] "Eva"   "John"  "Mack"  "Peter"
#3
str(EmployeeM2)
##  Ord.factor w/ 4 levels "Eva"<"John"<"Mack"<..: 1 4 2 3
str(SalM)
##  num [1:4] 21000 23400 26800 15000
str(EmpM.dataM)
## 'data.frame':    4 obs. of  2 variables:
##  $ EmpM: chr  "Eva" "Peter" "John" "Mack"
##  $ SalM: num  21000 23400 26800 15000
#4
EmployM.age <- c(25 ,36 , 42 ,32)
EmployeeMdata <- cbind(EmpM.dataM,EmployM.age)
EmployeeMdata
#5
dim(EmployeeMdata)
## [1] 4 3
attributes(EmployeeMdata)
## $names
## [1] "EmpM"        "SalM"        "EmployM.age"
## 
## $class
## [1] "data.frame"
## 
## $row.names
## [1] 1 2 3 4