In below chunk, I’m importing some packages which are necessary to process necessary functions. readr library is required to import data.
library(readr) # Advantageous for importing data
This data is extracted from kaggle.com. Url of data is as below: https://www.kaggle.com/lava18/google-play-store-apps
To download above data directly the url is: https://www.kaggle.com/lava18/google-play-store-apps/downloads/google-play-store-apps.zip/6
Below are some of the notable points about this data
1.The main zip file contains two files googleplaystore and googleplaystore_userreviews. I haveused googleplaystore as it was more relevant and contains all the necessary variables required.
In this segment, I’m importing the csv data using read.csv, The data has been saved in local computer in the folder name data and drive named C. Variable name assigned to this data is googlem and stringAsfactors argument has been declared as FALSE.In line 2, we are checking the head of data and In line 3, we are saving the data as Rdata file. In line #4, I’m checking whether my data is data frame or not
googlem <- read.csv("C:/data/googleplaystore.csv", stringsAsFactors = FALSE) #1
head(googlem) #2
save(googlem, file = "googlem.RData") #3
is.data.frame(googlem)#4
## [1] TRUE
In this section I’m Inspecting the data using various R function
#1
dim(googlem)
## [1] 10841 13
#2
str(googlem)
## 'data.frame': 10841 obs. of 13 variables:
## $ App : chr "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite â\200“ FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
## $ Category : chr "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : int 159 967 87510 215644 967 167 178 36815 13791 121 ...
## $ Size : chr "19M" "14M" "8.7M" "25M" ...
## $ Installs : chr "10,000+" "500,000+" "5,000,000+" "50,000,000+" ...
## $ Type : chr "Free" "Free" "Free" "Free" ...
## $ Price : chr "0" "0" "0" "0" ...
## $ Content.Rating: chr "Everyone" "Everyone" "Everyone" "Teen" ...
## $ Genres : chr "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
## $ Last.Updated : chr "7-Jan-18" "15-Jan-18" "1-Aug-18" "8-Jun-18" ...
## $ Current.Ver : chr "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
## $ Android.Ver : chr "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...
#3
googlem$Type <- as.factor(googlem$Type)
googlem$Content.Rating <- as.factor(googlem$Content.Rating)
#4
levels(googlem$Type)
## [1] "0" "Free" "NaN" "Paid"
levels(googlem$Content.Rating)
## [1] "" "Adults only 18+" "Everyone" "Everyone 10+"
## [5] "Mature 17+" "Teen" "Unrated"
googlem$Type<-factor(googlem$Type, levels = c("0","Free","Paid", "NAN"),ordered=TRUE)
googlem$Content.Rating<-factor(googlem$Content.Rating,levels = c("Unrated","Everyone","Everyone 10+","Teen", "Mature 17+","Adults only 18+"),ordered=TRUE)
#str(googlem)
str(googlem$Type)
## Ord.factor w/ 4 levels "0"<"Free"<"Paid"<..: 2 2 2 2 2 2 2 2 2 2 ...
str(googlem$Content.Rating)
## Ord.factor w/ 6 levels "Unrated"<"Everyone"<..: 2 2 2 4 2 2 2 2 2 2 ...
#5
colnames(googlem)
## [1] "App" "Category" "Rating" "Reviews"
## [5] "Size" "Installs" "Type" "Price"
## [9] "Content.Rating" "Genres" "Last.Updated" "Current.Ver"
## [13] "Android.Ver"
#1
googlemsubset <-googlem[0:10,]
dim(googlemsubset)
## [1] 10 13
#2
googlematrix <- as.matrix(googlemsubset)
is.matrix(googlematrix)
## [1] TRUE
#3
str(googlematrix)
## chr [1:10, 1:13] "Photo Editor & Candy Camera & Grid & ScrapBook" ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:10] "1" "2" "3" "4" ...
## ..$ : chr [1:13] "App" "Category" "Rating" "Reviews" ...
class(googlematrix)
## [1] "matrix"
#4
is.character(googlematrix)
## [1] TRUE
In below code, I have subsetted my dataframe and include only 1st & last variable, after subsetting I have saved my file as .Rdata using save.image() function.
googlemsubsetII <-googlem[c("App", "Android.Ver")]
googlemsubsetII
save.image(file = "googlesubsetII.RData")
In #1, I’m creating a new dataframe by utilizing function named data.frame() , I have declared two variables with 4 observations each. The first variable Employee is an ordinal variable which can be ordered alphabetically and second variable salary is Integer. I have converted salary variable to integer using as.integer() and after which I’m inspecting the structure of my data frame to confirm that it has one integer and an ordinal variable.
In #5, I’m checking the dimensions using dim() function and attributes using attributes() function of new data frame created.
#1
EmpM <- c('Eva','Peter','John', 'Mack')
SalM <- c(21000, 23400, 26800, 15000) #Salary of the Employee
EmpM.dataM <- data.frame(EmpM, SalM, stringsAsFactors = FALSE)
str(EmpM.dataM)
## 'data.frame': 4 obs. of 2 variables:
## $ EmpM: chr "Eva" "Peter" "John" "Mack"
## $ SalM: num 21000 23400 26800 15000
#2
factor(EmpM)
## [1] Eva Peter John Mack
## Levels: Eva John Mack Peter
EmployeeM2 = as.factor(EmpM)
EmployeeM2 <-factor(EmployeeM2,levels = c("Eva","John","Mack", "Peter"),ordered=TRUE)
levels(EmployeeM2)
## [1] "Eva" "John" "Mack" "Peter"
#3
str(EmployeeM2)
## Ord.factor w/ 4 levels "Eva"<"John"<"Mack"<..: 1 4 2 3
str(SalM)
## num [1:4] 21000 23400 26800 15000
str(EmpM.dataM)
## 'data.frame': 4 obs. of 2 variables:
## $ EmpM: chr "Eva" "Peter" "John" "Mack"
## $ SalM: num 21000 23400 26800 15000
#4
EmployM.age <- c(25 ,36 , 42 ,32)
EmployeeMdata <- cbind(EmpM.dataM,EmployM.age)
EmployeeMdata
#5
dim(EmployeeMdata)
## [1] 4 3
attributes(EmployeeMdata)
## $names
## [1] "EmpM" "SalM" "EmployM.age"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4