This R markdown encompasses data cleaning, analysis, and visualization.
The dataset provides data about Google Play Store applications, such as their prices, ratings, etc.
The chosen data set is imported from Kaggle.
Link: https://www.kaggle.com/ucdkaggle/data-cleaning-analysis/report
Import the necessary libraries
Original_DF <- read.csv(file ='/Users/salahkaf/Desktop/googleplaystore.csv')
Four ways to get initial understanding of the data
WAY 1 - Exploring the data set head and tail.
head(Original_DF,3) #Top 3 rows
App Category
1 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
2 Coloring book moana ART_AND_DESIGN
3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN
Rating Reviews Size Installs Type Price Content.Rating
1 4.1 159 19M 10,000+ Free 0 Everyone
2 3.9 967 14M 500,000+ Free 0 Everyone
3 4.7 87510 8.7M 5,000,000+ Free 0 Everyone
Genres Last.Updated Current.Ver Android.Ver
1 Art & Design January 7, 2018 1.0.0 4.0.3 and up
2 Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
3 Art & Design August 1, 2018 1.2.4 4.0.3 and up
tail(Original_DF,3) #Last 3 rows
App
10839 Parkinson Exercices FR
10840 The SCP Foundation DB fr nn5n
10841 iHoroscope - 2018 Daily Horoscope & Astrology
Category Rating Reviews Size
10839 MEDICAL NaN 3 9.5M
10840 BOOKS_AND_REFERENCE 4.5 114 Varies with device
10841 LIFESTYLE 4.5 398307 19M
Installs Type Price Content.Rating Genres
10839 1,000+ Free 0 Everyone Medical
10840 1,000+ Free 0 Mature 17+ Books & Reference
10841 10,000,000+ Free 0 Everyone Lifestyle
Last.Updated Current.Ver Android.Ver
10839 January 20, 2017 1.0 2.2 and up
10840 January 19, 2015 Varies with device Varies with device
10841 July 25, 2018 Varies with device Varies with device
WAY 2 - Exploring the data set dimensions and glimpse.
dim(Original_DF) #Dimensions of DF
[1] 10841 13
glimpse(Original_DF) #Glimpse about DF
Rows: 10,841
Columns: 13
$ App <chr> "Photo Editor & Candy Camera & Grid & ScrapBo…
$ Category <chr> "ART_AND_DESIGN", "ART_AND_DESIGN", "ART_AND_…
$ Rating <dbl> 4.1, 3.9, 4.7, 4.5, 4.3, 4.4, 3.8, 4.1, 4.4, …
$ Reviews <chr> "159", "967", "87510", "215644", "967", "167"…
$ Size <chr> "19M", "14M", "8.7M", "25M", "2.8M", "5.6M", …
$ Installs <chr> "10,000+", "500,000+", "5,000,000+", "50,000,…
$ Type <chr> "Free", "Free", "Free", "Free", "Free", "Free…
$ Price <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", …
$ Content.Rating <chr> "Everyone", "Everyone", "Everyone", "Teen", "…
$ Genres <chr> "Art & Design", "Art & Design;Pretend Play", …
$ Last.Updated <chr> "January 7, 2018", "January 15, 2018", "Augus…
$ Current.Ver <chr> "1.0.0", "2.0.0", "1.2.4", "Varies with devic…
$ Android.Ver <chr> "4.0.3 and up", "4.0.3 and up", "4.0.3 and up…
WAY 3 - Exploring the data set structure and summary.
str(Original_DF) #Structure of DF
'data.frame': 10841 obs. of 13 variables:
$ App : chr "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
$ Category : chr "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
$ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
$ Reviews : chr "159" "967" "87510" "215644" ...
$ Size : chr "19M" "14M" "8.7M" "25M" ...
$ Installs : chr "10,000+" "500,000+" "5,000,000+" "50,000,000+" ...
$ Type : chr "Free" "Free" "Free" "Free" ...
$ Price : chr "0" "0" "0" "0" ...
$ Content.Rating: chr "Everyone" "Everyone" "Everyone" "Teen" ...
$ Genres : chr "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
$ Last.Updated : chr "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
$ Current.Ver : chr "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
$ Android.Ver : chr "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...
summary(Original_DF) #Summary of DF
App Category Rating
Length:10841 Length:10841 Min. : 1.000
Class :character Class :character 1st Qu.: 4.000
Mode :character Mode :character Median : 4.300
Mean : 4.193
3rd Qu.: 4.500
Max. :19.000
NA's :1474
Reviews Size Installs
Length:10841 Length:10841 Length:10841
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Type Price Content.Rating
Length:10841 Length:10841 Length:10841
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Genres Last.Updated Current.Ver
Length:10841 Length:10841 Length:10841
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Android.Ver
Length:10841
Class :character
Mode :character
WAY 4 - Exploring the data set missing values.
# Total number of missing values in the data set
cat("The total number of missing values in the dataset is" , sum(is.na(Original_DF)))
The total number of missing values in the dataset is 1474
# Total number of missing values in the data set per column name
colSums(is.na(Original_DF))
App Category Rating Reviews
0 0 1474 0
Size Installs Type Price
0 0 0 0
Content.Rating Genres Last.Updated Current.Ver
0 0 0 0
Android.Ver
0
Four ways of subsetting / choosing row or columns.
Here we want to remove the unnecessary columns for this analysis.
They are [1 - Last.Updated 2 - Current.Ver 3 - Android.Ver].
At this point, we will consider all rows are important so we will keep them from manipulation.
Except in the fourth way we will select specific rows for demonstration purposes.
Way 1 - using regular index method for columns selection.
subsetted1 <- Original_DF[,-(11:13)] #Remove last three columns
head(subsetted1,3) #Show first three rows
App Category
1 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
2 Coloring book moana ART_AND_DESIGN
3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN
Rating Reviews Size Installs Type Price Content.Rating
1 4.1 159 19M 10,000+ Free 0 Everyone
2 3.9 967 14M 500,000+ Free 0 Everyone
3 4.7 87510 8.7M 5,000,000+ Free 0 Everyone
Genres
1 Art & Design
2 Art & Design;Pretend Play
3 Art & Design
WAY 2 - using subset function for columns selection.
subsetted2 <- subset(Original_DF, select = c(App:Genres)) #Remove last three columns
head(subsetted2,3) #Show first three rows
App Category
1 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
2 Coloring book moana ART_AND_DESIGN
3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN
Rating Reviews Size Installs Type Price Content.Rating
1 4.1 159 19M 10,000+ Free 0 Everyone
2 3.9 967 14M 500,000+ Free 0 Everyone
3 4.7 87510 8.7M 5,000,000+ Free 0 Everyone
Genres
1 Art & Design
2 Art & Design;Pretend Play
3 Art & Design
WAY 3 - using select function for columns selection
subsetted3 <- select(Original_DF, 1:10) #Remove last three columns
head(subsetted3,3) #Show first three rows
App Category
1 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
2 Coloring book moana ART_AND_DESIGN
3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN
Rating Reviews Size Installs Type Price Content.Rating
1 4.1 159 19M 10,000+ Free 0 Everyone
2 3.9 967 14M 500,000+ Free 0 Everyone
3 4.7 87510 8.7M 5,000,000+ Free 0 Everyone
Genres
1 Art & Design
2 Art & Design;Pretend Play
3 Art & Design
WAY 4 - subsetting rows and columns by the three above methods.
And Selecting only apps in "business" category.
subsetted4A <- Original_DF[Original_DF$Category =="BUSINESS",1:10] #Regular indexing
subsetted4B <- subset(Original_DF, Category == "BUSINESS" , select = c(App:Genres)) #Subset method.
subsetted4C <- filter(select(Original_DF, 1:10), Category == "BUSINESS") #Filter method
head(subsetted4A,2)
App Category Rating Reviews
188 Visual Voicemail by MetroPCS BUSINESS 4.1 16129
189 Indeed Job Search BUSINESS 4.3 674730
Size Installs Type Price Content.Rating Genres
188 Varies with device 10,000,000+ Free 0 Everyone Business
189 Varies with device 50,000,000+ Free 0 Everyone Business
App Category Rating Reviews
188 Visual Voicemail by MetroPCS BUSINESS 4.1 16129
189 Indeed Job Search BUSINESS 4.3 674730
Size Installs Type Price Content.Rating Genres
188 Varies with device 10,000,000+ Free 0 Everyone Business
189 Varies with device 50,000,000+ Free 0 Everyone Business
App Category Rating Reviews
1 Visual Voicemail by MetroPCS BUSINESS 4.1 16129
2 Indeed Job Search BUSINESS 4.3 674730
Size Installs Type Price Content.Rating Genres
1 Varies with device 10,000,000+ Free 0 Everyone Business
2 Varies with device 50,000,000+ Free 0 Everyone Business
c. Four ways to Preprocess data (Cleaning, etc)
WAY 1 - remove missing values using complete case code.
dim(subsetted1) #Before cleaning is 10841 rows
[1] 10841 10
subsetted1 <- subsetted1[complete.cases(subsetted1),]
dim(subsetted1) #After cleaning should be 10841 - 1474 = 9367 rows
[1] 9367 10
WAY 2 - remove missing values using na.omit.
dim(subsetted2) #Before cleaning is 10841 rows
[1] 10841 10
subsetted2 <- na.omit(subsetted2)
dim(subsetted2) #After cleaning should be 10841 - 1474 = 9367 rows
[1] 9367 10
WAY 3 - replacing missing values with mean values.
MeanOfRatings <- mean(subsetted3$Rating, na.rm = T) #Finding the mean of the "Rating" column
subsetted3$Rating[is.na(subsetted3$Rating)] <- MeanOfRatings
dim(subsetted3) #Should not change, thus there must be 10841.
[1] 10841 10
anyNA(subsetted3) #Should be "False", which implies there is no missing values.
[1] FALSE
WAY 4 - Giving the misssing values avalue of 0.
subsetted4D <- Original_DF[,1:10]
subsetted4D$Rating <-
gsub(NaN, 0, as.double(subsetted4D$Rating)) #Replacing missing values values with 0
subsetted4D$Rating <- as.double(subsetted4D$Rating) #Saving the column as double
subsetted4D[23:25,1:3] #Checking row 23-25 for confirmation.
App Category Rating
23 Superheroes Wallpapers | 4K Backgrounds ART_AND_DESIGN 4.7
24 Mcqueen Coloring pages ART_AND_DESIGN 0.0
25 HD Mickey Minnie Wallpapers ART_AND_DESIGN 4.7
Subsetted 3 & 4 results are not apporiate for our analysis as they provide inaccurate results.
(Replacing missing values with 0s or average is not preferable here).
So we will go with subsetted 1 & 2 (Removing rows with missing values).
(13.6% of data will be lost, which is acceptable in this situation).
Since Subsetted1, and subsetted2 sets are exactly the same,
we will take subsetted1 as the main data set to continue our analysis.
Final touch is to convert some columns into ineger type.
subsetted1$Reviews <- as.integer(subsetted1$Reviews) #Make the "Reviews" column type as integer.
subsetted1$Installs <- str_replace_all(subsetted1$Installs,",","") #Removing ","
subsetted1$Installs <- str_sub(subsetted1$Installs,1,nchar(subsetted1$Installs)-1) #Removing "+"
subsetted1$Installs <- as.integer(subsetted1$Installs) #Convert to integers
#Converting to integers will make one row has an NA value
#removing that row
dim(subsetted1)
[1] 9367 10
[1] 9366 10
Write one function and use the function for the dataset .
The ratings in the data set out of 5,
will will create a column to make the ratings in percentage form.
# Function creation - The function calculate the rating percentage out of 100% instead of 5 stars.
ScaleConversion <- function(x) {
RatingPercentage = (x/5) * 100
return(RatingPercentage)
}
# applying the function on the data set
subsetted1 <- mutate(subsetted1, RatingPercentage = ScaleConversion(subsetted1$Rating))
# Placing the new column next to "Rating" column
subsetted1 <- relocate(subsetted1, RatingPercentage, .after= Rating)
head(subsetted1,2)
App Category
1 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
2 Coloring book moana ART_AND_DESIGN
Rating RatingPercentage Reviews Size Installs Type Price
1 4.1 82 159 19M 10000 Free 0
2 3.9 78 967 14M 500000 Free 0
Content.Rating Genres
1 Everyone Art & Design
2 Everyone Art & Design;Pretend Play
This data set gives general understanding about apps
ratings, prices, number of downloads, and categories, etc.
The aim of this analysis is to answer these three questions,
Q1 - Which genres are most popular?
Q2- Which genres are most liked?
Q3 - What is the free/paid ratio?
Furthermore, for simplicity purposes. Each question will be treated with a unique DF.
Solutions:
Q1 - Which genres are most popular?
Prepare a subset suitable for plotting.
A1DF <- subsetted1 #Unique DF for Answer 1
most_downloaded <- data.frame(tapply(A1DF$Installs,A1DF$Category,sum)) # Sum number of installs per category
head(most_downloaded,4)
tapply.A1DF.Installs..A1DF.Category..sum.
ART_AND_DESIGN 124233100
AUTO_AND_VEHICLES 53129800
BEAUTY 26916200
BOOKS_AND_REFERENCE 1921291655
most_downloaded$Category <- rownames(most_downloaded) #Make the index as a column
rownames(most_downloaded) <- NULL #Remove the index since there is no need for it any more
colnames(most_downloaded)[1] <- "Total_Downloads" #Renaming the downloads count column
most_downloaded <- relocate(most_downloaded, Category, .before = "Total_Downloads") #Reordering columns
most_downloaded
Category Total_Downloads
1 ART_AND_DESIGN 124233100
2 AUTO_AND_VEHICLES 53129800
3 BEAUTY 26916200
4 BOOKS_AND_REFERENCE 1921291655
5 BUSINESS 1001768120
6 COMICS 56036100
7 COMMUNICATION 32647241530
8 DATING 264295110
9 EDUCATION 871352000
10 ENTERTAINMENT 2869160000
11 EVENTS 15949410
12 FAMILY 10257701590
13 FINANCE 876612400
14 FOOD_AND_DRINK 273777750
15 GAME 35085862717
16 HEALTH_AND_FITNESS 1583056220
17 HOUSE_AND_HOME 168582000
18 LIBRARIES_AND_DEMO 62083000
19 LIFESTYLE 537561120
20 MAPS_AND_NAVIGATION 724267560
21 MEDICAL 53205576
22 NEWS_AND_MAGAZINES 7496210650
23 PARENTING 31116110
24 PERSONALIZATION 2325352930
25 PHOTOGRAPHY 10088243130
26 PRODUCTIVITY 14176070180
27 SHOPPING 3247831540
28 SOCIAL 14069841475
29 SPORTS 1751131465
30 TOOLS 11450724500
31 TRAVEL_AND_LOCAL 6868859300
32 VIDEO_PLAYERS 6221897200
33 WEATHER 426096500
Plotting A1. Bar Chart
A1plot_bar <- ggplot(data=most_downloaded, aes(y=Category,x=Total_Downloads
)) +
geom_bar(stat= "identity",fill="#56B4E9") +
ggtitle("Number of Installs per Category") +
theme(plot.title = element_text(hjust = 0.5))+
xlab("Installs") + ylab("Category")
A1plot_bar

Plotting A1 Cloud words
most_downloaded[7,1] <- "COM" #Changing COMMUNICATION TO COM in order to fit it in the next plot.
A1Word_plot <- wordcloud(words = most_downloaded$Category,
freq = most_downloaded$Total_Downloads, min.freq = 1,
max.words=2000, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))

Create a table for top 10 categories
top10_downloaded <- arrange(most_downloaded, desc(Total_Downloads)) #Arrange DF in descending order
top10_downloaded <- top10_downloaded[1:10,]
top10_downloaded$Percentage <-
round((top10_downloaded$Total_Downloads)*100/sum(top10_downloaded$Total_Downloads),2)#Percentage Generating
top10_downloaded$Percentage <- paste(top10_downloaded$Percentage,"%") #Append"%" sign
top10_downloaded
Category Total_Downloads Percentage
1 GAME 35085862717 23.65 %
2 COM 32647241530 22.01 %
3 PRODUCTIVITY 14176070180 9.56 %
4 SOCIAL 14069841475 9.48 %
5 TOOLS 11450724500 7.72 %
6 FAMILY 10257701590 6.91 %
7 PHOTOGRAPHY 10088243130 6.8 %
8 NEWS_AND_MAGAZINES 7496210650 5.05 %
9 TRAVEL_AND_LOCAL 6868859300 4.63 %
10 VIDEO_PLAYERS 6221897200 4.19 %
Q2- Which categories are most liked?
Here we need to set an important criterion, some apps are rated above 4 starts,
yet the number of voters are less than 1000 person.
It is unfair to make comparison between 3star apps with milion view with a 4.5 starts app with 250 view only.
We will omit all applications where number of review is less than 1000.
A2DF <- subsetted1 #Unique DF for A2
dim(A2DF) # 9366 rows before filtering
[1] 9366 11
A2DF <- filter(A2DF, Reviews >= 1000)
dim(A2DF) # 5896 rows after filtering
[1] 5896 11
A2mean <- data.frame(tapply(A2DF$Rating,A2DF$Category,mean)) #Weight the average per category
head(A2mean,2)
tapply.A2DF.Rating..A2DF.Category..mean.
ART_AND_DESIGN 4.453846
AUTO_AND_VEHICLES 4.361765
A2mean$AverageRating <-round(A2mean[,1],2)
A2mean$Category <- rownames(A2mean)
rownames(A2mean) <- NULL
A2mean[,1] <-NULL
A2mean <- arrange(A2mean, desc(AverageRating))[1:10,]
A2mean <- relocate(A2mean, Category, .before = "AverageRating") #Reordering columns
A2mean
Category AverageRating
1 ART_AND_DESIGN 4.45
2 EDUCATION 4.42
3 HEALTH_AND_FITNESS 4.42
4 BOOKS_AND_REFERENCE 4.41
5 AUTO_AND_VEHICLES 4.36
6 MEDICAL 4.36
7 PARENTING 4.35
8 PERSONALIZATION 4.34
9 EVENTS 4.32
10 GAME 4.32
PLotting top 10 genres based on their rating.
Q2plot_Pie <- ggplot(A2mean, aes(x="", y=AverageRating, fill=Category)) +
geom_bar(stat="identity", width=1, color="White") +
coord_polar("y", start=0) +
ggtitle("Top 10 Rated Genres out of 5 Stars") +
theme(plot.title = element_text(hjust = 0.5))+
geom_label(aes(label = AverageRating),
position = position_stack(vjust = 0.35),
show.legend = FALSE) +
theme_void() # remove background, grid, numeric labels
Q2plot_Pie

Q3 - What is the free/paid ratio?
A3DF <- subsetted1 #Unique DF for Q3
A3DF <- count(A3DF,Type)
A3DF$percentage <- round(A3DF$n/sum(A3DF$n),2)
A3DF
Type n percentage
1 Free 8719 0.93
2 Paid 647 0.07
Plotting waffle chart
A3DF <- c(Free = 93, Paid = 7)
waffle(A3DF)

Concluded insights:
1 - Gaming industry is dominat in the Google Play Store.
2 - Altough communication apps are also dominant, they are not in the top 10 liked categoires.
3 - Most Apps are free. Apps devlopers are more interested in ads or microtransactions (inside app purchase).