R - Google Play Store applications

This R markdown encompasses data cleaning, analysis, and visualization.
The dataset provides data about Google Play Store applications, such as their prices, ratings, etc. 

The chosen data set is imported from Kaggle.
Link: https://www.kaggle.com/ucdkaggle/data-cleaning-analysis/report
Import the necessary libraries

library('tidyverse') #For data manipulation through Tidyr and Dplyr, and for plotting through ggplot2.
library('wordcloud') #For plotting word cloud.
library('waffle') # for plotting waffle chart.

Original_DF <- read.csv(file ='/Users/salahkaf/Desktop/googleplaystore.csv')

Four ways to get initial understanding of the data 
WAY 1 - Exploring the data set head and tail.

head(Original_DF,3) #Top 3 rows

                                                 App       Category
1     Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
2                                Coloring book moana ART_AND_DESIGN
3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN
  Rating Reviews Size   Installs Type Price Content.Rating
1    4.1     159  19M    10,000+ Free     0       Everyone
2    3.9     967  14M   500,000+ Free     0       Everyone
3    4.7   87510 8.7M 5,000,000+ Free     0       Everyone
                     Genres     Last.Updated Current.Ver  Android.Ver
1              Art & Design  January 7, 2018       1.0.0 4.0.3 and up
2 Art & Design;Pretend Play January 15, 2018       2.0.0 4.0.3 and up
3              Art & Design   August 1, 2018       1.2.4 4.0.3 and up

tail(Original_DF,3) #Last 3 rows

                                                App
10839                        Parkinson Exercices FR
10840                 The SCP Foundation DB fr nn5n
10841 iHoroscope - 2018 Daily Horoscope & Astrology
                 Category Rating Reviews               Size
10839             MEDICAL    NaN       3               9.5M
10840 BOOKS_AND_REFERENCE    4.5     114 Varies with device
10841           LIFESTYLE    4.5  398307                19M
         Installs Type Price Content.Rating            Genres
10839      1,000+ Free     0       Everyone           Medical
10840      1,000+ Free     0     Mature 17+ Books & Reference
10841 10,000,000+ Free     0       Everyone         Lifestyle
          Last.Updated        Current.Ver        Android.Ver
10839 January 20, 2017                1.0         2.2 and up
10840 January 19, 2015 Varies with device Varies with device
10841    July 25, 2018 Varies with device Varies with device

WAY 2 - Exploring the data set dimensions and glimpse.

dim(Original_DF) #Dimensions of DF

[1] 10841    13

glimpse(Original_DF) #Glimpse about DF

Rows: 10,841
Columns: 13
$ App            <chr> "Photo Editor & Candy Camera & Grid & ScrapBo…
$ Category       <chr> "ART_AND_DESIGN", "ART_AND_DESIGN", "ART_AND_…
$ Rating         <dbl> 4.1, 3.9, 4.7, 4.5, 4.3, 4.4, 3.8, 4.1, 4.4, …
$ Reviews        <chr> "159", "967", "87510", "215644", "967", "167"…
$ Size           <chr> "19M", "14M", "8.7M", "25M", "2.8M", "5.6M", …
$ Installs       <chr> "10,000+", "500,000+", "5,000,000+", "50,000,…
$ Type           <chr> "Free", "Free", "Free", "Free", "Free", "Free…
$ Price          <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", …
$ Content.Rating <chr> "Everyone", "Everyone", "Everyone", "Teen", "…
$ Genres         <chr> "Art & Design", "Art & Design;Pretend Play", …
$ Last.Updated   <chr> "January 7, 2018", "January 15, 2018", "Augus…
$ Current.Ver    <chr> "1.0.0", "2.0.0", "1.2.4", "Varies with devic…
$ Android.Ver    <chr> "4.0.3 and up", "4.0.3 and up", "4.0.3 and up…

WAY 3 - Exploring the data set structure and summary.

str(Original_DF) #Structure of DF

'data.frame':   10841 obs. of  13 variables:
 $ App           : chr  "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
 $ Category      : chr  "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
 $ Rating        : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
 $ Reviews       : chr  "159" "967" "87510" "215644" ...
 $ Size          : chr  "19M" "14M" "8.7M" "25M" ...
 $ Installs      : chr  "10,000+" "500,000+" "5,000,000+" "50,000,000+" ...
 $ Type          : chr  "Free" "Free" "Free" "Free" ...
 $ Price         : chr  "0" "0" "0" "0" ...
 $ Content.Rating: chr  "Everyone" "Everyone" "Everyone" "Teen" ...
 $ Genres        : chr  "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
 $ Last.Updated  : chr  "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
 $ Current.Ver   : chr  "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
 $ Android.Ver   : chr  "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...

summary(Original_DF) #Summary of DF

     App              Category             Rating      
 Length:10841       Length:10841       Min.   : 1.000  
 Class :character   Class :character   1st Qu.: 4.000  
 Mode  :character   Mode  :character   Median : 4.300  
                                       Mean   : 4.193  
                                       3rd Qu.: 4.500  
                                       Max.   :19.000  
                                       NA's   :1474    
   Reviews              Size             Installs        
 Length:10841       Length:10841       Length:10841      
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
     Type              Price           Content.Rating    
 Length:10841       Length:10841       Length:10841      
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
    Genres          Last.Updated       Current.Ver       
 Length:10841       Length:10841       Length:10841      
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
 Android.Ver       
 Length:10841      
 Class :character  
 Mode  :character

WAY 4 - Exploring the data set missing values.

# Total number of missing values in the data set
cat("The total number of missing values in the dataset is" , sum(is.na(Original_DF)))

The total number of missing values in the dataset is 1474

# Total number of missing values in the data set per column name
colSums(is.na(Original_DF))

           App       Category         Rating        Reviews 
             0              0           1474              0 
          Size       Installs           Type          Price 
             0              0              0              0 
Content.Rating         Genres   Last.Updated    Current.Ver 
             0              0              0              0 
   Android.Ver 
             0

Four ways of subsetting / choosing row or columns.
Here we want to remove the unnecessary columns for this analysis. 
They are [1 - Last.Updated 2 - Current.Ver 3 - Android.Ver].
At this point, we will consider all rows are important so we will keep them from manipulation.
Except in the fourth way we will select specific rows for demonstration purposes.
Way 1 - using regular index method for columns selection.

subsetted1 <- Original_DF[,-(11:13)] #Remove last three columns
head(subsetted1,3) #Show first three rows

                                                 App       Category
1     Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
2                                Coloring book moana ART_AND_DESIGN
3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN
  Rating Reviews Size   Installs Type Price Content.Rating
1    4.1     159  19M    10,000+ Free     0       Everyone
2    3.9     967  14M   500,000+ Free     0       Everyone
3    4.7   87510 8.7M 5,000,000+ Free     0       Everyone
                     Genres
1              Art & Design
2 Art & Design;Pretend Play
3              Art & Design

WAY 2 - using subset function for columns selection.

subsetted2 <- subset(Original_DF, select = c(App:Genres)) #Remove last three columns
head(subsetted2,3) #Show first three rows

                                                 App       Category
1     Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
2                                Coloring book moana ART_AND_DESIGN
3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN
  Rating Reviews Size   Installs Type Price Content.Rating
1    4.1     159  19M    10,000+ Free     0       Everyone
2    3.9     967  14M   500,000+ Free     0       Everyone
3    4.7   87510 8.7M 5,000,000+ Free     0       Everyone
                     Genres
1              Art & Design
2 Art & Design;Pretend Play
3              Art & Design

WAY 3 - using select function for columns selection

subsetted3 <- select(Original_DF, 1:10) #Remove last three columns
head(subsetted3,3) #Show first three rows

                                                 App       Category
1     Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
2                                Coloring book moana ART_AND_DESIGN
3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN
  Rating Reviews Size   Installs Type Price Content.Rating
1    4.1     159  19M    10,000+ Free     0       Everyone
2    3.9     967  14M   500,000+ Free     0       Everyone
3    4.7   87510 8.7M 5,000,000+ Free     0       Everyone
                     Genres
1              Art & Design
2 Art & Design;Pretend Play
3              Art & Design

WAY 4 - subsetting rows and columns by the three above methods.
And Selecting only apps in "business" category.

subsetted4A <- Original_DF[Original_DF$Category =="BUSINESS",1:10] #Regular indexing
subsetted4B <- subset(Original_DF, Category == "BUSINESS" , select = c(App:Genres)) #Subset method.
subsetted4C <- filter(select(Original_DF, 1:10), Category == "BUSINESS") #Filter method
head(subsetted4A,2)

                             App Category Rating Reviews
188 Visual Voicemail by MetroPCS BUSINESS    4.1   16129
189            Indeed Job Search BUSINESS    4.3  674730
                  Size    Installs Type Price Content.Rating   Genres
188 Varies with device 10,000,000+ Free     0       Everyone Business
189 Varies with device 50,000,000+ Free     0       Everyone Business

head(subsetted4B,2)

                             App Category Rating Reviews
188 Visual Voicemail by MetroPCS BUSINESS    4.1   16129
189            Indeed Job Search BUSINESS    4.3  674730
                  Size    Installs Type Price Content.Rating   Genres
188 Varies with device 10,000,000+ Free     0       Everyone Business
189 Varies with device 50,000,000+ Free     0       Everyone Business

head(subsetted4C,2)

                           App Category Rating Reviews
1 Visual Voicemail by MetroPCS BUSINESS    4.1   16129
2            Indeed Job Search BUSINESS    4.3  674730
                Size    Installs Type Price Content.Rating   Genres
1 Varies with device 10,000,000+ Free     0       Everyone Business
2 Varies with device 50,000,000+ Free     0       Everyone Business

c. Four ways to Preprocess data (Cleaning, etc)
WAY 1 - remove missing values using complete case code.

dim(subsetted1) #Before cleaning is 10841 rows

[1] 10841    10

subsetted1 <- subsetted1[complete.cases(subsetted1),]
dim(subsetted1) #After cleaning should be 10841 - 1474 = 9367 rows

[1] 9367   10

WAY 2 - remove missing values using na.omit.

dim(subsetted2) #Before cleaning is 10841 rows

[1] 10841    10

subsetted2 <- na.omit(subsetted2)
dim(subsetted2) #After cleaning should be 10841 - 1474 = 9367 rows

[1] 9367   10

WAY 3 - replacing missing values with mean values.

MeanOfRatings <- mean(subsetted3$Rating, na.rm = T) #Finding the mean of the "Rating" column
subsetted3$Rating[is.na(subsetted3$Rating)] <- MeanOfRatings
dim(subsetted3) #Should not change, thus there must be 10841.

[1] 10841    10

anyNA(subsetted3) #Should be "False", which implies there is no missing values.

[1] FALSE

WAY 4 - Giving the misssing values avalue of 0.

subsetted4D <- Original_DF[,1:10]
subsetted4D$Rating <- 
  gsub(NaN, 0, as.double(subsetted4D$Rating))  #Replacing missing values values with 0
subsetted4D$Rating <- as.double(subsetted4D$Rating) #Saving the column as double
subsetted4D[23:25,1:3] #Checking row 23-25 for confirmation.

                                       App       Category Rating
23 Superheroes Wallpapers | 4K Backgrounds ART_AND_DESIGN    4.7
24                  Mcqueen Coloring pages ART_AND_DESIGN    0.0
25             HD Mickey Minnie Wallpapers ART_AND_DESIGN    4.7

Subsetted 3 & 4 results are not apporiate for our analysis as they provide inaccurate results.
(Replacing missing values with 0s or average is not preferable here).
So we will go with subsetted 1 & 2 (Removing rows with missing values).
(13.6% of data will be lost, which is acceptable in this situation).
Since Subsetted1, and subsetted2 sets are exactly the same,
we will take subsetted1 as the main data set to continue our analysis.
Final touch is to convert some columns into ineger type.

subsetted1$Reviews <- as.integer(subsetted1$Reviews) #Make the "Reviews" column type as integer.

subsetted1$Installs <- str_replace_all(subsetted1$Installs,",","") #Removing ","
subsetted1$Installs <- str_sub(subsetted1$Installs,1,nchar(subsetted1$Installs)-1) #Removing "+"
subsetted1$Installs <- as.integer(subsetted1$Installs) #Convert to integers

#Converting to integers will make one row has an NA value
#removing that row 
dim(subsetted1)

[1] 9367   10

subsetted1 <- na.omit(subsetted1)
dim(subsetted1)

[1] 9366   10

Write one function and use the function for the dataset .
The ratings in the data set out of 5,
will will create a column to make the ratings in percentage form.

# Function creation - The function calculate the rating percentage out of 100% instead of 5 stars.
ScaleConversion <- function(x) {
  RatingPercentage = (x/5) * 100
  return(RatingPercentage)
}
# applying the function on the data set
subsetted1 <- mutate(subsetted1, RatingPercentage = ScaleConversion(subsetted1$Rating))
# Placing the new column next to "Rating" column
subsetted1 <- relocate(subsetted1,  RatingPercentage, .after= Rating)
head(subsetted1,2)

                                             App       Category
1 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN
2                            Coloring book moana ART_AND_DESIGN
  Rating RatingPercentage Reviews Size Installs Type Price
1    4.1               82     159  19M    10000 Free     0
2    3.9               78     967  14M   500000 Free     0
  Content.Rating                    Genres
1       Everyone              Art & Design
2       Everyone Art & Design;Pretend Play

This data set gives general understanding about apps
ratings, prices, number of downloads, and categories, etc.
The aim of this analysis is to answer these three questions, 
Q1 - Which genres are most popular?
Q2- Which genres are most liked?
Q3 - What is the free/paid ratio?
Furthermore, for simplicity purposes. Each question will be treated with a unique DF.
Solutions:
Q1 - Which genres are most popular?
Prepare a subset suitable for plotting.

A1DF <- subsetted1 #Unique DF for Answer 1 
most_downloaded <- data.frame(tapply(A1DF$Installs,A1DF$Category,sum)) # Sum number of installs per category
head(most_downloaded,4)

                    tapply.A1DF.Installs..A1DF.Category..sum.
ART_AND_DESIGN                                      124233100
AUTO_AND_VEHICLES                                    53129800
BEAUTY                                               26916200
BOOKS_AND_REFERENCE                                1921291655

most_downloaded$Category <- rownames(most_downloaded) #Make the index as a column 
rownames(most_downloaded) <- NULL #Remove the index since there is no need for it any more
colnames(most_downloaded)[1] <- "Total_Downloads" #Renaming the downloads count column
most_downloaded <- relocate(most_downloaded, Category, .before = "Total_Downloads") #Reordering columns
most_downloaded

              Category Total_Downloads
1       ART_AND_DESIGN       124233100
2    AUTO_AND_VEHICLES        53129800
3               BEAUTY        26916200
4  BOOKS_AND_REFERENCE      1921291655
5             BUSINESS      1001768120
6               COMICS        56036100
7        COMMUNICATION     32647241530
8               DATING       264295110
9            EDUCATION       871352000
10       ENTERTAINMENT      2869160000
11              EVENTS        15949410
12              FAMILY     10257701590
13             FINANCE       876612400
14      FOOD_AND_DRINK       273777750
15                GAME     35085862717
16  HEALTH_AND_FITNESS      1583056220
17      HOUSE_AND_HOME       168582000
18  LIBRARIES_AND_DEMO        62083000
19           LIFESTYLE       537561120
20 MAPS_AND_NAVIGATION       724267560
21             MEDICAL        53205576
22  NEWS_AND_MAGAZINES      7496210650
23           PARENTING        31116110
24     PERSONALIZATION      2325352930
25         PHOTOGRAPHY     10088243130
26        PRODUCTIVITY     14176070180
27            SHOPPING      3247831540
28              SOCIAL     14069841475
29              SPORTS      1751131465
30               TOOLS     11450724500
31    TRAVEL_AND_LOCAL      6868859300
32       VIDEO_PLAYERS      6221897200
33             WEATHER       426096500

Plotting A1.  Bar Chart

A1plot_bar <- ggplot(data=most_downloaded, aes(y=Category,x=Total_Downloads
)) +
  geom_bar(stat= "identity",fill="#56B4E9") + 
  ggtitle("Number of Installs per Category") +
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Installs") + ylab("Category")
A1plot_bar

Plotting A1 Cloud words

most_downloaded[7,1] <- "COM" #Changing COMMUNICATION TO COM in order to fit it in the next plot.
A1Word_plot <- wordcloud(words = most_downloaded$Category,
                         freq = most_downloaded$Total_Downloads, min.freq = 1,
          max.words=2000, random.order=FALSE, rot.per=0.35,
          colors=brewer.pal(8, "Dark2"))

Create a table for top 10 categories

top10_downloaded <- arrange(most_downloaded, desc(Total_Downloads)) #Arrange DF in descending order
top10_downloaded <- top10_downloaded[1:10,]
top10_downloaded$Percentage <- 
 round((top10_downloaded$Total_Downloads)*100/sum(top10_downloaded$Total_Downloads),2)#Percentage Generating
top10_downloaded$Percentage <- paste(top10_downloaded$Percentage,"%") #Append"%" sign
top10_downloaded

             Category Total_Downloads Percentage
1                GAME     35085862717    23.65 %
2                 COM     32647241530    22.01 %
3        PRODUCTIVITY     14176070180     9.56 %
4              SOCIAL     14069841475     9.48 %
5               TOOLS     11450724500     7.72 %
6              FAMILY     10257701590     6.91 %
7         PHOTOGRAPHY     10088243130      6.8 %
8  NEWS_AND_MAGAZINES      7496210650     5.05 %
9    TRAVEL_AND_LOCAL      6868859300     4.63 %
10      VIDEO_PLAYERS      6221897200     4.19 %

Q2- Which categories are most liked?
Here we need to set an important criterion, some apps are rated above 4 starts,
yet the number of voters are less than 1000 person.
It is unfair to make comparison between 3star apps with milion view with a 4.5 starts app with 250 view only.
We will omit all applications where number of review is less than 1000.

A2DF <- subsetted1 #Unique DF for A2
dim(A2DF) # 9366 rows before filtering

[1] 9366   11

A2DF <- filter(A2DF, Reviews >= 1000)  
dim(A2DF) # 5896 rows after filtering

[1] 5896   11

A2mean <- data.frame(tapply(A2DF$Rating,A2DF$Category,mean)) #Weight the average per category
head(A2mean,2)

                  tapply.A2DF.Rating..A2DF.Category..mean.
ART_AND_DESIGN                                    4.453846
AUTO_AND_VEHICLES                                 4.361765

A2mean$AverageRating <-round(A2mean[,1],2)
A2mean$Category <- rownames(A2mean)
rownames(A2mean) <- NULL
A2mean[,1] <-NULL
A2mean <- arrange(A2mean, desc(AverageRating))[1:10,]
A2mean <- relocate(A2mean, Category, .before = "AverageRating") #Reordering columns
A2mean

              Category AverageRating
1       ART_AND_DESIGN          4.45
2            EDUCATION          4.42
3   HEALTH_AND_FITNESS          4.42
4  BOOKS_AND_REFERENCE          4.41
5    AUTO_AND_VEHICLES          4.36
6              MEDICAL          4.36
7            PARENTING          4.35
8      PERSONALIZATION          4.34
9               EVENTS          4.32
10                GAME          4.32

PLotting top 10 genres based on their rating.

Q2plot_Pie <- ggplot(A2mean, aes(x="", y=AverageRating, fill=Category)) +
  geom_bar(stat="identity", width=1, color="White") +
  coord_polar("y", start=0) + 
  ggtitle("Top 10 Rated Genres out of 5 Stars") +
  theme(plot.title = element_text(hjust = 0.5))+
    geom_label(aes(label = AverageRating),
             position = position_stack(vjust = 0.35),
             show.legend = FALSE) +
  theme_void() # remove background, grid, numeric labels
Q2plot_Pie

Q3 - What is the free/paid ratio?

A3DF <- subsetted1 #Unique DF for Q3
A3DF <- count(A3DF,Type)
A3DF$percentage <- round(A3DF$n/sum(A3DF$n),2)
A3DF

  Type    n percentage
1 Free 8719       0.93
2 Paid  647       0.07

Plotting waffle chart

A3DF <- c(Free = 93, Paid = 7)
waffle(A3DF)

Concluded insights:

1 - Gaming industry is dominat in the Google Play Store.
2 - Altough communication apps are also dominant, they are not in the top 10 liked categoires.
3 - Most Apps are free. Apps devlopers are more interested in ads or microtransactions (inside app purchase).