Data is from Kaggle dataset

The goal of this project is to understand how the properties of the apps affect the popularity among the users of the google play store apps

We can first start by looking at the structure of the data

'data.frame':   10841 obs. of  13 variables:
 $ App           : Factor w/ 9660 levels "- Free Comics - Comic Apps",..: 7206 2551 8970 8089 7272 7103 8149 5568 4926 5806 ...
 $ Category      : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Rating        : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
 $ Reviews       : Factor w/ 6002 levels "0","1","10","100",..: 1183 5924 5681 1947 5924 1310 1464 3385 816 485 ...
 $ Size          : Factor w/ 462 levels "1,000+","1.0M",..: 55 30 368 102 64 222 55 118 146 120 ...
 $ Installs      : Factor w/ 22 levels "0","0+","1,000,000,000+",..: 8 20 13 16 11 17 17 4 4 8 ...
 $ Type          : Factor w/ 4 levels "0","Free","NaN",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Price         : Factor w/ 93 levels "$0.99","$1.00",..: 92 92 92 92 92 92 92 92 92 92 ...
 $ Content.Rating: Factor w/ 7 levels "","Adults only 18+",..: 3 3 3 6 3 3 3 3 3 3 ...
 $ Genres        : Factor w/ 120 levels "Action","Action;Action & Adventure",..: 10 13 10 10 12 10 10 10 10 12 ...
 $ Last.Updated  : Factor w/ 1378 levels "1.0.19","April 1, 2016",..: 562 482 117 825 757 901 76 726 1317 670 ...
 $ Current.Ver   : Factor w/ 2834 levels "","0.0.0.2","0.0.1",..: 122 1020 468 2827 280 116 280 2393 1457 1431 ...
 $ Android.Ver   : Factor w/ 35 levels "","1.0 and up",..: 17 17 17 20 22 10 17 20 12 17 ...

Preliminary findings

We have 13 variables and our first task is to figure out which variables should be used in our analysis.

We need variables that have an impact on the popularity of the app or measure the popularity and are physically sufficient to drive insights from (e.g. not 90% NA values, etc.)


We can see that the App name may not be relevant to our analysis as users won’t generally inclined to like a product only based on its name

Current.Ver, and Android.Ver represent the same information as the Last.Updated data already show how recently the app is updated


Now, we can focus on the remaining 10 variables we intend to analyze:

Numeric variables:

Categorical variables:

As we can see, most of the data types are factors; we need to convert them into numeric values first;

Numeric variables:

Let’s start with rating:

summary(na.omit(dat$Rating))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   4.000   4.300   4.193   4.500  19.000 

As we can see the max of the rating is 19 yet that is out of the range of the rating which is from 0 to 5, thus we need to find the row of the corrupted data

corrupted <- dat %>% filter(Rating == 19)
corrupted
                                      App Category Rating Reviews   Size
1 Life Made WI-Fi Touchscreen Photo Frame      1.9     19    3.0M 1,000+
  Installs Type    Price Content.Rating            Genres Last.Updated
1     Free    0 Everyone                February 11, 2018       1.0.19
  Current.Ver Android.Ver
1  4.0 and up            
head(dat, 1)
                                             App       Category Rating
1 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN    4.1
  Reviews Size Installs Type Price Content.Rating       Genres
1     159  19M  10,000+ Free     0       Everyone Art & Design
     Last.Updated Current.Ver  Android.Ver
1 January 7, 2018       1.0.0 4.0.3 and up

As we compare the corrupted data and the normal data, it is clear that this line of the data has missing value in the Category and genre columns thus the rest of the data are all in the wrong columns and since we do not have the accurate real data, I decide to remove this row

dat <- subset(dat, Category != "1.9")
summary(na.omit(dat$Rating))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   4.000   4.300   4.192   4.500   5.000 
ggplot(dat, aes(x = Rating)) + 
  geom_histogram()

Now we get the correct summary and distribution of rating.

Next we can take a look at the price:

dat <- dat %>% mutate(Price = na.omit(as.double(gsub("\\$", "", Price))))
summary(dat$Price)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   1.027   0.000 400.000 
ggplot(dat, aes(x = Price)) + 
  geom_histogram()

It is clear that most of the apps in the dataset is free. So the price doesn’t have that much variety in this dataset. So we might focus on other independent variables to find patterns.

The next is Review:

dat <- dat %>% mutate(Reviews = na.omit(as.numeric(levels(dat$Reviews))[dat$Reviews]))
summary(dat$Reviews)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
       0       38     2094   444153    54776 78158306 
ggplot(dat, aes(x = Reviews)) + 
  geom_histogram()

Size can be tricky as we can see it contains a level of non-numeric value and the data have different units thus we need to break it down seperately

length(dat$Size[dat$Size == 'Varies with device'])
[1] 1695
Numeric_size <- subset(dat, Size != 'Varies with device')
Numeric_size <- Numeric_size %>% mutate(Size =   as.character(levels(Numeric_size$Size)[Numeric_size$Size]))

large_size <- Numeric_size %>% filter(str_detect(Size, 'M'))
small_size <- Numeric_size %>% filter(str_detect(Size, 'k'))
head(large_size, 3)
                                                 App       Category Rating
1     Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN    4.1
2                                Coloring book moana ART_AND_DESIGN    3.9
3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN    4.7
  Reviews Size   Installs Type Price Content.Rating
1     159  19M    10,000+ Free     0       Everyone
2     967  14M   500,000+ Free     0       Everyone
3   87510 8.7M 5,000,000+ Free     0       Everyone
                     Genres     Last.Updated Current.Ver  Android.Ver
1              Art & Design  January 7, 2018       1.0.0 4.0.3 and up
2 Art & Design;Pretend Play January 15, 2018       2.0.0 4.0.3 and up
3              Art & Design   August 1, 2018       1.2.4 4.0.3 and up
head(small_size, 3)
                            App          Category Rating Reviews Size
1             Restart Navigator AUTO_AND_VEHICLES    4.0    1403 201k
2               Plugin:AOT v5.0          BUSINESS    3.1    4034  23k
3 Hangouts Dialer - Call Phones     COMMUNICATION    4.0  122498  79k
     Installs Type Price Content.Rating          Genres       Last.Updated
1    100,000+ Free     0       Everyone Auto & Vehicles    August 26, 2014
2    100,000+ Free     0       Everyone        Business September 11, 2015
3 10,000,000+ Free     0       Everyone   Communication  September 2, 2015
           Current.Ver  Android.Ver
1                1.0.1   2.2 and up
2 3.0.1.11 (Build 311)   2.2 and up
3        0.1.100944346 4.0.3 and up

As we can see, I successfully broke down the data set into two different piles: one with size unit M into large_size and the other with size unit k into small_size

large_size <- large_size %>% mutate(Size = as.double(substr(large_size$Size, 1,  nchar(large_size$Size) - 1)))

ggplot(large_size, aes(x = Size)) + 
  geom_histogram()

small_size <- small_size %>% mutate(Size = as.double(substr(small_size$Size, 1,  nchar(small_size$Size) - 1)))

ggplot(small_size, aes(x = Size)) + 
  geom_histogram()