Data is from Kaggle dataset
The goal of this project is to understand how the properties of the apps affect the popularity among the users of the google play store apps
'data.frame': 10841 obs. of 13 variables:
$ App : Factor w/ 9660 levels "- Free Comics - Comic Apps",..: 7206 2551 8970 8089 7272 7103 8149 5568 4926 5806 ...
$ Category : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
$ Reviews : Factor w/ 6002 levels "0","1","10","100",..: 1183 5924 5681 1947 5924 1310 1464 3385 816 485 ...
$ Size : Factor w/ 462 levels "1,000+","1.0M",..: 55 30 368 102 64 222 55 118 146 120 ...
$ Installs : Factor w/ 22 levels "0","0+","1,000,000,000+",..: 8 20 13 16 11 17 17 4 4 8 ...
$ Type : Factor w/ 4 levels "0","Free","NaN",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Price : Factor w/ 93 levels "$0.99","$1.00",..: 92 92 92 92 92 92 92 92 92 92 ...
$ Content.Rating: Factor w/ 7 levels "","Adults only 18+",..: 3 3 3 6 3 3 3 3 3 3 ...
$ Genres : Factor w/ 120 levels "Action","Action;Action & Adventure",..: 10 13 10 10 12 10 10 10 10 12 ...
$ Last.Updated : Factor w/ 1378 levels "1.0.19","April 1, 2016",..: 562 482 117 825 757 901 76 726 1317 670 ...
$ Current.Ver : Factor w/ 2834 levels "","0.0.0.2","0.0.1",..: 122 1020 468 2827 280 116 280 2393 1457 1431 ...
$ Android.Ver : Factor w/ 35 levels "","1.0 and up",..: 17 17 17 20 22 10 17 20 12 17 ...
We have 13 variables and our first task is to figure out which variables should be used in our analysis.
We need variables that have an impact on the popularity of the app or measure the popularity and are physically sufficient to drive insights from (e.g. not 90% NA values, etc.)
We can see that the App name may not be relevant to our analysis as users won’t generally inclined to like a product only based on its name
Current.Ver, and Android.Ver represent the same information as the Last.Updated data already show how recently the app is updated
Now, we can focus on the remaining 10 variables we intend to analyze:
Numeric variables:
Categorical variables:
As we can see, most of the data types are factors; we need to convert them into numeric values first;
Let’s start with rating:
summary(na.omit(dat$Rating))
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 4.000 4.300 4.193 4.500 19.000
As we can see the max of the rating is 19 yet that is out of the range of the rating which is from 0 to 5, thus we need to find the row of the corrupted data
corrupted <- dat %>% filter(Rating == 19)
corrupted
App Category Rating Reviews Size
1 Life Made WI-Fi Touchscreen Photo Frame 1.9 19 3.0M 1,000+
Installs Type Price Content.Rating Genres Last.Updated
1 Free 0 Everyone February 11, 2018 1.0.19
Current.Ver Android.Ver
1 4.0 and up
head(dat, 1)
App Category Rating
1 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1
Reviews Size Installs Type Price Content.Rating Genres
1 159 19M 10,000+ Free 0 Everyone Art & Design
Last.Updated Current.Ver Android.Ver
1 January 7, 2018 1.0.0 4.0.3 and up
As we compare the corrupted data and the normal data, it is clear that this line of the data has missing value in the Category and genre columns thus the rest of the data are all in the wrong columns and since we do not have the accurate real data, I decide to remove this row
dat <- subset(dat, Category != "1.9")
summary(na.omit(dat$Rating))
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 4.000 4.300 4.192 4.500 5.000
ggplot(dat, aes(x = Rating)) +
geom_histogram()
Now we get the correct summary and distribution of rating.
Next we can take a look at the price:
dat <- dat %>% mutate(Price = na.omit(as.double(gsub("\\$", "", Price))))
summary(dat$Price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 1.027 0.000 400.000
ggplot(dat, aes(x = Price)) +
geom_histogram()
It is clear that most of the apps in the dataset is free. So the price doesn’t have that much variety in this dataset. So we might focus on other independent variables to find patterns.
The next is Review:
dat <- dat %>% mutate(Reviews = na.omit(as.numeric(levels(dat$Reviews))[dat$Reviews]))
summary(dat$Reviews)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 38 2094 444153 54776 78158306
ggplot(dat, aes(x = Reviews)) +
geom_histogram()
Size can be tricky as we can see it contains a level of non-numeric value and the data have different units thus we need to break it down seperately
length(dat$Size[dat$Size == 'Varies with device'])
[1] 1695
Numeric_size <- subset(dat, Size != 'Varies with device')
Numeric_size <- Numeric_size %>% mutate(Size = as.character(levels(Numeric_size$Size)[Numeric_size$Size]))
large_size <- Numeric_size %>% filter(str_detect(Size, 'M'))
small_size <- Numeric_size %>% filter(str_detect(Size, 'k'))
head(large_size, 3)
App Category Rating
1 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1
2 Coloring book moana ART_AND_DESIGN 3.9
3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN 4.7
Reviews Size Installs Type Price Content.Rating
1 159 19M 10,000+ Free 0 Everyone
2 967 14M 500,000+ Free 0 Everyone
3 87510 8.7M 5,000,000+ Free 0 Everyone
Genres Last.Updated Current.Ver Android.Ver
1 Art & Design January 7, 2018 1.0.0 4.0.3 and up
2 Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
3 Art & Design August 1, 2018 1.2.4 4.0.3 and up
head(small_size, 3)
App Category Rating Reviews Size
1 Restart Navigator AUTO_AND_VEHICLES 4.0 1403 201k
2 Plugin:AOT v5.0 BUSINESS 3.1 4034 23k
3 Hangouts Dialer - Call Phones COMMUNICATION 4.0 122498 79k
Installs Type Price Content.Rating Genres Last.Updated
1 100,000+ Free 0 Everyone Auto & Vehicles August 26, 2014
2 100,000+ Free 0 Everyone Business September 11, 2015
3 10,000,000+ Free 0 Everyone Communication September 2, 2015
Current.Ver Android.Ver
1 1.0.1 2.2 and up
2 3.0.1.11 (Build 311) 2.2 and up
3 0.1.100944346 4.0.3 and up
As we can see, I successfully broke down the data set into two different piles: one with size unit M into large_size and the other with size unit k into small_size
large_size <- large_size %>% mutate(Size = as.double(substr(large_size$Size, 1, nchar(large_size$Size) - 1)))
ggplot(large_size, aes(x = Size)) +
geom_histogram()
small_size <- small_size %>% mutate(Size = as.double(substr(small_size$Size, 1, nchar(small_size$Size) - 1)))
ggplot(small_size, aes(x = Size)) +
geom_histogram()