PlayStore
Introduction
Web scraped data of 10k Play Store apps for analysing the Android market
| Columns | Description |
|---|---|
| App | Application name |
| Category | Category the app belongs to |
| Rating | Overall user rating of the app (as when scraped) |
| Reviews | Number of user reviews for the app (as when scraped) |
| Size | Size of the app (as when scraped) |
| Installs | Number of user downloads/installs for the app (as when scraped) |
| Type | Paid or Free |
| Price | Price of the app (as when scraped) |
| Content Rating | Age group the app is targeted at - Children / Mature 21+ / Adult |
| Genres | An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres. |
| Last Updated | Date when the app was last updated on Play Store (as when scraped) |
| Current Ver | Current version of the app available on Play Store (as when scraped) |
| Android Ver | Min required Android version (as when scraped) |
Sections
The document is structured with the following sections:
- Data Sample
- PART 1 - Subsets
- PART 2 - Visualizations
Required Packages
The packages required for this markdown are:
library(tidyverse) #the tidyverse collection of packages all together
library(DT) #making pretty javascript data tables
library(dplyr)#download file from web source
playstore <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/quinteromartinezr_xavier_edu/EcUCWmwEE-pNkZNupTrVWIgB5AjBfzuLuk53URcyrL1xgw?download=1")
#Removing space character from column names
names(playstore)[9] <- "ContentRating"
names(playstore)[11] <- "LastUpdated"
names(playstore)[12] <- "CurrentVersion"
names(playstore)[13] <- "AndroidVersion"
playstore_orig <- playstore
# Exclude observations with missing values
playstore <- na.omit(playstore)
# Clean Up - Size
sizes <- as.character(playstore$Size)
# Eliminate characters
sizes <- sub("M", "", sizes)
sizes <- sub("K", "", sizes)
sizes <- sub("k", "", sizes)
sizes <- sub("Varies with device", "", sizes)
playstore$Size <- as.numeric(sizes)
# Clean Up - Number of Installs
inst <- as.character(playstore$Installs)
# Eliminate characters
inst <- sub("\\+", "", inst)
inst <- sub("\\,", "", inst)
playstore$Installs <- as.numeric(inst)
# Clean Up - Android Version
version <- as.character(playstore$AndroidVersion)
version <- sub(" and up", "", version)
playstore$AndroidVersion <- as.character(version)
# Clean Up - Characters in Price column
prices <- as.character(playstore$Price)
# Eliminate characters
prices <- sub("\\$", "", prices)
playstore$Price <- as.numeric(prices)
# Convert date column from text to date format
dates <- as.character(playstore$LastUpdated)
playstore$LastUpdated <- as.Date(dates, "%B %d, %Y")
# Exclude observations with missing values
playstore <- na.omit(playstore)PlayStore Data
Data After Clean UP
Data Before Clean Up
PART 1 - Subsets
Subset #1
Summary of application activity by application type
playstore %>%
group_by(Type) %>%
summarise(Rating = mean(Rating, na.rm=1), Reviews = mean(Reviews, na.rm=1),
Installs = mean(Installs, na.rm=1), Price = mean(Price, na.rm=1)) %>%
datatable() %>%
formatCurrency('Price', currency = "$ ", interval = 3, mark = ",") %>%
formatRound(c('Rating','Reviews', 'Installs'), 2)Subset #2
Paid applications filtered with Installs and Reviews over 10K, grouped by category
playstore %>%
filter(Type == 'Paid', Installs >= 10000, Reviews >= 10000) %>%
group_by(Category) %>%
datatable()Mutation 1
Most profitable application by category
playstore %>%
select(App, Type, Category, ContentRating, Reviews, Installs, Price) %>%
filter(Type == 'Paid') %>%
group_by(Category) %>%
filter(Price == max(Price, na.rm = T)) %>%
mutate(Sales = Installs * Price) %>%
datatable()Mutation 2
Feedback ratio base on number of reviews / number of installs
playstore %>%
select(App, Type, ContentRating, Reviews, Installs, Price) %>%
mutate(Feedback = Reviews / Installs) %>%
datatable() %>%
formatPercentage('Feedback', 2)Mutation 3
Installs frequency per aplication base on number on total number of installs
playstore %>%
select(App, Type, ContentRating, Reviews, Installs, Price) %>%
mutate(Frequency = Installs / sum(Installs)) %>%
datatable() %>%
formatPercentage('Frequency', 2)PART 2 - Visualizations
Plot 1
Histogram to see the behavior of ratings. Definitely, it shows that is skewed to the right over 4 stars
ggplot(data = playstore, aes(x = Rating)) +
geom_bar()Plot 2
Histogram to see the behavior of application type. The majority of applications are not paid
ggplot(data = playstore, aes(x = Type)) +
geom_bar()Plot 3
Effect of application type over the relation betwwen rating vs reviews
playstore %>%
ggplot(aes(x = Rating, y = Reviews, color = Type)) +
geom_point()Plot 4
Effect of application content over the relation betwwen rating vs reviews
playstore %>%
ggplot(aes(x = Rating, y = Reviews, color = ContentRating)) +
geom_point()Plot 5
Evaluate sales behavior for paid applications
playstore %>%
filter(Type == 'Paid') %>%
mutate(sales = Installs * Price) %>%
ggplot(aes(x = sales, y = Installs)) +
geom_point(alpha = .25) +
scale_y_continuous(name = "Median Installs", labels = scales::comma) +
scale_x_log10(name = "Total Sales Volume", labels = scales::dollar)