PlayStore

Introduction

Web scraped data of 10k Play Store apps for analysing the Android market

Columns Description
App Application name
Category Category the app belongs to
Rating Overall user rating of the app (as when scraped)
Reviews Number of user reviews for the app (as when scraped)
Size Size of the app (as when scraped)
Installs Number of user downloads/installs for the app (as when scraped)
Type Paid or Free
Price Price of the app (as when scraped)
Content Rating Age group the app is targeted at - Children / Mature 21+ / Adult
Genres An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.
Last Updated Date when the app was last updated on Play Store (as when scraped)
Current Ver Current version of the app available on Play Store (as when scraped)
Android Ver Min required Android version (as when scraped)

Sections

The document is structured with the following sections:

  • Data Sample
  • PART 1 - Subsets
  • PART 2 - Visualizations

Required Packages

The packages required for this markdown are:

library(tidyverse) #the tidyverse collection of packages all together
library(DT)        #making pretty javascript data tables
library(dplyr)
#download file from web source
playstore <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/quinteromartinezr_xavier_edu/EcUCWmwEE-pNkZNupTrVWIgB5AjBfzuLuk53URcyrL1xgw?download=1")

#Removing space character from column names
names(playstore)[9] <- "ContentRating"
names(playstore)[11] <- "LastUpdated"
names(playstore)[12] <- "CurrentVersion"
names(playstore)[13] <- "AndroidVersion"

playstore_orig <- playstore

# Exclude observations with missing values
playstore <- na.omit(playstore)

# Clean Up -  Size
sizes <- as.character(playstore$Size)
# Eliminate characters
sizes <- sub("M", "", sizes) 
sizes <- sub("K", "", sizes)
sizes <- sub("k", "", sizes)
sizes <- sub("Varies with device", "", sizes)
playstore$Size <- as.numeric(sizes)

# Clean Up -  Number of Installs
inst <- as.character(playstore$Installs)
# Eliminate characters
inst <- sub("\\+", "", inst)
inst <- sub("\\,", "", inst)
playstore$Installs <- as.numeric(inst)

# Clean Up -  Android Version
version <- as.character(playstore$AndroidVersion)
version <-  sub(" and up", "", version)
playstore$AndroidVersion <- as.character(version)

# Clean Up - Characters in Price column 
prices <- as.character(playstore$Price)
# Eliminate characters
prices <- sub("\\$", "", prices)
playstore$Price <- as.numeric(prices)

# Convert date column from text to date format
dates <- as.character(playstore$LastUpdated)
playstore$LastUpdated <- as.Date(dates, "%B %d, %Y")

# Exclude observations with missing values
playstore <- na.omit(playstore)

PlayStore Data

Data After Clean UP

Data Before Clean Up

PART 1 - Subsets

Subset #1

Summary of application activity by application type

playstore %>%
  group_by(Type) %>%
  summarise(Rating = mean(Rating, na.rm=1), Reviews = mean(Reviews, na.rm=1), 
            Installs = mean(Installs, na.rm=1), Price = mean(Price, na.rm=1)) %>%
  datatable() %>% 
  formatCurrency('Price', currency = "$ ", interval = 3, mark = ",") %>%
  formatRound(c('Rating','Reviews', 'Installs'), 2)

Subset #2

Paid applications filtered with Installs and Reviews over 10K, grouped by category

playstore %>%
  filter(Type == 'Paid', Installs >= 10000, Reviews >= 10000) %>%
  group_by(Category) %>%
  datatable()

Mutation 1

Most profitable application by category

playstore %>%
  select(App, Type, Category, ContentRating, Reviews, Installs, Price) %>%
  filter(Type == 'Paid') %>%
  group_by(Category) %>%
  filter(Price == max(Price, na.rm = T)) %>%
  mutate(Sales = Installs * Price) %>%
  datatable()

Mutation 2

Feedback ratio base on number of reviews / number of installs

playstore %>%
  select(App, Type, ContentRating, Reviews, Installs, Price) %>%
  mutate(Feedback = Reviews / Installs) %>%
  datatable() %>%
  formatPercentage('Feedback', 2)

Mutation 3

Installs frequency per aplication base on number on total number of installs

playstore %>%
  select(App, Type, ContentRating, Reviews, Installs, Price) %>%
  mutate(Frequency = Installs / sum(Installs)) %>%
  datatable() %>%
  formatPercentage('Frequency', 2)

PART 2 - Visualizations

Plot 1

Histogram to see the behavior of ratings. Definitely, it shows that is skewed to the right over 4 stars

ggplot(data = playstore, aes(x = Rating)) +
  geom_bar()

Plot 2

Histogram to see the behavior of application type. The majority of applications are not paid

ggplot(data = playstore, aes(x = Type)) +
  geom_bar()

Plot 3

Effect of application type over the relation betwwen rating vs reviews

playstore %>%
  ggplot(aes(x = Rating, y = Reviews, color = Type)) +
    geom_point()

Plot 4

Effect of application content over the relation betwwen rating vs reviews

playstore %>%
  ggplot(aes(x = Rating, y = Reviews, color = ContentRating)) +
  geom_point()

Plot 5

Evaluate sales behavior for paid applications

playstore %>%
  filter(Type == 'Paid') %>%
  mutate(sales = Installs * Price) %>%
  ggplot(aes(x = sales, y = Installs)) +
  geom_point(alpha = .25)  +
  scale_y_continuous(name = "Median Installs", labels = scales::comma) +
  scale_x_log10(name = "Total Sales Volume", labels = scales::dollar) 

Reinaldo Quintero

2019-09-24