App Store Strategy Game EDA
Mobile Strategy Games
Background
In this modern time and society, the mobile games industry is worth billions of dollars. Companies are spending vast amounts of money on the development and marketing of these games to an equally large market. One of the most popular and favorite genre of mobile games is Strategy Games. By utilizing this data set, we’re going to create some insights and prediction of what is the components of the most popular Mobile Strategy Games in Apps Store (Apple) which can be used by developers and companies to create insights and plannings on creating their Mobile Strategy Games.
Read Data
The data we’re going to use is the 17K Mobile Strategy Games. This is the data of 17007 strategy games on the Apple App Store. It was collected on the 3rd of August 2019, using the iTunes API and the App Store sitemap.
games <- read.csv("data/appstore_games.csv")
# Clean the spaces from the variables' names into underscore
games <- games %>%
clean_names("snake")
glimpse(games)## Observations: 17,007
## Variables: 18
## $ url <fct> https://apps.apple.com/us/app/sud...
## $ id <int> 284921427, 284926400, 284946595, ...
## $ name <fct> "Sudoku", "Reversi", "Morocco", "...
## $ subtitle <fct> "", "", "", "", "", "Original bra...
## $ icon_url <fct> https://is2-ssl.mzstatic.com/imag...
## $ average_user_rating <dbl> 4.0, 3.5, 3.0, 3.5, 3.5, 3.0, 2.5...
## $ user_rating_count <int> 3553, 284, 8376, 190394, 28, 47, ...
## $ price <dbl> 2.99, 1.99, 0.00, 0.00, 2.99, 0.0...
## $ in_app_purchases <fct> "", "", "", "", "", "1.99", "", "...
## $ description <fct> "Join over 21,000,000 of our fans...
## $ developer <fct> "Mighty Mighty Good Games", "Kiss...
## $ age_rating <fct> 4+, 4+, 4+, 4+, 4+, 4+, 4+, 4+, 4...
## $ languages <fct> "DA, NL, EN, FI, FR, DE, IT, JA, ...
## $ size <dbl> 15853568, 12328960, 674816, 21552...
## $ primary_genre <fct> Games, Games, Games, Games, Games...
## $ genres <fct> "Games, Strategy, Puzzle", "Games...
## $ original_release_date <fct> 11/07/2008, 11/07/2008, 11/07/200...
## $ current_version_release_date <fct> 30/05/2017, 17/05/2018, 5/09/2017...
There are several variables that are included in the data, such as :
url: the link to the app through the App Store.id: the ID of the app in the App Store.name: the name of the app.subtitle: the secondary text under the name.icon url: url to the app’s icon image.average user rating: the average user rating of the app, rounded to nearest ,5.user rating count: the numbers of user rating the app have obtained internationally.price: the price of the apps in the App Store (USD).in app purchase: prices of available in app purchases.description: a quick description of the app.developer: the team that develops the app.age rating: the age ratings of the app.languages: the languages the apps use.size: the size of the apps (bytes).primary genre: the main genre of the app.genres: genres of the app.original release date: when the app was released.current version release date: when the app was last updated.
Data Preprocess
N/A Values
From our data, we first need to check whether it has N/A value in it or not.
## url id
## 0 0
## name subtitle
## 0 0
## icon_url average_user_rating
## 0 9446
## user_rating_count price
## 9446 24
## in_app_purchases description
## 0 0
## developer age_rating
## 0 0
## languages size
## 0 1
## primary_genre genres
## 0 0
## original_release_date current_version_release_date
## 0 0
From the check we’ve done, we find that there are N/A values in average_user_rating, user_rating_count, price, and size (although in both the last variables not really significant). We need to fill the N/A value from our data to reduce data loss and to do it we can fill the N/A with the mean value or median value. In this case, we are going to use the mean value to fill the N/A data.
avg_missing <- apply(games[,c("average_user_rating","user_rating_count","price","size")],
2,
mean,
na.rm = T)
avg_missing## average_user_rating user_rating_count price
## 4.060905e+00 3.306531e+03 8.134187e-01
## size
## 1.157064e+08
Because the average user rating is not solid, we need to round it first to tidy up the data.
After we find the mean (average) value of each variables, we then need to fill the mean value into each of the N/A value of our variables.
games2 <- games %>%
mutate(average_user_rating = ifelse(is.na(average_user_rating), avg_missing[1], average_user_rating),
user_rating_count = ifelse(is.na(user_rating_count), avg_missing[2], user_rating_count),
price = ifelse(is.na(price), avg_missing[3], price),
size = ifelse(is.na(size), avg_missing[4], size))Then we check if it works.
## url id
## 0 0
## name subtitle
## 0 0
## icon_url average_user_rating
## 0 0
## user_rating_count price
## 0 0
## in_app_purchases description
## 0 0
## developer age_rating
## 0 0
## languages size
## 0 0
## primary_genre genres
## 0 0
## original_release_date current_version_release_date
## 0 0
There are no more N/A value in our data.
Adding Variables
After we’ve got clean data, we need to check our variables and choose which of our variables are worthy enough to be the predictor variables for our model.
## Observations: 17,007
## Variables: 18
## $ url <fct> https://apps.apple.com/us/app/sud...
## $ id <int> 284921427, 284926400, 284946595, ...
## $ name <fct> "Sudoku", "Reversi", "Morocco", "...
## $ subtitle <fct> "", "", "", "", "", "Original bra...
## $ icon_url <fct> https://is2-ssl.mzstatic.com/imag...
## $ average_user_rating <dbl> 4.0, 3.5, 3.0, 3.5, 3.5, 3.0, 2.5...
## $ user_rating_count <dbl> 3553.000, 284.000, 8376.000, 1903...
## $ price <dbl> 2.99, 1.99, 0.00, 0.00, 2.99, 0.0...
## $ in_app_purchases <fct> "", "", "", "", "", "1.99", "", "...
## $ description <fct> "Join over 21,000,000 of our fans...
## $ developer <fct> "Mighty Mighty Good Games", "Kiss...
## $ age_rating <fct> 4+, 4+, 4+, 4+, 4+, 4+, 4+, 4+, 4...
## $ languages <fct> "DA, NL, EN, FI, FR, DE, IT, JA, ...
## $ size <dbl> 15853568, 12328960, 674816, 21552...
## $ primary_genre <fct> Games, Games, Games, Games, Games...
## $ genres <fct> "Games, Strategy, Puzzle", "Games...
## $ original_release_date <fct> 11/07/2008, 11/07/2008, 11/07/200...
## $ current_version_release_date <fct> 30/05/2017, 17/05/2018, 5/09/2017...
We can manipulate our data to make it easier to classify by doing the following.
games2 <- games2 %>%
mutate(size = round(size / 1000000, 2)) %>%
# Create new variable that group size
mutate(size_group = ifelse(size <= 1000, "1 GB and under",
ifelse(size > 1000 & size <= 2000, "1 GB to 2 GB",
ifelse(size > 2000 & size <= 3000, "2 GB to 3 GB",
ifelse(size > 3000 & size <= 4000, "3 GB to 4 GB", "Over 4 GB"))))) %>%
# Simplify the in_app_purchase variable
mutate(in_app_purchases = ifelse(in_app_purchases != "", "Yes", "No")) %>%
# Create new variable to decide popularity
mutate(popularity = ifelse(user_rating_count <= 1000 & average_user_rating < 3.5, "unpopular","popular")) %>%
# Turn all factor variables into character, because not all variables need to be factor
mutate_if(is.factor, as.character)
# Set the variables into date format
games2$original_release_date <- dmy(games2$original_release_date)
games2$current_version_release_date <- dmy(games2$current_version_release_date)
# Create new variables, release year and day spent to the last update
games2$release_year <- as.numeric(format(games2$original_release_date, "%Y"))
games2$day_since_updated <- as.numeric(games2$current_version_release_date - games2$original_release_date)
# Classify the last update into 3 categories
games2 <- games2 %>%
mutate(update = ifelse(day_since_updated <= 90, "recently updated",
ifelse(day_since_updated > 90 & day_since_updated <= 600, "a while since update", "not updated")))
# Create new variables to count languages used in one app and to count genres of one app
games2$lang_count <- str_count(games2$languages,",")+1
games2$genres_count <- str_count(games2$genres,",")+1
# Turn the variables into factor to ease up the classification
games2$size_group <- as.factor(games2$size_group)
games2$in_app_purchases <- as.factor(games2$in_app_purchases)
games2$popularity <- as.factor(games2$popularity)
games2$age_rating <- as.factor(games2$age_rating)
games2$primary_genre <- as.factor(games2$primary_genre)
games2$update <- as.factor(games2$update)
# Check our data frame
str(games2)## 'data.frame': 17007 obs. of 25 variables:
## $ url : chr "https://apps.apple.com/us/app/sudoku/id284921427" "https://apps.apple.com/us/app/reversi/id284926400" "https://apps.apple.com/us/app/morocco/id284946595" "https://apps.apple.com/us/app/sudoku-free/id285755462" ...
## $ id : int 284921427 284926400 284946595 285755462 285831220 286210009 286313771 286363959 286566987 286682679 ...
## $ name : chr "Sudoku" "Reversi" "Morocco" "Sudoku (Free)" ...
## $ subtitle : chr "" "" "" "" ...
## $ icon_url : chr "https://is2-ssl.mzstatic.com/image/thumb/Purple127/v4/7d/23/c6/7d23c660-aba8-308a-05c0-19385a377c0e/source/512x512bb.jpg" "https://is4-ssl.mzstatic.com/image/thumb/Purple128/v4/f7/e8/10/f7e810c8-72b4-cd85-e2d3-fbcb1e3ef381/source/512x512bb.jpg" "https://is5-ssl.mzstatic.com/image/thumb/Purple118/v4/98/b2/41/98b241cc-29b7-5f67-0060-1e030f35562f/source/512x512bb.jpg" "https://is3-ssl.mzstatic.com/image/thumb/Purple117/v4/64/da/aa/64daaaa4-40b5-9e9f-9d60-b936b5d2f3ca/source/512x512bb.jpg" ...
## $ average_user_rating : num 4 3.5 3 3.5 3.5 3 2.5 2.5 2.5 2.5 ...
## $ user_rating_count : num 3553 284 8376 190394 28 ...
## $ price : num 2.99 1.99 0 0 2.99 0 0 0.99 0 0 ...
## $ in_app_purchases : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ description : chr "Join over 21,000,000 of our fans and download one of our Sudoku games today!\\n\\nMakers of the Best Sudoku Gam"| __truncated__ "The classic game of Reversi, also known as Othello, is a much-loved strategy board game. It is often described "| __truncated__ "Play the classic strategy game Othello (also known as Reversi) on your iPhone or iPod Touch. The object is to f"| __truncated__ "Top 100 free app for over a year.\\nRated \"Best Sudoku Game of the Year\" by Apple.\\nRated #9 Game of the Yea"| __truncated__ ...
## $ developer : chr "Mighty Mighty Good Games" "Kiss The Machine" "Bayou Games" "Mighty Mighty Good Games" ...
## $ age_rating : Factor w/ 4 levels "12+","17+","4+",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ languages : chr "DA, NL, EN, FI, FR, DE, IT, JA, KO, NB, PL, PT, RU, ZH, ES, SV, ZH" "EN" "EN" "DA, NL, EN, FI, FR, DE, IT, JA, KO, NB, PL, PT, RU, ZH, ES, SV, ZH" ...
## $ size : num 15.85 12.33 0.67 21.55 34.69 ...
## $ primary_genre : Factor w/ 21 levels "Book","Business",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ genres : chr "Games, Strategy, Puzzle" "Games, Strategy, Board" "Games, Board, Strategy" "Games, Strategy, Puzzle" ...
## $ original_release_date : Date, format: "2008-07-11" "2008-07-11" ...
## $ current_version_release_date: Date, format: "2017-05-30" "2018-05-17" ...
## $ size_group : Factor w/ 5 levels "1 GB and under",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ popularity : Factor w/ 2 levels "popular","unpopular": 1 1 1 1 1 2 2 2 2 2 ...
## $ release_year : num 2008 2008 2008 2008 2008 ...
## $ day_since_updated : num 3245 3597 3343 3233 3656 ...
## $ update : Factor w/ 3 levels "a while since update",..: 2 2 2 2 2 2 2 2 2 3 ...
## $ lang_count : num 17 1 1 17 15 1 1 1 1 1 ...
## $ genres_count : num 3 3 3 3 4 4 4 3 4 3 ...
Here’s how our data set turns out after we added some variables to help us increase our predictor variables.
Data Exploration
Top 10 Free/Paid Strategy Games
The following is the top 10 most rated Mobile Strategy Games divided by whether they’re free or not.
Free Games
free_app <- games2 %>%
filter(price == 0) %>%
group_by(genres, in_app_purchases, name, icon_url) %>%
summarize(rate_count = max(user_rating_count),
price = max(price)) %>%
arrange(-rate_count)
free_app$rate_count <- format(free_app$rate_count, big.mark = ",")
free_app %>%
arrange(desc(rate_count)) %>%
top_n(10,wt = rate_count) %>%
select(name, genres, in_app_purchases, rate_count) %>%
datatable(class = "nowrap hover row-border", escape = FALSE,
options = list(dom = 't',scrollX = TRUE))## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
Paid Games
paid_app <- games2 %>%
filter(price != 0) %>%
group_by(genres, in_app_purchases, name, icon_url) %>%
summarize(rate_count = max(user_rating_count),
price = max(price)) %>%
arrange(-rate_count)
paid_app$rate_count <- format(paid_app$rate_count, big.mark = ",")
paid_app %>%
arrange(desc(rate_count)) %>%
top_n(10,wt = rate_count) %>%
select(name, genres, in_app_purchases, rate_count, price) %>%
datatable(class = "nowrap hover row-border", escape = FALSE,
options = list(dom = 't',scrollX = TRUE))Apps Size
In this section we are going to check if the size of the app affects it’s ratings.
games2 %>%
mutate(price = ifelse(price == 0 | is.na(price), "FREE APPS", "PAID APPS")) %>%
group_by(average_user_rating, size_group, price) %>%
summarize(total = n()) %>%
ggplot(aes(x = size_group, y = total, fill = as.factor(average_user_rating))) +
geom_col(position = "fill") +
coord_flip(expand = FALSE) +
theme(legend.position = "bottom",
plot.title = element_text(face = "bold", size = 25, hjust = 0.5, vjust = 0.5),
plot.background = element_rect(fill = "white"),
strip.text = element_text(colour = "white", face = "bold", size = 15),
strip.background = element_rect(fill = "gray60"),
axis.title = element_blank()) +
labs(title = "Apps Size vs Reviews",
fill = "User Rating") +
facet_wrap(~ price, ncol = 1)The graphic states that for Free Download Apps, the bigger the size the smaller the ratings. But it’s kinda different for Paid Apps, as the big sized apps still get a decent ratings.
Age Rating
Age rating is important to know who our target audiences are. In this graph, we are going to see the variety of the Age Rating of our data.
waffled <- games2 %>%
group_by(age_rating) %>%
summarize(Total = n()) %>%
mutate(age = round(Total/sum(Total) * 100)) %>%
arrange(-age)
age_counts <- waffled$age
names(age_counts) <- waffled$age_rating
waffle(age_counts) +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x = element_blank(),
plot.title = element_text(hjust = 0.5)) +
labs(title = "Variety of Age Rating")From the graphic, we get that most of the Mobile Strategy Games in the App Store have the 4+ rating. This might means that Mobile Strategy Games are really popular for kids with the age of 4 and above.
Genres and Languages
Now, we are going to check the popularity of the apps based on their genres and languages
Genres
For the genres, we are going to remove Games and Strategy from the pool because all of the apps are supposed to be Strategy Games.
appgenres <- separate_rows(games2, genres, convert = TRUE) %>%
filter(genres != "Games" & genres != "Strategy") %>%
group_by(genres) %>%
summarize(total = n()) %>%
arrange(-total) %>%
treemap(genre_tm, index = "genres",
vSize = "total",
fontsize.labels = 15,
palette = "YlOrRd",
title="Genres of Mobile Strategy Games",
fontsize.title = 20)From the image, we get that the top 3 of genres used in App Store are Entertainment, Puzzle, and Simulation.
Languages
Then we check the popular languages used in Mobile Strategy Games in App Store.
applanguages <- separate_rows(games2, languages, convert = TRUE) %>%
group_by(languages) %>%
summarize(total = n()) %>%
arrange(-total) %>%
treemap(lan_tm, index = "languages",
vSize = "total",
fontsize.labels = 15,
palette = "YlOrRd",
title="Languages of Mobile Strategy Games",
fontsize.title = 20)The most popular language is without any doubt, EN (English), followed by ZH (Chinese).
Summary
There are a lot of things (or variables) that can affect the popularity of an app (in this case Mobile Strategy Game). In this project, we are going to create a model that can help us solve the quest of creating the most popular Mobile Strategy Game.