Create a Kaggle account and download this data set (AppleStore.csv) about apps in the App Store: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps#AppleStore.csv
The author also provides a references R package (data only), however it is different than the CSV provided on the website, so we will use the CSV. We can get a quick preview with head()
.
data <- read.csv('AppleStore.csv')
head(data)
- What is the average (mean) and standard deviation of size_bytes?
The average (mean) of size_bytes
is 199134453.825066 bytes and the standard deviation is 359206913.538703 bytes.
library(dplyr)
data %>%
summarise(size_bytes_mean = mean(size_bytes),
size_bytes_sd = sd(size_bytes))
- Which category prime_genre has the lowest average (mean) value of user_rating?
prime_genre_groups_by_rating <- data %>%
group_by(prime_genre) %>%
summarise(average_rating = mean(user_rating)) %>%
arrange(average_rating)
head(prime_genre_groups_by_rating)
The prime_genre
with the lowest average (mean) value of user_rating
is Catalogs.
- What interesting observations can you make from this data set?
Different measures of averageness are suited for different distributions. Using mean will give you a good idea of the average of a normally distributed variable. But if a variable has a skewed distribution, the mean can be heavily influenced by a few outliers. In the case of a long right right tail, for example, the mean can be considerbly higher than what we would naturally consider to be the average value.
We can examine the distributions of size_bytes
and user rating
with a histogram.
library(ggplot2)
ggplot(data=data, aes(data$size_bytes)) +
geom_histogram(aes(fill=..count..),
alpha = .8) +
labs(title="Histogram for Size in Bytes") +
labs(x="Bytes", y="Count")
ggplot(data=data, aes(data$user_rating)) +
geom_histogram(aes(fill=..count..),
alpha = .8) +
labs(title="Histogram for Average User Ratings") +
labs(x="Rating", y="Count") +
scale_fill_gradient("Count", low = "darkgreen", high = "green")
The distribution for size_bytes
is heavily skewed right, and the mean of 199134453.825066 bytes is 2.049699 times larger than the median value of 97153024 bytes.
The distribution for user_ratings
is bimodal, with a large number of 0 values and slights left skewed distribution for the values 1 through 5. Though since the variable is discrete, this is a rare case where mode might be a useful measure of average. user_ratings
’s mode is 4.5.
We can further explore this variable using a series of Box plots grouped by the prime_genre
variable.
ggplot(data, aes(x=prime_genre, y=user_rating)) +
geom_boxplot(fill = "lightgreen", alpha = 0.8) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
labs(title="Ratings Per Prime Genre") +
labs(x="Prime Genre", y="User Rating Distribution")
We can see the catalogs
, finance
, navigation
, and book
categories have high variances. Users leave the widest range of review scores for these apps. The catalogs
category also has the lowest mean. However, if you visit the the catalogs section on the app store today, there are 0 apps listed. Maybe this category was discontinued?
The Productivity
genre’s ratings are high and tightly distributed. People who are rating productivity apps are consistently rating them well. For the highest mean rating, it’s a three-way tie between Photos & Videos
, Health & Fitness
and Games
all with a mean rating of 4.5. Apps in the Photos & Video
category seem to perform very, so we have some stiff competition!
The data set also distinguished between user_rating
, the overall average, and user_rating_ver
, or the rating average for the current version. We’ve all experience breaking changes or seen apps go from bad to great, but do reviews really change that much? By defining a new variable review_change
and using the same histogram technique we can see how many move how far.
data %>%
mutate(review_change = user_rating_ver - user_rating) %>%
ggplot(aes(.$review_change)) +
geom_histogram(aes(fill=..x..),
alpha = .8,
show.legend = FALSE) +
labs(title="Histogram for Average User Ratings") +
labs(x="Change in Rating", y="Count") +
scale_fill_gradient("Change", low = "red", high = "green")
So the answer is no,app ratings do not change much between the overall average and the current version only average in this data set. However, there are some that have improved a lot. Let’s give credit where credit is due, here are the top 5 most improved apps:
data %>%
mutate(review_change = user_rating_ver - user_rating) %>%
arrange(desc(review_change)) %>%
select(track_name, review_change) %>%
head(n = 5)
Correlations are another useful tool as a starting point during exploratory data analysis. Always be careful of spurious correlations! A visual correlation matrix is a fun and fast way to get an overview of all the correlations between your numeric variables.
numeric_variables = c("size_bytes","price","rating_count_tot","user_rating","sup_devices.num",
"ipadSc_urls.num", "lang.num")
library(corrplot)
corrplot(cor(data %>% select(numeric_variables)), type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
Among the numeric variables we selected, there are no significant correlations. This might seem like a dissapointing result, but there is some good news: since these variables seem to contain different information, we don’t have to worry about multicollinearity when building a linear model in the next step.
rating_count_tot
tells how many ratings this app has accross all versions. Assuming a typical funnel (e.g., awareness > download > use > rate), it’s possible this variable can act as a proxy for active users.
Since growth is a primary goal in app development, we would want to know which, if any, of the other variables of this set contribute to rating_count_tot
.
We can build a simple linear model to try to determine the effect of the following explanatory variables on rating_count_tot
:
price
: Does price limit your user base, or do yuppie economics rule the iPhone app ecosystem?prime_genre
(as a categorical variable): are people just using their phones for specific types of activity?user_rating
: is quality what leads to quantity? The App Store and iTunes are said to rely on reviews for promotion algorithms. Word of mouth is supposed to be the best kind of advertising.fit <- lm(rating_count_tot ~ price + as.factor(prime_genre) + user_rating, data)
summary(fit)
##
## Call:
## lm(formula = rating_count_tot ~ price + as.factor(prime_genre) +
## user_rating, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54358 -15578 -11069 -1138 2926784
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -4645.9 7265.5 -0.639
## price -508.5 155.2 -3.276
## as.factor(prime_genre)Business -4112.5 12274.6 -0.335
## as.factor(prime_genre)Catalogs -2268.9 24833.3 -0.091
## as.factor(prime_genre)Education -5622.8 7964.6 -0.706
## as.factor(prime_genre)Entertainment -1365.4 7833.3 -0.174
## as.factor(prime_genre)Finance 5419.8 10247.9 0.529
## as.factor(prime_genre)Food & Drink 5653.2 11856.5 0.477
## as.factor(prime_genre)Games 3179.6 7248.2 0.439
## as.factor(prime_genre)Health & Fitness -418.0 9084.4 -0.046
## as.factor(prime_genre)Lifestyle -837.5 9482.3 -0.088
## as.factor(prime_genre)Medical -4825.6 17265.0 -0.280
## as.factor(prime_genre)Music 18795.6 9620.8 1.954
## as.factor(prime_genre)Navigation 7022.6 13181.3 0.533
## as.factor(prime_genre)News 5076.8 11231.8 0.452
## as.factor(prime_genre)Photo & Video 3361.0 8209.5 0.409
## as.factor(prime_genre)Productivity -2369.8 9127.4 -0.260
## as.factor(prime_genre)Reference 14629.1 11812.5 1.238
## as.factor(prime_genre)Shopping 8003.5 9871.0 0.811
## as.factor(prime_genre)Social Networking 37448.5 9197.2 4.072
## as.factor(prime_genre)Sports 6299.4 10015.5 0.629
## as.factor(prime_genre)Travel 4788.0 10987.9 0.436
## as.factor(prime_genre)Utilities -1785.6 8579.0 -0.208
## as.factor(prime_genre)Weather 12134.8 11384.9 1.066
## user_rating 4311.2 598.6 7.202
## Pr(>|t|)
## (Intercept) 0.52255
## price 0.00106 **
## as.factor(prime_genre)Business 0.73760
## as.factor(prime_genre)Catalogs 0.92721
## as.factor(prime_genre)Education 0.48022
## as.factor(prime_genre)Entertainment 0.86163
## as.factor(prime_genre)Finance 0.59691
## as.factor(prime_genre)Food & Drink 0.63352
## as.factor(prime_genre)Games 0.66091
## as.factor(prime_genre)Health & Fitness 0.96330
## as.factor(prime_genre)Lifestyle 0.92962
## as.factor(prime_genre)Medical 0.77987
## as.factor(prime_genre)Music 0.05078 .
## as.factor(prime_genre)Navigation 0.59421
## as.factor(prime_genre)News 0.65128
## as.factor(prime_genre)Photo & Video 0.68226
## as.factor(prime_genre)Productivity 0.79515
## as.factor(prime_genre)Reference 0.21559
## as.factor(prime_genre)Shopping 0.41750
## as.factor(prime_genre)Social Networking 0.000047166184515 ***
## as.factor(prime_genre)Sports 0.52939
## as.factor(prime_genre)Travel 0.66303
## as.factor(prime_genre)Utilities 0.83512
## as.factor(prime_genre)Weather 0.28652
## user_rating 0.000000000000653 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 75240 on 7172 degrees of freedom
## Multiple R-squared: 0.01648, Adjusted R-squared: 0.01319
## F-statistic: 5.008 on 24 and 7172 DF, p-value: 0.00000000000001255
There are three results from this summary we want to pay attention to.
R-squared is how much of the variance in our outcome variable is explained by our regressors. In this case we have a very low R-square of 0.013. We need to change some variables in this model.
The next is the F-statistic test. In this case, the p-value for our F test is less than the level of significance. All this means is we are out performing a straight line model (a good start!).
Lastly, we can consider the individual explanatory variables themselves. We are interested in significant (p-value < 0.05) and important (coefficient size) variables. Three stand out:
price
has a significant (p = 0.001) and negative (508) effect on rating_total_count
prime_genre
has a significant (p = 0.000001) and positive (34087.5) effect.user_rating
has a significant (p = 0.0000000000006) and positive (4311.2) effect.(The “Music” genre was also marginally significant (p = 0.04) and positive (15434.6).)
How might we interpret these model results? Social networking might be biased by a signel outlier: Facebook. Otherwise, it seems free and quality are the biggest determinants according to this very basic model. Keep in mind our R-squared is low, so we aren’t getting much explanatory power in the first place.
My intuition is that this model has some serious flaws that would come out in regression diagnostics. At best it would require lots of work to pass muster. However, I think the exercise is informative because it provides some interestings questions and insights into drivers of app growth.
Summary
The first take away is that app quality (as measured by user_rating
) has a bigger positive impact on total reviews than price does a negative impact. iPhone and iPad users are willing to pay for quality.
Next, the fact that reviews and scores are clusterd by categories in both our histogram and our linear model shows the ecosystem has some both mature and immature segments (i.e. “blue ocean” categories).
Should developers view these “blue ocean” categories as potential target markets? It’s a question which requires further analysis. On the one hand, the lack of good apps could be driven by the fact that people associated their phones with certain kinds of activities (chat, selfies, passing time) and not others (browsing catalogs). However, there are some categories which are suprisingly less mature than we would expect, such as finance. Intuitively we all want mobile finance, so this may be a dynamic space to explore.