Getting the data

Create a Kaggle account and download this data set (AppleStore.csv) about apps in the App Store: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps#AppleStore.csv

The author also provides a references R package (data only), however it is different than the CSV provided on the website, so we will use the CSV. We can get a quick preview with head().

data <- read.csv('AppleStore.csv')

head(data)

Question 1

What is the average (mean) and standard deviation of size_bytes?

The average (mean) of size_bytes is 199134453.825066 bytes and the standard deviation is 359206913.538703 bytes.

library(dplyr)

data %>% 
  summarise(size_bytes_mean = mean(size_bytes),
            size_bytes_sd = sd(size_bytes))

Question 2

Which category prime_genre has the lowest average (mean) value of user_rating?

prime_genre_groups_by_rating <- data %>% 
  group_by(prime_genre) %>% 
  summarise(average_rating = mean(user_rating)) %>% 
  arrange(average_rating)

head(prime_genre_groups_by_rating)

The prime_genre with the lowest average (mean) value of user_rating is Catalogs.

Question 3

What interesting observations can you make from this data set?

3.1 Distributions

Different measures of averageness are suited for different distributions. Using mean will give you a good idea of the average of a normally distributed variable. But if a variable has a skewed distribution, the mean can be heavily influenced by a few outliers. In the case of a long right right tail, for example, the mean can be considerbly higher than what we would naturally consider to be the average value.

We can examine the distributions of size_bytes and user rating with a histogram.

library(ggplot2)
ggplot(data=data, aes(data$size_bytes)) + 
  geom_histogram(aes(fill=..count..),
                 alpha = .8) +
  labs(title="Histogram for Size in Bytes") +
  labs(x="Bytes", y="Count")

ggplot(data=data, aes(data$user_rating)) + 
  geom_histogram(aes(fill=..count..),
                 alpha = .8) +
  labs(title="Histogram for Average User Ratings") +
  labs(x="Rating", y="Count") +
  scale_fill_gradient("Count", low = "darkgreen", high = "green")

The distribution for size_bytes is heavily skewed right, and the mean of 199134453.825066 bytes is 2.049699 times larger than the median value of 97153024 bytes.

The distribution for user_ratings is bimodal, with a large number of 0 values and slights left skewed distribution for the values 1 through 5. Though since the variable is discrete, this is a rare case where mode might be a useful measure of average. user_ratings’s mode is 4.5.

We can further explore this variable using a series of Box plots grouped by the prime_genre variable.

ggplot(data, aes(x=prime_genre, y=user_rating)) + 
    geom_boxplot(fill = "lightgreen", alpha = 0.8) +
    theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
    labs(title="Ratings Per Prime Genre") +
    labs(x="Prime Genre", y="User Rating Distribution")

We can see the catalogs, finance, navigation, and book categories have high variances. Users leave the widest range of review scores for these apps. The catalogs category also has the lowest mean. However, if you visit the the catalogs section on the app store today, there are 0 apps listed. Maybe this category was discontinued?

The Productivity genre’s ratings are high and tightly distributed. People who are rating productivity apps are consistently rating them well. For the highest mean rating, it’s a three-way tie between Photos & Videos, Health & Fitness and Games all with a mean rating of 4.5. Apps in the Photos & Video category seem to perform very, so we have some stiff competition!

3.2 Do Ratings Change?

The data set also distinguished between user_rating, the overall average, and user_rating_ver, or the rating average for the current version. We’ve all experience breaking changes or seen apps go from bad to great, but do reviews really change that much? By defining a new variable review_change and using the same histogram technique we can see how many move how far.

data %>% 
  mutate(review_change = user_rating_ver - user_rating) %>% 
  ggplot(aes(.$review_change)) + 
  geom_histogram(aes(fill=..x..),
                 alpha = .8,
                 show.legend = FALSE) +
  labs(title="Histogram for Average User Ratings") +
  labs(x="Change in Rating", y="Count") +
  scale_fill_gradient("Change", low = "red", high = "green")

So the answer is no,app ratings do not change much between the overall average and the current version only average in this data set. However, there are some that have improved a lot. Let’s give credit where credit is due, here are the top 5 most improved apps:

data %>% 
  mutate(review_change = user_rating_ver - user_rating) %>%
  arrange(desc(review_change)) %>% 
  select(track_name, review_change) %>% 
  head(n = 5)

3.3 Relationships

Correlations are another useful tool as a starting point during exploratory data analysis. Always be careful of spurious correlations! A visual correlation matrix is a fun and fast way to get an overview of all the correlations between your numeric variables.

numeric_variables = c("size_bytes","price","rating_count_tot","user_rating","sup_devices.num",
                      "ipadSc_urls.num", "lang.num")
library(corrplot)
corrplot(cor(data %>% select(numeric_variables)), type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 45)

Among the numeric variables we selected, there are no significant correlations. This might seem like a dissapointing result, but there is some good news: since these variables seem to contain different information, we don’t have to worry about multicollinearity when building a linear model in the next step.

3.4 Modeling Rating Count Totals

rating_count_tot tells how many ratings this app has accross all versions. Assuming a typical funnel (e.g., awareness > download > use > rate), it’s possible this variable can act as a proxy for active users.

Since growth is a primary goal in app development, we would want to know which, if any, of the other variables of this set contribute to rating_count_tot.

We can build a simple linear model to try to determine the effect of the following explanatory variables on rating_count_tot:

price: Does price limit your user base, or do yuppie economics rule the iPhone app ecosystem?
prime_genre (as a categorical variable): are people just using their phones for specific types of activity?
user_rating: is quality what leads to quantity? The App Store and iTunes are said to rely on reviews for promotion algorithms. Word of mouth is supposed to be the best kind of advertising.

fit <- lm(rating_count_tot ~ price + as.factor(prime_genre) + user_rating, data)
summary(fit)

## 
## Call:
## lm(formula = rating_count_tot ~ price + as.factor(prime_genre) + 
##     user_rating, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -54358  -15578  -11069   -1138 2926784 
## 
## Coefficients:
##                                         Estimate Std. Error t value
## (Intercept)                              -4645.9     7265.5  -0.639
## price                                     -508.5      155.2  -3.276
## as.factor(prime_genre)Business           -4112.5    12274.6  -0.335
## as.factor(prime_genre)Catalogs           -2268.9    24833.3  -0.091
## as.factor(prime_genre)Education          -5622.8     7964.6  -0.706
## as.factor(prime_genre)Entertainment      -1365.4     7833.3  -0.174
## as.factor(prime_genre)Finance             5419.8    10247.9   0.529
## as.factor(prime_genre)Food & Drink        5653.2    11856.5   0.477
## as.factor(prime_genre)Games               3179.6     7248.2   0.439
## as.factor(prime_genre)Health & Fitness    -418.0     9084.4  -0.046
## as.factor(prime_genre)Lifestyle           -837.5     9482.3  -0.088
## as.factor(prime_genre)Medical            -4825.6    17265.0  -0.280
## as.factor(prime_genre)Music              18795.6     9620.8   1.954
## as.factor(prime_genre)Navigation          7022.6    13181.3   0.533
## as.factor(prime_genre)News                5076.8    11231.8   0.452
## as.factor(prime_genre)Photo & Video       3361.0     8209.5   0.409
## as.factor(prime_genre)Productivity       -2369.8     9127.4  -0.260
## as.factor(prime_genre)Reference          14629.1    11812.5   1.238
## as.factor(prime_genre)Shopping            8003.5     9871.0   0.811
## as.factor(prime_genre)Social Networking  37448.5     9197.2   4.072
## as.factor(prime_genre)Sports              6299.4    10015.5   0.629
## as.factor(prime_genre)Travel              4788.0    10987.9   0.436
## as.factor(prime_genre)Utilities          -1785.6     8579.0  -0.208
## as.factor(prime_genre)Weather            12134.8    11384.9   1.066
## user_rating                               4311.2      598.6   7.202
##                                                  Pr(>|t|)    
## (Intercept)                                       0.52255    
## price                                             0.00106 ** 
## as.factor(prime_genre)Business                    0.73760    
## as.factor(prime_genre)Catalogs                    0.92721    
## as.factor(prime_genre)Education                   0.48022    
## as.factor(prime_genre)Entertainment               0.86163    
## as.factor(prime_genre)Finance                     0.59691    
## as.factor(prime_genre)Food & Drink                0.63352    
## as.factor(prime_genre)Games                       0.66091    
## as.factor(prime_genre)Health & Fitness            0.96330    
## as.factor(prime_genre)Lifestyle                   0.92962    
## as.factor(prime_genre)Medical                     0.77987    
## as.factor(prime_genre)Music                       0.05078 .  
## as.factor(prime_genre)Navigation                  0.59421    
## as.factor(prime_genre)News                        0.65128    
## as.factor(prime_genre)Photo & Video               0.68226    
## as.factor(prime_genre)Productivity                0.79515    
## as.factor(prime_genre)Reference                   0.21559    
## as.factor(prime_genre)Shopping                    0.41750    
## as.factor(prime_genre)Social Networking 0.000047166184515 ***
## as.factor(prime_genre)Sports                      0.52939    
## as.factor(prime_genre)Travel                      0.66303    
## as.factor(prime_genre)Utilities                   0.83512    
## as.factor(prime_genre)Weather                     0.28652    
## user_rating                             0.000000000000653 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75240 on 7172 degrees of freedom
## Multiple R-squared:  0.01648,    Adjusted R-squared:  0.01319 
## F-statistic: 5.008 on 24 and 7172 DF,  p-value: 0.00000000000001255

There are three results from this summary we want to pay attention to.

R-squared is how much of the variance in our outcome variable is explained by our regressors. In this case we have a very low R-square of 0.013. We need to change some variables in this model.

The next is the F-statistic test. In this case, the p-value for our F test is less than the level of significance. All this means is we are out performing a straight line model (a good start!).

Lastly, we can consider the individual explanatory variables themselves. We are interested in significant (p-value < 0.05) and important (coefficient size) variables. Three stand out:

price has a significant (p = 0.001) and negative (508) effect on rating_total_count
“Social Networking” as a prime_genre has a significant (p = 0.000001) and positive (34087.5) effect.
user_rating has a significant (p = 0.0000000000006) and positive (4311.2) effect.

(The “Music” genre was also marginally significant (p = 0.04) and positive (15434.6).)

How might we interpret these model results? Social networking might be biased by a signel outlier: Facebook. Otherwise, it seems free and quality are the biggest determinants according to this very basic model. Keep in mind our R-squared is low, so we aren’t getting much explanatory power in the first place.

My intuition is that this model has some serious flaws that would come out in regression diagnostics. At best it would require lots of work to pass muster. However, I think the exercise is informative because it provides some interestings questions and insights into drivers of app growth.

Summary

The first take away is that app quality (as measured by user_rating) has a bigger positive impact on total reviews than price does a negative impact. iPhone and iPad users are willing to pay for quality.

Next, the fact that reviews and scores are clusterd by categories in both our histogram and our linear model shows the ecosystem has some both mature and immature segments (i.e. “blue ocean” categories).

Should developers view these “blue ocean” categories as potential target markets? It’s a question which requires further analysis. On the one hand, the lack of good apps could be driven by the fact that people associated their phones with certain kinds of activities (chat, selfies, passing time) and not others (browsing catalogs). However, there are some categories which are suprisingly less mature than we would expect, such as finance. Intuitively we all want mobile finance, so this may be a dynamic space to explore.

Apple Store Data Set EDA

Russell Shepherd