Homework 1

google-playstore

apps

reviews

Reading in Data

Author

Rohan Lekhwani

Published

January 3, 2024

Code

library(tidyverse)
library(readxl)
library(here)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Reading the Data

The working directory for RStudio has been set such that “googleplaystore.csv” can be found at the root of the working directory using the setwd() method.

Code

googleplaystore <- read_csv(here("googleplaystore.csv"))
googleplaystore

# A tibble: 10,841 × 13
   App       Category Rating Reviews Size  Installs Type  Price `Content Rating`
   <chr>     <chr>     <dbl>   <dbl> <chr> <chr>    <chr> <chr> <chr>           
 1 Photo Ed… ART_AND…    4.1     159 19M   10,000+  Free  0     Everyone        
 2 Coloring… ART_AND…    3.9     967 14M   500,000+ Free  0     Everyone        
 3 U Launch… ART_AND…    4.7   87510 8.7M  5,000,0… Free  0     Everyone        
 4 Sketch -… ART_AND…    4.5  215644 25M   50,000,… Free  0     Teen            
 5 Pixel Dr… ART_AND…    4.3     967 2.8M  100,000+ Free  0     Everyone        
 6 Paper fl… ART_AND…    4.4     167 5.6M  50,000+  Free  0     Everyone        
 7 Smoke Ef… ART_AND…    3.8     178 19M   50,000+  Free  0     Everyone        
 8 Infinite… ART_AND…    4.1   36815 29M   1,000,0… Free  0     Everyone        
 9 Garden C… ART_AND…    4.4   13791 33M   1,000,0… Free  0     Everyone        
10 Kids Pai… ART_AND…    4.7     121 3.1M  10,000+  Free  0     Everyone        
# ℹ 10,831 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
#   `Android Ver` <chr>

Cleaning the Dataset

Remove Duplicates

The dataset has 10841 observations and 13 variables. However, the dataset seems to contain duplicates as seen using the below query.

Code

n_distinct(googleplaystore)

[1] 10358

The following query de-duplicates the dataset based on matching rows

Code

googleplaystore %>% distinct()

# A tibble: 10,358 × 13
   App       Category Rating Reviews Size  Installs Type  Price `Content Rating`
   <chr>     <chr>     <dbl>   <dbl> <chr> <chr>    <chr> <chr> <chr>           
 1 Photo Ed… ART_AND…    4.1     159 19M   10,000+  Free  0     Everyone        
 2 Coloring… ART_AND…    3.9     967 14M   500,000+ Free  0     Everyone        
 3 U Launch… ART_AND…    4.7   87510 8.7M  5,000,0… Free  0     Everyone        
 4 Sketch -… ART_AND…    4.5  215644 25M   50,000,… Free  0     Teen            
 5 Pixel Dr… ART_AND…    4.3     967 2.8M  100,000+ Free  0     Everyone        
 6 Paper fl… ART_AND…    4.4     167 5.6M  50,000+  Free  0     Everyone        
 7 Smoke Ef… ART_AND…    3.8     178 19M   50,000+  Free  0     Everyone        
 8 Infinite… ART_AND…    4.1   36815 29M   1,000,0… Free  0     Everyone        
 9 Garden C… ART_AND…    4.4   13791 33M   1,000,0… Free  0     Everyone        
10 Kids Pai… ART_AND…    4.7     121 3.1M  10,000+  Free  0     Everyone        
# ℹ 10,348 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
#   `Android Ver` <chr>

However, this does not solve for the edge case where app names are duplicates:

Code

googleplaystore %>% 
  distinct() %>%
  select(App, Category, Rating, Reviews) %>%
  arrange(desc(Reviews))

# A tibble: 10,358 × 4
   App                                      Category      Rating  Reviews
   <chr>                                    <chr>          <dbl>    <dbl>
 1 Facebook                                 SOCIAL           4.1 78158306
 2 Facebook                                 SOCIAL           4.1 78128208
 3 WhatsApp Messenger                       COMMUNICATION    4.4 69119316
 4 WhatsApp Messenger                       COMMUNICATION    4.4 69109672
 5 Instagram                                SOCIAL           4.5 66577446
 6 Instagram                                SOCIAL           4.5 66577313
 7 Instagram                                SOCIAL           4.5 66509917
 8 Messenger – Text and Video Chat for Free COMMUNICATION    4   56646578
 9 Messenger – Text and Video Chat for Free COMMUNICATION    4   56642847
10 Clash of Clans                           GAME             4.6 44893888
# ℹ 10,348 more rows

The following query, deduplicates the dataset based on the App variable:

Code

googleplaystore_distinct <- googleplaystore %>%
  distinct(App, .keep_all = T)
googleplaystore_distinct

# A tibble: 9,660 × 13
   App       Category Rating Reviews Size  Installs Type  Price `Content Rating`
   <chr>     <chr>     <dbl>   <dbl> <chr> <chr>    <chr> <chr> <chr>           
 1 Photo Ed… ART_AND…    4.1     159 19M   10,000+  Free  0     Everyone        
 2 Coloring… ART_AND…    3.9     967 14M   500,000+ Free  0     Everyone        
 3 U Launch… ART_AND…    4.7   87510 8.7M  5,000,0… Free  0     Everyone        
 4 Sketch -… ART_AND…    4.5  215644 25M   50,000,… Free  0     Teen            
 5 Pixel Dr… ART_AND…    4.3     967 2.8M  100,000+ Free  0     Everyone        
 6 Paper fl… ART_AND…    4.4     167 5.6M  50,000+  Free  0     Everyone        
 7 Smoke Ef… ART_AND…    3.8     178 19M   50,000+  Free  0     Everyone        
 8 Infinite… ART_AND…    4.1   36815 29M   1,000,0… Free  0     Everyone        
 9 Garden C… ART_AND…    4.4   13791 33M   1,000,0… Free  0     Everyone        
10 Kids Pai… ART_AND…    4.7     121 3.1M  10,000+  Free  0     Everyone        
# ℹ 9,650 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
#   `Android Ver` <chr>

Remove Missing Values

Further, the dataset also contains missing values. Rows containing missing values can be dropped using the following query.

Code

googleplaystore_cleaned <- googleplaystore_distinct %>% drop_na()

Dataset Description

High Level Description

The data set comprises of 8195 rows with 13 columns.

Code

googleplaystore_cleaned

# A tibble: 8,195 × 13
   App       Category Rating Reviews Size  Installs Type  Price `Content Rating`
   <chr>     <chr>     <dbl>   <dbl> <chr> <chr>    <chr> <chr> <chr>           
 1 Photo Ed… ART_AND…    4.1     159 19M   10,000+  Free  0     Everyone        
 2 Coloring… ART_AND…    3.9     967 14M   500,000+ Free  0     Everyone        
 3 U Launch… ART_AND…    4.7   87510 8.7M  5,000,0… Free  0     Everyone        
 4 Sketch -… ART_AND…    4.5  215644 25M   50,000,… Free  0     Teen            
 5 Pixel Dr… ART_AND…    4.3     967 2.8M  100,000+ Free  0     Everyone        
 6 Paper fl… ART_AND…    4.4     167 5.6M  50,000+  Free  0     Everyone        
 7 Smoke Ef… ART_AND…    3.8     178 19M   50,000+  Free  0     Everyone        
 8 Infinite… ART_AND…    4.1   36815 29M   1,000,0… Free  0     Everyone        
 9 Garden C… ART_AND…    4.4   13791 33M   1,000,0… Free  0     Everyone        
10 Kids Pai… ART_AND…    4.7     121 3.1M  10,000+  Free  0     Everyone        
# ℹ 8,185 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
#   `Android Ver` <chr>

The dataset has a total of 11 <chr> type columns and the remaining columns are of the <dbl> type. Each observation pertains to a single app. The following are the descriptions of each of the variables in the dataset:

The App variable lists the name of the app under observation.
The Category variable consists of the category the app under observation is grouped into.
The Rating variable lists the rating of the app on the Google Play Store.
The Reviews variable represents the number of user reviews the app received on the Google Play Store.
The Size variable consists of the size of the app (either in KB, MB or varying by device).
The Installs variable marks the number of devices the app is installed on.
The Type variable lists whether an app is “Free” or “Paid”.
The Price variable marks the price of the app if the Type is “Paid” and 0 for “Free” apps.
The Content Rating variable mentions the suitable user group the app targets.
The Genres variable represents which genre the app falls under. This is similar to the Category variable. An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.
The Last Updated variable enlists the date on which the latest update was published on the Google Play Store.
The Current Ver variable marks the current published version of the app.
The Android Ver variable lists the target Android version of the device the app will work on.

How was the Data likely collected?

The dataset is a published Kaggle dataset. It has been collected by web-scraping Google Playstore listings in 2018. The detail page for an app provides all of the information presented in the dataset.

The following query gives the total apps under each category:

Code

googleplaystore_cleaned %>% 
  group_by(Category) %>%
  summarize(total_apps = n()) %>%
  arrange(desc(total_apps))

# A tibble: 33 × 2
   Category        total_apps
   <chr>                <int>
 1 FAMILY                1608
 2 GAME                   912
 3 TOOLS                  718
 4 FINANCE                302
 5 LIFESTYLE              301
 6 PRODUCTIVITY           301
 7 PERSONALIZATION        298
 8 MEDICAL                290
 9 BUSINESS               263
10 PHOTOGRAPHY            263
# ℹ 23 more rows

We see that the “FAMILY” category has the most number of apps published in the dataset.

Descriptive Statistics

Summary Statistics on Ratings

The following query summarizes the mean, median and sd for the Rating variable across all apps.

Code

googleplaystore_cleaned %>%
  summarize(mean_rating = mean(Rating),
            median_rating = median(Rating),
            sd_rating = sd(Rating))

# A tibble: 1 × 3
  mean_rating median_rating sd_rating
        <dbl>         <dbl>     <dbl>
1        4.17           4.3     0.537

The above summary tibble shows that most of the apps are well-rated. Rating statistics when grouped by app category can be computed using the following query. These have been arranged in descending order of mean ratings:

Code

googleplaystore_cleaned %>%
  group_by(Category) %>%
  summarize(mean_rating = mean(Rating),
            median_rating = median(Rating),
            sd_rating = sd(Rating)) %>%
  arrange(desc(mean_rating))

# A tibble: 33 × 4
   Category            mean_rating median_rating sd_rating
   <chr>                     <dbl>         <dbl>     <dbl>
 1 EVENTS                     4.44           4.5     0.419
 2 EDUCATION                  4.36           4.4     0.264
 3 ART_AND_DESIGN             4.36           4.4     0.361
 4 BOOKS_AND_REFERENCE        4.34           4.5     0.438
 5 PERSONALIZATION            4.33           4.4     0.359
 6 PARENTING                  4.3            4.4     0.518
 7 BEAUTY                     4.28           4.3     0.363
 8 GAME                       4.25           4.3     0.384
 9 SOCIAL                     4.25           4.3     0.457
10 WEATHER                    4.24           4.3     0.338
# ℹ 23 more rows

We observe that apps categorized as “EVENTS” have the highest mean rating in the dataset.

Summary Statistics on Reviews

The number of Reviews form an implicit indicator of the popularity of an app. The following query summarizes the mean, median and sd for the Reviews variable across all apps.

Code

googleplaystore_cleaned %>%
  summarize(mean_reviews = mean(Reviews),
            median_reviews = median(Reviews),
            sd_reviews = sd(Reviews))

# A tibble: 1 × 3
  mean_reviews median_reviews sd_reviews
         <dbl>          <dbl>      <dbl>
1      255280.           3003   1985713.

From the above tibble we observe that mean_reviews are significantly larger than the median_reviews implying the presence of a few outlier apps have a huge number of reviews.

Code

googleplaystore_cleaned %>%
  select(App, Reviews) %>%
  arrange(desc(Reviews)) %>%
  head()

# A tibble: 6 × 2
  App                                       Reviews
  <chr>                                       <dbl>
1 Facebook                                 78158306
2 WhatsApp Messenger                       69119316
3 Instagram                                66577313
4 Messenger – Text and Video Chat for Free 56642847
5 Clash of Clans                           44891723
6 Clean Master- Space Cleaner & Antivirus  42916526

From the above tibble we observe apps with the highest number of reviews. Review statistics can also be grouped by app category and can be computed using the following query. These have been arranged in descending order of mean reviews:

Code

googleplaystore_cleaned %>%
  group_by(Category) %>%
  summarize(mean_reviews = mean(Reviews),
            median_reviews = median(Reviews),
            sd_reviews = sd(Reviews)) %>%
  arrange(desc(mean_reviews))

# A tibble: 33 × 4
   Category        mean_reviews median_reviews sd_reviews
   <chr>                  <dbl>          <dbl>      <dbl>
 1 SOCIAL              1122795.          9606    7317698.
 2 COMMUNICATION       1116449.         15162.   5900344.
 3 GAME                 682342.         32840    2561632.
 4 VIDEO_PLAYERS        455973.          6567    2340948.
 5 PHOTOGRAPHY          400575.         31985    1181863.
 6 ENTERTAINMENT        340810.         37884.   1036260.
 7 TOOLS                319437.          1038.   2195072.
 8 SHOPPING             247509.         20362.    867401.
 9 PRODUCTIVITY         184686.          6752     556558.
10 PERSONALIZATION      179674.          1516.    835898.
# ℹ 23 more rows

This matches our previous finding. Apps published within “SOCIAL”, “COMMUNICATION” and “GAME” categories tend to have the highest reviews even though they don’t have the highest mean rating which may be attributed to a larger number of users using apps within these categories leading to a more unskewed distribution.

Summary Statistics on Category

The following query provides a summary over the count of apps published under each category as well as their proportion within the total number of apps.

Code

googleplaystore_cleaned %>%
  group_by(Category) %>%
  summarize(app_count = n(),
            app_proportion = n() / nrow(googleplaystore_cleaned)) %>%
  arrange(desc(app_count))

# A tibble: 33 × 3
   Category        app_count app_proportion
   <chr>               <int>          <dbl>
 1 FAMILY               1608         0.196 
 2 GAME                  912         0.111 
 3 TOOLS                 718         0.0876
 4 FINANCE               302         0.0369
 5 LIFESTYLE             301         0.0367
 6 PRODUCTIVITY          301         0.0367
 7 PERSONALIZATION       298         0.0364
 8 MEDICAL               290         0.0354
 9 BUSINESS              263         0.0321
10 PHOTOGRAPHY           263         0.0321
# ℹ 23 more rows

From the above tibble we observe that the maximum number of apps in our dataset are published under the “FAMILY” category.

Summary Statistics on App Type

App Types (Free or Paid) form an interesting statistic.

Code

googleplaystore_cleaned %>%
  group_by(Type) %>%
  summarize(count = n(),
            proportion = n() / nrow(googleplaystore_cleaned))

# A tibble: 2 × 3
  Type  count proportion
  <chr> <int>      <dbl>
1 Free   7591     0.926 
2 Paid    604     0.0737

We observe that about 92% of the apps in our dataset are “Free”. Another interesting statistic to observe is the most downloaded app for each Type:

Code

googleplaystore_cleaned %>%
  mutate(Installs = parse_number(gsub("[+,]", "", Installs))) %>%
  group_by(Type) %>%
  filter(Installs == max(Installs)) %>%
  arrange(desc(Type))

# A tibble: 22 × 13
# Groups:   Type [2]
   App       Category Rating Reviews Size  Installs Type  Price `Content Rating`
   <chr>     <chr>     <dbl>   <dbl> <chr>    <dbl> <chr> <chr> <chr>           
 1 Minecraft FAMILY      4.5  2.38e6 Vari…      1e7 Paid  $6.99 Everyone 10+    
 2 Hitman S… GAME        4.6  4.08e5 29M        1e7 Paid  $0.99 Mature 17+      
 3 Google P… BOOKS_A…    3.9  1.43e6 Vari…      1e9 Free  0     Teen            
 4 Messenge… COMMUNI…    4    5.66e7 Vari…      1e9 Free  0     Everyone        
 5 WhatsApp… COMMUNI…    4.4  6.91e7 Vari…      1e9 Free  0     Everyone        
 6 Google C… COMMUNI…    4.3  9.64e6 Vari…      1e9 Free  0     Everyone        
 7 Gmail     COMMUNI…    4.3  4.60e6 Vari…      1e9 Free  0     Everyone        
 8 Hangouts  COMMUNI…    4    3.42e6 Vari…      1e9 Free  0     Everyone        
 9 Skype - … COMMUNI…    4.1  1.05e7 Vari…      1e9 Free  0     Everyone        
10 Google P… ENTERTA…    4.3  7.17e6 Vari…      1e9 Free  0     Teen            
# ℹ 12 more rows
# ℹ 4 more variables: Genres <chr>, `Last Updated` <chr>, `Current Ver` <chr>,
#   `Android Ver` <chr>

We observe that “Minecraft” and “Hitman Sniper” form the most installed “Paid” apps with more than 10 million installs while the remaining apps form the most install “Free” apps with more than a billion installs. An interesting observation is that a majority of the most downloaded “Free” apps on the Google Play Store are owned by Google itself!

Potential Research Questions

The following are some of the potential research questions that the dataset can be used to answer:

Does the Size of an app relate to the number of Installs? Users don’t prefer installing bulky apps. Yet in our observation above we saw “GAMES” which tend to be bigger in size to be one of the most installed category of apps. An important research question to answer is the optimal size of the app that maximizes the number of installs. A follow-up question would be whether some categories like “GAMES” form outliers having apps with high size and installs. Do users pay more for apps that are lighter?
What is the best monetization strategy for “Paid” apps? Relations between the Price, Rating and Installs variables can tell us about the correlation between the optimal pricing for apps. Is there a pricing threshold above which users don’t prefer downloading apps? This research question would help deduce the best monetization strategy for a new app.
Are apps that target older Android Ver less popular? If an app targets older Android Ver it most likely has backward compatible components and is less probable to have the newest Android features. Are these kinds of apps less popular in terms of ratings, reviews and the number of installs? An alternative hypothesis is that apps targeting newer Android Ver have possibly not been on the Play Store for long and might consequently have lesser popularity. These follow-up questions can be answered by analyzing the dataset.
How does the distribution of Genres vary across different Category? A Genre might belong to multiple Category. An important question to answer would be to observe the Genre belonging to the maximum number of categories. Is this Genre also popular among users? What kind of genres tend to be the least popular among users? What genres form the highest priced apps?
How does the last update date of the app impact its popularity among users? The Last Updated variable provides the date when the latest update of the app was published on the Google Play Store. Are apps which are less frequently updated impact user ratings? Do these apps tend to be less installed? It could also be the case that Last Updated does not have any impact on any of the popularity metrics.
Does the number of reviews positively correlate with the number of installs for apps? Usually active users tend to leave reviews but does that also mean that the app is more installed or has a higher rating on the Play Store? This might be significantly important for paid apps, since users tend to read other people’s reviews before deciding to pay for an app.