Overview

This report loads a multilingual mobile app reviews dataset and prepares a small, tidy data frame for basic analysis. Steps are kept simple and explicit: select a few columns, assign clear names, and standardize common encodings (e.g., language codes and dates).
Dataset source: https://www.kaggle.com/datasets/pratyushpuri/multilingual-mobile-app-reviews-dataset-2025

For context on interpreting app‑store ratings alongside written reviews, see:
Discrepancies Between Text Reviews and Numerical Ratings for Mobile Apps (2024)https://pmc.ncbi.nlm.nih.gov/articles/PMC10773744/.
The article examines mismatches between star ratings and the sentiment in review text across mobile marketplaces. These discrepancies suggest that relying on ratings alone can misrepresent user experience, motivating clean preprocessing before comparisons by app or category.

Setup

library(tidyverse)
library(lubridate)

Load Data (GitHub Raw URL)

url <- "https://raw.githubusercontent.com/savibaraili/dataset-607-week1/main/multilingual_mobile_app_reviews_2025.csv"
reviews_raw <- readr::read_csv(url, show_col_types = FALSE)

# View the column names (so selections below can be edited if needed)
names(reviews_raw)
##  [1] "review_id"         "user_id"           "app_name"         
##  [4] "app_category"      "review_text"       "review_language"  
##  [7] "rating"            "review_date"       "verified_purchase"
## [10] "device_type"       "num_helpful_votes" "user_age"         
## [13] "user_country"      "user_gender"       "app_version"

Beginner Cleaning: Select, Rename, Transform

The code below does three clear things: 1) Select a small subset of columns and rename them (new_name = old_name).
2) Convert rating to numeric.
3) Parse the review date and expand a few common language codes for readability.

# 1) Select a small subset with clear names (new = old)
reviews_clean <- reviews_raw %>%
  dplyr::select(
    app_name     = app_name,
    review_text  = review_text,
    rating       = rating,          # target
    language     = review_language, # e.g., "en", "es"
    review_date  = review_date,
    app_category = app_category,
    app_version  = app_version,
    country      = user_country
  )

# 2) Convert rating to numeric
reviews_clean <- reviews_clean %>%
  mutate(
    rating = as.numeric(rating)
  )

# 3) Parse review date (ISO 8601 format, e.g., 2024-10-09 19:26:40)
reviews_clean <- reviews_clean %>%
  mutate(
    review_date = lubridate::ymd_hms(review_date, quiet = TRUE)
  )

# 4) Expand language codes (includes codes observed in this dataset preview)
reviews_clean <- reviews_clean %>%
  mutate(
    language = dplyr::recode(
      language,
      "en" = "English",
      "es" = "Spanish",
      "fr" = "French",
      "de" = "German",
      "hi" = "Hindi",
      "ar" = "Arabic",
      "ru" = "Russian",
      "zh" = "Chinese",
      # Added based on observed values:
      "no" = "Norwegian",
      "vi" = "Vietnamese",
      "tl" = "Filipino",
      "th" = "Thai",
      "da" = "Danish",
      "ja" = "Japanese",
      "ms" = "Malay",
      .default = language
    )
  )

# Quick peek at the cleaned data
glimpse(reviews_clean)
## Rows: 2,514
## Columns: 8
## $ app_name     <chr> "MX Player", "Tinder", "Netflix", "Venmo", "Google Drive"…
## $ review_text  <chr> "Qui doloribus consequuntur. Perspiciatis tempora assumen…
## $ rating       <dbl> 1.3, 1.6, 3.6, 3.8, 3.2, 5.0, 4.0, 1.2, 1.8, 4.6, 2.7, 4.…
## $ language     <chr> "Norwegian", "Russian", "Spanish", "Vietnamese", "Filipin…
## $ review_date  <dttm> 2024-10-09 19:26:40, 2024-06-21 17:29:40, 2024-10-31 13:…
## $ app_category <chr> "Travel & Local", "Navigation", "Dating", "Productivity",…
## $ app_version  <chr> "1.4", "8.9", "2.8.37.5926", "10.2", "4.7", "11.2.87.6917…
## $ country      <chr> "China", "Germany", "Nigeria", "India", "South Korea", "S…
head(reviews_clean, 10)
## # A tibble: 10 × 8
##    app_name         review_text rating language review_date         app_category
##    <chr>            <chr>        <dbl> <chr>    <dttm>              <chr>       
##  1 MX Player        Qui dolori…    1.3 Norwegi… 2024-10-09 19:26:40 Travel & Lo…
##  2 Tinder           Great app …    1.6 Russian  2024-06-21 17:29:40 Navigation  
##  3 Netflix          The interf…    3.6 Spanish  2024-10-31 13:47:12 Dating      
##  4 Venmo            Latest upd…    3.8 Vietnam… 2025-03-12 06:16:22 Productivity
##  5 Google Drive     Perfect fo…    3.2 Filipino 2024-04-21 03:48:27 Education   
##  6 Netflix          Works perf…    5   Thai     2024-01-15 02:49:03 Music & Aud…
##  7 Signal           Basso bell…    4   Danish   2024-05-20 21:28:14 Travel & Lo…
##  8 Canva            Odcinek sk…    1.2 Japanese 2025-05-26 05:21:22 Social Netw…
##  9 Microsoft Office Eius odio …    1.8 Malay    2023-09-13 07:50:14 Video Playe…
## 10 Dropbox          Husband at…    4.6 French   2024-08-22 15:25:24 News & Maga…
## # ℹ 2 more variables: app_version <chr>, country <chr>

(Optional) Quick Checks

# Basic rating summary
if ("rating" %in% names(reviews_clean) && is.numeric(reviews_clean$rating)) {
  summary(reviews_clean$rating)
}
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   2.100   3.000   3.021   4.000   5.000      37
# Simple histogram
if ("rating" %in% names(reviews_clean) && is.numeric(reviews_clean$rating)) {
  ggplot(reviews_clean, aes(x = rating)) +
    geom_histogram(binwidth = 0.5, boundary = 0) +
    labs(title = "Ratings Histogram", x = "Rating", y = "Count")
}

Conclusions

A small, tidy data frame (reviews_clean) is ready for simple filtering, grouping, and plotting. Possible next steps include comparing ratings by app or category and adding basic text analysis for review_text (e.g., word counts or simple sentiment).

Appendix

sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-apple-darwin20
## Running under: macOS Ventura 13.7.8
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lubridate_1.9.4 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
##  [5] purrr_1.1.0     readr_2.1.5     tidyr_1.3.1     tibble_3.3.0   
##  [9] ggplot2_3.5.2   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] bit_4.6.0          gtable_0.3.6       jsonlite_2.0.0     crayon_1.5.3      
##  [5] compiler_4.5.1     tidyselect_1.2.1   parallel_4.5.1     jquerylib_0.1.4   
##  [9] scales_1.4.0       yaml_2.3.10        fastmap_1.2.0      R6_2.6.1          
## [13] labeling_0.4.3     generics_0.1.4     curl_7.0.0         knitr_1.50        
## [17] bslib_0.9.0        pillar_1.11.0      RColorBrewer_1.1-3 tzdb_0.5.0        
## [21] rlang_1.1.6        utf8_1.2.6         cachem_1.1.0       stringi_1.8.7     
## [25] xfun_0.53          sass_0.4.10        bit64_4.6.0-1      timechange_0.3.0  
## [29] cli_3.6.5          withr_3.0.2        magrittr_2.0.3     digest_0.6.37     
## [33] grid_4.5.1         vroom_1.6.5        rstudioapi_0.17.1  hms_1.1.3         
## [37] lifecycle_1.0.4    vctrs_0.6.5        evaluate_1.0.5     glue_1.8.0        
## [41] farver_2.1.2       rmarkdown_2.29     tools_4.5.1        pkgconfig_2.0.3   
## [45] htmltools_0.5.8.1