This report loads a multilingual mobile app reviews dataset and
prepares a small, tidy data frame for basic analysis. Steps are kept
simple and explicit: select a few columns, assign clear names, and
standardize common encodings (e.g., language codes and dates).
Dataset source: https://www.kaggle.com/datasets/pratyushpuri/multilingual-mobile-app-reviews-dataset-2025
For context on interpreting app‑store ratings alongside written
reviews, see:
Discrepancies Between Text Reviews and Numerical Ratings for
Mobile Apps (2024) — https://pmc.ncbi.nlm.nih.gov/articles/PMC10773744/.
The article examines mismatches between star ratings and the sentiment
in review text across mobile marketplaces. These discrepancies suggest
that relying on ratings alone can misrepresent user experience,
motivating clean preprocessing before comparisons by app or
category.
library(tidyverse)
library(lubridate)
url <- "https://raw.githubusercontent.com/savibaraili/dataset-607-week1/main/multilingual_mobile_app_reviews_2025.csv"
reviews_raw <- readr::read_csv(url, show_col_types = FALSE)
# View the column names (so selections below can be edited if needed)
names(reviews_raw)
## [1] "review_id" "user_id" "app_name"
## [4] "app_category" "review_text" "review_language"
## [7] "rating" "review_date" "verified_purchase"
## [10] "device_type" "num_helpful_votes" "user_age"
## [13] "user_country" "user_gender" "app_version"
The code below does three clear things: 1) Select a small subset of
columns and rename them (new_name = old_name
).
2) Convert rating
to numeric.
3) Parse the review date and expand a few common language codes for
readability.
# 1) Select a small subset with clear names (new = old)
reviews_clean <- reviews_raw %>%
dplyr::select(
app_name = app_name,
review_text = review_text,
rating = rating, # target
language = review_language, # e.g., "en", "es"
review_date = review_date,
app_category = app_category,
app_version = app_version,
country = user_country
)
# 2) Convert rating to numeric
reviews_clean <- reviews_clean %>%
mutate(
rating = as.numeric(rating)
)
# 3) Parse review date (ISO 8601 format, e.g., 2024-10-09 19:26:40)
reviews_clean <- reviews_clean %>%
mutate(
review_date = lubridate::ymd_hms(review_date, quiet = TRUE)
)
# 4) Expand language codes (includes codes observed in this dataset preview)
reviews_clean <- reviews_clean %>%
mutate(
language = dplyr::recode(
language,
"en" = "English",
"es" = "Spanish",
"fr" = "French",
"de" = "German",
"hi" = "Hindi",
"ar" = "Arabic",
"ru" = "Russian",
"zh" = "Chinese",
# Added based on observed values:
"no" = "Norwegian",
"vi" = "Vietnamese",
"tl" = "Filipino",
"th" = "Thai",
"da" = "Danish",
"ja" = "Japanese",
"ms" = "Malay",
.default = language
)
)
# Quick peek at the cleaned data
glimpse(reviews_clean)
## Rows: 2,514
## Columns: 8
## $ app_name <chr> "MX Player", "Tinder", "Netflix", "Venmo", "Google Drive"…
## $ review_text <chr> "Qui doloribus consequuntur. Perspiciatis tempora assumen…
## $ rating <dbl> 1.3, 1.6, 3.6, 3.8, 3.2, 5.0, 4.0, 1.2, 1.8, 4.6, 2.7, 4.…
## $ language <chr> "Norwegian", "Russian", "Spanish", "Vietnamese", "Filipin…
## $ review_date <dttm> 2024-10-09 19:26:40, 2024-06-21 17:29:40, 2024-10-31 13:…
## $ app_category <chr> "Travel & Local", "Navigation", "Dating", "Productivity",…
## $ app_version <chr> "1.4", "8.9", "2.8.37.5926", "10.2", "4.7", "11.2.87.6917…
## $ country <chr> "China", "Germany", "Nigeria", "India", "South Korea", "S…
head(reviews_clean, 10)
## # A tibble: 10 × 8
## app_name review_text rating language review_date app_category
## <chr> <chr> <dbl> <chr> <dttm> <chr>
## 1 MX Player Qui dolori… 1.3 Norwegi… 2024-10-09 19:26:40 Travel & Lo…
## 2 Tinder Great app … 1.6 Russian 2024-06-21 17:29:40 Navigation
## 3 Netflix The interf… 3.6 Spanish 2024-10-31 13:47:12 Dating
## 4 Venmo Latest upd… 3.8 Vietnam… 2025-03-12 06:16:22 Productivity
## 5 Google Drive Perfect fo… 3.2 Filipino 2024-04-21 03:48:27 Education
## 6 Netflix Works perf… 5 Thai 2024-01-15 02:49:03 Music & Aud…
## 7 Signal Basso bell… 4 Danish 2024-05-20 21:28:14 Travel & Lo…
## 8 Canva Odcinek sk… 1.2 Japanese 2025-05-26 05:21:22 Social Netw…
## 9 Microsoft Office Eius odio … 1.8 Malay 2023-09-13 07:50:14 Video Playe…
## 10 Dropbox Husband at… 4.6 French 2024-08-22 15:25:24 News & Maga…
## # ℹ 2 more variables: app_version <chr>, country <chr>
# Basic rating summary
if ("rating" %in% names(reviews_clean) && is.numeric(reviews_clean$rating)) {
summary(reviews_clean$rating)
}
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 2.100 3.000 3.021 4.000 5.000 37
# Simple histogram
if ("rating" %in% names(reviews_clean) && is.numeric(reviews_clean$rating)) {
ggplot(reviews_clean, aes(x = rating)) +
geom_histogram(binwidth = 0.5, boundary = 0) +
labs(title = "Ratings Histogram", x = "Rating", y = "Count")
}
A small, tidy data frame (reviews_clean
) is ready for
simple filtering, grouping, and plotting. Possible next steps include
comparing ratings by app or category and adding basic text analysis for
review_text
(e.g., word counts or simple sentiment).
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-apple-darwin20
## Running under: macOS Ventura 13.7.8
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
## [5] purrr_1.1.0 readr_2.1.5 tidyr_1.3.1 tibble_3.3.0
## [9] ggplot2_3.5.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] bit_4.6.0 gtable_0.3.6 jsonlite_2.0.0 crayon_1.5.3
## [5] compiler_4.5.1 tidyselect_1.2.1 parallel_4.5.1 jquerylib_0.1.4
## [9] scales_1.4.0 yaml_2.3.10 fastmap_1.2.0 R6_2.6.1
## [13] labeling_0.4.3 generics_0.1.4 curl_7.0.0 knitr_1.50
## [17] bslib_0.9.0 pillar_1.11.0 RColorBrewer_1.1-3 tzdb_0.5.0
## [21] rlang_1.1.6 utf8_1.2.6 cachem_1.1.0 stringi_1.8.7
## [25] xfun_0.53 sass_0.4.10 bit64_4.6.0-1 timechange_0.3.0
## [29] cli_3.6.5 withr_3.0.2 magrittr_2.0.3 digest_0.6.37
## [33] grid_4.5.1 vroom_1.6.5 rstudioapi_0.17.1 hms_1.1.3
## [37] lifecycle_1.0.4 vctrs_0.6.5 evaluate_1.0.5 glue_1.8.0
## [41] farver_2.1.2 rmarkdown_2.29 tools_4.5.1 pkgconfig_2.0.3
## [45] htmltools_0.5.8.1