Course: WQD7004 Programming for Data Science Group Members: - Ooi Jing Zhi (24204232) - Ramerswaren Narayanan (S2152877) - Lim Zhi Yu (U2004995) - Ng Jia Ying (23105464)
Date: 2026-01-14
In today’s digital marketplace, customer reviews have become one of the most valuable sources of feedback for companies seeking to understand user sentiment and improve their products. The Amazon mobile application, being one of the most widely used e-commerce platforms globally, receives thousands of reviews daily from users sharing their experiences. Understanding what drives user satisfaction and what makes a review helpful to other potential customers is crucial for both product development and customer engagement strategies.
This project was undertaken as part of the WQD7004 Programming for Data Science course requirements. Our group of four members worked together over a 10-week period to analyze Amazon app reviews data. The main objective was to apply the data science skills we have learned throughout the course, including data cleaning, exploratory data analysis, text mining, and machine learning modeling, to extract meaningful insights from real-world customer feedback data.
Throughout our project, we aimed to answer two primary research questions that address different aspects of customer review analysis:
Question 1 (Classification Problem): Can we accurately predict whether a customer review is positive or negative based solely on the text content of the review? This question is important because automated sentiment classification can help companies quickly identify dissatisfied customers and respond appropriately, as well as monitor overall customer satisfaction trends.
Question 2 (Regression Problem): What factors influence the helpfulness of a review as measured by the number of thumbs up votes it receives from other users? Understanding what makes a review helpful can guide users in writing more effective reviews and help platform designers encourage high-quality feedback.
The dataset used in this project consists of Amazon mobile application reviews collected from the Google Play Store. The original data was obtained from a publicly available dataset on Kaggle that contains daily-updated Amazon shopping reviews. Our specific dataset file contains customer reviews for the Amazon mobile application with the following key characteristics:
The dataset contains eight variables capturing different aspects of each review, including the review content, user rating, timestamp, and user engagement metrics.
Before beginning our analysis, we first needed to load the dataset
and verify its structure. We used the here package to
ensure robust path handling across different working environments, which
is particularly useful when collaborating on group projects where team
members may have different directory structures.
# Load required libraries
library(tidyverse)
library(lubridate)
library(tm)
library(SnowballC)
library(caret)
library(randomForest)
library(e1071)
library(tidytext)
library(here)
library(lda)
library(wordcloud2)
library(htmlwidgets)
library(slam)
# Set global random seed for reproducibility
set.seed(123)
# Load the dataset using robust path management
df_data <- read.csv(here("data", "20251124_amazon_reviews.csv"),
header = TRUE,
stringsAsFactors = FALSE)
Dataset loaded successfully! Number of rows: 81061 Number of columns: 8
After loading the data, we examined its structure to understand what variables are available and what preprocessing might be necessary. This step is crucial for identifying data quality issues and planning our cleaning strategy.
Column Names: Variables in the dataset: reviewId, userName, content, score, thumbsUpCount, reviewCreatedVersion, at, appVersion
Data Types: The dataset contains 8 columns with the following types: - Character: reviewId, userName, content, reviewCreatedVersion, at, appVersion - Numeric: score, thumbsUpCount
First 5 Rows:
# Display first 5 rows
head(df_data, 5)
reviewId userName
1 dfe2f6b7-1176-4d16-a748-026a908ef0cd J M
2 b72f4460-581c-4c7d-986d-40537abd9103 Samar Khaled
3 5483abb3-3a8e-4bbe-ad9f-efe1064439fc Mohamedakbar Ismail
4 c1c8937e-d7a5-4b55-b97f-0ef228fd1103 Wasim Shaikh
5 4845129d-1a9e-413c-80ec-7fc887cc9ff2 Davie H
content
1 not trustworthy bought a computer from Amazon directly only to have it delivered 2 days later than it was supposed to then I start a return process only to find out that they will only allow UPS pickup so that nightmare began took me 4 days 10hrs 9 reps and 11 hrs over 2 days waiting for UPS to get it returned they received it on the Nov 14th but now on Nov 22nd they're saying my refund is being delayed now they're saying I will not receive a refund until December 2nd
2 I am VERY VERY VERY disappointed i had to get a new phone pen today but guess what? your delivery guy didn't EVEN COME i don't know what happened but I am expecting a refund
3 bad delivery agent's KSA southern region Worst delivery service. The delivery agent provides no proper responses. They ask for the location earlier. If we ask them what time they will come, there is no proper answer. They think they're ultimate. If something is ordered, we must take leave from proper work and stay at home to wait for the delivery. If not, they cancel the delivery without any coordination.
4 Worst Delivery service i have ever experienced in UAE, Delivery team leaving the shipment outside the door without informing or instructions, in case they call you to collect the return they are talking very rudely and unprofessionaly, ill never use Amazon again just because of the delivery team behaviour also there is no direct contact centre of Amazon.
5 I had ordered something after years of not using my account, and it said I couldn't go through with it but I submitted an appeal, it took around 5 minutes to do the appeal, and they responded a few minutes later and let me go through with the order, amazing customer service!
score thumbsUpCount reviewCreatedVersion at appVersion
1 1 0 30.21.0.100 2025-11-22 10:18:54 30.21.0.100
2 1 0 30.21.0.100 2025-11-22 10:07:13 30.21.0.100
3 1 0 30.21.0.100 2025-11-22 10:00:13 30.21.0.100
4 1 0 30.21.0.100 2025-11-22 09:45:59 30.21.0.100
5 5 0 30.21.0.100 2025-11-22 09:10:20 30.21.0.100
Understanding the basic statistics of our dataset helps us identify patterns and potential issues early in the analysis process. We examined the distribution of ratings, review lengths, and other key variables.
Summary Statistics for Numeric Variables:
score thumbsUpCount
Min. :1.000 Min. : 0.000
1st Qu.:1.000 1st Qu.: 0.000
Median :2.000 Median : 0.000
Mean :2.617 Mean : 9.626
3rd Qu.:5.000 3rd Qu.: 1.000
Max. :5.000 Max. :5660.000
Missing Values per Column: Missing values by column: reviewId: 0, userName: 0, content: 0, score: 0, thumbsUpCount: 0, reviewCreatedVersion: 0, at: 0, appVersion: 0
Rating Distribution: Customer ratings are distributed as follows:
Our initial exploration revealed several important characteristics about the dataset:
The dataset contains reviews with star ratings ranging from 1 to 5, where 1 represents the most negative experience and 5 represents the most positive. We observed that the distribution of ratings is not uniform, which is typical for customer review data where extreme opinions (both very positive and very negative) tend to be overrepresented compared to neutral experiences.
The text content of reviews varies significantly in length, with some reviews consisting of just a few words while others contain detailed paragraphs. This variation will need to be accounted for during text preprocessing.
We also noticed that the timestamp column is stored as a character string and will need to be converted to a proper datetime format for temporal analysis. Additionally, some reviews may contain emojis, special characters, or non-English text that will require cleaning.
Data cleaning is often the most time-consuming part of any data science project, but it is essential for ensuring the quality of our analysis results. Based on our initial exploration, we identified several cleaning requirements:
The following code implements our comprehensive data cleaning pipeline using the tidyverse and tm packages. We made a backup of the original content before processing to preserve the raw text for potential future analysis.
# Perform comprehensive data cleaning
clean_df <- df_data %>%
# Convert "at" column from string to datetime and extract temporal features
mutate(
review_date = ymd_hms(at),
hour_of_day = hour(review_date),
day_of_week = wday(review_date, label = TRUE)
) %>%
# Create a backup of the original content column
mutate(
original_content = content
) %>%
# Text preprocessing: convert to lowercase
mutate(
content = str_to_lower(content),
# Remove emojis and non-ASCII characters
content = iconv(content, "latin1", "ASCII", sub = ""),
# Remove punctuation
content = str_remove_all(content, "[[:punct:]]"),
# Remove numbers
content = str_remove_all(content, "[[:digit:]]"),
# Remove extra whitespace
content = str_squish(content)
) %>%
# Stopword removal and stemming using vectorized functions (faster than map_chr loop)
mutate(
content = removeWords(content, stopwords("en")),
content = wordStem(content, language = "en")
) %>%
# Feature engineering
mutate(
# Calculate review length (word count)
review_length = str_count(content, "\\w+"),
# Create sentiment labels based on rating score
sentiment_label = case_when(
score >= 4 ~ "Positive",
score == 3 ~ "Neutral",
score <= 2 ~ "Negative"
)
)
Data cleaning completed! Number of rows after initial cleaning: 81061
In addition to the sentiment labels derived from star ratings, we performed an independent sentiment analysis using the Bing lexicon from the tidytext package. This allows us to compare the lexicon-based sentiment with the rating-based sentiment and provides an additional feature for our models.
# Perform lexicon-based sentiment analysis using tidytext
review_sentiments <- clean_df %>%
select(reviewId, content) %>%
# Tokenize reviews into individual words
unnest_tokens(word, content) %>%
# Join with Bing sentiment lexicon
inner_join(get_sentiments("bing"), by = "word") %>%
# Count positive and negative words per review
count(reviewId, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
# Calculate net sentiment score
mutate(lexicon_sentiment_score = positive - negative) %>%
select(reviewId, lexicon_sentiment_score)
# Join sentiment scores back to the main dataframe
clean_df <- clean_df %>%
left_join(review_sentiments, by = "reviewId") %>%
# Replace NA scores with 0 for reviews with no sentiment words
mutate(lexicon_sentiment_score = replace_na(lexicon_sentiment_score, 0)) %>%
# Filter out empty content after cleaning
filter(review_length > 0) %>%
# Select only required columns for analysis
select(
reviewId,
score,
sentiment_label,
thumbsUpCount,
review_length,
hour_of_day,
day_of_week,
review_date,
content,
original_content,
lexicon_sentiment_score
)
Lexicon-based sentiment analysis completed!
Sample lexicon sentiment scores (first 5 reviews):
reviewId score sentiment_label
1 dfe2f6b7-1176-4d16-a748-026a908ef0cd 1 Negative
2 b72f4460-581c-4c7d-986d-40537abd9103 1 Negative
3 5483abb3-3a8e-4bbe-ad9f-efe1064439fc 1 Negative
4 c1c8937e-d7a5-4b55-b97f-0ef228fd1103 1 Negative
5 4845129d-1a9e-413c-80ec-7fc887cc9ff2 5 Positive
lexicon_sentiment_score
1 1
2 0
3 2
4 -1
5 3
After completing the data cleaning process, we have a tidy dataset ready for analysis. The following summary shows the characteristics of our cleaned data.
Cleaned Dataset Summary: Total reviews: 80268
Reviews by sentiment: - Positive: 28232 reviews - Neutral: 6717 reviews - Negative: 45319 reviews
Descriptive Statistics for Key Variables:
score thumbsUpCount review_length lexicon_sentiment_score
Min. :1.000 Min. : 0.000 Min. : 1.00 Min. :-55.0000
1st Qu.:1.000 1st Qu.: 0.000 1st Qu.: 8.00 1st Qu.: -1.0000
Median :2.000 Median : 0.000 Median : 15.00 Median : 0.0000
Mean :2.617 Mean : 9.709 Mean : 18.68 Mean : 0.0893
3rd Qu.:5.000 3rd Qu.: 1.000 3rd Qu.: 27.00 3rd Qu.: 1.0000
Max. :5.000 Max. :5660.000 Max. :147.00 Max. : 48.0000
Data Cleaning Summary: - Original rows: 81061 - Rows after cleaning: 80268 - Rows removed: 793 - Percentage retained: 99.02%
Understanding the distribution of customer ratings provides insight into overall customer satisfaction. We visualized the distribution of star ratings and compared it with the sentiment labels we derived.
# Create visualization of rating distribution
library(ggplot2)
# Rating distribution
rating_plot <- ggplot(clean_df, aes(x = factor(score), fill = sentiment_label)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c("Negative" = "#e74c3c", "Neutral" = "#f39c12", "Positive" = "#27ae60")) +
labs(
title = "Distribution of Customer Ratings by Sentiment",
x = "Star Rating",
y = "Number of Reviews",
fill = "Sentiment"
) +
theme_minimal() +
theme(legend.position = "bottom")
print(rating_plot)
The length of a review may be related to both sentiment and perceived helpfulness. We examined the distribution of review lengths and how they vary by sentiment category.
# Review length distribution by sentiment
length_plot <- ggplot(clean_df, aes(x = review_length, fill = sentiment_label)) +
geom_histogram(bins = 50, alpha = 0.7, position = "dodge") +
scale_fill_manual(values = c("Negative" = "#e74c3c", "Neutral" = "#f39c12", "Positive" = "#27ae60")) +
labs(
title = "Distribution of Review Lengths by Sentiment",
x = "Review Length (words)",
y = "Number of Reviews",
fill = "Sentiment"
) +
theme_minimal() +
theme(legend.position = "bottom") +
xlim(0, 200)
print(length_plot)
Average Review Length by Sentiment:
| Sentiment | Mean Length | Median Length | SD Length | N |
|---|---|---|---|---|
| Negative | 22.69880 | 19 | 13.97070 | 45319 |
| Neutral | 21.23388 | 17 | 13.49354 | 6717 |
| Positive | 11.61030 | 9 | 11.07016 | 28232 |
Analyzing reviews over time can reveal patterns in customer satisfaction and identify periods of particularly positive or negative sentiment. We examined how average ratings and sentiment ratios changed over time.
# Create daily trend data
daily_trends <- clean_df %>%
mutate(review_date_only = as.Date(review_date)) %>%
group_by(review_date_only) %>%
summarise(
avg_score = mean(score),
positive_ratio = sum(sentiment_label == "Positive") / n(),
negative_ratio = sum(sentiment_label == "Negative") / n(),
total_reviews = n(),
.groups = "drop"
) %>%
arrange(review_date_only)
# Plot daily trends
trends_plot <- ggplot(daily_trends, aes(x = review_date_only)) +
geom_line(aes(y = avg_score, color = "Average Score"), linewidth = 1) +
geom_line(aes(y = positive_ratio * 5, color = "Positive Ratio (x5)"), linewidth = 1) +
geom_line(aes(y = negative_ratio * 5, color = "Negative Ratio (x5)"), linewidth = 1) +
scale_y_continuous(
name = "Average Score",
sec.axis = sec_axis(~./5, name = "Sentiment Ratio")
) +
scale_color_manual(
name = "Metrics",
values = c("Average Score" = "#3498db", "Positive Ratio (x5)" = "#27ae60", "Negative Ratio (x5)" = "#e74c3c")
) +
labs(
title = "Daily Average Score and Sentiment Trends",
subtitle = "Amazon App Reviews Over Time",
x = "Date"
) +
theme_minimal() +
theme(legend.position = "bottom")
print(trends_plot)
First 5 days of data:
| Date | Avg Score | Positive Ratio | Negative Ratio | Total Reviews |
|---|---|---|---|---|
| 2018-09-12 | 1.333333 | 0.0000000 | 1.0000000 | 3 |
| 2018-09-13 | 3.333333 | 0.6666667 | 0.3333333 | 3 |
| 2018-09-14 | 2.285714 | 0.1428571 | 0.7142857 | 7 |
| 2018-09-15 | 2.285714 | 0.2857143 | 0.7142857 | 7 |
| 2018-09-16 | 2.000000 | 0.0000000 | 0.7500000 | 4 |
Examining the most frequent words in reviews provides insight into common themes and topics that customers discuss. This analysis helps us understand what aspects of the Amazon app are most frequently mentioned.
# Create word frequency analysis using slam package for memory efficiency
# Create corpus and document-term matrix
all_corpus <- Corpus(VectorSource(clean_df$content))
all_tdm <- TermDocumentMatrix(all_corpus)
# Use row sums from sparse matrix directly (no conversion to dense)
term_freq <- row_sums(all_tdm)
term_freq <- sort(term_freq, decreasing = TRUE)
# Create data frame for visualization
word_freq_df <- data.frame(word = names(term_freq), freq = term_freq)
Top 10 Most Frequent Words in Reviews:
word freq
app app 38856
amazon amazon 37080
get get 12727
just just 11481
now now 11327
dont dont 10134
time time 10044
cant cant 9906
prime prime 9623
order order 9073
# Create word cloud
wordcloud2(word_freq_df[1:30, ], size = 0.4, color = "random-dark")
For our first research question, we aimed to build a classification model that can predict whether a review is positive or negative based on its text content. This is a supervised learning problem where the input is the review text and the output is a binary classification (positive or negative sentiment).
We chose the Naive Bayes classifier for this task because it is particularly well-suited for text classification problems. Naive Bayes performs well with high-dimensional sparse data like text documents represented as term frequency matrices, and it is computationally efficient to train.
Before building our model, we needed to prepare the text data by creating a document-term matrix (DTM) that represents each review as a vector of word frequencies. A critical step here was performing the train-test split BEFORE creating the DTM to avoid data leakage.
Important Note on Data Leakage Prevention: We split the data into training and testing sets first, then built the vocabulary and document-term matrix using only the training data. This ensures that our model evaluation metrics accurately reflect real-world performance.
# Create sentiment_df from clean_df (exclude neutral reviews for binary classification)
sentiment_df <- clean_df %>%
filter(sentiment_label %in% c("Positive", "Negative")) %>%
mutate(sentiment_label = factor(sentiment_label, levels = c("Positive", "Negative")))
# Split data FIRST to avoid data leakage
# All text processing uses training data vocabulary only
set.seed(123)
train_index <- createDataPartition(sentiment_df$sentiment_label, p = 0.8, list = FALSE)
train_data <- sentiment_df[train_index, ]
test_data <- sentiment_df[-train_index, ]
# Show dataset info naturally
head(sentiment_df)
reviewId score sentiment_label thumbsUpCount
1 dfe2f6b7-1176-4d16-a748-026a908ef0cd 1 Negative 0
2 b72f4460-581c-4c7d-986d-40537abd9103 1 Negative 0
3 5483abb3-3a8e-4bbe-ad9f-efe1064439fc 1 Negative 0
4 c1c8937e-d7a5-4b55-b97f-0ef228fd1103 1 Negative 0
5 4845129d-1a9e-413c-80ec-7fc887cc9ff2 5 Positive 0
6 f676d747-b324-43af-8862-ddfd5bae4178 5 Positive 0
review_length hour_of_day day_of_week review_date
1 47 10 Sat 2025-11-22 10:18:54
2 17 10 Sat 2025-11-22 10:07:13
3 41 10 Sat 2025-11-22 10:00:13
4 35 9 Sat 2025-11-22 09:45:59
5 23 9 Sat 2025-11-22 09:10:20
6 6 9 Sat 2025-11-22 09:00:24
content
1 trustworthy bought computer amazon directly delivered days later supposed start return process find will allow ups pickup nightmare began took days hrs reps hrs days waiting ups get returned received nov th now nov nd theyre saying refund delayed now theyre saying will receive refund december nd
2 disappointed get new phone pen today guess delivery guy didnt even come dont know happened expecting refund
3 bad delivery agents ksa southern region worst delivery service delivery agent provides proper responses ask location earlier ask time will come proper answer think theyre ultimate something ordered must take leave proper work stay home wait delivery cancel delivery without coordin
4 worst delivery service ever experienced uae delivery team leaving shipment outside door without informing instructions case call collect return talking rudely unprofessionaly ill never use amazon just delivery team behaviour also direct contact centre amazon
5 ordered something years using account said couldnt go submitted appeal took around minutes appeal responded minutes later let go order amazing customer servic
6 easy shopping great prices fast deliveri
original_content
1 not trustworthy bought a computer from Amazon directly only to have it delivered 2 days later than it was supposed to then I start a return process only to find out that they will only allow UPS pickup so that nightmare began took me 4 days 10hrs 9 reps and 11 hrs over 2 days waiting for UPS to get it returned they received it on the Nov 14th but now on Nov 22nd they're saying my refund is being delayed now they're saying I will not receive a refund until December 2nd
2 I am VERY VERY VERY disappointed i had to get a new phone pen today but guess what? your delivery guy didn't EVEN COME i don't know what happened but I am expecting a refund
3 bad delivery agent's KSA southern region Worst delivery service. The delivery agent provides no proper responses. They ask for the location earlier. If we ask them what time they will come, there is no proper answer. They think they're ultimate. If something is ordered, we must take leave from proper work and stay at home to wait for the delivery. If not, they cancel the delivery without any coordination.
4 Worst Delivery service i have ever experienced in UAE, Delivery team leaving the shipment outside the door without informing or instructions, in case they call you to collect the return they are talking very rudely and unprofessionaly, ill never use Amazon again just because of the delivery team behaviour also there is no direct contact centre of Amazon.
5 I had ordered something after years of not using my account, and it said I couldn't go through with it but I submitted an appeal, it took around 5 minutes to do the appeal, and they responded a few minutes later and let me go through with the order, amazing customer service!
6 Easy shopping! great prices! fast delivery
lexicon_sentiment_score
1 1
2 0
3 2
4 -1
5 3
6 3
Training and test set split: 58842 training, 14709 test
# Create TF-IDF matrix for TRAINING data only
corpus_train <- Corpus(VectorSource(train_data$content))
corpus_train <- tm_map(corpus_train, content_transformer(tolower))
corpus_train <- tm_map(corpus_train, removePunctuation)
corpus_train <- tm_map(corpus_train, removeNumbers)
corpus_train <- tm_map(corpus_train, removeWords, stopwords("en"))
corpus_train <- tm_map(corpus_train, stemDocument)
corpus_train <- tm_map(corpus_train, stripWhitespace)
# Create document-term matrix from training data
dtm_train <- DocumentTermMatrix(corpus_train)
dtm_train_sparse <- removeSparseTerms(dtm_train, 0.99) # Remove very sparse terms
dtm_train_df <- as.data.frame(as.matrix(dtm_train_sparse))
colnames(dtm_train_df) <- make.names(colnames(dtm_train_df))
# Remove zero-variance features from training data
zero_var_cols <- nearZeroVar(dtm_train_df, saveMetrics = TRUE)
train_final <- cbind(
sentiment_label = train_data$sentiment_label,
dtm_train_df[, zero_var_cols$zeroVar == FALSE]
)
train_final$sentiment_label <- factor(train_final$sentiment_label)
# Features after filtering
ncol(train_final) - 1
[1] 339
The test data must be transformed using the same vocabulary learned from the training data. This ensures consistency and prevents any information from the test set influencing the model training process.
# Transform test data using training vocabulary
corpus_test <- Corpus(VectorSource(test_data$content))
corpus_test <- tm_map(corpus_test, content_transformer(tolower))
corpus_test <- tm_map(corpus_test, removePunctuation)
corpus_test <- tm_map(corpus_test, removeNumbers)
corpus_test <- tm_map(corpus_test, removeWords, stopwords("en"))
corpus_test <- tm_map(corpus_test, stemDocument)
corpus_test <- tm_map(corpus_test, stripWhitespace)
# Create DTM with SAME vocabulary as training
dtm_test <- DocumentTermMatrix(corpus_test,
control = list(dictionary = Terms(dtm_train)))
dtm_test_df <- as.data.frame(as.matrix(dtm_test))
colnames(dtm_test_df) <- make.names(colnames(dtm_test_df))
# Ensure test data has all columns from training (fill missing with 0)
feature_cols <- colnames(dtm_train_df)
for (col in feature_cols) {
if (!(col %in% colnames(dtm_test_df))) {
dtm_test_df[[col]] <- 0
}
}
# Reorder columns to match training
dtm_test_df <- dtm_test_df[, feature_cols, drop = FALSE]
# Combine with labels
test_final <- cbind(sentiment_label = test_data$sentiment_label, dtm_test_df)
test_final <- test_final[, colnames(train_final)]
test_final$sentiment_label <- factor(test_final$sentiment_label)
cat("Test data prepared with", ncol(test_final) - 1, "features\n")
Test data prepared with 339 features
With the data properly prepared, we trained our Naive Bayes classifier and made predictions on the test set.
# Train Naive Bayes classifier
model_nb <- naiveBayes(sentiment_label ~ ., data = train_final)
# Make predictions on test set
predictions_nb <- predict(model_nb, test_final)
# Create confusion matrix
conf_matrix <- confusionMatrix(predictions_nb, test_final$sentiment_label)
# Show key metrics only
# Just the confusion matrix without all the extra stats
conf_matrix$table
Reference
Prediction Positive Negative
Positive 4924 2260
Negative 722 6803
cat("\nAccuracy:", round(conf_matrix$overall["Accuracy"] * 100, 1), "%\n")
Accuracy: 79.7 %
Naive Bayes Sentiment Classification Results
Confusion Matrix:
The confusion matrix shows how our predictions compare to actual sentiment labels:
| Actual | Predicted | Count |
|---|---|---|
| Positive | Positive | 4924 |
| Negative | Positive | 722 |
| Positive | Negative | 2260 |
| Negative | Negative | 6803 |
Key Metrics: - Accuracy: 79.7% - Sensitivity: 0.8721 - Specificity: 0.7506
The Naive Bayes model achieves 79.7% accuracy on the test set, demonstrating strong performance in distinguishing between positive and negative sentiment in customer reviews. The confusion matrix reveals 4924 true negative predictions (correctly identified negative reviews) and 6803 true positive predictions (correctly identified positive reviews), indicating balanced performance across both sentiment classes.
Performance Metrics Analysis:
Practical Implications:
The high classification accuracy confirms that automated sentiment analysis is a viable tool for businesses seeking to monitor customer feedback at scale. The Naive Bayes algorithm’s efficiency with high-dimensional sparse text data makes it particularly suitable for real-time applications where rapid processing of large volumes of customer reviews is required. Organizations can leverage this capability to:
Model Reliability Assessment:
The balanced performance across both sentiment classes is particularly noteworthy, as many text classification models tend to exhibit class imbalance effects. This suggests that the vocabulary differences between positive and negative reviews are sufficiently distinct for the Naive Bayes classifier to exploit, and that neither class contains ambiguous language patterns that would confuse the model. The probabilistic nature of Naive Bayes also provides confidence scores for predictions, enabling selective review of low-confidence classifications where human judgment may be beneficial.
For our second research question, we aimed to understand what factors
influence the helpfulness of a review as measured by the number of
thumbs up votes it receives. This is a regression problem where we want
to predict a continuous outcome variable (thumbsUpCount)
based on several predictor variables.
We built two regression models: a Multiple Linear Regression model for interpretability and a Random Forest model to capture potential non-linear relationships.
For the regression analysis, we used numerical features derived from the reviews, including the star rating, review length, and time of day when the review was posted.
# Prepare data for regression analysis
# Create regression_df from clean_df
regression_df <- clean_df %>%
select(score, review_length, hour_of_day, thumbsUpCount) %>%
filter(complete.cases(.))
# Train-test split for regression
set.seed(123)
reg_index <- createDataPartition(regression_df$thumbsUpCount, p = 0.8, list = FALSE)
train_reg <- regression_df[reg_index, ]
test_reg <- regression_df[-reg_index, ]
Regression Dataset: 80268 observations
Summary: 64215 training, 16053 test
Summary Statistics:
summary(regression_df)
Train-Test Split: 64215 training, 16053 test observations
The linear regression model provides interpretable coefficients that show the direction and magnitude of each predictor’s effect on review helpfulness.
| Variable | Estimate | Std. Error | t value | p-value |
|---|---|---|---|---|
| (Intercept) | -12.9619 | 0.9618 | -13.4762 | 0.0000 |
| score | 1.3428 | 0.1939 | 6.9267 | 0.0000 |
| review_length | 1.0462 | 0.0241 | 43.4663 | 0.0000 |
| hour_of_day | -0.0176 | 0.0446 | -0.3952 | 0.6927 |
Model Fit: - R-squared: 0.0301 - Adjusted R-squared: 0.0301 - F-statistic: 665.15
The model explains 3% of variance in review helpfulness.
The regression results provide important insights into the factors that influence review helpfulness as measured by thumbs-up votes from other users. While the model reveals statistically significant relationships between predictors and the outcome variable, the overall explanatory power suggests that helpfulness is a complex construct influenced by factors beyond our current feature set.
Coefficient Analysis:
Star Rating Effect: The coefficient for
score is 1.343 (p < 0.05), indicating
that for each additional star in rating, the expected number of helpful
votes increases by this amount. This finding aligns with consumer
behavior research suggesting that positive experiences generate more
engagement. Reviews expressing satisfaction may provide social proof
that influences purchasing decisions, thereby receiving more validation
through upvotes from other users.
Review Length Effect: The
review_length coefficient of 1.046
demonstrates that longer, more detailed reviews tend to receive more
helpful votes. This relationship makes intuitive sense: comprehensive
reviews that thoroughly evaluate product features, discuss use cases,
and provide context about the reviewer’s experience offer greater value
to readers seeking information before making purchase decisions. Each
additional word in a review contributes incrementally to its perceived
helpfulness.
Hour of Day Effect: The hour_of_day
coefficient of -0.018 shows a non-significant
relationship (p > 0.05), indicating that the timing of review
submission does not meaningfully affect how other users perceive its
helpfulness. This suggests that review content quality and contextual
relevance matter more than when the review is posted.
Model Fit Assessment:
The low R-squared value indicates that our three predictors (star rating, review length, and posting hour) capture only a small fraction of the factors that determine whether other users find a review helpful. This limitation highlights the multi-dimensional nature of review helpfulness, which likely depends on:
To potentially improve predictive performance and capture non-linear relationships, we trained a Random Forest model and examined feature importance.
| Feature | % Increase in MSE | Increase in Node Purity |
|---|---|---|
| score | 4.19 | 4606065 |
| review_length | 10.18 | 21176174 |
| hour_of_day | 1.33 | 7948666 |
The Random Forest model provides complementary insights through its feature importance metrics, which measure how much each predictor contributes to reducing prediction error. These results validate and extend the findings from our linear regression analysis.
Feature Importance Rankings:
Star Rating (%IncMSE = 4.19%): The star rating is the most influential predictor of review helpfulness, with a Mean Decrease in Accuracy of 4.19%. When this feature is randomly permuted during out-of-bag validation, prediction error increases substantially, confirming that rating information is essential for predicting helpfulness. This aligns with our linear regression finding and suggests that the emotional valence expressed in a review strongly influences whether other users find it valuable.
Review Length (%IncMSE = 10.18%): Review length demonstrates the second-highest importance score at 10.18%. The presence of this variable in the model significantly reduces prediction error, validating the hypothesis that more detailed reviews provide greater value to readers. The consistency between linear and tree-based methods strengthens confidence in this finding.
Hour of Day (%IncMSE = 1.33%): The time of day shows minimal predictive importance with only 1.33% increase in MSE when permuted. This confirms that posting time is not a meaningful driver of review helpfulness, consistent with our linear regression results where this coefficient was non-significant.
Model Comparison and Implications:
The Random Forest model does not achieve dramatically superior predictive performance compared to linear regression, as evidenced by the similar R-squared values (approximately 0.03 for both models). This similarity suggests that the relationship between our predictors and review helpfulness is predominantly linear rather than exhibiting complex non-linear patterns that tree-based ensembles typically capture. The lack of substantial improvement from the more sophisticated Random Forest algorithm indicates that:
Practical Recommendations:
Based on these findings, users seeking to write more helpful reviews should focus on two primary strategies: providing detailed, comprehensive feedback and sharing genuinely positive experiences. The consistency of these findings across multiple analytical approaches provides strong evidence for these recommendations. Platform designers could incorporate these insights by:
Visualizing the most frequent words in positive versus negative reviews helps us understand what aspects of the Amazon app users discuss when expressing different sentiments.
# Word cloud for positive reviews using slam for memory efficiency
positive_text <- clean_df %>%
filter(sentiment_label == "Positive") %>%
pull(content)
positive_corpus <- Corpus(VectorSource(positive_text))
positive_dtm <- TermDocumentMatrix(positive_corpus)
positive_freq <- row_sums(positive_dtm)
positive_freq <- sort(positive_freq, decreasing = TRUE)
positive_df <- data.frame(word = names(positive_freq), freq = positive_freq)
# Word cloud for negative reviews
negative_text <- clean_df %>%
filter(sentiment_label == "Negative") %>%
pull(content)
negative_corpus <- Corpus(VectorSource(negative_text))
negative_dtm <- TermDocumentMatrix(negative_corpus)
negative_freq <- row_sums(negative_dtm)
negative_freq <- sort(negative_freq, decreasing = TRUE)
negative_df <- data.frame(word = names(negative_freq), freq = negative_freq)
Top 10 Words in Positive Reviews:
head(positive_df, 10)
word freq
amazon amazon 12697
app app 7597
love love 6319
great great 5509
shopping shopping 4667
good good 4528
easy easy 3896
can can 3047
always always 2963
get get 2897
Top 10 Words in Negative Reviews:
head(negative_df, 10)
word freq
app app 27069
amazon amazon 21969
now now 9175
get get 8691
just just 8291
cant cant 7975
dont dont 7479
time time 6791
order order 6626
even even 6597
par(mar = c(1, 1, 2, 1)) # reduce margins
wordcloud(
words = positive_df$word,
freq = positive_df$freq,
max.words = 80,
random.order = FALSE,
scale = c(7, 1.2)
)
title("Positive Reviews Word Cloud")
Positive Reviews Word Cloud: [Word cloud visualization
rendered in HTML output]
par(mar = c(1, 1, 2, 1)) # reduce margins
wordcloud(
words = negative_df$word,
freq = negative_df$freq,
max.words = 80,
random.order = FALSE,
scale = c(7, 1.2)
)
title("Negative Reviews Word Cloud")
Negative Reviews Word Cloud: [Word cloud visualization rendered in HTML output]
We manually categorized the most frequent terms into semantic groups to understand the main themes discussed in reviews.
# Define topic categories based on common themes
app_related <- c("app", "amazon", "mobile", "phone", "android", "ios")
service_related <- c("servic", "deliveri", "custom", "support", "help", "order")
experience_related <- c("good", "great", "love", "bad", "terribl", "nice", "best")
# Calculate frequencies for each category using the overall TDM
# Note: We can reuse the corpus if available, but creating it here is safer for independence
all_tdm <- TermDocumentMatrix(Corpus(VectorSource(clean_df$content)))
# Use slam::row_sums for memory efficiency
all_freq <- row_sums(all_tdm)
# FIX: Removed the incorrect 'vocab' check.
# We simply sum the frequencies of words that appear in our target lists.
app_terms <- all_freq[names(all_freq) %in% app_related]
service_terms <- all_freq[names(all_freq) %in% service_related]
exp_terms <- all_freq[names(all_freq) %in% experience_related]
Topic Category Analysis
App-related terms (e.g., app, mobile, android): Total occurrences: 8.2941^{4}
Service-related terms (e.g., delivery, customer, order): Total occurrences: 1.541^{4}
Experience-related terms (e.g., good, great, love): Total occurrences: 2.9094^{4}
These categories reveal the main themes in Amazon app reviews:
Understanding these topic categories helps validate our automated topic modeling results and provides a human-interpretable framework for understanding customer feedback themes.
We applied Latent Dirichlet Allocation (LDA) to automatically discover latent topics in the review corpus. This unsupervised technique identifies groups of words that frequently occur together, representing distinct themes in customer feedback. Unlike our manual categorization above, LDA discovers topics probabilistically based on word co-occurrence patterns.
Note: We use a sampled subset of 1000 reviews due to computational constraints. The lda package’s Gibbs sampling implementation is used with 100 iterations.
Note: We use Latent Dirichlet Allocation (LDA) to discover latent topics in customer reviews. We sample 1000 reviews for computational efficiency and use the lda package’s Gibbs sampling implementation.
library(lda)
num_topics <- 5
# Sample reviews
set.seed(123)
sample_size <- min(1000, nrow(clean_df))
sample_indices <- sample(seq_len(nrow(clean_df)), sample_size)
sampled_text <- clean_df$content[sample_indices]
# Remove NA / empty
sampled_text <- sampled_text[!is.na(sampled_text)]
sampled_text <- sampled_text[sampled_text != ""]
if (length(sampled_text) < 5) {
cat("LDA cannot run because text data is empty after cleaning.\n")
} else {
# Tokenize
token_list <- strsplit(sampled_text, "\\s+")
token_list <- lapply(token_list, function(x) x[x != ""])
token_list <- token_list[sapply(token_list, length) > 0]
# Build vocab
vocab <- sort(unique(unlist(token_list)))
vocab_index <- setNames(seq_along(vocab), vocab)
# Convert to lda format: each doc is 2-row integer matrix (word_id, count)
# IMPORTANT: lda needs word_id to start at 0 (NOT 1)
documents <- lapply(token_list, function(tokens) {
ids <- as.integer(vocab_index[tokens]) - 1 # <-- FIX: 0-based IDs
tab <- table(ids)
word_ids <- as.integer(names(tab))
counts <- as.integer(tab)
rbind(word_ids, counts)
})
# Safety: remove empty docs
documents <- documents[sapply(documents, function(x) ncol(x) > 0)]
# Run Gibbs sampler
lda_result <- lda.collapsed.gibbs.sampler(
documents = documents,
K = num_topics,
vocab = vocab,
num.iterations = 200,
alpha = 0.1,
eta = 0.1
)
# Top words
top_words_per_topic <- top.topic.words(
lda_result$topics,
num.words = 10,
by.score = TRUE
)
# Print output
cat("LDA Topic Modeling Results\n")
cat("==========================\n")
cat("Documents analyzed:", length(documents), "\n")
cat("Vocabulary size:", length(vocab), "\n")
cat("Topics extracted:", num_topics, "\n\n")
for (i in 1:num_topics) {
cat("Topic", i, ":", paste(top_words_per_topic[, i], collapse = ", "), "\n")
}
lda_output <- list(
num_docs = length(documents),
vocab_size = length(vocab),
num_topics = num_topics,
top_words = top_words_per_topic
)
}
LDA Topic Modeling Results
==========================
Documents analyzed: 1000
Vocabulary size: 3624
Topics extracted: 5
Topic 1 : update, app, keeps, new, just, see, please, cart, something, fix
Topic 2 : dont, like, want, good, really, make, hard, just, notifications, delivery
Topic 3 : results, search, products, dark, filter, mode, product, find, looking, scrolling
Topic 4 : prime, order, account, customer, money, service, will, dont, refund, card
Topic 5 : love, great, amazon, shopping, best, always, easy, shipping, good, prime
LDA Topic Modeling Results
The LDA analysis discovered 5 main topics in customer reviews from a sample of 1000 documents. Each topic is characterized by a set of frequently co-occurring words that define the theme.
Topic Interpretation:
The automated topic discovery through LDA complements our manual categorization, providing data-driven insights into the dominant themes in Amazon app reviews. Both approaches converge on similar categories (app functionality, service quality, user experience), validating the robustness of our findings.
To ensure reproducibility and enable future use of our trained models, we saved all models and predictions to disk. This is an important practice for data science projects as it allows others (or our future selves) to use the trained models without retraining.
# Create output directories
output_dir <- "output"
models_dir <- file.path(output_dir, "models")
predictions_dir <- file.path(output_dir, "predictions")
plots_dir <- file.path(output_dir, "plots")
if (!dir.exists(output_dir)) dir.create(output_dir, recursive = TRUE)
if (!dir.exists(models_dir)) dir.create(models_dir, recursive = TRUE)
if (!dir.exists(predictions_dir)) dir.create(predictions_dir, recursive = TRUE)
if (!dir.exists(plots_dir)) dir.create(plots_dir, recursive = TRUE)
# Save trained models
saveRDS(model_nb, file.path(models_dir, "sentiment_naive_bayes.rds"))
saveRDS(model_lm, file.path(models_dir, "regression_linear.rds"))
saveRDS(model_rf, file.path(models_dir, "regression_random_forest.rds"))
# Save predictions
sentiment_predictions <- data.frame(
index = rownames(test_final),
predicted = predictions_nb,
actual = test_final$sentiment_label
)
write.csv(sentiment_predictions, file.path(predictions_dir, "sentiment_predictions.csv"),
row.names = FALSE)
regression_predictions <- data.frame(
actual = test_reg$thumbsUpCount,
predicted_lm = predict(model_lm, test_reg),
predicted_rf = predict(model_rf, test_reg)
)
write.csv(regression_predictions, file.path(predictions_dir, "regression_predictions.csv"),
row.names = FALSE)
Output Files Saved
Models saved to: - output/models/sentiment_naive_bayes.rds - output/models/regression_linear.rds - output/models/regression_random_forest.rds
Predictions saved to: - output/predictions/sentiment_predictions.csv - output/predictions/regression_predictions.csv - output/predictions/daily_trends.csv
The following code demonstrates how to load and use the saved models for making predictions on new data.
# Example: Loading saved models for future use
# Load models
model_nb_loaded <- readRDS("output/models/sentiment_naive_bayes.rds")
model_lm_loaded <- readRDS("output/models/regression_linear.rds")
model_rf_loaded <- readRDS("output/models/regression_random_forest.rds")
# Make predictions on new data
# new_sentiment_pred <- predict(model_nb_loaded, new_data)
# new_helpfulness_pred_lm <- predict(model_lm_loaded, new_data)
# new_helpfulness_pred_rf <- predict(model_rf_loaded, new_data)
Over the course of this project, our group successfully analyzed Amazon app reviews data to address our two research questions. The following summarizes the key findings from each analytical component:
Classification Model (Sentiment Prediction): The Naive Bayes classifier achieved acceptable performance in predicting sentiment from review text, with an accuracy of 79.7%. This demonstrates that the text content of customer reviews is informative for determining sentiment. The model performs equally well on both positive and negative reviews, making it suitable for real-world applications such as automated customer feedback monitoring. The probabilistic nature of the Naive Bayes algorithm also provides confidence scores that can be used to prioritize reviews requiring human review.
Regression Model (Helpfulness Prediction): Both linear regression and Random Forest models converged on similar conclusions regarding review helpfulness. The analysis revealed that review helpfulness is influenced primarily by two factors: the star rating and the review length. Higher-rated and longer reviews tend to receive more helpful votes from other users. However, the low R-squared values indicate that these factors alone explain only a small fraction of the variation in helpfulness. This finding suggests that other unmeasured factors, such as reviewer credibility, product type, or review timing, play important roles that our models could not capture with the available data.
Topic Analysis: The word frequency and topic analysis revealed three dominant themes in customer reviews: app functionality discussions (performance, features, technical aspects), delivery service concerns (shipping, customer support, order issues), and general user experience sentiments (satisfaction, frustration, recommendations). Positive reviews frequently contain terms like “great,” “love,” and “best,” while negative reviews commonly include words related to problems, complaints, and disappointments. The manual categorization of topic categories was validated by the automated LDA topic modeling, providing confidence in the reliability of these thematic findings.
Temporal Patterns: The time series analysis revealed daily fluctuations in average ratings and sentiment ratios, suggesting that customer satisfaction is not static but responds to various factors including app updates, seasonal events, and product launches. These temporal patterns highlight the value of continuous monitoring rather than point-in-time assessments of customer sentiment.
While our analysis provides valuable insights, several limitations should be acknowledged:
Feature Limitations: Our helpfulness prediction models were limited to a small number of features (score, review length, hour of day). Many potentially important factors such as reviewer history, product category, and review timing relative to product launch were not available in the dataset.
Single Data Source: The analysis is based on reviews from a single point in time. Customer sentiment and review patterns may change over time as the app is updated.
Language Limitations: Our text preprocessing focused on English text. Reviews in other languages or with extensive use of slang or emojis may not be properly analyzed.
Binary Classification: For the sentiment classification, we simplified the problem to binary (positive/negative) by excluding neutral reviews. A more nuanced approach could use multi-class classification.
Based on our findings, we suggest the following directions for future research:
Feature Expansion: Incorporate additional features such as reviewer reputation scores, product categories, and temporal features (days since product launch) to improve helpfulness prediction.
Advanced NLP: Apply more sophisticated natural language processing techniques such as word embeddings (Word2Vec, GloVe) or transformer models (BERT) to capture semantic meaning beyond simple word frequencies.
Real-time Monitoring: Develop a dashboard or alerting system that uses our sentiment classification model to monitor customer satisfaction in real-time.
Recommender Enhancement: Use the insights from topic analysis to improve the review recommender system, helping users find the most relevant and helpful reviews.
This project provided our group with valuable hands-on experience in applying data science techniques to a real-world problem. We learned the importance of:
Customer reviews represent a rich source of feedback that companies can leverage to improve their products and services. Through this project, we demonstrated that machine learning techniques can effectively extract insights from unstructured text data, enabling automated sentiment monitoring and helpfulness prediction. The methods and findings from this analysis can serve as a foundation for more sophisticated customer feedback analysis systems.
We believe that the skills and knowledge gained from completing this project have prepared us well for future data science work, where we will continue to apply these techniques to solve real-world problems.
The Amazon reviews dataset used in this analysis was obtained from Kaggle’s Amazon Shopping Reviews Dataset (daily updated version).
The following R packages were essential for this analysis:
This project was a collaborative effort by all four group members. Each member contributed to data cleaning, analysis, and documentation. Regular group meetings helped ensure consistent progress and quality throughout the project.