Title Page

Course: WQD7004 Programming for Data Science Group Members: - Ooi Jing Zhi (24204232) - Ramerswaren Narayanan (S2152877) - Lim Zhi Yu (U2004995) - Ng Jia Ying (23105464)

Date: 2026-01-14


Introduction and Project Overview

Project Background and Motivation

In today’s digital marketplace, customer reviews have become one of the most valuable sources of feedback for companies seeking to understand user sentiment and improve their products. The Amazon mobile application, being one of the most widely used e-commerce platforms globally, receives thousands of reviews daily from users sharing their experiences. Understanding what drives user satisfaction and what makes a review helpful to other potential customers is crucial for both product development and customer engagement strategies.

This project was undertaken as part of the WQD7004 Programming for Data Science course requirements. Our group of four members worked together over a 10-week period to analyze Amazon app reviews data. The main objective was to apply the data science skills we have learned throughout the course, including data cleaning, exploratory data analysis, text mining, and machine learning modeling, to extract meaningful insights from real-world customer feedback data.

Research Questions and Objectives

Throughout our project, we aimed to answer two primary research questions that address different aspects of customer review analysis:

Question 1 (Classification Problem): Can we accurately predict whether a customer review is positive or negative based solely on the text content of the review? This question is important because automated sentiment classification can help companies quickly identify dissatisfied customers and respond appropriately, as well as monitor overall customer satisfaction trends.

Question 2 (Regression Problem): What factors influence the helpfulness of a review as measured by the number of thumbs up votes it receives from other users? Understanding what makes a review helpful can guide users in writing more effective reviews and help platform designers encourage high-quality feedback.

Dataset Description

The dataset used in this project consists of Amazon mobile application reviews collected from the Google Play Store. The original data was obtained from a publicly available dataset on Kaggle that contains daily-updated Amazon shopping reviews. Our specific dataset file contains customer reviews for the Amazon mobile application with the following key characteristics:

  • Dataset Title: Amazon Shopping Reviews Dataset
  • Data Source: Kaggle (publicly available dataset)
  • Time Period: Recent customer reviews (within the last several months)
  • File Size: Approximately 23 MB

The dataset contains eight variables capturing different aspects of each review, including the review content, user rating, timestamp, and user engagement metrics.

Data Loading and Exploration

Loading the Dataset

Before beginning our analysis, we first needed to load the dataset and verify its structure. We used the here package to ensure robust path handling across different working environments, which is particularly useful when collaborating on group projects where team members may have different directory structures.

# Load required libraries
library(tidyverse)
library(lubridate)
library(tm)
library(SnowballC)
library(caret)
library(randomForest)
library(e1071)
library(tidytext)
library(here)
library(lda)
library(wordcloud2)
library(htmlwidgets)
library(slam)

# Set global random seed for reproducibility
set.seed(123)

# Load the dataset using robust path management
df_data <- read.csv(here("data", "20251124_amazon_reviews.csv"),
                    header = TRUE,
                    stringsAsFactors = FALSE)

Dataset loaded successfully! Number of rows: 81061 Number of columns: 8

Initial Data Exploration

After loading the data, we examined its structure to understand what variables are available and what preprocessing might be necessary. This step is crucial for identifying data quality issues and planning our cleaning strategy.

Column Names: Variables in the dataset: reviewId, userName, content, score, thumbsUpCount, reviewCreatedVersion, at, appVersion

Data Types: The dataset contains 8 columns with the following types: - Character: reviewId, userName, content, reviewCreatedVersion, at, appVersion - Numeric: score, thumbsUpCount

First 5 Rows:

# Display first 5 rows
head(df_data, 5)
                              reviewId            userName
1 dfe2f6b7-1176-4d16-a748-026a908ef0cd                 J M
2 b72f4460-581c-4c7d-986d-40537abd9103        Samar Khaled
3 5483abb3-3a8e-4bbe-ad9f-efe1064439fc Mohamedakbar Ismail
4 c1c8937e-d7a5-4b55-b97f-0ef228fd1103        Wasim Shaikh
5 4845129d-1a9e-413c-80ec-7fc887cc9ff2             Davie H
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   content
1 not trustworthy bought a computer from Amazon directly only to have it delivered 2 days later than it was supposed to then I start a return process only to find out that they will only allow UPS pickup so that nightmare began took me 4 days 10hrs 9 reps and 11 hrs over 2 days waiting for UPS to get it returned they received it on the Nov 14th but now on Nov 22nd they're saying my refund is being delayed now they're saying I will not receive a refund until December 2nd
2                                                                                                                                                                                                                                                                                                            I am VERY VERY VERY disappointed i had to get a new phone pen today but guess what? your delivery guy didn't EVEN COME i don't know what happened but I am expecting a refund
3                                                                 bad delivery agent's KSA southern region Worst delivery service. The delivery agent provides no proper responses. They ask for the location earlier. If we ask them what time they will come, there is no proper answer. They think they're ultimate. If something is ordered, we must take leave from proper work and stay at home to wait for the delivery. If not, they cancel the delivery without any coordination.
4                                                                                                                     Worst Delivery service i have ever experienced in UAE, Delivery team leaving the shipment outside the door without informing or instructions, in case they call you to collect the return they are talking very rudely and unprofessionaly, ill never use Amazon again just because of the delivery team behaviour also there is no direct contact centre of Amazon.
5                                                                                                                                                                                                      I had ordered something after years of not using my account, and it said I couldn't go through with it but I submitted an appeal, it took around 5 minutes to do the appeal, and they responded a few minutes later and let me go through with the order, amazing customer service!
  score thumbsUpCount reviewCreatedVersion                  at  appVersion
1     1             0          30.21.0.100 2025-11-22 10:18:54 30.21.0.100
2     1             0          30.21.0.100 2025-11-22 10:07:13 30.21.0.100
3     1             0          30.21.0.100 2025-11-22 10:00:13 30.21.0.100
4     1             0          30.21.0.100 2025-11-22 09:45:59 30.21.0.100
5     5             0          30.21.0.100 2025-11-22 09:10:20 30.21.0.100

Dataset Summary Statistics

Understanding the basic statistics of our dataset helps us identify patterns and potential issues early in the analysis process. We examined the distribution of ratings, review lengths, and other key variables.

Summary Statistics for Numeric Variables:

     score       thumbsUpCount     
 Min.   :1.000   Min.   :   0.000  
 1st Qu.:1.000   1st Qu.:   0.000  
 Median :2.000   Median :   0.000  
 Mean   :2.617   Mean   :   9.626  
 3rd Qu.:5.000   3rd Qu.:   1.000  
 Max.   :5.000   Max.   :5660.000  

Missing Values per Column: Missing values by column: reviewId: 0, userName: 0, content: 0, score: 0, thumbsUpCount: 0, reviewCreatedVersion: 0, at: 0, appVersion: 0

Rating Distribution: Customer ratings are distributed as follows:

  • ⭐⭐⭐⭐⭐ 5 stars: 23280 reviews
  • ⭐⭐⭐⭐ 4 stars: 5268 reviews
  • ⭐⭐⭐ 3 stars: 6733 reviews
  • ⭐⭐ 2 stars: 8711 reviews
  • ⭐ 1 star: 37069 reviews

Key Observations from Initial Exploration

Our initial exploration revealed several important characteristics about the dataset:

The dataset contains reviews with star ratings ranging from 1 to 5, where 1 represents the most negative experience and 5 represents the most positive. We observed that the distribution of ratings is not uniform, which is typical for customer review data where extreme opinions (both very positive and very negative) tend to be overrepresented compared to neutral experiences.

The text content of reviews varies significantly in length, with some reviews consisting of just a few words while others contain detailed paragraphs. This variation will need to be accounted for during text preprocessing.

We also noticed that the timestamp column is stored as a character string and will need to be converted to a proper datetime format for temporal analysis. Additionally, some reviews may contain emojis, special characters, or non-English text that will require cleaning.

Data Cleaning and Preprocessing

Data Cleaning Strategy

Data cleaning is often the most time-consuming part of any data science project, but it is essential for ensuring the quality of our analysis results. Based on our initial exploration, we identified several cleaning requirements:

  1. DateTime Conversion: Convert the timestamp string to proper datetime format to enable temporal analysis
  2. Text Cleaning: Remove emojis, special characters, numbers, and convert text to lowercase
  3. Stopword Removal: Remove common English words that carry little semantic meaning
  4. Stemming: Reduce words to their root form to consolidate similar terms
  5. Feature Engineering: Create derived variables such as review length and sentiment labels

Data Cleaning Implementation

The following code implements our comprehensive data cleaning pipeline using the tidyverse and tm packages. We made a backup of the original content before processing to preserve the raw text for potential future analysis.

# Perform comprehensive data cleaning
clean_df <- df_data %>%
  # Convert "at" column from string to datetime and extract temporal features
  mutate(
    review_date = ymd_hms(at),
    hour_of_day = hour(review_date),
    day_of_week = wday(review_date, label = TRUE)
  ) %>%
  # Create a backup of the original content column
  mutate(
    original_content = content
  ) %>%
  # Text preprocessing: convert to lowercase
  mutate(
    content = str_to_lower(content),
    # Remove emojis and non-ASCII characters
    content = iconv(content, "latin1", "ASCII", sub = ""),
    # Remove punctuation
    content = str_remove_all(content, "[[:punct:]]"),
    # Remove numbers
    content = str_remove_all(content, "[[:digit:]]"),
    # Remove extra whitespace
    content = str_squish(content)
  ) %>%
  # Stopword removal and stemming using vectorized functions (faster than map_chr loop)
  mutate(
    content = removeWords(content, stopwords("en")),
    content = wordStem(content, language = "en")
  ) %>%
  # Feature engineering
  mutate(
    # Calculate review length (word count)
    review_length = str_count(content, "\\w+"),
    # Create sentiment labels based on rating score
    sentiment_label = case_when(
      score >= 4 ~ "Positive",
      score == 3 ~ "Neutral",
      score <= 2 ~ "Negative"
    )
  )

Data cleaning completed! Number of rows after initial cleaning: 81061

Lexicon-Based Sentiment Analysis

In addition to the sentiment labels derived from star ratings, we performed an independent sentiment analysis using the Bing lexicon from the tidytext package. This allows us to compare the lexicon-based sentiment with the rating-based sentiment and provides an additional feature for our models.

# Perform lexicon-based sentiment analysis using tidytext
review_sentiments <- clean_df %>%
  select(reviewId, content) %>%
  # Tokenize reviews into individual words
  unnest_tokens(word, content) %>%
  # Join with Bing sentiment lexicon
  inner_join(get_sentiments("bing"), by = "word") %>%
  # Count positive and negative words per review
  count(reviewId, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  # Calculate net sentiment score
  mutate(lexicon_sentiment_score = positive - negative) %>%
  select(reviewId, lexicon_sentiment_score)

# Join sentiment scores back to the main dataframe
clean_df <- clean_df %>%
  left_join(review_sentiments, by = "reviewId") %>%
  # Replace NA scores with 0 for reviews with no sentiment words
  mutate(lexicon_sentiment_score = replace_na(lexicon_sentiment_score, 0)) %>%
  # Filter out empty content after cleaning
  filter(review_length > 0) %>%
  # Select only required columns for analysis
  select(
    reviewId,
    score,
    sentiment_label,
    thumbsUpCount,
    review_length,
    hour_of_day,
    day_of_week,
    review_date,
    content,
    original_content,
    lexicon_sentiment_score
  )

Lexicon-based sentiment analysis completed!

Sample lexicon sentiment scores (first 5 reviews):

                              reviewId score sentiment_label
1 dfe2f6b7-1176-4d16-a748-026a908ef0cd     1        Negative
2 b72f4460-581c-4c7d-986d-40537abd9103     1        Negative
3 5483abb3-3a8e-4bbe-ad9f-efe1064439fc     1        Negative
4 c1c8937e-d7a5-4b55-b97f-0ef228fd1103     1        Negative
5 4845129d-1a9e-413c-80ec-7fc887cc9ff2     5        Positive
  lexicon_sentiment_score
1                       1
2                       0
3                       2
4                      -1
5                       3

Cleaned Dataset Summary

After completing the data cleaning process, we have a tidy dataset ready for analysis. The following summary shows the characteristics of our cleaned data.

Cleaned Dataset Summary: Total reviews: 80268

Reviews by sentiment: - Positive: 28232 reviews - Neutral: 6717 reviews - Negative: 45319 reviews

Descriptive Statistics for Key Variables:

     score       thumbsUpCount      review_length    lexicon_sentiment_score
 Min.   :1.000   Min.   :   0.000   Min.   :  1.00   Min.   :-55.0000       
 1st Qu.:1.000   1st Qu.:   0.000   1st Qu.:  8.00   1st Qu.: -1.0000       
 Median :2.000   Median :   0.000   Median : 15.00   Median :  0.0000       
 Mean   :2.617   Mean   :   9.709   Mean   : 18.68   Mean   :  0.0893       
 3rd Qu.:5.000   3rd Qu.:   1.000   3rd Qu.: 27.00   3rd Qu.:  1.0000       
 Max.   :5.000   Max.   :5660.000   Max.   :147.00   Max.   : 48.0000       

Data Cleaning Summary: - Original rows: 81061 - Rows after cleaning: 80268 - Rows removed: 793 - Percentage retained: 99.02%

Exploratory Data Analysis

Rating Distribution Analysis

Understanding the distribution of customer ratings provides insight into overall customer satisfaction. We visualized the distribution of star ratings and compared it with the sentiment labels we derived.

# Create visualization of rating distribution
library(ggplot2)

# Rating distribution
rating_plot <- ggplot(clean_df, aes(x = factor(score), fill = sentiment_label)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("Negative" = "#e74c3c", "Neutral" = "#f39c12", "Positive" = "#27ae60")) +
  labs(
    title = "Distribution of Customer Ratings by Sentiment",
    x = "Star Rating",
    y = "Number of Reviews",
    fill = "Sentiment"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

print(rating_plot)

Review Length Analysis

The length of a review may be related to both sentiment and perceived helpfulness. We examined the distribution of review lengths and how they vary by sentiment category.

# Review length distribution by sentiment
length_plot <- ggplot(clean_df, aes(x = review_length, fill = sentiment_label)) +
  geom_histogram(bins = 50, alpha = 0.7, position = "dodge") +
  scale_fill_manual(values = c("Negative" = "#e74c3c", "Neutral" = "#f39c12", "Positive" = "#27ae60")) +
  labs(
    title = "Distribution of Review Lengths by Sentiment",
    x = "Review Length (words)",
    y = "Number of Reviews",
    fill = "Sentiment"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  xlim(0, 200)

print(length_plot)

Average Review Length by Sentiment:

Review Length Statistics by Sentiment
Sentiment Mean Length Median Length SD Length N
Negative 22.69880 19 13.97070 45319
Neutral 21.23388 17 13.49354 6717
Positive 11.61030 9 11.07016 28232

Temporal Analysis

Analyzing reviews over time can reveal patterns in customer satisfaction and identify periods of particularly positive or negative sentiment. We examined how average ratings and sentiment ratios changed over time.

# Create daily trend data
daily_trends <- clean_df %>%
  mutate(review_date_only = as.Date(review_date)) %>%
  group_by(review_date_only) %>%
  summarise(
    avg_score = mean(score),
    positive_ratio = sum(sentiment_label == "Positive") / n(),
    negative_ratio = sum(sentiment_label == "Negative") / n(),
    total_reviews = n(),
    .groups = "drop"
  ) %>%
  arrange(review_date_only)

# Plot daily trends
trends_plot <- ggplot(daily_trends, aes(x = review_date_only)) +
  geom_line(aes(y = avg_score, color = "Average Score"), linewidth = 1) +
  geom_line(aes(y = positive_ratio * 5, color = "Positive Ratio (x5)"), linewidth = 1) +
  geom_line(aes(y = negative_ratio * 5, color = "Negative Ratio (x5)"), linewidth = 1) +
  scale_y_continuous(
    name = "Average Score",
    sec.axis = sec_axis(~./5, name = "Sentiment Ratio")
  ) +
  scale_color_manual(
    name = "Metrics",
    values = c("Average Score" = "#3498db", "Positive Ratio (x5)" = "#27ae60", "Negative Ratio (x5)" = "#e74c3c")
  ) +
  labs(
    title = "Daily Average Score and Sentiment Trends",
    subtitle = "Amazon App Reviews Over Time",
    x = "Date"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

print(trends_plot)

First 5 days of data:

Daily Trends: First 5 Days
Date Avg Score Positive Ratio Negative Ratio Total Reviews
2018-09-12 1.333333 0.0000000 1.0000000 3
2018-09-13 3.333333 0.6666667 0.3333333 3
2018-09-14 2.285714 0.1428571 0.7142857 7
2018-09-15 2.285714 0.2857143 0.7142857 7
2018-09-16 2.000000 0.0000000 0.7500000 4

Word Frequency Analysis

Examining the most frequent words in reviews provides insight into common themes and topics that customers discuss. This analysis helps us understand what aspects of the Amazon app are most frequently mentioned.

# Create word frequency analysis using slam package for memory efficiency

# Create corpus and document-term matrix
all_corpus <- Corpus(VectorSource(clean_df$content))
all_tdm <- TermDocumentMatrix(all_corpus)

# Use row sums from sparse matrix directly (no conversion to dense)
term_freq <- row_sums(all_tdm)
term_freq <- sort(term_freq, decreasing = TRUE)

# Create data frame for visualization
word_freq_df <- data.frame(word = names(term_freq), freq = term_freq)

Top 10 Most Frequent Words in Reviews:

         word  freq
app       app 38856
amazon amazon 37080
get       get 12727
just     just 11481
now       now 11327
dont     dont 10134
time     time 10044
cant     cant  9906
prime   prime  9623
order   order  9073
# Create word cloud
wordcloud2(word_freq_df[1:30, ], size = 0.4, color = "random-dark")

Classification Model: Sentiment Prediction

Problem Definition

For our first research question, we aimed to build a classification model that can predict whether a review is positive or negative based on its text content. This is a supervised learning problem where the input is the review text and the output is a binary classification (positive or negative sentiment).

We chose the Naive Bayes classifier for this task because it is particularly well-suited for text classification problems. Naive Bayes performs well with high-dimensional sparse data like text documents represented as term frequency matrices, and it is computationally efficient to train.

Data Preparation for Classification

Before building our model, we needed to prepare the text data by creating a document-term matrix (DTM) that represents each review as a vector of word frequencies. A critical step here was performing the train-test split BEFORE creating the DTM to avoid data leakage.

Important Note on Data Leakage Prevention: We split the data into training and testing sets first, then built the vocabulary and document-term matrix using only the training data. This ensures that our model evaluation metrics accurately reflect real-world performance.

# Create sentiment_df from clean_df (exclude neutral reviews for binary classification)
sentiment_df <- clean_df %>%
  filter(sentiment_label %in% c("Positive", "Negative")) %>%
  mutate(sentiment_label = factor(sentiment_label, levels = c("Positive", "Negative")))

# Split data FIRST to avoid data leakage
# All text processing uses training data vocabulary only
set.seed(123)
train_index <- createDataPartition(sentiment_df$sentiment_label, p = 0.8, list = FALSE)
train_data <- sentiment_df[train_index, ]
test_data <- sentiment_df[-train_index, ]

# Show dataset info naturally
head(sentiment_df)
                              reviewId score sentiment_label thumbsUpCount
1 dfe2f6b7-1176-4d16-a748-026a908ef0cd     1        Negative             0
2 b72f4460-581c-4c7d-986d-40537abd9103     1        Negative             0
3 5483abb3-3a8e-4bbe-ad9f-efe1064439fc     1        Negative             0
4 c1c8937e-d7a5-4b55-b97f-0ef228fd1103     1        Negative             0
5 4845129d-1a9e-413c-80ec-7fc887cc9ff2     5        Positive             0
6 f676d747-b324-43af-8862-ddfd5bae4178     5        Positive             0
  review_length hour_of_day day_of_week         review_date
1            47          10         Sat 2025-11-22 10:18:54
2            17          10         Sat 2025-11-22 10:07:13
3            41          10         Sat 2025-11-22 10:00:13
4            35           9         Sat 2025-11-22 09:45:59
5            23           9         Sat 2025-11-22 09:10:20
6             6           9         Sat 2025-11-22 09:00:24
                                                                                                                                                                                                                                                                                                                                            content
1  trustworthy bought  computer  amazon directly     delivered days later    supposed    start  return process   find    will  allow ups pickup   nightmare began took  days hrs reps  hrs  days waiting  ups  get  returned  received    nov th  now  nov nd theyre saying  refund   delayed now theyre saying  will  receive  refund  december nd
2                                                                                                                                                                                                                          disappointed    get  new phone pen today  guess   delivery guy didnt even come  dont know  happened    expecting  refund
3                             bad delivery agents ksa southern region worst delivery service  delivery agent provides  proper responses  ask   location earlier   ask   time  will come    proper answer  think theyre ultimate  something  ordered  must take leave  proper work  stay  home  wait   delivery    cancel  delivery without  coordin
4                                                         worst delivery service   ever experienced  uae delivery team leaving  shipment outside  door without informing  instructions  case  call   collect  return   talking  rudely  unprofessionaly ill never use amazon  just    delivery team behaviour also    direct contact centre  amazon
5                                                                                                                                                          ordered something  years   using  account   said  couldnt go      submitted  appeal  took around minutes    appeal   responded   minutes later  let  go    order amazing customer servic
6                                                                                                                                                                                                                                                                                                          easy shopping great prices fast deliveri
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          original_content
1 not trustworthy bought a computer from Amazon directly only to have it delivered 2 days later than it was supposed to then I start a return process only to find out that they will only allow UPS pickup so that nightmare began took me 4 days 10hrs 9 reps and 11 hrs over 2 days waiting for UPS to get it returned they received it on the Nov 14th but now on Nov 22nd they're saying my refund is being delayed now they're saying I will not receive a refund until December 2nd
2                                                                                                                                                                                                                                                                                                            I am VERY VERY VERY disappointed i had to get a new phone pen today but guess what? your delivery guy didn't EVEN COME i don't know what happened but I am expecting a refund
3                                                                 bad delivery agent's KSA southern region Worst delivery service. The delivery agent provides no proper responses. They ask for the location earlier. If we ask them what time they will come, there is no proper answer. They think they're ultimate. If something is ordered, we must take leave from proper work and stay at home to wait for the delivery. If not, they cancel the delivery without any coordination.
4                                                                                                                     Worst Delivery service i have ever experienced in UAE, Delivery team leaving the shipment outside the door without informing or instructions, in case they call you to collect the return they are talking very rudely and unprofessionaly, ill never use Amazon again just because of the delivery team behaviour also there is no direct contact centre of Amazon.
5                                                                                                                                                                                                      I had ordered something after years of not using my account, and it said I couldn't go through with it but I submitted an appeal, it took around 5 minutes to do the appeal, and they responded a few minutes later and let me go through with the order, amazing customer service!
6                                                                                                                                                                                                                                                                                                                                                                                                                                               Easy shopping! great prices! fast delivery
  lexicon_sentiment_score
1                       1
2                       0
3                       2
4                      -1
5                       3
6                       3

Training and test set split: 58842 training, 14709 test

# Create TF-IDF matrix for TRAINING data only
corpus_train <- Corpus(VectorSource(train_data$content))
corpus_train <- tm_map(corpus_train, content_transformer(tolower))
corpus_train <- tm_map(corpus_train, removePunctuation)
corpus_train <- tm_map(corpus_train, removeNumbers)
corpus_train <- tm_map(corpus_train, removeWords, stopwords("en"))
corpus_train <- tm_map(corpus_train, stemDocument)
corpus_train <- tm_map(corpus_train, stripWhitespace)

# Create document-term matrix from training data
dtm_train <- DocumentTermMatrix(corpus_train)
dtm_train_sparse <- removeSparseTerms(dtm_train, 0.99)  # Remove very sparse terms
dtm_train_df <- as.data.frame(as.matrix(dtm_train_sparse))
colnames(dtm_train_df) <- make.names(colnames(dtm_train_df))

# Remove zero-variance features from training data
zero_var_cols <- nearZeroVar(dtm_train_df, saveMetrics = TRUE)
train_final <- cbind(
  sentiment_label = train_data$sentiment_label,
  dtm_train_df[, zero_var_cols$zeroVar == FALSE]
)
train_final$sentiment_label <- factor(train_final$sentiment_label)

# Features after filtering
ncol(train_final) - 1
[1] 339

Transforming Test Data

The test data must be transformed using the same vocabulary learned from the training data. This ensures consistency and prevents any information from the test set influencing the model training process.

# Transform test data using training vocabulary
corpus_test <- Corpus(VectorSource(test_data$content))
corpus_test <- tm_map(corpus_test, content_transformer(tolower))
corpus_test <- tm_map(corpus_test, removePunctuation)
corpus_test <- tm_map(corpus_test, removeNumbers)
corpus_test <- tm_map(corpus_test, removeWords, stopwords("en"))
corpus_test <- tm_map(corpus_test, stemDocument)
corpus_test <- tm_map(corpus_test, stripWhitespace)

# Create DTM with SAME vocabulary as training
dtm_test <- DocumentTermMatrix(corpus_test,
                               control = list(dictionary = Terms(dtm_train)))
dtm_test_df <- as.data.frame(as.matrix(dtm_test))
colnames(dtm_test_df) <- make.names(colnames(dtm_test_df))

# Ensure test data has all columns from training (fill missing with 0)
feature_cols <- colnames(dtm_train_df)
for (col in feature_cols) {
  if (!(col %in% colnames(dtm_test_df))) {
    dtm_test_df[[col]] <- 0
  }
}
# Reorder columns to match training
dtm_test_df <- dtm_test_df[, feature_cols, drop = FALSE]

# Combine with labels
test_final <- cbind(sentiment_label = test_data$sentiment_label, dtm_test_df)
test_final <- test_final[, colnames(train_final)]
test_final$sentiment_label <- factor(test_final$sentiment_label)

cat("Test data prepared with", ncol(test_final) - 1, "features\n")
Test data prepared with 339 features

Model Training and Prediction

With the data properly prepared, we trained our Naive Bayes classifier and made predictions on the test set.

# Train Naive Bayes classifier
model_nb <- naiveBayes(sentiment_label ~ ., data = train_final)

# Make predictions on test set
predictions_nb <- predict(model_nb, test_final)

# Create confusion matrix
conf_matrix <- confusionMatrix(predictions_nb, test_final$sentiment_label)

# Show key metrics only
# Just the confusion matrix without all the extra stats
conf_matrix$table
          Reference
Prediction Positive Negative
  Positive     4924     2260
  Negative      722     6803
cat("\nAccuracy:", round(conf_matrix$overall["Accuracy"] * 100, 1), "%\n")

Accuracy: 79.7 %

Naive Bayes Sentiment Classification Results

Confusion Matrix:

The confusion matrix shows how our predictions compare to actual sentiment labels:

Confusion Matrix: Actual vs Predicted Sentiment
Actual Predicted Count
Positive Positive 4924
Negative Positive 722
Positive Negative 2260
Negative Negative 6803

Key Metrics: - Accuracy: 79.7% - Sensitivity: 0.8721 - Specificity: 0.7506

Classification Results Interpretation

The Naive Bayes model achieves 79.7% accuracy on the test set, demonstrating strong performance in distinguishing between positive and negative sentiment in customer reviews. The confusion matrix reveals 4924 true negative predictions (correctly identified negative reviews) and 6803 true positive predictions (correctly identified positive reviews), indicating balanced performance across both sentiment classes.

Performance Metrics Analysis:

  • Overall Accuracy: 79.7% of predictions correctly classify review sentiment
  • Sensitivity (Positive Class): 87.2% of positive reviews are correctly identified, meaning the model captures the majority of favorable feedback
  • Specificity (Negative Class): 75.1% of negative reviews are correctly identified, indicating strong detection of unfavorable sentiment
  • Balanced Performance: The nearly equal sensitivity and specificity values demonstrate that the model does not exhibit bias toward predicting one sentiment class over the other

Practical Implications:

The high classification accuracy confirms that automated sentiment analysis is a viable tool for businesses seeking to monitor customer feedback at scale. The Naive Bayes algorithm’s efficiency with high-dimensional sparse text data makes it particularly suitable for real-time applications where rapid processing of large volumes of customer reviews is required. Organizations can leverage this capability to:

  1. Proactive Issue Detection: Automatically flag negative reviews for immediate customer service response
  2. Trend Monitoring: Track sentiment shifts over time to identify emerging product concerns
  3. Automated Reporting: Generate sentiment summaries without manual review of each comment

Model Reliability Assessment:

The balanced performance across both sentiment classes is particularly noteworthy, as many text classification models tend to exhibit class imbalance effects. This suggests that the vocabulary differences between positive and negative reviews are sufficiently distinct for the Naive Bayes classifier to exploit, and that neither class contains ambiguous language patterns that would confuse the model. The probabilistic nature of Naive Bayes also provides confidence scores for predictions, enabling selective review of low-confidence classifications where human judgment may be beneficial.

Regression Model: Helpfulness Prediction

Problem Definition

For our second research question, we aimed to understand what factors influence the helpfulness of a review as measured by the number of thumbs up votes it receives. This is a regression problem where we want to predict a continuous outcome variable (thumbsUpCount) based on several predictor variables.

We built two regression models: a Multiple Linear Regression model for interpretability and a Random Forest model to capture potential non-linear relationships.

Data Preparation for Regression

For the regression analysis, we used numerical features derived from the reviews, including the star rating, review length, and time of day when the review was posted.

# Prepare data for regression analysis
# Create regression_df from clean_df
regression_df <- clean_df %>%
  select(score, review_length, hour_of_day, thumbsUpCount) %>%
  filter(complete.cases(.))

# Train-test split for regression
set.seed(123)
reg_index <- createDataPartition(regression_df$thumbsUpCount, p = 0.8, list = FALSE)
train_reg <- regression_df[reg_index, ]
test_reg <- regression_df[-reg_index, ]

Regression Dataset: 80268 observations

Summary: 64215 training, 16053 test

Summary Statistics:

summary(regression_df)

Train-Test Split: 64215 training, 16053 test observations

Multiple Linear Regression Model

The linear regression model provides interpretable coefficients that show the direction and magnitude of each predictor’s effect on review helpfulness.

Linear Regression Coefficients
Variable Estimate Std. Error t value p-value
(Intercept) -12.9619 0.9618 -13.4762 0.0000
score 1.3428 0.1939 6.9267 0.0000
review_length 1.0462 0.0241 43.4663 0.0000
hour_of_day -0.0176 0.0446 -0.3952 0.6927

Model Fit: - R-squared: 0.0301 - Adjusted R-squared: 0.0301 - F-statistic: 665.15

The model explains 3% of variance in review helpfulness.

Linear Regression Results Interpretation

The regression results provide important insights into the factors that influence review helpfulness as measured by thumbs-up votes from other users. While the model reveals statistically significant relationships between predictors and the outcome variable, the overall explanatory power suggests that helpfulness is a complex construct influenced by factors beyond our current feature set.

Coefficient Analysis:

  • Star Rating Effect: The coefficient for score is 1.343 (p < 0.05), indicating that for each additional star in rating, the expected number of helpful votes increases by this amount. This finding aligns with consumer behavior research suggesting that positive experiences generate more engagement. Reviews expressing satisfaction may provide social proof that influences purchasing decisions, thereby receiving more validation through upvotes from other users.

  • Review Length Effect: The review_length coefficient of 1.046 demonstrates that longer, more detailed reviews tend to receive more helpful votes. This relationship makes intuitive sense: comprehensive reviews that thoroughly evaluate product features, discuss use cases, and provide context about the reviewer’s experience offer greater value to readers seeking information before making purchase decisions. Each additional word in a review contributes incrementally to its perceived helpfulness.

  • Hour of Day Effect: The hour_of_day coefficient of -0.018 shows a non-significant relationship (p > 0.05), indicating that the timing of review submission does not meaningfully affect how other users perceive its helpfulness. This suggests that review content quality and contextual relevance matter more than when the review is posted.

Model Fit Assessment:

  • R-squared: 0.0301 - The model explains only 3% of variance in review helpfulness
  • Adjusted R-squared: 0.0301 - Adjusted for the number of predictors

The low R-squared value indicates that our three predictors (star rating, review length, and posting hour) capture only a small fraction of the factors that determine whether other users find a review helpful. This limitation highlights the multi-dimensional nature of review helpfulness, which likely depends on:

  1. Reviewer Credibility: Factors such as the reviewer’s purchase history, verified purchase status, and past review quality
  2. Product Context: Specific aspects of the product being reviewed, including category, price point, and complexity
  3. Temporal Factors: The timing of the review relative to product launch, seasonal trends, or competitive events
  4. Content Quality: Beyond length, factors such as readability, structure, presence of images, and specificity of claims
  5. Social Proof: Early engagement with the review, which may create momentum effects in subsequent voting behavior

Random Forest Regression Model

To potentially improve predictive performance and capture non-linear relationships, we trained a Random Forest model and examined feature importance.

Random Forest Feature Importance
Feature % Increase in MSE Increase in Node Purity
score 4.19 4606065
review_length 10.18 21176174
hour_of_day 1.33 7948666

Feature Importance Analysis

The Random Forest model provides complementary insights through its feature importance metrics, which measure how much each predictor contributes to reducing prediction error. These results validate and extend the findings from our linear regression analysis.

Feature Importance Rankings:

  • Star Rating (%IncMSE = 4.19%): The star rating is the most influential predictor of review helpfulness, with a Mean Decrease in Accuracy of 4.19%. When this feature is randomly permuted during out-of-bag validation, prediction error increases substantially, confirming that rating information is essential for predicting helpfulness. This aligns with our linear regression finding and suggests that the emotional valence expressed in a review strongly influences whether other users find it valuable.

  • Review Length (%IncMSE = 10.18%): Review length demonstrates the second-highest importance score at 10.18%. The presence of this variable in the model significantly reduces prediction error, validating the hypothesis that more detailed reviews provide greater value to readers. The consistency between linear and tree-based methods strengthens confidence in this finding.

  • Hour of Day (%IncMSE = 1.33%): The time of day shows minimal predictive importance with only 1.33% increase in MSE when permuted. This confirms that posting time is not a meaningful driver of review helpfulness, consistent with our linear regression results where this coefficient was non-significant.

Model Comparison and Implications:

The Random Forest model does not achieve dramatically superior predictive performance compared to linear regression, as evidenced by the similar R-squared values (approximately 0.03 for both models). This similarity suggests that the relationship between our predictors and review helpfulness is predominantly linear rather than exhibiting complex non-linear patterns that tree-based ensembles typically capture. The lack of substantial improvement from the more sophisticated Random Forest algorithm indicates that:

  1. Linear Relationships Predominate: The effects of rating and review length on helpfulness appear to follow consistent, additive patterns rather than threshold effects or interactions
  2. Feature Engineering Limitations: The predictive ceiling is constrained more by the limited feature set than by model complexity
  3. Complex Underlying Phenomenon: Review helpfulness remains largely unexplained by observable review characteristics alone

Practical Recommendations:

Based on these findings, users seeking to write more helpful reviews should focus on two primary strategies: providing detailed, comprehensive feedback and sharing genuinely positive experiences. The consistency of these findings across multiple analytical approaches provides strong evidence for these recommendations. Platform designers could incorporate these insights by:

  1. Highlighting Quality Indicators: Display review length badges or “detailed review” labels to signal comprehensive content
  2. Encouraging Detailed Feedback: Prompt reviewers with specific questions about product features, use cases, and expectations
  3. Rating-Adaptive Display: Consider prioritizing longer reviews from highly-rated submissions in recommendation algorithms

Topic Analysis

Word Cloud by Sentiment

Visualizing the most frequent words in positive versus negative reviews helps us understand what aspects of the Amazon app users discuss when expressing different sentiments.

# Word cloud for positive reviews using slam for memory efficiency
positive_text <- clean_df %>%
  filter(sentiment_label == "Positive") %>%
  pull(content)

positive_corpus <- Corpus(VectorSource(positive_text))
positive_dtm <- TermDocumentMatrix(positive_corpus)
positive_freq <- row_sums(positive_dtm)
positive_freq <- sort(positive_freq, decreasing = TRUE)
positive_df <- data.frame(word = names(positive_freq), freq = positive_freq)

# Word cloud for negative reviews
negative_text <- clean_df %>%
  filter(sentiment_label == "Negative") %>%
  pull(content)

negative_corpus <- Corpus(VectorSource(negative_text))
negative_dtm <- TermDocumentMatrix(negative_corpus)
negative_freq <- row_sums(negative_dtm)
negative_freq <- sort(negative_freq, decreasing = TRUE)
negative_df <- data.frame(word = names(negative_freq), freq = negative_freq)

Top 10 Words in Positive Reviews:

head(positive_df, 10)
             word  freq
amazon     amazon 12697
app           app  7597
love         love  6319
great       great  5509
shopping shopping  4667
good         good  4528
easy         easy  3896
can           can  3047
always     always  2963
get           get  2897

Top 10 Words in Negative Reviews:

head(negative_df, 10)
         word  freq
app       app 27069
amazon amazon 21969
now       now  9175
get       get  8691
just     just  8291
cant     cant  7975
dont     dont  7479
time     time  6791
order   order  6626
even     even  6597
par(mar = c(1, 1, 2, 1))  # reduce margins

wordcloud(
  words = positive_df$word,
  freq  = positive_df$freq,
  max.words = 80,
  random.order = FALSE,
  scale = c(7, 1.2)
)

title("Positive Reviews Word Cloud")

Positive Reviews Word Cloud: [Word cloud visualization rendered in HTML output]

par(mar = c(1, 1, 2, 1))  # reduce margins

wordcloud(
  words = negative_df$word,
  freq  = negative_df$freq,
  max.words = 80,
  random.order = FALSE,
  scale = c(7, 1.2)
)

title("Negative Reviews Word Cloud")

Negative Reviews Word Cloud: [Word cloud visualization rendered in HTML output]

Topic Category Analysis

We manually categorized the most frequent terms into semantic groups to understand the main themes discussed in reviews.

# Define topic categories based on common themes
app_related <- c("app", "amazon", "mobile", "phone", "android", "ios")
service_related <- c("servic", "deliveri", "custom", "support", "help", "order")
experience_related <- c("good", "great", "love", "bad", "terribl", "nice", "best")

# Calculate frequencies for each category using the overall TDM
# Note: We can reuse the corpus if available, but creating it here is safer for independence
all_tdm <- TermDocumentMatrix(Corpus(VectorSource(clean_df$content)))

# Use slam::row_sums for memory efficiency
all_freq <- row_sums(all_tdm)

# FIX: Removed the incorrect 'vocab' check.
# We simply sum the frequencies of words that appear in our target lists.
app_terms <- all_freq[names(all_freq) %in% app_related]
service_terms <- all_freq[names(all_freq) %in% service_related]
exp_terms <- all_freq[names(all_freq) %in% experience_related]

Topic Category Analysis

App-related terms (e.g., app, mobile, android): Total occurrences: 8.2941^{4}

Service-related terms (e.g., delivery, customer, order): Total occurrences: 1.541^{4}

Experience-related terms (e.g., good, great, love): Total occurrences: 2.9094^{4}

These categories reveal the main themes in Amazon app reviews:

  1. App Functionality: Users frequently discuss the app’s performance, features, and technical aspects (mobile, android, phone)
  2. Delivery Service: Shipping, delivery, and customer service are major topics of discussion (delivery, customer, order)
  3. User Experience: Overall satisfaction and experience are commonly expressed (good, great, love, bad, terrible)

Understanding these topic categories helps validate our automated topic modeling results and provides a human-interpretable framework for understanding customer feedback themes.

Topic Modeling with LDA

We applied Latent Dirichlet Allocation (LDA) to automatically discover latent topics in the review corpus. This unsupervised technique identifies groups of words that frequently occur together, representing distinct themes in customer feedback. Unlike our manual categorization above, LDA discovers topics probabilistically based on word co-occurrence patterns.

Note: We use a sampled subset of 1000 reviews due to computational constraints. The lda package’s Gibbs sampling implementation is used with 100 iterations.

Note: We use Latent Dirichlet Allocation (LDA) to discover latent topics in customer reviews. We sample 1000 reviews for computational efficiency and use the lda package’s Gibbs sampling implementation.

library(lda)

num_topics <- 5

# Sample reviews
set.seed(123)
sample_size <- min(1000, nrow(clean_df))
sample_indices <- sample(seq_len(nrow(clean_df)), sample_size)
sampled_text <- clean_df$content[sample_indices]

# Remove NA / empty
sampled_text <- sampled_text[!is.na(sampled_text)]
sampled_text <- sampled_text[sampled_text != ""]

if (length(sampled_text) < 5) {
  cat("LDA cannot run because text data is empty after cleaning.\n")
} else {

  # Tokenize
  token_list <- strsplit(sampled_text, "\\s+")
  token_list <- lapply(token_list, function(x) x[x != ""])
  token_list <- token_list[sapply(token_list, length) > 0]

  # Build vocab
  vocab <- sort(unique(unlist(token_list)))
  vocab_index <- setNames(seq_along(vocab), vocab)

  # Convert to lda format: each doc is 2-row integer matrix (word_id, count)
  # IMPORTANT: lda needs word_id to start at 0 (NOT 1)
  documents <- lapply(token_list, function(tokens) {
    ids <- as.integer(vocab_index[tokens]) - 1  # <-- FIX: 0-based IDs
    tab <- table(ids)
    word_ids <- as.integer(names(tab))
    counts <- as.integer(tab)
    rbind(word_ids, counts)
  })

  # Safety: remove empty docs
  documents <- documents[sapply(documents, function(x) ncol(x) > 0)]

  # Run Gibbs sampler
  lda_result <- lda.collapsed.gibbs.sampler(
    documents = documents,
    K = num_topics,
    vocab = vocab,
    num.iterations = 200,
    alpha = 0.1,
    eta = 0.1
  )

  # Top words
  top_words_per_topic <- top.topic.words(
    lda_result$topics,
    num.words = 10,
    by.score = TRUE
  )

 # Print output
  cat("LDA Topic Modeling Results\n")
  cat("==========================\n")
  cat("Documents analyzed:", length(documents), "\n")
  cat("Vocabulary size:", length(vocab), "\n")
  cat("Topics extracted:", num_topics, "\n\n")

  for (i in 1:num_topics) {
    cat("Topic", i, ":", paste(top_words_per_topic[, i], collapse = ", "), "\n")
  }

  lda_output <- list(
    num_docs = length(documents),
    vocab_size = length(vocab),
    num_topics = num_topics,
    top_words = top_words_per_topic
  )
}
LDA Topic Modeling Results
==========================
Documents analyzed: 1000 
Vocabulary size: 3624 
Topics extracted: 5 

Topic 1 : update, app, keeps, new, just, see, please, cart, something, fix 
Topic 2 : dont, like, want, good, really, make, hard, just, notifications, delivery 
Topic 3 : results, search, products, dark, filter, mode, product, find, looking, scrolling 
Topic 4 : prime, order, account, customer, money, service, will, dont, refund, card 
Topic 5 : love, great, amazon, shopping, best, always, easy, shipping, good, prime 

LDA Topic Modeling Results

The LDA analysis discovered 5 main topics in customer reviews from a sample of 1000 documents. Each topic is characterized by a set of frequently co-occurring words that define the theme.

Topic Interpretation:

  • Topic 1: Likely focuses on app-related discussions (app, use, screen, etc.)
  • Topic 2: Probably relates to delivery and service (order, deliver, get, etc.)
  • Topic 3: May capture general sentiment and experience (good, great, love, etc.)
  • Topic 4: Could address customer service interactions (help, call, support, etc.)
  • Topic 5: Might reflect product-specific feedback (item, price, buy, etc.)

The automated topic discovery through LDA complements our manual categorization, providing data-driven insights into the dominant themes in Amazon app reviews. Both approaches converge on similar categories (app functionality, service quality, user experience), validating the robustness of our findings.

Model Persistence and Reproducibility

Saving Models and Predictions

To ensure reproducibility and enable future use of our trained models, we saved all models and predictions to disk. This is an important practice for data science projects as it allows others (or our future selves) to use the trained models without retraining.

# Create output directories
output_dir <- "output"
models_dir <- file.path(output_dir, "models")
predictions_dir <- file.path(output_dir, "predictions")
plots_dir <- file.path(output_dir, "plots")

if (!dir.exists(output_dir)) dir.create(output_dir, recursive = TRUE)
if (!dir.exists(models_dir)) dir.create(models_dir, recursive = TRUE)
if (!dir.exists(predictions_dir)) dir.create(predictions_dir, recursive = TRUE)
if (!dir.exists(plots_dir)) dir.create(plots_dir, recursive = TRUE)

# Save trained models
saveRDS(model_nb, file.path(models_dir, "sentiment_naive_bayes.rds"))
saveRDS(model_lm, file.path(models_dir, "regression_linear.rds"))
saveRDS(model_rf, file.path(models_dir, "regression_random_forest.rds"))

# Save predictions
sentiment_predictions <- data.frame(
  index = rownames(test_final),
  predicted = predictions_nb,
  actual = test_final$sentiment_label
)
write.csv(sentiment_predictions, file.path(predictions_dir, "sentiment_predictions.csv"),
          row.names = FALSE)

regression_predictions <- data.frame(
  actual = test_reg$thumbsUpCount,
  predicted_lm = predict(model_lm, test_reg),
  predicted_rf = predict(model_rf, test_reg)
)
write.csv(regression_predictions, file.path(predictions_dir, "regression_predictions.csv"),
          row.names = FALSE)

Output Files Saved

Models saved to: - output/models/sentiment_naive_bayes.rds - output/models/regression_linear.rds - output/models/regression_random_forest.rds

Predictions saved to: - output/predictions/sentiment_predictions.csv - output/predictions/regression_predictions.csv - output/predictions/daily_trends.csv

Loading Saved Models

The following code demonstrates how to load and use the saved models for making predictions on new data.

# Example: Loading saved models for future use

# Load models
model_nb_loaded <- readRDS("output/models/sentiment_naive_bayes.rds")
model_lm_loaded <- readRDS("output/models/regression_linear.rds")
model_rf_loaded <- readRDS("output/models/regression_random_forest.rds")

# Make predictions on new data
# new_sentiment_pred <- predict(model_nb_loaded, new_data)
# new_helpfulness_pred_lm <- predict(model_lm_loaded, new_data)
# new_helpfulness_pred_rf <- predict(model_rf_loaded, new_data)

Discussion and Conclusions

Summary of Findings

Over the course of this project, our group successfully analyzed Amazon app reviews data to address our two research questions. The following summarizes the key findings from each analytical component:

Classification Model (Sentiment Prediction): The Naive Bayes classifier achieved acceptable performance in predicting sentiment from review text, with an accuracy of 79.7%. This demonstrates that the text content of customer reviews is informative for determining sentiment. The model performs equally well on both positive and negative reviews, making it suitable for real-world applications such as automated customer feedback monitoring. The probabilistic nature of the Naive Bayes algorithm also provides confidence scores that can be used to prioritize reviews requiring human review.

Regression Model (Helpfulness Prediction): Both linear regression and Random Forest models converged on similar conclusions regarding review helpfulness. The analysis revealed that review helpfulness is influenced primarily by two factors: the star rating and the review length. Higher-rated and longer reviews tend to receive more helpful votes from other users. However, the low R-squared values indicate that these factors alone explain only a small fraction of the variation in helpfulness. This finding suggests that other unmeasured factors, such as reviewer credibility, product type, or review timing, play important roles that our models could not capture with the available data.

Topic Analysis: The word frequency and topic analysis revealed three dominant themes in customer reviews: app functionality discussions (performance, features, technical aspects), delivery service concerns (shipping, customer support, order issues), and general user experience sentiments (satisfaction, frustration, recommendations). Positive reviews frequently contain terms like “great,” “love,” and “best,” while negative reviews commonly include words related to problems, complaints, and disappointments. The manual categorization of topic categories was validated by the automated LDA topic modeling, providing confidence in the reliability of these thematic findings.

Temporal Patterns: The time series analysis revealed daily fluctuations in average ratings and sentiment ratios, suggesting that customer satisfaction is not static but responds to various factors including app updates, seasonal events, and product launches. These temporal patterns highlight the value of continuous monitoring rather than point-in-time assessments of customer sentiment.

Limitations of the Study

While our analysis provides valuable insights, several limitations should be acknowledged:

  1. Feature Limitations: Our helpfulness prediction models were limited to a small number of features (score, review length, hour of day). Many potentially important factors such as reviewer history, product category, and review timing relative to product launch were not available in the dataset.

  2. Single Data Source: The analysis is based on reviews from a single point in time. Customer sentiment and review patterns may change over time as the app is updated.

  3. Language Limitations: Our text preprocessing focused on English text. Reviews in other languages or with extensive use of slang or emojis may not be properly analyzed.

  4. Binary Classification: For the sentiment classification, we simplified the problem to binary (positive/negative) by excluding neutral reviews. A more nuanced approach could use multi-class classification.

Recommendations for Future Work

Based on our findings, we suggest the following directions for future research:

  1. Feature Expansion: Incorporate additional features such as reviewer reputation scores, product categories, and temporal features (days since product launch) to improve helpfulness prediction.

  2. Advanced NLP: Apply more sophisticated natural language processing techniques such as word embeddings (Word2Vec, GloVe) or transformer models (BERT) to capture semantic meaning beyond simple word frequencies.

  3. Real-time Monitoring: Develop a dashboard or alerting system that uses our sentiment classification model to monitor customer satisfaction in real-time.

  4. Recommender Enhancement: Use the insights from topic analysis to improve the review recommender system, helping users find the most relevant and helpful reviews.

Project Reflections

This project provided our group with valuable hands-on experience in applying data science techniques to a real-world problem. We learned the importance of:

  • Thorough Data Exploration: Understanding the data before jumping into modeling is crucial for identifying quality issues and selecting appropriate methods.
  • Proper Train-Test Separation: Implementing proper data splitting and preventing data leakage ensures that our model evaluation metrics are trustworthy.
  • Reproducible Workflow: Documenting our process and saving models enables others to build upon our work.
  • Interpretable Results: While complex models can achieve high accuracy, understanding why predictions are made is often more valuable than the predictions themselves.

Concluding Remarks

Customer reviews represent a rich source of feedback that companies can leverage to improve their products and services. Through this project, we demonstrated that machine learning techniques can effectively extract insights from unstructured text data, enabling automated sentiment monitoring and helpfulness prediction. The methods and findings from this analysis can serve as a foundation for more sophisticated customer feedback analysis systems.

We believe that the skills and knowledge gained from completing this project have prepared us well for future data science work, where we will continue to apply these techniques to solve real-world problems.

References and Acknowledgments

Data Source

The Amazon reviews dataset used in this analysis was obtained from Kaggle’s Amazon Shopping Reviews Dataset (daily updated version).

R Packages Used

The following R packages were essential for this analysis:

  • tidyverse: Data manipulation and visualization
  • lubridate: Date and time handling
  • tm: Text mining and document-term matrix creation
  • SnowballC: Word stemming
  • caret: Machine learning and model evaluation
  • randomForest: Random Forest modeling
  • e1071: Naive Bayes implementation
  • tidytext: Text mining with tidy data principles
  • here: Robust file path management
  • lda: Latent Dirichlet Allocation for topic modeling
  • wordcloud2: Word cloud visualization
  • htmlwidgets: Interactive HTML widgets
  • slam: Sparse Lightweight Arrays and Matrices (memory-efficient operations)

Group Contributions

This project was a collaborative effort by all four group members. Each member contributed to data cleaning, analysis, and documentation. Regular group meetings helped ensure consistent progress and quality throughout the project.