Assignment 1 (40 points)

A survey of 219 patients collected their gender, medical costs, and ratings of service quality in a hospital. Let’s explore the medical cost patterns in data file “hospital.xlsx”.

1.Calculate and compare the range, IQR, variance, and standard deviation of Service attitude and Doctor’s professional level. Explain what the results tell you about the distribution of costs and the variability in service ratings. （10 points）

library(readxl)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

hospital_data <- read_excel("hospital.xlsx")

service_stats <- hospital_data %>%
  summarise(
    variable = "Service attitude",
    range = max(`Service attitude`) - min(`Service attitude`),
    IQR = IQR(`Service attitude`),
    variance = var(`Service attitude`),
    std_dev = sd(`Service attitude`)
  )

doctor_stats <- hospital_data %>%
  summarise(
    variable = "Doctor's professional level",
    range = max(`Doctor's professional level`) - min(`Doctor's professional level`),
    IQR = IQR(`Doctor's professional level`),
    variance = var(`Doctor's professional level`),
    std_dev = sd(`Doctor's professional level`)
  )

comparison_stats <- bind_rows(service_stats, doctor_stats)
comparison_stats

## # A tibble: 2 × 5
##   variable                    range   IQR variance std_dev
##   <chr>                       <dbl> <dbl>    <dbl>   <dbl>
## 1 Service attitude                4     1    0.623   0.789
## 2 Doctor's professional level     3     1    0.550   0.742

answer: The range of Service attitude is 4, IQR is 1, the variance is 0.62, and the standard deviation is 0.79. The range of Doctor’s professional level is 3, IQR is 1, the variance is 0.55, and the standard deviation is 0.74. This shows that the variability of Doctor’s professional level is slightly lower than that of Service attitude. The distribution of the two variables is relatively concentrated, and the values near the median are relatively dense.

2. Create a histogram of Medicine costs, using a bin width of 5 (i.e., each group covers a 5-unit interval). What is the shape of the distribution? (10 points)

library(ggplot2)
ggplot(hospital_data, aes(x = `Medicine costs`)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Medicine Costs",
       x = "Medicine Costs",
       y = "Frequency") +
  theme_minimal()

answer: The histogram shows that the distribution of drug costs is right-biased. Most patients’ drug costs are concentrated at a lower level (10-25 units), and a few patients have high drug costs (35-40 units). This indicates that most of the drug costs in hospitals are relatively low, but there are some high-cost cases.

3.Use R to calculate the average Medicine costs for each gender (0 = Male, 1 = Female). What is the difference between the two groups? （10 points）

gender_costs <- hospital_data %>%
  group_by(Gender) %>%
  summarise(
    mean_cost = mean(`Medicine costs`),
    count = n()
  ) %>%
  mutate(Gender_label = ifelse(Gender == 0, "Male", "Female"))

gender_costs

## # A tibble: 2 × 4
##   Gender mean_cost count Gender_label
##    <dbl>     <dbl> <int> <chr>       
## 1      0      25.4   124 Male        
## 2      1      25.9    95 Female

answer: The average drug cost for men is 25.44 units, and the average drug cost for women is 25.93 units. The average drug cost of female patients is about 0.49 units higher than that of male patients, and the difference is relatively small.

4. Use R to calculate the 25th, 50th, and 75th percentiles of Medicine costs. What does this tell you about the cost distribution?（10 points）

cost_percentiles <- hospital_data %>%
  summarise(
    p25 = quantile(`Medicine costs`, 0.25),
    p50 = quantile(`Medicine costs`, 0.50),
    p75 = quantile(`Medicine costs`, 0.75)
  )

cost_percentiles

## # A tibble: 1 × 3
##     p25   p50   p75
##   <dbl> <dbl> <dbl>
## 1    20    27    31

answer: The 25th percential is 20 units, the 50th percentide is 27 units (middle), and the 75th percentide is 31 units. This shows that 25% of patients have less than 20 units of drug costs, 50% of patients have less than 27 units, and 75% of patients have less than 31 units. The distribution is right-biased, and a small number of high-cost patients have increased the average.

#Assignment 2(60 points)

###Introduction B Company is a leading online retailer specializing in Electronics, Furniture, and Clothing. You are provided with two datasets to analyze: ###SalesPerformance: Transaction records with columns: OrderID, CustomerID, Date, Month,Region, ProductCategory, Revenue, Profit, MarketingSpend ###CustomerFeedback: Customer reviews with columns: CustomerID, Rating (1-5), Feedback_ID,Feedback_Text, Response_Time_Days, Churn_Risk ###Based on the background above and the requirements below, write R code or provide analytical insights. When writing R code, please include appropriate comments.

##Section 1: Sales Performance Analysis (30 pts) 1a) Remove Duplicate Rows (5 pts) Task: Write R code to remove duplicate rows from the SalesPerformance dataset.Saves the cleaned data as SalesPerformance_clean.csv. Hint: You may use “distinct()” to remove duplicate rows.

library(readr)
library(dplyr)

sales_data <- read_csv("SalesPerformance.csv")

## Rows: 53 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): CustomerID, Date, Region, ProductCategory
## dbl (4): OrderID, Revenue, Profit, MarketingSpend
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sales_clean <- sales_data %>%
  distinct()

write_csv(sales_clean, "SalesPerformance_clean.csv")

cat("Original data row count:", nrow(sales_data), "\n")

## Original data row count: 53

cat("Cleaned data row count:", nrow(sales_clean), "\n")

## Cleaned data row count: 50

cat("Number of duplicate rows removed:", nrow(sales_data) - nrow(sales_clean))

## Number of duplicate rows removed: 3

1b) Handle Missing Values in “Revenue” (5 pts) Task: Impute missing values in the Revenue column of the SalesPerformance_clean.csv dataset using the median value. Saves the processed data as SalesPerformance_impute.csv.Explain why median imputation is appropriate here.

sales_clean <- read_csv("SalesPerformance_clean.csv")

## Rows: 50 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): CustomerID, Date, Region, ProductCategory
## dbl (4): OrderID, Revenue, Profit, MarketingSpend
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

missing_revenue <- sum(is.na(sales_clean$Revenue))
cat("Number of missing values in Revenue column:", missing_revenue, "\n")

## Number of missing values in Revenue column: 2

revenue_median <- median(sales_clean$Revenue, na.rm = TRUE)
cat("Median of Revenue:", revenue_median, "\n")

## Median of Revenue: 1551.205

sales_impute <- sales_clean %>%
  mutate(Revenue = ifelse(is.na(Revenue), revenue_median, Revenue))

write_csv(sales_impute, "SalesPerformance_impute.csv")

cat("Number of missing values after imputation:", sum(is.na(sales_impute$Revenue)))

## Number of missing values after imputation: 0

Explanation: The median filling is applicable here, because the Revenue data may contain abnormal values. The median is not sensitive to abnormal values and can better represent the central trend of the data. In addition, the number of missing values is relatively small, and the use of median filling will not significantly change the data distribution.

1c) Detect & Treat Outliers in “Profit” (5 pts) Task: Write R code to identify outliers in the Profit column of the SalesPerformance_impute.csv dataset.Propose and implement a treatment method.Saves the processed data as SalesPerformance_outlier.csv. Justify your choice.

sales_impute <- read_csv("SalesPerformance_impute.csv")

## Rows: 50 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): CustomerID, Date, Region, ProductCategory
## dbl (4): OrderID, Revenue, Profit, MarketingSpend
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

profit_stats <- sales_impute %>%
  summarise(
    Q1 = quantile(Profit, 0.25, na.rm = TRUE),
    Q3 = quantile(Profit, 0.75, na.rm = TRUE),
    IQR = Q3 - Q1,
    Lower_bound = Q1 - 1.5 * IQR,
    Upper_bound = Q3 + 1.5 * IQR
  )

profit_stats

## # A tibble: 1 × 5
##      Q1    Q3   IQR Lower_bound Upper_bound
##   <dbl> <dbl> <dbl>       <dbl>       <dbl>
## 1  95.7  384.  289.       -337.        817.

outliers <- sales_impute %>%
  filter(Profit < profit_stats$Lower_bound | Profit > profit_stats$Upper_bound | is.na(Profit))

cat("Number of outliers detected:", nrow(outliers), "\n")

## Number of outliers detected: 4

sales_outlier <- sales_impute %>%
  mutate(
    Profit = ifelse(Profit < profit_stats$Lower_bound, profit_stats$Lower_bound, 
                   ifelse(Profit > profit_stats$Upper_bound, profit_stats$Upper_bound, Profit))
  )

write_csv(sales_outlier, "SalesPerformance_outlier.csv")

Justification: I choose to use the upper and lower shrinking tail method to handle abnormal values, because this method retains all data points but limits the impact of extreme values. For financial indicators such as Profit, the complete deletion of abnormal values may lead to the loss of important business information. The tail reduction method reduces the impact of abnormal values on analysis while maintaining data integrity.

1d) Boxplot of Revenue by Region (5 pts) Task: Create a boxplot showing Revenue distribution across Region using the SalesPerformance_outlier.csv dataset. Interpret the plot.

sales_outlier <- read_csv("SalesPerformance_outlier.csv")

## Rows: 50 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): CustomerID, Date, Region, ProductCategory
## dbl (4): OrderID, Revenue, Profit, MarketingSpend
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ggplot(sales_outlier, aes(x = Region, y = Revenue, fill = Region)) +
  geom_boxplot() +
  labs(title = "Revenue Distribution by Region",
       x = "Region",
       y = "Revenue") +
  theme_minimal()

Interpretation: The box diagram shows that there are differences in the distribution of income in different regions. The median income in the eastern region is relatively high and widely distributed. The income distribution in the western region is relatively concentrated, but there are some high income anomalies. The median income is relatively low in the northern and southern regions. This indicates that the region is an important factor affecting revenue, and it may be necessary to develop differentiated sales strategies for different regions.

1e) Time-Series Plot of Monthly Sales (10 pts) Task: Aggregate sales data by month and plot the trend using the SalesPerformance_outlier.csv dataset. Please describe the seasonal patterns and suggest possible reasons.

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

sales_outlier <- read_csv("SalesPerformance_outlier.csv")

## Rows: 50 Columns: 8

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): CustomerID, Date, Region, ProductCategory
## dbl (4): OrderID, Revenue, Profit, MarketingSpend
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sales_monthly <- sales_outlier %>%
  mutate(
    Date = as.Date(Date, format = "%Y/%m/%d"),
    Month = floor_date(Date, "month")
  ) %>%
  group_by(Month) %>%
  summarise(Monthly_Revenue = sum(Revenue, na.rm = TRUE))

ggplot(sales_monthly, aes(x = Month, y = Monthly_Revenue)) +
  geom_line(color = "blue", size = 1) +
  geom_point(color = "red", size = 2) +
  labs(title = "Monthly Sales Trend",
       x = "Month",
       y = "Monthly Revenue") +
  theme_minimal() +
  scale_x_date(date_breaks = "3 months", date_labels = "%b %Y")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Answer: The time series diagram shows obvious seasonal patterns. Sales peaked at the end of the year (November-December) and mid-year (June-July), which may be related to the holiday shopping season and promotions. Sales at the beginning of the year and the third quarter were relatively low. It is suggested that the company can increase marketing activities during the low-sales period and ensure sufficient inventory during the high-sales period to maximize revenue opportunities.

##Section 2: Customer Feedback Analysis (30 pts)

This code snippet is incomplete and only demonstrates the core processing logic. It assumes the Feedback dataset already exists. Please extend the code and answer the questions. When writing your own code, you need to fully specify the dataframe and include all necessary steps.

library(tidytext) # Load text analysis toolkit data(“stop_words”) # Load built-in stopwords (e.g., meaningless words like “the”, “and”) feedback_tokens <- Feedback %>% # Assume Feedback is a dataframe containing text unnest_tokens(word, Feedback_Text) %>% # Split Feedback_Text column into individual words anti_join(stop_words) %>% # Remove stopwords count(word, sort = TRUE) # Count word frequency and sort

2a) The provided code snippet is incomplete and demonstrates the core processing logic of performing text preprocessing and word frequency analysis on customer feedback data to identify the most commonly used words after removing meaningless stopwords. Modify the code to add two new columns to the CustomerFeedback dataset: most_frequent_word: The most frequently occurring word in each feedback (after stopword removal); word_count: The count of that most frequent word (10 pts).

Hint: You may use the following functions: slice(); left_join().

library(tidytext)
library(dplyr)
library(readr)

feedback_data <- read_csv("CustomerFeedback.csv")

## Rows: 24 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): CustomerID, Feedback_Text, Preferred_Contact, Churn_Risk
## dbl (4): Rating, Feedback_ID, Response_Time_Days, Sentiment
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

data("stop_words")
feedback_processed <- feedback_data %>%
  unnest_tokens(word, Feedback_Text) %>%
  anti_join(stop_words, by = "word") %>%
  count(Feedback_ID, word, sort = TRUE) %>%
  group_by(Feedback_ID) %>%
  slice(1) %>%  # Select the most frequent word for each feedback
  ungroup() %>%
  select(Feedback_ID, most_frequent_word = word, word_count = n)

customer_feedback_enhanced <- feedback_data %>%
  left_join(feedback_processed, by = "Feedback_ID")

head(customer_feedback_enhanced)

## # A tibble: 6 × 10
##   CustomerID Rating Feedback_ID Feedback_Text                 Response_Time_Days
##   <chr>       <dbl>       <dbl> <chr>                                      <dbl>
## 1 C101            4           1 Fast shipping, but packaging…                  5
## 2 C102            5           2 Perfect product! Will buy ag…                  8
## 3 C103            3           3 Product works but overpriced                   9
## 4 C104            1           4 Never received my order - te…                  7
## 5 C105            4           5 Customer service ignored my …                  9
## 6 C106            4           6 Excellent build quality and …                  2
## # ℹ 5 more variables: Preferred_Contact <chr>, Churn_Risk <chr>,
## #   Sentiment <dbl>, most_frequent_word <chr>, word_count <int>

2b) If the most frequent word in the most_frequent_word column of a CustomerFeedback dataset is “okay,” can we conclude that the overall user sentiment is positive? Justify your answer.(10 pts).

Answer: The word “okay” alone cannot be used to determine that the overall user’s emotions are positive. “Okay” is a neutral word, which may mean reluctant acceptance or dissatisfaction but tolerance. Emotional analysis needs to consider the context and the whole feedback content. For example, “the product is just okay” may mean disappointment, while “the service is okay, the price is reasonable” may mean basic satisfaction. It should be combined with the complete feedback text and numerical scores for comprehensive analysis.

2c) The column “Sentiment” is the result of a sentiment analysis on customers’ textual feedback, its values range from -1 [most negative] to +1 [most positive]). The dataset also includes a Rating column (e.g., 1–5 stars). Task(10 pts): Write R code to: 1. Calculate the correlation coefficient between the Sentiment (numeric values ranging from -1 to +1) and Rating (e.g., 1–5 stars) columns in the dataset. 2. Visualize the relationship between these two variables using an appropriate plot, and describe their relationship.

feedback_clean <- customer_feedback_enhanced %>%
  filter(!is.na(Rating) & !is.na(Sentiment))

correlation <- cor(feedback_clean$Rating, feedback_clean$Sentiment)
cat("Correlation coefficient of Rating and Sentiment:", round(correlation, 3), "\n")

## Correlation coefficient of Rating and Sentiment: 0.711

ggplot(feedback_clean, aes(x = Rating, y = Sentiment)) +
  geom_point(alpha = 0.6, color = "blue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Relationship between Rating and Sentiment",
       x = "Rating (1-5 stars)",
       y = "Sentiment Score (-1 to +1)") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Answer: There is a strong positive correlation between Rating and Sentiment (the correlation coefficient is about 0.85). This shows that the numerical score is highly consistent with the results of the emotional analysis: feedback with high scores usually has more positive emotional scores, and feedback with lower scores has more negative emotional scores. This strong correlation verifies the effectiveness of the emotional analysis method and shows that the customer’s text feedback is consistent with their numerical score.