Assignment 1 (40 points)

A survey of 219 patients collected their gender, medical costs, and ratings of service quality in a hospital. Let’s explore the medical cost patterns in data file “hospital.xlsx”.

1.Calculate and compare the range, IQR, variance, and standard deviation of Service attitude and Doctor’s professional level. Explain what the results tell you about their distributions. （10 points）

# Read Excel file
hospital_data <- read_excel("hospital.xlsx")

# Select columns for analysis
cols <- c("Service attitude", "Doctor's professional level")

# Calculate Range
range_values <- sapply(hospital_data[cols], function(x) range(x))

# Calculate IQR
iqr_values <- sapply(hospital_data[cols], function(x) IQR(x))

# Calculate Variance
var_values <- sapply(hospital_data[cols], function(x) var(x))

# Calculate Standard Deviation
sd_values <- sapply(hospital_data[cols], function(x) sd(x))

# Combine results into one data frame
summary_stats <- data.frame(
  Statistic = c("Range", "IQR", "Variance", "Standard Deviation"),
  `Service attitude` = c(range_values[1], iqr_values[1], var_values[1], sd_values[1]),
  `Doctor's professional level` = c(range_values[2], iqr_values[2], var_values[2], sd_values[2])
)

print(summary_stats)

##            Statistic Service.attitude Doctor.s.professional.level
## 1              Range        3.0000000                   7.0000000
## 2                IQR        1.0000000                   1.0000000
## 3           Variance        0.6230992                   0.5499560
## 4 Standard Deviation        0.7893663                   0.7415902

answer:

Range

Doctor’s professional level (7) has a much wider range than service attitude (3), indicating greater variability in patients’ evaluations of doctors’ professionalism. In contrast, service attitude ratings are more closely clustered.

IQR

Both variables share the same IQR of 1, showing that the middle 50% of their scores are similarly distributed.

Variance and Standard Deviation

Service attitude has slightly higher variance (0.623) and standard deviation (0.789) compared to Doctor’s professional level (0.550 and 0.742). This implies that patients’ opinions on service attitude are somewhat more dispersed, while ratings for doctor professionalism are relatively consistent.

Overall conclusion: Although a few extreme values exist in Doctor’s professional level ratings, the overall distribution remains more concentrated. Most patients provide similar evaluations of doctors’ professionalism, whereas their perceptions of service attitude vary more widely.

2. Create a histogram of Medicine costs, using a bin width of 5 (i.e., each group covers a 5-unit interval). What is the shape of the distribution? (10 points)

# Calculate mean and median
mean_cost <- mean(hospital_data$`Medicine costs`)
median_cost <- median(hospital_data$`Medicine costs`)

# Create histogram with mean and median lines
ggplot(hospital_data, aes(x = `Medicine costs`)) +
  geom_histogram(binwidth = 5, color = "black", fill = "steelblue") +
  geom_vline(aes(xintercept = mean_cost, color = "Mean"), linetype = "dashed") +
  geom_vline(aes(xintercept = median_cost, color = "Median"), linetype = "dashed") +
  scale_color_manual(
    name = "Statistics Lines",
    values = c("Mean" = "red", "Median" = "orange")
  ) +
  labs(
    title = "Histogram of Medicine Costs with Mean and Median",
    x = "Medicine Costs",
    y = "Frequency"
  )

answer:

The distribution of Medicine Costs is slightly left-skewed. Most patients incur relatively higher medicine expenses (mainly within the 20–35 range), while a small number of lower-cost cases extend the tail toward the left

3.Use R to calculate the average Medicine costs for each gender (0 = Male, 1 = Female). What is the difference between the two groups? （10 points）

# Calculate the average Medicine costs by gender
avg_costs <- aggregate(`Medicine costs` ~ Gender, data = hospital_data, FUN = mean)

# Print results
print(avg_costs)

##   Gender Medicine costs
## 1      0       25.43548
## 2      1       25.92632

# Calculate the difference between the two groups
difference <- diff(avg_costs$`Medicine costs`)
cat(paste("Difference (Female - Male):", round(difference, 5)))

## Difference (Female - Male): 0.49083

answer:

The average medicine cost for male patients is approximately 25.44, whereas for female patients it is about 25.93. The difference between the two groups is only around 0.49, indicating that female patients spend slightly more on medication on average. Nonetheless, this difference is minimal, suggesting that gender has little influence on medicine costs overall.

4. Use R to calculate the 25th, 50th, and 75th percentiles of Medicine costs. What does this tell you about the cost distribution?（10 points）

# Calculate percentiles (25th, 50th, 75th)
percentiles <- quantile(hospital_data$`Medicine costs`, probs = c(0.25, 0.5, 0.75))

# Print results
print(percentiles)

## 25% 50% 75% 
##  20  27  31

answer:

25% of patients have costs ≤ 20, 50% of patients have costs ≤ 27,

75% of patients have costs ≤ 31, Only 25% of patients have costs above 31.

Most patients (75%) incur relatively predictable costs (≤31).

The hospital can confidently budget for the majority of cases falling in the 20-31 range.

Therefore, the hospital can budget confidently for the majority of cases, while directing cost-control strategies toward the upper 25% of patients with higher expenditures.

#Assignment 2(60 points)

###Introduction B Company is a leading online retailer specializing in Electronics, Furniture, and Clothing. You are provided with two datasets to analyze: ###SalesPerformance: Transaction records with columns: OrderID, CustomerID, Date, Month,Region, ProductCategory, Revenue, Profit, MarketingSpend ###CustomerFeedback: Customer reviews with columns: CustomerID, Rating (1-5), Feedback_ID,Feedback_Text, Response_Time_Days, Churn_Risk ###Based on the background above and the requirements below, write R code or provide analytical insights. When writing R code, please include appropriate comments.

##Section 1: Sales Performance Analysis (30 pts) 1a) Remove Duplicate Rows (5 pts) Task: Write R code to remove duplicate rows from the SalesPerformance dataset.Saves the cleaned data as SalesPerformance_clean.csv. Hint: You may use “distinct()” to remove duplicate rows.

# Read the dataset
SalesPerformance <- read.csv("SalesPerformance.csv")

# Check the original dimensions and duplicate count
cat("Original dataset dimensions:", dim(SalesPerformance), "\n")

## Original dataset dimensions: 53 8

cat("Number of duplicate rows:", sum(duplicated(SalesPerformance)), "\n")

## Number of duplicate rows: 3

# Remove duplicate rows using distinct() function
SalesPerformance_clean <- SalesPerformance %>%
  distinct()

# Verify the cleaning results
cat("\nCleaned dataset dimensions:", dim(SalesPerformance_clean), "\n")

## 
## Cleaned dataset dimensions: 50 8

cat("Number of rows removed:", nrow(SalesPerformance) - nrow(SalesPerformance_clean), "\n")

## Number of rows removed: 3

# Save the cleaned data as CSV file
write.csv(SalesPerformance_clean, "SalesPerformance_clean.csv", row.names = FALSE)

1b) Handle Missing Values in “Revenue” (5 pts) Task: Impute missing values in the Revenue column of the SalesPerformance_clean.csv dataset using the median value. Saves the processed data as SalesPerformance_impute.csv.Explain why median imputation is appropriate here.

# Read the cleaned dataset
SalesPerformance_clean <- read.csv("SalesPerformance_clean.csv")

# Check for missing values
cat("Missing values in Revenue column:", sum(is.na(SalesPerformance_clean$Revenue)), "\n")

## Missing values in Revenue column: 2

# Check the distribution of Revenue
summary(SalesPerformance_clean$Revenue)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   404.9  1002.3  1551.2  1570.2  1869.3  5000.0       2

# Calculate median Revenue
revenue_median <- median(SalesPerformance_clean$Revenue, na.rm = TRUE)
cat("Median Revenue for imputation:", revenue_median, "\n")

## Median Revenue for imputation: 1551.205

# Impute missing values with median
SalesPerformance_impute <- SalesPerformance_clean %>%
  mutate(Revenue = ifelse(is.na(Revenue), revenue_median, Revenue))

# Verify that missing values have been handled
cat("Missing values after imputation:", sum(is.na(SalesPerformance_impute$Revenue)), "\n")

## Missing values after imputation: 0

# Save the processed data
write.csv(SalesPerformance_impute, "SalesPerformance_impute.csv", row.names = FALSE)


# Display summary statistics to compare before and after imputation
print(summary(SalesPerformance_clean$Revenue))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   404.9  1002.3  1551.2  1570.2  1869.3  5000.0       2

print(summary(SalesPerformance_impute$Revenue))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   404.9  1024.0  1551.2  1569.5  1854.2  5000.0

Explanation: 1. Revenue data frequently contains extreme values, and the median remains robust in the presence of outliers.

Median imputation preserves the overall distribution trend without significantly shifting the data’s central tendency.

1c) Detect & Treat Outliers in “Profit” (5 pts) Task: Write R code to identify outliers in the Profit column of the SalesPerformance_impute.csv dataset.Propose and implement a treatment method.Saves the processed data as SalesPerformance_outlier.csv. Justify your choice.

# Read the dataset after imputation
SalesPerformance_impute <- read.csv("SalesPerformance_impute.csv")

# Check basic statistics of Profit
summary(SalesPerformance_impute$Profit)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##   -100.00     95.72    238.06   3887.93    384.40 167000.37         2

# Compute Q1 and Q3
Q1 <- quantile(SalesPerformance_impute$Profit, 0.25, na.rm = TRUE)
Q3 <- quantile(SalesPerformance_impute$Profit, 0.75, na.rm = TRUE)
IQR_value <- Q3 - Q1

# Define lower and upper bounds for outliers
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

cat("Lower bound:", lower_bound, "\n")

## Lower bound: -337.2975

cat("Upper bound:", upper_bound, "\n")

## Upper bound: 817.4225

# Identify outliers
outliers <- SalesPerformance_impute %>%
  filter(Profit < lower_bound | Profit > upper_bound)

print(outliers$Profit)

## [1] 167000.4   9347.0

# Replace extreme values with boundary values
SalesPerformance_outlier <- SalesPerformance_impute %>%
  mutate(Profit = ifelse(Profit < lower_bound, lower_bound,
                         ifelse(Profit > upper_bound, upper_bound, Profit)))

summary(SalesPerformance_outlier$Profit)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -100.00   95.72  238.06  248.09  384.40  817.42       2

# Save the processed dataset
write.csv(SalesPerformance_outlier, "SalesPerformance_outlier.csv", row.names = FALSE)

# visualize before vs after
plot_before <- ggplot(SalesPerformance_impute[is.finite(SalesPerformance_impute$Profit), ], 
                     aes(x = "", y = Profit)) +
  geom_boxplot(aes(fill = "Before"), 
               outlier.colour = "red", outlier.shape = 16, outlier.size = 2) +
  scale_fill_manual(values = c("Before" = "orange")) +
  labs(title = "Before Outlier Treatment",
       x = "",
       y = "Profit") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, face = "bold"))

plot_after <- ggplot(SalesPerformance_outlier[is.finite(SalesPerformance_outlier$Profit), ], 
                    aes(x = "", y = Profit)) +
  geom_boxplot(aes(fill = "After"), 
               outlier.colour = "red", outlier.shape = 16, outlier.size = 2) +
  scale_fill_manual(values = c("After" = "steelblue")) +
  labs(title = "After Outlier Treatment",
       x = "",
       y = "Profit") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, face = "bold"))

plot_before + plot_after

Justification: The IQR method was applied to identify outliers.

Two observations exceeded the upper bound of 817.42 and were addressed using Winsorization, where values beyond the threshold were replaced with 817.42. This technique retains all data points while reducing the impact of extreme values.

Following the adjustment, the mean profit dropped from 3,887.93 to 248.09, becoming more aligned with the median (238.06). As a result, the dataset is cleaner and more representative of the overall profit distribution for further analysis.

1d) Boxplot of Revenue by Region (5 pts) Task: Create a boxplot showing Revenue distribution across Region using the SalesPerformance_outlier.csv dataset. Interpret the plot.

# Read the dataset
SalesPerformance_outlier <- read.csv("SalesPerformance_outlier.csv")

# Create boxplot of Revenue by Region
ggplot(SalesPerformance_outlier, aes(x = Region, y = Revenue, fill = Region)) +
  geom_boxplot(outlier.colour = "red") +
  labs(title = "Revenue Distribution by Region",
       x = "Region",
       y = "Revenue")

Interpretation:

The boxplot indicates that the West region has the highest median revenue along with the greatest variability, suggesting strong yet fluctuating sales performance. In contrast, the South region shows the lowest median revenue and the smallest spread, reflecting more stable but lower sales levels. The East and North regions exhibit moderate revenue patterns, although the East includes a notable high outlier near 5,000.

1e) Time-Series Plot of Monthly Sales (10 pts) Task: Aggregate sales data by month and plot the trend using the SalesPerformance_outlier.csv dataset. Please describe the seasonal patterns and suggest possible reasons.

# Read data
SalesPerformance_outlier <- read.csv("SalesPerformance_outlier.csv")

# Get month data
SalesPerformance_outlier <- SalesPerformance_outlier %>%
  mutate(Month = month(Date, label = TRUE, abbr = TRUE))

# Aggregate sales by Month
monthly_sales <- SalesPerformance_outlier %>%
  group_by(Month) %>%
  summarise(Total_Revenue = sum(Revenue, na.rm = TRUE)) %>%
  arrange(match(Month, month.abb))

# View aggregated data
print(monthly_sales)

## # A tibble: 12 × 2
##    Month Total_Revenue
##    <ord>         <dbl>
##  1 1月           1101 
##  2 2月           3183.
##  3 3月           4521.
##  4 4月           7792.
##  5 5月           5199.
##  6 6月           4437.
##  7 7月           5942.
##  8 8月           4284.
##  9 9月           4403.
## 10 10月          9344.
## 11 11月         10443.
## 12 12月         17826.

# Plot monthly sales trend
ggplot(monthly_sales, aes(x = Month, y = Total_Revenue, group = 1)) +
  geom_line(color = "steelblue", linewidth = 1.2) +
  geom_point(size = 2, color = "steelblue") +
  labs(title = "Monthly Sales Trend",
       x = "Month",
       y = "Total Revenue") +
  theme_minimal()

Answer:

Seasonal patterns:

Revenue shows a steady increase from January to April. Sales then remain relatively stable from May to September, followed by a sharp rise beginning in October. Notably, November and December together contribute roughly 35% of the annual revenue, with December reaching the highest level of the year.

Possible Reasons:

The year-end surge is likely driven by major shopping events such as Black Friday and the Christmas season, which substantially boost consumer spending. After the holidays, a decline in demand is observed due to spending fatigue, followed by gradual month-to-month recovery. The sales peak in April may be associated with consumer purchases funded by tax refunds, while a smaller rise in July could relate to Amazon Prime Day promotions.

##Section 2: Customer Feedback Analysis (30 pts)

This code snippet is incomplete and only demonstrates the core processing logic. It assumes the Feedback dataset already exists. Please extend the code and answer the questions. When writing your own code, you need to fully specify the dataframe and include all necessary steps.

library(tidytext) # Load text analysis toolkit data(“stop_words”) # Load built-in stopwords (e.g., meaningless words like “the”, “and”) feedback_tokens <- Feedback %>% # Assume Feedback is a dataframe containing text unnest_tokens(word, Feedback_Text) %>% # Split Feedback_Text column into individual words anti_join(stop_words) %>% # Remove stopwords count(word, sort = TRUE) # Count word frequency and sort

2a) The provided code snippet is incomplete and demonstrates the core processing logic of performing text preprocessing and word frequency analysis on customer feedback data to identify the most commonly used words after removing meaningless stopwords. Modify the code to add two new columns to the CustomerFeedback dataset: most_frequent_word: The most frequently occurring word in each feedback (after stopword removal); word_count: The count of that most frequent word (10 pts).

Hint: You may use the following functions: slice(); left_join().

library(tidytext) # Load text analysis toolkit

## Warning: package 'tidytext' was built under R version 4.5.2

# Read dataset
Feedback <- read.csv("CustomerFeedback.csv")

data("stop_words") # Load built-in stopwords (e.g., meaningless words like "the", "and")

feedback_tokens <- Feedback %>% # Assume Feedback is a dataframe containing text
  unnest_tokens(word, Feedback_Text) %>% # Split Feedback_Text column into individual words
  anti_join(stop_words) %>%  # Remove stopwords
  count(Feedback_ID, word, sort = TRUE)   # Count word frequency and sort

## Joining with `by = join_by(word)`

# Find the most frequent word for each feedback
most_frequent_words <- feedback_tokens %>%
  group_by(Feedback_ID) %>%
  slice(which.max(n)) %>%
  ungroup() %>%
  select(Feedback_ID, most_frequent_word = word, word_count = n)

# Add the new columns to the original CustomerFeedback dataset
Feedback_with_freq <- Feedback %>%
  left_join(most_frequent_words, by = "Feedback_ID")

# Display the result
print(Feedback_with_freq)

##    CustomerID Rating Feedback_ID
## 1        C101      4           1
## 2        C102      5           2
## 3        C103      3           3
## 4        C104      1           4
## 5        C105      4           5
## 6        C106      4           6
## 7        C107      5           7
## 8        C108      3           8
## 9        C109      4           9
## 10       C110      4          10
## 11       C111      2          11
## 12       C112      4          12
## 13       C113      4          13
## 14       C114      1          14
## 15       C115      2          15
## 16       C121      4          16
## 17       C122      2          17
## 18       C125      5          18
## 19       C128      4          19
## 20       C130     NA          20
## 21       C133      3          21
## 22       C135      5          22
## 23       C136      1          23
## 24       C137      3          24
##                                      Feedback_Text Response_Time_Days
## 1         Fast shipping, but packaging was damaged                  5
## 2                  Perfect product! Will buy again                  8
## 3                     Product works but overpriced                  9
## 4       Never received my order - terrible service                  7
## 5           Customer service ignored my complaints                  9
## 6        Excellent build quality and fast delivery                  2
## 7                             Good value for money                  6
## 8  Wrong item shipped and no response to my emails                  5
## 9                     Highly recommend this seller                  9
## 10             Average experience, nothing special                  2
## 11                No resolution after 3 complaints                  7
## 12             Exactly as described in the listing                  7
## 13               Best purchase I've made this year                  1
## 14                       Delivery was 2 weeks late                  5
## 15        It gets the job done but could be better                  1
## 16        Satisfied overall, but shipping was slow                  6
## 17    Product broke after 2 days - garbage quality                  4
## 18                    Exceeded all my expectations                  6
## 19                    Not bad but not great either                  6
## 20  Received a defective item and had to return it                  2
## 21    Customer support team was incredibly helpful                  2
## 22   Minor issues but seller resolved them quickly                  3
## 23       Still waiting for my refund after 3 weeks                  8
## 24  Decent product but not worth the premium price                  4
##    Preferred_Contact Churn_Risk Sentiment most_frequent_word word_count
## 1              Phone     Medium     -0.11            damaged          1
## 2              Email       High      0.98                buy          1
## 3              Phone        Low      0.12         overpriced          1
## 4              Phone       High     -0.82           received          1
## 5              Email       High     -0.33         complaints          1
## 6              Email       High      0.91              build          1
## 7              Phone       High      0.67              money          1
## 8              Phone     Medium     -0.74             emails          1
## 9              Phone        Low      0.85             highly          1
## 10             Phone     Medium     -0.04            average          1
## 11             Email     Medium     -0.59                  3          1
## 12             Phone     Medium      0.46            listing          1
## 13             Email        Low      0.88           purchase          1
## 14             Phone       High     -0.27                  2          1
## 15             Email     Medium      0.33                job          1
## 16             Phone     Medium      0.39          satisfied          1
## 17             Email       High     -0.47                  2          1
## 18             Phone     Medium      0.79           exceeded          1
## 19             Phone     Medium     -0.07                bad          1
## 20             Email        Low     -0.60          defective          1
## 21             Email     Medium      0.59           customer          1
## 22             Phone        Low      0.37             issues          1
## 23             Phone       High     -0.71                  3          1
## 24             Phone     Medium     -0.13             decent          1

2b) If the most frequent word in the most_frequent_word column of a CustomerFeedback dataset is “okay,” can we conclude that the overall user sentiment is positive? Justify your answer.(10 pts).

Answer:

No, we cannot conclude that the overall user sentiment is positive solely because the most frequent word is “okay.”

Justification:

The word “okay” is neutral to slightly positive, but it often signifies mediocrity, indifference, or mild satisfaction rather than strong positivity.

It can even be used in a negative context (e.g., “The delivery was just okay, I expected better.”).

And a single word does not capture the full context of the feedback. The overall sentiment of a sentence or review depends on the combination of words and their structure.

2c) The column “Sentiment” is the result of a sentiment analysis on customers’ textual feedback, its values range from -1 [most negative] to +1 [most positive]). The dataset also includes a Rating column (e.g., 1–5 stars). Task(10 pts): Write R code to: 1. Calculate the correlation coefficient between the Sentiment (numeric values ranging from -1 to +1) and Rating (e.g., 1–5 stars) columns in the dataset. 2. Visualize the relationship between these two variables using an appropriate plot, and describe their relationship.

# Read the dataset
customer_feedback <- read.csv("CustomerFeedback.csv")

# Data clean 
customer_feedback[!is.finite(customer_feedback$Sentiment) | !is.finite(customer_feedback$Rating), ]

##    CustomerID Rating Feedback_ID                                  Feedback_Text
## 20       C130     NA          20 Received a defective item and had to return it
##    Response_Time_Days Preferred_Contact Churn_Risk Sentiment
## 20                  2             Email        Low      -0.6

customer_feedback_clean <- na.omit(customer_feedback[, c("Sentiment", "Rating")])

# Calculate correlation coefficient
correlation <- cor(customer_feedback_clean$Sentiment, customer_feedback_clean$Rating, use = "complete.obs", method = "pearson")

# Print result
cat("Correlation between Sentiment and Rating:", correlation, "\n")

## Correlation between Sentiment and Rating: 0.7109679

# Scatter plot with trend line
ggplot(customer_feedback_clean, aes(x = Sentiment, y = Rating)) +
  geom_point(color = "steelblue", alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  labs(
    title = "Relationship between Sentiment Score and Customer Rating",
    x = "Sentiment Score (-1 to +1)",
    y = "Customer Rating (1–5 stars)"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Answer:

The correlation coefficient (r = 0.7109679) indicates a strong positive linear relationship between sentiment and rating.

This suggests that when customers express more positive sentiment in their feedback, they are more likely to provide higher ratings. Conversely, more negative sentiment typically corresponds to lower ratings.

BA Individual Assignment

2230006040 Gong Jingjing

Assignment 1 (40 points)

1.Calculate and compare the range, IQR, variance, and standard deviation of Service attitude and Doctor’s professional level. Explain what the results tell you about their distributions. （10 points）

2. Create a histogram of Medicine costs, using a bin width of 5 (i.e., each group covers a 5-unit interval). What is the shape of the distribution? (10 points)

3.Use R to calculate the average Medicine costs for each gender (0 = Male, 1 = Female). What is the difference between the two groups? （10 points）

4. Use R to calculate the 25th, 50th, and 75th percentiles of Medicine costs. What does this tell you about the cost distribution?（10 points）

This code snippet is incomplete and only demonstrates the core processing logic. It assumes the Feedback dataset already exists. Please extend the code and answer the questions. When writing your own code, you need to fully specify the dataframe and include all necessary steps.