Assignment 1 (40 points)

A survey of 219 patients collected their gender, medical costs, and ratings of service quality in a hospital. Let’s explore the medical cost patterns.

1.Calculate and compare the range, IQR, variance, and standard deviation of Service attitude and Doctor’s professional level. Explain what the results tell you about the distribution of costs and the variability in service ratings. （10 points）

library(tidyverse)

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.4     v readr     2.1.5
## v forcats   1.0.0     v stringr   1.5.1
## v ggplot2   3.5.2     v tibble    3.3.0
## v lubridate 1.9.4     v tidyr     1.3.1
## v purrr     1.1.0     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

dt <- read_csv("hospital.csv")

## Rows: 219 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (5): ID, Gender, Medicine costs, Service attitude, Doctor's professional...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

dt_summary <- dt %>%
  summarise(
    range = max(`Service attitude`, na.rm = TRUE) - min(`Service attitude`, na.rm = TRUE),
    IQR = IQR(`Service attitude`, na.rm = TRUE), 
    variance = var(`Service attitude`, na.rm = TRUE),
    standard_vairance = sd(`Service attitude`, na.rm = TRUE))
print(dt_summary)

## # A tibble: 1 x 4
##   range   IQR variance standard_vairance
##   <dbl> <dbl>    <dbl>             <dbl>
## 1     4     1    0.623             0.789

dt_summary2 <- dt %>%
  summarise(
    range = max(`Doctor's professional level`, na.rm = TRUE) - min(`Doctor's professional level`, na.rm = TRUE),
    IQR = IQR(`Doctor's professional level`, na.rm = TRUE), 
    variance = var(`Doctor's professional level`, na.rm = TRUE),
    standard_vairance = sd(`Doctor's professional level`, na.rm = TRUE))
print(dt_summary2)

## # A tibble: 1 x 4
##   range   IQR variance standard_vairance
##   <dbl> <dbl>    <dbl>             <dbl>
## 1     3     1    0.550             0.742

answer: The result shows that doctor’s professional level has a higher range than service attitude, means patients had a more siginificance difference in doctor’s professional level. Besides, compared the SD and IQR, both of them has same IQR. However, service attitude has a lower sd, which means the doctor’s professional level varies greater among different patients. What’s more, both of the two factor has the same IQR = 1, which means the 50% of the data is distributed within a range of 1 point. Most people’s ratings are very close.

2. Create a histogram of Medicine costs, using a bin width of 5 (i.e., each group covers a 5-unit interval). What is the shape of the distribution? (10 points)

dt <- read_csv("hospital.csv")

## Rows: 219 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (5): ID, Gender, Medicine costs, Service attitude, Doctor's professional...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(ggplot2)

ggplot(dt, aes(x = `Medicine costs`)) +
  geom_histogram(binwidth = 5,
                 fill = "blue", color = "white")+
  labs(
    title = "Medicine Costs Distribution",
    x = "Medicine Costs ($)",
    y = "Frequency"
  ) +
  theme_minimal()

answer: The shape shows that most of patients pay on $20-$25, and least of patients cost %0-$5. The average medicine cost may lie on $20-$25.

3.Use R to calculate the average Medicine costs for each gender (0 = Male, 1 = Female). What is the difference between the two groups? （10 points）

dt <- read_csv("hospital.csv")

## Rows: 219 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (5): ID, Gender, Medicine costs, Service attitude, Doctor's professional...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

dt2 <- dt %>%
  mutate(Gender = ifelse(Gender == 0, "Male", "Female"))

gender_means <- aggregate(`Medicine costs` ~ Gender, 
                         data = dt2, 
                         FUN = mean, 
                         na.rm = TRUE)

print(gender_means)

##   Gender Medicine costs
## 1 Female       25.92632
## 2   Male       25.43548

answer:

4. Use R to calculate the 25th, 50th, and 75th percentiles of Medicine costs. What does this tell you about the cost distribution?（10 points）

quantile(dt$`Medicine costs`, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)

## 25% 50% 75% 
##  20  27  31

answer: The distribution of medicine costs is asymmetric. The distance from the 25th to the 50th percentile is 7, while the distance from the 50th to the 75th percentile is 4. This indicates a right-skewed cost distribution. It means most of the data is concentrated in the lower-cost range, with a few high-cost values pulling the overall distribution upward.

#Assignment 2(60 points)

###Introduction B Company is a leading online retailer specializing in Electronics, Furniture, and Clothing. You are provided with two datasets to analyze: ###SalesPerformance: Transaction records with columns: OrderID, CustomerID, Date, Month,Region, ProductCategory, Revenue, Profit, MarketingSpend ###CustomerFeedback: Customer reviews with columns: CustomerID, Rating (1-5), Feedback_ID,Feedback_Text, Response_Time_Days, Churn_Risk ###Based on the background above and the requirements below, write R code or provide analytical insights. When writing R code, please include appropriate comments.

##Section 1: Sales Performance Analysis (30 pts) 1a) Remove Duplicate Rows (5 pts) Task: Write R code to remove duplicate rows from the SalesPerformance dataset.Saves the cleaned data as SalesPerformance_clean.csv. Hint: You may use “distinct()” to remove duplicate rows.

library(tidyverse)
dt3 <- read_csv("SalesPerformance.csv")

## Rows: 53 Columns: 8
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): CustomerID, Date, Region, ProductCategory
## dbl (4): OrderID, Revenue, Profit, MarketingSpend
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

SalesPerformance_clean <- dt3 %>%
  distinct()

1b) Handle Missing Values in “Revenue” (5 pts) Task: Impute missing values in the Revenue column of the SalesPerformance_clean.csv dataset using the median value. Saves the processed data as SalesPerformance_impute.csv.Explain why median imputation is appropriate here.

median(dt3$Revenue, na.rm = TRUE)

## [1] 1560.31

mean(dt3$Revenue, na.rm = TRUE)

## [1] 1562.163

SalesPerformance_impute <- dt3 %>% summarise(median(dt3$Revenue, na.rm = FALSE))

Explanation: By calculating median and mean, the two amount is relatively close. The median is not affected by extreme values and can better represent income.

1c) Detect & Treat Outliers in “Profit” (5 pts) Task: Write R code to identify outliers in the Profit column of the SalesPerformance_impute.csv dataset.Propose and implement a treatment method.Saves the processed data as SalesPerformance_outlier.csv. Justify your choice.

clean_outliers <- function(df, col_name) {
  Q1 <- quantile(df[[col_name]], probs = 0.25, na.rm = TRUE)
  Q3 <- quantile(df[[col_name]], probs = 0.75, na.rm = TRUE)
  IQR_value <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR_value
  upper_bound <- Q3 + 1.5 * IQR_value
  indices_outlier <- which(
    (!is.na(df[[col_name]])) & 
    (df[[col_name]] < lower_bound | df[[col_name]] > upper_bound)
  )
  df_cleaned <- df
  df_cleaned[[col_name]][df_cleaned[[col_name]] < lower_bound & !is.na(df_cleaned[[col_name]])] <- lower_bound
  df_cleaned[[col_name]][df_cleaned[[col_name]] > upper_bound & !is.na(df_cleaned[[col_name]])] <- upper_bound
  
  return(df_cleaned)
  quantile()
}

SalesPerformance_outlier <- clean_outliers(SalesPerformance_clean, "Profit")

Justification: I first use probs() function to find out 25th and 75th of the profit column so that I could get the IQR. The lower bound among data = Q1 - 1.5IQR, the higher bound = Q3 + 1.5IQR. Thus, I got the outliers in the profit column.

1d) Boxplot of Revenue by Region (5 pts) Task: Create a boxplot showing Revenue distribution across Region using the SalesPerformance_outlier.csv dataset. Interpret the plot.

boxplot(Revenue ~ Region, 
        data = SalesPerformance_outlier,
        main = "Revenue Distribution across Region",
        xlab = "Region", 
        ylab = "Revenue ($)",
        col = "red"
        )

Interpretation: Based on the boxplot, the West region demonstrates the strongest revenue performance with the highest median value, indicating it is the most consistently profitable market. In contrast, the South region shows the lowest median revenue, suggesting it may require strategic improvements. The variability in revenue also exist differences. The East region exhibits the widest interquartile range and contained a high-value outlier in the upper revenue range. The South region shows the most concentrated revenue distribution.

1e) Time-Series Plot of Monthly Sales (10 pts) Task: Aggregate sales data by month and plot the trend using the SalesPerformance_outlier.csv dataset. Please describe the seasonal patterns and suggest possible reasons.

SalesPerformance_outlier$Date <- as.Date(SalesPerformance_outlier$Date)

monthly_sales <- SalesPerformance_outlier %>%
  mutate(YearMonth = floor_date(Date, "month")) %>%
  group_by(YearMonth) %>%
  summarise(MonthlySales = sum(Revenue, na.rm = TRUE))

library(ggplot2)

ggplot(monthly_sales, aes(x = YearMonth, y = MonthlySales)) +
  geom_line(color = "pink") +
  geom_point(color = "blue") +
  labs(
    title = "Monthly Sales Trend",
    x = "Month",
    y = "Revenue",
  ) +
  theme_minimal(base_size = 12)

Answer: The image shows a rapid increase in winter sessions from $4000 to over $10,000. This may due to the cold weather in winter that people may more possible to get a sick and need to buy a medicine. On contrast, the summer day witnessed a quickly decreased in revenue from the peak of $10,000 to the bottom to $2500. This is because summer illnesses are usually milder and have a shorter course. For example, in the case of heatstroke, prevention and treatment mainly rely on physical cooling and hydration, with limited use of medication. It appears a moderate fluctuation during the summer and autumn, then increase again when the weather getting cold.

##Section 2: Customer Feedback Analysis (20 pts)

This code snippet is incomplete and only demonstrates the core processing logic. It assumes the Feedback dataset already exists. Please extend the code and answer the questions. When writing your own code, you need to fully specify the dataframe and include all necessary steps.

library(tidytext) # Load text analysis toolkit data(“stop_words”) # Load built-in stopwords (e.g., meaningless words like “the”, “and”) feedback_tokens <- Feedback %>% # Assume Feedback is a dataframe containing text unnest_tokens(word, Feedback_Text) %>% # Split Feedback_Text column into individual words anti_join(stop_words) %>% # Remove stopwords count(word, sort = TRUE) # Count word frequency and sort

2a) The provided code snippet is incomplete and demonstrates the core processing logic of performing text preprocessing and word frequency analysis on customer feedback data to identify the most commonly used words after removing meaningless stopwords. Modify the code to add two new columns to the CustomerFeedback dataset: most_frequent_word: The most frequently occurring word in each feedback (after stopword removal); word_count: The count of that most frequent word (10 pts).

Hint: You may use the following functions: slice(); left_join().

library(tidytext) # Load text analysis toolkit
data("stop_words") # Load built-in stopwords (e.g., meaningless words like "the", "and")
Feedback <- read_csv("CustomerFeedback.csv")

## Rows: 24 Columns: 8
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): CustomerID, Feedback_Text, Preferred_Contact, Churn_Risk
## dbl (4): Rating, Feedback_ID, Response_Time_Days, Sentiment
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

feedback_tokens <- Feedback %>%
  mutate(Feedback_ID = row_number()) %>%
  unnest_tokens(word, Feedback_Text) %>%
  anti_join(stop_words) %>%
  count(Feedback_ID, word, sort = TRUE) %>%
  group_by(Feedback_ID) %>%
  slice(1) %>%
  ungroup()

## Joining with `by = join_by(word)`

Feedback <- Feedback %>%
  mutate(Feedback_ID = row_number()) %>%
  left_join(feedback_tokens, by = "Feedback_ID")

2b) If the most frequent word in the most_frequent_word column of a CustomerFeedback dataset is “okay,” can we conclude that the overall user sentiment is positive? Justify your answer.(10 pts).

Answer: No. Firstly, “Okay” is a neatrual word, it can be both positive and negative in expressing feelings. It do not mean a strongly positive attitude. Secondly, because we only get the most frequently word instead of full sentences or phrases, we may miss crucial context. Lastly, the presence of many “okay” ratings could indicate widespread mediocrity rather than genuine satisfaction.

2c) The column “Sentiment” is the result of a sentiment analysis on customers’ textual feedback, its values range from -1 [most negative] to +1 [most positive]). The dataset also includes a Rating column (e.g., 1–5 stars). Task(10 pts): Write R code to: 1. Calculate the correlation coefficient between the Sentiment (numeric values ranging from -1 to +1) and Rating (e.g., 1–5 stars) columns in the dataset. 2. Visualize the relationship between these two variables using an appropriate plot, and describe their relationship.

correlation <- cor(Feedback$Sentiment, Feedback$Rating, 
                   use = "complete.obs") 

ggplot(Feedback, aes(x = Sentiment, y = Rating)) +  
  geom_point(alpha = 1, color = "black") +
  geom_smooth(method = "lm", color = "blue", se = TRUE) +
  labs(
    title = "Correlation coefficient between the Sentiment and Rating",
    x = "Sentiment (numeric values ranging from -1 to +1)", 
    y = "Rating (e.g., 1–5 stars)"  
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

Answer: There is a positive correlation between the Sentiment score and the Rating. When the Sentiment score increases, the customer Rating also systematically rises. It means the sentiment of the customer’s textual feedback is highly consistent with the numerical rating. The correlation between 2 factor is 0.71, this indicates that text sentiment analysis can effectively predict customers’ numerical ratings. Besides, here may exist several point that do not close to the regression line. This may because some customer has a negative sentiment but moderate rating, as well as has a postive sentiment but a low rating. But it would not affect the liner regression.