Spring2025-HW1

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(lubridate)
library(ggplot2) #2a
library(zoo) #2b

## Warning: package 'zoo' was built under R version 4.3.3

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(xts) #2b

## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last

library(car) #5

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

R Markdown

Sales and Customer Interaction Dataset

In this homework, you have been given a dataset that captures comprehensive details of 1,000 sales transactions made in 2024, focusing on various aspects of customer interactions and sales dynamics. It includes a mix of numeric and categorical variables, allowing for in-depth analysis.

Key features of the dataset include:

Customer_ID: Unique identifier for each customer.
Product_Category: Category of the purchased product (e.g., Electronics, Clothing, Home Appliances, etc.).
Purchase_Amount: Total amount spent on each purchase, in dollars.
Purchase_Date: Date of the transaction
Customer_Age: Age of the customer at the time of purchase.
Customer_Gender: Gender of the customer, Male or Female
Store_Location: Location of the store where the purchase occurred.
Satisfaction_Score (response variable) : Numeric response variable indicating the customer’s satisfaction level

Note:

For the first homework assignment, we have provided the pre-processing steps for the variables. Kindly ensure you understand and can modify these steps as needed. In future homework and exams, students will be expected to handle data pre-processing independently, and no code will be provided. The only aspect that should remain unchanged is the seed setting to 100.

set.seed(100) #this seed has been set to 100 to ensure results are reproducible. DO NOT CHANGE THIS SEED

Sales_dataset = read.csv("large_sales_dataset.csv", header=TRUE) #reads data
Sales_dataset$Product_Category=as.factor(Sales_dataset$Product_Category)
Sales_dataset$Purchase_Date=as.Date(Sales_dataset$Purchase_Date, format="%m/%d/%Y")
Sales_dataset$Customer_Gender=as.factor(Sales_dataset$Customer_Gender)
Sales_dataset$Store_Location=as.factor(Sales_dataset$Store_Location)

head(Sales_dataset)

##   Customer_ID Product_Category Purchase_Amount Purchase_Date Customer_Age
## 1         101            Books         1091.97    2024-07-24           62
## 2         102        Furniture          869.33    2024-05-04           62
## 3         102            Books         1592.04    2024-12-29           62
## 4         102      Electronics         1600.56    2024-12-07           62
## 5         105             Toys         1030.42    2024-02-20           35
## 6         105            Books          383.44    2024-11-25           35
##   Customer_Gender Store_Location Satisfaction_Score
## 1          Female        Chicago           7.281961
## 2          Female       New York           8.124978
## 3          Female         Boston           8.229209
## 4          Female         Boston           8.882945
## 5          Female        Atlanta           6.811557
## 6          Female      San Diego           5.779122

#Dividing the dataset into training and testing datasets
testRows_sales = sample(nrow(Sales_dataset),0.2*nrow(Sales_dataset))
testData_sales = Sales_dataset[testRows_sales, ]
trainData_sales = Sales_dataset[-testRows_sales, ]
row.names(trainData_sales) <- NULL
head(trainData_sales)

##   Customer_ID Product_Category Purchase_Amount Purchase_Date Customer_Age
## 1         102        Furniture          869.33    2024-05-04           62
## 2         102            Books         1592.04    2024-12-29           62
## 3         102      Electronics         1600.56    2024-12-07           62
## 4         105             Toys         1030.42    2024-02-20           35
## 5         105            Books          383.44    2024-11-25           35
## 6         109      Electronics         1632.42    2024-04-12           27
##   Customer_Gender Store_Location Satisfaction_Score
## 1          Female       New York           8.124978
## 2          Female         Boston           8.229209
## 3          Female         Boston           8.882945
## 4          Female        Atlanta           6.811557
## 5          Female      San Diego           5.779122
## 6            Male         Boston          10.000000

Question 1: Data Analysis (8 points)

For this question, use the “Sales_dataset”

1a)(4 points) Output a table that has both the average and median Purchase_Amount grouped by Product_Category, Customer_Gender and Customer_Age. Show the last 15 rows of the table

Note: For age group, use the following bins

‘0-18’, ‘19-25’, ‘26-35’, ‘36-45’, ‘46-55’, ‘56-65’, ‘66-75’, ‘76-85’, ‘86-100’

#library("dplyr")

copy_Sales_dataset <- Sales_dataset %>%
  mutate(Age_Group = cut(Customer_Age, 
                         breaks = c(0, 18, 25, 35, 45, 55, 65, 75, 85, 100), 
                         labels = c("0-18", "19-25", "26-35", "36-45", "46-55","56-65", "66-75", "76-85", "86-100")))

#head(ageCut_Sales_dataset)

q1a_table <- copy_Sales_dataset %>%
  group_by(Product_Category, Customer_Gender, Age_Group) %>%
  summarise(mean_Purchase_Amount = mean(Purchase_Amount), media_Purchase_Amount = median(Purchase_Amount))

## `summarise()` has grouped output by 'Product_Category', 'Customer_Gender'. You
## can override using the `.groups` argument.

tail(q1a_table, 15)

## # A tibble: 15 × 5
## # Groups:   Product_Category, Customer_Gender [3]
##    Product_Category Customer_Gender Age_Group mean_Purchase_Amount
##    <fct>            <fct>           <fct>                    <dbl>
##  1 Home Appliances  Male            46-55                    1372.
##  2 Home Appliances  Male            56-65                    1109.
##  3 Home Appliances  Male            66-75                     895.
##  4 Toys             Female          19-25                    1013.
##  5 Toys             Female          26-35                    1027.
##  6 Toys             Female          36-45                     586.
##  7 Toys             Female          46-55                    1053.
##  8 Toys             Female          56-65                     948.
##  9 Toys             Female          66-75                     929.
## 10 Toys             Male            0-18                     1115.
## 11 Toys             Male            19-25                    1139.
## 12 Toys             Male            26-35                    1155.
## 13 Toys             Male            36-45                     932.
## 14 Toys             Male            46-55                     778.
## 15 Toys             Male            56-65                    1008.
## # ℹ 1 more variable: media_Purchase_Amount <dbl>

Q1b)(2 points) Which customer age group has the highest total purchase amount across all categories?

totalPurchaseAge <- copy_Sales_dataset %>%
  group_by(Age_Group) %>%
  summarise(total_Purchase_Age = sum(Purchase_Amount))

totalPurchaseAge

## # A tibble: 7 × 2
##   Age_Group total_Purchase_Age
##   <fct>                  <dbl>
## 1 0-18                  28611.
## 2 19-25                205891.
## 3 26-35                279965.
## 4 36-45                186527.
## 5 46-55                147993.
## 6 56-65                104559.
## 7 66-75                 39057.

q1btable <- totalPurchaseAge %>%
  filter(total_Purchase_Age == max(total_Purchase_Age))

q1btable

## # A tibble: 1 × 2
##   Age_Group total_Purchase_Age
##   <fct>                  <dbl>
## 1 26-35                279965.

Response to Q1b

The age group of 26-35 has the highest total purchase amount.

Q1c)(2 points) Which product category has the highest average satisfaction score, and in which location is it most frequently purchased?

meanAvgSat <- copy_Sales_dataset %>%
  group_by(Product_Category) %>%
  summarise(mean_Satisfaction_Score = mean(Satisfaction_Score))

meanAvgSat

## # A tibble: 6 × 2
##   Product_Category mean_Satisfaction_Score
##   <fct>                              <dbl>
## 1 Books                               7.07
## 2 Clothing                            7.02
## 3 Electronics                         8.88
## 4 Furniture                           7.95
## 5 Home Appliances                     7.96
## 6 Toys                                6.02

q1c1table <- meanAvgSat %>%
  filter(mean_Satisfaction_Score == max(mean_Satisfaction_Score))

q1c1table$Product_Category

## [1] Electronics
## Levels: Books Clothing Electronics Furniture Home Appliances Toys

Response to Q1C1

Electronics is the Product Group that has the highest average satisfaction score.

q1c2table <- copy_Sales_dataset %>%
  filter(Product_Category == q1c1table$Product_Category) %>%
  count(Store_Location) %>%
  filter(n == max(n))

q1c2table$Store_Location

## [1] Boston
## 13 Levels: Atlanta Boston Chicago Dallas Denver Houston ... Seattle

Response to Q1C2

Boston is the most frequently-purchased store location in the Product Group - Electronics.

Question 2: Time Series Analysis (8 points)

For this question, use “Sales_dataset”

Q2a) (3 points) Calculate the monthly average sales per product category and plot them.

#rmb to convert with xts object

q2atable <- copy_Sales_dataset %>%
  group_by(Product_Category, Month = floor_date(Purchase_Date, unit = "month")) %>%
  summarize(monthly_avg_sales = mean(Purchase_Amount))

## `summarise()` has grouped output by 'Product_Category'. You can override using
## the `.groups` argument.

q2atable

## # A tibble: 72 × 3
## # Groups:   Product_Category [6]
##    Product_Category Month      monthly_avg_sales
##    <fct>            <date>                 <dbl>
##  1 Books            2024-01-01              997.
##  2 Books            2024-02-01             1002.
##  3 Books            2024-03-01             1067.
##  4 Books            2024-04-01              934.
##  5 Books            2024-05-01             1252.
##  6 Books            2024-06-01              965.
##  7 Books            2024-07-01              995.
##  8 Books            2024-08-01             1316.
##  9 Books            2024-09-01             1055.
## 10 Books            2024-10-01             1253.
## # ℹ 62 more rows

q2aplot <- ggplot(q2atable, aes(x = q2atable$Month, y = q2atable$monthly_avg_sales, color = q2atable$Product_Category)) + geom_line() + geom_point() + labs(title = "Monthly Average Sales per Product Category", x = "Month",y = "Avg Sales",color = "Product Category")

q2aplot

Q2b)(5 points) Calculate the rolling means of average monthly sales (calculated in 2a) using a 3 month window. Use the center alignment to calculate the rolling means.

Now again plot the monthly average sales of each product category along with the rolling means.

i) What difference do you observe? How are the rolling means beneficial as compared to simple averages?

q2btable <- q2atable %>%
  group_by(Product_Category) %>%
  mutate(rolling_mean_3m = rollmean(monthly_avg_sales, 3, align = "center", fill = NA))
  
q2btable

## # A tibble: 72 × 4
## # Groups:   Product_Category [6]
##    Product_Category Month      monthly_avg_sales rolling_mean_3m
##    <fct>            <date>                 <dbl>           <dbl>
##  1 Books            2024-01-01              997.             NA 
##  2 Books            2024-02-01             1002.           1022.
##  3 Books            2024-03-01             1067.           1001.
##  4 Books            2024-04-01              934.           1084.
##  5 Books            2024-05-01             1252.           1050.
##  6 Books            2024-06-01              965.           1071.
##  7 Books            2024-07-01              995.           1092.
##  8 Books            2024-08-01             1316.           1122.
##  9 Books            2024-09-01             1055.           1208.
## 10 Books            2024-10-01             1253.           1081.
## # ℹ 62 more rows

q2bplot <- ggplot(q2btable, aes(x = Month)) + geom_line(aes(y = monthly_avg_sales, color = "Avg Sales",)) + geom_line(aes(y = rolling_mean_3m, color = "Rolling Mean Sales (3-month)")) + facet_wrap(~ Product_Category) + labs(title = "Avg Sales and Rolling Mean Sales per Product Category", x = "Month", y = "Avg Sales", color="Avg Sales / Rolling Sales")

q2bplot

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

Response to Q2bi

The plot of Rolling Means 3 month window is more smooth, less steep and less fluctuate comparing to the plot of Monthly Avg Sales. Using rolling Mean for plot can help reduce noise and identify pattern such as seasonal pattern.

Question 3: Data Exploration (8 points)

For this question, use trainData_sales

Q3a)(4 points) Create a boxplot of the response variable versus the following predicting variables

i) Store_Location

ii) Product_Category

Explain the relationship between the response and the two variables based on the boxplots.

q3aiplot <- ggplot(trainData_sales, aes(Store_Location, Satisfaction_Score, color = Store_Location)) + geom_boxplot() +labs(title = "Satisfaction Score vs Store Location", x = "Satisfaction Score", y = "Store Location")

q3aiplot

q3aiiplot <- ggplot(trainData_sales, aes(Product_Category, Satisfaction_Score, color = Product_Category)) + geom_boxplot() +labs(title = "Satisfaction Score vs Product Category", x = "Satisfaction Score", y = "Product Category")

q3aiiplot

Response to Q3

The box plot of Satisfaction Score vs Store Location varies across different locations, for example locations in LA and Chicago have wider interquartile range in store satisfaction comparing to locations in Atlanta and Seattle with narrower interquartile range indicating consistent score among customers. There are outliers in Boston and Denver that may affect the overall result with lowering score.

The box plot of Satisfaction Score vs Product Category also varies across different categories, for example Electronics has the highest median as ~9 comparing with the lowest median score ~6 in Toys while the rest range from 7-8. The outliers in Toys in both quartile-ends may indicate insight to concerns.

Though the box plot for both predicting variables vary but do not tell there are significant relations since they overlapping.

Q3b)(4 points) Create scatterplots of the response variable against the following predictors:

i) Customer_Age

ii) Purchase_Amount

Describe the general trend of each plot.

Output the R^2 for each plot. Use the following R^2 cut-offs while explaining if it is a weak, moderate, or strong relationship.

R^2<=0.3 (weak)

0.3<R^2<0.7 (moderate)

R^2>=0.7 (strong)

#show regression line for 3bi
q3bi <- lm(Satisfaction_Score~ Customer_Age, data = trainData_sales)

#R^2 for 3bi
cat("R^2 for 3bi:", summary(q3bi)$r.squared)

## R^2 for 3bi: 0.002581523

q3biplot <- plot(trainData_sales$Customer_Age, trainData_sales$Satisfaction_Score, main = "Scartterplot of Satisfaction vs Age", xlab= "Age", ylab = "Satisfaction Score") + abline(q3bi, col = "red")

#show regression line for 3bii
q3bii <- lm(Satisfaction_Score~ Purchase_Amount, data = trainData_sales)

#R^2 for 3bii
cat("R^2 for 3bii:", summary(q3bii)$r.squared)

## R^2 for 3bii: 0.146777

q3biiplot <- plot(trainData_sales$Purchase_Amount, trainData_sales$Satisfaction_Score, main = "Scartterplot of Satisfaction vs Sales", xlab= "Sales", ylab = "Satisfaction Score") + abline(q3bii, col = "red")

Response to Q3b

The scatter plot of q3bi show close to zero relationship between Age and Satisfaction Score while the plot of 3bii shows positive relationship between Sales and Satisfaction Score. Overall, the relationship for both are weak as the small R^2 value.

Question 4: Simple linear regression and ANOVA (12 points)

For this question, use trainData_sales

Q4a) (5 points) Create a simple linear regression model using the predictor “Purchase_Amount”. Call it model1. Display the summary.

model1 <- q3bii

summary(model1)

## 
## Call:
## lm(formula = Satisfaction_Score ~ Purchase_Amount, data = trainData_sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9529 -0.8960 -0.0057  0.9767  3.3668 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.5431520  0.0937456   69.80   <2e-16 ***
## Purchase_Amount 0.0009585  0.0000818   11.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.327 on 798 degrees of freedom
## Multiple R-squared:  0.1468, Adjusted R-squared:  0.1457 
## F-statistic: 137.3 on 1 and 798 DF,  p-value: < 2.2e-16

i) What are the model parameters and their estimates?

The model parameters are the 1.) intercept and its estimates ~6.54315, 2.) B1 (Purchase_Amount) its estimates ~0.00095, 3.) the variance of error term ~(1.327^2)

ii) Interpret the coefficient of the predictor “Purchase_Amount” in the context of the problem.

For an increase of a dollar in the Purchase Amount, there will be an increase of ~0.00095 in the Satisfaction Score.

iii) Find a 95% confidence interval for the coefficient of “Purchase_Amount”. Is the coefficient significant at this level?

confint(model1, level = 0.95)

##                        2.5 %      97.5 %
## (Intercept)     6.3591349416 6.727169038
## Purchase_Amount 0.0007978727 0.001119022

response to q4aiii

The 95% confidence interval is (0.000797 to 0.001119) and is significant at this level.

Q4b)(3 points) Is the coefficient of Purchase_Amount statistically significant?

i) State the null hypothesis

H0 : B1(Purchase_Amount) = 0

ii) State the alternative hypothesis

H1 : B1(Purchase_Amount) != 0

iii) Which test is used for testing the significance of coefficient?

Comparing t-value with critical point or

comparing p-value with alpha level /(significant level) or

just use the p-value from R

Q4c)( 4 points) Perform an ANOVA F-test on the means of Product_Category.

#4ci
q4ctable <- aov(Satisfaction_Score ~ Product_Category, data = trainData_sales)
summary(q4ctable)

##                   Df Sum Sq Mean Sq F value Pr(>F)    
## Product_Category   5  674.9  134.97   110.2 <2e-16 ***
## Residuals        794  972.5    1.22                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

i) State the null and alternative hypotheses.

Null hypotheses, H0 : All means are equal

Alternative hypotheses, H1: Atleast 2 means are statistically significant.

ii) Using an α-level of 0.01, do we reject the null hypothesis that the means are equal? Explain your conclusion.

The p-value of the F-test for Product_Category is <2e-16 which is small than the α-level of 0.01. thus rejecting H0 (equal means) concludes that at least 2 means for the Satisfaction_Score across Product_Category are statistically significant.

iii) Which means are plausibly similar at the confidence level of 99%?

TukeyHSD(q4ctable, conf.level = 0.99)

##   Tukey multiple comparisons of means
##     99% family-wise confidence level
## 
## Fit: aov(formula = Satisfaction_Score ~ Product_Category, data = trainData_sales)
## 
## $Product_Category
##                                    diff        lwr        upr     p adj
## Clothing-Books               0.03700886 -0.4171450  0.4911627 0.9997875
## Electronics-Books            1.79972660  1.3547529  2.2447003 0.0000000
## Furniture-Books              0.89568210  0.4220455  1.3693187 0.0000000
## Home Appliances-Books        0.91781089  0.4704596  1.3651622 0.0000000
## Toys-Books                  -1.06748120 -1.5173015 -0.6176609 0.0000000
## Electronics-Clothing         1.76271774  1.3101126  2.2153229 0.0000000
## Furniture-Clothing           0.85867325  0.3778599  1.3394866 0.0000000
## Home Appliances-Clothing     0.88080203  0.4258592  1.3357449 0.0000000
## Toys-Clothing               -1.10449006 -1.5618610 -0.6471192 0.0000000
## Furniture-Electronics       -0.90404450 -1.3761963 -0.4318927 0.0000000
## Home Appliances-Electronics -0.88191572 -1.3276947 -0.4361367 0.0000000
## Toys-Electronics            -2.86720780 -3.3154645 -2.4189511 0.0000000
## Home Appliances-Furniture    0.02212878 -0.4522644  0.4965220 0.9999866
## Toys-Furniture              -1.96316330 -2.4398855 -1.4864411 0.0000000
## Toys-Home Appliances        -1.98529208 -2.4359090 -1.5346752 0.0000000

response for Q4ciii

The mean of Satisfaction Score in Clothing vs Books and Home Appliances vs Furniture are plausibly similar at the confidence level of 99% as the CI including 0 and p-value is large.

iv) Compare the satisfaction score of the following pair of product categories:

Electronics-Clothing

CI and p-value of Electronics-Clothing

The p-value is close to 0 indicates both the Satisfaction Score is statistically significant different. The CI comparing Electronics vs Clothing have only positive value (1.3101126 to 2.2153229) indicates that the mean of Satisfaction Score in Electronics is significantly higher.

Question 5: Model Diagnostics (14 points)

Q5a)(4 points) Perform the following model diagnostics on model1 created in Q4a.

Note: Both a histogram and a normal QQ plot with a pointwise confidence envelope must be plotted (tip: qqPlot() from the car package can generate a pointwise confidence envelope.

Explain your conclusion

i) Check for linearity assumption

Response vs Predictor variable plot evaluate linearity assumption.

#q3bii <- lm(Satisfaction_Score~ Purchase_Amount, data = trainData_sales)
#plot(model1)

ggplot(data = trainData_sales, aes(x = Purchase_Amount, y = Satisfaction_Score)) + 
  geom_point(col = "blue") + labs(x = "Purchase", y = "Satisfaction Score", title = "Linearity Plot") + geom_smooth(method = "lm", color = "red")

## `geom_smooth()` using formula = 'y ~ x'

#alternative way
#plot(trainData_sales$Purchase_Amount, trainData_sales$Satisfaction_Score, xlab = "Purchase", ylab = "Score", main = "Linearity Plot", pty = 2, lwd = 2) + abline(model1, col = "red")

ii) Check for constant variance

The residual vs fitted plot evaluates constant variance and uncorrelated.

ggplot(data = trainData_sales, aes(x = model1$fitted.values, y = model1$residuals)) + 
  geom_point(col = "blue") + labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs Fitted Plot") + geom_hline(yintercept = 0 )

#alternative way 
#plot(model1$fitted, model1$residuals, xlab="Fitted values", ylab="Residuals", main = "Residuals vs Fitted plot", pty=2, lwd=2) + abline(h = 0, col = "black")

iii) Check for normality

QQplot and histogram evaluate normality.

hist(model1$residuals, xlab = "Residuals", main = "")

qqPlot(model1$residuals, ylab = "Residuals")

## [1] 594 180

Response to Q5a

The Linearity Plot displays the linearity trend that holds the linear assumption.

The residuals scattered around the 0 line indicate that both the constant variance assumption and the assumption of uncorrelated errors both hold.

There are departures in the end-tails of the qqplot may indicate violation of normality, however the Histogram of residuals displays symmetric. The normality assumption may be violated.

Q5b)(1 point) Based on your conclusion in Q5a, would you propose any transformation of the predictor or response variable? Explain with reasoning.

Because the normality does not hold, it is better to transform the variable using function that do not consider normality.

Q5c)(2 points) Suppose you have fit a simple linear regression model using Sales as the response variable and Advertising as the predictor. You are interested in understanding the difference between confidence intervals and prediction intervals.

i) Briefly explain the difference between a confidence interval for the mean response and a prediction interval for a new observation at a given value of the predictor.

The prediction interval is used to provide an interval estimate for a prediction of y, Sales for one member of the population with a particular value of x, Advertising; the confidence interval is used to provide an interval estimate for the true average value of y, Sales for all members of the population with a particular value of x, Advertising.

ii) Why is the prediction interval wider than the confidence interval?

The prediction intervals have additional variance from the variation of a new measurement, accounting for both the uncertainty in the mean and the variability of individual observations around the mean.

iii) Suppose that for an Advertising value of 100, the fitted model returns the following:

Interpret each interval in the context of the problem.

95% Confidence Interval: (10.2, 11.4)

The 95% Confidence Interval [10.2, 11.4] represents the range within which the true mean Sales value is expected to lie with 95% confidence, given an Advertising value of 100.

95% Prediction Interval: (7.5, 14.1)

The 95% Prediction Interval [7.5, 14.1] represents the range within which a future individual Sales value is expected to lie with 95% confidence, given an Advertising value of 100.

iv) If the residual plot shows evidence of non-constant variance, how might that impact the reliability of the prediction interval? How might a log transformation of the response variable help address this issue?

Non-constant variance may result in intervals that are too narrow or too wide fro predictor values that underestimate uncertainty where variance is high. Log transformation can stabilize variance by the log scale resulting in more adjusted prediction interval.

Q5d)(2 points) Create a linear regression model, named model2, that uses the log transformed Satisfaction_Score as the response, and the log transformed Purchase_Amount as the predictor. Display the summary.

model2 <- lm(log(Satisfaction_Score)~ log(Purchase_Amount), data = trainData_sales)
summary(model2)

## 
## Call:
## lm(formula = log(Satisfaction_Score) ~ log(Purchase_Amount), 
##     data = trainData_sales)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86480 -0.12028  0.01694  0.13832  0.47907 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.485931   0.046705   31.82   <2e-16 ***
## log(Purchase_Amount) 0.077005   0.007001   11.00   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1906 on 798 degrees of freedom
## Multiple R-squared:  0.1316, Adjusted R-squared:  0.1306 
## F-statistic:   121 on 1 and 798 DF,  p-value: < 2.2e-16

Q5e)(2 points) Compare the R-squared values of model1 and model2. Did the transformation improve the explanatory power of the model?

Response to Q5e

The transformation reduce the explanatory power of the model as the R-squared values decrease.

Q5f) (3 points) Perform the same model diagnostics on model2 as you did on model1 in Q5a. Assess and interpret all model assumptions. Based on your interpretation of the model assumptions, is model2 a better fit than model1? A model is considered a good fit only if all the assumptions hold.

#linearity

ggplot(data = trainData_sales, aes(x = log(trainData_sales$Purchase_Amount), y = log(trainData_sales$Satisfaction_Score))) + geom_point(col = "blue") + labs(x = "Purchase", y = "Satisfaction Score", title = "Linearity Plot") + geom_smooth(method = "lm", color = "red")

## Warning: Use of `trainData_sales$Purchase_Amount` is discouraged.
## ℹ Use `Purchase_Amount` instead.

## Warning: Use of `trainData_sales$Satisfaction_Score` is discouraged.
## ℹ Use `Satisfaction_Score` instead.

## Warning: Use of `trainData_sales$Purchase_Amount` is discouraged.
## ℹ Use `Purchase_Amount` instead.

## Warning: Use of `trainData_sales$Satisfaction_Score` is discouraged.
## ℹ Use `Satisfaction_Score` instead.

## `geom_smooth()` using formula = 'y ~ x'

#constant variance and uncorrelated

ggplot(data = trainData_sales, aes(x = model2$fitted.values, y = model2$residuals)) + geom_point(col = "blue") + labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs Fitted Plot") + geom_hline(yintercept = 0)

#normality 

hist(model2$residuals, xlab = "Residuals", main = "")

qqPlot(model2$residuals, ylab = "Residuals")

## [1] 594 353

Response to Q5f

Linearity and constant variance assumption are violated as the point on the linearity plot does not show linear pattern and for the residual plot, the variability of the residuals changes from left (1.6) to right (2). The variability is much smaller for the values in the left. The qqplot suggests only normality assumption is hold as the deviation in the end tails is minor.

Question 6: Prediction (5 points)

For this question, use testData_sales

Q6a)(3 points) Using testData_sales, predict the satisfaction score using model1.

Show the first 10 predictions along with their true values.

predModel1 <- predict(model1, newdata = testData_sales)

q6ai_PredTable <- data.frame(Satisfaction_Score = testData_sales$Satisfaction_Score, predModel1)

head(q6ai_PredTable, 10)

##     Satisfaction_Score predModel1
## 714           9.401932   8.031362
## 503           5.557678   6.550302
## 358           8.634280   7.832033
## 624           8.394701   7.974056
## 985           9.168275   7.275348
## 718           4.708268   6.963268
## 919           9.076011   8.280098
## 470           8.757487   8.411673
## 966           7.671931   7.794002
## 516          10.000000   7.300258

Calculate the mean squared prediction error.

#answer for 6aii - MSE for prediction
mean((predModel1 - q6ai_PredTable$Satisfaction_Score)^2)

## [1] 1.785631

Calculate average and standard deviation of the predictions of the satisfaction score using model1.

#answer for 6aiii - Avg of the predictions of the satisfaction score
mean(predModel1)

## [1] 7.496399

#answer for 6aiii - Standard deviation of the predictions of the satisfaction score
sd(predModel1)

## [1] 0.5212317

Q6b)(2 points) Using the first row of testData_sales, predict the satisfaction score using model1. What is the 99% prediction interval (PI)? Provide an interpretation of your results.

predict(model1, newdata = testData_sales[1,], interval = "prediction", level = 0.99)

##          fit      lwr      upr
## 714 8.031362 4.600441 11.46228

Response to Q6b

The prediction interval for the satisfaction score is [4.600441, 11.46228] with a predicted value of 8.031362. This means there is 99% confident that the true satisfaction score for this observation lie within this interval.

Question 7: Research Question (2 points)

(2 points) Research and explain the arguments that go into the predict function in detail. Please include object, newdata, interval, type, se.fit, and level.

Response to Q7

Predict() function is to predict new value based on a model.

Object is the fitted model that created and used to against the newdata. It passes function for example, lm(), aov(), glm().

newdata is the dataframe containing all predictor variables for prediction.

interval is the type of interval the user wants to use for example, “confidence”, “prediction” as well as reflecting uncertainty.

type is the type of prediction the user wants to return for example, “response”, “term”.

se.fit to return whether to compute standard error, default as “FALSE”.

level is to return the confident interval, default as “0.95”.

Question 8: Changing the Baseline (3 points)

(3 points) In the pre-processing of the data we did:

Sales_dataset$Product_Category = as.factor(Sales_dataset$Product_Category) ,

explain ways using both code and descriptions that you can tell what the baseline is and how you can change the baseline for Product_Category.

#assign factor
newProduct_Category = as.factor(Sales_dataset$Product_Category)

#checking the baseline 
levels(newProduct_Category)

## [1] "Books"           "Clothing"        "Electronics"     "Furniture"      
## [5] "Home Appliances" "Toys"

#assign baseline manually
newProduct_Category2 = relevel(Sales_dataset$Product_Category, ref = "Clothing")

levels(newProduct_Category2)

## [1] "Clothing"        "Books"           "Electronics"     "Furniture"      
## [5] "Home Appliances" "Toys"

Response to Q8

as.factor() function is to assign factor to categorical type of variable. The baseline is assigned by default and can simply check the baseline using level() and the first value is the baseline. You may change the baseline thru relevel().

The End