knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(lubridate)
library(ggplot2) #2a
library(zoo) #2b
## Warning: package 'zoo' was built under R version 4.3.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(xts) #2b
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
library(car) #5
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
In this homework, you have been given a dataset that captures comprehensive details of 1,000 sales transactions made in 2024, focusing on various aspects of customer interactions and sales dynamics. It includes a mix of numeric and categorical variables, allowing for in-depth analysis.
Key features of the dataset include:
Customer_ID: Unique identifier for each customer.
Product_Category: Category of the purchased product (e.g., Electronics, Clothing, Home Appliances, etc.).
Purchase_Amount: Total amount spent on each purchase, in dollars.
Purchase_Date: Date of the transaction
Customer_Age: Age of the customer at the time of purchase.
Customer_Gender: Gender of the customer, Male or Female
Store_Location: Location of the store where the purchase occurred.
Satisfaction_Score (response variable) : Numeric response variable indicating the customer’s satisfaction level
Note:
For the first homework assignment, we have provided the pre-processing steps for the variables. Kindly ensure you understand and can modify these steps as needed. In future homework and exams, students will be expected to handle data pre-processing independently, and no code will be provided. The only aspect that should remain unchanged is the seed setting to 100.
set.seed(100) #this seed has been set to 100 to ensure results are reproducible. DO NOT CHANGE THIS SEED
Sales_dataset = read.csv("large_sales_dataset.csv", header=TRUE) #reads data
Sales_dataset$Product_Category=as.factor(Sales_dataset$Product_Category)
Sales_dataset$Purchase_Date=as.Date(Sales_dataset$Purchase_Date, format="%m/%d/%Y")
Sales_dataset$Customer_Gender=as.factor(Sales_dataset$Customer_Gender)
Sales_dataset$Store_Location=as.factor(Sales_dataset$Store_Location)
head(Sales_dataset)
## Customer_ID Product_Category Purchase_Amount Purchase_Date Customer_Age
## 1 101 Books 1091.97 2024-07-24 62
## 2 102 Furniture 869.33 2024-05-04 62
## 3 102 Books 1592.04 2024-12-29 62
## 4 102 Electronics 1600.56 2024-12-07 62
## 5 105 Toys 1030.42 2024-02-20 35
## 6 105 Books 383.44 2024-11-25 35
## Customer_Gender Store_Location Satisfaction_Score
## 1 Female Chicago 7.281961
## 2 Female New York 8.124978
## 3 Female Boston 8.229209
## 4 Female Boston 8.882945
## 5 Female Atlanta 6.811557
## 6 Female San Diego 5.779122
#Dividing the dataset into training and testing datasets
testRows_sales = sample(nrow(Sales_dataset),0.2*nrow(Sales_dataset))
testData_sales = Sales_dataset[testRows_sales, ]
trainData_sales = Sales_dataset[-testRows_sales, ]
row.names(trainData_sales) <- NULL
head(trainData_sales)
## Customer_ID Product_Category Purchase_Amount Purchase_Date Customer_Age
## 1 102 Furniture 869.33 2024-05-04 62
## 2 102 Books 1592.04 2024-12-29 62
## 3 102 Electronics 1600.56 2024-12-07 62
## 4 105 Toys 1030.42 2024-02-20 35
## 5 105 Books 383.44 2024-11-25 35
## 6 109 Electronics 1632.42 2024-04-12 27
## Customer_Gender Store_Location Satisfaction_Score
## 1 Female New York 8.124978
## 2 Female Boston 8.229209
## 3 Female Boston 8.882945
## 4 Female Atlanta 6.811557
## 5 Female San Diego 5.779122
## 6 Male Boston 10.000000
For this question, use the “Sales_dataset”
1a)(4 points) Output a table that has both the average and median Purchase_Amount grouped by Product_Category, Customer_Gender and Customer_Age. Show the last 15 rows of the table
Note: For age group, use the following bins
‘0-18’, ‘19-25’, ‘26-35’, ‘36-45’, ‘46-55’, ‘56-65’, ‘66-75’, ‘76-85’, ‘86-100’
#library("dplyr")
copy_Sales_dataset <- Sales_dataset %>%
mutate(Age_Group = cut(Customer_Age,
breaks = c(0, 18, 25, 35, 45, 55, 65, 75, 85, 100),
labels = c("0-18", "19-25", "26-35", "36-45", "46-55","56-65", "66-75", "76-85", "86-100")))
#head(ageCut_Sales_dataset)
q1a_table <- copy_Sales_dataset %>%
group_by(Product_Category, Customer_Gender, Age_Group) %>%
summarise(mean_Purchase_Amount = mean(Purchase_Amount), media_Purchase_Amount = median(Purchase_Amount))
## `summarise()` has grouped output by 'Product_Category', 'Customer_Gender'. You
## can override using the `.groups` argument.
tail(q1a_table, 15)
## # A tibble: 15 × 5
## # Groups: Product_Category, Customer_Gender [3]
## Product_Category Customer_Gender Age_Group mean_Purchase_Amount
## <fct> <fct> <fct> <dbl>
## 1 Home Appliances Male 46-55 1372.
## 2 Home Appliances Male 56-65 1109.
## 3 Home Appliances Male 66-75 895.
## 4 Toys Female 19-25 1013.
## 5 Toys Female 26-35 1027.
## 6 Toys Female 36-45 586.
## 7 Toys Female 46-55 1053.
## 8 Toys Female 56-65 948.
## 9 Toys Female 66-75 929.
## 10 Toys Male 0-18 1115.
## 11 Toys Male 19-25 1139.
## 12 Toys Male 26-35 1155.
## 13 Toys Male 36-45 932.
## 14 Toys Male 46-55 778.
## 15 Toys Male 56-65 1008.
## # ℹ 1 more variable: media_Purchase_Amount <dbl>
Q1b)(2 points) Which customer age group has the highest total purchase amount across all categories?
totalPurchaseAge <- copy_Sales_dataset %>%
group_by(Age_Group) %>%
summarise(total_Purchase_Age = sum(Purchase_Amount))
totalPurchaseAge
## # A tibble: 7 × 2
## Age_Group total_Purchase_Age
## <fct> <dbl>
## 1 0-18 28611.
## 2 19-25 205891.
## 3 26-35 279965.
## 4 36-45 186527.
## 5 46-55 147993.
## 6 56-65 104559.
## 7 66-75 39057.
q1btable <- totalPurchaseAge %>%
filter(total_Purchase_Age == max(total_Purchase_Age))
q1btable
## # A tibble: 1 × 2
## Age_Group total_Purchase_Age
## <fct> <dbl>
## 1 26-35 279965.
Response to Q1b
The age group of 26-35 has the highest total purchase amount.
Q1c)(2 points) Which product category has the highest average satisfaction score, and in which location is it most frequently purchased?
meanAvgSat <- copy_Sales_dataset %>%
group_by(Product_Category) %>%
summarise(mean_Satisfaction_Score = mean(Satisfaction_Score))
meanAvgSat
## # A tibble: 6 × 2
## Product_Category mean_Satisfaction_Score
## <fct> <dbl>
## 1 Books 7.07
## 2 Clothing 7.02
## 3 Electronics 8.88
## 4 Furniture 7.95
## 5 Home Appliances 7.96
## 6 Toys 6.02
q1c1table <- meanAvgSat %>%
filter(mean_Satisfaction_Score == max(mean_Satisfaction_Score))
q1c1table$Product_Category
## [1] Electronics
## Levels: Books Clothing Electronics Furniture Home Appliances Toys
Response to Q1C1
Electronics is the Product Group that has the highest average satisfaction score.
q1c2table <- copy_Sales_dataset %>%
filter(Product_Category == q1c1table$Product_Category) %>%
count(Store_Location) %>%
filter(n == max(n))
q1c2table$Store_Location
## [1] Boston
## 13 Levels: Atlanta Boston Chicago Dallas Denver Houston ... Seattle
Response to Q1C2
Boston is the most frequently-purchased store location in the Product Group - Electronics.
For this question, use “Sales_dataset”
Q2a) (3 points) Calculate the monthly average sales per product category and plot them.
#rmb to convert with xts object
q2atable <- copy_Sales_dataset %>%
group_by(Product_Category, Month = floor_date(Purchase_Date, unit = "month")) %>%
summarize(monthly_avg_sales = mean(Purchase_Amount))
## `summarise()` has grouped output by 'Product_Category'. You can override using
## the `.groups` argument.
q2atable
## # A tibble: 72 × 3
## # Groups: Product_Category [6]
## Product_Category Month monthly_avg_sales
## <fct> <date> <dbl>
## 1 Books 2024-01-01 997.
## 2 Books 2024-02-01 1002.
## 3 Books 2024-03-01 1067.
## 4 Books 2024-04-01 934.
## 5 Books 2024-05-01 1252.
## 6 Books 2024-06-01 965.
## 7 Books 2024-07-01 995.
## 8 Books 2024-08-01 1316.
## 9 Books 2024-09-01 1055.
## 10 Books 2024-10-01 1253.
## # ℹ 62 more rows
q2aplot <- ggplot(q2atable, aes(x = q2atable$Month, y = q2atable$monthly_avg_sales, color = q2atable$Product_Category)) + geom_line() + geom_point() + labs(title = "Monthly Average Sales per Product Category", x = "Month",y = "Avg Sales",color = "Product Category")
q2aplot
Q2b)(5 points) Calculate the rolling means of average monthly sales (calculated in 2a) using a 3 month window. Use the center alignment to calculate the rolling means.
Now again plot the monthly average sales of each product category along with the rolling means.
i) What difference do you observe? How are the rolling means beneficial as compared to simple averages?
q2btable <- q2atable %>%
group_by(Product_Category) %>%
mutate(rolling_mean_3m = rollmean(monthly_avg_sales, 3, align = "center", fill = NA))
q2btable
## # A tibble: 72 × 4
## # Groups: Product_Category [6]
## Product_Category Month monthly_avg_sales rolling_mean_3m
## <fct> <date> <dbl> <dbl>
## 1 Books 2024-01-01 997. NA
## 2 Books 2024-02-01 1002. 1022.
## 3 Books 2024-03-01 1067. 1001.
## 4 Books 2024-04-01 934. 1084.
## 5 Books 2024-05-01 1252. 1050.
## 6 Books 2024-06-01 965. 1071.
## 7 Books 2024-07-01 995. 1092.
## 8 Books 2024-08-01 1316. 1122.
## 9 Books 2024-09-01 1055. 1208.
## 10 Books 2024-10-01 1253. 1081.
## # ℹ 62 more rows
q2bplot <- ggplot(q2btable, aes(x = Month)) + geom_line(aes(y = monthly_avg_sales, color = "Avg Sales",)) + geom_line(aes(y = rolling_mean_3m, color = "Rolling Mean Sales (3-month)")) + facet_wrap(~ Product_Category) + labs(title = "Avg Sales and Rolling Mean Sales per Product Category", x = "Month", y = "Avg Sales", color="Avg Sales / Rolling Sales")
q2bplot
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
Response to Q2bi
The plot of Rolling Means 3 month window is more smooth, less steep and less fluctuate comparing to the plot of Monthly Avg Sales. Using rolling Mean for plot can help reduce noise and identify pattern such as seasonal pattern.
For this question, use trainData_sales
Q3a)(4 points) Create a boxplot of the response variable versus the following predicting variables
i) Store_Location
ii) Product_Category
Explain the relationship between the response and the two variables based on the boxplots.
q3aiplot <- ggplot(trainData_sales, aes(Store_Location, Satisfaction_Score, color = Store_Location)) + geom_boxplot() +labs(title = "Satisfaction Score vs Store Location", x = "Satisfaction Score", y = "Store Location")
q3aiplot
q3aiiplot <- ggplot(trainData_sales, aes(Product_Category, Satisfaction_Score, color = Product_Category)) + geom_boxplot() +labs(title = "Satisfaction Score vs Product Category", x = "Satisfaction Score", y = "Product Category")
q3aiiplot
Response to Q3
The box plot of Satisfaction Score vs Store Location varies across different locations, for example locations in LA and Chicago have wider interquartile range in store satisfaction comparing to locations in Atlanta and Seattle with narrower interquartile range indicating consistent score among customers. There are outliers in Boston and Denver that may affect the overall result with lowering score.
The box plot of Satisfaction Score vs Product Category also varies across different categories, for example Electronics has the highest median as ~9 comparing with the lowest median score ~6 in Toys while the rest range from 7-8. The outliers in Toys in both quartile-ends may indicate insight to concerns.
Though the box plot for both predicting variables vary but do not tell there are significant relations since they overlapping.
Q3b)(4 points) Create scatterplots of the response variable against the following predictors:
i) Customer_Age
ii) Purchase_Amount
Describe the general trend of each plot.
Output the R^2 for each plot. Use the following R^2 cut-offs while explaining if it is a weak, moderate, or strong relationship.
R^2<=0.3 (weak)
0.3<R^2<0.7 (moderate)
R^2>=0.7 (strong)
#show regression line for 3bi
q3bi <- lm(Satisfaction_Score~ Customer_Age, data = trainData_sales)
#R^2 for 3bi
cat("R^2 for 3bi:", summary(q3bi)$r.squared)
## R^2 for 3bi: 0.002581523
q3biplot <- plot(trainData_sales$Customer_Age, trainData_sales$Satisfaction_Score, main = "Scartterplot of Satisfaction vs Age", xlab= "Age", ylab = "Satisfaction Score") + abline(q3bi, col = "red")
#show regression line for 3bii
q3bii <- lm(Satisfaction_Score~ Purchase_Amount, data = trainData_sales)
#R^2 for 3bii
cat("R^2 for 3bii:", summary(q3bii)$r.squared)
## R^2 for 3bii: 0.146777
q3biiplot <- plot(trainData_sales$Purchase_Amount, trainData_sales$Satisfaction_Score, main = "Scartterplot of Satisfaction vs Sales", xlab= "Sales", ylab = "Satisfaction Score") + abline(q3bii, col = "red")
Response to Q3b
The scatter plot of q3bi show close to zero relationship between Age and Satisfaction Score while the plot of 3bii shows positive relationship between Sales and Satisfaction Score. Overall, the relationship for both are weak as the small R^2 value.
For this question, use trainData_sales
Q4a) (5 points) Create a simple linear regression model using the predictor “Purchase_Amount”. Call it model1. Display the summary.
model1 <- q3bii
summary(model1)
##
## Call:
## lm(formula = Satisfaction_Score ~ Purchase_Amount, data = trainData_sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9529 -0.8960 -0.0057 0.9767 3.3668
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5431520 0.0937456 69.80 <2e-16 ***
## Purchase_Amount 0.0009585 0.0000818 11.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.327 on 798 degrees of freedom
## Multiple R-squared: 0.1468, Adjusted R-squared: 0.1457
## F-statistic: 137.3 on 1 and 798 DF, p-value: < 2.2e-16
i) What are the model parameters and their estimates?
The model parameters are the 1.) intercept and its estimates ~6.54315, 2.) B1 (Purchase_Amount) its estimates ~0.00095, 3.) the variance of error term ~(1.327^2)
ii) Interpret the coefficient of the predictor “Purchase_Amount” in the context of the problem.
For an increase of a dollar in the Purchase Amount, there will be an increase of ~0.00095 in the Satisfaction Score.
iii) Find a 95% confidence interval for the coefficient of “Purchase_Amount”. Is the coefficient significant at this level?
confint(model1, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 6.3591349416 6.727169038
## Purchase_Amount 0.0007978727 0.001119022
response to q4aiii
The 95% confidence interval is (0.000797 to 0.001119) and is significant at this level.
Q4b)(3 points) Is the coefficient of Purchase_Amount statistically significant?
i) State the null hypothesis
H0 : B1(Purchase_Amount) = 0
ii) State the alternative hypothesis
H1 : B1(Purchase_Amount) != 0
iii) Which test is used for testing the significance of coefficient?
Comparing t-value with critical point or
comparing p-value with alpha level /(significant level) or
just use the p-value from R
Q4c)( 4 points) Perform an ANOVA F-test on the means of Product_Category.
#4ci
q4ctable <- aov(Satisfaction_Score ~ Product_Category, data = trainData_sales)
summary(q4ctable)
## Df Sum Sq Mean Sq F value Pr(>F)
## Product_Category 5 674.9 134.97 110.2 <2e-16 ***
## Residuals 794 972.5 1.22
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
i) State the null and alternative hypotheses.
Null hypotheses, H0 : All means are equal
Alternative hypotheses, H1: Atleast 2 means are statistically significant.
ii) Using an α-level of 0.01, do we reject the null hypothesis that the means are equal? Explain your conclusion.
The p-value of the F-test for Product_Category is <2e-16 which is small than the α-level of 0.01. thus rejecting H0 (equal means) concludes that at least 2 means for the Satisfaction_Score across Product_Category are statistically significant.
iii) Which means are plausibly similar at the confidence level of 99%?
TukeyHSD(q4ctable, conf.level = 0.99)
## Tukey multiple comparisons of means
## 99% family-wise confidence level
##
## Fit: aov(formula = Satisfaction_Score ~ Product_Category, data = trainData_sales)
##
## $Product_Category
## diff lwr upr p adj
## Clothing-Books 0.03700886 -0.4171450 0.4911627 0.9997875
## Electronics-Books 1.79972660 1.3547529 2.2447003 0.0000000
## Furniture-Books 0.89568210 0.4220455 1.3693187 0.0000000
## Home Appliances-Books 0.91781089 0.4704596 1.3651622 0.0000000
## Toys-Books -1.06748120 -1.5173015 -0.6176609 0.0000000
## Electronics-Clothing 1.76271774 1.3101126 2.2153229 0.0000000
## Furniture-Clothing 0.85867325 0.3778599 1.3394866 0.0000000
## Home Appliances-Clothing 0.88080203 0.4258592 1.3357449 0.0000000
## Toys-Clothing -1.10449006 -1.5618610 -0.6471192 0.0000000
## Furniture-Electronics -0.90404450 -1.3761963 -0.4318927 0.0000000
## Home Appliances-Electronics -0.88191572 -1.3276947 -0.4361367 0.0000000
## Toys-Electronics -2.86720780 -3.3154645 -2.4189511 0.0000000
## Home Appliances-Furniture 0.02212878 -0.4522644 0.4965220 0.9999866
## Toys-Furniture -1.96316330 -2.4398855 -1.4864411 0.0000000
## Toys-Home Appliances -1.98529208 -2.4359090 -1.5346752 0.0000000
response for Q4ciii
The mean of Satisfaction Score in Clothing vs Books and Home Appliances vs Furniture are plausibly similar at the confidence level of 99% as the CI including 0 and p-value is large.
iv) Compare the satisfaction score of the following pair of product categories:
Electronics-Clothing
CI and p-value of Electronics-Clothing
The p-value is close to 0 indicates both the Satisfaction Score is statistically significant different. The CI comparing Electronics vs Clothing have only positive value (1.3101126 to 2.2153229) indicates that the mean of Satisfaction Score in Electronics is significantly higher.
Q5a)(4 points) Perform the following model diagnostics on model1 created in Q4a.
Note: Both a histogram and a normal QQ plot with a pointwise confidence envelope must be plotted (tip: qqPlot() from the car package can generate a pointwise confidence envelope.
Explain your conclusion
i) Check for linearity assumption
Response vs Predictor variable plot evaluate linearity assumption.
#q3bii <- lm(Satisfaction_Score~ Purchase_Amount, data = trainData_sales)
#plot(model1)
ggplot(data = trainData_sales, aes(x = Purchase_Amount, y = Satisfaction_Score)) +
geom_point(col = "blue") + labs(x = "Purchase", y = "Satisfaction Score", title = "Linearity Plot") + geom_smooth(method = "lm", color = "red")
## `geom_smooth()` using formula = 'y ~ x'
#alternative way
#plot(trainData_sales$Purchase_Amount, trainData_sales$Satisfaction_Score, xlab = "Purchase", ylab = "Score", main = "Linearity Plot", pty = 2, lwd = 2) + abline(model1, col = "red")
ii) Check for constant variance
The residual vs fitted plot evaluates constant variance and uncorrelated.
ggplot(data = trainData_sales, aes(x = model1$fitted.values, y = model1$residuals)) +
geom_point(col = "blue") + labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs Fitted Plot") + geom_hline(yintercept = 0 )
#alternative way
#plot(model1$fitted, model1$residuals, xlab="Fitted values", ylab="Residuals", main = "Residuals vs Fitted plot", pty=2, lwd=2) + abline(h = 0, col = "black")
iii) Check for normality
QQplot and histogram evaluate normality.
hist(model1$residuals, xlab = "Residuals", main = "")
qqPlot(model1$residuals, ylab = "Residuals")
## [1] 594 180
Response to Q5a
The Linearity Plot displays the linearity trend that holds the linear assumption.
The residuals scattered around the 0 line indicate that both the constant variance assumption and the assumption of uncorrelated errors both hold.
There are departures in the end-tails of the qqplot may indicate violation of normality, however the Histogram of residuals displays symmetric. The normality assumption may be violated.
Q5b)(1 point) Based on your conclusion in Q5a, would you propose any transformation of the predictor or response variable? Explain with reasoning.
Because the normality does not hold, it is better to transform the variable using function that do not consider normality.
Q5c)(2 points) Suppose you have fit a simple linear regression model using Sales as the response variable and Advertising as the predictor. You are interested in understanding the difference between confidence intervals and prediction intervals.
i) Briefly explain the difference between a confidence interval for the mean response and a prediction interval for a new observation at a given value of the predictor.
The prediction interval is used to provide an interval estimate for a prediction of y, Sales for one member of the population with a particular value of x, Advertising; the confidence interval is used to provide an interval estimate for the true average value of y, Sales for all members of the population with a particular value of x, Advertising.
ii) Why is the prediction interval wider than the confidence interval?
The prediction intervals have additional variance from the variation of a new measurement, accounting for both the uncertainty in the mean and the variability of individual observations around the mean.
iii) Suppose that for an Advertising value of 100, the fitted model returns the following:
Interpret each interval in the context of the problem.
95% Confidence Interval: (10.2, 11.4)
The 95% Confidence Interval [10.2, 11.4] represents the range within which the true mean Sales value is expected to lie with 95% confidence, given an Advertising value of 100.
95% Prediction Interval: (7.5, 14.1)
The 95% Prediction Interval [7.5, 14.1] represents the range within which a future individual Sales value is expected to lie with 95% confidence, given an Advertising value of 100.
iv) If the residual plot shows evidence of non-constant variance, how might that impact the reliability of the prediction interval? How might a log transformation of the response variable help address this issue?
Non-constant variance may result in intervals that are too narrow or too wide fro predictor values that underestimate uncertainty where variance is high. Log transformation can stabilize variance by the log scale resulting in more adjusted prediction interval.
Q5d)(2 points) Create a linear regression model, named model2, that uses the log transformed Satisfaction_Score as the response, and the log transformed Purchase_Amount as the predictor. Display the summary.
model2 <- lm(log(Satisfaction_Score)~ log(Purchase_Amount), data = trainData_sales)
summary(model2)
##
## Call:
## lm(formula = log(Satisfaction_Score) ~ log(Purchase_Amount),
## data = trainData_sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.86480 -0.12028 0.01694 0.13832 0.47907
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.485931 0.046705 31.82 <2e-16 ***
## log(Purchase_Amount) 0.077005 0.007001 11.00 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1906 on 798 degrees of freedom
## Multiple R-squared: 0.1316, Adjusted R-squared: 0.1306
## F-statistic: 121 on 1 and 798 DF, p-value: < 2.2e-16
Q5e)(2 points) Compare the R-squared values of model1 and model2. Did the transformation improve the explanatory power of the model?
Response to Q5e
The transformation reduce the explanatory power of the model as the R-squared values decrease.
Q5f) (3 points) Perform the same model diagnostics on model2 as you did on model1 in Q5a. Assess and interpret all model assumptions. Based on your interpretation of the model assumptions, is model2 a better fit than model1? A model is considered a good fit only if all the assumptions hold.
#linearity
ggplot(data = trainData_sales, aes(x = log(trainData_sales$Purchase_Amount), y = log(trainData_sales$Satisfaction_Score))) + geom_point(col = "blue") + labs(x = "Purchase", y = "Satisfaction Score", title = "Linearity Plot") + geom_smooth(method = "lm", color = "red")
## Warning: Use of `trainData_sales$Purchase_Amount` is discouraged.
## ℹ Use `Purchase_Amount` instead.
## Warning: Use of `trainData_sales$Satisfaction_Score` is discouraged.
## ℹ Use `Satisfaction_Score` instead.
## Warning: Use of `trainData_sales$Purchase_Amount` is discouraged.
## ℹ Use `Purchase_Amount` instead.
## Warning: Use of `trainData_sales$Satisfaction_Score` is discouraged.
## ℹ Use `Satisfaction_Score` instead.
## `geom_smooth()` using formula = 'y ~ x'
#constant variance and uncorrelated
ggplot(data = trainData_sales, aes(x = model2$fitted.values, y = model2$residuals)) + geom_point(col = "blue") + labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs Fitted Plot") + geom_hline(yintercept = 0)
#normality
hist(model2$residuals, xlab = "Residuals", main = "")
qqPlot(model2$residuals, ylab = "Residuals")
## [1] 594 353
Response to Q5f
Linearity and constant variance assumption are violated as the point on the linearity plot does not show linear pattern and for the residual plot, the variability of the residuals changes from left (1.6) to right (2). The variability is much smaller for the values in the left. The qqplot suggests only normality assumption is hold as the deviation in the end tails is minor.
For this question, use testData_sales
Q6a)(3 points) Using testData_sales, predict the satisfaction score using model1.
predModel1 <- predict(model1, newdata = testData_sales)
q6ai_PredTable <- data.frame(Satisfaction_Score = testData_sales$Satisfaction_Score, predModel1)
head(q6ai_PredTable, 10)
## Satisfaction_Score predModel1
## 714 9.401932 8.031362
## 503 5.557678 6.550302
## 358 8.634280 7.832033
## 624 8.394701 7.974056
## 985 9.168275 7.275348
## 718 4.708268 6.963268
## 919 9.076011 8.280098
## 470 8.757487 8.411673
## 966 7.671931 7.794002
## 516 10.000000 7.300258
#answer for 6aii - MSE for prediction
mean((predModel1 - q6ai_PredTable$Satisfaction_Score)^2)
## [1] 1.785631
#answer for 6aiii - Avg of the predictions of the satisfaction score
mean(predModel1)
## [1] 7.496399
#answer for 6aiii - Standard deviation of the predictions of the satisfaction score
sd(predModel1)
## [1] 0.5212317
Q6b)(2 points) Using the first row of testData_sales, predict the satisfaction score using model1. What is the 99% prediction interval (PI)? Provide an interpretation of your results.
predict(model1, newdata = testData_sales[1,], interval = "prediction", level = 0.99)
## fit lwr upr
## 714 8.031362 4.600441 11.46228
Response to Q6b
The prediction interval for the satisfaction score is [4.600441, 11.46228] with a predicted value of 8.031362. This means there is 99% confident that the true satisfaction score for this observation lie within this interval.
(2 points) Research and explain the arguments that go into the predict function in detail. Please include object, newdata, interval, type, se.fit, and level.
Response to Q7
Predict() function is to predict new value based on a model.
Object is the fitted model that created and used to against the newdata. It passes function for example, lm(), aov(), glm().
newdata is the dataframe containing all predictor variables for prediction.
interval is the type of interval the user wants to use for example, “confidence”, “prediction” as well as reflecting uncertainty.
type is the type of prediction the user wants to return for example, “response”, “term”.
se.fit to return whether to compute standard error, default as “FALSE”.
level is to return the confident interval, default as “0.95”.
(3 points) In the pre-processing of the data we did:
Sales_dataset$Product_Category = as.factor(Sales_dataset$Product_Category) ,
explain ways using both code and descriptions that you can tell what the baseline is and how you can change the baseline for Product_Category.
#assign factor
newProduct_Category = as.factor(Sales_dataset$Product_Category)
#checking the baseline
levels(newProduct_Category)
## [1] "Books" "Clothing" "Electronics" "Furniture"
## [5] "Home Appliances" "Toys"
#assign baseline manually
newProduct_Category2 = relevel(Sales_dataset$Product_Category, ref = "Clothing")
levels(newProduct_Category2)
## [1] "Clothing" "Books" "Electronics" "Furniture"
## [5] "Home Appliances" "Toys"
Response to Q8
as.factor() function is to assign factor to categorical type of variable. The baseline is assigned by default and can simply check the baseline using level() and the first value is the baseline. You may change the baseline thru relevel().
The End