In this homework, you have been given a dataset that captures comprehensive details of 1,000 sales transactions made in 2024, focusing on various aspects of customer interactions and sales dynamics. It includes a mix of numeric and categorical variables, allowing for in-depth analysis.
Key features of the dataset include:
Customer_ID: Unique identifier for each customer.
Product_Category: Category of the purchased product (e.g., Electronics, Clothing, Home Appliances, etc.).
Purchase_Amount: Total amount spent on each purchase, in dollars.
Purchase_Date: Date of the transaction
Customer_Age: Age of the customer at the time of purchase.
Customer_Gender: Gender of the customer, Male or Female
Store_Location: Location of the store where the purchase occurred.
Satisfaction_Score (response variable) : Numeric response variable indicating the customer’s satisfaction level
Note:
For the first homework assignment, we have provided the pre-processing steps for the variables. Kindly ensure you understand and can modify these steps as needed. In future homework and exams, students will be expected to handle data pre-processing independently, and no code will be provided. The only aspect that should remain unchanged is the seed setting to 100.
set.seed(100) #this seed has been set to 100 to ensure results are reproducible. DO NOT CHANGE THIS SEED
Sales_dataset = read.csv("large_sales_dataset.csv", header=TRUE) #reads data
Sales_dataset$Product_Category=as.factor(Sales_dataset$Product_Category)
Sales_dataset$Purchase_Date=as.Date(Sales_dataset$Purchase_Date, format="%m/%d/%Y")
Sales_dataset$Customer_Gender=as.factor(Sales_dataset$Customer_Gender)
Sales_dataset$Store_Location=as.factor(Sales_dataset$Store_Location)
head(Sales_dataset)
## Customer_ID Product_Category Purchase_Amount Purchase_Date Customer_Age
## 1 203 Books 1121.89 2024-09-05 55
## 2 536 Books 236.11 2024-03-01 55
## 3 961 Electronics 756.51 2024-08-26 38
## 4 371 Toys 23.10 2024-12-26 65
## 5 207 Electronics 1638.27 2024-01-27 46
## 6 172 Clothing 261.26 2024-01-29 22
## Customer_Gender Store_Location Satisfaction_Score
## 1 Female Denver 7.924694
## 2 Male Seattle 9.132629
## 3 Male Houston 9.637018
## 4 Female Chicago 5.519188
## 5 Male Seattle 10.000000
## 6 Female San Francisco 5.933145
#Dividing the dataset into training and testing datasets
testRows_sales = sample(nrow(Sales_dataset),0.2*nrow(Sales_dataset))
testData_sales = Sales_dataset[testRows_sales, ]
trainData_sales = Sales_dataset[-testRows_sales, ]
row.names(trainData_sales) <- NULL
head(trainData_sales)
## Customer_ID Product_Category Purchase_Amount Purchase_Date Customer_Age
## 1 536 Books 236.11 2024-03-01 55
## 2 961 Electronics 756.51 2024-08-26 38
## 3 371 Toys 23.10 2024-12-26 65
## 4 207 Electronics 1638.27 2024-01-27 46
## 5 172 Clothing 261.26 2024-01-29 22
## 6 121 Home Appliances 41.76 2024-11-11 27
## Customer_Gender Store_Location Satisfaction_Score
## 1 Male Seattle 9.132629
## 2 Male Houston 9.637018
## 3 Female Chicago 5.519188
## 4 Male Seattle 10.000000
## 5 Female San Francisco 5.933145
## 6 Male Denver 6.691315
For this question, use the “Sales_dataset”
1a)(4 points) Output a table that has both the average and median Purchase_Amount grouped by Product_Category, Customer_Gender and Customer_Age. Show the last 15 rows of the table
Note: For age group, use the following bins
‘0-18’, ‘19-25’, ‘26-35’, ‘36-45’, ‘46-55’, ‘56-65’, ‘66-75’, ‘76-85’, ‘86-100’
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
AgegroupsSales <- Sales_dataset
AgegroupsSales <- AgegroupsSales %>%
mutate(Age_Groups = cut(Sales_dataset$Customer_Age, breaks = c(0,18,25,35,45,55,65,75,85,100), labels = c('0-18', '19-25', '26-35', '36-45', '46-55', '56-65', '66-75', '76-85', '86-100')))
Q1Table <- AgegroupsSales %>%
group_by(Product_Category,Age_Groups,Customer_Gender)%>%
summarise(mean = mean(Purchase_Amount),median = median(Purchase_Amount))
## `summarise()` has grouped output by 'Product_Category', 'Age_Groups'. You can
## override using the `.groups` argument.
Q1Table
## # A tibble: 81 × 5
## # Groups: Product_Category, Age_Groups [42]
## Product_Category Age_Groups Customer_Gender mean median
## <fct> <fct> <fct> <dbl> <dbl>
## 1 Books 0-18 Male 45.0 45.0
## 2 Books 19-25 Female 785. 521.
## 3 Books 19-25 Male 913. 888.
## 4 Books 26-35 Female 1323. 1373.
## 5 Books 26-35 Male 1043. 1163.
## 6 Books 36-45 Female 1102. 1072.
## 7 Books 36-45 Male 1035. 1087.
## 8 Books 46-55 Female 915. 975.
## 9 Books 46-55 Male 1076. 1213.
## 10 Books 56-65 Female 951. 1055.
## # ℹ 71 more rows
Q1b)(2 points) Which customer age group has the highest total purchase amount across all categories?
AgegroupsSales %>%
group_by(Age_Groups) %>%
summarise(HpurchaserAMT = sum(Purchase_Amount))
## # A tibble: 7 × 2
## Age_Groups HpurchaserAMT
## <fct> <dbl>
## 1 0-18 16762.
## 2 19-25 110907.
## 3 26-35 189693.
## 4 36-45 187040.
## 5 46-55 193727.
## 6 56-65 198254.
## 7 66-75 96220.
The highest purchase amount by age across all categories is 56-65
Q1c)(2 points) Which product category has the highest average satisfaction score, and in which location is it most frequently purchased?
HASS <- Sales_dataset %>%
group_by(Product_Category)%>%
summarise(HmeanSS = mean(Satisfaction_Score))
The product category with the highest average satisfaction score is Electronics.
FP <- Sales_dataset[Sales_dataset$Product_Category == "Electronics",] %>%
group_by(Product_Category,Store_Location)%>%
summarise(LocSales = length(Store_Location))
## `summarise()` has grouped output by 'Product_Category'. You can override using
## the `.groups` argument.
The location where Electronics is most frequently purchased is Boston.
For this question, use “Sales_dataset”
Q2a) (3 points) Calculate the monthly average sales per product category and plot them.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
MonthSales <- Sales_dataset %>%
mutate(Month = month(Purchase_Date))
MASpP <- MonthSales %>%
group_by(Product_Category,Month)%>%
summarise(MonthlySale = mean(Purchase_Amount))
## `summarise()` has grouped output by 'Product_Category'. You can override using
## the `.groups` argument.
ggplot(data = MASpP, mapping = aes(MASpP$Month,MASpP$MonthlySale, color = MASpP$Product_Category))+geom_line()+geom_point()+labs(title = "Monthly Average Sales per Category", x="Month",y="Sales",color="Product Category")
Q2b)(5 points) Calculate the rolling means of average monthly sales (calculated in 2a) using a 3 month window. Use the center alignment to calculate the rolling means.
Now again plot the monthly average sales of each product category along with the rolling means.
i) What difference do you observe? How are the rolling means beneficial as compared to simple averages?
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
RolledMean <- MASpP %>%
mutate(TMRolled = rollmean(MonthlySale,3,fill = NA,align = "center"))
ggplot(data = RolledMean, mapping = aes(RolledMean$Month,RolledMean$TMRolled, color = RolledMean$Product_Category))+geom_line()+geom_point()+labs(title = "Three Month Roling Mean for Sales per Category", x="Month",y="Sales",color="Product Category")
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).
Response to Q2b
The lines for the graph of rolling means are much smoother and less erratic. This graph most likely produces more usable slopes for predicting future growth and understanding how sales may occur in later years.
For this question, use trainData_sales
Q3a)(4 points) Create a boxplot of the response variable versus the following predicting variables
i) Store_Location
ii) Product_Category
Explain the relationship between the response and the two variables based on the boxplots.
ggplot(trainData_sales, aes(Store_Location,Satisfaction_Score))+geom_boxplot()+labs(title = "Satisfaction Score VS Store Location",x = "Location",y= "Satisfaction Score")
ggplot(trainData_sales, aes(Product_Category,Satisfaction_Score))+geom_boxplot()+labs(title = "Satisfaction Score VS Product Category",x = "Category",y= "Satisfaction Score")
Response to Q3
The relationship between Satisfaction score and location does not seem to vary widely. Significant overlap exists between the plots, indicating that satisfaction does not vary greatly by location. There is much more variability in the product category, showing that the satisfaction score is much better predicted by this variable. Electronics has a much higher mean than most of the group, with only significant overlap in Q1 with the other plots. Toys have the lowest mean and the largest number of outliers. This indicates that there may be some additional reason for the toy’s low satisfaction score, especially since most of its outliers are higher satisfaction scores.
Q3b)(4 points) Create scatterplots of the response variable against the following predictors:
i) Customer_Age
ii) Purchase_Amount
Describe the general trend of each plot.
Output the R^2 for each plot. Use the following R^2 cut-offs while explaining if it is a weak, moderate, or strong relationship.
R^2<=0.3 (weak)
0.3<R^2<0.7 (moderate)
R^2>=0.7 (strong)
ggplot(trainData_sales,aes(Customer_Age,Satisfaction_Score))+geom_point()+labs(title = "Satisfaction Score VS Customer Age",x = "Customer Age",y= "Satisfaction Score")
r_extraction <- summary(lm(Satisfaction_Score ~ Customer_Age,data = trainData_sales))
r_Squared = r_extraction$r.squared
ggplot(trainData_sales,aes(Purchase_Amount,Satisfaction_Score))+geom_point()+labs(title = "Satisfaction Score VS Customer Age",x = "Purchase Amount",y= "Satisfaction Score")
r_extraction2 <- summary(lm(Satisfaction_Score ~ Purchase_Amount,data = trainData_sales))
rSquare2 <- r_extraction2$r.squared
Response to Q3b
Plot one seems to exhibit little to no change in satisfaction scores based on age. In the second plot, there seems to be a positive correlation between satisfaction score and purchase amount.
The R^2 for plot one is very low, indicating a weak relation between the predictor and the result variable (~.001). For the second plot, R^2 is still weak, but significantly stronger than in the previous plot (~.163).
For this question, use trainData_sales
Q4a) (5 points) Create a simple linear regression model using the predictor “Purchase_Amount”. Call it model1. Display the summary.
i) What are the model parameters and their estimates?
ii) Interpret the coefficient of the predictor “Purchase_Amount” in the context of the problem.
iii) Find a 95% confidence interval for the coefficient of “Purchase_Amount”. Is the coefficient significant at this level?
model1 <- lm(Satisfaction_Score ~ Purchase_Amount,trainData_sales)
summary(model1)
##
## Call:
## lm(formula = Satisfaction_Score ~ Purchase_Amount, data = trainData_sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8394 -0.9163 0.0100 0.9824 3.4846
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.417e+00 9.558e-02 67.14 <2e-16 ***
## Purchase_Amount 1.046e-03 8.382e-05 12.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.333 on 798 degrees of freedom
## Multiple R-squared: 0.1633, Adjusted R-squared: 0.1623
## F-statistic: 155.8 on 1 and 798 DF, p-value: < 2.2e-16
Response to Q4a
i. The model parameter B0 = 6.417, and the parameter B1 = 1.046e-3 = ~.0521.
ii. For each increase of one unit in “Purchase Amount” it is estimated that “Satisfaction_Score” will increase by .0521.
iii. Using the confint() function below, it is determined that the coefficient is significant at this level.
confint(model1, "Purchase_Amount",level = .95)
## 2.5 % 97.5 %
## Purchase_Amount 0.0008816338 0.001210702
Q4b)(3 points) Is the coefficient of Purchase_Amount statistically significant?
i) State the null hypothesis
ii) State the alternative hypothesis
iii) Which test is used for testing the significance of coefficient?
Response to Q4b
i. The coefficient of Purchase_Amount is not statistically significant. H0: B1 = 0
ii. The coefficient of Purchase_Amount is statistically significant. HA: B1 > 0
ii. P-value test
Q4c)( 4 points) Perform an ANOVA F-test on the means of Product_Category.
i) State the null and alternative hypotheses.
ii) Using an α-level of 0.01, do we reject the null hypothesis that the means are equal? Explain your conclusion.
iii) Which means are plausibly similar at the confidence level of 99%?
meanbyCategory <- trainData_sales %>%
group_by(Product_Category)%>%
summarise(meanSatisfaction = mean(Satisfaction_Score))
Trial <- t.test(trainData_sales$Product_Category == "Books",conf.level = .99)
print("Books")
## [1] "Books"
Trial$conf.int
## [1] 0.143752 0.213748
## attr(,"conf.level")
## [1] 0.99
Trial <- t.test(trainData_sales$Product_Category == "Clothing",conf.level = .99)
print("Clothing")
## [1] "Clothing"
Trial$conf.int
## [1] 0.1391401 0.2083599
## attr(,"conf.level")
## [1] 0.99
Trial <- t.test(trainData_sales$Product_Category == "Electronics",conf.level = .99)
print("Electronics")
## [1] "Electronics"
Trial$conf.int
## [1] 0.1391401 0.2083599
## attr(,"conf.level")
## [1] 0.99
Trial <- t.test(trainData_sales$Product_Category == "Furniture", conf.level = .99)
print("Furniture")
## [1] "Furniture"
Trial$conf.int
## [1] 0.1004054 0.1620946
## attr(,"conf.level")
## [1] 0.99
Trial <- t.test(trainData_sales$Product_Category == "Home Appliances",conf.level = .99)
print("Home Appliances")
## [1] "Home Appliances"
Trial$conf.int
## [1] 0.1391401 0.2083599
## attr(,"conf.level")
## [1] 0.99
Trial <- t.test(trainData_sales$Product_Category == "Toys", conf.level = .99)
print("Toys")
## [1] "Toys"
Trial$conf.int
## [1] 0.1345387 0.2029613
## attr(,"conf.level")
## [1] 0.99
iv) Compare the satisfaction score of the following pair of product categories:
Electronics-Clothing
EandC <- trainData_sales[(trainData_sales$Product_Category=="Electronics"|trainData_sales$Product_Category=="Clothing"),]
modelT <- lm(Satisfaction_Score ~ Product_Category, EandC)
forTukey <- aov(modelT)
TukeyHSD(forTukey,conf.level = .99)
## Tukey multiple comparisons of means
## 99% family-wise confidence level
##
## Fit: aov(formula = modelT)
##
## $Product_Category
## diff lwr upr p adj
## Electronics-Clothing 1.885893 1.558771 2.213015 0
Response to Q4c
H0: M = xbar
HA: M != xbar
We reject the null hypothesis because the produced p-value is less than the a-level
modelA <- lm(Satisfaction_Score ~ Product_Category, trainData_sales)
anova(modelA)
## Analysis of Variance Table
##
## Response: Satisfaction_Score
## Df Sum Sq Mean Sq F value Pr(>F)
## Product_Category 5 732.94 146.588 121.09 < 2.2e-16 ***
## Residuals 794 961.19 1.211
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
print(.01 > 2.2e-16)
## [1] TRUEResponse to Q4ciii
At a 99% confidence level, the confidence interval is wide enough for each category that they all could plausibly be similar due to the overlap of the intervals.
Response to Q4civ
A Tukey test of the two product categories demonstrates a significant difference in the means. A very low adjusted P value is observed hinting that there is little similarity between the two sets. Additionally, the confidence interval does not include zero, which shows that there is a statistically significant difference between the means.
Q5a)(4 points) Perform the following model diagnostics on model1 created in Q4a.
i) Check for linearity assumption
ii) Check for constant variance
iii) Check for normality
Note: Both a histogram and a normal QQ plot with a pointwise confidence envelope must be plotted (tip: qqPlot() from the car package can generate a pointwise confidence envelope.
Explain your conclusion
#Linearity no pattern in plot so assumption holds
#Constant variance assumption holds becasue the residuals are grouped
plot(model1, 1)
# Normality- tail on both ends of plot, may show violation of normality
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
qqPlot(model1$residuals, envelope = .95)
## [1] 528 667
#Bell shaped curve suggests normality (Looks like a normal distribution)
hist(residuals(model1))
Response to Q5a
Linearity assumption- There is no obvious pattern to residuals, and the estimation line runs nearly horizontally to 0.
Constant variance- residuals are grouped with the mean of nearly zero, showing constant variance
Normality- The QQ plot has tails on both ends of the line, which, to me, indicated that there may be a violation of normality. However, a histogram of the residuals nearly resembles a normal distribution. These two together indicate adherence to the normality assumption
Q5b)(1 point) Based on your conclusion in Q5a, would you propose any transformation of the predictor or response variable? Explain with reasoning.
Response to Q5b
I don’t think I would recommend a transformation, because the current graphs don’t seem to show violations of the assumptions.
Q5c)(2 points) Use a Box-Cox transformation (boxCox()) in car() package or (boxcox()) in MASS() package to find the optimal λ value rounded to the nearest half integer. What transformation of the response, if any, does it suggest to perform?
#Code
boxcox <- boxCox(model1)
boxcox
## $x
## [1] -2.00000000 -1.95959596 -1.91919192 -1.87878788 -1.83838384 -1.79797980
## [7] -1.75757576 -1.71717172 -1.67676768 -1.63636364 -1.59595960 -1.55555556
## [13] -1.51515152 -1.47474747 -1.43434343 -1.39393939 -1.35353535 -1.31313131
## [19] -1.27272727 -1.23232323 -1.19191919 -1.15151515 -1.11111111 -1.07070707
## [25] -1.03030303 -0.98989899 -0.94949495 -0.90909091 -0.86868687 -0.82828283
## [31] -0.78787879 -0.74747475 -0.70707071 -0.66666667 -0.62626263 -0.58585859
## [37] -0.54545455 -0.50505051 -0.46464646 -0.42424242 -0.38383838 -0.34343434
## [43] -0.30303030 -0.26262626 -0.22222222 -0.18181818 -0.14141414 -0.10101010
## [49] -0.06060606 -0.02020202 0.02020202 0.06060606 0.10101010 0.14141414
## [55] 0.18181818 0.22222222 0.26262626 0.30303030 0.34343434 0.38383838
## [61] 0.42424242 0.46464646 0.50505051 0.54545455 0.58585859 0.62626263
## [67] 0.66666667 0.70707071 0.74747475 0.78787879 0.82828283 0.86868687
## [73] 0.90909091 0.94949495 0.98989899 1.03030303 1.07070707 1.11111111
## [79] 1.15151515 1.19191919 1.23232323 1.27272727 1.31313131 1.35353535
## [85] 1.39393939 1.43434343 1.47474747 1.51515152 1.55555556 1.59595960
## [91] 1.63636364 1.67676768 1.71717172 1.75757576 1.79797980 1.83838384
## [97] 1.87878788 1.91919192 1.95959596 2.00000000
##
## $y
## [1] -3192.473 -3183.787 -3175.271 -3166.923 -3158.743 -3150.727 -3142.875
## [8] -3135.184 -3127.653 -3120.280 -3113.064 -3106.001 -3099.091 -3092.332
## [15] -3085.722 -3079.258 -3072.941 -3066.766 -3060.734 -3054.842 -3049.088
## [22] -3043.470 -3037.987 -3032.638 -3027.420 -3022.332 -3017.372 -3012.538
## [29] -3007.830 -3003.245 -2998.781 -2994.438 -2990.214 -2986.106 -2982.115
## [36] -2978.237 -2974.473 -2970.819 -2967.276 -2963.841 -2960.513 -2957.291
## [43] -2954.174 -2951.160 -2948.247 -2945.436 -2942.724 -2940.110 -2937.593
## [50] -2935.172 -2932.845 -2930.613 -2928.472 -2926.423 -2924.464 -2922.594
## [57] -2920.813 -2919.118 -2917.510 -2915.987 -2914.547 -2913.191 -2911.917
## [64] -2910.725 -2909.612 -2908.580 -2907.625 -2906.749 -2905.949 -2905.225
## [71] -2904.577 -2904.003 -2903.502 -2903.075 -2902.719 -2902.434 -2902.220
## [78] -2902.076 -2902.001 -2901.994 -2902.055 -2902.183 -2902.377 -2902.637
## [85] -2902.961 -2903.350 -2903.803 -2904.318 -2904.896 -2905.536 -2906.237
## [92] -2906.998 -2907.820 -2908.701 -2909.640 -2910.638 -2911.694 -2912.807
## [99] -2913.977 -2915.203
Response to Q5c
The optimal lambda is roughly 1.5. Since the confidence interval for lambda includes 1, then a transformation is not necessary.
Q5d)(2 points) Create a linear regression model, named model2, that uses the log transformed Satisfaction_Score as the response, and the log transformed Purchase_Amount as the predictor. Display the summary.
model2 <- lm(log(Satisfaction_Score) ~ log(Purchase_Amount),trainData_sales)
summary(model2)
##
## Call:
## lm(formula = log(Satisfaction_Score) ~ log(Purchase_Amount),
## data = trainData_sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8521 -0.1223 0.0194 0.1447 0.4992
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.452993 0.046363 31.34 <2e-16 ***
## log(Purchase_Amount) 0.081082 0.006946 11.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1919 on 798 degrees of freedom
## Multiple R-squared: 0.1458, Adjusted R-squared: 0.1448
## F-statistic: 136.3 on 1 and 798 DF, p-value: < 2.2e-16
summary(model1)
##
## Call:
## lm(formula = Satisfaction_Score ~ Purchase_Amount, data = trainData_sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8394 -0.9163 0.0100 0.9824 3.4846
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.417e+00 9.558e-02 67.14 <2e-16 ***
## Purchase_Amount 1.046e-03 8.382e-05 12.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.333 on 798 degrees of freedom
## Multiple R-squared: 0.1633, Adjusted R-squared: 0.1623
## F-statistic: 155.8 on 1 and 798 DF, p-value: < 2.2e-16
Q5e)(2 points) Compare the R-squared values of model1 and model2. Did the transformation improve the explanatory power of the model?
Response to Q5e
No, taking the two logs decreased R squared, meaning the predictive power of the model went down.
Q5f) (3 points) Perform the same model diagnostics on model2 as you did on model1 in Q5a. Assess and interpret all model assumptions. Based on your interpretation of the model assumptions, is model2 a better fit than model1? A model is considered a good fit only if all the assumptions hold.
plot(model2, 1)
# Your code here
qqPlot(model2$residuals, envelope = .95)
## [1] 528 69
hist(residuals(model2))
Response to Q5f
Linearity- the trend line indicates a bowed shape to the residuals so it violates linearity
Constant variance- The residuals have a significantly looser grouping than in the previous model
Normality- The QQ plot still has the tail, and added an outlier. Additionally, the histogram is now skewed to the left.
Based on my observations, model2 is not a better fit for the data because of the above issues.
For this question, use testData_sales
Q6a)(3 points) Using testData_sales, predict the satisfaction score using model1.
Show the first 10 predictions along with their true values.
Calculate the mean squared prediction error.
Calculate average and standard deviation of the predictions of the satisfaction score using model1.
testModel1 <- predict(model1, newdata = testData_sales)
comparison <- data.frame(Test_Values = testModel1,Source_Values = testData_sales$Satisfaction_Score)
head(comparison, 10)
## Test_Values Source_Values
## 714 6.758070 9.465528
## 503 6.717227 9.766702
## 358 7.533249 9.053531
## 624 6.501717 7.836710
## 985 7.299200 7.046928
## 718 8.262041 10.000000
## 919 6.617318 6.162825
## 470 6.930907 8.958263
## 966 7.213435 8.347239
## 516 7.687925 9.015264
MSE = mean((comparison$Source_Values - comparison$Test_Values)^2)
MSE
## [1] 1.741135
Standard_deviation = sd(comparison$Test_Values)
deviations = abs(comparison$Test_Values - mean(comparison$Test_Values))
AVG_deviation = mean(deviations)
AVG_deviation
## [1] 0.5440862
Standard deviation = .6171612
Average deviation = .5440862
Q6b)(2 points) Using the first row of testData_sales, predict the satisfaction score using model1. What is the 99% prediction interval (PI)? Provide an interpretation of your results.
oneRow <- testData_sales[1,]
oneRowPred <- predict(model1,newdata = oneRow, interval = "prediction",level = .99)
head(oneRowPred)
## fit lwr upr
## 714 6.75807 3.311727 10.20441
Response to Q6b
The 99% prediction interval is 3.311727 to 10.20441. The model expects 99% of values to fall from ~3.3 to ~10.2. This is possibly concerning considering that the maximum reported satisfaction score is 10. Additionally, the predicted value for this row (6.75807) is significantly different from the real value of ~9.46. These observations make me think that this model may be a poor representation of the data as a whole.
Question 7 (2 points)
(2 points) Research and explain the arguments that go into the predict function in detail. Please include object, newdata, interval, type, se.fit, and level.
Response to Q7
Object- fitted statistical model that is used against newdata.
Newdata- a data frame that contains the data you want to try to predict. It should have the same columns as the data on which the model was trained.
Interval- give either confidence or prediction. In the output it will provide the floor and ceiling for whichever you chose
Type- determine the type of prediction. Returned values will be of different data types depending on the value given.
Se.fit- Determines if the prediction should include standard errors. If true, it will return a list with the prediction, the standard error of the prediction, and the degrees of freedom for residuals.
Level- the confidence interval for the prediction. Degree of certainty that the points will fall within a given range.
(3 points) In the pre-processing of the data we did: Sales_dataset$Product_Category = as.factor(Sales_dataset$Product_Category) , explain ways using both code and descriptions that you can tell what the baseline is and how you can change the baseline for Product_Category.
levels(Sales_dataset$Product_Category)
## [1] "Books" "Clothing" "Electronics" "Furniture"
## [5] "Home Appliances" "Toys"
#relevel(Sales_dataset$ProductCategory, ref = "Electronics")
Response to Q8
When a column is converted to being a factor, values found in the column are assigned to levels. The first value found is by default assigned to be the default that the other values are compared to. You can find the levels using the “levels()” function and the first value is the baseline. You can also find this by looking at the data set and finding the first value in the converted column. You can change the baseline by using “relevel()”. In the above code, relevel would change Electronics to being the new baseline.
The End