R Markdown

Sales and Customer Interaction Dataset

In this homework, you have been given a dataset that captures comprehensive details of 1,000 sales transactions made in 2024, focusing on various aspects of customer interactions and sales dynamics. It includes a mix of numeric and categorical variables, allowing for in-depth analysis.

Key features of the dataset include:

set.seed(100) #this seed has been set to 100 to ensure results are reproducible. DO NOT CHANGE THIS SEED

Sales_dataset = read.csv("large_sales_dataset.csv", header=TRUE) #reads data
Sales_dataset$Product_Category=as.factor(Sales_dataset$Product_Category)
Sales_dataset$Purchase_Date=as.Date(Sales_dataset$Purchase_Date, format="%m/%d/%Y")
Sales_dataset$Customer_Gender=as.factor(Sales_dataset$Customer_Gender)
Sales_dataset$Store_Location=as.factor(Sales_dataset$Store_Location)

head(Sales_dataset)
##   Customer_ID Product_Category Purchase_Amount Purchase_Date Customer_Age
## 1         203            Books         1121.89    2024-09-05           55
## 2         536            Books          236.11    2024-03-01           55
## 3         961      Electronics          756.51    2024-08-26           38
## 4         371             Toys           23.10    2024-12-26           65
## 5         207      Electronics         1638.27    2024-01-27           46
## 6         172         Clothing          261.26    2024-01-29           22
##   Customer_Gender Store_Location Satisfaction_Score
## 1          Female         Denver           7.924694
## 2            Male        Seattle           9.132629
## 3            Male        Houston           9.637018
## 4          Female        Chicago           5.519188
## 5            Male        Seattle          10.000000
## 6          Female  San Francisco           5.933145
#Dividing the dataset into training and testing datasets
testRows_sales = sample(nrow(Sales_dataset),0.2*nrow(Sales_dataset))
testData_sales = Sales_dataset[testRows_sales, ]
trainData_sales = Sales_dataset[-testRows_sales, ]
row.names(trainData_sales) <- NULL
head(trainData_sales)
##   Customer_ID Product_Category Purchase_Amount Purchase_Date Customer_Age
## 1         536            Books          236.11    2024-03-01           55
## 2         961      Electronics          756.51    2024-08-26           38
## 3         371             Toys           23.10    2024-12-26           65
## 4         207      Electronics         1638.27    2024-01-27           46
## 5         172         Clothing          261.26    2024-01-29           22
## 6         121  Home Appliances           41.76    2024-11-11           27
##   Customer_Gender Store_Location Satisfaction_Score
## 1            Male        Seattle           9.132629
## 2            Male        Houston           9.637018
## 3          Female        Chicago           5.519188
## 4            Male        Seattle          10.000000
## 5          Female  San Francisco           5.933145
## 6            Male         Denver           6.691315

Question 1: Data Analysis (8 points)

For this question, use the “Sales_dataset”

1a)(4 points) Output a table that has both the average and median Purchase_Amount grouped by Product_Category, Customer_Gender and Customer_Age. Show the last 15 rows of the table

Note: For age group, use the following bins

‘0-18’, ‘19-25’, ‘26-35’, ‘36-45’, ‘46-55’, ‘56-65’, ‘66-75’, ‘76-85’, ‘86-100’

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
AgegroupsSales <- Sales_dataset

AgegroupsSales <- AgegroupsSales %>%
  mutate(Age_Groups = cut(Sales_dataset$Customer_Age, breaks = c(0,18,25,35,45,55,65,75,85,100), labels = c('0-18', '19-25', '26-35', '36-45', '46-55', '56-65', '66-75', '76-85', '86-100')))

Q1Table <- AgegroupsSales %>%
  group_by(Product_Category,Age_Groups,Customer_Gender)%>%
  summarise(mean = mean(Purchase_Amount),median = median(Purchase_Amount))
## `summarise()` has grouped output by 'Product_Category', 'Age_Groups'. You can
## override using the `.groups` argument.
Q1Table
## # A tibble: 81 × 5
## # Groups:   Product_Category, Age_Groups [42]
##    Product_Category Age_Groups Customer_Gender   mean median
##    <fct>            <fct>      <fct>            <dbl>  <dbl>
##  1 Books            0-18       Male              45.0   45.0
##  2 Books            19-25      Female           785.   521. 
##  3 Books            19-25      Male             913.   888. 
##  4 Books            26-35      Female          1323.  1373. 
##  5 Books            26-35      Male            1043.  1163. 
##  6 Books            36-45      Female          1102.  1072. 
##  7 Books            36-45      Male            1035.  1087. 
##  8 Books            46-55      Female           915.   975. 
##  9 Books            46-55      Male            1076.  1213. 
## 10 Books            56-65      Female           951.  1055. 
## # ℹ 71 more rows

Q1b)(2 points) Which customer age group has the highest total purchase amount across all categories?

AgegroupsSales %>%
  group_by(Age_Groups) %>%
  summarise(HpurchaserAMT = sum(Purchase_Amount))
## # A tibble: 7 × 2
##   Age_Groups HpurchaserAMT
##   <fct>              <dbl>
## 1 0-18              16762.
## 2 19-25            110907.
## 3 26-35            189693.
## 4 36-45            187040.
## 5 46-55            193727.
## 6 56-65            198254.
## 7 66-75             96220.

The highest purchase amount by age across all categories is 56-65

Q1c)(2 points) Which product category has the highest average satisfaction score, and in which location is it most frequently purchased?

HASS <- Sales_dataset %>%
  group_by(Product_Category)%>%
  summarise(HmeanSS = mean(Satisfaction_Score))

The product category with the highest average satisfaction score is Electronics.

FP <- Sales_dataset[Sales_dataset$Product_Category == "Electronics",] %>%
  group_by(Product_Category,Store_Location)%>%
  summarise(LocSales = length(Store_Location))
## `summarise()` has grouped output by 'Product_Category'. You can override using
## the `.groups` argument.

The location where Electronics is most frequently purchased is Boston.

Question 2: Time Series Analysis (8 points)

For this question, use “Sales_dataset”

Q2a) (3 points) Calculate the monthly average sales per product category and plot them.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
MonthSales <- Sales_dataset %>%
  mutate(Month = month(Purchase_Date))

MASpP <- MonthSales %>%
  group_by(Product_Category,Month)%>%
  summarise(MonthlySale = mean(Purchase_Amount))
## `summarise()` has grouped output by 'Product_Category'. You can override using
## the `.groups` argument.
ggplot(data = MASpP, mapping = aes(MASpP$Month,MASpP$MonthlySale, color = MASpP$Product_Category))+geom_line()+geom_point()+labs(title = "Monthly Average Sales per Category", x="Month",y="Sales",color="Product Category")

Q2b)(5 points) Calculate the rolling means of average monthly sales (calculated in 2a) using a 3 month window. Use the center alignment to calculate the rolling means.

Now again plot the monthly average sales of each product category along with the rolling means.

i) What difference do you observe? How are the rolling means beneficial as compared to simple averages?

library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
RolledMean <- MASpP %>%
  mutate(TMRolled = rollmean(MonthlySale,3,fill = NA,align = "center"))

ggplot(data = RolledMean, mapping = aes(RolledMean$Month,RolledMean$TMRolled, color = RolledMean$Product_Category))+geom_line()+geom_point()+labs(title = "Three Month Roling Mean for Sales per Category", x="Month",y="Sales",color="Product Category")
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

Response to Q2b

The lines for the graph of rolling means are much smoother and less erratic. This graph most likely produces more usable slopes for predicting future growth and understanding how sales may occur in later years.

Question 3: Data Exploration (8 points)

For this question, use trainData_sales

Q3a)(4 points) Create a boxplot of the response variable versus the following predicting variables

i) Store_Location

ii) Product_Category

Explain the relationship between the response and the two variables based on the boxplots.

ggplot(trainData_sales, aes(Store_Location,Satisfaction_Score))+geom_boxplot()+labs(title = "Satisfaction Score VS Store Location",x = "Location",y= "Satisfaction Score")

ggplot(trainData_sales, aes(Product_Category,Satisfaction_Score))+geom_boxplot()+labs(title = "Satisfaction Score VS Product Category",x = "Category",y= "Satisfaction Score")

Response to Q3

The relationship between Satisfaction score and location does not seem to vary widely. Significant overlap exists between the plots, indicating that satisfaction does not vary greatly by location. There is much more variability in the product category, showing that the satisfaction score is much better predicted by this variable. Electronics has a much higher mean than most of the group, with only significant overlap in Q1 with the other plots. Toys have the lowest mean and the largest number of outliers. This indicates that there may be some additional reason for the toy’s low satisfaction score, especially since most of its outliers are higher satisfaction scores.

Q3b)(4 points) Create scatterplots of the response variable against the following predictors:

i) Customer_Age

ii) Purchase_Amount

Describe the general trend of each plot.

Output the R^2 for each plot. Use the following R^2 cut-offs while explaining if it is a weak, moderate, or strong relationship.

R^2<=0.3 (weak)

0.3<R^2<0.7 (moderate)

R^2>=0.7 (strong)

ggplot(trainData_sales,aes(Customer_Age,Satisfaction_Score))+geom_point()+labs(title = "Satisfaction Score VS Customer Age",x = "Customer Age",y= "Satisfaction Score")

r_extraction <- summary(lm(Satisfaction_Score ~ Customer_Age,data = trainData_sales))

r_Squared = r_extraction$r.squared
ggplot(trainData_sales,aes(Purchase_Amount,Satisfaction_Score))+geom_point()+labs(title = "Satisfaction Score VS Customer Age",x = "Purchase Amount",y= "Satisfaction Score")

r_extraction2 <- summary(lm(Satisfaction_Score ~ Purchase_Amount,data = trainData_sales))

rSquare2 <- r_extraction2$r.squared

Response to Q3b

Plot one seems to exhibit little to no change in satisfaction scores based on age. In the second plot, there seems to be a positive correlation between satisfaction score and purchase amount.

The R^2 for plot one is very low, indicating a weak relation between the predictor and the result variable (~.001). For the second plot, R^2 is still weak, but significantly stronger than in the previous plot (~.163).

Question 4: Simple linear regression and ANOVA (12 points)

For this question, use trainData_sales

Q4a) (5 points) Create a simple linear regression model using the predictor “Purchase_Amount”. Call it model1. Display the summary.

i) What are the model parameters and their estimates?

ii) Interpret the coefficient of the predictor “Purchase_Amount” in the context of the problem.

iii) Find a 95% confidence interval for the coefficient of “Purchase_Amount”. Is the coefficient significant at this level?

model1 <- lm(Satisfaction_Score ~ Purchase_Amount,trainData_sales)

summary(model1)
## 
## Call:
## lm(formula = Satisfaction_Score ~ Purchase_Amount, data = trainData_sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8394 -0.9163  0.0100  0.9824  3.4846 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.417e+00  9.558e-02   67.14   <2e-16 ***
## Purchase_Amount 1.046e-03  8.382e-05   12.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.333 on 798 degrees of freedom
## Multiple R-squared:  0.1633, Adjusted R-squared:  0.1623 
## F-statistic: 155.8 on 1 and 798 DF,  p-value: < 2.2e-16

Response to Q4a

i. The model parameter B0 = 6.417, and the parameter B1 = 1.046e-3 = ~.0521.

ii. For each increase of one unit in “Purchase Amount” it is estimated that “Satisfaction_Score” will increase by .0521.

iii. Using the confint() function below, it is determined that the coefficient is significant at this level.

confint(model1, "Purchase_Amount",level = .95)
##                        2.5 %      97.5 %
## Purchase_Amount 0.0008816338 0.001210702

Q4b)(3 points) Is the coefficient of Purchase_Amount statistically significant?

i) State the null hypothesis

ii) State the alternative hypothesis

iii) Which test is used for testing the significance of coefficient?

Response to Q4b

i. The coefficient of Purchase_Amount is not statistically significant. H0: B1 = 0

ii. The coefficient of Purchase_Amount is statistically significant. HA: B1 > 0

ii. P-value test

Q4c)( 4 points) Perform an ANOVA F-test on the means of Product_Category.

i) State the null and alternative hypotheses.

ii) Using an α-level of 0.01, do we reject the null hypothesis that the means are equal? Explain your conclusion.

iii) Which means are plausibly similar at the confidence level of 99%?

meanbyCategory <- trainData_sales %>%
  group_by(Product_Category)%>%
  summarise(meanSatisfaction = mean(Satisfaction_Score))

Trial <- t.test(trainData_sales$Product_Category == "Books",conf.level = .99)
print("Books")
## [1] "Books"
Trial$conf.int
## [1] 0.143752 0.213748
## attr(,"conf.level")
## [1] 0.99
Trial <- t.test(trainData_sales$Product_Category == "Clothing",conf.level = .99)
print("Clothing")
## [1] "Clothing"
Trial$conf.int
## [1] 0.1391401 0.2083599
## attr(,"conf.level")
## [1] 0.99
Trial <- t.test(trainData_sales$Product_Category == "Electronics",conf.level = .99)
print("Electronics")
## [1] "Electronics"
Trial$conf.int
## [1] 0.1391401 0.2083599
## attr(,"conf.level")
## [1] 0.99
Trial <- t.test(trainData_sales$Product_Category == "Furniture", conf.level = .99)
print("Furniture")
## [1] "Furniture"
Trial$conf.int
## [1] 0.1004054 0.1620946
## attr(,"conf.level")
## [1] 0.99
Trial <- t.test(trainData_sales$Product_Category == "Home Appliances",conf.level = .99)
print("Home Appliances")
## [1] "Home Appliances"
Trial$conf.int
## [1] 0.1391401 0.2083599
## attr(,"conf.level")
## [1] 0.99
Trial <- t.test(trainData_sales$Product_Category == "Toys", conf.level = .99)
print("Toys")
## [1] "Toys"
Trial$conf.int
## [1] 0.1345387 0.2029613
## attr(,"conf.level")
## [1] 0.99

iv) Compare the satisfaction score of the following pair of product categories:

Electronics-Clothing

EandC <- trainData_sales[(trainData_sales$Product_Category=="Electronics"|trainData_sales$Product_Category=="Clothing"),]

modelT <- lm(Satisfaction_Score ~ Product_Category, EandC)

forTukey <- aov(modelT)
TukeyHSD(forTukey,conf.level = .99)
##   Tukey multiple comparisons of means
##     99% family-wise confidence level
## 
## Fit: aov(formula = modelT)
## 
## $Product_Category
##                          diff      lwr      upr p adj
## Electronics-Clothing 1.885893 1.558771 2.213015     0

Response to Q4c

  1. H0: M = xbar

    HA: M != xbar

  2. We reject the null hypothesis because the produced p-value is less than the a-level

    modelA <- lm(Satisfaction_Score ~ Product_Category, trainData_sales)
    
    anova(modelA)
    ## Analysis of Variance Table
    ## 
    ## Response: Satisfaction_Score
    ##                   Df Sum Sq Mean Sq F value    Pr(>F)    
    ## Product_Category   5 732.94 146.588  121.09 < 2.2e-16 ***
    ## Residuals        794 961.19   1.211                      
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    print(.01 > 2.2e-16)
    ## [1] TRUE

Response to Q4ciii

At a 99% confidence level, the confidence interval is wide enough for each category that they all could plausibly be similar due to the overlap of the intervals.

Response to Q4civ

A Tukey test of the two product categories demonstrates a significant difference in the means. A very low adjusted P value is observed hinting that there is little similarity between the two sets. Additionally, the confidence interval does not include zero, which shows that there is a statistically significant difference between the means.

Question 5: Model Diagnostics (14 points)

Q5a)(4 points) Perform the following model diagnostics on model1 created in Q4a.

i) Check for linearity assumption

ii) Check for constant variance

iii) Check for normality

Note: Both a histogram and a normal QQ plot with a pointwise confidence envelope must be plotted (tip: qqPlot() from the car package can generate a pointwise confidence envelope.

Explain your conclusion

#Linearity no pattern in plot so assumption holds
#Constant variance assumption holds becasue the residuals are grouped
plot(model1, 1)

# Normality- tail on both ends of plot, may show violation of normality
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
qqPlot(model1$residuals, envelope = .95)

## [1] 528 667
#Bell shaped curve suggests normality (Looks like a normal distribution)
hist(residuals(model1))

Response to Q5a

Linearity assumption- There is no obvious pattern to residuals, and the estimation line runs nearly horizontally to 0.

Constant variance- residuals are grouped with the mean of nearly zero, showing constant variance

Normality- The QQ plot has tails on both ends of the line, which, to me, indicated that there may be a violation of normality. However, a histogram of the residuals nearly resembles a normal distribution. These two together indicate adherence to the normality assumption

Q5b)(1 point) Based on your conclusion in Q5a, would you propose any transformation of the predictor or response variable? Explain with reasoning.

Response to Q5b

I don’t think I would recommend a transformation, because the current graphs don’t seem to show violations of the assumptions.

Q5c)(2 points) Use a Box-Cox transformation (boxCox()) in car() package or (boxcox()) in MASS() package to find the optimal λ value rounded to the nearest half integer. What transformation of the response, if any, does it suggest to perform?

#Code
boxcox <- boxCox(model1)

boxcox
## $x
##   [1] -2.00000000 -1.95959596 -1.91919192 -1.87878788 -1.83838384 -1.79797980
##   [7] -1.75757576 -1.71717172 -1.67676768 -1.63636364 -1.59595960 -1.55555556
##  [13] -1.51515152 -1.47474747 -1.43434343 -1.39393939 -1.35353535 -1.31313131
##  [19] -1.27272727 -1.23232323 -1.19191919 -1.15151515 -1.11111111 -1.07070707
##  [25] -1.03030303 -0.98989899 -0.94949495 -0.90909091 -0.86868687 -0.82828283
##  [31] -0.78787879 -0.74747475 -0.70707071 -0.66666667 -0.62626263 -0.58585859
##  [37] -0.54545455 -0.50505051 -0.46464646 -0.42424242 -0.38383838 -0.34343434
##  [43] -0.30303030 -0.26262626 -0.22222222 -0.18181818 -0.14141414 -0.10101010
##  [49] -0.06060606 -0.02020202  0.02020202  0.06060606  0.10101010  0.14141414
##  [55]  0.18181818  0.22222222  0.26262626  0.30303030  0.34343434  0.38383838
##  [61]  0.42424242  0.46464646  0.50505051  0.54545455  0.58585859  0.62626263
##  [67]  0.66666667  0.70707071  0.74747475  0.78787879  0.82828283  0.86868687
##  [73]  0.90909091  0.94949495  0.98989899  1.03030303  1.07070707  1.11111111
##  [79]  1.15151515  1.19191919  1.23232323  1.27272727  1.31313131  1.35353535
##  [85]  1.39393939  1.43434343  1.47474747  1.51515152  1.55555556  1.59595960
##  [91]  1.63636364  1.67676768  1.71717172  1.75757576  1.79797980  1.83838384
##  [97]  1.87878788  1.91919192  1.95959596  2.00000000
## 
## $y
##   [1] -3192.473 -3183.787 -3175.271 -3166.923 -3158.743 -3150.727 -3142.875
##   [8] -3135.184 -3127.653 -3120.280 -3113.064 -3106.001 -3099.091 -3092.332
##  [15] -3085.722 -3079.258 -3072.941 -3066.766 -3060.734 -3054.842 -3049.088
##  [22] -3043.470 -3037.987 -3032.638 -3027.420 -3022.332 -3017.372 -3012.538
##  [29] -3007.830 -3003.245 -2998.781 -2994.438 -2990.214 -2986.106 -2982.115
##  [36] -2978.237 -2974.473 -2970.819 -2967.276 -2963.841 -2960.513 -2957.291
##  [43] -2954.174 -2951.160 -2948.247 -2945.436 -2942.724 -2940.110 -2937.593
##  [50] -2935.172 -2932.845 -2930.613 -2928.472 -2926.423 -2924.464 -2922.594
##  [57] -2920.813 -2919.118 -2917.510 -2915.987 -2914.547 -2913.191 -2911.917
##  [64] -2910.725 -2909.612 -2908.580 -2907.625 -2906.749 -2905.949 -2905.225
##  [71] -2904.577 -2904.003 -2903.502 -2903.075 -2902.719 -2902.434 -2902.220
##  [78] -2902.076 -2902.001 -2901.994 -2902.055 -2902.183 -2902.377 -2902.637
##  [85] -2902.961 -2903.350 -2903.803 -2904.318 -2904.896 -2905.536 -2906.237
##  [92] -2906.998 -2907.820 -2908.701 -2909.640 -2910.638 -2911.694 -2912.807
##  [99] -2913.977 -2915.203

Response to Q5c

The optimal lambda is roughly 1.5. Since the confidence interval for lambda includes 1, then a transformation is not necessary.

Q5d)(2 points) Create a linear regression model, named model2, that uses the log transformed Satisfaction_Score as the response, and the log transformed Purchase_Amount as the predictor. Display the summary.

model2 <- lm(log(Satisfaction_Score) ~ log(Purchase_Amount),trainData_sales)
summary(model2)
## 
## Call:
## lm(formula = log(Satisfaction_Score) ~ log(Purchase_Amount), 
##     data = trainData_sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8521 -0.1223  0.0194  0.1447  0.4992 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.452993   0.046363   31.34   <2e-16 ***
## log(Purchase_Amount) 0.081082   0.006946   11.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1919 on 798 degrees of freedom
## Multiple R-squared:  0.1458, Adjusted R-squared:  0.1448 
## F-statistic: 136.3 on 1 and 798 DF,  p-value: < 2.2e-16
summary(model1)
## 
## Call:
## lm(formula = Satisfaction_Score ~ Purchase_Amount, data = trainData_sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8394 -0.9163  0.0100  0.9824  3.4846 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.417e+00  9.558e-02   67.14   <2e-16 ***
## Purchase_Amount 1.046e-03  8.382e-05   12.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.333 on 798 degrees of freedom
## Multiple R-squared:  0.1633, Adjusted R-squared:  0.1623 
## F-statistic: 155.8 on 1 and 798 DF,  p-value: < 2.2e-16

Q5e)(2 points) Compare the R-squared values of model1 and model2. Did the transformation improve the explanatory power of the model?

Response to Q5e

No, taking the two logs decreased R squared, meaning the predictive power of the model went down.

Q5f) (3 points) Perform the same model diagnostics on model2 as you did on model1 in Q5a. Assess and interpret all model assumptions. Based on your interpretation of the model assumptions, is model2 a better fit than model1? A model is considered a good fit only if all the assumptions hold.

plot(model2, 1)

# Your code here
qqPlot(model2$residuals, envelope = .95)

## [1] 528  69
hist(residuals(model2))

Response to Q5f

Linearity- the trend line indicates a bowed shape to the residuals so it violates linearity

Constant variance- The residuals have a significantly looser grouping than in the previous model

Normality- The QQ plot still has the tail, and added an outlier. Additionally, the histogram is now skewed to the left.

Based on my observations, model2 is not a better fit for the data because of the above issues.

Question 6: Prediction (5 points)

For this question, use testData_sales

Q6a)(3 points) Using testData_sales, predict the satisfaction score using model1.

  1. Show the first 10 predictions along with their true values.

  2. Calculate the mean squared prediction error.

    1. 1.741135
  3. Calculate average and standard deviation of the predictions of the satisfaction score using model1.

testModel1 <- predict(model1, newdata = testData_sales)

comparison <- data.frame(Test_Values = testModel1,Source_Values = testData_sales$Satisfaction_Score)

head(comparison, 10)
##     Test_Values Source_Values
## 714    6.758070      9.465528
## 503    6.717227      9.766702
## 358    7.533249      9.053531
## 624    6.501717      7.836710
## 985    7.299200      7.046928
## 718    8.262041     10.000000
## 919    6.617318      6.162825
## 470    6.930907      8.958263
## 966    7.213435      8.347239
## 516    7.687925      9.015264
MSE = mean((comparison$Source_Values - comparison$Test_Values)^2)
MSE
## [1] 1.741135
Standard_deviation = sd(comparison$Test_Values)

deviations = abs(comparison$Test_Values - mean(comparison$Test_Values))
AVG_deviation = mean(deviations)
AVG_deviation
## [1] 0.5440862

Standard deviation = .6171612

Average deviation = .5440862

Q6b)(2 points) Using the first row of testData_sales, predict the satisfaction score using model1. What is the 99% prediction interval (PI)? Provide an interpretation of your results.

oneRow <- testData_sales[1,]

oneRowPred <- predict(model1,newdata = oneRow, interval = "prediction",level = .99)

head(oneRowPred)
##         fit      lwr      upr
## 714 6.75807 3.311727 10.20441

Response to Q6b

The 99% prediction interval is 3.311727 to 10.20441. The model expects 99% of values to fall from ~3.3 to ~10.2. This is possibly concerning considering that the maximum reported satisfaction score is 10. Additionally, the predicted value for this row (6.75807) is significantly different from the real value of ~9.46. These observations make me think that this model may be a poor representation of the data as a whole.

Question 7 (2 points)

(2 points) Research and explain the arguments that go into the predict function in detail. Please include object, newdata, interval, type, se.fit, and level.

Response to Q7

Object- fitted statistical model that is used against newdata.

Newdata- a data frame that contains the data you want to try to predict. It should have the same columns as the data on which the model was trained.

Interval- give either confidence or prediction. In the output it will provide the floor and ceiling for whichever you chose

Type- determine the type of prediction. Returned values will be of different data types depending on the value given.

Se.fit- Determines if the prediction should include standard errors. If true, it will return a list with the prediction, the standard error of the prediction, and the degrees of freedom for residuals.

Level- the confidence interval for the prediction. Degree of certainty that the points will fall within a given range.

Question 8: Changing the Baseline (3 points)

(3 points) In the pre-processing of the data we did: Sales_dataset$Product_Category = as.factor(Sales_dataset$Product_Category) , explain ways using both code and descriptions that you can tell what the baseline is and how you can change the baseline for Product_Category.

levels(Sales_dataset$Product_Category)
## [1] "Books"           "Clothing"        "Electronics"     "Furniture"      
## [5] "Home Appliances" "Toys"
#relevel(Sales_dataset$ProductCategory, ref = "Electronics")

Response to Q8

When a column is converted to being a factor, values found in the column are assigned to levels. The first value found is by default assigned to be the default that the other values are compared to. You can find the levels using the “levels()” function and the first value is the baseline. You can also find this by looking at the data set and finding the first value in the converted column. You can change the baseline by using “relevel()”. In the above code, relevel would change Electronics to being the new baseline.

The End