Note that the rmarkdown package does not need to be explicitly installed or loaded here, as RStudio automatically does both when needed.
When you need to use a package, please just use the library command to call the package into R to be used. There is no need to include the install.packages() command in your answers. You can assume that I have already installed the package.
Put your answers to the questions after Answer:. If you want to start a new line, use the backward slash symbol.
The dataset is called salesData.csv. It contains the sales data of a department store chain to its loyalty club members. The columns are as follows:
Please put all your commands in between ’’‘{r} and’’’ below.
Before answering the following questions, please delete three rows from the dataset depending on the last two digits of your student id. If your student id ends with 12, then delete cases 13, 14 and 15 from the dataset. Read the dataset into R, and enter the commands to remove the three cases.
mydata <- read.csv("salesData.csv")
mydata <- mydata[-c(66,67,68),]
Use the summary command to get some summary statistics that describe the dataset, and then display the first 10 cases in the dataset. What is the mean meanShoppingCartValue? What is the median nPurch value? (1 point)
summary(mydata)
## id nItems mostFreqStore mostFreqCat
## Min. : 1 Min. : 1.0 Length:5119 Length:5119
## 1st Qu.:1384 1st Qu.: 83.0 Class :character Class :character
## Median :2744 Median : 157.0 Mode :character Mode :character
## Mean :2742 Mean : 186.2
## 3rd Qu.:4100 3rd Qu.: 258.0
## Max. :5455 Max. :1469.0
## nCats preferredBrand nBrands nPurch
## Min. : 1.00 Length:5119 Min. : 1.0 Min. : 1.00
## 1st Qu.:27.00 Class :character 1st Qu.: 45.0 1st Qu.:11.00
## Median :37.00 Mode :character Median : 76.0 Median :17.00
## Mean :36.35 Mean : 81.9 Mean :19.87
## 3rd Qu.:47.00 3rd Qu.:111.0 3rd Qu.:26.00
## Max. :73.00 Max. :517.0 Max. :88.00
## salesLast3Mon salesThisMon daysSinceLastPurch meanItemPrice
## Min. : 189 Min. : 0.0 Min. : 1.000 Min. : 1.867
## 1st Qu.:1067 1st Qu.: 480.9 1st Qu.: 2.000 1st Qu.: 6.019
## Median :1332 Median : 607.8 Median : 3.000 Median : 8.530
## Mean :1322 Mean : 605.8 Mean : 6.278 Mean : 12.257
## 3rd Qu.:1573 3rd Qu.: 731.3 3rd Qu.: 7.000 3rd Qu.: 13.191
## Max. :2791 Max. :1362.8 Max. :89.000 Max. :377.900
## meanShoppingCartValue customerDuration coupon gender
## Min. : 17.35 Min. : 0.0 Length:5119 Length:5119
## 1st Qu.: 54.46 1st Qu.: 547.0 Class :character Class :character
## Median : 76.57 Median : 649.0 Mode :character Mode :character
## Mean : 91.80 Mean : 644.1
## 3rd Qu.:110.61 3rd Qu.: 745.0
## Max. :914.04 Max. :1355.0
## income age
## Length:5119 Min. :17.00
## Class :character 1st Qu.:28.00
## Mode :character Median :38.00
## Mean :38.01
## 3rd Qu.:48.00
## Max. :65.00
print(mydata[1:10,])
## id nItems mostFreqStore mostFreqCat nCats preferredBrand nBrands nPurch
## 1 1 1469 S10 Alcohol 72 Veina 517 82
## 2 2 1463 S10 Alcohol 73 Veina 482 88
## 3 3 262 S2 Shoes 55 Bo 126 56
## 4 4 293 S2 Bakery 50 Veina 108 43
## 5 5 108 S2 Beverages 32 Bo 79 18
## 6 6 216 S1 Alcohol 41 Bo 98 35
## 7 7 174 S3 Packaged food 36 Bo 78 34
## 8 8 122 S9 Shoes 31 Bo 62 12
## 9 9 204 S6 Bakery 41 Bo 99 26
## 10 10 308 S9 Alcohol 52 Bo 103 33
## salesLast3Mon salesThisMon daysSinceLastPurch meanItemPrice
## 1 2741.97 1283.87 1 1.866555
## 2 2790.58 1242.60 1 1.907437
## 3 1529.55 682.57 1 5.837977
## 4 1765.81 730.23 1 6.026655
## 5 1180.00 552.54 12 10.925926
## 6 1345.29 662.52 2 6.228194
## 7 1338.81 621.46 2 7.694310
## 8 1256.96 367.07 4 10.302951
## 9 1963.60 780.78 14 9.625490
## 10 1584.59 695.52 1 5.144773
## meanShoppingCartValue customerDuration coupon gender income age
## 1 33.43866 821 No male Low Income 47
## 2 31.71114 657 Yes male Low Income 45
## 3 27.31339 548 Yes female Low Income 46
## 4 41.06535 596 Yes male Low Income 45
## 5 65.55556 603 Yes female Medium Income 29
## 6 38.43686 673 Yes female Low Income 32
## 7 39.37676 612 Yes female Low Income 29
## 8 104.74667 517 Yes female Medium Income 18
## 9 75.52308 709 No female Medium Income 45
## 10 48.01788 480 No female Low Income 35
mean(mydata$meanShoppingCartValue)
## [1] 91.79511
median(mydata$nPurch)
## [1] 17
Answer:The mean meanShoppingCartValue is 91.79511.The median nPurch value is 17.
Use the tidyverse pcakage for this question. First, create a copy of mydata and name the copy as mydata1. Then change mydata1 into a tibble. Use mydata1 as input to a Pipe operator to create a new column called hicustomerDuration that divides all the customers into two groups based on the median of customerDuration: hicustomerDuration = 1 if customerDuration > median(customerDuration); otherwise hicustomerDuration = 0. Next, select the customers who used a coupon, and for this selected group of customers, find out if there is any gender difference in mean customerDuration. (3 points)
Answer:
Use ggplot2 to draw a scatterplot using salesThisMon as the y axis and daysSinceLastPurch as the x axis, and set point size = 2, and then overlaps the graph with a regression line in red. Label the x-axis as “Days Since Last Purchase” and the y-axis as “Sales This Month”. What is the relationship between the two variables? (2 points)
Answer:
Is there any income difference in the use of coupons? Is the effect practically significant? (2 points)
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
dsrs <- svydesign(id=~1, data=mydata)
## Warning in svydesign.default(id = ~1, data = mydata): No weights or
## probabilities supplied, assuming equal probability
summary(dsrs)
## Independent Sampling design (with replacement)
## svydesign(id = ~1, data = mydata)
## Probabilities:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 1 1 1 1
## Data variables:
## [1] "id" "nItems" "mostFreqStore"
## [4] "mostFreqCat" "nCats" "preferredBrand"
## [7] "nBrands" "nPurch" "salesLast3Mon"
## [10] "salesThisMon" "daysSinceLastPurch" "meanItemPrice"
## [13] "meanShoppingCartValue" "customerDuration" "coupon"
## [16] "gender" "income" "age"
result <- svychisq(~income+coupon, dsrs, statistic="Chisq")
result$observed
## coupon
## income No Yes
## High Income 342 654
## Low Income 524 1034
## Medium Income 847 1718
sum(result$observed)
## [1] 5119
result$expected
## coupon
## income No Yes
## High Income 333.2971 662.7029
## Low Income 521.3624 1036.6376
## Medium Income 858.3405 1706.6595
result$stdres
## coupon
## income No Yes
## High Income 0.6511828 -0.6511828
## Low Income 0.1697931 -0.1697931
## Medium Income -0.6718226 0.6718226
Answer:There is no income difference in the use of coupons.Also,the effect is not practically significant.
Use the caret package to build a logistic regression machine learning model using coupon as the target variable and meanShoppingCartValue, salesThisMon, gender and income as the feature variables. Please randomly split the dataset into the training dataset and the testing dataset. The training dataset should contain 80% of the data. Use min-max to normalize meanShoppingCartValue and salesThisMon first before you train the model. Name the transformed variables as meanShoppingCartValue1 and salesThisMon1 Write down the logistic regression equation. How accurate is your machine learning model according to the confusion matrix? Please print the confusion matrix and accuracy value to the screen. (4 points)
Answer:
Build a regression model using meanShoppingCartValue as the dependent variable and customerDuration, daysSinceLastPurch, gender and income as the independent variable. Write down the regression equation. Which variables are signifcant at the 5% level? Interpret the R-squared value. What is the predicted meanShoppingCartValue of a customer with the following characteristics? customerDuration= 365, daysSinceLastPurch=10, gender=“female”, income=“Low Income” (3 points)
modell <- lm(meanShoppingCartValue ~ customerDuration+daysSinceLastPurch+gender+income, mydata)
summary(modell)
##
## Call:
## lm(formula = meanShoppingCartValue ~ customerDuration + daysSinceLastPurch +
## gender + income, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -134.21 -14.30 -2.37 9.94 711.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.623e+02 2.623e+00 61.891 <2e-16 ***
## customerDuration -3.644e-03 3.367e-03 -1.082 0.2793
## daysSinceLastPurch 1.655e+00 6.269e-02 26.405 <2e-16 ***
## gendermale 2.140e+00 1.068e+00 2.004 0.0452 *
## incomeLow Income -1.206e+02 1.617e+00 -74.596 <2e-16 ***
## incomeMedium Income -8.512e+01 1.451e+00 -58.674 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.51 on 5113 degrees of freedom
## Multiple R-squared: 0.652, Adjusted R-squared: 0.6516
## F-statistic: 1916 on 5 and 5113 DF, p-value: < 2.2e-16
predict(modell, data.frame(customerDuration= 365, daysSinceLastPurch=10, gender="female", income="Low Income"))
## 1
## 56.93051
Answer:Regression equation: meanShoppingCartValue = 1.623e+02 - -3.644e-03customerDuration + 1.655e+00daysSinceLastPurch + 2.140e+00gendermale - -1.206e+02incomeLow_Income - -8.512e+01*incomeMedium_Income gendermale is significant at the 5% level. The R-squared value is 0.6516. The predicted meanShoppingCartValue is 56.93051.