R Assignment

Note that the rmarkdown package does not need to be explicitly installed or loaded here, as RStudio automatically does both when needed.

When you need to use a package, please just use the library command to call the package into R to be used. There is no need to include the install.packages() command in your answers. You can assume that I have already installed the package.

Put your answers to the questions after Answer:. If you want to start a new line, use the backward slash symbol.

Course code: MKTG 3010D

Your name: Tam Lai Yan

Your student id: 1155125618

The dataset is called salesData.csv. It contains the sales data of a department store chain to its loyalty club members. The columns are as follows:

  • id = customer’s loyalty card number
  • nItems = number of items bought by the customer in the past
  • mostFreqStore = most frequently visited branch store
  • mostFreqCat = most frequently bought product category
  • Cats = number of product categories bought in the past
  • preferredBrand = most preferred brand
  • nBrands = number of brands bought in the past
  • nPurch = number of purchases made in the past
  • salesLast3Mon = sales to the customer in the last three months
  • salesThisMon = sales to the customer in the current month
  • daysSinceLastPurch = days since last purchase
  • meanItemPrice = mean item price of previous purchases
  • meanShoppingCartValue = mean shopping cart value
  • customerDuration = number of days the customer has been with the company
  • coupon = whether or not the customer has used a discount coupon in the last purchase
  • gender
  • income
  • age

Please put all your commands in between ’’‘{r} and’’’ below.

Before answering the following questions, please delete three rows from the dataset depending on the last two digits of your student id. If your student id ends with 12, then delete cases 13, 14 and 15 from the dataset. Read the dataset into R, and enter the commands to remove the three cases.

mydata <- read.csv("salesData.csv")
mydata <- mydata[-c(19,20,21),]

Question 1

Use the summary command to get some summary statistics that describe the dataset, and then display the first 10 cases in the dataset. What is the mean meanShoppingCartValue? What is the median nPurch value? (1 point)

summary(mydata)
##        id           nItems       mostFreqStore      mostFreqCat       
##  Min.   :   1   Min.   :   1.0   Length:5119        Length:5119       
##  1st Qu.:1384   1st Qu.:  83.0   Class :character   Class :character  
##  Median :2744   Median : 157.0   Mode  :character   Mode  :character  
##  Mean   :2742   Mean   : 186.2                                        
##  3rd Qu.:4100   3rd Qu.: 258.0                                        
##  Max.   :5455   Max.   :1469.0                                        
##      nCats       preferredBrand        nBrands           nPurch     
##  Min.   : 1.00   Length:5119        Min.   :  1.00   Min.   : 1.00  
##  1st Qu.:27.00   Class :character   1st Qu.: 45.00   1st Qu.:11.00  
##  Median :37.00   Mode  :character   Median : 75.00   Median :17.00  
##  Mean   :36.35                      Mean   : 81.89   Mean   :19.88  
##  3rd Qu.:47.00                      3rd Qu.:111.00   3rd Qu.:26.00  
##  Max.   :73.00                      Max.   :517.00   Max.   :88.00  
##  salesLast3Mon   salesThisMon    daysSinceLastPurch meanItemPrice    
##  Min.   : 189   Min.   :   0.0   Min.   : 1.00      Min.   :  1.867  
##  1st Qu.:1067   1st Qu.: 480.9   1st Qu.: 2.00      1st Qu.:  6.019  
##  Median :1332   Median : 607.8   Median : 3.00      Median :  8.533  
##  Mean   :1322   Mean   : 605.7   Mean   : 6.28      Mean   : 12.258  
##  3rd Qu.:1573   3rd Qu.: 731.3   3rd Qu.: 7.00      3rd Qu.: 13.191  
##  Max.   :2791   Max.   :1362.8   Max.   :89.00      Max.   :377.900  
##  meanShoppingCartValue customerDuration    coupon             gender         
##  Min.   : 17.35        Min.   :   0.0   Length:5119        Length:5119       
##  1st Qu.: 54.44        1st Qu.: 546.5   Class :character   Class :character  
##  Median : 76.57        Median : 649.0   Mode  :character   Mode  :character  
##  Mean   : 91.82        Mean   : 644.0                                        
##  3rd Qu.:110.62        3rd Qu.: 745.0                                        
##  Max.   :914.04        Max.   :1355.0                                        
##     income               age    
##  Length:5119        Min.   :17  
##  Class :character   1st Qu.:28  
##  Mode  :character   Median :38  
##                     Mean   :38  
##                     3rd Qu.:48  
##                     Max.   :65

id nItems mostFreqStore mostFreqCat ##nCats preferredBrand

##Min. : 1 Min. : 1.0 Length:5119 Length:5119 Min. : 1.00 Length:5119
##1st Qu.:1384 1st Qu.: 83.0 Class :character Class :character 1st Qu.:27.00 Class :character
##Median :2744 Median : 157.0 Mode :character Mode :character Median :37.00 Mode :character
## Mean :2742 Mean : 186.2 Mean :36.35
##3rd Qu.:4100 3rd Qu.: 258.0 3rd Qu.:47.00
## Max. :5455 Max. :1469.0 Max. :73.00
## nBrands nPurch salesLast3Mon salesThisMon daysSinceLastPurch meanItemPrice
## Min. : 1.00 Min. : 1.00 Min. : 189 Min. : 0.0 Min. : 1.00 Min. : 1.867
##1st Qu.: 45.00 1st Qu.:11.00 1st Qu.:1067 1st Qu.: 480.9 1st Qu.: 2.00 1st Qu.: 6.019
##Median : 75.00 Median :17.00 Median :1332 Median : 607.8 Median : 3.00 Median : 8.533
##Mean : 81.89 Mean :19.88 Mean :1322 Mean : 605.7 Mean : 6.28 Mean : 12.258
##3rd Qu.:111.00 3rd Qu.:26.00 3rd Qu.:1573 3rd Qu.: 731.3 3rd Qu.: 7.00 3rd Qu.: 13.191
##Max. :517.00 Max. :88.00 Max. :2791 Max. :1362.8 Max. :89.00 Max. :377.900
##meanShoppingCartValue customerDuration coupon gender income
##Min. : 17.35 Min. : 0.0 Length:5119 Length:5119 Length:5119
##1st Qu.: 54.44 1st Qu.: 546.5 Class :character Class :character Class :character
##Median : 76.57 Median : 649.0 Mode :character Mode :character Mode :character
##Mean : 91.82 Mean : 644.0
##3rd Qu.:110.62 3rd Qu.: 745.0 ##Max. :914.04 Max. :1355.0
##age
##Min. :17
##1st Qu.:28
##Median :38
##Mean :38
##3rd Qu.:48
##Max. :65

print(mydata[1:10,]) 
##    id nItems mostFreqStore   mostFreqCat nCats preferredBrand nBrands nPurch
## 1   1   1469           S10       Alcohol    72          Veina     517     82
## 2   2   1463           S10       Alcohol    73          Veina     482     88
## 3   3    262            S2         Shoes    55             Bo     126     56
## 4   4    293            S2        Bakery    50          Veina     108     43
## 5   5    108            S2     Beverages    32             Bo      79     18
## 6   6    216            S1       Alcohol    41             Bo      98     35
## 7   7    174            S3 Packaged food    36             Bo      78     34
## 8   8    122            S9         Shoes    31             Bo      62     12
## 9   9    204            S6        Bakery    41             Bo      99     26
## 10 10    308            S9       Alcohol    52             Bo     103     33
##    salesLast3Mon salesThisMon daysSinceLastPurch meanItemPrice
## 1        2741.97      1283.87                  1      1.866555
## 2        2790.58      1242.60                  1      1.907437
## 3        1529.55       682.57                  1      5.837977
## 4        1765.81       730.23                  1      6.026655
## 5        1180.00       552.54                 12     10.925926
## 6        1345.29       662.52                  2      6.228194
## 7        1338.81       621.46                  2      7.694310
## 8        1256.96       367.07                  4     10.302951
## 9        1963.60       780.78                 14      9.625490
## 10       1584.59       695.52                  1      5.144773
##    meanShoppingCartValue customerDuration coupon gender        income age
## 1               33.43866              821     No   male    Low Income  47
## 2               31.71114              657    Yes   male    Low Income  45
## 3               27.31339              548    Yes female    Low Income  46
## 4               41.06535              596    Yes   male    Low Income  45
## 5               65.55556              603    Yes female Medium Income  29
## 6               38.43686              673    Yes female    Low Income  32
## 7               39.37676              612    Yes female    Low Income  29
## 8              104.74667              517    Yes female Medium Income  18
## 9               75.52308              709     No female Medium Income  45
## 10              48.01788              480     No female    Low Income  35

##id nItems mostFreqStore mostFreqCat nCats preferredBrand nBrands nPurch salesLast3Mon salesThisMon ##1 1 1469 S10 Alcohol 72 Veina 517 82 2741.97 1283.87 ##2 2 1463 S10 Alcohol 73 Veina 482 88 2790.58 1242.60 ##3 3 262 S2 Shoes 55 Bo 126 56 1529.55 682.57 ##4 4 293 S2 Bakery 50 Veina 108 43 1765.81 730.23 ##5 5 108 S2 Beverages 32 Bo 79 18 1180.00 552.54 ##6 6 216 S1 Alcohol 41 Bo 98 35 1345.29 662.52 ##7 7 174 S3 Packaged food 36 Bo 78 34 1338.81 621.46 ##8 8 122 S9 Shoes 31 Bo 62 12 1256.96 367.07 ##9 9 204 S6 Bakery 41 Bo 99 26 1963.60 780.78 ##10 10 308 S9 Alcohol 52 Bo 103 33 1584.59 695.52 ## daysSinceLastPurch meanItemPrice meanShoppingCartValue customerDuration coupon gender income age ##1 1 1.866555 33.43866 821 No male Low Income 47 ##2 1 1.907437 31.71114 657 Yes male Low Income 45 ##3 1 5.837977 27.31339 548 Yes female Low Income 46 ##4 1 6.026655 41.06535 596 Yes male Low Income 45 ##5 12 10.925926 65.55556 603 Yes female Medium Income 29 ##6 2 6.228194 38.43686 673 Yes female Low Income 32 ##7 2 7.694310 39.37676 612 Yes female Low Income 29 ##8 4 10.302951 104.74667 517 Yes female Medium Income 18 ##9 14 9.625490 75.52308 709 No female Medium Income 45 ##10 1 5.144773 48.01788 480 No female Low Income 35

mean(mydata$meanShoppingCartValue)
## [1] 91.82078

91.82078

median(mydata$nPurch)
## [1] 17

##17

##Answer: The mean meanShoppingCartValue is 91.82078. ##The median nPurch value is 17.

Question 2

Use the tidyverse pcakage for this question. First, create a copy of mydata and name the copy as mydata1. Then change mydata1 into a tibble. Use mydata1 as input to a Pipe operator to create a new column called hicustomerDuration that divides all the customers into two groups based on the median of customerDuration: hicustomerDuration = 1 if customerDuration > median(customerDuration); otherwise hicustomerDuration = 0. Next, select the customers who used a coupon, and for this selected group of customers, find out if there is any gender difference in mean customerDuration. (3 points)

mydata1 <- mydata
install.packages("tidyverse", dependencies=TRUE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.3
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

##– Attaching packages —————————————————————————————————– tidyverse 1.3.0 – ##v ggplot2 3.3.3 v purrr 0.3.4 ##v tibble 3.0.6 v dplyr 1.0.3 ##v tidyr 1.1.3 v stringr 1.4.0 ##v readr 1.4.0 v forcats 0.5.1 ##-Conflicts——————————————————————————————————– tidyverse_conflicts() – ##x dplyr::filter() masks stats::filter() ##x dplyr::lag() masks stats::lag()

mydata1 <- mydata %>% as_tibble()
mydata1 %>% mutate(hicustomerDuration = customerDuration > median(customerDuration)) %>% filter(coupon == "Yes") %>% select(gender,customerDuration) %>% group_by(gender) %>% summarize(Avg_customerDuration = mean(customerDuration)) %>% ungroup()
## # A tibble: 2 x 2
##   gender Avg_customerDuration
## * <chr>                 <dbl>
## 1 female                 642.
## 2 male                   651.

A tibble: 2 x 2

gender Avg_customerDuration

##* ##1 female 642. ##2 male 651.

**Answer:The mean customer duration for female and male are 642 and 651 respectively and thus there is a gender difference, which is 9.

Question 3

Use ggplot2 to draw a scatterplot using salesThisMon as the y axis and daysSinceLastPurch as the x axis, and set point size = 2, and then overlaps the graph with a regression line in red. Label the x-axis as “Days Since Last Purchase” and the y-axis as “Sales This Month”. What is the relationship between the two variables? (2 points)

library(ggplot2)

g <- ggplot(data = mydata, aes(y = salesThisMon, x = daysSinceLastPurch)) g + geom_point(size = 2) + stat_smooth(method = “lm”, col = “green”) + scale_x_discrete(name = ‘Days Since Last Purchase’) + scale_y_discrete(name = ‘Sales This Month’) geom_smooth() using formula ‘y ~ x’

**Answer:The two variables are inversely proportional.

Question 4

Is there any income difference in the use of coupons? Is the effect practically significant? (2 points)

install.packages("survey")
library(survey)

##Loading required package: grid ##Loading required package: Matrix

##Attaching package: ‘Matrix’

##The following objects are masked from ‘package:tidyr’:

expand, pack, unpack

##Loading required package: survival

##Attaching package: ‘survey’

#The following object is masked from ‘package:graphics’:

##dotchart

dsrs <- svydesign(id=~1, data=mydata)

##Warning message: ##In svydesign.default(id = ~1, data = mydata) : ##No weights or probabilities supplied, assuming equal probability `` ##summary(dsrs)

##Independent Sampling design (with replacement)
##svydesign(id = ~1, data = mydata)
##Probabilities:
##   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1       1       1       1       1       1 
##Data variables:
##[1] "id"                    "nItems"                "mostFreqStore"         "mostFreqCat"           "nCats"                
##[6] "preferredBrand"        "nBrands"               "nPurch"                "salesLast3Mon"         "salesThisMon"         
##[11] "daysSinceLastPurch"    "meanItemPrice"         "meanShoppingCartValue" "customerDuration"      "coupon"               
##[16] "gender"                "income"                "age"                  

result <- svychisq(~income+coupon, dsrs, statistic=“Chisq”) result


##  Pearson's X^2: Rao & Scott adjustment

##data:  svychisq(~income + coupon, dsrs, statistic = "Chisq")
##X-squared = 0.53029, df = 2, p-value = 0.7671

**Answer: Since p=value= 0.7671> 0.5,there's no any income difference in the use of coupons and practically significant effect.

## Question 5 
Use the caret package to build a logistic regression machine learning model using coupon as the target variable and meanShoppingCartValue, salesThisMon, gender and income as the feature variables. Please randomly split the dataset into the training dataset and the testing dataset. The training dataset should contain 80% of the data.
Use min-max to normalize meanShoppingCartValue and salesThisMon first before you train the model. Name the transformed variables as meanShoppingCartValue1 and salesThisMon1
Write down the logistic regression equation. How accurate is your machine learning model according to the confusion matrix? Please print the confusion matrix and accuracy value to the screen.
(4 points)

```r
##install.packages("caret", dependencies=TRUE)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift

##Loading required package: lattice

##Attaching package: ‘caret’

##The following object is masked from ‘package:survival’:

##cluster

##The following object is masked from ‘package:purrr’: ##lift

normalize <- function(x){return((x-min(x))/(max(x)-min(x)))}
meanShoppingCartValue1 <- normalize(mydata$meanShoppingCartValue)
salesThisMon1 <- normalize(mydata$salesThisMon)
set.seed(5455)
inTrain <- createDataPartition(y=mydata$coupon, p=0.8, list=FALSE)
training <- mydata[inTrain,]
testing <- mydata[-inTrain,]
dim(training)
## [1] 4096   18

##1 4096 18

dim(testing)
## [1] 1023   18

##1 1023 18

Model1 <- train(coupon ~ meanShoppingCartValue+salesThisMon+gender+income, data=training, method="glm")
summary(Model1)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6991  -1.4552   0.8820   0.9102   1.0110  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)   
## (Intercept)            1.346e-01  2.275e-01   0.592  0.55412   
## meanShoppingCartValue  2.800e-03  1.005e-03   2.785  0.00535 **
## salesThisMon          -8.486e-05  1.869e-04  -0.454  0.64980   
## gendermale             7.201e-02  7.040e-02   1.023  0.30637   
## `incomeLow Income`     4.662e-01  1.613e-01   2.890  0.00385 **
## `incomeMedium Income`  3.620e-01  1.264e-01   2.864  0.00418 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5222.1  on 4095  degrees of freedom
## Residual deviance: 5210.0  on 4090  degrees of freedom
## AIC: 5222
## 
## Number of Fisher Scoring iterations: 4

##Call: ##NULL

##Deviance Residuals: ## Min 1Q Median 3Q Max
##-1.6991 -1.4552 0.8820 0.9102 1.0110

##Coefficients: ## Estimate Std. Error z value Pr(>|z|)
##(Intercept) 1.346e-01 2.275e-01 0.592 0.55412
##meanShoppingCartValue 2.800e-03 1.005e-03 2.785 0.00535 ##salesThisMon -8.486e-05 1.869e-04 -0.454 0.64980
##gendermale 7.201e-02 7.040e-02 1.023 0.30637
##incomeLow Income 4.662e-01 1.613e-01 2.890 0.00385
##incomeMedium Income 3.620e-01 1.264e-01 2.864 0.00418 ** ##— ##Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

##(Dispersion parameter for binomial family taken to be 1)

Null deviance: 5222.1 on 4095 degrees of freedom

##Residual deviance: 5210.0 on 4090 degrees of freedom ##AIC: 5222

##Number of Fisher Scoring iterations: 4

##ModelPredictions <- predict(Model, newdata=testing) ##cm <- table(testing$coupon, ModelPredictions)

##acc <- sum(diag(cm)) / sum(cm)

##cm
##     ModelPredictions
##       No Yes
##  No    0 342
##  Yes   0 681
## acc
## 0.6656891

## Question 6 
Build a regression model using meanShoppingCartValue as the dependent variable and customerDuration,  daysSinceLastPurch, gender and income as the independent variable. 
Write down the regression equation. Which variables are signifcant at the 5% level? Interpret the R-squared value. 
What is the predicted meanShoppingCartValue of a customer with the following characteristics?
customerDuration= 365, daysSinceLastPurch=10, gender="female", income="Low Income"
(3 points)

```r
Model2 <- lm(meanShoppingCartValue ~ customerDuration+daysSinceLastPurch+gender+income, mydata)
summary(Model2)
## 
## Call:
## lm(formula = meanShoppingCartValue ~ customerDuration + daysSinceLastPurch + 
##     gender + income, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -134.26  -14.30   -2.37    9.94  710.96 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.625e+02  2.624e+00  61.917   <2e-16 ***
## customerDuration    -3.746e-03  3.370e-03  -1.112    0.266    
## daysSinceLastPurch   1.655e+00  6.273e-02  26.383   <2e-16 ***
## gendermale           2.196e+00  1.069e+00   2.054    0.040 *  
## incomeLow Income    -1.207e+02  1.617e+00 -74.639   <2e-16 ***
## incomeMedium Income -8.521e+01  1.451e+00 -58.722   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.52 on 5113 degrees of freedom
## Multiple R-squared:  0.6522, Adjusted R-squared:  0.6519 
## F-statistic:  1918 on 5 and 5113 DF,  p-value: < 2.2e-16

##Call: ##lm(formula = meanShoppingCartValue ~ customerDuration + daysSinceLastPurch + ## gender + income, data = mydata)

##Residuals: ##Min 1Q Median 3Q Max ##-134.26 -14.30 -2.37 9.94 710.96

##Coefficients: ## Estimate Std. Error t value Pr(>|t|)
##(Intercept) 1.625e+02 2.624e+00 61.917 <2e-16 ##customerDuration -3.746e-03 3.370e-03 -1.112 0.266
##daysSinceLastPurch 1.655e+00 6.273e-02 26.383 <2e-16
##gendermale 2.196e+00 1.069e+00 2.054 0.040 *
##incomeMedium Income -8.521e+01 1.451e+00 -58.722 <2e-16 *** — ##Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

##Residual standard error: 36.52 on 5113 degrees of freedom ##Multiple R-squared: 0.6522, Adjusted R-squared: 0.6519 ##F-statistic: 1918 on 5 and 5113 DF, p-value: < 2.2e-16

predict(Model2, data.frame(customerDuration= 365, daysSinceLastPurch=10, gender="female", income="Low Income"))
##        1 
## 56.93139

1

##56.93139

##Answer:Regression equation: meanShoppingCartValue :1.625e+02 - 3.746e-03customerDuration + 1.655e+00daysSinceLastPurch + 2.196e+00gendermale - 1.207e+02incomeLow_Income - 8.521e+01*incomeMedium_Income ##GenderMale is the most significant at the 5% level ##The predicted meanShoppingCartValue of a customer is 56.93139.