R Assignment

Note that the rmarkdown package does not need to be explicitly installed or loaded here, as RStudio automatically does both when needed.

When you need to use a package, please just use the library command to call the package into R to be used. There is no need to include the install.packages() command in your answers. You can assume that I have already installed the package.

Put your answers to the questions after Answer:. If you want to start a new line, use the backward slash symbol.

Course code: MKTG3010D

Your name:Yip Lai Yee

Your student id: 1155125165

The dataset is called salesData.csv. It contains the sales data of a department store chain to its loyalty club members. The columns are as follows:

  • id = customer’s loyalty card number
  • nItems = number of items bought by the customer in the past
  • mostFreqStore = most frequently visited branch store
  • mostFreqCat = most frequently bought product category
  • Cats = number of product categories bought in the past
  • preferredBrand = most preferred brand
  • nBrands = number of brands bought in the past
  • nPurch = number of purchases made in the past
  • salesLast3Mon = sales to the customer in the last three months
  • salesThisMon = sales to the customer in the current month
  • daysSinceLastPurch = days since last purchase
  • meanItemPrice = mean item price of previous purchases
  • meanShoppingCartValue = mean shopping cart value
  • customerDuration = number of days the customer has been with the company
  • coupon = whether or not the customer has used a discount coupon in the last purchase
  • gender
  • income
  • age

Please put all your commands in between ’’‘{r} and’’’ below.

Before answering the following questions, please delete three rows from the dataset depending on the last two digits of your student id. If your student id ends with 12, then delete cases 13, 14 and 15 from the dataset. Read the dataset into R, and enter the commands to remove the three cases.

mydata <- read.csv("salesData.csv")
mydata <- mydata[-c(66,67,68),]

Question 1

Use the summary command to get some summary statistics that describe the dataset, and then display the first 10 cases in the dataset. What is the mean meanShoppingCartValue? What is the median nPurch value? (1 point)

summary(mydata)
##        id           nItems       mostFreqStore      mostFreqCat       
##  Min.   :   1   Min.   :   1.0   Length:5119        Length:5119       
##  1st Qu.:1384   1st Qu.:  83.0   Class :character   Class :character  
##  Median :2744   Median : 157.0   Mode  :character   Mode  :character  
##  Mean   :2742   Mean   : 186.2                                        
##  3rd Qu.:4100   3rd Qu.: 258.0                                        
##  Max.   :5455   Max.   :1469.0                                        
##      nCats       preferredBrand        nBrands          nPurch     
##  Min.   : 1.00   Length:5119        Min.   :  1.0   Min.   : 1.00  
##  1st Qu.:27.00   Class :character   1st Qu.: 45.0   1st Qu.:11.00  
##  Median :37.00   Mode  :character   Median : 76.0   Median :17.00  
##  Mean   :36.35                      Mean   : 81.9   Mean   :19.87  
##  3rd Qu.:47.00                      3rd Qu.:111.0   3rd Qu.:26.00  
##  Max.   :73.00                      Max.   :517.0   Max.   :88.00  
##  salesLast3Mon   salesThisMon    daysSinceLastPurch meanItemPrice    
##  Min.   : 189   Min.   :   0.0   Min.   : 1.000     Min.   :  1.867  
##  1st Qu.:1067   1st Qu.: 480.9   1st Qu.: 2.000     1st Qu.:  6.019  
##  Median :1332   Median : 607.8   Median : 3.000     Median :  8.530  
##  Mean   :1322   Mean   : 605.8   Mean   : 6.278     Mean   : 12.257  
##  3rd Qu.:1573   3rd Qu.: 731.3   3rd Qu.: 7.000     3rd Qu.: 13.191  
##  Max.   :2791   Max.   :1362.8   Max.   :89.000     Max.   :377.900  
##  meanShoppingCartValue customerDuration    coupon             gender         
##  Min.   : 17.35        Min.   :   0.0   Length:5119        Length:5119       
##  1st Qu.: 54.46        1st Qu.: 547.0   Class :character   Class :character  
##  Median : 76.57        Median : 649.0   Mode  :character   Mode  :character  
##  Mean   : 91.80        Mean   : 644.1                                        
##  3rd Qu.:110.61        3rd Qu.: 745.0                                        
##  Max.   :914.04        Max.   :1355.0                                        
##     income               age       
##  Length:5119        Min.   :17.00  
##  Class :character   1st Qu.:28.00  
##  Mode  :character   Median :38.00  
##                     Mean   :38.01  
##                     3rd Qu.:48.00  
##                     Max.   :65.00
print(mydata[1:10,]) 
##    id nItems mostFreqStore   mostFreqCat nCats preferredBrand nBrands nPurch
## 1   1   1469           S10       Alcohol    72          Veina     517     82
## 2   2   1463           S10       Alcohol    73          Veina     482     88
## 3   3    262            S2         Shoes    55             Bo     126     56
## 4   4    293            S2        Bakery    50          Veina     108     43
## 5   5    108            S2     Beverages    32             Bo      79     18
## 6   6    216            S1       Alcohol    41             Bo      98     35
## 7   7    174            S3 Packaged food    36             Bo      78     34
## 8   8    122            S9         Shoes    31             Bo      62     12
## 9   9    204            S6        Bakery    41             Bo      99     26
## 10 10    308            S9       Alcohol    52             Bo     103     33
##    salesLast3Mon salesThisMon daysSinceLastPurch meanItemPrice
## 1        2741.97      1283.87                  1      1.866555
## 2        2790.58      1242.60                  1      1.907437
## 3        1529.55       682.57                  1      5.837977
## 4        1765.81       730.23                  1      6.026655
## 5        1180.00       552.54                 12     10.925926
## 6        1345.29       662.52                  2      6.228194
## 7        1338.81       621.46                  2      7.694310
## 8        1256.96       367.07                  4     10.302951
## 9        1963.60       780.78                 14      9.625490
## 10       1584.59       695.52                  1      5.144773
##    meanShoppingCartValue customerDuration coupon gender        income age
## 1               33.43866              821     No   male    Low Income  47
## 2               31.71114              657    Yes   male    Low Income  45
## 3               27.31339              548    Yes female    Low Income  46
## 4               41.06535              596    Yes   male    Low Income  45
## 5               65.55556              603    Yes female Medium Income  29
## 6               38.43686              673    Yes female    Low Income  32
## 7               39.37676              612    Yes female    Low Income  29
## 8              104.74667              517    Yes female Medium Income  18
## 9               75.52308              709     No female Medium Income  45
## 10              48.01788              480     No female    Low Income  35
mean(mydata$meanShoppingCartValue)
## [1] 91.79511
median(mydata$nPurch)
## [1] 17

Answer:

Question 2

Use the tidyverse pcakage for this question. First, create a copy of mydata and name the copy as mydata1. Then change mydata1 into a tibble. Use mydata1 as input to a Pipe operator to create a new column called hicustomerDuration that divides all the customers into two groups based on the median of customerDuration: hicustomerDuration = 1 if customerDuration > median(customerDuration); otherwise hicustomerDuration = 0. Next, select the customers who used a coupon, and for this selected group of customers, find out if there is any gender difference in mean customerDuration. (3 points)

Answer:

Question 3

Use ggplot2 to draw a scatterplot using salesThisMon as the y axis and daysSinceLastPurch as the x axis, and set point size = 2, and then overlaps the graph with a regression line in red. Label the x-axis as “Days Since Last Purchase” and the y-axis as “Sales This Month”. What is the relationship between the two variables? (2 points)

Answer:

Question 4

Is there any income difference in the use of coupons? Is the effect practically significant? (2 points)

Answer:

Question 5

Use the caret package to build a logistic regression machine learning model using coupon as the target variable and meanShoppingCartValue, salesThisMon, gender and income as the feature variables. Please randomly split the dataset into the training dataset and the testing dataset. The training dataset should contain 80% of the data. Use min-max to normalize meanShoppingCartValue and salesThisMon first before you train the model. Name the transformed variables as meanShoppingCartValue1 and salesThisMon1 Write down the logistic regression equation. How accurate is your machine learning model according to the confusion matrix? Please print the confusion matrix and accuracy value to the screen. (4 points)

Answer:

Question 6

Build a regression model using meanShoppingCartValue as the dependent variable and customerDuration, daysSinceLastPurch, gender and income as the independent variable. Write down the regression equation. Which variables are signifcant at the 5% level? Interpret the R-squared value. What is the predicted meanShoppingCartValue of a customer with the following characteristics? customerDuration= 365, daysSinceLastPurch=10, gender=“female”, income=“Low Income” (3 points)

Answer: