Data

data<- read.csv('C:/Users/clyu/Desktop/School/699/699 Data.csv')
data$ContentType <- as.factor(data$ContentType)
summary(data)
##    customerid        ContentType     sends             opens       
##  Min.   :3.633e+05   0: 79466    Min.   :   1.00   Min.   :   0.0  
##  1st Qu.:3.766e+09   1:104164    1st Qu.:  12.00   1st Qu.:   0.0  
##  Median :4.312e+09               Median :  46.00   Median :   1.0  
##  Mean   :4.144e+09               Mean   :  75.34   Mean   :  14.7  
##  3rd Qu.:5.125e+09               3rd Qu.:  94.00   3rd Qu.:   7.0  
##  Max.   :5.396e+09               Max.   :5175.00   Max.   :5153.0  
##      clicks            orders             revenue        
##  Min.   :   0.00   Min.   :  0.00000   Min.   :    0.00  
##  1st Qu.:   0.00   1st Qu.:  0.00000   1st Qu.:    0.00  
##  Median :   0.00   Median :  0.00000   Median :    0.00  
##  Mean   :   6.74   Mean   :  0.09455   Mean   :   19.19  
##  3rd Qu.:   2.00   3rd Qu.:  0.00000   3rd Qu.:    0.00  
##  Max.   :4996.00   Max.   :148.00000   Max.   :44554.59
str(data)
## 'data.frame':    183630 obs. of  7 variables:
##  $ customerid : num  1856094 1856094 26747851 26747851 43406107 ...
##  $ ContentType: Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 1 1 2 ...
##  $ sends      : int  86 46 91 53 3 454 78 23 9 1 ...
##  $ opens      : int  0 0 4 0 3 347 51 22 0 0 ...
##  $ clicks     : int  0 0 0 0 1 208 16 21 0 0 ...
##  $ orders     : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ revenue    : num  0 0 0 0 0 ...
head(data)
##   customerid ContentType sends opens clicks orders revenue
## 1    1856094           0    86     0      0      0    0.00
## 2    1856094           1    46     0      0      0    0.00
## 3   26747851           0    91     4      0      0    0.00
## 4   26747851           1    53     0      0      0    0.00
## 5   43406107           1     3     3      1      0    0.00
## 6   47963744           0   454   347    208      1  115.99

Univariate plot for the variable of your interest

  1. histogram
  2. skewness values = 46
  3. kurtosis values =4411
  4. Results description The interest variable y is a discrete variable order counts. Y is not normally distributed. Y is highly skewed to the right and has a skewenss of 46, and also has a high kurtosis with the kurtosis score is 4411. The histogram shows that out of the 183,630 observations, around over 175,000 customers placed zero orders through emails in the past 8 months. Give our data is highly skewed, models like linear regression will not be appliciable to analyze this dataset.

## [1] 45.52438
## [1] 4411.404

Bivariate plot for your Y variable and one X.

  1. Box plot
  2. Figure description

The below box plot represents email content type and customers order count where content type is a categorical data that has two leverls, non-promo content and promo content and order is a numerical discrete data. From the box plot you can see there are a lot outliers of order for both content type. Median and mean of order are all close to 0. When comparing two different email content types, non-promo orders have more outliers and larger outliers than promo content type. Both non-promo and promo orders are highly skewed to the right.

## Warning: `fun.y` is deprecated. Use `fun` instead.

## Scatter plots between X and Y
1. Scatter plot 2. Relationship description Tthe three graphs plot 3 x variable and the y variable. From the three scatter plots, it’s clear that there is no linear relationship between each x and y variable. All the three plots have very similar patterns that most of the data points are located near the origin point.

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## Multivariate plot 1. Scatter plot 2. Relationship description The scatter plot below is the plot between x variable opends and y variable orders filled by another x variable content type. The data pattern is still very similar between two types of email contents. All the data are close to the origin point and there are a lot zero values. The differences in these two email content types are also clear. In this scatter plot, you can see that non promo emails have more opends but promo emails have more orders

## `geom_smooth()` using formula 'y ~ x'