data<- read.csv('C:/Users/clyu/Desktop/School/699/699 Data.csv')
data$ContentType <- as.factor(data$ContentType)
summary(data)
## customerid ContentType sends opens
## Min. :3.633e+05 0: 79466 Min. : 1.00 Min. : 0.0
## 1st Qu.:3.766e+09 1:104164 1st Qu.: 12.00 1st Qu.: 0.0
## Median :4.312e+09 Median : 46.00 Median : 1.0
## Mean :4.144e+09 Mean : 75.34 Mean : 14.7
## 3rd Qu.:5.125e+09 3rd Qu.: 94.00 3rd Qu.: 7.0
## Max. :5.396e+09 Max. :5175.00 Max. :5153.0
## clicks orders revenue
## Min. : 0.00 Min. : 0.00000 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00000 1st Qu.: 0.00
## Median : 0.00 Median : 0.00000 Median : 0.00
## Mean : 6.74 Mean : 0.09455 Mean : 19.19
## 3rd Qu.: 2.00 3rd Qu.: 0.00000 3rd Qu.: 0.00
## Max. :4996.00 Max. :148.00000 Max. :44554.59
str(data)
## 'data.frame': 183630 obs. of 7 variables:
## $ customerid : num 1856094 1856094 26747851 26747851 43406107 ...
## $ ContentType: Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 1 1 2 ...
## $ sends : int 86 46 91 53 3 454 78 23 9 1 ...
## $ opens : int 0 0 4 0 3 347 51 22 0 0 ...
## $ clicks : int 0 0 0 0 1 208 16 21 0 0 ...
## $ orders : int 0 0 0 0 0 1 0 0 0 0 ...
## $ revenue : num 0 0 0 0 0 ...
head(data)
## customerid ContentType sends opens clicks orders revenue
## 1 1856094 0 86 0 0 0 0.00
## 2 1856094 1 46 0 0 0 0.00
## 3 26747851 0 91 4 0 0 0.00
## 4 26747851 1 53 0 0 0 0.00
## 5 43406107 1 3 3 1 0 0.00
## 6 47963744 0 454 347 208 1 115.99
## [1] 45.52438
## [1] 4411.404
The below box plot represents email content type and customers order count where content type is a categorical data that has two leverls, non-promo content and promo content and order is a numerical discrete data. From the box plot you can see there are a lot outliers of order for both content type. Median and mean of order are all close to 0. When comparing two different email content types, non-promo orders have more outliers and larger outliers than promo content type. Both non-promo and promo orders are highly skewed to the right.
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Scatter plots between X and Y
1. Scatter plot 2. Relationship description Tthe three graphs plot 3 x variable and the y variable. From the three scatter plots, it’s clear that there is no linear relationship between each x and y variable. All the three plots have very similar patterns that most of the data points are located near the origin point.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## Multivariate plot 1. Scatter plot 2. Relationship description The scatter plot below is the plot between x variable opends and y variable orders filled by another x variable content type. The data pattern is still very similar between two types of email contents. All the data are close to the origin point and there are a lot zero values. The differences in these two email content types are also clear. In this scatter plot, you can see that non promo emails have more opends but promo emails have more orders
## `geom_smooth()` using formula 'y ~ x'