This is my practical exam reports for DataCamp’s Data Analyst Professional Certificate

Data source

Data of Pens and Printers Company is imported from “https://s3.amazonaws.com/talent-assets.datacamp.com/product_sales.csv

Summary of data

Data set consists of 15,000 rows and 8 columns such as ‘week’, ‘sales_method’, ‘customer_id’, ‘nd_sold’, ‘revenue’, ‘years_as_customer’, ‘nb_site_visits’ and ‘state’.

Week is from 1 to 6, sale methods have 3 types of “Email”, “Call” and the combination of “Email + Call”. However, there are 23 orders with “em + call” may be “Email + Call”, so it is converted into “Email + Call” and 10 “email” will be converted into “Email”.

There are 15,000 distinct customers from 50 states.

There are 15,1270 products sold during 6 weeks of the sale compaign, with min of 7, mean of 10.08, median of 10 and max of 10.

The total revenue is 1,308,138 with min of 32.54, median of 89.50, mean of 93.93 and max of 238.32.

There are 1074 missing values in ‘revenue’, about 7.2% of rows, so the missing values will be replaced by the mean of ‘revenue’, 93.93.

The ‘year_of_customer’ have min of 0, median of 3, mean of 5.966 and max of 63.

The ‘nb_site_visits’ is 374,863 with min of 12, median of 25, mean of 24.99 and max of 41.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)

df <- read_csv("https://s3.amazonaws.com/talent-assets.datacamp.com/product_sales.csv")
## Rows: 15000 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): sales_method, customer_id, state
## dbl (5): week, nb_sold, revenue, years_as_customer, nb_site_visits
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df,5)
## # A tibble: 5 × 8
##    week sales_method customer_id               nb_sold revenue years_as_customer
##   <dbl> <chr>        <chr>                       <dbl>   <dbl>             <dbl>
## 1     2 Email        2e72d641-95ac-497b-bbf8-…      10    NA                   0
## 2     6 Email + Call 3998a98d-70f5-44f7-942e-…      15   225.                  1
## 3     5 Call         d1de9884-8059-4065-b10f-…      11    52.6                 6
## 4     4 Email        78aa75a4-ffeb-4817-b1d0-…      11    NA                   3
## 5     3 Email        10e6d446-10a5-42e5-8210-…       9    90.5                 0
## # ℹ 2 more variables: nb_site_visits <dbl>, state <chr>
View(df)
glimpse(df)
## Rows: 15,000
## Columns: 8
## $ week              <dbl> 2, 6, 5, 4, 3, 6, 4, 1, 5, 5, 3, 2, 5, 2, 5, 4, 2, 6…
## $ sales_method      <chr> "Email", "Email + Call", "Call", "Email", "Email", "…
## $ customer_id       <chr> "2e72d641-95ac-497b-bbf8-4861764a7097", "3998a98d-70…
## $ nb_sold           <dbl> 10, 15, 11, 11, 9, 13, 11, 10, 11, 11, 9, 9, 11, 10,…
## $ revenue           <dbl> NA, 225.47, 52.55, NA, 90.49, 65.01, 113.38, 99.94, …
## $ years_as_customer <dbl> 0, 1, 6, 3, 0, 10, 9, 1, 10, 7, 4, 2, 2, 1, 1, 2, 6,…
## $ nb_site_visits    <dbl> 24, 28, 26, 25, 28, 24, 28, 22, 31, 23, 28, 23, 30, …
## $ state             <chr> "Arizona", "Kansas", "Wisconsin", "Indiana", "Illino…
summary(df)
##       week       sales_method       customer_id           nb_sold     
##  Min.   :1.000   Length:15000       Length:15000       Min.   : 7.00  
##  1st Qu.:2.000   Class :character   Class :character   1st Qu.: 9.00  
##  Median :3.000   Mode  :character   Mode  :character   Median :10.00  
##  Mean   :3.098                                         Mean   :10.08  
##  3rd Qu.:5.000                                         3rd Qu.:11.00  
##  Max.   :6.000                                         Max.   :16.00  
##                                                                       
##     revenue       years_as_customer nb_site_visits     state          
##  Min.   : 32.54   Min.   : 0.000    Min.   :12.00   Length:15000      
##  1st Qu.: 52.47   1st Qu.: 1.000    1st Qu.:23.00   Class :character  
##  Median : 89.50   Median : 3.000    Median :25.00   Mode  :character  
##  Mean   : 93.93   Mean   : 4.966    Mean   :24.99                     
##  3rd Qu.:107.33   3rd Qu.: 7.000    3rd Qu.:27.00                     
##  Max.   :238.32   Max.   :63.000    Max.   :41.00                     
##  NA's   :1074
str(df)
## spc_tbl_ [15,000 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ week             : num [1:15000] 2 6 5 4 3 6 4 1 5 5 ...
##  $ sales_method     : chr [1:15000] "Email" "Email + Call" "Call" "Email" ...
##  $ customer_id      : chr [1:15000] "2e72d641-95ac-497b-bbf8-4861764a7097" "3998a98d-70f5-44f7-942e-789bb8ad2fe7" "d1de9884-8059-4065-b10f-86eef57e4a44" "78aa75a4-ffeb-4817-b1d0-2f030783c5d7" ...
##  $ nb_sold          : num [1:15000] 10 15 11 11 9 13 11 10 11 11 ...
##  $ revenue          : num [1:15000] NA 225.5 52.5 NA 90.5 ...
##  $ years_as_customer: num [1:15000] 0 1 6 3 0 10 9 1 10 7 ...
##  $ nb_site_visits   : num [1:15000] 24 28 26 25 28 24 28 22 31 23 ...
##  $ state            : chr [1:15000] "Arizona" "Kansas" "Wisconsin" "Indiana" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   week = col_double(),
##   ..   sales_method = col_character(),
##   ..   customer_id = col_character(),
##   ..   nb_sold = col_double(),
##   ..   revenue = col_double(),
##   ..   years_as_customer = col_double(),
##   ..   nb_site_visits = col_double(),
##   ..   state = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
table(df$sales_method)
## 
##         Call    em + call        email        Email Email + Call 
##         4962           23           10         7456         2549
df %>% summarize(n_customers = n_distinct(customer_id))
## # A tibble: 1 × 1
##   n_customers
##         <int>
## 1       15000

Cleaning data

After cleaning data with replacing incorrected values in ‘sales_method’ we have 3 types of sale methods that are “Email”, “Call”, and “Email + Call” and no “em + call” and “email” any more.

I replace the missing values of ‘revenue’ with the mean of 93.93, then total amount of ‘revenue’ is now 1,409,019, instead of 1,308,138, the median of 91.86, instead of 89.50.

Pens and Printers company is found in 1984, but the summary shows max of ‘years_as_customer’ of 63 that is error. Another customer’s ‘years_as_customer’ is 47. I replace these error values in this column with the mean of 5.

df_1 <- df %>% mutate(sales_method = case_when(
  sales_method == "em + call" ~ "Email + Call", 
  sales_method == "email" ~ "Email", TRUE ~ sales_method))

df_2 <- df_1 %>% mutate(revenue = replace_na(revenue, 93.93) )
table(df_2$sales_method)  # check if errors are corrected
## 
##         Call        Email Email + Call 
##         4962         7466         2572
summary(df_2$revenue)    # check if errors are corrected
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   32.54   53.04   91.86   93.93  106.07  238.32
df_e <- df_2 %>% filter(years_as_customer > 40)
head(df_e) # check if errors occure
## # A tibble: 2 × 8
##    week sales_method customer_id               nb_sold revenue years_as_customer
##   <dbl> <chr>        <chr>                       <dbl>   <dbl>             <dbl>
## 1     2 Email        18919515-a618-430c-9a05-…      10    97.2                63
## 2     4 Call         2ea97d34-571d-4e1b-95be-…      10    50.5                47
## # ℹ 2 more variables: nb_site_visits <dbl>, state <chr>
# final cleaned data
df_cln <- df_2 %>% mutate( years_as_customer = ifelse(years_as_customer >40, 5, years_as_customer) )
summary(df_cln)
##       week       sales_method       customer_id           nb_sold     
##  Min.   :1.000   Length:15000       Length:15000       Min.   : 7.00  
##  1st Qu.:2.000   Class :character   Class :character   1st Qu.: 9.00  
##  Median :3.000   Mode  :character   Mode  :character   Median :10.00  
##  Mean   :3.098                                         Mean   :10.08  
##  3rd Qu.:5.000                                         3rd Qu.:11.00  
##  Max.   :6.000                                         Max.   :16.00  
##     revenue       years_as_customer nb_site_visits     state          
##  Min.   : 32.54   Min.   : 0.000    Min.   :12.00   Length:15000      
##  1st Qu.: 53.04   1st Qu.: 1.000    1st Qu.:23.00   Class :character  
##  Median : 91.86   Median : 3.000    Median :25.00   Mode  :character  
##  Mean   : 93.93   Mean   : 4.959    Mean   :24.99                     
##  3rd Qu.:106.07   3rd Qu.: 7.000    3rd Qu.:27.00                     
##  Max.   :238.32   Max.   :39.000    Max.   :41.00

Graphic summary of data

Numbers of customers by sale methods. ‘Email’ only is the method that contacts most customers, then ‘Call’ and the combination of ‘Email + Call’. However, number of customers who contacted by ‘Email’ only gradually decreased by weeks.

The numbers of customers by sale methods as following:

Call: 4962 (33.08%)

Email: 7466 (49.77%)

Email + Call: 2572 (17.15%)

df_cln %>% ggplot(aes(sales_method, fill = sales_method)) + 
  geom_bar(stat = "count") +
  scale_y_continuous(labels = scales::comma) +
  labs(x= "Sale Methods", y="Number of customers", 
  title = "Number of Customers By Sale Methods", subtitle = "Data source: Pens and Printers Company") +
 stat_count(geom = "text", colour = "white", size = 3.5,
aes(label = ..count..),position=position_stack(vjust=0.5))
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The Numbers of customers is decreasing by week.

df_cln %>% 
  ggplot(aes(week, fill = factor(week))) + 
  geom_bar(stat = "count") +
  scale_x_continuous(breaks = 1:6, labels = 1:6) +
  scale_y_continuous(labels = scales::comma) +
  labs(x= "Week", y="Number of customers", 
  title = "Number of Customers By Week", subtitle = "Data source: Pens and Printers Company") +
 stat_count(geom = "text", colour = "white", size = 3.5,
aes(label = ..count..),position=position_stack(vjust=0.5))

The numbers of customers contacts by all sale methods sharply decrease at the 6th week. This makes the number of orders and total revenue later drops accordingly.

The numbers of customers by sale methods and week is differently changing. The numbers of customers with “Email” is stably decreasing until the last week. The numbers of customers with “Call” and “Email + Call” increased stably until week 5, then drop suddenly at week 6. However, with the combine methods, the numbers of customers is at the lower level although on the same pattern.

df_cln %>% ggplot(aes(week, fill= sales_method)) + 
  geom_bar(position = position_dodge(width = 0.8)) +
  scale_x_continuous(breaks = 1:6, labels = 1:6) +
  scale_y_continuous(labels = scales::comma) +
  labs(x= "Week", y="Number of customers", 
  title = "Number of Customers By Sale Methods and Week",
  subtitle = "Data source: Pens and Printers Company")

Number of products sold by week sharply increases in week 4 and 5 even the number of customers is not increasing. In these weeks, the numbers of contacts by the combination of email and call increased.

week_sold <- df_cln %>%
  group_by(week) %>%
  summarize(num_sold = sum(nb_sold))

ggplot(week_sold, aes(week, num_sold)) +
  geom_line(color = "blue", size = 0.7) + 
  scale_x_continuous(breaks = 1:6, labels = 1:6) +
  scale_y_continuous(labels = scales::comma) +
  labs(x= "Week", y="Number of products sold", title = "Number of Products Sold By Week",
       subtitle = "Data source: Pens and Printers Company")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Numbers of site visits by week is changing as the same pattern of the numbers of products sold by week.

wk_visits <- df_cln %>%
  group_by(week) %>%
  summarize(num_visits = sum(nb_site_visits))

ggplot(wk_visits, aes(week, num_visits)) +
  geom_line(color = "green", size = 0.7) + 
  scale_x_continuous(breaks = 1:6, labels = 1:6) +
  scale_y_continuous(labels = scales::comma) +
  labs(x= "Week", y="Site visits", title = "Site Visits By Week",
       subtitle = "Data source: Pens and Printers Company")

The number of year as customers is decreasing by week. Almost loyal customers with 5 years or more as customers have orders on the first 3 weeks of the sale compaign.

wk_years <- df_cln %>%
  group_by(week) %>%
  summarize(mean_years = mean(years_as_customer))

ggplot(wk_years, aes(week, mean_years)) +
  geom_line(color = "pink", size = 1, alpha = 3) + 
  scale_x_continuous(breaks = 1:6, labels = 1:6) +
  labs(x= "Week", y="Mean years as customers", title = "Mean Years as Customer By Week",
       subtitle = "Data source: Pens and Printers Company")

Number of products sold by states.

Top 20 of states that buy most number of products.

California, Texas, New York, Florida, Illinois, Pensilvania, Ohio, Michigan, Goergia and North Carolina are top 10 states buy most numbers of products.

st_sold <- df_cln %>%
  group_by(state) %>%
  summarize(sum_state_sold = sum(nb_sold)) %>%
  arrange(desc(sum_state_sold))
st_sold[1:20,]
## # A tibble: 20 × 2
##    state          sum_state_sold
##    <chr>                   <dbl>
##  1 California              18859
##  2 Texas                   11957
##  3 New York                 9734
##  4 Florida                  9201
##  5 Illinois                 6143
##  6 Pennsylvania             5979
##  7 Ohio                     5699
##  8 Michigan                 4998
##  9 Georgia                  4930
## 10 North Carolina           4559
## 11 New Jersey               4338
## 12 Virginia                 3790
## 13 Indiana                  3558
## 14 Washington               3424
## 15 Tennessee                3414
## 16 Arizona                  3238
## 17 Missouri                 3122
## 18 Massachusetts            2913
## 19 Maryland                 2669
## 20 Wisconsin                2528

Revenue by states.

Top 20 states that bring most revenue for the company.

California, Texas, New York, Florida, Illinois, Pensilvania, Ohio, Michigan, Goergia and North Carolina are top 10 states that bring most revenue for the company.

st_revenue <- df_cln %>%
  group_by(state) %>%
  summarize(sum_state_revenue = sum(revenue)) %>%
  arrange(desc(sum_state_revenue))
st_revenue[1:20,]
## # A tibble: 20 × 2
##    state          sum_state_revenue
##    <chr>                      <dbl>
##  1 California               173534.
##  2 Texas                    113621.
##  3 New York                  89442.
##  4 Florida                   84978.
##  5 Illinois                  56500.
##  6 Pennsylvania              55822.
##  7 Ohio                      52332.
##  8 Michigan                  47431.
##  9 Georgia                   46150.
## 10 North Carolina            41142.
## 11 New Jersey                39533.
## 12 Virginia                  36192.
## 13 Indiana                   33160.
## 14 Washington                32841.
## 15 Tennessee                 30701.
## 16 Arizona                   29643.
## 17 Missouri                  28208.
## 18 Massachusetts             27480.
## 19 Maryland                  24480.
## 20 Wisconsin                 23680.

The amount of revenue is changing similarly the pattern of changing of the numbers of products sold as well as the numbers of site visits.

wk_revenue <- df_cln %>%
  group_by(week) %>%
  summarize(total_revenue = sum(revenue))

ggplot(wk_revenue, aes(week, total_revenue)) +
  geom_line(color = "red", size = 0.7) + 
  scale_x_continuous(breaks = 1:6, labels = 1:6) +
  scale_y_continuous(labels = scales::comma)+
  labs(x= "Week", y="Revenue", title = "Revenue By Week",
       subtitle = "Data source: Pens and Printers Company")

Total revenue is 1,409,019. It is high at the first week then decreasing until week 3, then increasing back until week 5. It drops sharply at week 6.

Revenue by the methods:

“Email”: 723,415.8

“Email + Call”: 441,038.3

“Call”: 244,564.8

The total revenue by week and sale methods. The revenue of email method decreased sharply by the time. The revenue by call method is not sinificantly increasing.

The graph shows that the combine of email and call is the best sale methods to make the revenue increased. Therefore, this method of sales is recomendated for the company, emails combined with calls.

sum(df_cln$revenue)
## [1] 1409019
df_cln %>% group_by(sales_method) %>% 
  summarize(sum_revenue = sum(revenue))
## # A tibble: 3 × 2
##   sales_method sum_revenue
##   <chr>              <dbl>
## 1 Call             244565.
## 2 Email            723416.
## 3 Email + Call     441038.
wk_sale_revenue <- df_cln %>%
group_by(week, sales_method) %>%
  summarize(total_revenue = sum(revenue), .groups = 'drop')

ggplot(wk_sale_revenue, aes(week, total_revenue, color = sales_method)) +
  geom_line() + 
  scale_x_continuous(breaks = 1:6, labels = 1:6) +
  scale_y_continuous(labels = scales::comma)+
labs(x= "Week", y="Revenue", title = "Revenue By Week and Sale Methods",
       subtitle = "Data source: Pens and Printers Company")

I considers revenue as the main metric to monitor the company performance. So, I want to further investigate the effect of sale methods on the revenue.

Difference in revenue between other sale methods is significant, p-value < 2e-16.

The most significant difference in revenue is between the combined approach of email plus call and call only (122.19), then between the combined method and email only (74.58), then email only and call only (47.61) using TukeyHSD test.

It is ‘Email + Call’ method with superior difference that makes the overall numbers of customers, products sold and ‘revenue’ increased at week 5 and 6. At these weeks, ‘Email” contacts decreasing and ’Call’ contacts increasing a litle.

Therefore the combine of ‘Email + Call’ and ‘Email’ only are the better approaches compared to ‘Call’ only.

model1 <- aov(revenue ~ sales_method, data = df_cln)
summary(model1)
##                 Df   Sum Sq  Mean Sq F value Pr(>F)    
## sales_method     2 25421358 12710679   32246 <2e-16 ***
## Residuals    14997  5911407      394                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model1)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = revenue ~ sales_method, data = df_cln)
## 
## $sales_method
##                         diff       lwr       upr p adj
## Email-Call          47.60714  46.75479  48.45949     0
## Email + Call-Call  122.18922 121.05855 123.31990     0
## Email + Call-Email  74.58208  73.51811  75.64606     0

Regression models

Revenue is the dependent variable in a regression model to investigate the effect of other independent variales on the revenue.

Before building the model, I evaluate the correlation between variables.

The results shows no strong correlation between variables that are independent from ‘revenue’.

library(GGally)
## Warning: package 'GGally' was built under R version 4.3.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
df_cor <- df_cln %>% select(-customer_id, - state, - week) 
ggpairs(df_cor)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The model glm will be builded with ‘revenue’ as dependent variable and 4 other above as independent variables.

The summary results show ‘nb_site_visits’ and ‘years_as_customer’ are not significant, p-values > 0.05.

AIC: 125284

Therefore, the short model will be builded without two above variables.

model2 <- glm(revenue ~ sales_method + nb_sold + nb_site_visits + years_as_customer, data = df_cln, family = "gaussian")
summary(model2)
## 
## Call:
## glm(formula = revenue ~ sales_method + nb_sold + nb_site_visits + 
##     years_as_customer, family = "gaussian", data = df_cln)
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -25.64682    1.06564 -24.067   <2e-16 ***
## sales_methodEmail         45.87221    0.28915 158.644   <2e-16 ***
## sales_methodEmail + Call 100.65211    0.44627 225.543   <2e-16 ***
## nb_sold                    7.96411    0.09449  84.282   <2e-16 ***
## nb_site_visits            -0.03557    0.04221  -0.843    0.399    
## years_as_customer          0.01296    0.02580   0.502    0.615    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 248.1067)
## 
##     Null deviance: 31332766  on 14999  degrees of freedom
## Residual deviance:  3720113  on 14994  degrees of freedom
## AIC: 125284
## 
## Number of Fisher Scoring iterations: 2

The new glm model is better with smaller AIC, AIC: 125281.

model3 <- glm(revenue ~ sales_method + nb_sold, data = df_cln, family = "gaussian")
summary(model3)
## 
## Call:
## glm(formula = revenue ~ sales_method + nb_sold, family = "gaussian", 
##     data = df_cln)
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -26.07568    0.83252  -31.32   <2e-16 ***
## sales_methodEmail         45.86657    0.28909  158.66   <2e-16 ***
## sales_methodEmail + Call 100.66652    0.44599  225.72   <2e-16 ***
## nb_sold                    7.92490    0.08433   93.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 248.0896)
## 
##     Null deviance: 31332766  on 14999  degrees of freedom
## Residual deviance:  3720351  on 14996  degrees of freedom
## AIC: 125281
## 
## Number of Fisher Scoring iterations: 2

However, the variable ‘nb_sold’ is not much impact on the ‘revenue’, the estimate coefficent = 7.96387. Moreover, it is fairly strongly correlated with ‘revenue’, r = 0.662. It is not the variable that the team actively does. It acts as a covariate with ‘revenue’.

Therefore, the model I want to build will be with ‘sales_method’ only.

AIC: 132225, biger than that of previous models, however, this model is more practical.

model4 <- glm(revenue ~ sales_method , data = df_cln, family = "gaussian")
summary(model4)
## 
## Call:
## glm(formula = revenue ~ sales_method, family = "gaussian", data = df_cln)
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               49.2875     0.2818   174.9   <2e-16 ***
## sales_methodEmail         47.6071     0.3636   130.9   <2e-16 ***
## sales_methodEmail + Call 122.1892     0.4824   253.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 394.1727)
## 
##     Null deviance: 31332766  on 14999  degrees of freedom
## Residual deviance:  5911407  on 14997  degrees of freedom
## AIC: 132225
## 
## Number of Fisher Scoring iterations: 2

Conclusion

There are 15,000 customers involved in this compaign, 7466 for “Email”, 4962 for “Call” only and 2572 for the combine “Email + Call”.

Total revenue is 1,409,019. It is high at the first week then decreasing until week 3, then increasing back until week 5. It drops sharply at week 6.

Revenue by the methods:

“Email”: 723,415.8

“Email + Call”: 441,038.3

“Call”: 244,564.8

Revenue from “Email” only is high at the first week, then decreasing stably to the last week of the compaign.

The revenue from “Call” only is low at the first week, then increasing slowly and drop back at the last week.

The revenue from the combine sale method, “Email + Call” is low at the first week, then increasing as the pattern of “Call” only method but at the level lower than that of “Call” only.

At week 6, the revenue from all sale methods drops quickly.

Numbers of customers are different significantly in three sale methods, especially small number in the combine sale method group, but revenue from this group is fair high.

Top 10 states bring about most revenue for the company are California, Texas, New York, Florida, Illinois, Pensilvania, Ohio, Michigan, Goergia and North Carolina.

The combined “Email + Call”, and “Email” only are most valuable methods for sale that bring about most revenue as well as the numbers of products sold for the company.

Email only is the method recommended to continue to use.

More data needed to prove the combine method of ‘Email + Call’ superiority.