library(car)
library(DT)
library(tidyverse)
library(pander)
library(mosaic)
library(readr)
PestControl <- read_csv("Data/PestControl.csv")

Background

I work for a marketing company that markets for Terminix Pest Control. I take all the sales that have been sold in a day (in the marketing company’s system) and enter them into a system that Terminix can view. In my company’s system, I can view customers sales status. In other words, I can view if they have been canceled or serviced. In this analysis, I will look at the contract value of 249 sales and if they are serviced or canceled. I want to see if the contract value effects the cancellation rate. Sales reps want to sell higher contract values, because they get a percentage of the yearly contract value. However, customers don’t want to pay such a high value. That being said, the yearly contract values I am looking at are (in USD) 396 (99 quarterly), 436 (109 quarterly), 476 (119 quarterly), 516 (129 quarterly), and the few that have been sold of 556 (139 quarterly). For the status of the sale; 1 = serviced and 2 = canceled. My hypotheses are

\[ H_0: \beta_1 = 0 \]

\[ H_a: \beta_1 \neq 0 \] With a significance level of \(\alpha=0.05\).

This is stating that if the slope is equal to zero (the null hypothesis) then all of the contract values have the exact same cancellation rate. If the slope does not equal zero (the alternative hypothesis) then all of the contract values have different cancellation rates. I think the alternative hypothesis will be correct.

I will use a linear regression to model this.

\[ \underbrace{Y_i}_\text{Average contract value} = \overbrace{\beta_0}^\text{y-int} + \overbrace{\beta_1}^\text{slope} \underbrace{X_i}_\text{Cancellation rate} + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0, \sigma^2) \]

Data

datatable(PestControl)

Analysis

PestControl.lm <- lm(Price ~ Status, data = PestControl)
pander(summary(PestControl.lm))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 449.2 8.679 51.75 1.428e-134
Status -0.02354 6.589 -0.003573 0.9972
Fitting linear model: Price ~ Status
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
249 44.72 5.168e-08 -0.004049

We can see from this test that the intercept is 449.2 and the slope is -0.02354 which is actually a slight decrease showing that the change in the average y-value decreases 0.02354 units for a one unit change in the x-value. But, this is not enough to be significant (because there is so little change in y). The slope has a p-value of 0.9972 so the data supports the alternative hypothesis. So our estimated equation is

\[ \underbrace{Y_i}_\text{Average contract value} = \overbrace{449.2}^\text{y-int} + \overbrace{−0.02354}^\text{slope} \underbrace{X_i}_\text{Cancellation rate} \]

The graph below shows a plot of the linear regression line.

plot(Price ~ Status, data = PestControl, col=c("firebrick", "plum4")[as.factor(Status)], xlab="Status (1=serviced, 2=canceled)", ylab="Average Contract Value", pch=21, bg="slategray2", main="Cancelation Rate of Pest Control Accounts", cex.main=1)
abline(PestControl.lm)

Appropriateness of the Regression

par(mfrow=c(1,3))
plot(PestControl.lm, which=1:2)
plot(PestControl.lm$residuals, main="Residuals vs Order", xlab="", ylab="Residuals")

It is clear that a linear regression test is not the most appropriate for analyzing this data because the residuals do not appear to be normal, so the error terms are also assumed to violate the normality assumption. There dramatic trends that are visible in the residual vs order plot, thus the independence assumption is violated.

Conclusion

The p-value obtained of 0.9972 gives us insufficient evidence to reject the null. In other words, our slope is zero and it seems that all contract values have about the same cancellation rate. We can also see that from the graph above.