Introduction

The analysis is based on the following scenario.

A car company established a new sales strategy. Instead of requiring customers to buy its product at once, they can pay a certain deposit, and then make regular installments until all cost is covered. After a certain period of time,customers start to default. A call center is then established to reach out to customers and persuade them to repay the due amount.
Three months after calls have been made, the company wishes to understand whether calls had an impact on loan repayment.

Analysis

From the data that I have generated, there are three variables i.e. reached, cumulative_paid and county. Reached is binary i.e. Yes (coded as 1) for a person who received a call, and No (coded as 0) that did not receive the call. Cumulative_paid the is total amount paid to date.

A simple way to determine impact is to check the difference of the amount cumulatively paid by those that received a call from those that didn’t. If same amount of loan was provided to all, then this might work. However if this was not the case, skewness is present in the data, and substraction will provide wrong results. We must check distribution of data first to identify the appropriate procedure to measure impact.

We first import the data from Stata file and view the first 10 cases.

library(haven)
library(dplyr)
options(scipen=100)
data <- read_dta("D:/My projects/Stanley/data.dta")
attach(data)
head(data,10)
## # A tibble: 10 x 3
##      reached cumulative_paid       county
##    <dbl+lbl>           <dbl>    <dbl+lbl>
##  1   0 [No]            75000 1 [County A]
##  2   0 [No]             5450 2 [County B]
##  3   0 [No]             5990 3 [County C]
##  4   0 [No]             1640 3 [County C]
##  5   0 [No]             1340 3 [County C]
##  6   0 [No]           150775 4 [County D]
##  7   0 [No]           201720 4 [County D]
##  8   1 [Yes]           88710 5 [County E]
##  9   0 [No]             2680 3 [County C]
## 10   1 [Yes]          103355 5 [County E]

We are interested with amount that has been cumulatively paid. Let us check distribution using a simple histogram. From the graph, the data is skewed to the right. If the data, exhibited a normal distribution, we could have used mean difference to measure impact. This is not the case here, an alternative will be to do a natural logarithm of the variable.

hist(cumulative_paid)

After taking logarithm, skewness is no longer present according to the graph below.

data=mutate(data,logpaid=log(cumulative_paid))
hist(data$logpaid)

We now fit a log linear model: In the model, response variable is cumulative paid, and predictor variable is reached, i.e. with (1 Yes, 0 No).

\[\begin{equation} log(cumulativepaid)=\beta_{0}+\beta reached \end{equation}\]

To determing impact from the model output get exp(0.2314)=1.18 . In other word, those that received a call increased their repayment accounts by 18%. This is after we adjust the model to take into account the county where the buyer comes from.

fit=lm(logpaid~reached+county,data=data)
summary(fit)
## 
## Call:
## lm(formula = logpaid ~ reached + county, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7937 -1.3288 -0.3422  1.3747  3.3528 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   9.0075     0.2650  33.985 <0.0000000000000002 ***
## reached       0.1614     0.1941   0.831              0.4064    
## county        0.2089     0.0808   2.586              0.0101 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.613 on 327 degrees of freedom
## Multiple R-squared:  0.02345,    Adjusted R-squared:  0.01747 
## F-statistic: 3.925 on 2 and 327 DF,  p-value: 0.02067