Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

For this discussion, I will look at the Kaggle Dataset “Blood Transfusion Dataset” link https://www.kaggle.com/datasets/whenamancodes/blood-transfusion-dataset

Data taken from the Blood Transfusion Service Center in Hsin-Chu City in Taiwan

Data Set Information : To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).

Donate <- read.csv("https://raw.githubusercontent.com/marjete/605.week11.db/main/transfusion.csv")

summary(Donate)

##  Recency..months. Frequency..times. Monetary..c.c..blood. Time..months.  
##  Min.   : 0.000   Min.   : 1.000    Min.   :  250         Min.   : 2.00  
##  1st Qu.: 2.750   1st Qu.: 2.000    1st Qu.:  500         1st Qu.:16.00  
##  Median : 7.000   Median : 4.000    Median : 1000         Median :28.00  
##  Mean   : 9.507   Mean   : 5.515    Mean   : 1379         Mean   :34.28  
##  3rd Qu.:14.000   3rd Qu.: 7.000    3rd Qu.: 1750         3rd Qu.:50.00  
##  Max.   :74.000   Max.   :50.000    Max.   :12500         Max.   :98.00  
##  whether.he.she.donated.blood.in.March.2007
##  Min.   :0.000                             
##  1st Qu.:0.000                             
##  Median :0.000                             
##  Mean   :0.238                             
##  3rd Qu.:0.000                             
##  Max.   :1.000

colnames(Donate) <- c("Recency","Frequency","Monetary","Time", "Donated07")
head(Donate)

##   Recency Frequency Monetary Time Donated07
## 1       2        50    12500   98         1
## 2       0        13     3250   28         1
## 3       1        16     4000   35         1
## 4       2        20     5000   45         1
## 5       1        24     6000   77         0
## 6       4         4     1000    4         0

lm_Donate <- lm(Donated07 ~ Monetary, data=Donate)
summary(lm_Donate)

## 
## Call:
## lm(formula = Donated07 ~ Monetary, data = Donate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8520 -0.2298 -0.1819 -0.1659  0.8341 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.500e-01  2.093e-02   7.165 1.87e-12 ***
## Monetary    6.382e-05  1.043e-05   6.120 1.51e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4161 on 746 degrees of freedom
## Multiple R-squared:  0.0478, Adjusted R-squared:  0.04652 
## F-statistic: 37.45 on 1 and 746 DF,  p-value: 1.515e-09

plot(Donated07 ~ Monetary, data=Donate)

abline(lm_Donate)

qqnorm(resid(lm_Donate))
qqline(resid(lm_Donate))

Conclusion

The goal was to investigate if the total blood donated in c.c.(Monetary) predicts is useful to predict if someone donated blood on 3/2007. The linear model does not seem to be appropriate.

Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, “Knowledge discovery on RFM model using Bernoulli sequence,”Expert Systems with Applications, 2008 (doi:10.1016/j.eswa.2008.07.018).

605.db 11

Marjete Vucinaj

2022-10-30

Conclusion