Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
For this discussion, I will look at the Kaggle Dataset “Blood Transfusion Dataset” link https://www.kaggle.com/datasets/whenamancodes/blood-transfusion-dataset
Data taken from the Blood Transfusion Service Center in Hsin-Chu City in Taiwan
Data Set Information : To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).
Donate <- read.csv("https://raw.githubusercontent.com/marjete/605.week11.db/main/transfusion.csv")
summary(Donate)
## Recency..months. Frequency..times. Monetary..c.c..blood. Time..months.
## Min. : 0.000 Min. : 1.000 Min. : 250 Min. : 2.00
## 1st Qu.: 2.750 1st Qu.: 2.000 1st Qu.: 500 1st Qu.:16.00
## Median : 7.000 Median : 4.000 Median : 1000 Median :28.00
## Mean : 9.507 Mean : 5.515 Mean : 1379 Mean :34.28
## 3rd Qu.:14.000 3rd Qu.: 7.000 3rd Qu.: 1750 3rd Qu.:50.00
## Max. :74.000 Max. :50.000 Max. :12500 Max. :98.00
## whether.he.she.donated.blood.in.March.2007
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.238
## 3rd Qu.:0.000
## Max. :1.000
colnames(Donate) <- c("Recency","Frequency","Monetary","Time", "Donated07")
head(Donate)
## Recency Frequency Monetary Time Donated07
## 1 2 50 12500 98 1
## 2 0 13 3250 28 1
## 3 1 16 4000 35 1
## 4 2 20 5000 45 1
## 5 1 24 6000 77 0
## 6 4 4 1000 4 0
lm_Donate <- lm(Donated07 ~ Monetary, data=Donate)
summary(lm_Donate)
##
## Call:
## lm(formula = Donated07 ~ Monetary, data = Donate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8520 -0.2298 -0.1819 -0.1659 0.8341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.500e-01 2.093e-02 7.165 1.87e-12 ***
## Monetary 6.382e-05 1.043e-05 6.120 1.51e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4161 on 746 degrees of freedom
## Multiple R-squared: 0.0478, Adjusted R-squared: 0.04652
## F-statistic: 37.45 on 1 and 746 DF, p-value: 1.515e-09
plot(Donated07 ~ Monetary, data=Donate)
abline(lm_Donate)
qqnorm(resid(lm_Donate))
qqline(resid(lm_Donate))
The goal was to investigate if the total blood donated in c.c.(Monetary) predicts is useful to predict if someone donated blood on 3/2007. The linear model does not seem to be appropriate.
Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, “Knowledge discovery on RFM model using Bernoulli sequence,”Expert Systems with Applications, 2008 (doi:10.1016/j.eswa.2008.07.018).