Packages Required:

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
blood_transfusion <- read.csv("~/Desktop/BANA 4137/project /blood_transfusion.csv")
summary(blood_transfusion)
##     Recency         Frequency         Monetary          Time      
##  Min.   : 0.000   Min.   : 1.000   Min.   :  250   Min.   : 2.00  
##  1st Qu.: 2.750   1st Qu.: 2.000   1st Qu.:  500   1st Qu.:16.00  
##  Median : 7.000   Median : 4.000   Median : 1000   Median :28.00  
##  Mean   : 9.507   Mean   : 5.515   Mean   : 1379   Mean   :34.28  
##  3rd Qu.:14.000   3rd Qu.: 7.000   3rd Qu.: 1750   3rd Qu.:50.00  
##  Max.   :74.000   Max.   :50.000   Max.   :12500   Max.   :98.00  
##     Class          
##  Length:748        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Dataset

This is the dataset I have selected from a website called Kaggle. This dataset is called blood_transfusion. To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood). https://www.kaggle.com/datasets/whenamancodes/blood-transfusion-dataset/data

What does each column represent? What variables?

colnames(blood_transfusion)
## [1] "Recency"   "Frequency" "Monetary"  "Time"      "Class"
ncol(blood_transfusion)
## [1] 5

1. Recency: The number of months since the last donation.

2. Frequency: Total number of donations.

3. Monetary: Total volume of blood donated in cubic centimeters.

4. Time: Number of months since the first donation.

5. Class: This seems to represent whether the individual donated blood or not.

What does each row represent? How many rows/observations?

nrow(blood_transfusion)
## [1] 748

This will give the dataset total number of rows, representing the observations. The total number of rows in the dataset will indicate the overall number of donors, as each row represents an individual who donated blood.

Brief descriptive analytics, such as the missing values proportions

missing <- colSums(is.na(blood_transfusion)) / nrow(blood_transfusion)
missing
##   Recency Frequency  Monetary      Time     Class 
##         0         0         0         0         0
sum(is.na(blood_transfusion))
## [1] 0

There is no missing values in any column ensures that the dataset’s completeness for analysis.

What are the dimensions of this data?

dim(blood_transfusion)
## [1] 748   5

The dataset contains 748 rows and 5 columns

Check out the first 10 rows

head(blood_transfusion, 10)
##    Recency Frequency Monetary Time       Class
## 1        2        50    12500   98     donated
## 2        0        13     3250   28     donated
## 3        1        16     4000   35     donated
## 4        2        20     5000   45     donated
## 5        1        24     6000   77 not donated
## 6        4         4     1000    4 not donated
## 7        2         7     1750   14     donated
## 8        1        12     3000   35 not donated
## 9        2         9     2250   22     donated
## 10       5        46    11500   98     donated

Check out the last 10 rows

tail(blood_transfusion, 10)
##     Recency Frequency Monetary Time       Class
## 739      23         1      250   23 not donated
## 740      23         4     1000   52 not donated
## 741      23         1      250   23 not donated
## 742      23         7     1750   88 not donated
## 743      16         3      750   86 not donated
## 744      23         2      500   38 not donated
## 745      21         2      500   52 not donated
## 746      23         3      750   62 not donated
## 747      39         1      250   39 not donated
## 748      72         1      250   72 not donated

Questions:

Question 1: How does the time since the first donation relate to the recency of donation?

I would like to use the Scatter Plot with linear regression line fitted to the data to visualize the relationship between “Time Since First Donation” and “Recency (recently something has occurred) of Donation”.

custom_palette <- c("red", "black")

ggplot(blood_transfusion, aes(x = Time, y = Recency)) +
  geom_point(color = custom_palette[1], size = 3, alpha = 0.7) +  
  geom_smooth(method = "lm", color = custom_palette[2], linetype = "dashed", se = FALSE) +  
  labs(title = "Time Since First Donation vs. Recency of Donation",
       x = "Time Since First Donation (months)",
       y = "Recency of Donation") +
  theme_minimal() +  
  theme(plot.title = element_text(face = "bold", size = 16),  
        axis.title = element_text(size = 14),  
        axis.text = element_text(size = 12)) +  
  scale_color_manual(values = custom_palette)  
## `geom_smooth()` using formula = 'y ~ x'

Most data points are cluttered in the lower left corner, that means main donations occurred just after the first one. The Negative correlation between time since there first donation and recency of donation suggests that, donors generally make more donations very shortly after their initial contribution.

Question 2: Is there a correlation between the frequency of donation and the total volume donated?

I would like to use the 2D Density Plot which visualizes the correlation between the frequency of blood donation and the total volume donated.

ggplot(blood_transfusion, aes(x = Frequency, y = Monetary)) +
  geom_density_2d(color = "purple", fill = "blue", alpha = 0.5) +
  labs(title = "Correlation between Frequency and Total Volume Donated",
       x = "Frequency of Donation",
       y = "Total Volume Donated (cubic cm)") +
  theme_minimal() +  
  theme(plot.title = element_text(face = "bold", size = 16),  
        axis.title = element_text(size = 14),  
        axis.text = element_text(size = 12)) 
## Warning in geom_density_2d(color = "purple", fill = "blue", alpha = 0.5):
## Ignoring unknown parameters: `fill`

The plot I provided visualizes the correlation between the frequency of donation and the total volume donated. The plot features contour lines that indicate the density of data points. These lines represent areas where data points are concentrated. There’s a concentration around lower frequencies of donation. The plot suggests that there are more instances where donations are less frequent but occur in larger volumes.

Question 3: Are there any patterns or trends in the frequency and recency of donation over time?

I would like to use the Line Plot in this visualization, it has the releation between total frequency of blood donation changes over time since the first donation.

total_frequency <- aggregate(Frequency ~ Time, data = blood_transfusion, FUN = sum)

ggplot(total_frequency, aes(x = Time, y = Frequency)) +
  geom_line(color = "pink", linewidth = 1) +  
  labs(title = "Trend of Donation Frequency Over Time",
       x = "Months since First Donation",
       y = "Total Frequency of Donation") +
  theme_minimal() + 
  theme(plot.title = element_text(face = "bold", size = 16),  
        axis.title = element_text(size = 14),  
        axis.text = element_text(size = 12))

The plot I have provided visualization the trend of donation frequency over time since the first donation. A pink line graphically represents the data points, showing fluctuations in donation frequencies. Significant peaks around 25 months and just before 100 months since the first donation. The line plot reveals how donation frequency changes over time. It allows us to identify trends, cyclical patterns, and seasonal variations in donation behavior.