Packages Required:
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
blood_transfusion <- read.csv("~/Desktop/BANA 4137/project /blood_transfusion.csv")
summary(blood_transfusion)
## Recency Frequency Monetary Time
## Min. : 0.000 Min. : 1.000 Min. : 250 Min. : 2.00
## 1st Qu.: 2.750 1st Qu.: 2.000 1st Qu.: 500 1st Qu.:16.00
## Median : 7.000 Median : 4.000 Median : 1000 Median :28.00
## Mean : 9.507 Mean : 5.515 Mean : 1379 Mean :34.28
## 3rd Qu.:14.000 3rd Qu.: 7.000 3rd Qu.: 1750 3rd Qu.:50.00
## Max. :74.000 Max. :50.000 Max. :12500 Max. :98.00
## Class
## Length:748
## Class :character
## Mode :character
##
##
##
This is the dataset I have selected from a website called Kaggle. This dataset is called blood_transfusion. To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood). https://www.kaggle.com/datasets/whenamancodes/blood-transfusion-dataset/data
What does each column represent? What variables?
colnames(blood_transfusion)
## [1] "Recency" "Frequency" "Monetary" "Time" "Class"
ncol(blood_transfusion)
## [1] 5
1. Recency: The number of months since the last donation.
2. Frequency: Total number of donations.
3. Monetary: Total volume of blood donated in cubic centimeters.
4. Time: Number of months since the first donation.
5. Class: This seems to represent whether the individual donated blood or not.
What does each row represent? How many rows/observations?
nrow(blood_transfusion)
## [1] 748
This will give the dataset total number of rows, representing the observations. The total number of rows in the dataset will indicate the overall number of donors, as each row represents an individual who donated blood.
Brief descriptive analytics, such as the missing values proportions
missing <- colSums(is.na(blood_transfusion)) / nrow(blood_transfusion)
missing
## Recency Frequency Monetary Time Class
## 0 0 0 0 0
sum(is.na(blood_transfusion))
## [1] 0
There is no missing values in any column ensures that the dataset’s completeness for analysis.
What are the dimensions of this data?
dim(blood_transfusion)
## [1] 748 5
The dataset contains 748 rows and 5 columns
Check out the first 10 rows
head(blood_transfusion, 10)
## Recency Frequency Monetary Time Class
## 1 2 50 12500 98 donated
## 2 0 13 3250 28 donated
## 3 1 16 4000 35 donated
## 4 2 20 5000 45 donated
## 5 1 24 6000 77 not donated
## 6 4 4 1000 4 not donated
## 7 2 7 1750 14 donated
## 8 1 12 3000 35 not donated
## 9 2 9 2250 22 donated
## 10 5 46 11500 98 donated
Check out the last 10 rows
tail(blood_transfusion, 10)
## Recency Frequency Monetary Time Class
## 739 23 1 250 23 not donated
## 740 23 4 1000 52 not donated
## 741 23 1 250 23 not donated
## 742 23 7 1750 88 not donated
## 743 16 3 750 86 not donated
## 744 23 2 500 38 not donated
## 745 21 2 500 52 not donated
## 746 23 3 750 62 not donated
## 747 39 1 250 39 not donated
## 748 72 1 250 72 not donated
Question 1: How does the time since the first donation relate to the recency of donation?
I would like to use the Scatter Plot with linear regression line fitted to the data to visualize the relationship between “Time Since First Donation” and “Recency (recently something has occurred) of Donation”.
custom_palette <- c("red", "black")
ggplot(blood_transfusion, aes(x = Time, y = Recency)) +
geom_point(color = custom_palette[1], size = 3, alpha = 0.7) +
geom_smooth(method = "lm", color = custom_palette[2], linetype = "dashed", se = FALSE) +
labs(title = "Time Since First Donation vs. Recency of Donation",
x = "Time Since First Donation (months)",
y = "Recency of Donation") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 16),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)) +
scale_color_manual(values = custom_palette)
## `geom_smooth()` using formula = 'y ~ x'
Most data points are cluttered in the lower left corner, that means main donations occurred just after the first one. The Negative correlation between time since there first donation and recency of donation suggests that, donors generally make more donations very shortly after their initial contribution.
Question 2: Is there a correlation between the frequency of donation and the total volume donated?
I would like to use the 2D Density Plot which visualizes the correlation between the frequency of blood donation and the total volume donated.
ggplot(blood_transfusion, aes(x = Frequency, y = Monetary)) +
geom_density_2d(color = "purple", fill = "blue", alpha = 0.5) +
labs(title = "Correlation between Frequency and Total Volume Donated",
x = "Frequency of Donation",
y = "Total Volume Donated (cubic cm)") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 16),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12))
## Warning in geom_density_2d(color = "purple", fill = "blue", alpha = 0.5):
## Ignoring unknown parameters: `fill`
The plot I provided visualizes the correlation between the frequency of donation and the total volume donated. The plot features contour lines that indicate the density of data points. These lines represent areas where data points are concentrated. There’s a concentration around lower frequencies of donation. The plot suggests that there are more instances where donations are less frequent but occur in larger volumes.
Question 3: Are there any patterns or trends in the frequency and recency of donation over time?
I would like to use the Line Plot in this visualization, it has the releation between total frequency of blood donation changes over time since the first donation.
total_frequency <- aggregate(Frequency ~ Time, data = blood_transfusion, FUN = sum)
ggplot(total_frequency, aes(x = Time, y = Frequency)) +
geom_line(color = "pink", linewidth = 1) +
labs(title = "Trend of Donation Frequency Over Time",
x = "Months since First Donation",
y = "Total Frequency of Donation") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 16),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12))
The plot I have provided visualization the trend of donation frequency over time since the first donation. A pink line graphically represents the data points, showing fluctuations in donation frequencies. Significant peaks around 25 months and just before 100 months since the first donation. The line plot reveals how donation frequency changes over time. It allows us to identify trends, cyclical patterns, and seasonal variations in donation behavior.
Conclusion:
The Blood Transfusion Dataset analysis provided valuable insights into the donation behavior of individuals and revealed trends, correlations, and patterns. Through a systematic exploration of the dataset, employing diverse visualizations and analytical techniques, several key findings emerged. By leveraging data-driven approaches and diverse analytical techniques, stakeholders can optimize their strategies to attract and retain donors, ultimately contributing to the availability and accessibility of life-saving blood supplies.