In this step, two datasets were loaded:
Both datasets were loaded using the read_csv function in R, and the column specifications were checked to ensure the data types were consistent (all columns were numeric).
# Load the real dataset
real_data <- read_csv("datase_asthma-2017.csv")
## Rows: 398 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Census_tract, ED_visits, ED_hosp, UC_visits, Asthma_use, Total_members
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Load the fake dataset
fake_data <- read_csv("mock_data.csv")
## Rows: 398 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Census_tract, ED_visits, ED_hosp, UC_visits, Asthma_use, Total_members
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the first few rows
head(real_data)
## # A tibble: 6 × 6
## Census_tract ED_visits ED_hosp UC_visits Asthma_use Total_members
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 42003010300 2 0 0 6 83
## 2 42003020100 32 19 3 223 2114
## 3 42003020300 0 0 0 2 48
## 4 42003030500 11 3 3 61 422
## 5 42003040200 2 1 1 18 138
## 6 42003040400 0 0 0 1 17
head(fake_data)
## # A tibble: 6 × 6
## Census_tract ED_visits ED_hosp UC_visits Asthma_use Total_members
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 42003802967 0 17 4 216 1424
## 2 42003593473 12 12 0 74 2198
## 3 42003935835 21 8 2 225 210
## 4 42003162506 18 11 4 167 793
## 5 42003596976 3 3 6 145 509
## 6 42003481509 18 19 5 205 76
Summary statistics were generated for both datasets using the summary() function in R. The summary provides key metrics such as minimum, maximum, mean, median, and quartiles for each column.
# Summary statistics for real dataset
summary(real_data)
## Census_tract ED_visits ED_hosp UC_visits
## Min. :4.2e+10 Min. : 0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.:4.2e+10 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.:0.0000
## Median :4.2e+10 Median : 2.000 Median : 1.000 Median :0.0000
## Mean :4.2e+10 Mean : 3.241 Mean : 1.628 Mean :0.6457
## 3rd Qu.:4.2e+10 3rd Qu.: 4.000 3rd Qu.: 2.000 3rd Qu.:1.0000
## Max. :4.2e+10 Max. :32.000 Max. :19.000 Max. :6.0000
## Asthma_use Total_members
## Min. : 0.00 Min. : 1.0
## 1st Qu.: 20.00 1st Qu.: 249.5
## Median : 35.00 Median : 412.0
## Mean : 38.21 Mean : 466.2
## 3rd Qu.: 53.00 3rd Qu.: 646.8
## Max. :240.00 Max. :2398.0
# Summary statistics for fake dataset
summary(fake_data)
## Census_tract ED_visits ED_hosp UC_visits
## Min. :4.2e+10 Min. : 0.00 Min. : 0.000 Min. :0.000
## 1st Qu.:4.2e+10 1st Qu.: 7.00 1st Qu.: 4.250 1st Qu.:1.000
## Median :4.2e+10 Median :15.50 Median : 9.000 Median :3.000
## Mean :4.2e+10 Mean :15.52 Mean : 9.578 Mean :2.995
## 3rd Qu.:4.2e+10 3rd Qu.:24.00 3rd Qu.:15.000 3rd Qu.:5.000
## Max. :4.2e+10 Max. :32.00 Max. :19.000 Max. :6.000
## Asthma_use Total_members
## Min. : 0.00 Min. : 7.0
## 1st Qu.: 65.25 1st Qu.: 568.5
## Median :126.00 Median :1223.5
## Mean :122.72 Mean :1200.9
## 3rd Qu.:181.50 3rd Qu.:1776.5
## Max. :240.00 Max. :2384.0
# Aggregate by Census Tract for both datasets
real_summary <- real_data %>%
group_by(Census_tract) %>%
summarise(
total_ed_visits = sum(ED_visits, na.rm = TRUE),
total_hosp = sum(ED_hosp, na.rm = TRUE),
total_uc_visits = sum(UC_visits, na.rm = TRUE),
total_asthma_use = sum(Asthma_use, na.rm = TRUE),
total_members = sum(Total_members, na.rm = TRUE)
)
fake_summary <- fake_data %>%
group_by(Census_tract) %>%
summarise(
total_ed_visits = sum(ED_visits, na.rm = TRUE),
total_hosp = sum(ED_hosp, na.rm = TRUE),
total_uc_visits = sum(UC_visits, na.rm = TRUE),
total_asthma_use = sum(Asthma_use, na.rm = TRUE),
total_members = sum(Total_members, na.rm = TRUE)
)
Observation: The fake dataset has significantly higher means for all columns compared to the real dataset, indicating that the fake data is inflated and does not accurately reflect the real data distribution.
ggplot(real_data, aes(x = ED_visits)) +
geom_histogram(binwidth = 5, fill = "blue", alpha = 0.5) +
geom_histogram(data = fake_data, aes(x = ED_visits), binwidth = 5, fill = "red", alpha = 0.5) +
labs(title = "Comparison of Emergency Department Visits", x = "ED Visits", y = "Count") +
theme_minimal()
1. Bar Plot: Emergency Department Visits: ** The bar plot compares the
distribution of ED_visits between the real and fake datasets. **
Observation: The fake dataset has a higher frequency of ED_visits in the
range of 10-30, while the real dataset has most visits in the 0-10
range. This indicates that the fake dataset overestimates the number of
emergency department visits.
ggplot() +
geom_boxplot(data = real_data, aes(y = Asthma_use, x = "Real"), fill = "blue", alpha = 0.5) +
geom_boxplot(data = fake_data, aes(y = Asthma_use, x = "Fake"), fill = "red", alpha = 0.5) +
labs(title = "Asthma Controller Medication Use Comparison", x = "Dataset", y = "Asthma Use") +
theme_minimal()
2. Boxplot: Asthma Use Comparison: ** The boxplot compares the
distribution of Asthma_use between the real and fake datasets. **
Observation: The fake dataset has a much higher median and interquartile
range for Asthma_use compared to the real dataset, suggesting that the
fake data overestimates asthma medication usage.
ggplot(real_data, aes(x = Total_members, y = Asthma_use)) +
geom_point(color = "blue", alpha = 0.5) +
geom_point(data = fake_data, aes(x = Total_members, y = Asthma_use), color = "red", alpha = 0.5) +
labs(title = "Comparison of Total Members vs Asthma Use", x = "Total Members", y = "Asthma Use") +
theme_minimal()
3. Scatter Plot: Total Members vs Asthma Use:
** The scatter plot compares the relationship between Total_members and Asthma_use for both datasets. ** Observation: The fake dataset shows a wider spread of Asthma_use values across different Total_members counts, whereas the real dataset has a more concentrated distribution. This suggests that the fake dataset does not accurately capture the relationship between these two variables.
# Comparing Mean and Standard Deviation for Key Metrics
comparison <- data.frame(
Metric = c("ED_visits", "ED_hosp", "UC_visits", "Asthma_use", "Total_members"),
Real_Mean = c(mean(real_data$ED_visits, na.rm = TRUE),
mean(real_data$ED_hosp, na.rm = TRUE),
mean(real_data$UC_visits, na.rm = TRUE),
mean(real_data$Asthma_use, na.rm = TRUE),
mean(real_data$Total_members, na.rm = TRUE)),
Fake_Mean = c(mean(fake_data$ED_visits, na.rm = TRUE),
mean(fake_data$ED_hosp, na.rm = TRUE),
mean(fake_data$UC_visits, na.rm = TRUE),
mean(fake_data$Asthma_use, na.rm = TRUE),
mean(fake_data$Total_members, na.rm = TRUE)),
Real_SD = c(sd(real_data$ED_visits, na.rm = TRUE),
sd(real_data$ED_hosp, na.rm = TRUE),
sd(real_data$UC_visits, na.rm = TRUE),
sd(real_data$Asthma_use, na.rm = TRUE),
sd(real_data$Total_members, na.rm = TRUE)),
Fake_SD = c(sd(fake_data$ED_visits, na.rm = TRUE),
sd(fake_data$ED_hosp, na.rm = TRUE),
sd(fake_data$UC_visits, na.rm = TRUE),
sd(fake_data$Asthma_use, na.rm = TRUE),
sd(fake_data$Total_members, na.rm = TRUE))
)
print(comparison)
## Metric Real_Mean Fake_Mean Real_SD Fake_SD
## 1 ED_visits 3.2412060 15.520101 3.8677261 9.714221
## 2 ED_hosp 1.6281407 9.577889 2.5209274 5.911839
## 3 UC_visits 0.6457286 2.994975 0.9975382 2.010044
## 4 Asthma_use 38.2110553 122.718593 26.6922130 67.874950
## 5 Total_members 466.1608040 1200.904523 295.7141537 695.923952
Observation: The fake dataset has significantly higher means and standard deviations for all metrics compared to the real dataset. This indicates that the fake data is not only inflated but also more variable, which could lead to misleading conclusions if used in analysis.
ggplot() +
geom_density(data = real_data, aes(x = ED_visits, color = "Real"), fill = "blue", alpha = 0.3) +
geom_density(data = fake_data, aes(x = ED_visits, color = "Fake"), fill = "red", alpha = 0.3) +
labs(title = "Density Plot of ED Visits", x = "ED Visits", y = "Density") +
scale_color_manual(values = c("Real" = "blue", "Fake" = "red")) +
theme_minimal()
ggplot() +
geom_violin(data = real_data, aes(y = UC_visits, x = "Real"), fill = "blue", alpha = 0.5) +
geom_violin(data = fake_data, aes(y = UC_visits, x = "Fake"), fill = "red", alpha = 0.5) +
labs(title = "Urgent Care Visits Comparison", x = "Dataset", y = "Urgent Care Visits") +
theme_minimal()
library(reshape2)
# Compute correlation matrices
real_corr <- cor(real_data %>% select(-Census_tract), use = "complete.obs")
fake_corr <- cor(fake_data %>% select(-Census_tract), use = "complete.obs")
# Melt data for heatmap
real_melted <- melt(real_corr)
fake_melted <- melt(fake_corr)
# Heatmap for real dataset
ggplot(real_melted, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
labs(title = "Correlation Heatmap (Real Data)", x = "", y = "") +
theme_minimal()
# Heatmap for fake dataset
ggplot(fake_melted, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
labs(title = "Correlation Heatmap (Fake Data)", x = "", y = "") +
theme_minimal()
Additional Section: Analysis of Each Graph 1. Density Plot: ED Visits:
** The density plot shows the distribution of ED_visits for both
datasets. ** Observation: The fake dataset has a much higher density in
the 10-30 range, while the real dataset is concentrated in the 0-10
range. This confirms that the fake dataset overestimates the frequency
of emergency department visits.
Violin Plot: Urgent Care Visits: ** The violin plot compares the distribution of UC_visits between the real and fake datasets. ** Observation: The fake dataset has a wider distribution of UC_visits, with more values in the 2-6 range, while the real dataset has most values clustered around 0-1. This indicates that the fake dataset overestimates urgent care visits.
Heatmap: Correlation Matrix: ** The heatmap shows the correlation between variables in both datasets. ** Observation: The real dataset shows a stronger correlation between Total_members and Asthma_use, while the fake dataset has weaker correlations overall. This suggests that the fake dataset does not accurately capture the relationships between variables.
The analysis reveals that the fake dataset significantly overestimates key metrics such as ED_visits, ED_hosp, UC_visits, and Asthma_use compared to the real dataset. The distributions and correlations in the fake dataset do not accurately reflect those in the real dataset, which could lead to incorrect conclusions if used in further analysis.
To improve the accuracy of the fake dataset, the following measures could be applied: 1. Adjust the Range of Values: Reduce the maximum values for ED_visits, ED_hosp, UC_visits, and Asthma_use to better match the real dataset. 2. Refine the Distribution: Ensure that the distribution of values in the fake dataset closely matches the real dataset, particularly for key metrics like Asthma_use and Total_members. 3. Improve Correlations: Adjust the fake dataset to better reflect the correlations between variables, especially between Total_members and Asthma_use.
We adjust the max values to make sure the fake dataset better matches the real one in terms of averages (mean) and spread (standard deviation). Right now, the fake dataset has much higher values than the real one, which makes it unrealistic.
By adjusting these max values, we’re making sure the fake dataset doesn’t overshoot the real one but still has enough variation to look natural. If we keep the fake max values too high, the whole dataset will be inflated and won’t reflect real trends.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the real dataset
real_data <- read_csv("datase_asthma-2017.csv")
## Rows: 398 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Census_tract, ED_visits, ED_hosp, UC_visits, Asthma_use, Total_members
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Load the fake dataset
fake_data <- read_csv("mock_data_new.csv")
## Rows: 398 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Census_tract, ED_visits, ED_hosp, UC_visits, Asthma_use, Total_members
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the first few rows
head(real_data)
## # A tibble: 6 × 6
## Census_tract ED_visits ED_hosp UC_visits Asthma_use Total_members
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 42003010300 2 0 0 6 83
## 2 42003020100 32 19 3 223 2114
## 3 42003020300 0 0 0 2 48
## 4 42003030500 11 3 3 61 422
## 5 42003040200 2 1 1 18 138
## 6 42003040400 0 0 0 1 17
head(fake_data)
## # A tibble: 6 × 6
## Census_tract ED_visits ED_hosp UC_visits Asthma_use Total_members
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 42003080863 12 2 3 28 996
## 2 42003945826 6 1 4 90 661
## 3 42003574160 8 4 3 117 615
## 4 42003592335 8 10 4 94 910
## 5 42003436405 6 4 4 84 33
## 6 42003724223 10 9 3 21 420
Summary statistics were generated for both datasets using the summary() function.
# Summary statistics for real dataset
summary(real_data)
## Census_tract ED_visits ED_hosp UC_visits
## Min. :4.2e+10 Min. : 0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.:4.2e+10 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.:0.0000
## Median :4.2e+10 Median : 2.000 Median : 1.000 Median :0.0000
## Mean :4.2e+10 Mean : 3.241 Mean : 1.628 Mean :0.6457
## 3rd Qu.:4.2e+10 3rd Qu.: 4.000 3rd Qu.: 2.000 3rd Qu.:1.0000
## Max. :4.2e+10 Max. :32.000 Max. :19.000 Max. :6.0000
## Asthma_use Total_members
## Min. : 0.00 Min. : 1.0
## 1st Qu.: 20.00 1st Qu.: 249.5
## Median : 35.00 Median : 412.0
## Mean : 38.21 Mean : 466.2
## 3rd Qu.: 53.00 3rd Qu.: 646.8
## Max. :240.00 Max. :2398.0
# Summary statistics for fake dataset
summary(fake_data)
## Census_tract ED_visits ED_hosp UC_visits
## Min. :4.2e+10 Min. : 0.000 Min. : 0.000 Min. :0.00
## 1st Qu.:4.2e+10 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.:1.00
## Median :4.2e+10 Median : 7.000 Median : 5.000 Median :2.00
## Mean :4.2e+10 Mean : 7.302 Mean : 4.899 Mean :2.06
## 3rd Qu.:4.2e+10 3rd Qu.:11.000 3rd Qu.: 8.000 3rd Qu.:3.00
## Max. :4.2e+10 Max. :15.000 Max. :10.000 Max. :4.00
## Asthma_use Total_members
## Min. : 0.00 Min. : 2.0
## 1st Qu.: 26.00 1st Qu.:230.2
## Median : 57.50 Median :487.0
## Mean : 58.22 Mean :486.5
## 3rd Qu.: 88.75 3rd Qu.:745.8
## Max. :119.00 Max. :996.0
# Aggregate by Census Tract for both datasets
real_summary <- real_data %>%
group_by(Census_tract) %>%
summarise(
total_ed_visits = sum(ED_visits, na.rm = TRUE),
total_hosp = sum(ED_hosp, na.rm = TRUE),
total_uc_visits = sum(UC_visits, na.rm = TRUE),
total_asthma_use = sum(Asthma_use, na.rm = TRUE),
total_members = sum(Total_members, na.rm = TRUE)
)
fake_summary <- fake_data %>%
group_by(Census_tract) %>%
summarise(
total_ed_visits = sum(ED_visits, na.rm = TRUE),
total_hosp = sum(ED_hosp, na.rm = TRUE),
total_uc_visits = sum(UC_visits, na.rm = TRUE),
total_asthma_use = sum(Asthma_use, na.rm = TRUE),
total_members = sum(Total_members, na.rm = TRUE)
)
Observation: The adjusted fake dataset has significantly lower means and maximum values compared to the original fake dataset. However, the means are still higher than the real dataset, particularly for ED_visits, ED_hosp, and Asthma_use. This suggests that further adjustments may be needed to fully align the fake dataset with the real one.
ggplot(real_data, aes(x = ED_visits)) +
geom_histogram(binwidth = 5, fill = "blue", alpha = 0.5) +
geom_histogram(data = fake_data, aes(x = ED_visits), binwidth = 5, fill = "red", alpha = 0.5) +
labs(title = "Comparison of Emergency Department Visits", x = "ED Visits", y = "Count") +
theme_minimal()
1. Bar Plot: Emergency Department Visits: ** The bar plot compares the
distribution of ED_visits between the real and adjusted fake datasets.
** Observation: The adjusted fake dataset shows a more reasonable
distribution of ED_visits, with most values falling in the 0-10 range.
However, there is still a higher frequency of visits in the 10-15 range
compared to the real dataset, which has most visits in the 0-5
range.
ggplot() +
geom_boxplot(data = real_data, aes(y = Asthma_use, x = "Real"), fill = "blue", alpha = 0.5) +
geom_boxplot(data = fake_data, aes(y = Asthma_use, x = "Fake"), fill = "red", alpha = 0.5) +
labs(title = "Asthma Controller Medication Use Comparison", x = "Dataset", y = "Asthma Use") +
theme_minimal()
2. Boxplot: Asthma Use Comparison: ** The boxplot compares the
distribution of Asthma_use between the real and adjusted fake datasets.
** Observation: The adjusted fake dataset has a lower median and
interquartile range for Asthma_use compared to the original fake
dataset, but it is still higher than the real dataset. This indicates
that the fake dataset still overestimates asthma medication usage,
though to a lesser extent. #### Scatter Plot: Total Members vs Asthma
Use
ggplot(real_data, aes(x = Total_members, y = Asthma_use)) +
geom_point(color = "blue", alpha = 0.5) +
geom_point(data = fake_data, aes(x = Total_members, y = Asthma_use), color = "red", alpha = 0.5) +
labs(title = "Comparison of Total Members vs Asthma Use", x = "Total Members", y = "Asthma Use") +
theme_minimal()
4. Scatter Plot: Total Members vs Asthma Use: ** The scatter plot
compares the relationship between Total_members and Asthma_use for both
datasets. ** Observation: The adjusted fake dataset shows a more
concentrated distribution of Asthma_use values across different
Total_members counts, which is closer to the real dataset. However, the
fake dataset still has a wider spread, particularly for higher values of
Total_members.
# Comparing Mean and Standard Deviation for Key Metrics
comparison <- data.frame(
Metric = c("ED_visits", "ED_hosp", "UC_visits", "Asthma_use", "Total_members"),
Real_Mean = c(mean(real_data$ED_visits, na.rm = TRUE),
mean(real_data$ED_hosp, na.rm = TRUE),
mean(real_data$UC_visits, na.rm = TRUE),
mean(real_data$Asthma_use, na.rm = TRUE),
mean(real_data$Total_members, na.rm = TRUE)),
Fake_Mean = c(mean(fake_data$ED_visits, na.rm = TRUE),
mean(fake_data$ED_hosp, na.rm = TRUE),
mean(fake_data$UC_visits, na.rm = TRUE),
mean(fake_data$Asthma_use, na.rm = TRUE),
mean(fake_data$Total_members, na.rm = TRUE)),
Real_SD = c(sd(real_data$ED_visits, na.rm = TRUE),
sd(real_data$ED_hosp, na.rm = TRUE),
sd(real_data$UC_visits, na.rm = TRUE),
sd(real_data$Asthma_use, na.rm = TRUE),
sd(real_data$Total_members, na.rm = TRUE)),
Fake_SD = c(sd(fake_data$ED_visits, na.rm = TRUE),
sd(fake_data$ED_hosp, na.rm = TRUE),
sd(fake_data$UC_visits, na.rm = TRUE),
sd(fake_data$Asthma_use, na.rm = TRUE),
sd(fake_data$Total_members, na.rm = TRUE))
)
print(comparison)
## Metric Real_Mean Fake_Mean Real_SD Fake_SD
## 1 ED_visits 3.2412060 7.301508 3.8677261 4.714472
## 2 ED_hosp 1.6281407 4.899497 2.5209274 3.151898
## 3 UC_visits 0.6457286 2.060302 0.9975382 1.416485
## 4 Asthma_use 38.2110553 58.223618 26.6922130 36.424626
## 5 Total_members 466.1608040 486.547739 295.7141537 294.545462
Observation: The adjusted fake dataset has lower means and standard deviations compared to the original fake dataset, but it still overestimates key metrics like ED_visits, ED_hosp, and Asthma_use. The standard deviations are closer to the real dataset, indicating that the variability in the fake dataset has been reduced.
# Convert necessary columns to numeric
numeric_cols <- c("ED_visits", "ED_hosp", "UC_visits", "Asthma_use", "Total_members")
real_data[numeric_cols] <- lapply(real_data[numeric_cols], as.numeric)
fake_data[numeric_cols] <- lapply(fake_data[numeric_cols], as.numeric)
# ---- 1. Density Plot: ED Visits ----
ggplot() +
geom_density(data = real_data, aes(x = ED_visits, color = "Real"), fill = "blue", alpha = 0.3) +
geom_density(data = fake_data, aes(x = ED_visits, color = "Fake"), fill = "red", alpha = 0.3) +
labs(title = "Density Plot of ED Visits", x = "ED Visits", y = "Density") +
scale_color_manual(values = c("Real" = "blue", "Fake" = "red")) +
theme_minimal()
# ---- 2. Violin Plot: Urgent Care Visits ----
ggplot() +
geom_violin(data = real_data, aes(y = UC_visits, x = "Real"), fill = "blue", alpha = 0.5) +
geom_violin(data = fake_data, aes(y = UC_visits, x = "Fake"), fill = "red", alpha = 0.5) +
labs(title = "Urgent Care Visits Comparison", x = "Dataset", y = "Urgent Care Visits") +
theme_minimal()
# ---- 3. Heatmap: Correlation Matrix ----
# Compute correlation matrices
real_corr <- cor(real_data %>% select(-Census_tract), use = "complete.obs")
fake_corr <- cor(fake_data %>% select(-Census_tract), use = "complete.obs")
# Melt data for heatmap
real_melted <- melt(real_corr)
fake_melted <- melt(fake_corr)
# Heatmap for real dataset
ggplot(real_melted, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
labs(title = "Correlation Heatmap (Real Data)", x = "", y = "") +
theme_minimal()
# Heatmap for fake dataset
ggplot(fake_melted, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
labs(title = "Correlation Heatmap (Fake Data)", x = "", y = "") +
theme_minimal()
Additional Section: Analysis of Each Graph 1. Density Plot: ED Visits: ** The density plot shows the distribution of ED_visits for both the real and adjusted fake datasets. ** Observation: The adjusted fake dataset has a more reasonable distribution, with most values falling in the 0-10 range. However, there is still a higher density in the 10-15 range compared to the real dataset, which is concentrated in the 0-5 range.
Violin Plot: Urgent Care Visits: ** The violin plot compares the distribution of UC_visits between the real and adjusted fake datasets. ** Observation: The adjusted fake dataset has a narrower distribution of UC_visits, with most values falling in the 0-4 range. This is closer to the real dataset, which has most values clustered around 0-1. However, the fake dataset still overestimates urgent care visits.
Heatmap: Correlation Matrix: ** The heatmap shows the correlation between variables in both datasets. ** Observation: The adjusted fake dataset shows stronger correlations between variables compared to the original fake dataset, particularly between Total_members and Asthma_use. However, the correlations are still weaker than in the real dataset, indicating that the fake dataset does not fully capture the relationships between variables.
The adjustments made to the fake dataset in Part 2 have improved its alignment with the real dataset, particularly in terms of reducing the inflated values for key metrics like ED_visits, ED_hosp, and Asthma_use. However, the fake dataset still overestimates some metrics and does not fully capture the correlations between variables.
To further improve the accuracy of the fake dataset, the following measures could be applied: 1. Further Reduce Maximum Values: Adjust the maximum values for ED_visits, ED_hosp, and Asthma_use to better match the real dataset. 2 .Refine the Distribution: Ensure that the distribution of values in the fake dataset more closely matches the real dataset, particularly for Asthma_use and Total_members. 3.Improve Correlations: Adjust the fake dataset to better reflect the correlations between variables, especially between Total_members and Asthma_use.