PART 1 OF WEEK 5 ANALYSIS:

COMPARISON OF THE DATABSE FOR ASTHMA FROM 2017 AND THE FIRST SET OF MOCK DATA CREATED

Step 1: Load the datasets

In this step, two datasets were loaded:

  1. Real Dataset: datase_asthma-2017.csv containing 398 rows and 6 columns. The columns include Census_tract, ED_visits, ED_hosp, UC_visits, Asthma_use, and Total_members.
  2. Fake Dataset: mock_data.csv also containing 398 rows and 6 columns with the same structure as the real dataset.

Both datasets were loaded using the read_csv function in R, and the column specifications were checked to ensure the data types were consistent (all columns were numeric).

# Load the real dataset
real_data <- read_csv("datase_asthma-2017.csv")
## Rows: 398 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Census_tract, ED_visits, ED_hosp, UC_visits, Asthma_use, Total_members
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Load the fake dataset
fake_data <- read_csv("mock_data.csv")
## Rows: 398 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Census_tract, ED_visits, ED_hosp, UC_visits, Asthma_use, Total_members
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the first few rows
head(real_data)
## # A tibble: 6 × 6
##   Census_tract ED_visits ED_hosp UC_visits Asthma_use Total_members
##          <dbl>     <dbl>   <dbl>     <dbl>      <dbl>         <dbl>
## 1  42003010300         2       0         0          6            83
## 2  42003020100        32      19         3        223          2114
## 3  42003020300         0       0         0          2            48
## 4  42003030500        11       3         3         61           422
## 5  42003040200         2       1         1         18           138
## 6  42003040400         0       0         0          1            17
head(fake_data)
## # A tibble: 6 × 6
##   Census_tract ED_visits ED_hosp UC_visits Asthma_use Total_members
##          <dbl>     <dbl>   <dbl>     <dbl>      <dbl>         <dbl>
## 1  42003802967         0      17         4        216          1424
## 2  42003593473        12      12         0         74          2198
## 3  42003935835        21       8         2        225           210
## 4  42003162506        18      11         4        167           793
## 5  42003596976         3       3         6        145           509
## 6  42003481509        18      19         5        205            76

Step 2: Create Summary Statistics (Pivot Tables)

Summary statistics were generated for both datasets using the summary() function in R. The summary provides key metrics such as minimum, maximum, mean, median, and quartiles for each column.

# Summary statistics for real dataset
summary(real_data)
##   Census_tract       ED_visits         ED_hosp         UC_visits     
##  Min.   :4.2e+10   Min.   : 0.000   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.:4.2e+10   1st Qu.: 1.000   1st Qu.: 0.000   1st Qu.:0.0000  
##  Median :4.2e+10   Median : 2.000   Median : 1.000   Median :0.0000  
##  Mean   :4.2e+10   Mean   : 3.241   Mean   : 1.628   Mean   :0.6457  
##  3rd Qu.:4.2e+10   3rd Qu.: 4.000   3rd Qu.: 2.000   3rd Qu.:1.0000  
##  Max.   :4.2e+10   Max.   :32.000   Max.   :19.000   Max.   :6.0000  
##    Asthma_use     Total_members   
##  Min.   :  0.00   Min.   :   1.0  
##  1st Qu.: 20.00   1st Qu.: 249.5  
##  Median : 35.00   Median : 412.0  
##  Mean   : 38.21   Mean   : 466.2  
##  3rd Qu.: 53.00   3rd Qu.: 646.8  
##  Max.   :240.00   Max.   :2398.0
# Summary statistics for fake dataset
summary(fake_data)
##   Census_tract       ED_visits        ED_hosp         UC_visits    
##  Min.   :4.2e+10   Min.   : 0.00   Min.   : 0.000   Min.   :0.000  
##  1st Qu.:4.2e+10   1st Qu.: 7.00   1st Qu.: 4.250   1st Qu.:1.000  
##  Median :4.2e+10   Median :15.50   Median : 9.000   Median :3.000  
##  Mean   :4.2e+10   Mean   :15.52   Mean   : 9.578   Mean   :2.995  
##  3rd Qu.:4.2e+10   3rd Qu.:24.00   3rd Qu.:15.000   3rd Qu.:5.000  
##  Max.   :4.2e+10   Max.   :32.00   Max.   :19.000   Max.   :6.000  
##    Asthma_use     Total_members   
##  Min.   :  0.00   Min.   :   7.0  
##  1st Qu.: 65.25   1st Qu.: 568.5  
##  Median :126.00   Median :1223.5  
##  Mean   :122.72   Mean   :1200.9  
##  3rd Qu.:181.50   3rd Qu.:1776.5  
##  Max.   :240.00   Max.   :2384.0
# Aggregate by Census Tract for both datasets
real_summary <- real_data %>%
  group_by(Census_tract) %>%
  summarise(
    total_ed_visits = sum(ED_visits, na.rm = TRUE),
    total_hosp = sum(ED_hosp, na.rm = TRUE),
    total_uc_visits = sum(UC_visits, na.rm = TRUE),
    total_asthma_use = sum(Asthma_use, na.rm = TRUE),
    total_members = sum(Total_members, na.rm = TRUE)
  )

fake_summary <- fake_data %>%
  group_by(Census_tract) %>%
  summarise(
    total_ed_visits = sum(ED_visits, na.rm = TRUE),
    total_hosp = sum(ED_hosp, na.rm = TRUE),
    total_uc_visits = sum(UC_visits, na.rm = TRUE),
    total_asthma_use = sum(Asthma_use, na.rm = TRUE),
    total_members = sum(Total_members, na.rm = TRUE)
  )

Observation: The fake dataset has significantly higher means for all columns compared to the real dataset, indicating that the fake data is inflated and does not accurately reflect the real data distribution.

Step 3: Create Charts for Trend Analysis

Bar Plot: Emergency Department Visits

ggplot(real_data, aes(x = ED_visits)) +
  geom_histogram(binwidth = 5, fill = "blue", alpha = 0.5) +
  geom_histogram(data = fake_data, aes(x = ED_visits), binwidth = 5, fill = "red", alpha = 0.5) +
  labs(title = "Comparison of Emergency Department Visits", x = "ED Visits", y = "Count") +
  theme_minimal()

1. Bar Plot: Emergency Department Visits: ** The bar plot compares the distribution of ED_visits between the real and fake datasets. ** Observation: The fake dataset has a higher frequency of ED_visits in the range of 10-30, while the real dataset has most visits in the 0-10 range. This indicates that the fake dataset overestimates the number of emergency department visits.

Boxplot: Asthma Use Comparison

ggplot() +
  geom_boxplot(data = real_data, aes(y = Asthma_use, x = "Real"), fill = "blue", alpha = 0.5) +
  geom_boxplot(data = fake_data, aes(y = Asthma_use, x = "Fake"), fill = "red", alpha = 0.5) +
  labs(title = "Asthma Controller Medication Use Comparison", x = "Dataset", y = "Asthma Use") +
  theme_minimal()

2. Boxplot: Asthma Use Comparison: ** The boxplot compares the distribution of Asthma_use between the real and fake datasets. ** Observation: The fake dataset has a much higher median and interquartile range for Asthma_use compared to the real dataset, suggesting that the fake data overestimates asthma medication usage.

Scatter Plot: Total Members vs Asthma Use

ggplot(real_data, aes(x = Total_members, y = Asthma_use)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_point(data = fake_data, aes(x = Total_members, y = Asthma_use), color = "red", alpha = 0.5) +
  labs(title = "Comparison of Total Members vs Asthma Use", x = "Total Members", y = "Asthma Use") +
  theme_minimal()

3. Scatter Plot: Total Members vs Asthma Use:

** The scatter plot compares the relationship between Total_members and Asthma_use for both datasets. ** Observation: The fake dataset shows a wider spread of Asthma_use values across different Total_members counts, whereas the real dataset has a more concentrated distribution. This suggests that the fake dataset does not accurately capture the relationship between these two variables.

Step 4: Compare Distributions

# Comparing Mean and Standard Deviation for Key Metrics
comparison <- data.frame(
  Metric = c("ED_visits", "ED_hosp", "UC_visits", "Asthma_use", "Total_members"),
  Real_Mean = c(mean(real_data$ED_visits, na.rm = TRUE), 
                mean(real_data$ED_hosp, na.rm = TRUE), 
                mean(real_data$UC_visits, na.rm = TRUE), 
                mean(real_data$Asthma_use, na.rm = TRUE), 
                mean(real_data$Total_members, na.rm = TRUE)),
  Fake_Mean = c(mean(fake_data$ED_visits, na.rm = TRUE), 
                mean(fake_data$ED_hosp, na.rm = TRUE), 
                mean(fake_data$UC_visits, na.rm = TRUE), 
                mean(fake_data$Asthma_use, na.rm = TRUE), 
                mean(fake_data$Total_members, na.rm = TRUE)),
  Real_SD = c(sd(real_data$ED_visits, na.rm = TRUE), 
              sd(real_data$ED_hosp, na.rm = TRUE), 
              sd(real_data$UC_visits, na.rm = TRUE), 
              sd(real_data$Asthma_use, na.rm = TRUE), 
              sd(real_data$Total_members, na.rm = TRUE)),
  Fake_SD = c(sd(fake_data$ED_visits, na.rm = TRUE), 
              sd(fake_data$ED_hosp, na.rm = TRUE), 
              sd(fake_data$UC_visits, na.rm = TRUE), 
              sd(fake_data$Asthma_use, na.rm = TRUE), 
              sd(fake_data$Total_members, na.rm = TRUE))
)

print(comparison)
##          Metric   Real_Mean   Fake_Mean     Real_SD    Fake_SD
## 1     ED_visits   3.2412060   15.520101   3.8677261   9.714221
## 2       ED_hosp   1.6281407    9.577889   2.5209274   5.911839
## 3     UC_visits   0.6457286    2.994975   0.9975382   2.010044
## 4    Asthma_use  38.2110553  122.718593  26.6922130  67.874950
## 5 Total_members 466.1608040 1200.904523 295.7141537 695.923952

Observation: The fake dataset has significantly higher means and standard deviations for all metrics compared to the real dataset. This indicates that the fake data is not only inflated but also more variable, which could lead to misleading conclusions if used in analysis.

ADDITIONAL COMPARSION CHARTS

1. Density Plot: ED Visits

ggplot() +
  geom_density(data = real_data, aes(x = ED_visits, color = "Real"), fill = "blue", alpha = 0.3) +
  geom_density(data = fake_data, aes(x = ED_visits, color = "Fake"), fill = "red", alpha = 0.3) +
  labs(title = "Density Plot of ED Visits", x = "ED Visits", y = "Density") +
  scale_color_manual(values = c("Real" = "blue", "Fake" = "red")) +
  theme_minimal()

2. Violin Plot: Urgent Care Visits

ggplot() +
  geom_violin(data = real_data, aes(y = UC_visits, x = "Real"), fill = "blue", alpha = 0.5) +
  geom_violin(data = fake_data, aes(y = UC_visits, x = "Fake"), fill = "red", alpha = 0.5) +
  labs(title = "Urgent Care Visits Comparison", x = "Dataset", y = "Urgent Care Visits") +
  theme_minimal()

3. Heatmap: Correlation Matrix

library(reshape2)

# Compute correlation matrices
real_corr <- cor(real_data %>% select(-Census_tract), use = "complete.obs")
fake_corr <- cor(fake_data %>% select(-Census_tract), use = "complete.obs")

# Melt data for heatmap
real_melted <- melt(real_corr)
fake_melted <- melt(fake_corr)

# Heatmap for real dataset
ggplot(real_melted, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
  labs(title = "Correlation Heatmap (Real Data)", x = "", y = "") +
  theme_minimal()

# Heatmap for fake dataset
ggplot(fake_melted, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
  labs(title = "Correlation Heatmap (Fake Data)", x = "", y = "") +
  theme_minimal()

Additional Section: Analysis of Each Graph 1. Density Plot: ED Visits: ** The density plot shows the distribution of ED_visits for both datasets. ** Observation: The fake dataset has a much higher density in the 10-30 range, while the real dataset is concentrated in the 0-10 range. This confirms that the fake dataset overestimates the frequency of emergency department visits.

  1. Violin Plot: Urgent Care Visits: ** The violin plot compares the distribution of UC_visits between the real and fake datasets. ** Observation: The fake dataset has a wider distribution of UC_visits, with more values in the 2-6 range, while the real dataset has most values clustered around 0-1. This indicates that the fake dataset overestimates urgent care visits.

  2. Heatmap: Correlation Matrix: ** The heatmap shows the correlation between variables in both datasets. ** Observation: The real dataset shows a stronger correlation between Total_members and Asthma_use, while the fake dataset has weaker correlations overall. This suggests that the fake dataset does not accurately capture the relationships between variables.

Conclusion

The analysis reveals that the fake dataset significantly overestimates key metrics such as ED_visits, ED_hosp, UC_visits, and Asthma_use compared to the real dataset. The distributions and correlations in the fake dataset do not accurately reflect those in the real dataset, which could lead to incorrect conclusions if used in further analysis.

To improve the accuracy of the fake dataset, the following measures could be applied: 1. Adjust the Range of Values: Reduce the maximum values for ED_visits, ED_hosp, UC_visits, and Asthma_use to better match the real dataset. 2. Refine the Distribution: Ensure that the distribution of values in the fake dataset closely matches the real dataset, particularly for key metrics like Asthma_use and Total_members. 3. Improve Correlations: Adjust the fake dataset to better reflect the correlations between variables, especially between Total_members and Asthma_use.

PART 2 OF WEEK 5 ANALYSIS: COMPARISON OF THE DATABSE FOR ASTHMA FROM 2017 AND THE SECOND SET OF MOCK DATA CREATED

We adjust the max values to make sure the fake dataset better matches the real one in terms of averages (mean) and spread (standard deviation). Right now, the fake dataset has much higher values than the real one, which makes it unrealistic.

Why These Specific Adjustments?

  1. ED_visits (Max: 15)
  • The real dataset has a mean of 3.24 and a max of 32.
  • The fake dataset has a mean of 15.53, which is way too high.
  • Reducing the max to 15 lowers the overall average and brings it closer to 3.24 while still allowing variation.
  1. ED_hosp (Max: 10)
  • Real mean: 1.63, fake mean: 9.58 (too high).
  • Since the real max is 19, cutting the fake max to 10 balances the numbers while keeping it realistic.
  1. UC_visits (Max: 4)
  • Real mean: 0.65, fake mean: 2.99.
  • Real max is 6, so setting the fake max to 4 keeps it closer to the real range * without making it too extreme.
  1. Asthma_use (Max: 120)
  • Real mean: 38.21, fake mean: 122.63.
  • The real max is 240, but the fake dataset is way off.
  • Lowering the max to 120 ensures that the mean doesn’t skew too high while still allowing higher values.
  1. Total_members (Max: 1000)
  • Real mean: 466, fake mean: 1201 (huge difference).
  • The real max is 2398, but setting the fake max to 1000 brings the average down while still keeping the dataset diverse.

The Goal?

By adjusting these max values, we’re making sure the fake dataset doesn’t overshoot the real one but still has enough variation to look natural. If we keep the fake max values too high, the whole dataset will be inflated and won’t reflect real trends.

Step 1: Load the datasets

  1. Real Dataset: The same real dataset (datase_asthma-2017.csv) was loaded, containing 398 rows and 6 columns.
  2. Fake Dataset: A new fake dataset (mock_data_new.csv) was loaded, which was adjusted to better match the real dataset. The adjustments included reducing the maximum values for key metrics like ED_visits, ED_hosp, UC_visits, Asthma_use, and Total_members.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the real dataset
real_data <- read_csv("datase_asthma-2017.csv")
## Rows: 398 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Census_tract, ED_visits, ED_hosp, UC_visits, Asthma_use, Total_members
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Load the fake dataset
fake_data <- read_csv("mock_data_new.csv")
## Rows: 398 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Census_tract, ED_visits, ED_hosp, UC_visits, Asthma_use, Total_members
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the first few rows
head(real_data)
## # A tibble: 6 × 6
##   Census_tract ED_visits ED_hosp UC_visits Asthma_use Total_members
##          <dbl>     <dbl>   <dbl>     <dbl>      <dbl>         <dbl>
## 1  42003010300         2       0         0          6            83
## 2  42003020100        32      19         3        223          2114
## 3  42003020300         0       0         0          2            48
## 4  42003030500        11       3         3         61           422
## 5  42003040200         2       1         1         18           138
## 6  42003040400         0       0         0          1            17
head(fake_data)
## # A tibble: 6 × 6
##   Census_tract ED_visits ED_hosp UC_visits Asthma_use Total_members
##          <dbl>     <dbl>   <dbl>     <dbl>      <dbl>         <dbl>
## 1  42003080863        12       2         3         28           996
## 2  42003945826         6       1         4         90           661
## 3  42003574160         8       4         3        117           615
## 4  42003592335         8      10         4         94           910
## 5  42003436405         6       4         4         84            33
## 6  42003724223        10       9         3         21           420

Step 2: Create Summary Statistics (Pivot Tables)

Summary statistics were generated for both datasets using the summary() function.

# Summary statistics for real dataset
summary(real_data)
##   Census_tract       ED_visits         ED_hosp         UC_visits     
##  Min.   :4.2e+10   Min.   : 0.000   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.:4.2e+10   1st Qu.: 1.000   1st Qu.: 0.000   1st Qu.:0.0000  
##  Median :4.2e+10   Median : 2.000   Median : 1.000   Median :0.0000  
##  Mean   :4.2e+10   Mean   : 3.241   Mean   : 1.628   Mean   :0.6457  
##  3rd Qu.:4.2e+10   3rd Qu.: 4.000   3rd Qu.: 2.000   3rd Qu.:1.0000  
##  Max.   :4.2e+10   Max.   :32.000   Max.   :19.000   Max.   :6.0000  
##    Asthma_use     Total_members   
##  Min.   :  0.00   Min.   :   1.0  
##  1st Qu.: 20.00   1st Qu.: 249.5  
##  Median : 35.00   Median : 412.0  
##  Mean   : 38.21   Mean   : 466.2  
##  3rd Qu.: 53.00   3rd Qu.: 646.8  
##  Max.   :240.00   Max.   :2398.0
# Summary statistics for fake dataset
summary(fake_data)
##   Census_tract       ED_visits         ED_hosp         UC_visits   
##  Min.   :4.2e+10   Min.   : 0.000   Min.   : 0.000   Min.   :0.00  
##  1st Qu.:4.2e+10   1st Qu.: 3.000   1st Qu.: 2.000   1st Qu.:1.00  
##  Median :4.2e+10   Median : 7.000   Median : 5.000   Median :2.00  
##  Mean   :4.2e+10   Mean   : 7.302   Mean   : 4.899   Mean   :2.06  
##  3rd Qu.:4.2e+10   3rd Qu.:11.000   3rd Qu.: 8.000   3rd Qu.:3.00  
##  Max.   :4.2e+10   Max.   :15.000   Max.   :10.000   Max.   :4.00  
##    Asthma_use     Total_members  
##  Min.   :  0.00   Min.   :  2.0  
##  1st Qu.: 26.00   1st Qu.:230.2  
##  Median : 57.50   Median :487.0  
##  Mean   : 58.22   Mean   :486.5  
##  3rd Qu.: 88.75   3rd Qu.:745.8  
##  Max.   :119.00   Max.   :996.0
# Aggregate by Census Tract for both datasets
real_summary <- real_data %>%
  group_by(Census_tract) %>%
  summarise(
    total_ed_visits = sum(ED_visits, na.rm = TRUE),
    total_hosp = sum(ED_hosp, na.rm = TRUE),
    total_uc_visits = sum(UC_visits, na.rm = TRUE),
    total_asthma_use = sum(Asthma_use, na.rm = TRUE),
    total_members = sum(Total_members, na.rm = TRUE)
  )

fake_summary <- fake_data %>%
  group_by(Census_tract) %>%
  summarise(
    total_ed_visits = sum(ED_visits, na.rm = TRUE),
    total_hosp = sum(ED_hosp, na.rm = TRUE),
    total_uc_visits = sum(UC_visits, na.rm = TRUE),
    total_asthma_use = sum(Asthma_use, na.rm = TRUE),
    total_members = sum(Total_members, na.rm = TRUE)
  )

Observation: The adjusted fake dataset has significantly lower means and maximum values compared to the original fake dataset. However, the means are still higher than the real dataset, particularly for ED_visits, ED_hosp, and Asthma_use. This suggests that further adjustments may be needed to fully align the fake dataset with the real one.

Step 3: Create Charts for Trend Analysis

Bar Plot: Emergency Department Visits

ggplot(real_data, aes(x = ED_visits)) +
  geom_histogram(binwidth = 5, fill = "blue", alpha = 0.5) +
  geom_histogram(data = fake_data, aes(x = ED_visits), binwidth = 5, fill = "red", alpha = 0.5) +
  labs(title = "Comparison of Emergency Department Visits", x = "ED Visits", y = "Count") +
  theme_minimal()

1. Bar Plot: Emergency Department Visits: ** The bar plot compares the distribution of ED_visits between the real and adjusted fake datasets. ** Observation: The adjusted fake dataset shows a more reasonable distribution of ED_visits, with most values falling in the 0-10 range. However, there is still a higher frequency of visits in the 10-15 range compared to the real dataset, which has most visits in the 0-5 range.

Boxplot: Asthma Use Comparison

ggplot() +
  geom_boxplot(data = real_data, aes(y = Asthma_use, x = "Real"), fill = "blue", alpha = 0.5) +
  geom_boxplot(data = fake_data, aes(y = Asthma_use, x = "Fake"), fill = "red", alpha = 0.5) +
  labs(title = "Asthma Controller Medication Use Comparison", x = "Dataset", y = "Asthma Use") +
  theme_minimal()

2. Boxplot: Asthma Use Comparison: ** The boxplot compares the distribution of Asthma_use between the real and adjusted fake datasets. ** Observation: The adjusted fake dataset has a lower median and interquartile range for Asthma_use compared to the original fake dataset, but it is still higher than the real dataset. This indicates that the fake dataset still overestimates asthma medication usage, though to a lesser extent. #### Scatter Plot: Total Members vs Asthma Use

ggplot(real_data, aes(x = Total_members, y = Asthma_use)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_point(data = fake_data, aes(x = Total_members, y = Asthma_use), color = "red", alpha = 0.5) +
  labs(title = "Comparison of Total Members vs Asthma Use", x = "Total Members", y = "Asthma Use") +
  theme_minimal()

4. Scatter Plot: Total Members vs Asthma Use: ** The scatter plot compares the relationship between Total_members and Asthma_use for both datasets. ** Observation: The adjusted fake dataset shows a more concentrated distribution of Asthma_use values across different Total_members counts, which is closer to the real dataset. However, the fake dataset still has a wider spread, particularly for higher values of Total_members.

Step 4: Compare Distributions

# Comparing Mean and Standard Deviation for Key Metrics
comparison <- data.frame(
  Metric = c("ED_visits", "ED_hosp", "UC_visits", "Asthma_use", "Total_members"),
  Real_Mean = c(mean(real_data$ED_visits, na.rm = TRUE), 
                mean(real_data$ED_hosp, na.rm = TRUE), 
                mean(real_data$UC_visits, na.rm = TRUE), 
                mean(real_data$Asthma_use, na.rm = TRUE), 
                mean(real_data$Total_members, na.rm = TRUE)),
  Fake_Mean = c(mean(fake_data$ED_visits, na.rm = TRUE), 
                mean(fake_data$ED_hosp, na.rm = TRUE), 
                mean(fake_data$UC_visits, na.rm = TRUE), 
                mean(fake_data$Asthma_use, na.rm = TRUE), 
                mean(fake_data$Total_members, na.rm = TRUE)),
  Real_SD = c(sd(real_data$ED_visits, na.rm = TRUE), 
              sd(real_data$ED_hosp, na.rm = TRUE), 
              sd(real_data$UC_visits, na.rm = TRUE), 
              sd(real_data$Asthma_use, na.rm = TRUE), 
              sd(real_data$Total_members, na.rm = TRUE)),
  Fake_SD = c(sd(fake_data$ED_visits, na.rm = TRUE), 
              sd(fake_data$ED_hosp, na.rm = TRUE), 
              sd(fake_data$UC_visits, na.rm = TRUE), 
              sd(fake_data$Asthma_use, na.rm = TRUE), 
              sd(fake_data$Total_members, na.rm = TRUE))
)

print(comparison)
##          Metric   Real_Mean  Fake_Mean     Real_SD    Fake_SD
## 1     ED_visits   3.2412060   7.301508   3.8677261   4.714472
## 2       ED_hosp   1.6281407   4.899497   2.5209274   3.151898
## 3     UC_visits   0.6457286   2.060302   0.9975382   1.416485
## 4    Asthma_use  38.2110553  58.223618  26.6922130  36.424626
## 5 Total_members 466.1608040 486.547739 295.7141537 294.545462

Observation: The adjusted fake dataset has lower means and standard deviations compared to the original fake dataset, but it still overestimates key metrics like ED_visits, ED_hosp, and Asthma_use. The standard deviations are closer to the real dataset, indicating that the variability in the fake dataset has been reduced.

ADDITIONAL GRAPHS

# Convert necessary columns to numeric
numeric_cols <- c("ED_visits", "ED_hosp", "UC_visits", "Asthma_use", "Total_members")
real_data[numeric_cols] <- lapply(real_data[numeric_cols], as.numeric)
fake_data[numeric_cols] <- lapply(fake_data[numeric_cols], as.numeric)

# ---- 1. Density Plot: ED Visits ----
ggplot() +
  geom_density(data = real_data, aes(x = ED_visits, color = "Real"), fill = "blue", alpha = 0.3) +
  geom_density(data = fake_data, aes(x = ED_visits, color = "Fake"), fill = "red", alpha = 0.3) +
  labs(title = "Density Plot of ED Visits", x = "ED Visits", y = "Density") +
  scale_color_manual(values = c("Real" = "blue", "Fake" = "red")) +
  theme_minimal()

# ---- 2. Violin Plot: Urgent Care Visits ----
ggplot() +
  geom_violin(data = real_data, aes(y = UC_visits, x = "Real"), fill = "blue", alpha = 0.5) +
  geom_violin(data = fake_data, aes(y = UC_visits, x = "Fake"), fill = "red", alpha = 0.5) +
  labs(title = "Urgent Care Visits Comparison", x = "Dataset", y = "Urgent Care Visits") +
  theme_minimal()

# ---- 3. Heatmap: Correlation Matrix ----
# Compute correlation matrices
real_corr <- cor(real_data %>% select(-Census_tract), use = "complete.obs")
fake_corr <- cor(fake_data %>% select(-Census_tract), use = "complete.obs")

# Melt data for heatmap
real_melted <- melt(real_corr)
fake_melted <- melt(fake_corr)

# Heatmap for real dataset
ggplot(real_melted, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
  labs(title = "Correlation Heatmap (Real Data)", x = "", y = "") +
  theme_minimal()

# Heatmap for fake dataset
ggplot(fake_melted, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
  labs(title = "Correlation Heatmap (Fake Data)", x = "", y = "") +
  theme_minimal()

Additional Section: Analysis of Each Graph 1. Density Plot: ED Visits: ** The density plot shows the distribution of ED_visits for both the real and adjusted fake datasets. ** Observation: The adjusted fake dataset has a more reasonable distribution, with most values falling in the 0-10 range. However, there is still a higher density in the 10-15 range compared to the real dataset, which is concentrated in the 0-5 range.

  1. Violin Plot: Urgent Care Visits: ** The violin plot compares the distribution of UC_visits between the real and adjusted fake datasets. ** Observation: The adjusted fake dataset has a narrower distribution of UC_visits, with most values falling in the 0-4 range. This is closer to the real dataset, which has most values clustered around 0-1. However, the fake dataset still overestimates urgent care visits.

  2. Heatmap: Correlation Matrix: ** The heatmap shows the correlation between variables in both datasets. ** Observation: The adjusted fake dataset shows stronger correlations between variables compared to the original fake dataset, particularly between Total_members and Asthma_use. However, the correlations are still weaker than in the real dataset, indicating that the fake dataset does not fully capture the relationships between variables.

CONCLUSION

The adjustments made to the fake dataset in Part 2 have improved its alignment with the real dataset, particularly in terms of reducing the inflated values for key metrics like ED_visits, ED_hosp, and Asthma_use. However, the fake dataset still overestimates some metrics and does not fully capture the correlations between variables.

To further improve the accuracy of the fake dataset, the following measures could be applied: 1. Further Reduce Maximum Values: Adjust the maximum values for ED_visits, ED_hosp, and Asthma_use to better match the real dataset. 2 .Refine the Distribution: Ensure that the distribution of values in the fake dataset more closely matches the real dataset, particularly for Asthma_use and Total_members. 3.Improve Correlations: Adjust the fake dataset to better reflect the correlations between variables, especially between Total_members and Asthma_use.