a) Find a data set of your interest with at least five meaningful continuous variables. The sample size should be larger than 1000. Introduce your data set and five selected continuous variables.

ANWSER:

The dataset was obtained from County Health Rankings (https://www.countyhealthrankings.org/), and includes many variables for each county, such as demographic, health behavior, and socioeconomic factors. The County Health Rankings & Roadmaps (CHR&R) is a program hosted by the University of Wisconsin Population Health Institute. Its goal is to assist communities in improving residents’ health and promoting health equity by providing actionable data, evidence-based strategies, and guidance. The website ranks the health of U.S. counties, highlighting various factors that influence health, including health behaviors, clinical care, social and economic factors, and the physical environment. These rankings help communities understand their health status and identify areas for improvement.

I plan to use Physical Inactivity, Flu Vaccinations, Children in Single-Parent Households, Alcohol-Impaired Driving Deaths,Access to Exercise Opportunities. Each of these represents the corresponding proportion in the statistical data for different counties across the United States.

b) Create a pair plot of your continuous variables. Also compute the covariance and correlation matrices. Comment on your findings.

Anwser:

data4 <- read_excel("C:/Users/sukuu/OneDrive/桌面/learn/previous courses/Advanced Regression Analysis/Project/program/All_Counties.xlsx")

#### View the structure of the dataset
str(data4)
## tibble [3,143 × 87] (S3: tbl_df/tbl/data.frame)
##  $ State                                            : chr [1:3143] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ County                                           : chr [1:3143] "Autauga (AU)" "Baldwin (BA)" "Barbour (BR)" "Bibb (BI)" ...
##  $ Population                                       : num [1:3143] 59759 246435 24706 22005 59512 ...
##  $ % Below 18 Years of Age                          : num [1:3143] 0.232 0.208 0.204 0.197 0.227 0.211 0.224 0.211 0.207 0.186 ...
##  $ % 65 and Older                                   : num [1:3143] 0.164 0.219 0.206 0.178 0.191 0.178 0.215 0.186 0.207 0.244 ...
##  $ % Non-Hispanic Black                             : num [1:3143] 0.21 0.082 0.472 0.207 0.017 0.673 0.442 0.22 0.393 0.04 ...
##  $ % American Indian or Alaska Native               : num [1:3143] 0.005 0.008 0.007 0.006 0.006 0.009 0.005 0.005 0.004 0.008 ...
##  $ % Asian                                          : num [1:3143] 0.012 0.012 0.005 0.003 0.004 0.003 0.015 0.009 0.009 0.004 ...
##  $ % Native Hawaiian or Other Pacific Islander      : num [1:3143] 0.001 0.001 0.003 0.001 0.001 0.008 0.001 0.001 0.001 0.001 ...
##  $ % Hispanic                                       : num [1:3143] 0.034 0.05 0.051 0.032 0.101 0.094 0.018 0.043 0.036 0.021 ...
##  $ % Non-Hispanic White                             : num [1:3143] 0.721 0.83 0.453 0.74 0.86 0.218 0.508 0.702 0.545 0.91 ...
##  $ % Not Proficient in English                      : num [1:3143] 0 0.01 0.02 0 0.02 0.02 0 0.01 0 0.01 ...
##  $ % Female                                         : num [1:3143] 0.513 0.512 0.471 0.465 0.503 0.446 0.53 0.514 0.523 0.496 ...
##  $ % Rural                                          : num [1:3143] 0.407 0.376 0.659 1 0.905 1 0.694 0.365 0.507 1 ...
##  $ Life Expectancy                                  : num [1:3143] 75.3 76.7 72.4 72.3 73.4 72.1 73.2 71.7 72.5 72.7 ...
##  $ Premature Age-Adjusted Mortality                 : num [1:3143] 470 420 630 620 570 630 620 650 580 630 ...
##  $ High School Graduation                           : num [1:3143] 0.87 0.89 0.85 0.88 0.98 0.83 0.8 0.92 0.87 0.92 ...
##  $ Child Mortality                                  : num [1:3143] 60 50 80 100 60 NA 70 70 100 90 ...
##  $ Disconnected Youth                               : num [1:3143] 0.07 0.08 NA NA 0.1 NA NA 0.08 0.17 NA ...
##  $ Frequent Physical Distress                       : num [1:3143] 0.12 0.11 0.16 0.14 0.13 0.17 0.14 0.14 0.14 0.14 ...
##  $ Infant Mortality                                 : num [1:3143] 7 5 NA NA 7 NA NA 7 9 NA ...
##  $ Uninsured Adults                                 : num [1:3143] 0.12 0.13 0.16 0.14 0.16 0.16 0.15 0.17 0.16 0.16 ...
##  $ Reading Scores                                   : num [1:3143] 3.1 3.2 2.4 2.7 2.9 2.1 2.5 2.9 2.6 2.8 ...
##  $ Uninsured Children                               : num [1:3143] 0.04 0.05 0.04 0.04 0.05 0.03 0.04 0.04 0.04 0.04 ...
##  $ Frequent Mental Distress                         : num [1:3143] 0.18 0.17 0.21 0.19 0.19 0.21 0.2 0.19 0.2 0.2 ...
##  $ Math Scores                                      : num [1:3143] 2.9 3 2 2.6 2.8 2 2.1 2.8 2.5 2.6 ...
##  $ School Segregation                               : num [1:3143] 0.05 0.09 0.04 0.09 0.1 0.01 0.12 0.19 0.13 0.06 ...
##  $ Diabetes Prevalence                              : num [1:3143] 0.11 0.1 0.16 0.12 0.11 0.19 0.14 0.12 0.13 0.11 ...
##  $ Other Primary Care Providers                     : chr [1:3143] "1,990:1" "1,200:1" "1,240:1" "29.584027777777777" ...
##  $ HIV Prevalence                                   : num [1:3143] 218 174 514 269 81 616 377 255 360 133 ...
##  $ School Funding Adequacy                          : num [1:3143] -3607 -537 -23627 -6971 -2789 ...
##  $ Traffic Volume                                   : num [1:3143] 46 39 34 8 14 16 22 56 37 4 ...
##  $ Homeownership                                    : num [1:3143] 0.75 0.78 0.65 0.76 0.79 0.65 0.71 0.71 0.72 0.8 ...
##  $ Food Insecurity                                  : num [1:3143] 0.13 0.12 0.18 0.15 0.14 0.16 0.14 0.16 0.13 0.15 ...
##  $ Severe Housing Cost Burden                       : num [1:3143] 0.11 0.12 0.11 0.1 0.09 0.16 0.12 0.11 0.1 0.09 ...
##  $ Limited Access to Healthy Foods                  : num [1:3143] 0.13 0.08 0.1 0 0.03 0.32 0.07 0.15 0.08 0.05 ...
##  $ Broadband Access                                 : num [1:3143] 0.89 0.89 0.68 0.8 0.83 0.61 0.79 0.83 0.78 0.78 ...
##  $ Gender Pay Gap                                   : num [1:3143] 0.75 0.72 0.78 0.81 0.88 0.67 0.78 0.76 0.77 0.88 ...
##  $ Drug Overdose Deaths                             : num [1:3143] 9 26 NA 22 21 NA 17 23 15 21 ...
##  $ Median Household Income                          : num [1:3143] 70100 71700 41200 54300 60600 35800 41900 52800 45600 52600 ...
##  $ Living Wage                                      : num [1:3143] 41.4 43.6 35.2 38.4 42.9 ...
##  $ Children Eligible for Free or Reduced Price Lunch: num [1:3143] 0.28 0.37 0.57 0.49 0.43 0.76 0.71 0.51 0.66 0.54 ...
##  $ Residential Segregation - Black/White            : num [1:3143] 29 40 19 32 63 35 24 46 31 48 ...
##  $ Child Care Cost Burden                           : num [1:3143] 0.2 0.2 0.25 0.24 0.35 0.39 0.34 0.23 0.33 0.22 ...
##  $ Child Care Centers                               : num [1:3143] 6 6 5 5 5 5 5 6 6 6 ...
##  $ Homicides                                        : num [1:3143] 7 3 13 10 6 19 17 10 11 8 ...
##  $ Suicides                                         : num [1:3143] 17 19 13 17 22 NA 12 22 19 23 ...
##  $ Firearm Fatalities                               : num [1:3143] 17 14 27 21 18 28 25 27 24 26 ...
##  $ Motor Vehicle Crash Deaths                       : num [1:3143] 18 16 29 25 31 56 39 22 28 32 ...
##  $ Juvenile Arrests                                 : num [1:3143] 5 20 NA NA 7 7 16 31 NA 23 ...
##  $ Insufficient Sleep                               : num [1:3143] 0.38 0.35 0.44 0.4 0.37 0.46 0.41 0.38 0.43 0.38 ...
##  $ Voter Turnout                                    : num [1:3143] 0.662 0.653 0.54 0.546 0.642 0.591 0.629 0.579 0.585 0.595 ...
##  $ Census Participation                             : num [1:3143] 0.688 0.564 0.545 0.56 0.667 0.488 0.558 0.636 0.583 0.457 ...
##  $ Air Pollution - Particulate Matter               : num [1:3143] 10 7.6 9.4 9.8 9.6 9.3 9.1 9.7 9.8 9.5 ...
##  $ High School Completion                           : num [1:3143] 0.9 0.92 0.76 0.79 0.82 0.73 0.88 0.85 0.83 0.82 ...
##  $ Premature Death                                  : num [1:3143] 9400 9000 13100 12700 11500 13800 12700 13200 12100 12400 ...
##  $ Poor or Fair Health                              : num [1:3143] 0.17 0.15 0.27 0.22 0.19 0.31 0.23 0.21 0.22 0.21 ...
##  $ Adult Smoking                                    : num [1:3143] 0.17 0.15 0.25 0.22 0.2 0.26 0.21 0.21 0.2 0.22 ...
##  $ Uninsured                                        : num [1:3143] 0.1 0.11 0.13 0.11 0.13 0.12 0.12 0.13 0.13 0.13 ...
##  $ Drinking Water Violations                        : chr [1:3143] "No" "Yes" "No" "No" ...
##  $ Poor Physical Health Days                        : num [1:3143] 3.9 3.7 5 4.6 4.2 5.2 4.6 4.5 4.5 4.5 ...
##  $ Some College                                     : num [1:3143] 0.64 0.66 0.4 0.39 0.52 0.33 0.39 0.55 0.52 0.53 ...
##  $ Primary Care Physicians                          : chr [1:3143] "2,270:1" "1,600:1" "2,500:1" "1,500:1" ...
##  $ Poor Mental Health Days                          : num [1:3143] 5.7 5.4 6 5.8 5.7 6 6.1 5.7 5.8 6.2 ...
##  $ Dentists                                         : chr [1:3143] "3,320:1" "2,120:1" "3,090:1" "4,400:1" ...
##  $ Low Birthweight                                  : num [1:3143] 0.1 0.08 0.12 0.1 0.08 0.15 0.13 0.1 0.13 0.07 ...
##  $ Mental Health Providers                          : chr [1:3143] "2,720:1" "37.917361111111113" "3,530:1" "2,000:1" ...
##  $ Preventable Hospital Stays                       : num [1:3143] 3915 2799 3040 4651 2499 ...
##  $ Adult Obesity                                    : num [1:3143] 0.39 0.37 0.43 0.4 0.38 0.49 0.42 0.39 0.44 0.38 ...
##  $ Severe Housing Problems                          : num [1:3143] 0.15 0.12 0.15 0.12 0.11 0.1 0.12 0.11 0.16 0.07 ...
##  $ Unemployment                                     : num [1:3143] 0.023 0.024 0.041 0.025 0.022 0.028 0.034 0.03 0.026 0.023 ...
##  $ Driving Alone to Work                            : num [1:3143] 0.85 0.81 0.82 0.88 0.85 0.89 0.86 0.85 0.86 0.85 ...
##  $ Mammography Screening                            : num [1:3143] 0.4 0.44 0.42 0.36 0.35 0.51 0.39 0.33 0.3 0.35 ...
##  $ Food Environment Index                           : num [1:3143] 6.7 7.5 6 7.6 7.7 4.2 7.2 5.9 7.2 7.2 ...
##  $ Physical Inactivity                              : num [1:3143] 0.29 0.28 0.4 0.36 0.3 0.44 0.37 0.34 0.35 0.33 ...
##  $ Flu Vaccinations                                 : num [1:3143] 0.37 0.42 0.35 0.28 0.36 0.28 0.29 0.34 0.41 0.43 ...
##  $ Long Commute - Driving Alone                     : num [1:3143] 0.43 0.38 0.37 0.54 0.61 0.45 0.35 0.31 0.31 0.46 ...
##  $ Access to Exercise Opportunities                 : num [1:3143] 0.54 0.63 0.55 0.43 0.41 0.42 0.43 0.67 0.64 0.38 ...
##  $ Children in Poverty                              : num [1:3143] 0.16 0.16 0.38 0.26 0.16 0.39 0.33 0.26 0.31 0.23 ...
##  $ Excessive Drinking                               : num [1:3143] 0.15 0.16 0.12 0.15 0.15 0.12 0.13 0.14 0.13 0.14 ...
##  $ Income Inequality                                : num [1:3143] 4.6 4.5 5.6 5.7 4.6 7.1 4.9 4.9 4.7 4.5 ...
##  $ Alcohol-Impaired Driving Deaths                  : num [1:3143] 0.29 0.36 0.31 0.15 0.13 0.27 0.28 0.1 0.25 0.11 ...
##  $ Sexually Transmitted Infections                  : num [1:3143] 577 318 729 663 318 ...
##  $ Children in Single-Parent Households             : num [1:3143] 0.23 0.19 0.51 0.31 0.23 0.63 0.39 0.33 0.25 0.29 ...
##  $ Social Associations                              : num [1:3143] 12.7 9.7 8.4 8.9 7.6 5.8 9 13.6 17.7 9.2 ...
##  $ Teen Births                                      : num [1:3143] 21 20 37 32 25 37 35 26 30 30 ...
##  $ Injury Deaths                                    : num [1:3143] 68 78 85 100 98 108 110 97 95 104 ...
# Display the first few rows
head(data4)
## # A tibble: 6 × 87
##   State   County       Population `% Below 18 Years of Age` `% 65 and Older`
##   <chr>   <chr>             <dbl>                     <dbl>            <dbl>
## 1 Alabama Autauga (AU)      59759                     0.232            0.164
## 2 Alabama Baldwin (BA)     246435                     0.208            0.219
## 3 Alabama Barbour (BR)      24706                     0.204            0.206
## 4 Alabama Bibb (BI)         22005                     0.197            0.178
## 5 Alabama Blount (BL)       59512                     0.227            0.191
## 6 Alabama Bullock (BU)      10202                     0.211            0.178
## # ℹ 82 more variables: `% Non-Hispanic Black` <dbl>,
## #   `% American Indian or Alaska Native` <dbl>, `% Asian` <dbl>,
## #   `% Native Hawaiian or Other Pacific Islander` <dbl>, `% Hispanic` <dbl>,
## #   `% Non-Hispanic White` <dbl>, `% Not Proficient in English` <dbl>,
## #   `% Female` <dbl>, `% Rural` <dbl>, `Life Expectancy` <dbl>,
## #   `Premature Age-Adjusted Mortality` <dbl>, `High School Graduation` <dbl>,
## #   `Child Mortality` <dbl>, `Disconnected Youth` <dbl>, …
library(GGally)
library(tidyverse)

# Select continuous variables
data4_4 <- tibble(data4) %>%
  dplyr::select('Physical Inactivity', 'Flu Vaccinations', 'Children in Single-Parent Households', 'Alcohol-Impaired Driving Deaths'
,'Access to Exercise Opportunities')

# Create pairs plot
ggpairs(data4_4)

# Calculate covariance and correlation matrices
cov_matrix <- cov(data4_4, use ="pairwise.complete.obs")
cor_matrix <- cor(data4_4, use ="pairwise.complete.obs")
print(cov_matrix)
##                                      Physical Inactivity Flu Vaccinations
## Physical Inactivity                         0.0027540618    -0.0021191955
## Flu Vaccinations                           -0.0021191955     0.0108929779
## Children in Single-Parent Households        0.0028176761    -0.0014539806
## Alcohol-Impaired Driving Deaths            -0.0007997593    -0.0004044585
## Access to Exercise Opportunities           -0.0049893453     0.0088773463
##                                      Children in Single-Parent Households
## Physical Inactivity                                           0.002817676
## Flu Vaccinations                                             -0.001453981
## Children in Single-Parent Households                          0.010551677
## Alcohol-Impaired Driving Deaths                              -0.000490900
## Access to Exercise Opportunities                             -0.001763820
##                                      Alcohol-Impaired Driving Deaths
## Physical Inactivity                                    -0.0007997593
## Flu Vaccinations                                       -0.0004044585
## Children in Single-Parent Households                   -0.0004909000
## Alcohol-Impaired Driving Deaths                         0.0231770075
## Access to Exercise Opportunities                        0.0009018042
##                                      Access to Exercise Opportunities
## Physical Inactivity                                     -0.0049893453
## Flu Vaccinations                                         0.0088773463
## Children in Single-Parent Households                    -0.0017638197
## Alcohol-Impaired Driving Deaths                          0.0009018042
## Access to Exercise Opportunities                         0.0532181286
print(cor_matrix)
##                                      Physical Inactivity Flu Vaccinations
## Physical Inactivity                            1.0000000      -0.38691267
## Flu Vaccinations                              -0.3869127       1.00000000
## Children in Single-Parent Households           0.5229527      -0.13580290
## Alcohol-Impaired Driving Deaths               -0.1000737      -0.02558866
## Access to Exercise Opportunities              -0.4138829       0.37165013
##                                      Children in Single-Parent Households
## Physical Inactivity                                            0.52295271
## Flu Vaccinations                                              -0.13580290
## Children in Single-Parent Households                           1.00000000
## Alcohol-Impaired Driving Deaths                               -0.03146420
## Access to Exercise Opportunities                              -0.07529047
##                                      Alcohol-Impaired Driving Deaths
## Physical Inactivity                                      -0.10007366
## Flu Vaccinations                                         -0.02558866
## Children in Single-Parent Households                     -0.03146420
## Alcohol-Impaired Driving Deaths                           1.00000000
## Access to Exercise Opportunities                          0.02636836
##                                      Access to Exercise Opportunities
## Physical Inactivity                                       -0.41388290
## Flu Vaccinations                                           0.37165013
## Children in Single-Parent Households                      -0.07529047
## Alcohol-Impaired Driving Deaths                            0.02636836
## Access to Exercise Opportunities                           1.00000000

My finding: According the pairplot, we can know the correlation by Upper triangle, and we can use the lower triangle to understand the distribution of these variables. The middle part shows the density curve, which helps us see if the data is randomly distributed.

c) Create a few 2D density plots of joint distributions of some selected variable pairs. Do they look normal to you? Give comments.

Anwser:

df <- data4_4 %>%
  dplyr::select(`Physical Inactivity`, `Flu Vaccinations`) %>%
  na.omit() #remove NA
ggplot(df, aes(x = `Physical Inactivity`, y = `Flu Vaccinations`)) +
  geom_density_2d_filled() +
  labs(title = "2D Density Plot - Physical Inactivity vs. Flu Vaccinations",
       x = "Physical Inactivity",
       y = "Flu Vaccinations") +
  theme_minimal() #let the plot more beautiful

#### According the plot, we can see that there is skewness, meanings it is not profect normal, but it is approximately normality

d) Select a pair of variables, use the Gaussian KDE method to approximate a joint distribution. Create a density plot.

Anwser:

#Select 'Physical Inactivity','Flu Vaccinations' variables and remove missing values
df_kde <- data4_4 %>%
  dplyr::select('Physical Inactivity', 'Flu Vaccinations' ) %>%
  na.omit()

#Extract x, y values
x <- df_kde$`Physical Inactivity`
y <- df_kde$`Flu Vaccinations`
#Plot the KDE image
kde <- kde2d(x, y, n = 500)
image(kde, col = viridis(50), xlab = 'Physical Inactivity', ylab = 'Flu Vaccinations', main = "Bivariate Gaussian KDE")
points(x, y, pch = ".",col = "white", alpha = 0.1)

e) Use the inverse sampling method to create new samples for all the five variables together (study how to do it, you can use generative AI for help). Then create scatterplots of new samples on top of old samples for a few variable pairs to see how good the sampling is.

data_full <- data4_4 %>%
  dplyr::select(`Physical Inactivity`, 
                `Flu Vaccinations`, 
                `Children in Single-Parent Households`, 
                `Alcohol-Impaired Driving Deaths`,
                `Access to Exercise Opportunities`) %>%
  na.omit()

n <- nrow(data_full)
#convert the data frame to a matric for prosessing
X <- as.matrix(data_full)

#computer pseudo-observations for each variable
U <- apply(X, 2, function(x)rank(x)/(n + 1))

#transform uniform valuse to standard normal values
Z <- qnorm(U)

#Estimate the correlation matric from the transformed data
R <- cor(Z)

#generate new multivariate normal samples with the estimated correlation matrix
Z_new <- mvrnorm(n = n, mu = rep(0, ncol(X)), Sigma = R)

#transform the new model samples back to uniform values
U_new <- pnorm(Z_new)

new_samples <- data.frame(matrix(NA, nrow = n, ncol = ncol(X)))
colnames(new_samples) <- colnames(data_full)

#obtain new samples
for (i in 1:ncol(X)) {
  new_samples[[i]] <- sapply(U_new[, i], function(p)
    quantile(X[, i], probs = p, type = 8))
}
head(new_samples)
##   Physical Inactivity Flu Vaccinations Children in Single-Parent Households
## 1                0.26             0.47                                 0.18
## 2                0.24             0.35                                 0.26
## 3                0.24             0.34                                 0.23
## 4                0.30             0.38                                 0.15
## 5                0.23             0.46                                 0.16
## 6                0.20             0.42                                 0.11
##   Alcohol-Impaired Driving Deaths Access to Exercise Opportunities
## 1                            0.32                             0.33
## 2                            0.30                             0.65
## 3                            0.23                             0.68
## 4                            0.39                             0.75
## 5                            0.10                             0.62
## 6                            0.32                             0.52
#create scatterplots comparing orignal samples with new samples, and we choise "Physical Inactivity" vs. "Flu Vaccinations"
df_plot1 <- data.frame(
  Old_PhysicalInactivity = data_full$`Physical Inactivity`,
  Old_FluVaccinations = data_full$`Flu Vaccinations`,
  New_PhysicalInactivity = new_samples$`Physical Inactivity`,
  New_FluVaccinations = new_samples$`Flu Vaccinations`
)
ggplot(df_plot1, aes(x = Old_PhysicalInactivity, y = Old_FluVaccinations))+
  geom_point(color = "black", alpha = 0.5)+
geom_point(aes(x= New_PhysicalInactivity, y = New_FluVaccinations), color = "red", alpha = 0.5)+
  labs(title = "Old vs. New Samples: Physical Inactivity vs. Flu Vaccinations",
       x = "Physical Inactivity",
       y = "Flu Vaccinations") +
  theme_minimal()

#### According the plot, we can see the black points and red points have same distribution, indicating we generate a good model