EXAM 2 - MATH 217

Author

Blossom Anyanwu

Libraries and Data Loading

 # Load Libraries and set WD
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(sf)

Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

setwd("/Users/blossomanyanwu/Documents/MATH 217 HM")

# Load Data sets 
community<-read_csv("finalmerge.csv")

New names:
Rows: 1406 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): State, Census Tract, copdrates, 95% Confidence Interval, Confidence... dbl
(6): ...1, StateFIPS, CensusTract, Year, Number, parkdistancepopulation lgl
(1): ...11
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

walkability<- read_csv("marylandwalk.csv")

New names:
Rows: 3926 Columns: 118
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(2): CSA_Name, CBSA_Name dbl (116): ...1, OBJECTID, GEOID10, GEOID20, STATEFP,
COUNTYFP, TRACTCE, BLK...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

# Subset Moco
moco_walkability <- walkability %>%
  filter(COUNTYFP == 3)

# Remove %'s from columns with it using gsub

community$oldhousing <- gsub("%", "", community$oldhousing)
community$copdrates <- gsub("%", "", community$copdrates)

# Convert the column to numeric (depending on later results)

community$oldhousing <- as.numeric(community$oldhousing)
community$copdrates <- as.numeric(community$copdrates)

Warning: NAs introduced by coercion

1. For this project I am using 2 data sets. The first is a Walkability Index data set from the Evironmental Protection Agency. The second is a data set regarding community design and different public health outcomes, this data came from the CDC Public Health Tracking Network and for the final product I merged 3 different data sets since they were so sparse.

# 2a: In the Wallkability dataset two variables are the 'walkability score' and the 'R_MedWageWk' which represents the percent of a census tract that is earning between a certain threshold. 

# R_MedWageWk
ggplot(walkability, aes(x = R_MedWageWk)) +
  geom_histogram(binwidth = 2, fill = "red", color = "black", alpha = 0.7) +
  labs(title = "Income Distributions (less that 3300 but more than 1250 USD",
       y = "Frequency") +
  theme_minimal()

# Walkability Index Score
ggplot(walkability, aes(x = NatWalkInd)) +
  geom_histogram(binwidth = 2, fill = "red", color = "black", alpha = 0.7) +
  labs(title = "Walkability Score Distribution For Montgomery County",
       y = "Frequency") +
  theme_minimal()

# Kernel Density Plot
d <- density(walkability$NatWalkInd)
plot(d, main="Walkability Score Distribution")
polygon(d, col="red", border="blue")

# 3. Name 2 categorical variables. Define them. Create a bargraph for each of those 2 variables. Describe what the bargraphs show for each variable.

# My 1st data set has Census Tract and Census County Tract as categorical variables. However all of these categorical variables are different so a bar graph would not be a great representation. However for the walkability data set the only differing categorical variables are County FP and CSA Name. Both these variables serve as county identifiers.

data_freqs <- walkability %>%
  count(COUNTYFP)

ggplot(data_freqs, aes(x = COUNTYFP, y = n)) +
  geom_bar(stat = "identity") + 
  labs(title = "Number of Observations Per County FP",
       x = "Counties",
       y = "Frequenct") +
  theme_classic()

4. The MoCO Community Data set hand the walkability data set do not have over 500 observations. Their original data sets have over 500 however for the sake of my project I querried values only from the state of Maryland and Montgomery County. So because of this I do not believe there was bias in the data collection process.

# 5. Perform a 𝜒𝜒2 test on your two categorical variables. State the null and alternative hypotheses. Write the results of this chi-square test in terms of the chi-square value and the p-value. Write a conclusion in context of your variables.

# Chi-square test with chisq.test() function
chi_square_results <- chisq.test(table(walkability$COUNTYFP, walkability$TRACTCE))

Warning in chisq.test(table(walkability$COUNTYFP, walkability$TRACTCE)):
Chi-squared approximation may be incorrect

# Null Hypothesis (H0): There is no association between COUNTYFP and NatWalkInd
# Alternative Hypothesis (Ha): There is an association between COUNTYFP and NatWalkInd. 

# Print chi-square statistic and p-value
cat("Chi-square statistic: ", chi_square_results$statistic, "\n")

Chi-square statistic:  81928.09

cat("p-value: ", chi_square_results$p.value, "\n")

p-value:  0

# Interpretation: The chi squared P-Value is less than 0.05 so we can reject the null hypothesis. There is statistically significant evidence to suggest an association between COUNTYFP and NatWalkInd

# 7. Perform a linear or multiple linear (or logistic regression) on variables in your dataset. Be sure to address the basic assumptions for your regression.

model3 <- lm(copdrates ~ oldhousing, data = community)  # y ~ x represents dependent variable ~ independent
summary(model3)


Call:
lm(formula = copdrates ~ oldhousing, data = community)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9315 -1.2323 -0.3673  1.0400 10.1918 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.993742   0.126692   31.52   <2e-16 ***
oldhousing  0.024456   0.001935   12.64   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.844 on 1266 degrees of freedom
  (138 observations deleted due to missingness)
Multiple R-squared:  0.1121,    Adjusted R-squared:  0.1114 
F-statistic: 159.8 on 1 and 1266 DF,  p-value: < 2.2e-16

# 8. # Model Analysis: The coefficient of 0.024 indicates that for each one-unit increase in the oldhousing variable there is a 0.024 increase for the COPD rates variable. Also the P-values are much smaller than 0.05 suggesting that there is a statistically significant relationship between housing age and COPD rates. However, the R-Squared value is on the lower end and signifies that the model explains about 11.2% of variance in the dependent variable. This suggests that other variables may have more of an impact. So I will likely use a multiple regression model to include more variables and make the model better. All of these values indicate a relationship between housing age and COPD rates in MD census tracts.

# 9. For this project I plan to engage in more hypothesis testing. I am pretty sure of the relationships my variables have with each other at the is point. However, I believe hypothesis testing will help me further prove these relationships. Because of this I will try to use bootstrapping methods and just draw different samples from my data. I would do this with a specific focus on variables like COPD Rates and the Housing Age % Data.