── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(sf)
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
setwd("/Users/blossomanyanwu/Documents/MATH 217 HM")# Load Data sets community<-read_csv("finalmerge.csv")
New names:
Rows: 1406 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): State, Census Tract, copdrates, 95% Confidence Interval, Confidence... dbl
(6): ...1, StateFIPS, CensusTract, Year, Number, parkdistancepopulation lgl
(1): ...11
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
walkability<-read_csv("marylandwalk.csv")
New names:
Rows: 3926 Columns: 118
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(2): CSA_Name, CBSA_Name dbl (116): ...1, OBJECTID, GEOID10, GEOID20, STATEFP,
COUNTYFP, TRACTCE, BLK...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
# Subset Mocomoco_walkability <- walkability %>%filter(COUNTYFP ==3)# Remove %'s from columns with it using gsubcommunity$oldhousing <-gsub("%", "", community$oldhousing)community$copdrates <-gsub("%", "", community$copdrates)# Convert the column to numeric (depending on later results)community$oldhousing <-as.numeric(community$oldhousing)community$copdrates <-as.numeric(community$copdrates)
Warning: NAs introduced by coercion
1. For this project I am using 2 data sets. The first is a Walkability Index data set from the Evironmental Protection Agency. The second is a data set regarding community design and different public health outcomes, this data came from the CDC Public Health Tracking Network and for the final product I merged 3 different data sets since they were so sparse.
# 2a: In the Wallkability dataset two variables are the 'walkability score' and the 'R_MedWageWk' which represents the percent of a census tract that is earning between a certain threshold. # R_MedWageWkggplot(walkability, aes(x = R_MedWageWk)) +geom_histogram(binwidth =2, fill ="red", color ="black", alpha =0.7) +labs(title ="Income Distributions (less that 3300 but more than 1250 USD",y ="Frequency") +theme_minimal()
# Walkability Index Scoreggplot(walkability, aes(x = NatWalkInd)) +geom_histogram(binwidth =2, fill ="red", color ="black", alpha =0.7) +labs(title ="Walkability Score Distribution For Montgomery County",y ="Frequency") +theme_minimal()
# Kernel Density Plotd <-density(walkability$NatWalkInd)plot(d, main="Walkability Score Distribution")polygon(d, col="red", border="blue")
# 3. Name 2 categorical variables. Define them. Create a bargraph for each of those 2 variables. Describe what the bargraphs show for each variable.# My 1st data set has Census Tract and Census County Tract as categorical variables. However all of these categorical variables are different so a bar graph would not be a great representation. However for the walkability data set the only differing categorical variables are County FP and CSA Name. Both these variables serve as county identifiers.data_freqs <- walkability %>%count(COUNTYFP)ggplot(data_freqs, aes(x = COUNTYFP, y = n)) +geom_bar(stat ="identity") +labs(title ="Number of Observations Per County FP",x ="Counties",y ="Frequenct") +theme_classic()
4. The MoCO Community Data set hand the walkability data set do not have over 500 observations. Their original data sets have over 500 however for the sake of my project I querried values only from the state of Maryland and Montgomery County. So because of this I do not believe there was bias in the data collection process.
# 5. Perform a 𝜒𝜒2 test on your two categorical variables. State the null and alternative hypotheses. Write the results of this chi-square test in terms of the chi-square value and the p-value. Write a conclusion in context of your variables.# Chi-square test with chisq.test() functionchi_square_results <-chisq.test(table(walkability$COUNTYFP, walkability$TRACTCE))
Warning in chisq.test(table(walkability$COUNTYFP, walkability$TRACTCE)):
Chi-squared approximation may be incorrect
# Null Hypothesis (H0): There is no association between COUNTYFP and NatWalkInd# Alternative Hypothesis (Ha): There is an association between COUNTYFP and NatWalkInd. # Print chi-square statistic and p-valuecat("Chi-square statistic: ", chi_square_results$statistic, "\n")
# Interpretation: The chi squared P-Value is less than 0.05 so we can reject the null hypothesis. There is statistically significant evidence to suggest an association between COUNTYFP and NatWalkInd
# 7. Perform a linear or multiple linear (or logistic regression) on variables in your dataset. Be sure to address the basic assumptions for your regression.model3 <-lm(copdrates ~ oldhousing, data = community) # y ~ x represents dependent variable ~ independentsummary(model3)
Call:
lm(formula = copdrates ~ oldhousing, data = community)
Residuals:
Min 1Q Median 3Q Max
-4.9315 -1.2323 -0.3673 1.0400 10.1918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.993742 0.126692 31.52 <2e-16 ***
oldhousing 0.024456 0.001935 12.64 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.844 on 1266 degrees of freedom
(138 observations deleted due to missingness)
Multiple R-squared: 0.1121, Adjusted R-squared: 0.1114
F-statistic: 159.8 on 1 and 1266 DF, p-value: < 2.2e-16
# 8. # Model Analysis: The coefficient of 0.024 indicates that for each one-unit increase in the oldhousing variable there is a 0.024 increase for the COPD rates variable. Also the P-values are much smaller than 0.05 suggesting that there is a statistically significant relationship between housing age and COPD rates. However, the R-Squared value is on the lower end and signifies that the model explains about 11.2% of variance in the dependent variable. This suggests that other variables may have more of an impact. So I will likely use a multiple regression model to include more variables and make the model better. All of these values indicate a relationship between housing age and COPD rates in MD census tracts.
# 9. For this project I plan to engage in more hypothesis testing. I am pretty sure of the relationships my variables have with each other at the is point. However, I believe hypothesis testing will help me further prove these relationships. Because of this I will try to use bootstrapping methods and just draw different samples from my data. I would do this with a specific focus on variables like COPD Rates and the Housing Age % Data.