#Brief Introduction:
##The topic of my dataset is analyzing the top 5 counties in Maryland, Virginia and District of Columbia. The primary aim is to explore how demographic/population changes and housing characteristics relate to socioeconomic outcomes for the top 5 counties with population change in DC,MD,District of Columbia.
#Variable Definitions
##counties : the name of the county or city
##State: The state in which the county is located.
##Pop2000,pop2010,po2017: Population of the county at three different census years
##Poverty: The average percentage of the population living below the poverty line or above the poverty line.
##Homeownership: The percentage of homes that are owner-occupied.
##Multi_unit: The percentage of housing units in buildings with multiple units, reflecting housing density.
##Unemployment_Rate: The unemployment rate as a percentage, reflecting economic activity.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
county <- read_csv("county.csv")
## Rows: 3142 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): name, state, metro, median_edu, smoking_ban
## dbl (10): pop2000, pop2010, pop2017, pop_change, poverty, homeownership, mul...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Loading dataset
glimpse(county)
## Rows: 3,142
## Columns: 15
## $ name              <chr> "Autauga County", "Baldwin County", "Barbour County"…
## $ state             <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama…
## $ pop2000           <dbl> 43671, 140415, 29038, 20826, 51024, 11714, 21399, 11…
## $ pop2010           <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 20947, 11…
## $ pop2017           <dbl> 55504, 212628, 25270, 22668, 58013, 10309, 19825, 11…
## $ pop_change        <dbl> 1.48, 9.19, -6.22, 0.73, 0.68, -2.28, -2.69, -1.51, …
## $ poverty           <dbl> 13.7, 11.8, 27.2, 15.2, 15.6, 28.5, 24.4, 18.6, 18.8…
## $ homeownership     <dbl> 77.5, 76.7, 68.0, 82.9, 82.0, 76.9, 69.0, 70.7, 71.4…
## $ multi_unit        <dbl> 7.2, 22.6, 11.1, 6.6, 3.7, 9.9, 13.7, 14.3, 8.7, 4.3…
## $ unemployment_rate <dbl> 3.86, 3.99, 5.90, 4.39, 4.02, 4.93, 5.49, 4.93, 4.08…
## $ metro             <chr> "yes", "yes", "no", "yes", "yes", "no", "no", "yes",…
## $ median_edu        <chr> "some_college", "some_college", "hs_diploma", "hs_di…
## $ per_capita_income <dbl> 27841.70, 27779.85, 17891.73, 20572.05, 21367.39, 15…
## $ median_hh_income  <dbl> 55317, 52562, 33368, 43404, 47412, 29655, 36326, 436…
## $ smoking_ban       <chr> "none", "none", "partial", "none", "none", "none", N…
#Small look at data
Filtered_State <-county %>%
 filter(state == 'Maryland' |state == 'District of Columbia' | state == "Virginia")

#Filtering out states, in order to to keep on states; District of Columbia, Virginia, Marylandf, I had to use Chat Gpt to figure out' | 'in order to combine. 
WEE <- Filtered_State %>%
  arrange(desc(pop_change)) %>% #Arrange Desc Order Pop_change
  arrange(desc(pop2000)) %>% #Arrange Desc Order for Pop2000
  filter(pop2000 > 150000)#Filtering out pop2000 to keep counties with greater than 150,000 population.
Advancedd <- WEE %>%
  select(name,state,pop2000,pop2010,pop2017,pop_change,poverty,homeownership,multi_unit,unemployment_rate,metro,median_edu,per_capita_income,median_hh_income,smoking_ban,)
#Saving a New Dataset with the variables that im originally planning to use.
FirstGraph <-Advancedd %>%
  filter(pop_change > 1.00) %>% ##Filtering Population change to keep counties with Greater than 1.00 pop_change
  filter(poverty > 6.00) ##Filtering Poverty Rates to keep counties with greater than 6.00 poverty rate
correlation_result <- cor(FirstGraph$pop_change, FirstGraph$poverty)
correlation_result
## [1] 0.5439203
#Finding correlation between pop_change and poverty rates.
Exempt1 <- lm(poverty ~ pop_change, data = FirstGraph)
summary(Exempt1)
## 
## Call:
## lm(formula = poverty ~ pop_change, data = FirstGraph)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7527 -3.0265 -0.4563  1.9840 11.6288 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    4.002      2.863   1.398   0.1875  
## pop_change     1.574      0.701   2.245   0.0444 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.599 on 12 degrees of freedom
## Multiple R-squared:  0.2958, Adjusted R-squared:  0.2372 
## F-statistic: 5.042 on 1 and 12 DF,  p-value: 0.04436
#First linear regression analysis between pop_change and poverty rates. P-Value could be better, and Adjusted R-Squared...
fortnite2.0 <- lm(poverty ~ pop_change + pop2000 + pop2010 + pop2017 + unemployment_rate, data = FirstGraph) #Adding Multiple Independent Variables for Multiple Regression (pop2000,pop2010,pop2017). R-Square and P-Value not statistically pleasing enough..
summary(fortnite2.0) #Summary
## 
## Call:
## lm(formula = poverty ~ pop_change + pop2000 + pop2010 + pop2017 + 
##     unemployment_rate, data = FirstGraph)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7970 -2.1968 -0.5206  1.2559  7.7659 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)       -4.332e+00  8.773e+00  -0.494    0.635
## pop_change         2.276e+00  1.420e+00   1.603    0.148
## pop2000            6.443e-05  5.467e-05   1.179    0.272
## pop2010            4.338e-05  1.580e-04   0.274    0.791
## pop2017           -9.606e-05  1.233e-04  -0.779    0.458
## unemployment_rate  1.866e+00  1.850e+00   1.009    0.343
## 
## Residual standard error: 3.995 on 8 degrees of freedom
## Multiple R-squared:  0.6458, Adjusted R-squared:  0.4244 
## F-statistic: 2.917 on 5 and 8 DF,  p-value: 0.08637
fortnite3.0 <- lm(poverty ~ pop_change + pop2000 +pop2010 + pop2017 + unemployment_rate + homeownership + multi_unit + per_capita_income +median_hh_income, data = FirstGraph) #Adding homeownership,multi_unit,per_capita_income + median_hh_income as new multiple indepedent variable.
summary(fortnite3.0) #Summary
## 
## Call:
## lm(formula = poverty ~ pop_change + pop2000 + pop2010 + pop2017 + 
##     unemployment_rate + homeownership + multi_unit + per_capita_income + 
##     median_hh_income, data = FirstGraph)
## 
## Residuals:
##       1       2       3       4       5       6       7       8       9      10 
##  1.8587  0.5483 -0.9085 -0.8976 -0.1095 -2.1375 -0.5809 -1.6542  0.2736  2.0985 
##      11      12      13      14 
## -0.6132  1.8636  0.8393 -0.5806 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        1.004e+02  4.067e+01   2.468   0.0691 .
## pop_change        -1.285e+00  1.418e+00  -0.906   0.4160  
## pop2000            5.302e-05  7.206e-05   0.736   0.5026  
## pop2010           -2.423e-04  2.233e-04  -1.085   0.3390  
## pop2017            1.810e-04  1.597e-04   1.133   0.3205  
## unemployment_rate -4.684e-01  1.954e+00  -0.240   0.8224  
## homeownership     -8.916e-01  4.142e-01  -2.153   0.0977 .
## multi_unit        -5.906e-01  3.790e-01  -1.558   0.1941  
## per_capita_income  3.232e-05  3.082e-04   0.105   0.9215  
## median_hh_income  -1.013e-04  1.465e-04  -0.692   0.5270  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.367 on 4 degrees of freedom
## Multiple R-squared:  0.9378, Adjusted R-squared:  0.798 
## F-statistic: 6.705 on 9 and 4 DF,  p-value: 0.04128
#Result: Stronger R-Squared, and P-Value ... Not good enough.
fortnite4.0 <- lm(poverty ~ pop_change + pop2010 + pop2017 + homeownership + multi_unit  , data = FirstGraph) #Removed Independent Variables: per_capita_Income, median_hh_Income + Unemployment_rate and Pop2000
summary(fortnite4.0) #Results: P-value is smaller but Adjusted R Squared Decreased, Not good enough
## 
## Call:
## lm(formula = poverty ~ pop_change + pop2010 + pop2017 + homeownership + 
##     multi_unit, data = FirstGraph)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4334 -0.6994  0.1005  1.2366  3.8432 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    1.036e+02  3.024e+01   3.424  0.00903 **
## pop_change    -6.798e-01  1.349e+00  -0.504  0.62779   
## pop2010       -6.961e-05  1.169e-04  -0.595  0.56809   
## pop2017        6.398e-05  1.083e-04   0.591  0.57098   
## homeownership -1.078e+00  2.959e-01  -3.642  0.00657 **
## multi_unit    -6.394e-01  2.289e-01  -2.794  0.02342 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.589 on 8 degrees of freedom
## Multiple R-squared:  0.8512, Adjusted R-squared:  0.7582 
## F-statistic: 9.153 on 5 and 8 DF,  p-value: 0.003658
top_counties <- FirstGraph %>%
  filter(state %in% c("Maryland", "Virginia", "District of Columbia")) %>%
  group_by(state) %>%
  arrange(desc(homeownership)) %>%
  slice_head(n = 10) 
 #Filtering the top 10 state wise for homeownership because in the lm above, homeownership has the lowest p value.
Top_countieswithighestmultiunit <- top_counties %>%
  filter(state %in% c("Maryland", "Virginia", "District of Columbia")) %>%
  group_by(state) %>%
  arrange(desc(multi_unit)) %>%
  slice_head(n = 5 ) #Filtering for states, group by states, Keeping the top 5 multi_unit counties in Maryland, Virginia, District of Columbia
Topg <- Top_countieswithighestmultiunit %>%
  filter(state %in% c("Maryland", "Virginia", "District of Columbia")) %>%
  group_by(state) %>%
  arrange(desc(pop2000)) %>%
  slice_head(n = 4) 
#Now I want to keep the top 4 populations in Maryland, Virginia, District of Columbia, by choosing pop2000.
Topg2.0 <- Topg %>%
  select(name,state,homeownership,multi_unit,pop2010,pop2000,unemployment_rate,pop_change,poverty) #Creating a new dataset, Selecting the variables to keep. This Dataset will be used for The last and final linear model.
fortnite5.0 <- lm(poverty ~  pop_change + pop2000 + pop2010 +  homeownership + multi_unit +unemployment_rate  , data = Topg2.0) #Independent Variables: pop_change, pop2000, pop2010, pop2014, homeownership, multi_unit for dependent: Poverty.
summary(fortnite5.0) #Result Highest Adjusted R-Squared, and P-Value is decent.
## 
## Call:
## lm(formula = poverty ~ pop_change + pop2000 + pop2010 + homeownership + 
##     multi_unit + unemployment_rate, data = Topg2.0)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
## -0.14942  0.39906 -0.32704  0.56294 -1.06842 -0.01156 -0.15022  0.68272 
##        9 
##  0.06194 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        9.784e+01  2.329e+01   4.202   0.0522 .
## pop_change         5.900e-01  4.435e-01   1.330   0.3149  
## pop2000           -1.003e-04  1.554e-04  -0.646   0.5847  
## pop2010            9.484e-05  1.488e-04   0.637   0.5891  
## homeownership     -1.164e+00  4.381e-01  -2.658   0.1172  
## multi_unit        -7.036e-01  4.910e-01  -1.433   0.2883  
## unemployment_rate  1.672e+00  4.152e+00   0.403   0.7262  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.058 on 2 degrees of freedom
## Multiple R-squared:  0.9929, Adjusted R-squared:  0.9716 
## F-statistic: 46.59 on 6 and 2 DF,  p-value: 0.02116
#FINAL MODEL Y = mx + b
#Intercept 97.849 (Poverty)
#CoEfficient (pop_change) ~ acts like m
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
pairs.panels(Topg2.0[3:8],  #Visualization of the Correlations for Linear M 5 "Fortnite 5.0"
             gap = 0,
             pch = 21,
             lm = TRUE)

 MEE <- WEE %>%
  filter(state %in% c("Maryland", "Virginia", "District of Columbia")) %>%
  group_by(state) %>%
  arrange(desc(pop2017)) %>%
  slice_head(n = 8 )
# I only want the top 8 counties with highest population from variable pop2017.
MEEE <- MEE %>%
  filter(pop_change > 0)
#Remove the negative, so now its top 7 counties with the highest population
ggplot(MEEE, aes(x = pop2010, y = pop2017, color = pop_change)) +
  geom_point(alpha = 0.8, size = 3) +  # Adjust size and transparency 
  scale_color_gradient2(low = "blue", mid = "white", high = "red", midpoint = median(Advancedd$pop_change)) +
  theme_minimal() +  # Change theme
  labs(title = "Population Change from 2010 to 2017",
       x = "Population in 2000",
       y = "Population in 2017 ",
       color = "Population Change",
       caption = "From Census Quick Facts (no longer available as of 2020) and its accompanying pages. Includes top 7 counties from Maryland and Virginia in terms of population and DC.") + #labels
  theme(plot.title = element_text(hjust = 0.5)) 

#Source/Help Used for GGplot: https://r-graph-gallery.com/ggplot2-package.html
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot1 <- ggplot(MEEE,
            aes(x = pop2000,
                y = pop2017,
                size = pop_change,
                text = paste("State:", state, "name", name))) +
  geom_point(alpha = 0.5, color = "black") +
  xlim(155000,970000) +
  ylim(250000,1150000) +
  labs(title = "Population of 2017 and the Population Change for top 7 Populous Counties in DMV",
       caption = "From Census Quick Facts (no longer available as of 2020) and its accompanying pages. Includes top 7 counties from Maryland and Virginia in terms of population and DC",
       x= "Population in 2000",
       y= "Population in 2017",
       size = "Population_Change") +
  theme_minimal(base_size = 10)
ggplotly(plot1)
#source/Help Used: Professor Saidi Walkthrough From Week 6

Summary Essay:

In my project, I explored demographic and economic trends in Maryland, Virginia, and the District of Columbia. By examining population changes and their relationship with socioeconomic factors such as poverty , I aimed to uncover patterns that could inform regional planning and policy. The data for this project was sourced from From Census Quick Facts . I focused on cleaning and preparing the data by Filtering to include only relevant counties with the highest population. The main tool for my analysis was a series of scatter plots, which showed population changes from 2010 to 2017. These visualizations revealed Certain areas experienced population growth with increased poverty rates, while others experienced population growth with a stable poverty rate throughout.

Higher homeownership rates did not necessarily correlate with lower poverty levels. Some areas with population growth did not see expected decreases in poverty, suggesting other factors were at play.

In my Linear In analysis, I used multiple regression to examine the factors affecting poverty rates across different regions. While the adjusted R squared value of 0.9716 suggests that our model explains approximately 97% of the variance in poverty rates. This high value could also suggest that our model might be overfitting, especially since the 𝑝 p-values associated with the predictors do not indicate statistical significance. This means that, according to our data and the model used, changes in population, homeownership rates, housing unit types, and unemployment do not have statistically significant impacts on poverty rates, under the conventional threshold of 0.05 for 𝑝 p-values. Future studies might benefit from examining these relationships with a larger dataset or different model specifications to validate these findings.”