#knitr::opts_chunk$set(echo=TRUE, warning = FALSE, message = FALSE, results = 'hide', include=TRUE, fig.keep = 'all')
## SET UP ##
#Setup for our analysis, including necessary packages.
setwd("~/Documents/UPenn/Penn 2022 Spring/Planning by Numbers/Assignment 2")
library(tidyverse)
library(tidycensus)
library(ggplot2)
library(dplyr)
library(data.table)
require("ggrepel")
library(hrbrthemes)
library(viridis)
library(ggridges)
library(lubridate)
library(tigris)
library(gganimate)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(mapview)
library(ggcorrplot)
library(RColorBrewer)
library(stargazer)
library(correlationfunnel)
library(tidyquant)
library(plotly)
library(shiny)
library(bslib)
library(shinydashboard)
library(DT)
library(car)
library(AICcmodavg)
options(scipen=999)
dat <- read.csv("cwns2012.csv", header = TRUE)
TO: Mayor Kenney
FROM: Riddhi Batra & Emily Goldstein
DATE: 3 March 2022
RE: Water Infrastructure Needs in Philadelphia
Dear Mayor Kenney,
Since former Mayor Michael Nutter’s introduction of the Greenworks Philadelphia plan in 2009, Philadelphia has been on track to perform major improvements to its waste and stormwater infrastructure with the goal of reducing the overall burden on Philly’s aging water infrastructure and saving the City critical dollars in future facility repairs and improvements. In an effort to understand the potential scale and impact of these improvements, we have performed several comparative statistical analyses across geographic regions, states, and counties, studying their relationship to Philadelphia County, and identifying critical metrics impacting Philly’s potential infrastructure need costs.1 We have outlined our key findings below and provided a more detailed and technical evaluation in our attached analysis.
Philadelphia’s water infrastructure needs are closely related to the county’s population and population density.
We identified positive relationships between Philadelphia’s population and population density and its share of infrastructure needs. This finding is intuitive and unsurprising. Philadelphia County alone accounts for almost 20% of Pennsylvania’s official needs and Philadelphia County contains the highest population density of all counties in PA (and growing– between 2000 – 2010 by 0.56%). Additionally, over 24% of the residential population receiving treatment in Pennsylvania belong to Philadelphia County. We also evaluated the relationship between the percentage of white identifying populations and median income, finding these metrics to have little to no association with infrastructure needs.
The City of Philadelphia’s total need accounts for just over 3% of the four states within the Delaware River Watershed, and its mean population density falls within the third quartile.
Within the Delaware River Watershed (DRW), a critical source for Philadelphia’s drinking water, the City of Philadelphia accounted for over 3% of the four states, including New York, New Jersey, and Delaware, also within the DRW. Like Philadelphia’s high need percentage in PA, it also has a comparatively high need relative to other related regions. While Philadelphia is dominant in its percentage share of need in PA, the state of PA itself falls near the middle range of need at number 28 when compared to mean need to all 49 states. However, in terms of median need, PA falls at number 18 within the third quartile of all states—a discrepancy which can be explained by the significant differences in min and max need values within PA that contribute towards skewing the state’s mean need.
Our analysis showed that every unit increase in residences receiving treatment is accompanied by an almost $700 increase in total need for Philadelphia.
Within the population of Philadelphia only about 62% of the residential and non-residential population is receiving water treatment. We anticipate this percentage share of the population using city water collection and treatment to continue to grow as it has over the past 10 years of our study period. As the number of Philly’s water infrastructure users increases, so will the cost burden experienced by the City.
We also recommend that particular attention be paid to upgrading the quality and capacity of CSS (Combined Sewer System) infrastructure in Philadelphia through adequate budgetary allocations and consistent quality evaluations.
The presence of a CSS facility also impacts Philadelphia’s water infrastructure requirements. Our enclosed regression analysis found that the increase in official need per unit increase in population density is 132% higher for facilities with a CSS than those without. This could be attributed to the high maintenance costs associated with Combined Sewer Systems.
A strategy taking into consideration the steady increase in the key metrics of population and population density, along with astute attention to Philadelphia’s aging CSS system, should support the City in its efforts to improve its water infrastructure and lower its operational costs.
Sincerely,
Riddhi Batra & Emily Goldstein
Future city planners aspiring to be semi-competent in R.
ADDENDUM: A refinement of our study on factors impacting Philadelphia’s water infrastructure.
The most significant predictors of future water infrastructure needs are the total count of residents receiving water treatment, total population and population density, residential burden per facility, and the total land area covered per facility.
Through a series of further tests evaluating various factors that impact water infrastructure, we identified these six indicators to be the most statistically significant predictors of water infrastructure need. We used statistical analysis by way of multivariate regression to model and identify the factors most impacting water infrastructure need. We built 6 models to test a total of 7 different factors which, during our previous studies, had displayed varying correlations with “need”. The variables exhibiting medium to high correlation were the total residents receiving water treatment, population density, and residential burden of need. Those exhibiting little to no correlation included absolute population counts, median income, land area, and percent of white population.
Residential burden has the greatest monetary impact on future water infrastructure needs.
Through our model, we determined that our statistically significant factors have the following predicted costs associated with need.
A unit increase in residential burden results in an increase of $78,500 in total need
A unit increase in population density results in an increase of $22,390 in total need
A unit increase in land area results in an increase of $1,926 in total need
A unit increase in residences receiving water treatment will result in an increase of $565 in total need
A unit increase in absolute population count results in a decrease of $20 in total official need
Future infrastructure budgeting and policy should consider these factors and their associated costs. Due to the aggressive and forward-thinking policies of the previous administration, Philadelphia is in an advantageous position to be a leader in efficient and environmentally sensitive wastewater management. Key to achieving this will be strategic investments and planning for future infrastructure needs. We believe the factors identified within this memo will be critical to ensuring Philadelphia is able to achieve its Greenworks policy goals.
Your gals, R & E.
Created a new data frame including change and percent change for all metrics.
dat1 <- select(dat, 1:18, 21:26, 30:31, 35, 37:38, 40)%>%
mutate(CHANGE_RES_REC_COLLCTN = PROJ_RES_REC_COLLCTN - PRES_RES_REC_COLLCTN)%>%
mutate(PCT_CHANGE_RES_REC_COLLCTN = (CHANGE_RES_REC_COLLCTN/PRES_RES_REC_COLLCTN)*100)%>%
mutate(CHANGE_RES_REC_TRMT = PROJ_RES_REC_TRMT - PRES_RES_REC_TRMT)%>%
mutate(PCT_CHANGE_RES_REC_TRMT = (CHANGE_RES_REC_TRMT/PRES_RES_REC_TRMT)*100)%>%
mutate(CHANGE_N_RES_REC_TRMT = PROJ_N_RES_REC_TRMT - PRES_N_RES_REC_TRTM)%>%
mutate(PCT_CHANGE_N_RES_REC_TRMT = (CHANGE_N_RES_REC_TRMT/PRES_N_RES_REC_TRTM)*100)%>%
mutate(CHANGE_N_RES_REC_COLLCTN = PROJ_N_RES_REC_COLLCTN - PRES_N_RES_REC_COLLCTN)%>%
mutate(PCT_CHANGE_N_RES_REC_COLLCTN = (CHANGE_N_RES_REC_COLLCTN/PRES_N_RES_REC_COLLCTN)*100)%>%
mutate(ALAND_SQMILE = ALAND/2.59e+6)%>%
mutate(POP_DENSITY_10 = POP10/ALAND_SQMILE)%>%
mutate(POP_DENSITY_00 = POP00/ALAND_SQMILE)%>%
mutate(CHANGE_POP_DENSITY = POP_DENSITY_10 - POP_DENSITY_00)%>%
mutate(PCT_CHANGE_POP_DENSITY = (CHANGE_POP_DENSITY/POP_DENSITY_00)*100)%>%
mutate(MEDINC99_INFL = MEDINC99*1.29)%>%
mutate(CHANGE_MEDINC = MEDINC09 - MEDINC99_INFL)%>%
mutate(PCT_CHANGE_MEDINC = (CHANGE_MEDINC/MEDINC99_INFL)*100)
Summary of Total Reported Need
The facility with the highest reported need is CWNS NUMBER 36002001005, in Queens County, NY, EPA Region 2.
summary(dat1$TOTAL_OFFICIAL_NEED)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 704599 2409822 21623215 7660571 5160108526
Population Density
Summary of Change in Population Density
summary(dat1$PCT_CHANGE_POP_DENSITY)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -46.605 -0.640 4.269 7.341 11.730 110.355
Density Plot 1
dat1%>%
filter(POP_DENSITY_10 < 5000)%>%
ggplot(aes(x=POP_DENSITY_10)) +
geom_density(fill="orchid", color="orchid3", alpha=0.8)
Density Plot 2
dat1%>%
filter(POP_DENSITY_10 >= 5000)%>%
ggplot(aes(x=POP_DENSITY_10)) +
geom_density(fill="orchid", color="orchid3", alpha=0.8)
Median Income
Change in Median Income
summary(dat1$CHANGE_MEDINC)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -15488.5 -3924.0 -1614.8 -1530.3 706.9 25537.0
Percent Change in Median Income
summary(dat1$PCT_CHANGE_MEDINC)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -36.025 -6.512 -2.663 -2.447 1.244 49.526
Median Income Plotted
ggplot(dat1, aes(x=MEDINC09)) +
geom_density(fill="orchid", color="orchid3", alpha=0.8)
Residential Collection
Change in Collection - Residential
summary(dat1$CHANGE_RES_REC_COLLCTN)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -88018.0 19.0 275.5 4636.7 1800.0 1300000.0
Percent Change in Collection - Residential
summary(dat1$PCT_CHANGE_RES_REC_COLLCTN)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -100.000 1.745 12.500 Inf 35.780 Inf 23
Non-Residential Collection
Change in Collection - Non-Residential
summary(dat1$CHANGE_N_RES_REC_COLLCTN)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -254000.0 0.0 0.0 376.8 0.0 354330.0
Percent Change in Collection - Non-Residential
summary(dat1$PCT_CHANGE_N_RES_REC_COLLCTN)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -100.00 0.00 23.46 Inf 100.00 Inf 7217
Residential Treatment
Change in Treatment - Residential
summary(dat1$CHANGE_RES_REC_TRMT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -187386 31 326 5477 2096 1300028
Percent Change in Treatment - Residential
summary(dat1$PCT_CHANGE_RES_REC_TRMT)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -100.000 2.302 13.636 Inf 37.883 Inf 18
Non-Residential Treatment
Change in Treatment - Non-Residential
summary(dat1$CHANGE_N_RES_REC_TRMT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -254000.0 0.0 0.0 437.9 0.0 354330.0
Percent Change in Treatment - Non-Residential
summary(dat1$PCT_CHANGE_N_RES_REC_TRMT)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -100.00 0.00 24.62 Inf 100.00 Inf 7102
Top Ten Facilities with the Most Need
dat_top10 <- dat1%>%
slice_max(TOTAL_OFFICIAL_NEED, n = 10)
DT::datatable(
dat_top10,
height = 150,
options = list(scrollX = TRUE)
)
Summary Stats by Need of EPA Regions
We charted the correlation between population density and official need for different groupings of the data set.
Of the 10 EPAs represented in the CWNS data set:
– EPA 9 has a mid-range population density with the highest official need.
– EPA 2 has the highest population density with a high official need.
– EPA 8 has the lowest population density and the lowest reported need.
– EPA 3, which contains Philadelphia, has a relatively lower population density and official need.
These findings are consistent with the presence of large urban centers in EPA 9 and EPA 2 – situated on the West and East coasts respectively, both regions contain cities and counties with high populations and water requirements – such as Los Angeles, San Franciso, New York, Queens, and Hudson.
dat_epa <- group_by(dat1, EPA_REGION)%>%
summarise(count = n(),
mean_need = mean(TOTAL_OFFICIAL_NEED),
median_need = median(TOTAL_OFFICIAL_NEED),
mean_popdensity00 = mean(POP_DENSITY_00),
mean_popdensity10 = mean(POP_DENSITY_10),
median_popdensity00 = median(POP_DENSITY_00),
median_popdensity10 = median(POP_DENSITY_10))
summary(dat_epa$median_need)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 946351 1896198 3022067 4276328 3977127 17740293
Summary Stats by Density of EPA Regions
summary(dat_epa$mean_popdensity00)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 105.6 179.5 411.7 568.6 596.5 2208.0
Top Ten by EPA Region
dat_top10epa <- dat1%>%
group_by(EPA_REGION)%>%
slice_max(TOTAL_OFFICIAL_NEED, n=10)
DT::datatable(
dat_top10epa,
height = 100,
options = list(scrollX = TRUE)
)
#CHUNK FOR BREAKING UP TABLES
Bottom Ten by EPA Region
dat_bottom10epa <- dat1%>%
group_by(EPA_REGION)%>%
slice_min(TOTAL_OFFICIAL_NEED, n=10)
DT::datatable(
dat_bottom10epa,
height = 100,
options = list(scrollX = TRUE)
)
Plot: Population Density vs. Total Need by EPA
ggplot(dat_epa, aes(x = mean_popdensity10,
y = mean_need)) +
geom_point(size=5, shape=20, color="darkorchid4") +
geom_text_repel(aes(label = rownames(dat_epa), size = 4)) +
theme(legend.position = "bottom") +
ggtitle("Population Density vs. Total Need, by EPA") +
xlab("Mean Population Density (2010)") + ylab("Mean Official Need")
When grouped by “State”. we can observe population versus official need in a manner consistent with our EPA grouping, in addition to an unexpected cameo:
– California, falling under EPA 9, has mid-range population density with a high need.
– New Jersey has the highest population density with a mid-range need, and New York has a high population density with a high need; both belonging to EPA 2.
– Pennsylvania has a mid-range population density with a low official need.
Surprisingly, the highest need of all states is shouldered by Hawaii, despite a low population density. This could be attributed to the fact that Hawaii’s current water infrastructure is outdated1 and requires extra attention due to its geographic susceptibility to climate change. 2
Top Ten by State
dat_state <- group_by(dat1, STATEFP, STATE)%>%
summarise(count = n(),
mean_need = mean(TOTAL_OFFICIAL_NEED),
median_need = median(TOTAL_OFFICIAL_NEED),
mean_popdensity00 = mean(POP_DENSITY_00),
mean_popdensity10 = mean(POP_DENSITY_10),
median_popdensity00 = median(POP_DENSITY_00),
median_popdensity10 = median(POP_DENSITY_10))%>%
filter(count > 2)
dat_top10state <- dat1%>%
group_by(STATEFP)%>%
slice_max(TOTAL_OFFICIAL_NEED, n = 10)
DT::datatable(
dat_top10state,
height = 100,
options = list(scrollX = TRUE)
)
#CHUNK FOR BREAKING UP TABLES
Bottom Ten by State
dat_bottom10state <- dat1%>%
group_by(STATEFP)%>%
slice_min(TOTAL_OFFICIAL_NEED, n = 10)
DT::datatable(
dat_bottom10state,
height = 100,
options = list(scrollX = TRUE)
)
Population Density vs. Total Need by State
ggplot(dat_state, aes(x = mean_popdensity10,
y = mean_need)) +
geom_point(size=4, shape=20, color="springgreen4") +
geom_text_repel(aes(label = STATE, size = 2)) +
theme(legend.position = "bottom") +
ggtitle("Population Density vs. Total Need, by State") +
xlab("Mean Population Density (2010)") + ylab("Mean Official Need")
These regions were selected for additional study as they contain states that fall within the Delaware River Watershed.
dat_epa_2and3 <- filter(dat1, EPA_REGION == 2 | EPA_REGION == 3)
In order to begin contextualizing Philadelphia’s water needs, we decided to focus our statistical analysis on the four states falling under the Delaware River Watershed (12,800 sq miles). While Philadelphia’s water supply is split between two sources – the Schyulkill and Delaware rivers – we’ve chosen to compare infrastructure and resources to its northeastern neighbors, in particular the states of New York, New Jersey, and Delaware. As they are geographically related and rely on the Delaware River Watershed for much of their potable water needs, these states are a logical point of reference. Within the EPA’s regional, large aquatic ecosystems (LAE) classification system, these states fall within regions 2 (NY, NJ) and 3 (PA, DE).
dat_PA <- filter(dat1, STATEFP == 42)
dat_NJ <- filter(dat1, STATEFP == 34)
dat_NY <- filter(dat1, STATEFP == 9)
dat_DE <- filter(dat1, STATEFP == 10)
dat_focus_states <- filter(dat1, STATE %in% c("PA", "NJ", "NY", "DE"))
Summary Statistics
Of the four states sharing the Delaware River Watershed:
– The highest need is for the state of New York, which also has a relatively high population density.
– Delaware has the lowest need and population statistics.
– Seen in conjunction with New York and New Jersey, Pennsylvania’s population density and total need is much lower.
Total Official Need
summary(dat_focus_states$TOTAL_OFFICIAL_NEED)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1067266 3371533 37872196 8494217 5160108526
Median Income
summary(dat_focus_states$CHANGE_MEDINC)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -7174.6 -1722.5 -547.4 -494.3 1052.2 14726.6
Population Density
summary(dat_focus_states$CHANGE_POP_DENSITY)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -79.88 -1.52 11.16 34.49 45.69 2148.63
Residential Collection
summary(dat_focus_states$CHANGE_RES_REC_COLLCTN)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -88018.0 0.0 100.0 841.1 748.5 35408.0
Residential Treatment
summary(dat_focus_states$CHANGE_RES_REC_TRMT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -111031.0 0.0 180.5 1184.9 1047.8 54959.0
Box Plots: Official Need by Focus States
We constructed box plots to understand the distribution of total need within each of these four states. Facilities with outlier values in New York, New Jersey, and Pennsylvania skewed the representation of our box plots.
New York’s need is skewing comparative visualization– it is literally off the chart!
ggplot(dat_focus_states,
aes(x=STATE, y=TOTAL_OFFICIAL_NEED)) +
geom_boxplot(fill="seagreen3", alpha=0.2) +
geom_jitter(color="gray50", size=0.4, alpha=0.9) +
ggtitle("Water Infrastructure Need: Delaware River Watershed") +
xlab("STATE") + ylab("Official Need")
##NY need is skewing comparative visualization - literally off the charts!
This calls for a log transformation of the data.
We decided to log transform the “total official need” variables to fix our box plots. The dots behind each box plot represent need values for each state’s facility, which gave us a way to visualize the spread of data.
– As surmised above, the largest spread in values can be seen in New Jersey, New York, and Pennsylvania.
– Median values of total need are about the same for all states, with a slight upward skew in the range of data for Delaware.
ggplot(dat_focus_states,
aes(x=STATE, y=TOTAL_OFFICIAL_NEED)) +
scale_y_log10() +
geom_boxplot(fill="seagreen3", alpha=0.2) +
geom_jitter(color="gray50", size=0.4, alpha=0.9) +
ggtitle("Water Infrastructure Need: Delaware River Watershed") +
xlab("STATE") + ylab("Official Need (Log Transformed)")
Summary of Percent Change Residential Collection for Focus States
summary(dat_focus_states$PCT_CHANGE_RES_REC_COLLCTN)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -79.095 0.000 4.349 Inf 14.333 Inf 2
Change in Residential Population Receiving Collection
ggplot(dat_focus_states,
aes(x=STATE, y=PCT_CHANGE_RES_REC_COLLCTN)) +
geom_boxplot(fill="seagreen4", alpha=0.2) +
xlab("STATE") + ylab("Change") +
ylim(-75,75)
dat_focus_states_summary <- filter(dat1, STATE %in% c("PA", "NJ", "NY", "DE"))%>%
group_by(STATE)%>%
summarise(count = n(),
mean_need = mean(TOTAL_OFFICIAL_NEED),
median_need = median(TOTAL_OFFICIAL_NEED),
mean_popdensity00 = mean(POP_DENSITY_00),
mean_popdensity10 = mean(POP_DENSITY_10),
median_popdensity00 = median(POP_DENSITY_00),
median_popdensity10 = median(POP_DENSITY_10))
Population Density vs. Total Need by Focus State
ggplot(dat_focus_states_summary, aes(x = mean_popdensity10,
y = mean_need)) +
geom_point(size=5, shape=20, color="olivedrab4") +
geom_text_repel(aes(label = STATE, size=3)) +
theme(legend.position = "bottom") +
ggtitle("Water Infrastructure Need: Delaware River Watershed") +
xlab("Mean Population Density (2010)") + ylab("Mean Official Need")
Within Pennsylvania, Philadelphia County unsurprisingly reported the highest total need with the highest population density within the state of Pennsylvania. While all other counties were clustered near the graph origin with low population density and low need, Delaware and Lackawanna Counties exhibited some deviation from the origin yet remain far from Philadelphia Counties statistics.
dat_PA_counties <- group_by(dat_PA, COUNTYNAME)%>%
summarise(count = n(),
mean_need = mean(TOTAL_OFFICIAL_NEED),
median_need = median(TOTAL_OFFICIAL_NEED),
mean_popdensity00 = mean(POP_DENSITY_00),
mean_popdensity10 = mean(POP_DENSITY_10),
median_popdensity00 = median(POP_DENSITY_00),
median_popdensity10 = median(POP_DENSITY_10))
Population Density vs. Official Need
Even when compared to the top 10 counties exhibiting high need within Pennsylvania, Philadelphia’s needs stack up much higher.
ggplot(dat_PA_counties, aes(x = mean_popdensity10,
y = mean_need)) +
geom_point(size=5, shape=20, color="darkgoldenrod1") +
geom_text_repel(aes(label = COUNTYNAME, size=3)) +
theme(legend.position = "bottom") +
ggtitle("Water Infrastructure Need: Pennsylvania Counties") +
xlab("Mean Population Density (2010)") + ylab("Mean Official Need")
Certain counties are standing out a lot…
datPA <- filter(dat1, STATE == "PA")%>%
top_n(10, TOTAL_OFFICIAL_NEED)
Official Need for Top 10 Counties
This isn’t very surprising, given the high population density of Philadelphia County (11373 in 2010) as compared to other Pennsylvania counties. The non-normal distribution of the following density plot helps visualize this stark difference.
ggplot(datPA, aes(x=as.factor(COUNTYNAME), fill=as.factor(TOTAL_OFFICIAL_NEED))) +
geom_bar(color="darkgoldenrod3", fill="darkgoldenrod1", alpha=0.8) +
xlab ("County") + ylab ("Total Official Need")
The following chart representing absolute values of need shows the comparatively high need of Philadelphia County with respect to Delaware and Lackawanna.
dat1%>%
filter(COUNTYNAME %in% c("Lackawanna", "Philadelphia", "Delaware"))%>%
ggplot(aes(x=COUNTYNAME, y=TOTAL_OFFICIAL_NEED)) +
geom_boxplot(fill="darkgoldenrod1", alpha=0.2) +
geom_jitter(color="gray50", size=0.4, alpha=0.9) +
ggtitle("Water Infrastructure Need: Pennsylvania Counties") +
xlab("STATE") + ylab("Official Need")
Log Transform
On log-transforming the chart variables, the extent of Philadelphia County’s need, relative to other PA counties can be clearly noticed in the differences in median values and the spread of data. This clearly demonstrates the extent to which Philadelphia County alone contributes to Pennsylvania’s water infrastructure needs.
#Log Transform
dat1%>%
filter(COUNTYNAME %in% c("Lackawanna", "Philadelphia", "Delaware"))%>%
ggplot(aes(x=COUNTYNAME, y=TOTAL_OFFICIAL_NEED)) +
geom_boxplot(fill="darkgoldenrod1", alpha=0.2) +
scale_y_log10() +
geom_jitter(color="gray50", size=0.4, alpha=0.9) +
ggtitle("Water Infrastructure Need: Pennsylvania Counties") +
xlab("STATE") + ylab("Official Need (Log Transformed)")
#DensityPlots
ggplot(dat_PA, aes(x=POP_DENSITY_10)) +
geom_density(fill="darkgoldenrod1", color="darkgoldenrod2", alpha=0.8) +
ggtitle("Variation in Population Density: Pennsylvania Counties") +
xlab("Population Density (2010)")
Philadelphia Water Department has three water treatment plants, the Baxter Plant, Queen Lane Plant and Belmont Plant. Located in Torresdale, the Baxter Plant cleans water from the Delaware. Both the Queen Lane (East Falls) and Belmont (Wynnefield) plants source water from the Schuylkill.
Summary
##LOOK AT PHILLY
dat_PHL <- filter(dat1, COUNTYFP == 101, STATEFP == 42)
#colnames(dat_PHL)
summary(dat_PHL$TOTAL_OFFICIAL_NEED)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 249190563 383117348 517044133 429336765 519409866 521775599
Total Need
#Density Plot: Total Need
ggplot(dat_PHL, aes(x=TOTAL_OFFICIAL_NEED)) +
geom_density(fill="coral", color="coral2", alpha=0.8) +
ggtitle("Variation in Need: Philadelphia Facilities") +
xlab("Official Need")
Of the three facilities, the highest reported need is for Philadelphia Water Dept (SW).
#Compare Changes in Collection and Treatment
#Barplots
ggplot(dat_PHL, aes(y=CHANGE_RES_REC_TRMT, x=CWNS_NUMBER)) +
geom_bar(position="dodge", stat="identity", color="coral1", fill="coral1", alpha=0.6) +
ggtitle("Change in Residential Population Receiving Treatment") +
xlab("CWNS Facility") + ylab("Change in Treatment Received")
Magnitude of overall budgeted need for CSS is significantly more than budgeted need for MS4.
Means are significantly different as the test statistic is greater than the critical value.
Test statistic is within the interval of 95% confidence.
dat_YesCSS <- filter(dat, CSS == "1")
dat_YesMS4 <- filter(dat, MS4 == "1")
#sample means
meanNeed_YesCSS <-mean(dat_YesCSS$TOTAL_OFFICIAL_NEED)
meanNeed_YesMS4 <-mean(dat_YesMS4$TOTAL_OFFICIAL_NEED)
##Mean need for CSS is larger than MS4
#sample standardized deviations
stdNeed_YesCSS <-sd(dat_YesCSS$TOTAL_OFFICIAL_NEED)
stdNeed_YesMS4 <-sd(dat_YesMS4$TOTAL_OFFICIAL_NEED)
#number of obs
numobs_YesCSS <- dim(dat_YesCSS)[1]
numobs_YesMS4 <- dim(dat_YesMS4)[1]
#test stat calc: difference in the means & standardized deviations
numerator_teststat <- meanNeed_YesCSS - meanNeed_YesMS4
denom_teststat <-
((stdNeed_YesCSS*stdNeed_YesCSS)/numobs_YesCSS +
(stdNeed_YesMS4*stdNeed_YesMS4)/numobs_YesMS4)^.5
test_stat<-numerator_teststat/denom_teststat
abs(test_stat)>1.96
## [1] TRUE
Paired T-test of CSS vs. MS4s
Reject null hypothesis of equal means.
t.test(dat_YesCSS$TOTAL_OFFICIAL_NEED, dat_YesMS4$TOTAL_OFFICIAL_NEED)
##
## Welch Two Sample t-test
##
## data: dat_YesCSS$TOTAL_OFFICIAL_NEED and dat_YesMS4$TOTAL_OFFICIAL_NEED
## t = 6.712, df = 699.19, p-value = 0.00000000003972
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 70514032 128823442
## sample estimates:
## mean of x mean of y
## 113053301 13384564
##Reject null hypothesis of equal means.
Summary of Official Need for CSS Facilities
summary(dat_YesCSS$TOTAL_OFFICIAL_NEED)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2832885 8813826 113053301 36254645 5160108526
Summary of Official Need for MS4 Facilities
summary(dat_YesMS4$TOTAL_OFFICIAL_NEED)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 628112 2161876 13384564 6708263 3289580276
Mean need of TMDL impacted sites is greater than non-TMDL sites.
Test statistic is within the interval of 95% confidence.
dat_YesTMDL <- filter(dat, TMDL_INDICATOR == "Y")
dat_NoTMDL <- filter(dat, TMDL_INDICATOR == "N")
#sample means
meanNeed_YesTMDL <-mean(dat_YesTMDL$TOTAL_OFFICIAL_NEED)
meanNeed_NoTMDL <-mean(dat_NoTMDL$TOTAL_OFFICIAL_NEED)
##Mean need of TMDL impacted sites is greater than non-TMDL sites.
#sample standardized deviations
stdNeed_YesTMDL <-sd(dat_YesTMDL$TOTAL_OFFICIAL_NEED)
stdNeed_NoTMDL <-sd(dat_NoTMDL$TOTAL_OFFICIAL_NEED)
#number of obs
numobs_YesTMDL <- dim(dat_YesTMDL)[1]
numobs_NoTMDL <- dim(dat_NoTMDL)[1]
#test stat calc: difference in the means & standardized deviations
numerator_teststatTMDL <- meanNeed_NoTMDL - meanNeed_YesTMDL
denom_teststatTMDL <-
((stdNeed_NoTMDL*stdNeed_NoTMDL)/numobs_NoTMDL +
(stdNeed_YesTMDL*stdNeed_YesTMDL)/numobs_YesTMDL)^.5
test_statTMDL<-numerator_teststatTMDL/denom_teststatTMDL
abs(test_statTMDL)>1.96
## [1] TRUE
Paired T-test of TMDL and Official Need
Summary of Official Need for Non-TMDL Facilities
#paired t-test
summary(dat_NoTMDL$TOTAL_OFFICIAL_NEED)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 690815 2371608 20228097 7536767 5160108526
Summary of Official Need for TMDL Facilities
summary(dat_YesTMDL$TOTAL_OFFICIAL_NEED)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1249456 3862453 58996731 14372950 3945228972
Paired T-test of Need by EPA Regions with TMDL Facilities
Region 1
Fail to reject null hypothesis; cannot establish significance.
#compare TMDL by need by EPA regions
t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR,
data = filter(dat, EPA_REGION == 1), paired = FALSE)
##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by TMDL_INDICATOR
## t = -0.85922, df = 3.0193, p-value = 0.453
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -542766502 311336752
## sample estimates:
## mean in group N mean in group Y
## 32685334 148400209
##High p-value/low t-value
Region 2
Fail to reject null hypothesis; cannot establish significance.
t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR,
data = filter(dat, EPA_REGION == 2), paired = FALSE)
##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by TMDL_INDICATOR
## t = -3.5245, df = 5.0045, p-value = 0.01681
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -3033874754 -475206220
## sample estimates:
## mean in group N mean in group Y
## 39662349 1794202835
##High p-value/low t-value
Region 3
Fail to reject null hypothesis; cannot establish significance.
t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR,
data = filter(dat, EPA_REGION == 3), paired = FALSE)
##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by TMDL_INDICATOR
## t = 2.6523, df = 31.104, p-value = 0.01247
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## 2196333 16806577
## sample estimates:
## mean in group N mean in group Y
## 18403975 8902520
##High p-value/low t-value
Region 4
Fail to reject null hypothesis; cannot establish significance.
t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR,
data = filter(dat, EPA_REGION == 4), paired = FALSE)
##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by TMDL_INDICATOR
## t = 0.12885, df = 18.325, p-value = 0.8989
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -17470739 19756906
## sample estimates:
## mean in group N mean in group Y
## 22352317 21209234
##High p-value/low t-value -->
Region 5
Reject null hypothesis, significance between need and TMDL.
t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR,
data = filter(dat, EPA_REGION == 5), paired = FALSE)
##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by TMDL_INDICATOR
## t = 3.9798, df = 1436.2, p-value = 0.00007241
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## 5244922 15440674
## sample estimates:
## mean in group N mean in group Y
## 15088120 4745322
##Low p-value/high t-value --> reject null hypothesis
Region 6
Reject null hypothesis, significance between need and TMDL.
t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR,
data = filter(dat, EPA_REGION == 6), paired = FALSE)
##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by TMDL_INDICATOR
## t = -3.074, df = 63.917, p-value = 0.003104
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -80860108 -17158802
## sample estimates:
## mean in group N mean in group Y
## 9455932 58465387
##Low p-value/high t-value --> reject null hypothesis
Regions 7 & 8
Failed t-test; not enough values.
#t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR,
# data = filter(dat, EPA_REGION == 7), paired = FALSE)
##failed t-test, not enough values
#t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR,
# data = filter(dat, EPA_REGION == 8), paired = FALSE)
##failed t-test, not enough values
Region 9
Fail to reject null hypothesis; cannot establish significance.
t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR,
data = filter(dat, EPA_REGION == 9), paired = FALSE)
##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by TMDL_INDICATOR
## t = -0.87332, df = 1.0125, p-value = 0.5415
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -1749036595 1517883950
## sample estimates:
## mean in group N mean in group Y
## 65458334 181034656
##Very high p-value/low t-value --> cannot establish significance; accept null hypothesis?
Region 10
Fail to reject null hypothesis; cannot establish significance.
t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR,
data = filter(dat, EPA_REGION == 10), paired = FALSE)
##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by TMDL_INDICATOR
## t = 1.2645, df = 252, p-value = 0.2072
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -4907121 22511440
## sample estimates:
## mean in group N mean in group Y
## 26134688 17332528
##High p-value/low t-value --> fail to reject null hypothesis; cannot establish significance
Cannot establish significant association between facility needs and TMDL impacted-sites.
Some significance was established in regions 5 & 6; otherwise there were too few values in regions 7 & 8; could not establish significance in regions 1 - 4, 9, 10.
Cross Tables of Official Need and Region
Summary of Official Need
library(ggplot2)
library(vcd)
library (gmodels)
library (MASS)
library(fabricatr)
library(reactable)
summary(dat$TOTAL_OFFICIAL_NEED)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 704599 2409822 21623215 7660571 5160108526
## cross table of need and region --> too large need to group
# (dat$TOTAL_OFFICIAL_NEED, dat$EPA_REGION, fisher = FALSE, chisq = TRUE,
# expected = TRUE, sresid = FALSE, format="SPSS")
##CHI2: 74282.34; P: 0.00000000000000057
##Minimum expected frequency: 0.03361045; Cells with Expected Frequency < 5: 79130 of 79140 (99.98736%)
Official Need Grouped by Five Equal Sections
dat1_intervalfive <- dat1 %>% mutate(intervalfive_Need = ntile(dat1$TOTAL_OFFICIAL_NEED, 5))%>%
mutate(intervalfive_pchmedinc = ntile(dat1$PCT_CHANGE_MEDINC, 5))%>%
mutate(intervalfive_pchpopden = ntile(dat1$PCT_CHANGE_POP_DENSITY, 5))%>%
mutate(intervalfive_pwhite10 = ntile(dat1$PCTWHITE10, 5))%>%
mutate(intervalfive_medinc09 = ntile(dat1$MEDINC09, 5))%>%
mutate(intervalfive_pop10 = ntile(dat1$POP_DENSITY_10, 5))
CrossTable(dat1_intervalfive$EPA_REGION, dat1_intervalfive$intervalfive_Need, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")
##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 8420
##
## | dat1_intervalfive$intervalfive_Need
## dat1_intervalfive$EPA_REGION | 1 | 2 | 3 | 4 | 5 | Row Total |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 42 | 51 | 63 | 83 | 126 | 365 |
## | 73.000 | 73.000 | 73.000 | 73.000 | 73.000 | |
## | 13.164 | 6.630 | 1.370 | 1.370 | 38.479 | |
## | 11.507% | 13.973% | 17.260% | 22.740% | 34.521% | 4.335% |
## | 2.494% | 3.029% | 3.741% | 4.929% | 7.482% | |
## | 0.499% | 0.606% | 0.748% | 0.986% | 1.496% | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 105 | 109 | 130 | 150 | 120 | 614 |
## | 122.800 | 122.800 | 122.800 | 122.800 | 122.800 | |
## | 2.580 | 1.551 | 0.422 | 6.025 | 0.064 | |
## | 17.101% | 17.752% | 21.173% | 24.430% | 19.544% | 7.292% |
## | 6.235% | 6.473% | 7.720% | 8.907% | 7.126% | |
## | 1.247% | 1.295% | 1.544% | 1.781% | 1.425% | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 125 | 165 | 239 | 258 | 244 | 1031 |
## | 206.200 | 206.200 | 206.200 | 206.200 | 206.200 | |
## | 31.976 | 8.232 | 5.217 | 13.013 | 6.929 | |
## | 12.124% | 16.004% | 23.181% | 25.024% | 23.666% | 12.245% |
## | 7.423% | 9.798% | 14.192% | 15.321% | 14.489% | |
## | 1.485% | 1.960% | 2.838% | 3.064% | 2.898% | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 4 | 226 | 266 | 231 | 320 | 335 | 1378 |
## | 275.600 | 275.600 | 275.600 | 275.600 | 275.600 | |
## | 8.927 | 0.334 | 7.218 | 7.153 | 12.802 | |
## | 16.401% | 19.303% | 16.763% | 23.222% | 24.311% | 16.366% |
## | 13.420% | 15.796% | 13.717% | 19.002% | 19.893% | |
## | 2.684% | 3.159% | 2.743% | 3.800% | 3.979% | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 311 | 415 | 424 | 347 | 255 | 1752 |
## | 350.400 | 350.400 | 350.400 | 350.400 | 350.400 | |
## | 4.430 | 11.910 | 15.459 | 0.033 | 25.974 | |
## | 17.751% | 23.687% | 24.201% | 19.806% | 14.555% | 20.808% |
## | 18.468% | 24.644% | 25.178% | 20.606% | 15.143% | |
## | 3.694% | 4.929% | 5.036% | 4.121% | 3.029% | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 6 | 253 | 234 | 207 | 185 | 184 | 1063 |
## | 212.600 | 212.600 | 212.600 | 212.600 | 212.600 | |
## | 7.677 | 2.154 | 0.148 | 3.583 | 3.847 | |
## | 23.801% | 22.013% | 19.473% | 17.404% | 17.310% | 12.625% |
## | 15.024% | 13.895% | 12.292% | 10.986% | 10.926% | |
## | 3.005% | 2.779% | 2.458% | 2.197% | 2.185% | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 7 | 427 | 251 | 205 | 126 | 92 | 1101 |
## | 220.200 | 220.200 | 220.200 | 220.200 | 220.200 | |
## | 194.215 | 4.308 | 1.049 | 40.298 | 74.638 | |
## | 38.783% | 22.797% | 18.619% | 11.444% | 8.356% | 13.076% |
## | 25.356% | 14.905% | 12.173% | 7.482% | 5.463% | |
## | 5.071% | 2.981% | 2.435% | 1.496% | 1.093% | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 8 | 131 | 138 | 116 | 88 | 42 | 515 |
## | 103.000 | 103.000 | 103.000 | 103.000 | 103.000 | |
## | 7.612 | 11.893 | 1.641 | 2.184 | 36.126 | |
## | 25.437% | 26.796% | 22.524% | 17.087% | 8.155% | 6.116% |
## | 7.779% | 8.195% | 6.888% | 5.226% | 2.494% | |
## | 1.556% | 1.639% | 1.378% | 1.045% | 0.499% | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 9 | 22 | 14 | 21 | 64 | 197 | 318 |
## | 63.600 | 63.600 | 63.600 | 63.600 | 63.600 | |
## | 27.210 | 38.682 | 28.534 | 0.003 | 279.804 | |
## | 6.918% | 4.403% | 6.604% | 20.126% | 61.950% | 3.777% |
## | 1.306% | 0.831% | 1.247% | 3.800% | 11.698% | |
## | 0.261% | 0.166% | 0.249% | 0.760% | 2.340% | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 10 | 42 | 41 | 48 | 63 | 89 | 283 |
## | 56.600 | 56.600 | 56.600 | 56.600 | 56.600 | |
## | 3.766 | 4.300 | 1.307 | 0.724 | 18.547 | |
## | 14.841% | 14.488% | 16.961% | 22.261% | 31.449% | 3.361% |
## | 2.494% | 2.435% | 2.850% | 3.741% | 5.285% | |
## | 0.499% | 0.487% | 0.570% | 0.748% | 1.057% | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 1684 | 1684 | 1684 | 1684 | 1684 | 8420 |
## | 20.000% | 20.000% | 20.000% | 20.000% | 20.000% | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 1025.513 d.f. = 36 p = 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000699429
##
##
##
## Minimum expected frequency: 56.6
Percent Change in Median Income
# Potential Association: Percent change in median income
CrossTable(dat1_intervalfive$intervalfive_pchmedinc, dat1_intervalfive$intervalfive_Need, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")
##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 8420
##
## | dat1_intervalfive$intervalfive_Need
## dat1_intervalfive$intervalfive_pchmedinc | 1 | 2 | 3 | 4 | 5 | Row Total |
## -----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 335 | 364 | 343 | 333 | 309 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 0.010 | 2.197 | 0.114 | 0.043 | 2.295 | |
## | 19.893% | 21.615% | 20.368% | 19.774% | 18.349% | 20.000% |
## | 19.893% | 21.615% | 20.368% | 19.774% | 18.349% | |
## | 3.979% | 4.323% | 4.074% | 3.955% | 3.670% | |
## -----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 334 | 327 | 331 | 350 | 342 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 0.023 | 0.285 | 0.100 | 0.517 | 0.080 | |
## | 19.834% | 19.418% | 19.656% | 20.784% | 20.309% | 20.000% |
## | 19.834% | 19.418% | 19.656% | 20.784% | 20.309% | |
## | 3.967% | 3.884% | 3.931% | 4.157% | 4.062% | |
## -----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 342 | 325 | 329 | 350 | 338 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 0.080 | 0.413 | 0.181 | 0.517 | 0.004 | |
## | 20.309% | 19.299% | 19.537% | 20.784% | 20.071% | 20.000% |
## | 20.309% | 19.299% | 19.537% | 20.784% | 20.071% | |
## | 4.062% | 3.860% | 3.907% | 4.157% | 4.014% | |
## -----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 4 | 299 | 340 | 349 | 332 | 364 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 4.242 | 0.030 | 0.442 | 0.068 | 2.197 | |
## | 17.755% | 20.190% | 20.724% | 19.715% | 21.615% | 20.000% |
## | 17.755% | 20.190% | 20.724% | 19.715% | 21.615% | |
## | 3.551% | 4.038% | 4.145% | 3.943% | 4.323% | |
## -----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 374 | 328 | 332 | 319 | 331 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 4.109 | 0.230 | 0.068 | 0.941 | 0.100 | |
## | 22.209% | 19.477% | 19.715% | 18.943% | 19.656% | 20.000% |
## | 22.209% | 19.477% | 19.715% | 18.943% | 19.656% | |
## | 4.442% | 3.895% | 3.943% | 3.789% | 3.931% | |
## -----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 1684 | 1684 | 1684 | 1684 | 1684 | 8420 |
## | 20.000% | 20.000% | 20.000% | 20.000% | 20.000% | |
## -----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 19.28741 d.f. = 16 p = 0.2540411
##
##
##
## Minimum expected frequency: 336.8
Percent Change in Population Density
# Potential Association: Percent change in population density
CrossTable(dat1_intervalfive$intervalfive_pchpopden, dat1_intervalfive$intervalfive_Need, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")
##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 8420
##
## | dat1_intervalfive$intervalfive_Need
## dat1_intervalfive$intervalfive_pchpopden | 1 | 2 | 3 | 4 | 5 | Row Total |
## -----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 453 | 393 | 371 | 292 | 175 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 40.090 | 9.378 | 3.473 | 5.959 | 77.729 | |
## | 26.900% | 23.337% | 22.031% | 17.340% | 10.392% | 20.000% |
## | 26.900% | 23.337% | 22.031% | 17.340% | 10.392% | |
## | 5.380% | 4.667% | 4.406% | 3.468% | 2.078% | |
## -----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 345 | 371 | 374 | 327 | 267 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 0.200 | 3.473 | 4.109 | 0.285 | 14.466 | |
## | 20.487% | 22.031% | 22.209% | 19.418% | 15.855% | 20.000% |
## | 20.487% | 22.031% | 22.209% | 19.418% | 15.855% | |
## | 4.097% | 4.406% | 4.442% | 3.884% | 3.171% | |
## -----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 323 | 351 | 326 | 347 | 337 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 0.565 | 0.599 | 0.346 | 0.309 | 0.000 | |
## | 19.181% | 20.843% | 19.359% | 20.606% | 20.012% | 20.000% |
## | 19.181% | 20.843% | 19.359% | 20.606% | 20.012% | |
## | 3.836% | 4.169% | 3.872% | 4.121% | 4.002% | |
## -----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 4 | 300 | 308 | 325 | 354 | 397 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 4.021 | 2.463 | 0.413 | 0.878 | 10.760 | |
## | 17.815% | 18.290% | 19.299% | 21.021% | 23.575% | 20.000% |
## | 17.815% | 18.290% | 19.299% | 21.021% | 23.575% | |
## | 3.563% | 3.658% | 3.860% | 4.204% | 4.715% | |
## -----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 263 | 261 | 288 | 364 | 508 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 16.171 | 17.060 | 7.071 | 2.197 | 87.023 | |
## | 15.618% | 15.499% | 17.102% | 21.615% | 30.166% | 20.000% |
## | 15.618% | 15.499% | 17.102% | 21.615% | 30.166% | |
## | 3.124% | 3.100% | 3.420% | 4.323% | 6.033% | |
## -----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 1684 | 1684 | 1684 | 1684 | 1684 | 8420 |
## | 20.000% | 20.000% | 20.000% | 20.000% | 20.000% | |
## -----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 309.038 d.f. = 16 p = 0.00000000000000000000000000000000000000000000000000000003417345
##
##
##
## Minimum expected frequency: 336.8
Percentage of Population, White in 2010
# Potential Association: Percentage of Population White 2010
CrossTable(dat1_intervalfive$intervalfive_pwhite10, dat1_intervalfive$intervalfive_Need, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")
##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 8420
##
## | dat1_intervalfive$intervalfive_Need
## dat1_intervalfive$intervalfive_pwhite10 | 1 | 2 | 3 | 4 | 5 | Row Total |
## ----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 315 | 292 | 267 | 329 | 481 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 1.411 | 5.959 | 14.466 | 0.181 | 61.739 | |
## | 18.705% | 17.340% | 15.855% | 19.537% | 28.563% | 20.000% |
## | 18.705% | 17.340% | 15.855% | 19.537% | 28.563% | |
## | 3.741% | 3.468% | 3.171% | 3.907% | 5.713% | |
## ----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 284 | 276 | 307 | 349 | 468 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 8.277 | 10.976 | 2.637 | 0.442 | 51.109 | |
## | 16.865% | 16.390% | 18.230% | 20.724% | 27.791% | 20.000% |
## | 16.865% | 16.390% | 18.230% | 20.724% | 27.791% | |
## | 3.373% | 3.278% | 3.646% | 4.145% | 5.558% | |
## ----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 298 | 318 | 333 | 382 | 353 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 4.470 | 1.049 | 0.043 | 6.066 | 0.779 | |
## | 17.696% | 18.884% | 19.774% | 22.684% | 20.962% | 20.000% |
## | 17.696% | 18.884% | 19.774% | 22.684% | 20.962% | |
## | 3.539% | 3.777% | 3.955% | 4.537% | 4.192% | |
## ----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 4 | 334 | 400 | 360 | 339 | 251 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 0.023 | 11.859 | 1.598 | 0.014 | 21.858 | |
## | 19.834% | 23.753% | 21.378% | 20.131% | 14.905% | 20.000% |
## | 19.834% | 23.753% | 21.378% | 20.131% | 14.905% | |
## | 3.967% | 4.751% | 4.276% | 4.026% | 2.981% | |
## ----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 453 | 398 | 417 | 285 | 131 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 40.090 | 11.121 | 19.098 | 7.967 | 125.753 | |
## | 26.900% | 23.634% | 24.762% | 16.924% | 7.779% | 20.000% |
## | 26.900% | 23.634% | 24.762% | 16.924% | 7.779% | |
## | 5.380% | 4.727% | 4.952% | 3.385% | 1.556% | |
## ----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 1684 | 1684 | 1684 | 1684 | 1684 | 8420 |
## | 20.000% | 20.000% | 20.000% | 20.000% | 20.000% | |
## ----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 408.9846 d.f. = 16 p = 0.00000000000000000000000000000000000000000000000000000000000000000000000000004758703
##
##
##
## Minimum expected frequency: 336.8
Median Income in 2009
CrossTable(dat1_intervalfive$intervalfive_medinc09, dat1_intervalfive$intervalfive_Need, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")
##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 8420
##
## | dat1_intervalfive$intervalfive_Need
## dat1_intervalfive$intervalfive_medinc09 | 1 | 2 | 3 | 4 | 5 | Row Total |
## ----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 432 | 401 | 342 | 322 | 187 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 26.909 | 12.238 | 0.080 | 0.650 | 66.627 | |
## | 25.653% | 23.812% | 20.309% | 19.121% | 11.105% | 20.000% |
## | 25.653% | 23.812% | 20.309% | 19.121% | 11.105% | |
## | 5.131% | 4.762% | 4.062% | 3.824% | 2.221% | |
## ----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 351 | 362 | 365 | 346 | 260 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 0.599 | 1.886 | 2.361 | 0.251 | 17.513 | |
## | 20.843% | 21.496% | 21.675% | 20.546% | 15.439% | 20.000% |
## | 20.843% | 21.496% | 21.675% | 20.546% | 15.439% | |
## | 4.169% | 4.299% | 4.335% | 4.109% | 3.088% | |
## ----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 361 | 325 | 344 | 318 | 336 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 1.739 | 0.413 | 0.154 | 1.049 | 0.002 | |
## | 21.437% | 19.299% | 20.428% | 18.884% | 19.952% | 20.000% |
## | 21.437% | 19.299% | 20.428% | 18.884% | 19.952% | |
## | 4.287% | 3.860% | 4.086% | 3.777% | 3.990% | |
## ----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 4 | 300 | 340 | 331 | 327 | 386 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 4.021 | 0.030 | 0.100 | 0.285 | 7.187 | |
## | 17.815% | 20.190% | 19.656% | 19.418% | 22.922% | 20.000% |
## | 17.815% | 20.190% | 19.656% | 19.418% | 22.922% | |
## | 3.563% | 4.038% | 3.931% | 3.884% | 4.584% | |
## ----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 240 | 256 | 302 | 371 | 515 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 27.821 | 19.384 | 3.596 | 3.473 | 94.285 | |
## | 14.252% | 15.202% | 17.933% | 22.031% | 30.582% | 20.000% |
## | 14.252% | 15.202% | 17.933% | 22.031% | 30.582% | |
## | 2.850% | 3.040% | 3.587% | 4.406% | 6.116% | |
## ----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 1684 | 1684 | 1684 | 1684 | 1684 | 8420 |
## | 20.000% | 20.000% | 20.000% | 20.000% | 20.000% | |
## ----------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 292.6544 d.f. = 16 p = 0.00000000000000000000000000000000000000000000000000008450559
##
##
##
## Minimum expected frequency: 336.8
Population Density in 2010
CrossTable(dat1_intervalfive$intervalfive_pop10, dat1_intervalfive$intervalfive_Need, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")
##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 8420
##
## | dat1_intervalfive$intervalfive_Need
## dat1_intervalfive$intervalfive_pop10 | 1 | 2 | 3 | 4 | 5 | Row Total |
## -------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 566 | 412 | 369 | 245 | 92 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 155.976 | 16.790 | 3.079 | 25.021 | 177.931 | |
## | 33.610% | 24.466% | 21.912% | 14.549% | 5.463% | 20.000% |
## | 33.610% | 24.466% | 21.912% | 14.549% | 5.463% | |
## | 6.722% | 4.893% | 4.382% | 2.910% | 1.093% | |
## -------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 414 | 421 | 354 | 319 | 176 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 17.695 | 21.050 | 0.878 | 0.941 | 76.771 | |
## | 24.584% | 25.000% | 21.021% | 18.943% | 10.451% | 20.000% |
## | 24.584% | 25.000% | 21.021% | 18.943% | 10.451% | |
## | 4.917% | 5.000% | 4.204% | 3.789% | 2.090% | |
## -------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 281 | 369 | 366 | 387 | 281 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 9.245 | 3.079 | 2.532 | 7.482 | 9.245 | |
## | 16.686% | 21.912% | 21.734% | 22.981% | 16.686% | 20.000% |
## | 16.686% | 21.912% | 21.734% | 22.981% | 16.686% | |
## | 3.337% | 4.382% | 4.347% | 4.596% | 3.337% | |
## -------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 4 | 221 | 274 | 314 | 377 | 498 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 39.815 | 11.710 | 1.543 | 4.798 | 77.154 | |
## | 13.124% | 16.271% | 18.646% | 22.387% | 29.572% | 20.000% |
## | 13.124% | 16.271% | 18.646% | 22.387% | 29.572% | |
## | 2.625% | 3.254% | 3.729% | 4.477% | 5.914% | |
## -------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 202 | 208 | 281 | 356 | 637 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 53.952 | 49.256 | 9.245 | 1.095 | 267.577 | |
## | 11.995% | 12.352% | 16.686% | 21.140% | 37.827% | 20.000% |
## | 11.995% | 12.352% | 16.686% | 21.140% | 37.827% | |
## | 2.399% | 2.470% | 3.337% | 4.228% | 7.565% | |
## -------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 1684 | 1684 | 1684 | 1684 | 1684 | 8420 |
## | 20.000% | 20.000% | 20.000% | 20.000% | 20.000% | |
## -------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 1043.86 d.f. = 16 p = 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000004522926
##
##
##
## Minimum expected frequency: 336.8
2010 Population and Official Need
Low degree of correlation @ 0.18
##Correlation between 2010population and official need
cor.test(dat1$POP10, dat1$TOTAL_OFFICIAL_NEED, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dat1$POP10 and dat1$TOTAL_OFFICIAL_NEED
## t = 16.355, df = 8418, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1547099 0.1961150
## sample estimates:
## cor
## 0.1754901
## small correlation @ 0.18
##scatter plot##
ggplot(dat1, aes(x = POP10,
y = TOTAL_OFFICIAL_NEED)) +
geom_point(size=2, shape=20, color="orchid") +
theme(legend.position = "bottom") +
ggtitle("2010 Population x Total Official Need") +
geom_smooth(method=lm) +
xlab("Total Population (2010)") + ylab("Total Official Need")
##Degree of correlation:
###Perfect: If value is near ? 1, then it said to be a perfect correlation: as one variable increases, the other variable tends to also increase (if positive) or decrease (if negative).
###High degree: value between ? 0.50 and ? 1
###Moderate degree: value between ? 0.30 and ? 0.49
###Low degree: value below + .29
###No correlation: value is 0.
2010 Population Density and Official Need
Moderate degree of correlation @ 0.43
cor.test(dat1$POP_DENSITY_10, dat1$TOTAL_OFFICIAL_NEED, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dat1$POP_DENSITY_10 and dat1$TOTAL_OFFICIAL_NEED
## t = 43.501, df = 8418, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4108153 0.4456975
## sample estimates:
## cor
## 0.428416
ggplot(dat1, aes(x = POP_DENSITY_10,
y = TOTAL_OFFICIAL_NEED)) +
geom_point(size=2, shape=20, color="orchid") +
theme(legend.position = "bottom") +
ggtitle("2010 Population Density x Total Official Need") +
geom_smooth(method=lm) +
xlab("Population Density (2010)") + ylab("Total Official Need")
##
Median Income and Official Need
Little to no correlation @ 0.05
cor.test(dat1$MEDINC09, dat1$TOTAL_OFFICIAL_NEED, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dat1$MEDINC09 and dat1$TOTAL_OFFICIAL_NEED
## t = 4.8097, df = 8418, p-value = 0.000001538
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03102467 0.07362785
## sample estimates:
## cor
## 0.05235008
ggplot(dat1, aes(x = MEDINC09,
y = TOTAL_OFFICIAL_NEED)) +
geom_point(size=2, shape=20, color="orchid") +
theme(legend.position = "bottom") +
ggtitle("Median Income x Total Official Need") +
geom_smooth(method=lm) +
xlab("Median Income (2009)") + ylab("Total Official Need")
Change in Residential Collection and Official Need
Low degree of correlation @ 0.22
cor.test(dat1$CHANGE_RES_REC_COLLCTN, dat1$TOTAL_OFFICIAL_NEED, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dat1$CHANGE_RES_REC_COLLCTN and dat1$TOTAL_OFFICIAL_NEED
## t = 20.412, df = 8418, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1967184 0.2374248
## sample estimates:
## cor
## 0.217166
ggplot(dat1, aes(x = CHANGE_RES_REC_COLLCTN,
y = TOTAL_OFFICIAL_NEED)) +
geom_point(size=2, shape=20, color="orchid") +
theme(legend.position = "bottom") +
ggtitle("Change in Residential Collection x Total Official Need") +
geom_smooth(method=lm) +
xlab("Change in Res Collection") + ylab("Total Official Need")
Change in Residential Collection and Median Income
Little to no correlation @ 0.064
cor.test(dat1$MEDINC09, dat1$CHANGE_RES_REC_COLLCTN, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dat1$MEDINC09 and dat1$CHANGE_RES_REC_COLLCTN
## t = 5.8425, df = 8418, p-value = 0.000000005333
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04224740 0.08479516
## sample estimates:
## cor
## 0.06355016
ggplot(dat1, aes(x = MEDINC09,
y = CHANGE_RES_REC_COLLCTN)) +
geom_point(size=2, shape=20, color="orchid") +
theme(legend.position = "bottom") +
ggtitle("Change in Residential Collection x Median Income") +
geom_smooth(method=lm) +
xlab("Median Income (2009)") + ylab("Change in Res Collection")
library(tidyverse)
library(tidycensus)
library(ggplot2)
library(dplyr)
library(data.table)
require("ggrepel")
library(hrbrthemes)
library(viridis)
library(ggridges)
library(Hmisc)
library(stargazer)
options(scipen=999)
dat <- read.csv("cwns2012.csv", header = TRUE)
dat2 <- dat%>%
mutate(LOG_NEED = log(TOTAL_OFFICIAL_NEED))%>%
mutate(RES_BURDEN = TOTAL_OFFICIAL_NEED/POP10)%>%
mutate(LOG_RES_BURDEN = log(RES_BURDEN))%>%
mutate(LOG_MEDINC09 = log(MEDINC09))%>%
mutate(MEDINC99_INFL = MEDINC99*1.29)%>%
mutate(LOG_MEDINC99_INFL = log(MEDINC99_INFL))%>%
mutate(LOG_POP10 = log(POP10))%>%
mutate(LOG_POP00 = log(POP00))%>%
mutate(ALAND_SQMILE = ALAND/2.59e+6)%>%
mutate(POP_DENSITY_10 = POP10/ALAND_SQMILE)%>%
mutate(LOG_POP_DENSITY_10 = log(POP_DENSITY_10))%>%
mutate(LOG_RES_REC_COLLCTN = log(PRES_RES_REC_COLLCTN))%>%
mutate(LOG_RES_REC_TRMT = log(PRES_RES_REC_TRMT))
Box Plots
dat2_focus_states <- filter(dat2, STATE %in% c("PA", "NJ", "NY", "DE"))
#Boxplots
ggplot(dat2_focus_states,
aes(x=STATE, y=TOTAL_OFFICIAL_NEED)) +
geom_boxplot(fill="seagreen3", alpha=0.2) +
geom_jitter(color="gray50", size=0.4, alpha=0.9) +
ggtitle("Water Infrastructure Need: Delaware River Watershed") +
xlab("STATE") + ylab("Official Need")
Box Plot with Log Transformed Variables
ggplot(dat2_focus_states,
aes(x=STATE, y=LOG_NEED)) +
geom_boxplot(fill="seagreen3", alpha=0.2) +
geom_jitter(color="gray50", size=0.4, alpha=0.9) +
ggtitle("Water Infrastructure Need: Delaware River Watershed") +
xlab("STATE") + ylab("Official Need (Log Adjusted)")
Density Plots
ggplot(dat2_focus_states,
aes(x=RES_BURDEN)) +
geom_density(fill="seagreen3", color="seagreen1", alpha=0.7) +
ggtitle("Residential Burden: Delaware River Watershed") +
xlab("Residential Burden") + ylab("Density")
Density plots with Log Transformed Variables
#Density plots with Log Transformed Variables
ggplot(dat2_focus_states,
aes(x=LOG_RES_BURDEN)) +
geom_density(fill="seagreen3", color="seagreen1", alpha=0.7) +
ggtitle("Residential Burden: Delaware River Watershed") +
xlab("Residential Burden (Log Adjusted)") + ylab("Density")
Density Plots
dat2_PA <- filter(dat2, STATEFP == 42)
#Density plots
ggplot(dat2_PA,
aes(x=MEDINC09)) +
geom_density(fill="darkgoldenrod1", color="darkgoldenrod2", alpha=0.7) +
ggtitle("Median Income Distribution: Pennsylvania Counties") +
xlab("Median Income (2009)") + ylab("Density")
Density Plots with Log Transformed Variables
#Density plots with Log Transformed Variables
ggplot(dat2_PA,
aes(x=LOG_MEDINC09)) +
geom_density(fill="darkgoldenrod1", color="darkgoldenrod2", alpha=0.7) +
ggtitle("Median Income Distribution: Pennsylvania Counties") +
xlab("Median Income (2009, Log Adjusted)") + ylab("Density")
Bar Plots
Total Need
#Barplots: Total Need
dat2%>%
filter(COUNTYFP == 101, STATEFP == 42)%>%
ggplot(aes(y=TOTAL_OFFICIAL_NEED, x=CWNS_NUMBER)) +
geom_bar(position="dodge", stat="identity", color="coral1", fill="coral1", alpha=0.6) +
ggtitle("Total Official Need: Philadelphia Facilities") +
xlab("CWNS Facility") + ylab("Total Need")
Residential Burden
Log adjustment not required as variables are similar in value.
#Barplots: Res Burden
dat2%>%
filter(COUNTYFP == 101, STATEFP == 42)%>%
ggplot(aes(y=RES_BURDEN, x=CWNS_NUMBER)) +
geom_bar(position="dodge", stat="identity", color="coral1", fill="coral1", alpha=0.6) +
ggtitle("Residential Burden: Philadelphia Facilities") +
xlab("CWNS Facility") + ylab("Residential Burden")
##log adjustment not required as variables are similar in value
Total Population vs. Official Need
cor = 0.18, small correlation
cor.test(dat2$POP10, dat2$TOTAL_OFFICIAL_NEED, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dat2$POP10 and dat2$TOTAL_OFFICIAL_NEED
## t = 16.355, df = 8418, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1547099 0.1961150
## sample estimates:
## cor
## 0.1754901
#cor = 0.18, small correlation
Population Density vs. Official Need
cor = 0.43, moderate correlation
cor.test(dat2$POP_DENSITY_10, dat2$TOTAL_OFFICIAL_NEED, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dat2$POP_DENSITY_10 and dat2$TOTAL_OFFICIAL_NEED
## t = 43.501, df = 8418, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4108153 0.4456975
## sample estimates:
## cor
## 0.428416
#cor=0.43, variables exhibit a medium correlation
Median Income vs. Official Need
cor = 0.052, little to no correlation
cor.test(dat2$MEDINC09, dat2$TOTAL_OFFICIAL_NEED, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dat2$MEDINC09 and dat2$TOTAL_OFFICIAL_NEED
## t = 4.8097, df = 8418, p-value = 0.000001538
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03102467 0.07362785
## sample estimates:
## cor
## 0.05235008
#cor 0.052, little to no correlation
Residential Burden vs. Official Need
cor = 0.33, moderate correlation
cor.test(dat2$RES_BURDEN, dat2$TOTAL_OFFICIAL_NEED, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dat2$RES_BURDEN and dat2$TOTAL_OFFICIAL_NEED
## t = 32.449, df = 8418, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3143058 0.3522786
## sample estimates:
## cor
## 0.3334274
#cor 0.33, medium correlation
Present Residential Population Receiving Collection vs. Official Need
cor = 0.52, high correlation
cor.test(dat$PRES_RES_REC_COLLCTN, dat$TOTAL_OFFICIAL_NEED, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dat$PRES_RES_REC_COLLCTN and dat$TOTAL_OFFICIAL_NEED
## t = 57.264, df = 8418, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5139253 0.5446732
## sample estimates:
## cor
## 0.5294731
#cor 0.52, high correlation
Present Residential Pop Receiving Treatment vs. Official Need
cor = 0.53, high correlation
#Need vs. Present Residential Pop Receiving Treatment
cor.test(dat$PRES_RES_REC_TRMT, dat$TOTAL_OFFICIAL_NEED, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dat$PRES_RES_REC_TRMT and dat$TOTAL_OFFICIAL_NEED
## t = 57.507, df = 8418, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5155744 0.5462493
## sample estimates:
## cor
## 0.5310858
#cor 0.53, high correlation
Present Non-Residential Pop Receiving Collection vs. Official Need
cor = 0.44, moderate correlation
#Need vs. Present Non-Residential Pop Receiving Collection
cor.test(dat$PRES_N_RES_REC_COLLCTN, dat$TOTAL_OFFICIAL_NEED, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dat$PRES_N_RES_REC_COLLCTN and dat$TOTAL_OFFICIAL_NEED
## t = 45.619, df = 8418, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4279247 0.4621801
## sample estimates:
## cor
## 0.4452153
#cor 0.44, medium correlation
Present Non-Residential Pop Receiving Treatment vs. Official Need
cor = 0.43, moderate correlation
#Need vs. Present Non-Residential Pop Receiving Treatment
cor.test(dat$PRES_N_RES_REC_TRTM, dat$TOTAL_OFFICIAL_NEED, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dat$PRES_N_RES_REC_TRTM and dat$TOTAL_OFFICIAL_NEED
## t = 43.771, df = 8418, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4130163 0.4478193
## sample estimates:
## cor
## 0.4305779
#cor 0.43, medium correlation
Land Area vs. Official Need
cor = 0.003, almost no correlation
#Need vs. Land Area
cor.test(dat1$ALAND_SQMILE, dat1$TOTAL_OFFICIAL_NEED, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dat1$ALAND_SQMILE and dat1$TOTAL_OFFICIAL_NEED
## t = 0.36351, df = 8418, p-value = 0.7162
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.01739962 0.02531992
## sample estimates:
## cor
## 0.003961961
#cor 0.003, almost no correlation
Water Area vs. Official Need
cor = 0.02, almost no correlation
#Need vs. Water Area
cor.test(dat$AWATER, dat$TOTAL_OFFICIAL_NEED, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dat$AWATER and dat$TOTAL_OFFICIAL_NEED
## t = 2.2582, df = 8418, p-value = 0.02396
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.003247044 0.045941400
## sample estimates:
## cor
## 0.02460544
#cor = 0.02, almost no correlation
Variables with Significant Correlation
– Population Density
– Residential Burden
– Present Residential Pop Receiving Collection/Treatment
– Present Non-Residential Pop Receiving Collection/Treatment
Based on the correlation tests, the highest correlation is exhibited between Official Need and Present Residential Pop Receiving Treatment. We chose to use the median value because the large range of observations within the data set can disproportionately skew the mean value.
###Based on correlation tests, the highest correlation is exhibited between Need and Present Residential Pop Receiving Treatment
###Construct dummy groups to classify facilities below median values
###We chose to use the median value because the large range of observations within the data set can disproportionately skew the mean value
median_trmt <- median(dat1$PRES_RES_REC_TRMT)
##Growing or Shrinking Cities
###Based on pop density change mean value
mean_popdensity <- mean(dat1$CHANGE_POP_DENSITY)
##CSS
###Whether or not it's part of a CSS facility
##already a binary!
dat1$DUMMY_TRMT <- ifelse(dat1$PRES_RES_REC_TRMT >= 2759.5, 1, 0)
dat1$DUMMY_POPDENSITY <- ifelse(dat1$CHANGE_POP_DENSITY >= 24.43, 1, 0)
DT::datatable(
dat1,
height = 200,
options = list(scrollX = TRUE)
)
From correlation tests, variables with significant correlation which might yield statistically significant regression models:
Population Density
Residential Burden
Present Residential Pop Receiving Treatment
Regression Plot
mod.pop.density <- lm(TOTAL_OFFICIAL_NEED ~ POP_DENSITY_10, data = dat1)
summary(mod.pop.density)
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ POP_DENSITY_10, data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1722672791 -10797745 -6312025 -3289469 4552683118
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5993102.0 1371913.1 4.368 0.0000127 ***
## POP_DENSITY_10 29325.2 674.1 43.501 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 121500000 on 8418 degrees of freedom
## Multiple R-squared: 0.1835, Adjusted R-squared: 0.1834
## F-statistic: 1892 on 1 and 8418 DF, p-value: < 0.00000000000000022
#R2 = 0.184; Adjusted R2 = 0.183
##p values??
#Regression Plot
ggplot(dat1, aes(POP_DENSITY_10, TOTAL_OFFICIAL_NEED)) +
geom_point(size=3, shape=20, color="orchid") +
geom_smooth(method='lm', se=FALSE, color="grey40") +
ggtitle("Regression: Need and Population Density") +
xlab("Population Density") + ylab("Total Need")
Residual Plot
Funnel shaped plot indicates heteroscedasticity, which implies a non-constant variation in error terms, i.e. error terms increase as fitted value increases.
##Residual Plots
summary(mod.pop.density$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1722672791 -10797745 -6312025 0 -3289469 4552683118
plot(fitted(mod.pop.density), mod.pop.density$residuals,
main = "Fitted Model Against Residuals: Population Density",
xlab = "Fitted Model", ylab = "Residuals",
col = "orchid", pch = 16)
abline(0,0, col = "grey40")
Histogram of Regressed Population Density and Official Need
Residuals are mostly clustered around mean with a skew towards the left, evident from large negative values of min, 1st quartile, and median. Using log adjusted values can help cluster data closer to mean.
hist(mod.pop.density$residuals, breaks = 10,
main = "Regression Residual Spread: Population Density",
xlab = "Residuals",
col = "orchid", border = "orchid3")
#mod_pop_density_predict <- predict(mod.pop.density)
#DT::datatable(
# mod_pop_density_predict,
# height = 100,
# options = list(scrollX = TRUE)
#)
Log Official Need vs. Log Population Density
Removing infinite values from “log adjusted need” column.
Low R squared, as seen from large spread of data about the mode means that the model does not sufficiently account for variance.
##remove infinite values from "log adjusted need" column
dat2_reg <- dat2%>%
filter(LOG_NEED > -Inf)
mod.log.pop.density <- lm(LOG_NEED ~ LOG_POP_DENSITY_10, data = dat2_reg)
summary(mod.log.pop.density)
##
## Call:
## lm(formula = LOG_NEED ~ LOG_POP_DENSITY_10, data = dat2_reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8064 -1.1451 -0.0249 1.0973 6.1504
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.84607 0.05523 232.58 <0.0000000000000002 ***
## LOG_POP_DENSITY_10 0.40626 0.01085 37.45 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.69 on 8261 degrees of freedom
## Multiple R-squared: 0.1451, Adjusted R-squared: 0.145
## F-statistic: 1402 on 1 and 8261 DF, p-value: < 0.00000000000000022
ggplot(dat2_reg, aes(LOG_POP_DENSITY_10, LOG_NEED)) +
geom_point(size=3, shape=20, color="orchid") +
geom_smooth(method='lm', se=FALSE, color="grey40") +
ggtitle("Regression: Need and Population Density") +
xlab("Population Density (Log Adjusted)") + ylab("Total Need (Log Adjusted)")
#R2 = 0.1451; Adjusted R2 = 0.145
##low R squared, as seen from large spread of data about the model
##means that the model does not sufficiently account for variance
Residual Plot of Log Population Density
Much better spread, residuals clustered closer to the mean in a more homoscedastic plot, closer to normal distribution. However, variance still increases as model values increase.
summary(mod.log.pop.density$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -7.80641 -1.14511 -0.02493 0.00000 1.09732 6.15044
#better, less negative skew from mean
plot(fitted(mod.log.pop.density), mod.log.pop.density$residuals,
main = "Fitted Model Against Residuals: Log Adj. Population Density",
xlab = "Fitted Model", ylab = "Residuals",
col = "orchid", pch = 16)
abline(0,0, col = "grey40")
hist(mod.log.pop.density$residuals, breaks = 10,
main = "Regression Residual Spread: Population Density",
xlab = "Residuals",
col = "orchid", border = "orchid3")
##much better spread, residuals clustered closer to the mean in a more homoscedastic plot, closer to normal distribution
##however, variance still increases as model values increase
Large range across mean values seen in summary of residual plot.
mod.res.burden <- lm(TOTAL_OFFICIAL_NEED ~ RES_BURDEN, data = dat2)
summary(mod.res.burden)
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ RES_BURDEN, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1966254161 -12024541 -8159307 -6205059 4926111573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7884945 1444881 5.457 0.0000000498 ***
## RES_BURDEN 97748 3012 32.449 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 126800000 on 8418 degrees of freedom
## Multiple R-squared: 0.1112, Adjusted R-squared: 0.1111
## F-statistic: 1053 on 1 and 8418 DF, p-value: < 0.00000000000000022
#R2 = 0.11; Adjusted R2 = 0.11
##low R squared
##p values??
ggplot(dat2, aes(RES_BURDEN, TOTAL_OFFICIAL_NEED)) +
geom_point(size=3, shape=20, color="orchid") +
geom_smooth(method='lm', se=FALSE, color="grey40") +
ggtitle("Regression: Need and Residential Burden") +
xlab("Residential Burden") + ylab("Total Need")
#Residual Plots
summary(mod.res.burden$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1966254161 -12024541 -8159307 0 -6205059 4926111573
##large range across mean values
Histogram of Residuals
Large range but maximum values clustered closer to mean, skewed toward the left of mean value.
hist(mod.res.burden$residuals, breaks = 10,
main = "Regression Residual Spread: Residential Burden",
xlab = "Residuals",
col = "orchid", border = "orchid3")
##large range but maximum values clustered closer to mean, skewed toward the left of mean value
Heteroscedastic plot, exhibits high non-covariance as fitted values increase.
plot(fitted(mod.res.burden), mod.res.burden$residuals,
main = "Fitted Model Against Residuals: Residential Burden",
xlab = "Fitted Model", ylab = "Residuals",
col = "orchid", pch = 16)
abline(0,0, col = "grey40")
##heteroscedastic plot, exhibits high non-covariance as fitted values increase
Range far from mean on both sides.
mod.res.trmt <- lm(TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT, data = dat)
summary(mod.res.trmt)
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1981362007 -7599612 -5839460 -3348041 4666003598
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6109417.49 1270553.40 4.808 0.00000155 ***
## PRES_RES_REC_TRMT 670.21 11.65 57.507 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 113900000 on 8418 degrees of freedom
## Multiple R-squared: 0.2821, Adjusted R-squared: 0.282
## F-statistic: 3307 on 1 and 8418 DF, p-value: < 0.00000000000000022
#R2 = 0.282; Adjusted R2 = 0.282
ggplot(dat, aes(PRES_RES_REC_TRMT, TOTAL_OFFICIAL_NEED)) +
geom_point(size=3, shape=20, color="orchid") +
geom_smooth(method='lm', se=FALSE, color="grey40") +
ggtitle("Regression: Need and Residential Treatment") +
xlab("Residential Treatment") + ylab("Total Need")
#Residual Plots
summary(mod.res.trmt$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1981362007 -7599612 -5839460 0 -3348041 4666003598
#range far from mean on both sides
Large range but maximum values clustered closer to mean.
hist(mod.res.trmt$residuals, breaks = 10,
main = "Regression Residual Spread: Residential Treatment",
xlab = "Residuals",
col = "orchid", border = "orchid3")
##large range but maximum values clustered closer to mean
Heteroscedastic plot, high non-variance.
plot(fitted(mod.res.trmt), mod.res.trmt$residuals,
main = "Fitted Model Against Residuals: Residential Collection",
xlab = "Fitted Model", ylab = "Residuals",
col = "orchid", pch = 16)
abline(0,0, col = "grey40")
##heteroscedastic plot, high non-variance
Log Need vs. Log Residential Treatment
Removed infinite values. Model still does not explain variance very well, but is much better than non-log-adjusted variables.
#LOG Need vs. LOG Residential Treatment
##remove infinite values
dat2_reg <- dat2%>%
filter(LOG_NEED > -Inf)%>%
filter(LOG_RES_REC_TRMT > -Inf)
mod.log.res.trmt <- lm(LOG_NEED ~ LOG_RES_REC_TRMT, data = dat2_reg)
summary(mod.log.res.trmt)
##
## Call:
## lm(formula = LOG_NEED ~ LOG_RES_REC_TRMT, data = dat2_reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4591 -0.8723 0.0485 0.9241 9.5216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.610620 0.068825 139.64 <0.0000000000000002 ***
## LOG_RES_REC_TRMT 0.640192 0.008284 77.28 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.391 on 8240 degrees of freedom
## Multiple R-squared: 0.4202, Adjusted R-squared: 0.4201
## F-statistic: 5972 on 1 and 8240 DF, p-value: < 0.00000000000000022
#R2 = 0.42; Adjusted R2 = 0.42
##model still does not explain variance very well, but is much better than non-log-adjusted variables
##p-value?
Residual Plots
Less spread of range across mean.
#Residual Plots
summary(mod.log.res.trmt$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.45911 -0.87230 0.04848 0.00000 0.92411 9.52156
##less spread of range across mean
Tighter clustering around mean, almost exhibiting a normal distribution.
hist(mod.log.res.trmt$residuals, breaks = 10,
main = "Regression Residual Spread: Log Adj. Residential Treatment",
xlab = "Residuals",
col = "orchid", border = "orchid3")
##tighter clustering around mean, almost exhibiting a normal distribution
Error terms definitely exhibit more homoscedasticity, with more variance towards the left of mean. However, terms are still spread along a wide range from the fitted model.
plot(fitted(mod.log.res.trmt), mod.log.res.trmt$residuals,
main = "Fitted Model Against Residuals: Log Adj. Residential Treatment",
xlab = "Fitted Model", ylab = "Residuals",
col = "orchid", pch = 16)
abline(0,0, col = "grey40")
##error terms definitely exhibit more homoscedasticity, with more variance towards the left of mean
##however, terms are still spread along a wide range from the fitted model
#predict(mod.log.res.trmt)
#predict(mod.log.res.trmt, int = "c")
#predict(mod.log.res.trmt, int = "p")
Comparison of regression results through a comparative stargazer table.
– Unit increase in pop density would result in an increase of $29,325.19 in total need.
– Unit increase in residential burden would result in an increase of $97,748.53 in total need.
– Unit increase in residential population receiving collection would result in an increase of $793.925 in total need.
##compare regression results through a comparative stargazer table
stargazer(mod.pop.density, mod.res.burden, mod.res.trmt, type = "text", star.cutoffs = c(0.05, 0.01, 0.001))
##
## ==================================================================================
## Dependent variable:
## --------------------------------------------------
## TOTAL_OFFICIAL_NEED
## (1) (2) (3)
## ----------------------------------------------------------------------------------
## POP_DENSITY_10 29,325.190***
## (674.121)
##
## RES_BURDEN 97,748.530***
## (3,012.402)
##
## PRES_RES_REC_TRMT 670.210***
## (11.654)
##
## Constant 5,993,102.000*** 7,884,945.000*** 6,109,417.000***
## (1,371,913.000) (1,444,881.000) (1,270,553.000)
##
## ----------------------------------------------------------------------------------
## Observations 8,420 8,420 8,420
## R2 0.184 0.111 0.282
## Adjusted R2 0.183 0.111 0.282
## Residual Std. Error (df = 8418) 121,493,463.000 126,763,417.000 113,928,392.000
## F Statistic (df = 1; 8418) 1,892.368*** 1,052.918*** 3,307.086***
## ==================================================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
##unit increase in pop density would be accompanied by an increase of 29,325.19 in total need
##unit increase in residential burden --> increase of 97,748.53 in total need
##unit increase in residential population receiving collection would result in an increase of 793.925 in total need
R-squared values are not necessarily better, but residual plots (section above) do exhibit a better fit about the mean as such, log adjustment proves useful for clustering data with high min/max values, it doesn’t necessarily affect the predictive power of the regression model.
stargazer(mod.log.pop.density, mod.log.res.trmt, type = "text", star.cutoffs = c(0.05, 0.01, 0.001))
##
## ===========================================================================
## Dependent variable:
## -------------------------------------------------------
## LOG_NEED
## (1) (2)
## ---------------------------------------------------------------------------
## LOG_POP_DENSITY_10 0.406***
## (0.011)
##
## LOG_RES_REC_TRMT 0.640***
## (0.008)
##
## Constant 12.846*** 9.611***
## (0.055) (0.069)
##
## ---------------------------------------------------------------------------
## Observations 8,263 8,242
## R2 0.145 0.420
## Adjusted R2 0.145 0.420
## Residual Std. Error 1.690 (df = 8261) 1.391 (df = 8240)
## F Statistic 1,402.444*** (df = 1; 8261) 5,972.063*** (df = 1; 8240)
## ===========================================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
##R-squared values are not necessarily better, but residual plots (above) do exhibit a better fit about the mean
##as such, log adjustment proves useful for clustering data with high min/max values, it doesn't necessarily affect the predictive power of the regression model
Need vs. Treatment; CSS Presence Dummies
#Need vs. Treatment; CSS Presence Dummies
##THIS IS IMP!!!
dat_dummy_css0 <- filter(dat1, CSS == 0)
dat_dummy_css1 <- filter(dat1, CSS == 1)
cor = 0.60, high correlation
cor.test(dat_dummy_css0$PRES_RES_REC_TRMT, dat_dummy_css0$TOTAL_OFFICIAL_NEED, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dat_dummy_css0$PRES_RES_REC_TRMT and dat_dummy_css0$TOTAL_OFFICIAL_NEED
## t = 66.159, df = 7722, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5870440 0.6155166
## sample estimates:
## cor
## 0.6014713
#cor = 0.60, high correlation
mod.poptrmt.css0 <- lm(TOTAL_OFFICIAL_NEED ~ POP_DENSITY_10 + PRES_RES_REC_TRMT, data = dat_dummy_css0)
summary(mod.poptrmt.css0)
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ POP_DENSITY_10 + PRES_RES_REC_TRMT,
## data = dat_dummy_css0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1315325939 -4955708 -3562491 -1004169 2895170321
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3808597.32 724162.54 5.259 0.000000148 ***
## POP_DENSITY_10 2380.55 754.21 3.156 0.0016 **
## PRES_RES_REC_TRMT 461.00 7.18 64.208 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57110000 on 7721 degrees of freedom
## Multiple R-squared: 0.3626, Adjusted R-squared: 0.3624
## F-statistic: 2196 on 2 and 7721 DF, p-value: < 0.00000000000000022
#R2 = 0.36; Adjusted R2 = 0.36
cor = 0.54, high correlation
cor.test(dat_dummy_css1$PRES_RES_REC_TRMT, dat_dummy_css1$TOTAL_OFFICIAL_NEED, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dat_dummy_css1$PRES_RES_REC_TRMT and dat_dummy_css1$TOTAL_OFFICIAL_NEED
## t = 17.347, df = 694, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4959223 0.5997712
## sample estimates:
## cor
## 0.549969
#cor = 0.54, high correlation
mod.poptrmt.css1 <- lm(TOTAL_OFFICIAL_NEED ~ POP_DENSITY_10 + PRES_RES_REC_TRMT, data = dat_dummy_css1)
summary(mod.poptrmt.css1)
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ POP_DENSITY_10 + PRES_RES_REC_TRMT,
## data = dat_dummy_css1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1355805129 -42970102 -18453362 -9399433 4117860020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15582255.13 12531599.67 1.243 0.214
## POP_DENSITY_10 20477.14 2096.12 9.769 <0.0000000000000002 ***
## PRES_RES_REC_TRMT 833.24 63.44 13.134 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 306700000 on 693 degrees of freedom
## Multiple R-squared: 0.3869, Adjusted R-squared: 0.3851
## F-statistic: 218.7 on 2 and 693 DF, p-value: < 0.00000000000000022
#R2 = 0.39; Adjusted R2 = 0.39
For facilities without CSS:
– unit increase in population density results in a $2,380 increase in total need
– unit increase in residential population receiving treatment results in a $461 increase in total need
Comparatively, for facilities with CSS:
– unit increase in population density results in a $20,477 increase in total need
– unit increase in residential population receiving treatment results in a $833 increase in total need
stargazer(mod.poptrmt.css0, mod.poptrmt.css1, type = "text", star.cutoffs = c(0.05, 0.01, 0.001))
##
## ==========================================================================
## Dependent variable:
## ------------------------------------------------------
## TOTAL_OFFICIAL_NEED
## (1) (2)
## --------------------------------------------------------------------------
## POP_DENSITY_10 2,380.555** 20,477.130***
## (754.205) (2,096.116)
##
## PRES_RES_REC_TRMT 461.001*** 833.237***
## (7.180) (63.441)
##
## Constant 3,808,597.000*** 15,582,255.000
## (724,162.500) (12,531,600.000)
##
## --------------------------------------------------------------------------
## Observations 7,724 696
## R2 0.363 0.387
## Adjusted R2 0.362 0.385
## Residual Std. Error 57,107,195.000 (df = 7721) 306,726,863.000 (df = 693)
## F Statistic 2,196.043*** (df = 2; 7721) 218.659*** (df = 2; 693)
## ==========================================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
##for facilities without CSS:
###unit increase in population density results in a $2,380 increase in total need
###unit increase in residential population receiving treatment results in a $461 increase in total need
##comparatively, for facilities with CSS:
#####unit increase in population density results in a $20,477 increase in total need
###unit increase in residential population receiving treatment results in a $833 increase in total need
Need vs. Treatment: Mean Treatment Threshold Dummies
Too low, no point in looking much further.
dat_dummy_trmt0 <- filter(dat1, DUMMY_TRMT == 0)
dat_dummy_trmt1 <- filter(dat1, DUMMY_TRMT == 1)%>%
mutate(RES_BURDEN = TOTAL_OFFICIAL_NEED/POP10)
mod.income.dummytrmt0 <- lm(TOTAL_OFFICIAL_NEED ~ MEDINC09 + PRES_RES_REC_TRMT, data = dat_dummy_trmt0)
summary(mod.income.dummytrmt0)
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ MEDINC09 + PRES_RES_REC_TRMT,
## data = dat_dummy_trmt0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7358465 -2095107 -1148444 85561 268672717
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2056471.918 559390.560 -3.676 0.00024 ***
## MEDINC09 65.608 9.521 6.891 0.00000000000636 ***
## PRES_RES_REC_TRMT 1162.541 170.618 6.814 0.00000000001086 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8117000 on 4207 degrees of freedom
## Multiple R-squared: 0.02308, Adjusted R-squared: 0.02261
## F-statistic: 49.69 on 2 and 4207 DF, p-value: < 0.00000000000000022
#R2 = 0.023; Adjusted R2 = 0.023
##too low, no point looking much further
Create a multiple linear regression with variables exhibiting high correlations. Controlled for dummy variable of “residential population receiving treatment > median of res pop receiving treatment”
The values are reasonable, let’s plot and look at residuals.
##create a multiple linear regression with variables exhibiting high correlations
##controlled for dummy variable of "residential population receiving treatment > of res pop receiving treatment"
mod.dummytrmt1 <- lm(TOTAL_OFFICIAL_NEED ~ POP_DENSITY_10 + RES_BURDEN + PRES_RES_REC_TRMT, data = dat_dummy_trmt1)
summary(mod.dummytrmt1)
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ POP_DENSITY_10 + RES_BURDEN +
## PRES_RES_REC_TRMT, data = dat_dummy_trmt1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1830743422 -15644256 4978917 12951099 4141397389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19571664.88 2408965.46 -8.125 0.000000000000000585 ***
## POP_DENSITY_10 21590.08 844.40 25.569 < 0.0000000000000002 ***
## RES_BURDEN 92444.79 3606.06 25.636 < 0.0000000000000002 ***
## PRES_RES_REC_TRMT 524.15 15.28 34.307 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 140300000 on 4206 degrees of freedom
## Multiple R-squared: 0.4439, Adjusted R-squared: 0.4435
## F-statistic: 1119 on 3 and 4206 DF, p-value: < 0.00000000000000022
#R2 = 0.44; Adjusted R2 = 0.44
##reasonable values, we can plot and look at residuals
Higher spread from the mean.
ggplot(dat_dummy_trmt1, aes(PRES_RES_REC_TRMT + RES_BURDEN + POP_DENSITY_10, TOTAL_OFFICIAL_NEED)) +
geom_point(size=3, shape=20, color="orchid") +
geom_smooth(method='lm', se=FALSE, color="grey40") +
ggtitle("Multiple Regression with Dummies") +
xlab("Res Trmt + Pop Density + Res Burden") + ylab("Total Need")
summary(mod.dummytrmt1$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1830743422 -15644256 4978917 0 12951099 4141397389
##high spread from mean
Still a heteroscedastic plot. Log transforming would help.
plot(fitted(mod.dummytrmt1), mod.dummytrmt1$residuals,
main = "Fitted Model Against Residuals: Residential Trmt with Dummies",
xlab = "Fitted Model", ylab = "Residuals",
col = "orchid", pch = 16)
abline(0,0, col = "grey40")
##still a heteroscedastic plot
##log transforming would help, but we're going to sleep instead
Model 1: Need vs Residential Treatment vs Pop Density
Test of Correlation
cor = 0.29, medium correlation
##Cor test of Res TRMT vs Pop Density
cor.test(dat2$PRES_RES_REC_TRMT, dat2$POP_DENSITY_10)
##
## Pearson's product-moment correlation
##
## data: dat2$PRES_RES_REC_TRMT and dat2$POP_DENSITY_10
## t = 27.398, df = 8418, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2664011 0.3056251
## sample estimates:
## cor
## 0.286133
##cor= 0.29 medium correlation
Summary of Model 1
mod1 <- lm(TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10, data = dat2)
summary(mod1)
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10,
## data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1674822946 -5293001 1168371 3285259 4330928960
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2359258.97 1221586.22 -1.931 0.0535 .
## PRES_RES_REC_TRMT 561.48 11.44 49.095 <0.0000000000000002 ***
## POP_DENSITY_10 20610.86 620.34 33.225 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 107100000 on 8417 degrees of freedom
## Multiple R-squared: 0.3653, Adjusted R-squared: 0.3651
## F-statistic: 2422 on 2 and 8417 DF, p-value: < 0.00000000000000022
Test of Colinearity
Testing the variation inflation factors of all predictors. Testing VIF and its square root.
Low mulitcolinearity, variables can be used in same regression model
##TEST OF COLINEARITY
vif(mod1)
## PRES_RES_REC_TRMT POP_DENSITY_10
## 1.089173 1.089173
sqrt(vif(mod1))
## PRES_RES_REC_TRMT POP_DENSITY_10
## 1.043634 1.043634
##low multicolinearity; can be used in same regression model
Model 2: Need vs Residential Treatment vs Pop Density vs Residential Burden
Test of Correlation
cor = 0.10, low correlation
cor.test(dat2$PRES_RES_REC_TRMT, dat2$RES_BURDEN)
##
## Pearson's product-moment correlation
##
## data: dat2$PRES_RES_REC_TRMT and dat2$RES_BURDEN
## t = 9.5363, df = 8418, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0822030 0.1244668
## sample estimates:
## cor
## 0.1033816
##cor= 0.10 low correlation
Summary of Model 2
mod2 <- lm(TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN, data = dat2)
summary(mod2)
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 +
## RES_BURDEN, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1594708443 -5904970 5950114 9810337 4182802531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12897449.71 1186726.35 -10.87 <0.0000000000000002 ***
## PRES_RES_REC_TRMT 526.33 10.77 48.85 <0.0000000000000002 ***
## POP_DENSITY_10 20391.56 581.75 35.05 <0.0000000000000002 ***
## RES_BURDEN 81601.80 2400.26 34.00 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 100500000 on 8416 degrees of freedom
## Multiple R-squared: 0.4419, Adjusted R-squared: 0.4417
## F-statistic: 2222 on 3 and 8416 DF, p-value: < 0.00000000000000022
Test of Colinearity
Testing the variation inflation factors of all predictors. Testing VIF and its square root.
Low mulitcolinearity, variables can be used in same regression model
vif(mod2)
## PRES_RES_REC_TRMT POP_DENSITY_10 RES_BURDEN
## 1.099300 1.089307 1.010928
sqrt(vif(mod2))
## PRES_RES_REC_TRMT POP_DENSITY_10 RES_BURDEN
## 1.048475 1.043699 1.005449
##low multicolinearity for all; can be used in same regression model
Model 3: Need vs Residential Treatment vs Pop Density vs Residential Burden vs Median Income
Test of Correlation
cor = 0.01, very, very low correlation
##Cor test of Res TRMT vs Median Income
cor.test(dat2$PRES_RES_REC_TRMT, dat2$MEDINC09)
##
## Pearson's product-moment correlation
##
## data: dat2$PRES_RES_REC_TRMT and dat2$MEDINC09
## t = 8.8694, df = 8418, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07501528 0.11734014
## sample estimates:
## cor
## 0.09622121
##cor= 0.01 very low correlation
Summary of Model 3
mod3 <- lm(TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN + MEDINC09, data = dat2)
summary(mod3)
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 +
## RES_BURDEN + MEDINC09, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1569618867 -5853994 4869682 9189983 4176075399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2677629.65 4469571.45 0.599 0.549136
## PRES_RES_REC_TRMT 528.27 10.78 49.004 < 0.0000000000000002 ***
## POP_DENSITY_10 20772.23 590.80 35.159 < 0.0000000000000002 ***
## RES_BURDEN 80933.54 2405.66 33.643 < 0.0000000000000002 ***
## MEDINC09 -266.25 73.67 -3.614 0.000303 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 100400000 on 8415 degrees of freedom
## Multiple R-squared: 0.4428, Adjusted R-squared: 0.4425
## F-statistic: 1672 on 4 and 8415 DF, p-value: < 0.00000000000000022
Test of Colinearity
Testing the variation inflation factors of all predictors. Testing VIF and its square root.
Low mulitcolinearity, variables can be used in same regression model
vif(mod3)
## PRES_RES_REC_TRMT POP_DENSITY_10 RES_BURDEN MEDINC09
## 1.102048 1.125063 1.016935 1.048167
sqrt(vif(mod3))
## PRES_RES_REC_TRMT POP_DENSITY_10 RES_BURDEN MEDINC09
## 1.049785 1.060690 1.008432 1.023800
##low multicolinearity for all; can be used in same regression model
Model 4: Need vs Residential Treatment vs Pop Density vs Residential Burden vs Median Income vs Total Population
Test of Correlation
cor = 0.40, moderate correlation
##Cor test of Res TRMT vs Population
cor.test(dat2$PRES_RES_REC_TRMT, dat2$POP10)
##
## Pearson's product-moment correlation
##
## data: dat2$PRES_RES_REC_TRMT and dat2$POP10
## t = 39.203, df = 8418, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3747010 0.4108284
## sample estimates:
## cor
## 0.3929163
##cor= 0.40 moderate correlation
Summary of Model 4
mod4 <- lm(TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN + MEDINC09 + POP10, data = dat2)
summary(mod4)
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 +
## RES_BURDEN + MEDINC09 + POP10, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1520341109 -5929113 3512562 6920069 4158589432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1732427.407 4464463.141 -0.388 0.6980
## PRES_RES_REC_TRMT 566.556 11.368 49.839 <0.0000000000000002 ***
## POP_DENSITY_10 22263.887 605.606 36.763 <0.0000000000000002 ***
## RES_BURDEN 78279.176 2405.790 32.538 <0.0000000000000002 ***
## MEDINC09 -122.785 74.597 -1.646 0.0998 .
## POP10 -17.634 1.747 -10.092 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 99790000 on 8414 degrees of freedom
## Multiple R-squared: 0.4495, Adjusted R-squared: 0.4491
## F-statistic: 1374 on 5 and 8414 DF, p-value: < 0.00000000000000022
Test of Colinearity
Testing the variation inflation factors of all predictors. Testing VIF and its square root.
Low mulitcolinearity, variables can be used in same regression model
vif(mod4)
## PRES_RES_REC_TRMT POP_DENSITY_10 RES_BURDEN MEDINC09
## 1.240147 1.196321 1.029236 1.087666
## POP10
## 1.344847
sqrt(vif(mod4))
## PRES_RES_REC_TRMT POP_DENSITY_10 RES_BURDEN MEDINC09
## 1.113619 1.093765 1.014513 1.042912
## POP10
## 1.159676
##low multicolinearity for all; can be used in same regression model
Model 5: Need vs Residential Treatment vs Pop Density vs Residential Burden vs Median Income vs Total Population vs Land Area
Test of Correlation
cor = 0.08, very low correlation
##Cor test of Res TRMT vs Land Area
cor.test(dat2$PRES_RES_REC_TRMT, dat2$ALAND_SQMILE)
##
## Pearson's product-moment correlation
##
## data: dat2$PRES_RES_REC_TRMT and dat2$ALAND_SQMILE
## t = 7.4768, df = 8418, p-value = 0.00000000000008383
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.05996557 0.10240409
## sample estimates:
## cor
## 0.08122164
##cor= 0.08 very low correlation
Summary of Model 5
mod5 <- lm(TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN + MEDINC09 + POP10 + ALAND_SQMILE, data = dat2)
summary(mod5)
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 +
## RES_BURDEN + MEDINC09 + POP10 + ALAND_SQMILE, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1520159726 -5904094 3678604 6977894 4158122802
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4297714.345 4603422.802 -0.934 0.3505
## PRES_RES_REC_TRMT 565.735 11.371 49.754 <0.0000000000000002 ***
## POP_DENSITY_10 22483.124 613.068 36.673 <0.0000000000000002 ***
## RES_BURDEN 78312.457 2405.237 32.559 <0.0000000000000002 ***
## MEDINC09 -107.052 74.898 -1.429 0.1530
## POP10 -18.584 1.796 -10.347 <0.0000000000000002 ***
## ALAND_SQMILE 1821.022 799.975 2.276 0.0229 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 99760000 on 8413 degrees of freedom
## Multiple R-squared: 0.4498, Adjusted R-squared: 0.4494
## F-statistic: 1146 on 6 and 8413 DF, p-value: < 0.00000000000000022
Test of Colinearity
Testing the variation inflation factors of all predictors. Testing VIF and its square root.
Low mulitcolinearity, variables can be used in same regression model
vif(mod5)
## PRES_RES_REC_TRMT POP_DENSITY_10 RES_BURDEN MEDINC09
## 1.241395 1.226593 1.029274 1.097006
## POP10 ALAND_SQMILE
## 1.421547 1.080440
sqrt(vif(mod5))
## PRES_RES_REC_TRMT POP_DENSITY_10 RES_BURDEN MEDINC09
## 1.114179 1.107516 1.014532 1.047381
## POP10 ALAND_SQMILE
## 1.192286 1.039442
##low multicolinearity for all; can be used in same regression model
Model 6: Need vs Residential Treatment vs Pop Density vs Residential Burden vs Median Income vs Total Population vs Land Area vs Percent of White Population
Test of Correlation
cor = 0.20, low correlation
##Cor test of Res TRMT vs PctWhite
cor.test(dat2$PRES_RES_REC_TRMT, dat2$PCTWHITE10)
##
## Pearson's product-moment correlation
##
## data: dat2$PRES_RES_REC_TRMT and dat2$PCTWHITE10
## t = -18.993, df = 8418, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2231039 -0.1821384
## sample estimates:
## cor
## -0.2027099
##cor = -0.20 low correlation
Summary of Model 6
mod6 <- lm(TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN + MEDINC09 + POP10 + ALAND_SQMILE + PCTWHITE10, data = dat2)
summary(mod6)
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 +
## RES_BURDEN + MEDINC09 + POP10 + ALAND_SQMILE + PCTWHITE10,
## data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1521189881 -5899219 3607320 7382288 4158100267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2154104.225 7650604.689 0.282 0.7783
## PRES_RES_REC_TRMT 565.143 11.384 49.642 <0.0000000000000002 ***
## POP_DENSITY_10 22368.754 622.560 35.930 <0.0000000000000002 ***
## RES_BURDEN 78350.424 2405.489 32.572 <0.0000000000000002 ***
## MEDINC09 -96.809 75.524 -1.282 0.1999
## POP10 -19.076 1.855 -10.281 <0.0000000000000002 ***
## ALAND_SQMILE 1787.892 800.584 2.233 0.0256 *
## PCTWHITE10 -82488.063 78126.659 -1.056 0.2911
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 99760000 on 8412 degrees of freedom
## Multiple R-squared: 0.4499, Adjusted R-squared: 0.4494
## F-statistic: 982.7 on 7 and 8412 DF, p-value: < 0.00000000000000022
Test of Colinearity
Testing the variation inflation factors of all predictors. Testing VIF and its square root.
Low mulitcolinearity, variables can be used in same regression model
vif(mod6)
## PRES_RES_REC_TRMT POP_DENSITY_10 RES_BURDEN MEDINC09
## 1.244417 1.264887 1.029504 1.115413
## POP10 ALAND_SQMILE PCTWHITE10
## 1.517256 1.082103 1.190893
sqrt(vif(mod6))
## PRES_RES_REC_TRMT POP_DENSITY_10 RES_BURDEN MEDINC09
## 1.115535 1.124672 1.014645 1.056131
## POP10 ALAND_SQMILE PCTWHITE10
## 1.231769 1.040242 1.091280
##low multicolinearity for all; can be used in same regression model
Models 2,3,4 are statistically significant improvements to model 1; models 5 and 6 do not improve the model.
anova(mod1, mod2, mod3, mod4, mod5, mod6)
## Analysis of Variance Table
##
## Model 1: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10
## Model 2: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN
## Model 3: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## MEDINC09
## Model 4: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## MEDINC09 + POP10
## Model 5: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## MEDINC09 + POP10 + ALAND_SQMILE
## Model 6: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## MEDINC09 + POP10 + ALAND_SQMILE + PCTWHITE10
## Res.Df RSS Df Sum of Sq F
## 1 8417 96594435962878001152
## 2 8416 84930596049859624960 1 11663839913018376192 1171.9278
## 3 8415 84798962329488867328 1 131633720370757632 13.2259
## 4 8414 83784738288426221568 1 1014224041062645760 101.9045
## 5 8413 83733165028634722304 1 51573259791499264 5.1818
## 6 8412 83722070102993846272 1 11094925640876032 1.1148
## Pr(>F)
## 1
## 2 < 0.00000000000000022 ***
## 3 0.0002778 ***
## 4 < 0.00000000000000022 ***
## 5 0.0228492 *
## 6 0.2910786
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Models 2,3,4 are statistically significant improvements to model 1; model 5 does not improve the model.
anova(mod1, mod2, mod3, mod4, mod5)
## Analysis of Variance Table
##
## Model 1: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10
## Model 2: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN
## Model 3: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## MEDINC09
## Model 4: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## MEDINC09 + POP10
## Model 5: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## MEDINC09 + POP10 + ALAND_SQMILE
## Res.Df RSS Df Sum of Sq F
## 1 8417 96594435962878001152
## 2 8416 84930596049859624960 1 11663839913018376192 1171.9118
## 3 8415 84798962329488867328 1 131633720370757632 13.2258
## 4 8414 83784738288426221568 1 1014224041062645760 101.9031
## 5 8413 83733165028634722304 1 51573259791499264 5.1818
## Pr(>F)
## 1
## 2 < 0.00000000000000022 ***
## 3 0.0002778 ***
## 4 < 0.00000000000000022 ***
## 5 0.0228501 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##mods 2,3,4 are statistically significant improvements to the mod1; mod5 does not improve
Models 3 & 4 are statistically significant improvements to model 2.
anova(mod2, mod3, mod4)
## Analysis of Variance Table
##
## Model 1: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN
## Model 2: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## MEDINC09
## Model 3: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## MEDINC09 + POP10
## Res.Df RSS Df Sum of Sq F
## 1 8416 84930596049859624960
## 2 8415 84798962329488867328 1 131633720370757632 13.219
## 3 8414 83784738288426221568 1 1014224041062645760 101.853
## Pr(>F)
## 1
## 2 0.0002788 ***
## 3 < 0.00000000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Backward Selection
Run backward selection for model 6.
Variable percentage of white population drops.
AIC = 310172.3
step(mod6, direction = "backward")
## Start: AIC=310172.3
## TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## MEDINC09 + POP10 + ALAND_SQMILE + PCTWHITE10
##
## Df Sum of Sq RSS AIC
## - PCTWHITE10 1 11094925640876032 83733165028634722304 310171
## - MEDINC09 1 16353266232967168 83738423369226813440 310172
## <none> 83722070102993846272 310172
## - ALAND_SQMILE 1 49637430109929472 83771707533103775744 310175
## - POP10 1 1051974830555332608 84774044933549178880 310275
## - RES_BURDEN 1 10558850351510257664 94280920454504103936 311170
## - POP_DENSITY_10 1 12848774705816813568 96570844808810659840 311372
## - PRES_RES_REC_TRMT 1 24526603207986151424 108248673310979997696 312334
##
## Step: AIC=310171.5
## TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## MEDINC09 + POP10 + ALAND_SQMILE
##
## Df Sum of Sq RSS AIC
## <none> 83733165028634722304 310171
## - MEDINC09 1 20332521454518272 83753497550089240576 310171
## - ALAND_SQMILE 1 51573259791499264 83784738288426221568 310175
## - POP10 1 1065627219167199232 84798792247801921536 310276
## - RES_BURDEN 1 10550977588116946944 94284142616751669248 311169
## - POP_DENSITY_10 1 13385746909705256960 97118911938339979264 311418
## - PRES_RES_REC_TRMT 1 24637882449163943936 108371047477798666240 312341
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 +
## RES_BURDEN + MEDINC09 + POP10 + ALAND_SQMILE, data = dat2)
##
## Coefficients:
## (Intercept) PRES_RES_REC_TRMT POP_DENSITY_10 RES_BURDEN
## -4297714.34 565.74 22483.12 78312.46
## MEDINC09 POP10 ALAND_SQMILE
## -107.05 -18.58 1821.02
##mod 6 drops pctwhite10
Run backward for model 5.
No variables drop. Model 5 is preferred based on backward selection. Lower AIC, higher quality model.
AIC = 310171.5
step(mod5, direction="backward") ##lower AIC indicates the preferred model
## Start: AIC=310171.5
## TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## MEDINC09 + POP10 + ALAND_SQMILE
##
## Df Sum of Sq RSS AIC
## <none> 83733165028634722304 310171
## - MEDINC09 1 20332521454518272 83753497550089240576 310171
## - ALAND_SQMILE 1 51573259791499264 83784738288426221568 310175
## - POP10 1 1065627219167199232 84798792247801921536 310276
## - RES_BURDEN 1 10550977588116946944 94284142616751669248 311169
## - POP_DENSITY_10 1 13385746909705256960 97118911938339979264 311418
## - PRES_RES_REC_TRMT 1 24637882449163943936 108371047477798666240 312341
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 +
## RES_BURDEN + MEDINC09 + POP10 + ALAND_SQMILE, data = dat2)
##
## Coefficients:
## (Intercept) PRES_RES_REC_TRMT POP_DENSITY_10 RES_BURDEN
## -4297714.34 565.74 22483.12 78312.46
## MEDINC09 POP10 ALAND_SQMILE
## -107.05 -18.58 1821.02
##mod5 is preferred model?? Don't drop any variables??
Forward Selection
Setting up the forward selection
fullmod <- lm(TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN + MEDINC09
+ POP10 + ALAND_SQMILE + PCTWHITE10, data = dat2)
intonly <- lm(TOTAL_OFFICIAL_NEED ~1, data = dat2)
step(intonly, scope=list(lower=intonly, upper=fullmod), direction="forward")
## Start: AIC=315190.2
## TOTAL_OFFICIAL_NEED ~ 1
##
## Df Sum of Sq RSS AIC
## + PRES_RES_REC_TRMT 1 42924911711518359552 109262933935920676864 312402
## + POP_DENSITY_10 1 27932597247564005376 124255248399875031040 313485
## + RES_BURDEN 1 16919306870371106816 135268538777067929600 314200
## + POP10 1 4686893417223847936 147500952230215188480 314929
## + PCTWHITE10 1 3431418268165898240 148756427379273138176 315000
## + MEDINC09 1 417075467969036288 151770770179470000128 315169
## <none> 152187845647439036416 315190
## + ALAND_SQMILE 1 2388913005264896 152185456734433771520 315192
##
## Step: AIC=312402.2
## TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT
##
## Df Sum of Sq RSS AIC
## + POP_DENSITY_10 1 12668497973042675712 96594435962878001152 311367
## + RES_BURDEN 1 11933516856222908416 97329417079697768448 311430
## + PCTWHITE10 1 286683731388301312 108976250204532375552 312382
## + ALAND_SQMILE 1 235095190528573440 109027838745392103424 312386
## + POP10 1 198160531418267648 109064773404502409216 312389
## <none> 109262933935920676864 312402
## + MEDINC09 1 239385716244480 109262694550204432384 312404
##
## Step: AIC=311366.5
## TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10
##
## Df Sum of Sq RSS AIC
## + RES_BURDEN 1 11663839913018376192 84930596049859624960 310285
## + POP10 1 2151275306185752576 94443160656692248576 311179
## + MEDINC09 1 389668980873494528 96204766982004506624 311335
## + PCTWHITE10 1 66375998289903616 96528059964588097536 311363
## <none> 96594435962878001152 311367
## + ALAND_SQMILE 1 8378328201969664 96586057634676031488 311368
##
## Step: AIC=310285
## TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN
##
## Df Sum of Sq RSS AIC
## + POP10 1 1118880214647357440 83811715835212267520 310175
## + MEDINC09 1 131633720370757632 84798962329488867328 310274
## <none> 84930596049859624960 310285
## + PCTWHITE10 1 16526324248657920 84914069725610967040 310285
## + ALAND_SQMILE 1 15269208604672 84930580780651020288 310287
##
## Step: AIC=310175.3
## TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## POP10
##
## Df Sum of Sq RSS AIC
## + ALAND_SQMILE 1 58218285123026944 83753497550089240576 310171
## + MEDINC09 1 26977546786045952 83784738288426221568 310175
## <none> 83811715835212267520 310175
## + PCTWHITE10 1 18198042573750272 83793517792638517248 310176
##
## Step: AIC=310171.5
## TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## POP10 + ALAND_SQMILE
##
## Df Sum of Sq RSS AIC
## + MEDINC09 1 20332521454518272 83733165028634722304 310171
## <none> 83753497550089240576 310171
## + PCTWHITE10 1 15074180862427136 83738423369226813440 310172
##
## Step: AIC=310171.5
## TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## POP10 + ALAND_SQMILE + MEDINC09
##
## Df Sum of Sq RSS AIC
## <none> 83733165028634722304 310171
## + PCTWHITE10 1 11094925640876032 83722070102993846272 310172
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 +
## RES_BURDEN + POP10 + ALAND_SQMILE + MEDINC09, data = dat2)
##
## Coefficients:
## (Intercept) PRES_RES_REC_TRMT POP_DENSITY_10 RES_BURDEN
## -4297714.34 565.74 22483.12 78312.46
## POP10 ALAND_SQMILE MEDINC09
## -18.58 1821.02 -107.05
#Start: TOTAL_OFFICIAL_NEED ~ 1
##AIC = 315190.2
#Step: ~ PRES_RES_REC_TRMT
##AIC = 312402.2
#Step: PRES_RES_REC_TRMT + POP_DENSITY_10
##AIC = 311366.5
#Step: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN
##AIC = 310285
#Step: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN + POP10
##AIC = 310175.3
#Step: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN + POP10 + ALAND_SQMILE
##AIC = 310171.5
#Step: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN + POP10 + ALAND_SQMILE + MEDINC09
##AIC = 310171.5
#Step: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN + POP10 + ALAND_SQMILE + MEDINC09 + PCTWHITE10
##AIC = 310172
The AIC stops stops reducing once ALAND_SQMILE and MEDINC09 are added to the regression.
Also noted is a marginal increase of 0.5 when adding PCTWHITE10, implying a model that is not relatively as lean as the previous few.
This points us towards constructing a new model that excludes MEDINC09 but keeps ALAND_SQMILE.
Test Model
Otherwise know as model 4.1.
VIF values below 2 affirm low multicolinearity.
mod4.1 <- lm(TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN + POP10 + ALAND_SQMILE, data = dat2)
sqrt(vif(mod4.1))
## PRES_RES_REC_TRMT POP_DENSITY_10 RES_BURDEN POP10
## 1.114064 1.101199 1.013024 1.166715
## ALAND_SQMILE
## 1.035008
ANOVA testing determines model 4.1 brings statistical significance to the model.
anova(mod1, mod2, mod3, mod4.1, mod5, mod6)
## Analysis of Variance Table
##
## Model 1: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10
## Model 2: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN
## Model 3: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## MEDINC09
## Model 4: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## POP10 + ALAND_SQMILE
## Model 5: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## MEDINC09 + POP10 + ALAND_SQMILE
## Model 6: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## MEDINC09 + POP10 + ALAND_SQMILE + PCTWHITE10
## Res.Df RSS Df Sum of Sq F
## 1 8417 96594435962878001152
## 2 8416 84930596049859624960 1 11663839913018376192 1171.9278
## 3 8415 84798962329488867328 1 131633720370757632 13.2259
## 4 8414 83753497550089240576 1 1045464779399626752 105.0434
## 5 8413 83733165028634722304 1 20332521454518272 2.0429
## 6 8412 83722070102993846272 1 11094925640876032 1.1148
## Pr(>F)
## 1
## 2 < 0.00000000000000022 ***
## 3 0.0002778 ***
## 4 < 0.00000000000000022 ***
## 5 0.1529534
## 6 0.2910786
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mod1, mod2, mod3, mod4.1)
## Analysis of Variance Table
##
## Model 1: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10
## Model 2: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN
## Model 3: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## MEDINC09
## Model 4: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN +
## POP10 + ALAND_SQMILE
## Res.Df RSS Df Sum of Sq F
## 1 8417 96594435962878001152
## 2 8416 84930596049859624960 1 11663839913018376192 1171.767
## 3 8415 84798962329488867328 1 131633720370757632 13.224
## 4 8414 83753497550089240576 1 1045464779399626752 105.029
## Pr(>F)
## 1
## 2 < 0.00000000000000022 ***
## 3 0.000278 ***
## 4 < 0.00000000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##mod4.1 brings statistical significance to the model
This tells us that model 4 with the lowest AIC in backward selection and model 4.1 in forward selection, low VIF, and statistically significant contributions through the anova test, are the two leanest models.
While mod5 matched with mod4 for the lowest AIC, the anova test for mod5 demonstrated that a multivariate regression model including both ALAND_SQMILE with MEDINC09 is not as robust.
Cor tests to confirm:
cor.test(dat2$ALAND_SQMILE, dat2$MEDINC09)
##
## Pearson's product-moment correlation
##
## data: dat2$ALAND_SQMILE and dat2$MEDINC09
## t = -5.0237, df = 8418, p-value = 0.0000005171
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07594447 -0.03335190
## sample estimates:
## cor
## -0.05467306
#cor = -0.054, very very low
cor.test(dat2$PRES_RES_REC_TRMT, dat2$MEDINC09)
##
## Pearson's product-moment correlation
##
## data: dat2$PRES_RES_REC_TRMT and dat2$MEDINC09
## t = 8.8694, df = 8418, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07501528 0.11734014
## sample estimates:
## cor
## 0.09622121
#cor = 0.096
cor.test(dat2$PRES_RES_REC_TRMT, dat2$ALAND_SQMILE)
##
## Pearson's product-moment correlation
##
## data: dat2$PRES_RES_REC_TRMT and dat2$ALAND_SQMILE
## t = 7.4768, df = 8418, p-value = 0.00000000000008383
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.05996557 0.10240409
## sample estimates:
## cor
## 0.08122164
#cor = 0.081
Summary Tables
Model 4.1
summary(mod4.1)
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 +
## RES_BURDEN + POP10 + ALAND_SQMILE, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1528463923 -5893694 4009844 7202038 4160210331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10548295.447 1437703.018 -7.337 0.000000000000239 ***
## PRES_RES_REC_TRMT 565.969 11.370 49.777 < 0.0000000000000002 ***
## POP_DENSITY_10 22389.663 609.609 36.728 < 0.0000000000000002 ***
## RES_BURDEN 78499.798 2401.811 32.684 < 0.0000000000000002 ***
## POP10 -19.113 1.758 -10.874 < 0.0000000000000002 ***
## ALAND_SQMILE 1926.530 796.611 2.418 0.0156 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 99770000 on 8414 degrees of freedom
## Multiple R-squared: 0.4497, Adjusted R-squared: 0.4493
## F-statistic: 1375 on 5 and 8414 DF, p-value: < 0.00000000000000022
Model 4
summary(mod4)
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 +
## RES_BURDEN + MEDINC09 + POP10, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1520341109 -5929113 3512562 6920069 4158589432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1732427.407 4464463.141 -0.388 0.6980
## PRES_RES_REC_TRMT 566.556 11.368 49.839 <0.0000000000000002 ***
## POP_DENSITY_10 22263.887 605.606 36.763 <0.0000000000000002 ***
## RES_BURDEN 78279.176 2405.790 32.538 <0.0000000000000002 ***
## MEDINC09 -122.785 74.597 -1.646 0.0998 .
## POP10 -17.634 1.747 -10.092 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 99790000 on 8414 degrees of freedom
## Multiple R-squared: 0.4495, Adjusted R-squared: 0.4491
## F-statistic: 1374 on 5 and 8414 DF, p-value: < 0.00000000000000022
Stargazer function analysis of models 4 & 4.1
R-squared value for model 4.1 is marginally better, while Adjusted R-squared is the same for both smaller p-values are demonstrated for all variables in model 4.1 as compared to model 4, so we can reject the null hypothesis.
However, both ALAND_SQMILE and MEDINC09 are less statistically significant than other variables in the model.
stargazer(mod4, mod4.1, type = "text", star.cutoffs = c(0.05, 0.01, 0.001))
##
## ==================================================================
## Dependent variable:
## ----------------------------------
## TOTAL_OFFICIAL_NEED
## (1) (2)
## ------------------------------------------------------------------
## PRES_RES_REC_TRMT 566.556*** 565.969***
## (11.368) (11.370)
##
## POP_DENSITY_10 22,263.890*** 22,389.660***
## (605.606) (609.609)
##
## RES_BURDEN 78,279.180*** 78,499.800***
## (2,405.790) (2,401.811)
##
## MEDINC09 -122.785
## (74.597)
##
## POP10 -17.634*** -19.113***
## (1.747) (1.758)
##
## ALAND_SQMILE 1,926.530*
## (796.611)
##
## Constant -1,732,427.000 -10,548,295.000***
## (4,464,463.000) (1,437,703.000)
##
## ------------------------------------------------------------------
## Observations 8,420 8,420
## R2 0.449 0.450
## Adjusted R2 0.449 0.449
## Residual Std. Error (df = 8414) 99,788,663.000 99,770,057.000
## F Statistic (df = 5; 8414) 1,373.863*** 1,375.003***
## ==================================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
##R-squared value for mod4.1 is marginally better, while Adjusted R-squared is the same for both
##smaller p-values are demonstrated for all variables in mod4.1 as compared to mod4, so we can reject the null hypothesis
##however, both ALAND_SQMILE and MEDINC09 are less statistically significant than other variables in the model
We can try dropping ALAND_SQMILE and MEDINC09, which gives us a comparison with model 3.
While the p-values for all variables are statistically significant in model 3, it exhibits marginally lower R-squared values than models 4 and 4.1.
stargazer(mod3, mod4, mod4.1, type = "text", star.cutoffs = c(0.05, 0.01, 0.001))
##
## =======================================================================================================
## Dependent variable:
## -----------------------------------------------------------------------------------
## TOTAL_OFFICIAL_NEED
## (1) (2) (3)
## -------------------------------------------------------------------------------------------------------
## PRES_RES_REC_TRMT 528.272*** 566.556*** 565.969***
## (10.780) (11.368) (11.370)
##
## POP_DENSITY_10 20,772.230*** 22,263.890*** 22,389.660***
## (590.802) (605.606) (609.609)
##
## RES_BURDEN 80,933.540*** 78,279.180*** 78,499.800***
## (2,405.657) (2,405.790) (2,401.811)
##
## MEDINC09 -266.253*** -122.785
## (73.668) (74.597)
##
## POP10 -17.634*** -19.113***
## (1.747) (1.758)
##
## ALAND_SQMILE 1,926.530*
## (796.611)
##
## Constant 2,677,630.000 -1,732,427.000 -10,548,295.000***
## (4,469,571.000) (4,464,463.000) (1,437,703.000)
##
## -------------------------------------------------------------------------------------------------------
## Observations 8,420 8,420 8,420
## R2 0.443 0.449 0.450
## Adjusted R2 0.443 0.449 0.449
## Residual Std. Error 100,384,858.000 (df = 8415) 99,788,663.000 (df = 8414) 99,770,057.000 (df = 8414)
## F Statistic 1,671.829*** (df = 4; 8415) 1,373.863*** (df = 5; 8414) 1,375.003*** (df = 5; 8414)
## =======================================================================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
##While the p-values for all variables are statistically significant in mod3, it exhibits marginally lower R-squared values than mods 4 and 4.1
##Ultimately, no model is absolutely perfect, but mod 4.1 does a good job across all tests of correlation, multicolinearity, variance, quality, residual fit, and statistical significance.
Build Prediction Frames
Model 4.1
##mod4.1 <- lm(TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + POP_DENSITY_10 + RES_BURDEN + POP10 + ALAND_SQMILE, data = dat2)
pred.frame <- data.frame (PRES_RES_REC_TRMT = range(dat2$PRES_RES_REC_TRMT),
POP_DENSITY_10 = range(dat2$POP_DENSITY_10),
RES_BURDEN = range(dat2$RES_BURDEN),
POP10 = range(dat2$POP10),
ALAND_SQMILE = range(dat2$ALAND_SQMILE))
pp <- predict(mod4.1, int="p", newdata = pred.frame)
pc <- predict(mod4.1, int="c", newdata = pred.frame)
Predication Frame: Need and Residential Treatment
par(mfrow=c(1,1))
plot (dat2$PRES_RES_REC_TRMT, dat2$TOTAL_OFFICIAL_NEED, ylim = range (dat2$TOTAL_OFFICIAL_NEED, pp, na.rm = T))
pred.TRMT <- pred.frame$PRES_RES_REC_TRMT
matlines (pred.TRMT, pc, lty = c(1,2,2), col="black")
matlines (pred.TRMT, pp, lty = c(1,3,3), col="black")
Predication Frame: Need and Population Density
plot (dat2$POP_DENSITY_10, dat2$TOTAL_OFFICIAL_NEED, ylim = range (dat2$TOTAL_OFFICIAL_NEED, pp, na.rm = T))
pred.POPDENS <- pred.frame$POP_DENSITY_10
matlines (pred.POPDENS, pc, lty = c(1,2,2), col="black")
matlines (pred.POPDENS, pp, lty = c(1,3,3), col="black")
Plot of model 4.1
plot(mod4.1)
Final Thoughts
The R-squared value for model 4.1 is reasonable with a value of 0.45. The stargazer output confirmed that all variables are statistically significant. However, both land area and median income are less statistically significant than other variables in the model.
Ultimately, no model can be absolutely perfect, but model 4.1 does a good job across all tests of correlation, multicolinearity, variance, quality, residual fit, and statistical significance.