For this effort our goal is broadly stated as “Quantify the relationship between flood losses, migration patterns, and real-estate value indices to determine whether these factors have a statistically significant affect on each other.” This is important to risq stakeholders because should these associations exist, it means that:
tax-revenues are at risk in areas of high vulnearability to flood related catastrophe and
that home values are likely to decrease in areas where climate damages are on the rise.
Such results would increase incentives for stakeholders to double-down on Risq insights to protect themselves against long term losses. As well, it should spur local governments to prioritize climate adaptation and catastrophe planning to prevent local tax bases from dissolving.
Upon examination of this data, our initial findings are as follows:
Flood losses do have an effect on migration patterns. Increases in claims payouts are associated with decreases in population growth.
Flood losses do not have a statistically significant effect on Zillow’s home value index.
Migration has a statistically significant effect on Zillow’s home value index.
What does this mean for stakeholders? ( TBD with Evan)
Ultimately this exercise involved aggregating multiple datasets containing information on flood losses, population migration, income, and real estate value while accounting for external variables like inflation, population density, and the coastal vs non-coastal nature of various geographies. In total, 7 datasets were gathered and processed:
Three variables were engineered to perform this analysis. They are:
migration_index: calculated as a % change in total population over a 3 year period using data from 2010 - 2018. Populations were aggregated at the US county level. The population deltas were grouped by population density and scaled into z-scores.
zhvi_index: Calculated as a % change in Zillow’s home value index over 3 year time periods. These % changes were grouped by population density and converted into z-scores and
adjusted_claims_per_capita: Annual flood damages were adjusted to beginning of year 2020 dollars using inflation data. The adjusted annual claims were then aggregated across the same 3 year periods over which migration / ZHVI fluxes were calculated. These county level claims were divided by county population to normalize for the effect of population density.
With these variables in hand our investigation hinged on the following methods:
The output from these experiments led us to the conclusions outlined in section 1. Further details around the specific design of these tests are discussed in section 3.
#####################################################################################
#### PHASE 0: Loading all raw datasets
#####################################################################################
# (1) load in damages datasets / claims payouts
# link to data source: https://www.fema.gov/media-library/assets/documents/180374
damages = fread('openFEMA_claims20190831.csv')
head(damages)
## agriculturestructureindicator asofdate basefloodelevation
## 1: 2019-08-31 NA
## 2: 2019-08-31 NA
## 3: 2019-08-31 NA
## 4: 2019-08-31 NA
## 5: 2019-08-31 NA
## 6: 2019-08-31 NA
## basementenclosurecrawlspacetype reportedcity condominiumindicator
## 1: 0 OCEANSIDE N
## 2: 0 NEW ORLEANS N
## 3: 0 NAVARRE N
## 4: 1 BEAUFORT N
## 5: 0 MELBOURNE N
## 6: 0 VIRGINIA BEACH N
## policycount countycode crsdiscount dateofloss elevatedbuildingindicator
## 1: 1 6073 0.00 1998-02-07 N
## 2: 1 22071 0.00 2005-08-29 N
## 3: 1 12113 0.05 1998-09-28 N
## 4: 1 45013 0.00 1994-10-07 N
## 5: 1 12009 0.00 1996-03-11 N
## 6: 1 51810 0.00 1998-02-03 Y
## elevationcertificateindicator elevationdifference censustract floodzone
## 1: 999 6073018519 X
## 2: 999 22071000616 X
## 3: 999 12113010815 X
## 4: 999 45013000700 X
## 5: 999 12009063107 X
## 6: 999 51810043600 AE
## houseworship latitude locationofcontents longitude lowestadjacentgrade
## 1: 33.2 -117.3 NA
## 2: 29.9 -90.0 NA
## 3: 30.4 -86.9 NA
## 4: 32.4 -80.7 NA
## 5: 28.3 -80.7 NA
## 6: 36.9 -76.0 NA
## lowestfloorelevation numberoffloorsintheinsuredbuilding nonprofitindicator
## 1: NA 4
## 2: NA 2
## 3: NA 1
## 4: NA 2
## 5: NA 1
## 6: NA 1
## obstructiontype occupancytype originalconstructiondate originalnbdate
## 1: 10 1 1963-01-01 1997-01-11
## 2: 10 1 1967-07-01 1990-07-12
## 3: 10 1 1972-01-01 1997-07-24
## 4: 10 1 1960-01-01 1993-10-01
## 5: 10 1 1988-01-01 1996-01-11
## 6: 10 1 1970-01-01 1995-12-03
## amountpaidonbuildingclaim amountpaidoncontentsclaim
## 1: 0.00 0
## 2: 0.00 0
## 3: 8813.21 1720
## 4: 2906.00 0
## 5: 3875.53 1545
## 6: 14985.92 1510
## amountpaidonincreasedcostofcomplianceclaim postfirmconstructionindicator
## 1: NA N
## 2: NA N
## 3: 0 N
## 4: NA N
## 5: NA Y
## 6: 0 N
## ratemethod smallbusinessindicatorbuilding state
## 1: 7 CA
## 2: 7 LA
## 3: 1 FL
## 4: 7 SC
## 5: 7 FL
## 6: 1 VA
## totalbuildinginsurancecoverage totalcontentsinsurancecoverage yearofloss
## 1: 200000 50000 1998
## 2: 100000 40000 2005
## 3: 100000 50000 1998
## 4: 100000 25000 1994
## 5: 100000 25000 1996
## 6: 88000 12600 1998
## reportedzipcode primaryresidence
## 1: 92056
## 2: 70131 Y
## 3: 32566
## 4: 29902
## 5: 32940
## 6: 23451
# (2) load in real estate value indices
# linke to data source: https://www.zillow.com/research/data/
real_estate_value = fread('zhvi_risq_county_homevalue_index.csv')
head(real_estate_value)
## V1 GEOID year STATEFP lon lat RegionName Metro
## 1: 1 1027 2008 1 -85.86055 33.26909 <NA> <NA>
## 2: 2 1091 2008 1 -87.78952 32.24761 <NA> <NA>
## 3: 3 1049 2008 1 -85.80414 34.45977 De Kalb County Fort Payne
## 4: 4 1019 2008 1 -85.60379 34.17596 Cherokee County
## 5: 5 1065 2008 1 -87.62912 32.76266 <NA> <NA>
## 6: 6 1105 2008 1 -87.29440 32.63847 <NA> <NA>
## StateCodeFIPS STATEABBR STATE MunicipalCodeFIPS meanCountyAverage
## 1: NA <NA> <NA> NA NA
## 2: NA <NA> <NA> NA NA
## 3: 1 AL Alabama 49 89731.29
## 4: 1 AL Alabama 19 121605.57
## 5: NA <NA> <NA> NA NA
## 6: NA <NA> <NA> NA NA
## meanStateAverage MeanStateMinusCounty state_stdev acs_pc1 acs_pc2
## 1: NA 4181.528 32186.68 -4.062239 0.0656347
## 2: NA 4181.528 32186.68 -4.264940 1.9842919
## 3: 128633.2 4181.528 32186.68 -3.596758 0.4997780
## 4: 128633.2 4181.528 32186.68 -3.020672 0.1175345
## 5: NA 4181.528 32186.68 -4.357674 1.1254512
## 6: NA 4181.528 32186.68 -6.958061 1.8764552
## acs_pc3 acs_pc4 ImputedCountyAverages ProductCountyZHVI grp
## 1: -0.9808717 -0.18526281 85732.22 85732.22 01__2008
## 2: -0.7748036 -1.78987741 93152.16 93152.16 01__2008
## 3: -1.0720726 -0.12143096 102678.99 89731.29 01__2008
## 4: -1.0980994 -0.09957345 105956.24 121605.57 01__2008
## 5: -0.7375154 -1.09452420 84050.58 84050.58 01__2008
## 6: 0.2331603 0.17748290 60758.55 60758.55 01__2008
## county_homevalue_index
## 1: 0.4471943
## 2: 0.4858980
## 3: 0.4680542
## 4: 0.6343160
## 5: 0.4384226
## 6: 0.3169273
# (3) load in population data (census / migration from american fact finder. Choose "county" from "Geographies" the from "Topics" select People and then "Basic Count / Estimate"
# link to data source: https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=PEP_2018_PEPANNRES&prodType=table
migration = fread('migration.csv')
head(migration)
## countycode 2010 2011 2012 2013 2014 2015 2016 2017 2018
## 1: 1001 54754 55208 54936 54713 54876 54838 55242 55443 55601
## 2: 1003 183111 186540 190143 194886 199189 202995 207712 212619 218022
## 3: 1005 27330 27350 27174 26944 26758 26294 25819 25158 24881
## 4: 1007 22872 22747 22664 22516 22541 22562 22576 22555 22400
## 5: 1009 57373 57554 57570 57611 57521 57522 57517 57827 57840
## 6: 1011 10878 10677 10607 10551 10665 10400 10381 10176 10138
# (4) load in inflation data
# link to data source: https://data.bls.gov/timeseries/CUUR0000SA0
inflation = fread('calculator_net.csv')
head(inflation)
## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
## 1: 2019 1.55% 1.52% 1.86% 2.00% 1.79% 1.65% 1.81% 1.75% 1.71% 1.76% 2.05%
## 2: 2018 2.07% 2.21% 2.36% 2.46% 2.80% 2.87% 2.95% 2.70% 2.28% 2.52% 2.18%
## 3: 2017 2.50% 2.74% 2.38% 2.20% 1.87% 1.63% 1.73% 1.94% 2.23% 2.04% 2.20%
## 4: 2016 1.37% 1.02% 0.85% 1.13% 1.02% 1.01% 0.84% 1.06% 1.46% 1.64% 1.69%
## 5: 2015 -0.09% -0.03% -0.07% -0.20% -0.04% 0.12% 0.17% 0.20% -0.04% 0.17% 0.50%
## 6: 2014 1.58% 1.13% 1.51% 1.95% 2.13% 2.07% 1.99% 1.70% 1.66% 1.66% 1.32%
## Dec Annual
## 1: 2.29% 1.81%
## 2: 1.91% 2.44%
## 3: 2.11% 2.13%
## 4: 2.07% 1.26%
## 5: 0.73% 0.12%
## 6: 0.76% 1.62%
# (5) load in county density data
# link to data source: https://www.cdc.gov/nchs/data_access/urban_rural.htm#Data_Files_and_Documentation
density = fread("density.csv")
head(density[, c('countycode', 'state', 'population', 'density_indicator')])
## countycode state population density_indicator
## 1: 2201 AK . 6
## 2: 2232 AK . 6
## 3: 2280 AK . 6
## 4: 51560 VA . 6
## 5: 6037 CA 9962789 1
## 6: 17031 IL 5231351 1
# (6) load in is_coastal indicator (census data)
# link to data source: https://www2.census.gov › library › stories › 2018/08 › coastline-counties-...
coastlines = fread("coastlines.csv")
head(coastlines)
## countycode state coast coast_pop
## 1: 1003 Alabama Gulf of Mexico 208,563
## 2: 1097 Alabama Gulf of Mexico 414,836
## 3: 2013 Alaska Pacific 3,296
## 4: 2016 Alaska Pacific 5,647
## 5: 2020 Alaska Pacific 298,192
## 6: 2050 Alaska Pacific 17,968
# (7) load in income data set
# link to data source: https://data.world/tylerudite/2015-median-income-by-county/workspace/file?filename=2015+Median+Income+by+County.csv
income = fread("income.csv")
head(income)
## county_state county population median_income state
## 1: Autauga County, Alabama Autauga County 55221 51281 AL
## 2: Baldwin County, Alabama Baldwin County 195121 50254 AL
## 3: Barbour County, Alabama Barbour County 26932 32964 AL
## 4: Bibb County, Alabama Bibb County 22604 38678 AL
## 5: Blount County, Alabama Blount County 57710 45813 AL
## 6: Bullock County, Alabama Bullock County 10678 31938 AL
## state_name
## 1: Alabama
## 2: Alabama
## 3: Alabama
## 4: Alabama
## 5: Alabama
## 6: Alabama
interesting_vars = c('dateofloss', 'countycode', 'amountpaidonbuildingclaim')
damages$dateofloss = parse_date(damages$dateofloss, "%Y-%m-%d")
summary(damages$dateofloss)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "1970-08-31" "1992-12-11" "2004-07-13" "2001-03-13" "2011-08-27" "2019-08-30"
damage = damages[dateofloss > '2010-01-01']
df = dplyr::select(damage[!is.na(countycode) & !is.na(amountpaidonbuildingclaim)], interesting_vars)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(interesting_vars)` instead of `interesting_vars` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
# the income data encodes county as a string (all counties have 5 characters in it. So we will force all other data sets to have that same form for ease of joining later on)
df[, countycode := str_pad(countycode, 5, side = "left", pad = "0")]
# (2) subset real estate datat to only the variables of interest
real_estate_vars = c('ProductCountyZHVI', 'year', 'GEOID')
real_estate_value = dplyr::select(real_estate_value, real_estate_vars)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(real_estate_vars)` instead of `real_estate_vars` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
real_estate_value[, countycode := str_pad(GEOID, 5, side = "left", pad = "0")]
real_est = real_estate_value[year >= 2010]
# (3) melt the migration data to hold year as a single column (rather than a dummmy indicator)
migration_melt = melt(migration, id.vars = c('countycode'),
measure.vars = c('2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018'),
variable.name = c('year'),
value.name = c('population'),
variable.factor = FALSE)
migration_melt$year = as.numeric(migration_melt$year)
migration_melt[, countycode := str_pad(countycode, 5, side = "left", pad = "0")]
inflate = inflation[, c('Year', 'Annual')]
# remove the "%" sign in the raw data and convert to numeric, then put data in form of 1.01 (rather than .01)
inflate$Annual = gsub("%", "", as.character(factor(inflate$Annual)))
inflate$Annual = as.numeric(inflate$Annual)
inflate[, Annual := Annual* .01 + 1]
# initialize a total_multiplier to convert a given year to beginning of 2020 dollars
inflate[, total_multiplier := 0]
# loop through the list to calculate a cumulative product for each year
for (yr in unique(inflate$Year)) {
inflate[Year == yr]$total_multiplier = prod(inflate[Year >= yr]$Annual)
}
# some issues with this approach -- it is saying that all 2019 dollars should be multiplied by 1.018 to conver them into beginning of 2020 dollars
# (5) reverse density indicator scale
density[, density_indicator := abs(density_indicator - 7)]
density[, countycode := str_pad(countycode, 5, side = "left", pad = "0")]
# (6) no cleansing required for coastal data set
coastlines[, countycode := str_pad(countycode, 5, side = "left", pad = "0")]
# (7) use R's fips data set to get county code from county name in income dataset
# lpad the countycode
fips = as.data.table(force(county.fips))
income[, state_name := tolower(state_name)]
income[, short_county := tolower(gsub(" County", "", county))]
income[, polyname := paste0(state_name, ",", short_county)]
incomes = merge(income, fips, by = c('polyname'), all.x = TRUE)
incomes[, countycode := str_pad(fips, 5, side = "left", pad = "0")]
df = as.data.table(df %>% as_tibble() %>% mutate(
year = year(dateofloss)
))
# create a new column for the year grouping. We have 10 years of data (2010 - 2019)
# options include: 3 buckets of 3 years; 2 buckets of 5 years, dropping 2019; or year-to-year variation (this is likely noisy)
# seems like maybe 3 buckets of 3 years is a good place to start ( we lose a year of data, but we get to see change across an extra time groups)
df[, time_group := ifelse(year <= 2012, 2012,
ifelse(year <= 2015, 2015,
ifelse(year <= 2018, 2018, NA)))]
# this will cause us to lose damages data after 2018 (around 3% of the data)
df <- df[!is.na(df$time_group),]
df2 <- merge(df, inflate[, c('Year', 'total_multiplier')], by.x=c("year"), by.y = c("Year"))
df2[, adjusted_claims := amountpaidonbuildingclaim * total_multiplier]
# now that all claim dollars are on the same scale (begining of year 2020)
# we can sum all claim dollars by time group ( 3 year bucket) and county
df3 = as.data.table(df2 %>%
group_by(time_group, countycode) %>%
summarise(total_claims = sum(adjusted_claims, na.rm = TRUE)))
## `summarise()` regrouping output by 'time_group' (override with `.groups` argument)
# we add roughly 2.6 billion dollars of claims by enforcing this adjustment
dist = as.data.table(df2 %>% group_by(year) %>% summarise(count = n(), mult = mean(total_multiplier)))
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(dist, aes(x = year, y = count))+ geom_bar(stat = 'identity') + ggtitle('Distribution of Claims Data Points by Year')
df4 = merge(df3, coastlines[, c('countycode', 'coast')], by = "countycode", all.x = TRUE)
df4[, is_coastal := ifelse(is.na(coast), 0, 1)]
df4 = select(df4,-c(coast))
# read in data for rural / urban classes from CDC
df5 = merge(x = df4, y = density[,c('countycode', 'density_indicator', 'density_label')], by = "countycode", all.x = TRUE)
# drops ~250 rows where density indicator is missing (drops data from 66 counties...)
df5 = df5[!is.na(df5$density_indicator),]
# lpad the countycode with 0's to ensure no loss of information during join
df6 = merge(df5, incomes[, c('countycode', 'median_income')], by = 'countycode', all.x = TRUE)
df6$median_income = as.numeric(df6$median_income)
# drop 409 rows were median income is missing
df6 = df6[!is.na(df6$median_income),]
migration_melt[, time_group := ifelse(year <= 2012, 2012,
ifelse(year <= 2015, 2015,
ifelse(year <= 2018, 2018, NA)))]
# compute year to year change in population by county and time group
# set lag length (equal to bucket size)
length = 3
migration_melt2 = as.data.table(migration_melt %>%
group_by(countycode) %>%
arrange(year, .by_group = TRUE) %>%
mutate(pct_change = (population - lag(population, n = length - 1)) / lag(population, n = length - 1)))
# join migration data to damages data
df7 <- merge(migration_melt2,df6,by=c("time_group", "countycode"))
df7 = df7[time_group == year]
# repeat identical steps for real_estate_value data
# compute identical time group to align with damages
real_est[, time_group := ifelse(year <= 2012, 2012,
ifelse(year <= 2015, 2015,
ifelse(year <= 2018, 2018, NA)))]
real_estate_value2 = as.data.table(real_est %>%
group_by(countycode) %>%
arrange(year, .by_group = TRUE) %>%
mutate(index_delta = (ProductCountyZHVI - lag(ProductCountyZHVI, n = length - 1)) / lag(ProductCountyZHVI, n = length - 1)))
real_estate_value2 = real_estate_value2[time_group == year]
# join real_estate data to damages data
df8 <- merge(real_estate_value2[, c('index_delta', 'time_group', 'countycode')],df7,by=c("time_group", "countycode"))
head(df8)
## time_group countycode index_delta year population pct_change total_claims
## 1: 2012 01001 -0.03525311 2012 54936 0.0033239581 0.000
## 2: 2012 01003 -0.09568093 2012 190143 0.0384029359 827838.469
## 3: 2012 01007 0.00000000 2012 22664 -0.0090940888 1936.504
## 4: 2012 01015 0.00000000 2012 117257 -0.0102973573 24432.293
## 5: 2012 01017 -0.09603776 2012 34104 -0.0005275189 0.000
## 6: 2012 01019 -0.06875065 2012 25963 -0.0004235004 0.000
## is_coastal density_indicator density_label median_income
## 1: 0 4 medium metro 51281
## 2: 1 3 small metro 50254
## 3: 0 5 large fringe metro 38678
## 4: 0 3 small metro 41703
## 5: 0 2 micropolitan 34177
## 6: 0 1 non-core 36296
ggplot(df8, aes(pct_change))+
geom_histogram()+
xlab('Net Migration')+
ggtitle('Distribution of Migration Patterns')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# hist(df8$pct_change, xlab = 'Net Migration', main = 'Distribution of Migration Patterns')
ggplot(df8, aes(total_claims))+
geom_boxplot()+
ggtitle('Distribution of Flood Claims Payouts')
df9 <- as.data.table(df8 %>% group_by(density_indicator) %>% mutate(
z_score = scale(pct_change),
scaled_index = scale(index_delta)))
hist(df9$z_score)
summary(df9$z_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.0501 -0.6264 -0.1174 0.0000 0.4843 13.8255
hist(df9$scaled_index)
A preliminary linear regression model of migration_index on total_claims, the binary is_coastal indicator, the density indicator, and median income was fit. The results indicate that being near the coast is associated with a .054 unit decrease in migration. Additionally, we can say that every 1 unit decrease in population density we see a .011 unit decrease in migration. This means that people are moving into denser regions. We also see that for every $3,200 increase in median income we see a .1 unit increase in migration. This means that people tend to move into areas that have higher median incomes. Lastly, the results of this regression fit indicate that for every 1,000,000,000 increase in claims we see a .3367 unit decline in migration.
Summary of linear regression fit
summary(lm)
##
## Call:
## lm(formula = scaled_index ~ total_claims + is_coastal + density_indicator +
## median_income, data = df9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9290 -0.5713 -0.3421 0.7061 3.7075
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.177e-01 5.203e-02 -2.263 0.02369 *
## total_claims -3.321e-10 1.648e-10 -2.015 0.04395 *
## is_coastal -5.409e-02 4.557e-02 -1.187 0.23529
## density_indicator -1.102e-02 1.017e-02 -1.083 0.27903
## median_income 3.207e-06 1.240e-06 2.586 0.00973 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9989 on 5815 degrees of freedom
## Multiple R-squared: 0.002024, Adjusted R-squared: 0.001337
## F-statistic: 2.948 on 4 and 5815 DF, p-value: 0.01905
Principal component analysis was conducted on claims_per_capita, the density_indicator, and median income (should density indicator go into PCA if is_coastal didn’t?). We can see that all three principal components are significant to the outcome (z_scored migration fluxes.)
df.pca <- prcomp(formula = ~ claims_per_capita + density_indicator + median_income,
data = df9,
center = TRUE,
scale. = TRUE)
lm.pca = lm(df9$z_score ~ df.pca$x[,1] + df.pca$x[,2] + df.pca$x[,3])
summary(lm.pca)
##
## Call:
## lm(formula = df9$z_score ~ df.pca$x[, 1] + df.pca$x[, 2] + df.pca$x[,
## 3])
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0330 -0.6170 -0.1202 0.4698 13.0367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.005e-17 1.265e-02 0.000 1.000000
## df.pca$x[, 1] 9.892e-02 1.019e-02 9.706 < 2e-16 ***
## df.pca$x[, 2] 4.528e-02 1.265e-02 3.578 0.000348 ***
## df.pca$x[, 3] 3.339e-01 1.866e-02 17.897 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.965 on 5816 degrees of freedom
## Multiple R-squared: 0.06844, Adjusted R-squared: 0.06796
## F-statistic: 142.4 on 3 and 5816 DF, p-value: < 2.2e-16
Contribution of each variable to each component
df.pca$rotation # PC2 is mostly claims per capita
## PC1 PC2 PC3
## claims_per_capita 0.0266746 -0.99964402 0.0005415214
## density_indicator 0.7068603 0.01847888 -0.7071117900
## median_income 0.7068501 0.01924471 0.7071015650
A series of permutation tests were conducted to determine whether the association between a particular target and a given label was significant or not. To check whether or not losses are an important factor of migration patterns, we assigned a label to each row of our data set based on the value of per capita claims. Any row with per capita claims greater than 7 (the 75% percentile value of claims_per_capita) was assigned a label of “high” indicating that losses were above average. All other rows were assigned a label of “low.” Then, the absolute difference between the mean z_score of the “high” group and the “low” group was computed. This absolute difference in means served as our permutation test statistic. Next we generated 10,000 samples of our data by shuffling the index of the assigned “high” / “low” labels. For each sample, the absolute difference between the mean z-score of the “high” and “low” group was calculated and stored. Each of these calculations make up a distribution of permutation test statistics. We can then plot the distribution of our permutation test statistics along with our originally observed test statistic to get a sense of where our data falls in relation to the permutation sample statistics. This can serve as a visual sense of how much the test statistic actually depends on the labels. We can compute a p-value to measure how likely we are to observe our test statistic assuming there is no dependence on the label.
plot(density(migration_test$permutation_results),
xlab="Absolute Difference in Average Migration Index Between Groups With Damages in Top 25% vs Bottom 75%" ,
main="Observed Test Statistic vs Distribution of Simulated Test Statistics", las=1)
abline(v=migration_test$initial_test_statistic, col="blue", lty="dotted")
text(.11, 10, "observed test statistic", col="blue", cex=0.7)
text(.07, 20, "probability of our result if claims have no affect on migration", col="black", cex=0.7)
mean(migration_test$permutation_results >= migration_test$initial_test_statistic) # .0927
## [1] 5e-04
plot(density(real_estate_test$permutation_results),
xlab="Absolute Difference in Average ZHVI Index Between Groups With Damages in Top 25% vs Bottom 75% " ,
main="Observed Test Statistic vs Distribution of Simulated Test Statistics", las=1)
abline(v=real_estate_test$initial_test_statistic, col="blue", lty="dotted")
text(.11, 10, "observed test statistic", col="blue", cex=0.7)
text(.07, 20, "simulated distribution of test statistic on permuted data", col="black", cex=0.7)
mean(real_estate_test$permutation_results >= real_estate_test$initial_test_statistic) # .0927
## [1] 0.011
plot(density(real_estate_test2$permutation_results),
xlab="Absolute Difference in Average Migration Index Between Groups With Migration Indices in Top 25% vs Bottom 75% " ,
main="Observed vs Simulated Test Statistic", las=1,
xlim = c(0, real_estate_test2$initial_test_statistic*1.2))
abline(v=real_estate_test2$initial_test_statistic, col="blue", lty="dotted")
text(.11, 10, "observed test statistic", col="blue", cex=0.7)
text(.07, 20, "simulated distribution of test statistic on permuted data", col="black", cex=0.7)
mean(real_estate_test2$permutation_results >= real_estate_test2$initial_test_statistic) # .0927
## [1] 0
Migration vs Claims:
print(cor(test_dat$claims_per_capita, test_dat$z_score))
## [1] -0.04111326
ggplot(test_dat[claims_per_capita < 10], aes(y = z_score, x = claims_per_capita))+
geom_point()+
ggtitle('Migration vs Claims')
Migration vs Real Estate Value:
print(cor(test_dat$scaled_index, test_dat$z_score))
## [1] 0.2692948
ggplot(test_dat[claims_per_capita < 10], aes(y = z_score, x = scaled_index))+
geom_point()+
ggtitle('Migration vs Real Estate Value')
Real Estate Value vs Claims:
print(cor(test_dat$claims_per_capita, test_dat$scaled_index))
## [1] -0.02666343
ggplot(test_dat[claims_per_capita < 10], aes(y = scaled_index, x = claims_per_capita))+geom_point()
The first data set contains information regarding total claims payouts made due to flood related damages available on FEMA’s website. The full dataset contained 39 different variables but only 3 were used: date, county, and total payouts.
head(df[, c('dateofloss', 'amountpaidonbuildingclaim', 'countycode')], 3)
## dateofloss amountpaidonbuildingclaim countycode
## 1: 2017-08-27 195857.43 48201
## 2: 2015-02-21 2618.18 25001
## 3: 2012-05-20 0.00 16059
The second data set contained information downloaded from zillow’s website. The home value index draws on Zestimates calculated on over 100 million U.S. homes, including new construction homes and/or homes that have not traded on the open market in many years. Each month every home’s current zestimate is compared to the previous month’s and the % change is tracked. Those changes are weighted according the each home’s overall value and then aggregated across the housing market in a particular market segment.
Three of the 24 variables in this dataset were of interest: the index, the year, and the county code.
head(real_est[, c('ProductCountyZHVI', 'year', 'GEOID')], 3)
## ProductCountyZHVI year GEOID
## 1: 85732.22 2010 1027
## 2: 93152.16 2010 1091
## 3: 83986.03 2010 1049
The third data set contained census data downloaded from the american fact finder. This dataset was obtained by navigating to the “Geographies” section and choosing “county” and navigating to the “Topics” section, choosing “People” and then “Basic Count / Estimate”
head(migration_melt, 3)
## countycode year population time_group
## 1: 01001 2010 54754 2012
## 2: 01003 2010 183111 2012
## 3: 01005 2010 27330 2012
The fourth data set was downloaded from the Bureau of Labor Statistics and contains inflation data. This data was used to normalize the historical claims amounts into a common scale (2020 dollars).
head(inflate, 3)
## Year Annual total_multiplier
## 1: 2019 1.0181 1.018100
## 2: 2018 1.0244 1.042942
## 3: 2017 1.0213 1.065156
The fifth data set was downloaded from the CDC and carries a “density” indicator to inform whether a given county is a major metro or a rural village. There are 6 total classifications of population density.
head(density, 3)
## countycode state county CBSA title population
## 1: 2201 AK Prince of Wales-Outer Ket .
## 2: 2232 AK Skagway-Hoonah-Angoon Census Area .
## 3: 2280 AK Wrangell-Petersburg Census Area .
## density_indicator density_label
## 1: 6 non-core
## 2: 6 non-core
## 3: 6 non-core
unique(density$density_label)
## [1] "non-core" "large central metro" "large fringe metro"
## [4] "medium metro" "small metro" "micropolitan"
The sixth data set was taken from the census and can be found by navigating to the “libray/stories/2018/08” and selecting the “coastline-counties” file. This data carries information on counties that are considered coastal (there are roughly 255 out of 3,000 counties considered to be coastal)
head(coastlines, 3)
## countycode state coast coast_pop
## 1: 1003 Alabama Gulf of Mexico 208,563
## 2: 1097 Alabama Gulf of Mexico 414,836
## 3: 2013 Alaska Pacific 3,296
The final data set contains information on median county level household income and was taken from data world.
head(incomes, 3)
## polyname county_state county population
## 1: alabama,autauga Autauga County, Alabama Autauga County 55221
## 2: alabama,baldwin Baldwin County, Alabama Baldwin County 195121
## 3: alabama,barbour Barbour County, Alabama Barbour County 26932
## median_income state state_name short_county fips countycode
## 1: 51281 AL alabama autauga 1001 01001
## 2: 50254 AL alabama baldwin 1003 01003
## 3: 32964 AL alabama barbour 1005 01005