1 Executive Summary (Busines Interpretation)

For this effort our goal is broadly stated as “Quantify the relationship between flood losses, migration patterns, and real-estate value indices to determine whether these factors have a statistically significant affect on each other.” This is important to risq stakeholders because should these associations exist, it means that:

  1. tax-revenues are at risk in areas of high vulnearability to flood related catastrophe and

  2. that home values are likely to decrease in areas where climate damages are on the rise.

Such results would increase incentives for stakeholders to double-down on Risq insights to protect themselves against long term losses. As well, it should spur local governments to prioritize climate adaptation and catastrophe planning to prevent local tax bases from dissolving.

Upon examination of this data, our initial findings are as follows:

  1. Flood losses do have an effect on migration patterns. Increases in claims payouts are associated with decreases in population growth.

  2. Flood losses do not have a statistically significant effect on Zillow’s home value index.

  3. Migration has a statistically significant effect on Zillow’s home value index.

What does this mean for stakeholders? ( TBD with Evan)

2 Overview of Approach

Ultimately this exercise involved aggregating multiple datasets containing information on flood losses, population migration, income, and real estate value while accounting for external variables like inflation, population density, and the coastal vs non-coastal nature of various geographies. In total, 7 datasets were gathered and processed:

  1. Flood damages (sourced from FEMA)
  2. Real estate value (source from Zillow)
  3. Population data (sourced from American Fact Finder)
  4. Inflation data (sourced from BLS)
  5. Population density data (sourced from CDC)
  6. Coastal indicator (sourced from US census)
  7. Income data set (sourced from Data World)

Three variables were engineered to perform this analysis. They are:

migration_index: calculated as a % change in total population over a 3 year period using data from 2010 - 2018. Populations were aggregated at the US county level. The population deltas were grouped by population density and scaled into z-scores.

zhvi_index: Calculated as a % change in Zillow’s home value index over 3 year time periods. These % changes were grouped by population density and converted into z-scores and

adjusted_claims_per_capita: Annual flood damages were adjusted to beginning of year 2020 dollars using inflation data. The adjusted annual claims were then aggregated across the same 3 year periods over which migration / ZHVI fluxes were calculated. These county level claims were divided by county population to normalize for the effect of population density.

With these variables in hand our investigation hinged on the following methods:

  1. Permutation test
  2. Linear regression fit
  3. Principal component analysis

The output from these experiments led us to the conclusions outlined in section 1. Further details around the specific design of these tests are discussed in section 3.

2.1 Experimental Design & Assumptions

  1. 3 year window
  2. Every ZHVI index value is converted to a % of the maximum value of all counties

4. Modeling

#####################################################################################
#### PHASE 0: Loading all raw datasets
#####################################################################################
# (1) load in damages datasets / claims payouts
# link to data source: https://www.fema.gov/media-library/assets/documents/180374
damages = fread('openFEMA_claims20190831.csv')
head(damages)
##    agriculturestructureindicator   asofdate basefloodelevation
## 1:                               2019-08-31                 NA
## 2:                               2019-08-31                 NA
## 3:                               2019-08-31                 NA
## 4:                               2019-08-31                 NA
## 5:                               2019-08-31                 NA
## 6:                               2019-08-31                 NA
##    basementenclosurecrawlspacetype   reportedcity condominiumindicator
## 1:                               0      OCEANSIDE                    N
## 2:                               0    NEW ORLEANS                    N
## 3:                               0        NAVARRE                    N
## 4:                               1       BEAUFORT                    N
## 5:                               0      MELBOURNE                    N
## 6:                               0 VIRGINIA BEACH                    N
##    policycount countycode crsdiscount dateofloss elevatedbuildingindicator
## 1:           1       6073        0.00 1998-02-07                         N
## 2:           1      22071        0.00 2005-08-29                         N
## 3:           1      12113        0.05 1998-09-28                         N
## 4:           1      45013        0.00 1994-10-07                         N
## 5:           1      12009        0.00 1996-03-11                         N
## 6:           1      51810        0.00 1998-02-03                         Y
##    elevationcertificateindicator elevationdifference censustract floodzone
## 1:                                               999  6073018519         X
## 2:                                               999 22071000616         X
## 3:                                               999 12113010815         X
## 4:                                               999 45013000700         X
## 5:                                               999 12009063107         X
## 6:                                               999 51810043600        AE
##    houseworship latitude locationofcontents longitude lowestadjacentgrade
## 1:                  33.2                       -117.3                  NA
## 2:                  29.9                        -90.0                  NA
## 3:                  30.4                        -86.9                  NA
## 4:                  32.4                        -80.7                  NA
## 5:                  28.3                        -80.7                  NA
## 6:                  36.9                        -76.0                  NA
##    lowestfloorelevation numberoffloorsintheinsuredbuilding nonprofitindicator
## 1:                   NA                                  4                   
## 2:                   NA                                  2                   
## 3:                   NA                                  1                   
## 4:                   NA                                  2                   
## 5:                   NA                                  1                   
## 6:                   NA                                  1                   
##    obstructiontype occupancytype originalconstructiondate originalnbdate
## 1:              10             1               1963-01-01     1997-01-11
## 2:              10             1               1967-07-01     1990-07-12
## 3:              10             1               1972-01-01     1997-07-24
## 4:              10             1               1960-01-01     1993-10-01
## 5:              10             1               1988-01-01     1996-01-11
## 6:              10             1               1970-01-01     1995-12-03
##    amountpaidonbuildingclaim amountpaidoncontentsclaim
## 1:                      0.00                         0
## 2:                      0.00                         0
## 3:                   8813.21                      1720
## 4:                   2906.00                         0
## 5:                   3875.53                      1545
## 6:                  14985.92                      1510
##    amountpaidonincreasedcostofcomplianceclaim postfirmconstructionindicator
## 1:                                         NA                             N
## 2:                                         NA                             N
## 3:                                          0                             N
## 4:                                         NA                             N
## 5:                                         NA                             Y
## 6:                                          0                             N
##    ratemethod smallbusinessindicatorbuilding state
## 1:          7                                   CA
## 2:          7                                   LA
## 3:          1                                   FL
## 4:          7                                   SC
## 5:          7                                   FL
## 6:          1                                   VA
##    totalbuildinginsurancecoverage totalcontentsinsurancecoverage yearofloss
## 1:                         200000                          50000       1998
## 2:                         100000                          40000       2005
## 3:                         100000                          50000       1998
## 4:                         100000                          25000       1994
## 5:                         100000                          25000       1996
## 6:                          88000                          12600       1998
##    reportedzipcode primaryresidence
## 1:           92056                 
## 2:           70131                Y
## 3:           32566                 
## 4:           29902                 
## 5:           32940                 
## 6:           23451
# (2) load in real estate value indices 
# linke to data source: https://www.zillow.com/research/data/
real_estate_value = fread('zhvi_risq_county_homevalue_index.csv')
head(real_estate_value)
##    V1 GEOID year STATEFP       lon      lat      RegionName      Metro
## 1:  1  1027 2008       1 -85.86055 33.26909            <NA>       <NA>
## 2:  2  1091 2008       1 -87.78952 32.24761            <NA>       <NA>
## 3:  3  1049 2008       1 -85.80414 34.45977  De Kalb County Fort Payne
## 4:  4  1019 2008       1 -85.60379 34.17596 Cherokee County           
## 5:  5  1065 2008       1 -87.62912 32.76266            <NA>       <NA>
## 6:  6  1105 2008       1 -87.29440 32.63847            <NA>       <NA>
##    StateCodeFIPS STATEABBR   STATE MunicipalCodeFIPS meanCountyAverage
## 1:            NA      <NA>    <NA>                NA                NA
## 2:            NA      <NA>    <NA>                NA                NA
## 3:             1        AL Alabama                49          89731.29
## 4:             1        AL Alabama                19         121605.57
## 5:            NA      <NA>    <NA>                NA                NA
## 6:            NA      <NA>    <NA>                NA                NA
##    meanStateAverage MeanStateMinusCounty state_stdev   acs_pc1   acs_pc2
## 1:               NA             4181.528    32186.68 -4.062239 0.0656347
## 2:               NA             4181.528    32186.68 -4.264940 1.9842919
## 3:         128633.2             4181.528    32186.68 -3.596758 0.4997780
## 4:         128633.2             4181.528    32186.68 -3.020672 0.1175345
## 5:               NA             4181.528    32186.68 -4.357674 1.1254512
## 6:               NA             4181.528    32186.68 -6.958061 1.8764552
##       acs_pc3     acs_pc4 ImputedCountyAverages ProductCountyZHVI      grp
## 1: -0.9808717 -0.18526281              85732.22          85732.22 01__2008
## 2: -0.7748036 -1.78987741              93152.16          93152.16 01__2008
## 3: -1.0720726 -0.12143096             102678.99          89731.29 01__2008
## 4: -1.0980994 -0.09957345             105956.24         121605.57 01__2008
## 5: -0.7375154 -1.09452420              84050.58          84050.58 01__2008
## 6:  0.2331603  0.17748290              60758.55          60758.55 01__2008
##    county_homevalue_index
## 1:              0.4471943
## 2:              0.4858980
## 3:              0.4680542
## 4:              0.6343160
## 5:              0.4384226
## 6:              0.3169273
# (3) load in population data (census / migration from american fact finder. Choose "county" from "Geographies" the from "Topics" select People and then "Basic Count / Estimate"
# link to data source: https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=PEP_2018_PEPANNRES&prodType=table
migration = fread('migration.csv') 
head(migration)
##    countycode   2010   2011   2012   2013   2014   2015   2016   2017   2018
## 1:       1001  54754  55208  54936  54713  54876  54838  55242  55443  55601
## 2:       1003 183111 186540 190143 194886 199189 202995 207712 212619 218022
## 3:       1005  27330  27350  27174  26944  26758  26294  25819  25158  24881
## 4:       1007  22872  22747  22664  22516  22541  22562  22576  22555  22400
## 5:       1009  57373  57554  57570  57611  57521  57522  57517  57827  57840
## 6:       1011  10878  10677  10607  10551  10665  10400  10381  10176  10138
# (4) load in inflation data
# link to data source: https://data.bls.gov/timeseries/CUUR0000SA0
inflation = fread('calculator_net.csv')
head(inflation)
##    Year    Jan    Feb    Mar    Apr    May   Jun   Jul   Aug    Sep   Oct   Nov
## 1: 2019  1.55%  1.52%  1.86%  2.00%  1.79% 1.65% 1.81% 1.75%  1.71% 1.76% 2.05%
## 2: 2018  2.07%  2.21%  2.36%  2.46%  2.80% 2.87% 2.95% 2.70%  2.28% 2.52% 2.18%
## 3: 2017  2.50%  2.74%  2.38%  2.20%  1.87% 1.63% 1.73% 1.94%  2.23% 2.04% 2.20%
## 4: 2016  1.37%  1.02%  0.85%  1.13%  1.02% 1.01% 0.84% 1.06%  1.46% 1.64% 1.69%
## 5: 2015 -0.09% -0.03% -0.07% -0.20% -0.04% 0.12% 0.17% 0.20% -0.04% 0.17% 0.50%
## 6: 2014  1.58%  1.13%  1.51%  1.95%  2.13% 2.07% 1.99% 1.70%  1.66% 1.66% 1.32%
##      Dec Annual
## 1: 2.29%  1.81%
## 2: 1.91%  2.44%
## 3: 2.11%  2.13%
## 4: 2.07%  1.26%
## 5: 0.73%  0.12%
## 6: 0.76%  1.62%
# (5) load in county density data
# link to data source: https://www.cdc.gov/nchs/data_access/urban_rural.htm#Data_Files_and_Documentation
density = fread("density.csv")
head(density[, c('countycode', 'state', 'population', 'density_indicator')])
##    countycode state population density_indicator
## 1:       2201    AK          .                 6
## 2:       2232    AK          .                 6
## 3:       2280    AK          .                 6
## 4:      51560    VA          .                 6
## 5:       6037    CA    9962789                 1
## 6:      17031    IL    5231351                 1
# (6) load in is_coastal indicator (census data)
# link to data source: https://www2.census.gov › library › stories › 2018/08 › coastline-counties-...
coastlines = fread("coastlines.csv")
head(coastlines)
##    countycode   state          coast coast_pop
## 1:       1003 Alabama Gulf of Mexico   208,563
## 2:       1097 Alabama Gulf of Mexico   414,836
## 3:       2013  Alaska        Pacific     3,296
## 4:       2016  Alaska        Pacific     5,647
## 5:       2020  Alaska        Pacific   298,192
## 6:       2050  Alaska        Pacific    17,968
# (7) load in income data set
# link to data source: https://data.world/tylerudite/2015-median-income-by-county/workspace/file?filename=2015+Median+Income+by+County.csv
income = fread("income.csv")
head(income)
##               county_state         county population median_income state
## 1: Autauga County, Alabama Autauga County      55221         51281    AL
## 2: Baldwin County, Alabama Baldwin County     195121         50254    AL
## 3: Barbour County, Alabama Barbour County      26932         32964    AL
## 4:    Bibb County, Alabama    Bibb County      22604         38678    AL
## 5:  Blount County, Alabama  Blount County      57710         45813    AL
## 6: Bullock County, Alabama Bullock County      10678         31938    AL
##    state_name
## 1:    Alabama
## 2:    Alabama
## 3:    Alabama
## 4:    Alabama
## 5:    Alabama
## 6:    Alabama

PHASE 1: Data Pre-Processing / Cleaning

  1. subset damages data to only the variables of interest for select time period (since migration data begins in 2010) also drop rows with null values for county or payouts
interesting_vars = c('dateofloss',  'countycode', 'amountpaidonbuildingclaim')
damages$dateofloss = parse_date(damages$dateofloss, "%Y-%m-%d")
summary(damages$dateofloss)
##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "1970-08-31" "1992-12-11" "2004-07-13" "2001-03-13" "2011-08-27" "2019-08-30"
damage = damages[dateofloss > '2010-01-01']
df = dplyr::select(damage[!is.na(countycode) & !is.na(amountpaidonbuildingclaim)], interesting_vars)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(interesting_vars)` instead of `interesting_vars` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
# the income data encodes county as a string (all counties have 5 characters in it. So we will force all other data sets to have that same form for ease of joining later on)
df[, countycode := str_pad(countycode, 5, side = "left", pad = "0")]
# (2) subset real estate datat to only the variables of interest
real_estate_vars = c('ProductCountyZHVI', 'year', 'GEOID')
real_estate_value = dplyr::select(real_estate_value, real_estate_vars)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(real_estate_vars)` instead of `real_estate_vars` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
real_estate_value[, countycode := str_pad(GEOID, 5, side = "left", pad = "0")]
real_est = real_estate_value[year >= 2010]
# (3) melt the migration data to hold year as a single column (rather than a dummmy indicator)
migration_melt = melt(migration, id.vars = c('countycode'),
                      measure.vars = c('2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018'),
                      variable.name = c('year'),
                      value.name = c('population'),
                      variable.factor = FALSE)
migration_melt$year = as.numeric(migration_melt$year)
migration_melt[, countycode := str_pad(countycode, 5, side = "left", pad = "0")]

PHASE 1: Data Pre-Processing / Cleaning Continued ….

(4) calculate inflation rate for each year and build a “total inflation multiplier” for each year relative to 2019

inflate = inflation[, c('Year', 'Annual')]
# remove the "%" sign in the raw data and convert to numeric, then put data in form of 1.01 (rather than .01)
inflate$Annual = gsub("%", "", as.character(factor(inflate$Annual)))
inflate$Annual = as.numeric(inflate$Annual)
inflate[, Annual := Annual* .01 + 1]
# initialize a total_multiplier to convert a given year to beginning of 2020 dollars
inflate[, total_multiplier := 0]
# loop through the list to calculate a cumulative product for each year
for (yr in unique(inflate$Year)) {
  inflate[Year == yr]$total_multiplier = prod(inflate[Year >= yr]$Annual)
}
# some issues with this approach -- it is saying that all 2019 dollars should be multiplied by 1.018 to conver them into beginning of 2020 dollars
# (5) reverse density indicator scale
density[, density_indicator := abs(density_indicator - 7)]
density[, countycode := str_pad(countycode, 5, side = "left", pad = "0")]
# (6) no cleansing required for coastal data set
coastlines[, countycode := str_pad(countycode, 5, side = "left", pad = "0")]
# (7) use R's fips data set to get county code from county name in income dataset
# lpad the countycode 
fips = as.data.table(force(county.fips))
income[, state_name := tolower(state_name)]
income[, short_county := tolower(gsub(" County", "", county))]
income[, polyname := paste0(state_name, ",", short_county)]
incomes = merge(income, fips, by = c('polyname'), all.x = TRUE)
incomes[, countycode := str_pad(fips, 5, side = "left", pad = "0")]

PHASE 2: Merging Datasets together

damages data is currently daily, so we want to create a new column to represent the year

df = as.data.table(df %>% as_tibble() %>% mutate(
  year = year(dateofloss)
))
# create a new column for the year grouping. We have 10 years of data (2010 - 2019)
# options include: 3 buckets of 3 years; 2 buckets of 5 years, dropping 2019; or year-to-year variation (this is likely noisy)
# seems like maybe 3 buckets of 3 years is a good place to start ( we lose a year of data, but we get to see change across an extra time groups)
df[, time_group := ifelse(year <= 2012, 2012, 
                              ifelse(year <= 2015, 2015,
                                     ifelse(year <= 2018, 2018, NA)))]
# this will cause us to lose damages data after 2018 (around 3% of the data)
df <- df[!is.na(df$time_group),]

PHASE 2: Merging Datasets together

STEP 1: Adjust historical damages using inflation data

adjust the aggregated damages for inflation (merge inflation by year) and multiply the “total multiplier” by the aggregated amount paid on building claim

df2 <- merge(df, inflate[, c('Year', 'total_multiplier')], by.x=c("year"), by.y = c("Year"))
df2[, adjusted_claims := amountpaidonbuildingclaim * total_multiplier]
# now that all claim dollars are on the same scale (begining of year 2020) 
# we can sum all claim dollars by time group ( 3 year bucket) and county
df3 = as.data.table(df2 %>% 
                          group_by(time_group, countycode) %>%
                          summarise(total_claims = sum(adjusted_claims, na.rm = TRUE)))
## `summarise()` regrouping output by 'time_group' (override with `.groups` argument)
# we add roughly 2.6 billion dollars of claims by enforcing this adjustment
dist = as.data.table(df2 %>% group_by(year) %>% summarise(count = n(), mult = mean(total_multiplier)))
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(dist, aes(x = year, y = count))+ geom_bar(stat = 'identity') + ggtitle('Distribution of Claims Data Points by Year')

PHASE 2: Merging Datasets together

STEP 2: Join in exogenous data on coastal counties, income, suburban/rural classes

df4 = merge(df3, coastlines[, c('countycode', 'coast')], by = "countycode", all.x = TRUE)
df4[, is_coastal := ifelse(is.na(coast), 0, 1)]
df4 = select(df4,-c(coast))
 # read in data for rural / urban classes from CDC
df5 = merge(x = df4, y = density[,c('countycode', 'density_indicator', 'density_label')], by = "countycode", all.x = TRUE)
# drops ~250 rows where density indicator is missing (drops data from 66 counties...)
df5 = df5[!is.na(df5$density_indicator),]
# lpad the countycode with 0's to ensure no loss of information during join
df6 = merge(df5, incomes[, c('countycode', 'median_income')], by = 'countycode', all.x = TRUE)
df6$median_income = as.numeric(df6$median_income)
# drop 409 rows were median income is missing
df6 = df6[!is.na(df6$median_income),]

join real estate data ……

PHASE 2: Merging Datasets together

compute identical time group to align with damages

migration_melt[, time_group := ifelse(year <= 2012, 2012, 
                                      ifelse(year <= 2015, 2015,
                                             ifelse(year <= 2018, 2018, NA)))]
# compute year to year change in population by county and time group
# set lag length (equal to bucket size)
length = 3
migration_melt2 = as.data.table(migration_melt %>%
                                 group_by(countycode) %>%
                                 arrange(year, .by_group = TRUE) %>%
                                 mutate(pct_change = (population - lag(population, n = length - 1)) / lag(population, n = length - 1)))
# join migration data to damages data
df7 <- merge(migration_melt2,df6,by=c("time_group", "countycode"))
df7 = df7[time_group == year]
# repeat identical steps for real_estate_value data
# compute identical time group to align with damages
real_est[, time_group := ifelse(year <= 2012, 2012, 
                                         ifelse(year <= 2015, 2015,
                                                ifelse(year <= 2018, 2018, NA)))]
real_estate_value2 = as.data.table(real_est %>%
                                     group_by(countycode) %>%
                                     arrange(year, .by_group = TRUE) %>%
                                     mutate(index_delta = (ProductCountyZHVI - lag(ProductCountyZHVI, n = length - 1)) / lag(ProductCountyZHVI, n = length - 1)))
real_estate_value2 = real_estate_value2[time_group == year]
# join real_estate data to damages data
df8 <- merge(real_estate_value2[, c('index_delta', 'time_group', 'countycode')],df7,by=c("time_group", "countycode"))
head(df8)
##    time_group countycode index_delta year population    pct_change total_claims
## 1:       2012      01001 -0.03525311 2012      54936  0.0033239581        0.000
## 2:       2012      01003 -0.09568093 2012     190143  0.0384029359   827838.469
## 3:       2012      01007  0.00000000 2012      22664 -0.0090940888     1936.504
## 4:       2012      01015  0.00000000 2012     117257 -0.0102973573    24432.293
## 5:       2012      01017 -0.09603776 2012      34104 -0.0005275189        0.000
## 6:       2012      01019 -0.06875065 2012      25963 -0.0004235004        0.000
##    is_coastal density_indicator      density_label median_income
## 1:          0                 4       medium metro         51281
## 2:          1                 3        small metro         50254
## 3:          0                 5 large fringe metro         38678
## 4:          0                 3        small metro         41703
## 5:          0                 2       micropolitan         34177
## 6:          0                 1           non-core         36296

WE HAVE NOW SUCCESSFULLY COMPILED OUR DATA SET, END OF PHASE 2

PHASE 3……ANALYSIS !!

visualize distribution of migration patterns, should be roughly normal and slightly positive

ggplot(df8, aes(pct_change))+
  geom_histogram()+
  xlab('Net Migration')+
  ggtitle('Distribution of Migration Patterns')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# hist(df8$pct_change, xlab = 'Net Migration', main = 'Distribution of Migration Patterns')

visualize distribution of percentage change in migration patterns

ggplot(df8, aes(total_claims))+
  geom_boxplot()+
  ggtitle('Distribution of Flood Claims Payouts')

scale migration data into z-score, GROUPING BY POPULATION DENSITY INDICATOR !!!

df9 <- as.data.table(df8 %>% group_by(density_indicator) %>% mutate(
  z_score = scale(pct_change),
  scaled_index = scale(index_delta)))
hist(df9$z_score)

summary(df9$z_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.0501 -0.6264 -0.1174  0.0000  0.4843 13.8255
hist(df9$scaled_index)

4.1. Linear Regression

A preliminary linear regression model of migration_index on total_claims, the binary is_coastal indicator, the density indicator, and median income was fit. The results indicate that being near the coast is associated with a .054 unit decrease in migration. Additionally, we can say that every 1 unit decrease in population density we see a .011 unit decrease in migration. This means that people are moving into denser regions. We also see that for every $3,200 increase in median income we see a .1 unit increase in migration. This means that people tend to move into areas that have higher median incomes. Lastly, the results of this regression fit indicate that for every 1,000,000,000 increase in claims we see a .3367 unit decline in migration.

Summary of linear regression fit

summary(lm)
## 
## Call:
## lm(formula = scaled_index ~ total_claims + is_coastal + density_indicator + 
##     median_income, data = df9)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9290 -0.5713 -0.3421  0.7061  3.7075 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       -1.177e-01  5.203e-02  -2.263  0.02369 * 
## total_claims      -3.321e-10  1.648e-10  -2.015  0.04395 * 
## is_coastal        -5.409e-02  4.557e-02  -1.187  0.23529   
## density_indicator -1.102e-02  1.017e-02  -1.083  0.27903   
## median_income      3.207e-06  1.240e-06   2.586  0.00973 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9989 on 5815 degrees of freedom
## Multiple R-squared:  0.002024,   Adjusted R-squared:  0.001337 
## F-statistic: 2.948 on 4 and 5815 DF,  p-value: 0.01905

4.2 Principal Component Analysis

Principal component analysis was conducted on claims_per_capita, the density_indicator, and median income (should density indicator go into PCA if is_coastal didn’t?). We can see that all three principal components are significant to the outcome (z_scored migration fluxes.)

df.pca <- prcomp(formula = ~ claims_per_capita + density_indicator + median_income, 
                 data = df9, 
                 center = TRUE,
                 scale. = TRUE)
lm.pca = lm(df9$z_score ~ df.pca$x[,1] + df.pca$x[,2] + df.pca$x[,3])
summary(lm.pca)
## 
## Call:
## lm(formula = df9$z_score ~ df.pca$x[, 1] + df.pca$x[, 2] + df.pca$x[, 
##     3])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0330 -0.6170 -0.1202  0.4698 13.0367 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.005e-17  1.265e-02   0.000 1.000000    
## df.pca$x[, 1] 9.892e-02  1.019e-02   9.706  < 2e-16 ***
## df.pca$x[, 2] 4.528e-02  1.265e-02   3.578 0.000348 ***
## df.pca$x[, 3] 3.339e-01  1.866e-02  17.897  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.965 on 5816 degrees of freedom
## Multiple R-squared:  0.06844,    Adjusted R-squared:  0.06796 
## F-statistic: 142.4 on 3 and 5816 DF,  p-value: < 2.2e-16

Contribution of each variable to each component

df.pca$rotation # PC2 is mostly claims per capita
##                         PC1         PC2           PC3
## claims_per_capita 0.0266746 -0.99964402  0.0005415214
## density_indicator 0.7068603  0.01847888 -0.7071117900
## median_income     0.7068501  0.01924471  0.7071015650

4.3 Permutation Tests

A series of permutation tests were conducted to determine whether the association between a particular target and a given label was significant or not. To check whether or not losses are an important factor of migration patterns, we assigned a label to each row of our data set based on the value of per capita claims. Any row with per capita claims greater than 7 (the 75% percentile value of claims_per_capita) was assigned a label of “high” indicating that losses were above average. All other rows were assigned a label of “low.” Then, the absolute difference between the mean z_score of the “high” group and the “low” group was computed. This absolute difference in means served as our permutation test statistic. Next we generated 10,000 samples of our data by shuffling the index of the assigned “high” / “low” labels. For each sample, the absolute difference between the mean z-score of the “high” and “low” group was calculated and stored. Each of these calculations make up a distribution of permutation test statistics. We can then plot the distribution of our permutation test statistics along with our originally observed test statistic to get a sense of where our data falls in relation to the permutation sample statistics. This can serve as a visual sense of how much the test statistic actually depends on the labels. We can compute a p-value to measure how likely we are to observe our test statistic assuming there is no dependence on the label.

i. We can reject the hypothesis that claims have no affect on migration patterns.
plot(density(migration_test$permutation_results), 
     xlab="Absolute Difference in Average Migration Index Between Groups With Damages in Top 25% vs Bottom 75%" , 
     main="Observed Test Statistic vs Distribution of Simulated Test Statistics", las=1)
abline(v=migration_test$initial_test_statistic, col="blue", lty="dotted")
text(.11, 10, "observed test statistic", col="blue", cex=0.7)
text(.07, 20, "probability of our result if claims have no affect on migration", col="black", cex=0.7)

mean(migration_test$permutation_results >= migration_test$initial_test_statistic) # .0927
## [1] 5e-04
ii. We fail to reject the hypothesis that total claims have no affect on ZHVI
plot(density(real_estate_test$permutation_results), 
     xlab="Absolute Difference in Average ZHVI Index Between Groups With Damages in Top 25% vs Bottom 75% " , 
     main="Observed Test Statistic vs Distribution of Simulated Test Statistics", las=1)
abline(v=real_estate_test$initial_test_statistic, col="blue", lty="dotted")
text(.11, 10, "observed test statistic", col="blue", cex=0.7)
text(.07, 20, "simulated distribution of test statistic on permuted data", col="black", cex=0.7)

mean(real_estate_test$permutation_results >= real_estate_test$initial_test_statistic) # .0927
## [1] 0.011
iii. Relationship Between ZHVI and Migration
plot(density(real_estate_test2$permutation_results), 
     xlab="Absolute Difference in Average Migration Index Between Groups With Migration Indices in Top 25% vs Bottom 75% " , 
     main="Observed vs Simulated Test Statistic", las=1,
     xlim = c(0, real_estate_test2$initial_test_statistic*1.2))
abline(v=real_estate_test2$initial_test_statistic, col="blue", lty="dotted")
text(.11, 10, "observed test statistic", col="blue", cex=0.7)
text(.07, 20, "simulated distribution of test statistic on permuted data", col="black", cex=0.7)

mean(real_estate_test2$permutation_results >= real_estate_test2$initial_test_statistic) # .0927
## [1] 0

5. Next Steps

Correlations & Plots

Migration vs Claims:

print(cor(test_dat$claims_per_capita, test_dat$z_score))
## [1] -0.04111326
ggplot(test_dat[claims_per_capita < 10], aes(y = z_score, x = claims_per_capita))+
  geom_point()+
  ggtitle('Migration vs Claims')

Migration vs Real Estate Value:

print(cor(test_dat$scaled_index, test_dat$z_score))
## [1] 0.2692948
ggplot(test_dat[claims_per_capita < 10], aes(y = z_score, x = scaled_index))+
  geom_point()+
  ggtitle('Migration vs Real Estate Value')

Real Estate Value vs Claims:

print(cor(test_dat$claims_per_capita, test_dat$scaled_index))
## [1] -0.02666343
ggplot(test_dat[claims_per_capita < 10], aes(y = scaled_index, x = claims_per_capita))+geom_point()

Appendix – Review of Data Sets

The first data set contains information regarding total claims payouts made due to flood related damages available on FEMA’s website. The full dataset contained 39 different variables but only 3 were used: date, county, and total payouts.

head(df[, c('dateofloss', 'amountpaidonbuildingclaim', 'countycode')], 3)
##    dateofloss amountpaidonbuildingclaim countycode
## 1: 2017-08-27                 195857.43      48201
## 2: 2015-02-21                   2618.18      25001
## 3: 2012-05-20                      0.00      16059

The second data set contained information downloaded from zillow’s website. The home value index draws on Zestimates calculated on over 100 million U.S. homes, including new construction homes and/or homes that have not traded on the open market in many years. Each month every home’s current zestimate is compared to the previous month’s and the % change is tracked. Those changes are weighted according the each home’s overall value and then aggregated across the housing market in a particular market segment.

Three of the 24 variables in this dataset were of interest: the index, the year, and the county code.

head(real_est[, c('ProductCountyZHVI', 'year', 'GEOID')], 3)
##    ProductCountyZHVI year GEOID
## 1:          85732.22 2010  1027
## 2:          93152.16 2010  1091
## 3:          83986.03 2010  1049

The third data set contained census data downloaded from the american fact finder. This dataset was obtained by navigating to the “Geographies” section and choosing “county” and navigating to the “Topics” section, choosing “People” and then “Basic Count / Estimate”

head(migration_melt, 3)
##    countycode year population time_group
## 1:      01001 2010      54754       2012
## 2:      01003 2010     183111       2012
## 3:      01005 2010      27330       2012

The fourth data set was downloaded from the Bureau of Labor Statistics and contains inflation data. This data was used to normalize the historical claims amounts into a common scale (2020 dollars).

head(inflate, 3)
##    Year Annual total_multiplier
## 1: 2019 1.0181         1.018100
## 2: 2018 1.0244         1.042942
## 3: 2017 1.0213         1.065156

The fifth data set was downloaded from the CDC and carries a “density” indicator to inform whether a given county is a major metro or a rural village. There are 6 total classifications of population density.

head(density, 3)
##    countycode state                            county CBSA title population
## 1:       2201    AK         Prince of Wales-Outer Ket                     .
## 2:       2232    AK Skagway-Hoonah-Angoon Census Area                     .
## 3:       2280    AK   Wrangell-Petersburg Census Area                     .
##    density_indicator density_label
## 1:                 6      non-core
## 2:                 6      non-core
## 3:                 6      non-core
unique(density$density_label)
## [1] "non-core"            "large central metro" "large fringe metro" 
## [4] "medium metro"        "small metro"         "micropolitan"

The sixth data set was taken from the census and can be found by navigating to the “libray/stories/2018/08” and selecting the “coastline-counties” file. This data carries information on counties that are considered coastal (there are roughly 255 out of 3,000 counties considered to be coastal)

head(coastlines, 3)
##    countycode   state          coast coast_pop
## 1:       1003 Alabama Gulf of Mexico   208,563
## 2:       1097 Alabama Gulf of Mexico   414,836
## 3:       2013  Alaska        Pacific     3,296

The final data set contains information on median county level household income and was taken from data world.

head(incomes, 3)
##           polyname            county_state         county population
## 1: alabama,autauga Autauga County, Alabama Autauga County      55221
## 2: alabama,baldwin Baldwin County, Alabama Baldwin County     195121
## 3: alabama,barbour Barbour County, Alabama Barbour County      26932
##    median_income state state_name short_county fips countycode
## 1:         51281    AL    alabama      autauga 1001      01001
## 2:         50254    AL    alabama      baldwin 1003      01003
## 3:         32964    AL    alabama      barbour 1005      01005