We will work on this exercise with a new data set (i.e., “fema_claims_random.csv Download fema_claims_random.csv”). First, a little story about this data set: in 1968, the Congress of the US passed the National Flood Insurance Act, creating the National Flood Insurance Program (NFIP) to reduce future flood losses through flood hazard identification, manage floodplains, and provide insurance protection. In other words, the NFIP offers insurance coverage for building structures as well as for contents and personal property within the building structures to eligible and insurable properties. The data you’ll be using is derived from the NFIP system of record, staged in the NFIP reporting platform, and redacted to protect policyholders’ personally identifiable information. The original dataset has more than 2.5 million observations, but here you will use a 5% random sample stratified by the state where the property is located. That is, within each state, we took a 5% random sample of every NFIP record. Each row in the data set represents a record, and you can look at the description of each variable in the workbook Download workbook. For this homework, we will be using just a subset of variables. Imagine you are working for a planning research group that is trying to understand the consequences of flood on the real estate market. For your annual report, you would like to provide the reader with a general description of FEMA’s NFIP program across space and time.
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
##
## Attaching package: 'scales'
##
##
## The following object is masked from 'package:purrr':
##
## discard
##
##
## The following object is masked from 'package:readr':
##
## col_factor
Create a new table of data from the “MASSZONG.xls”
masszoning <- read_xls("MASSZONING.xls")
femaclaim <- read.csv("fema_claims_random.csv")
view(femaclaim)
building_insurance <- femaclaim$totalbuildinginsurancecoverage
content_insurance <- na.omit(femaclaim$totalcontentsinsurancecoverage)
#find the sample mean
x_hat_building <- mean(building_insurance)
x_hat_building
## [1] 157332.1
x_hat_content <- mean(content_insurance)
x_hat_content
## [1] 30321.81
#find the median
median_building <- median(building_insurance)
median_building
## [1] 100000
median_content <- median(content_insurance)
median_content
## [1] 11500
#find sd
sd_building <- sd(building_insurance)
sd_building
## [1] 1176810
sd_content <- sd(content_insurance)
sd_content
## [1] 49813.11
The amount paid in dollars on building is significantly larger (~5 times) than the total insurance amount paid in dollars on the contents
ggplot(data = femaclaim, aes(yearofloss))+geom_bar()
##### • What’s the year with the highest number of flood claims? What
extreme weather event could explain this pattern? 2005 has the highest
number of flood claims due to Hurricane Katrina
Ho: avg insurance paid are same Ha: avg insurance paid are different
a <- 0.05
fema_subset_2000 <- femaclaim %>% filter(yearofloss == 2000) %>% select(yearofloss, totalbuildinginsurancecoverage)
fema_subset_2010 <- femaclaim %>% filter(yearofloss == 2010) %>% select(yearofloss, totalbuildinginsurancecoverage)
fema_subset_2020 <- femaclaim %>% filter(yearofloss == 2020) %>% select(yearofloss, totalbuildinginsurancecoverage)
t.test(fema_subset_2000, fema_subset_2010)
##
## Welch Two Sample t-test
##
## data: fema_subset_2000 and fema_subset_2010
## t = -0.77039, df = 5303.8, p-value = 0.4411
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -127728.76 55661.36
## sample estimates:
## mean of x mean of y
## 103766.3 139800.0
t.test(fema_subset_2010, fema_subset_2020)
##
## Welch Two Sample t-test
##
## data: fema_subset_2010 and fema_subset_2020
## t = -0.61511, df = 5541.3, p-value = 0.5385
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -120275.11 62824.09
## sample estimates:
## mean of x mean of y
## 139800.0 168525.5
t.test(fema_subset_2000, fema_subset_2010)$p.value < a
## [1] FALSE
t.test(fema_subset_2010, fema_subset_2020)$p.value < a
## [1] FALSE
Because the p values are larger than the critical value, we can not reject the null hypothesis. The average total insurance amount paid in dollars on the building has not changed over time.
Ho: Number of flood claims between the states of Louisiana and Mississippi are NOT independent of the occupancy type;
Ha: Number of flood claims between the states of Louisiana and Mississippi are independent of the occupancy type
a = 0.01
fema_subset_types <- femaclaim %>% filter(state %in% c('MS', 'LA') & yearofloss == 2005) %>% select(state, occupancytype)
#inspect the new table
head(fema_subset_types)
## state occupancytype
## 1 LA 1
## 2 LA 1
## 3 LA 1
## 4 LA 1
## 5 LA 1
## 6 LA 1
fema_subset_types_chisqtest <- chisq.test(fema_subset_types$state, fema_subset_types$occupancytype)
## Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be
## incorrect
fema_subset_types_chisqtest
##
## Pearson's Chi-squared test
##
## data: fema_subset_types$state and fema_subset_types$occupancytype
## X-squared = 73.182, df = 6, p-value = 9.084e-14
# p-value of our test is < 2.2e-16, which is smaller than critical value a 0.01
fema_subset_types_chisqtest$p.value < a
## [1] TRUE
We reject the null hypothesis. The number of flood claims between the states of Louisiana and Mississippi are independent of the occupancy type.