Part I

What are different ways in which two variables might be correlated (i.e. direct causation, codetermination, etc…) ? Please give one example for each.

ANSWER: direct causation: cookies are very sweet because it contains a lot of sugar (sweetness of cookie vs amount of sugar added) codetermination: more butter and flour will make cookies have higher calorie (calorie vs amount of different ingredients’ energy ) reverse causation: sweetness of cookie decreases because of the increase amount of flour used (sweetness of cookie vs amount of flour) hidden variables (confounders): adding more sugar to cookie will increase the calorie and increase the sweetness

Part II

The output of the bivariate regression of Home Energy Rating System (HERS) Index score on annual utility cost per square foot, for our sample of 20,452 homes with HERS ratings, tells us that there is a constant of 0.634 and a coefficient of 0.006.

2)(a) Write the equation for this bivariate regression. y = bx + k y = 0.006x + 0.634

x = Annual utility cost per square foot y = Index score

2(b) A one unit increase in the HERS index score is associated with how much change in the annual utility costs for a 2,000 sf home? 1 = 0.006x

x=1/0.006
x*2000
## [1] 333333.3

Part III

SUMMARY

We will continue to work on the FEMA claims data. Here is the CSV and codebook:

Setting up working environment
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'scales'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
## 
## 
## Learn more about sjPlot with 'browseVignettes("sjPlot")'.

Create a new table of data from the “MASSZONG.xls”

femaclaim <- read.csv("fema_claims_random.csv")

We would like to know the relationship between the total amount of building claims (variable: amountpaidonbuildingclaim) and the types of Elevation Certificate (variable: elevationcertificateindicator).

  1. Create a bar plot of types of certificates (hint: In order for the elevation certificate variable to be read as categorical data, you need to convert the variable as factor).
  2. Plot the histogram of the amount paid on building claims.
  3. Write out the equation for the fitted model.
  4. Plot the relationship between two variables (hint: make sure you exclude NAs before you plot).
  5. Fit the model and paste the result.
  6. In a short paragraph, describe how these two variables are associated. Make sure to explicitly discuss and interpret each coefficient (the hypothesis testing using t-stats and p-value) and R2.
  7. If you could add variables to increase the explanatory power of the model, what variable would you want to add? Why?
ec <-
  femaclaim$elevationcertificateindicator %>%
  na.omit() %>%
  as.factor() %>%
  plot()

2. Plot the histogram of the amount paid on building claims.

bc <- (na.omit(femaclaim$amountpaidonbuildingclaim))

femaclaim %>%
  ggplot(aes(amountpaidonbuildingclaim, na.rm = TRUE))+
  geom_histogram(bins = 100)+
  xlim(0, 90000)
## Warning: Removed 42383 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

3. Write out the equation for the fitted model. buildingclaim = coefficient*elevationcertificateindicator + constant y = 31,322.1-9314.2(x_2 )+8661.3(x_3 )+ 6015.5(x_4) + ε

  1. Plot the relationship between two variables (hint: make sure you exclude NAs before you plot).
cor <- femaclaim %>%
  ggplot (aes(x=elevationcertificateindicator, y=amountpaidonbuildingclaim)) +
  ylim(1,1050000) +
  geom_point(color = as.factor(femaclaim$elevationcertificateindicator), alpha = 0.05, size = 4, fill = "white") 

cor
## Warning: Removed 106575 rows containing missing values (geom_point).

  1. Fit the model and paste the result.
lm <- lm(amountpaidonbuildingclaim~as.factor(elevationcertificateindicator), data = femaclaim)
summary(lm)
## 
## Call:
## lm(formula = amountpaidonbuildingclaim ~ as.factor(elevationcertificateindicator), 
##     data = femaclaim)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39983 -28888 -18363  10397 972957 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                31322.1      473.8  66.110  < 2e-16
## as.factor(elevationcertificateindicator)2  -9314.2     1115.6  -8.349  < 2e-16
## as.factor(elevationcertificateindicator)3   8661.3      750.7  11.538  < 2e-16
## as.factor(elevationcertificateindicator)4   6015.5     2331.5   2.580  0.00989
##                                              
## (Intercept)                               ***
## as.factor(elevationcertificateindicator)2 ***
## as.factor(elevationcertificateindicator)3 ***
## as.factor(elevationcertificateindicator)4 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50690 on 22032 degrees of freedom
##   (106483 observations deleted due to missingness)
## Multiple R-squared:  0.01243,    Adjusted R-squared:  0.01229 
## F-statistic: 92.42 on 3 and 22032 DF,  p-value: < 2.2e-16
  1. In a short paragraph, describe how these two variables are associated. Make sure to explicitly discuss and interpret each coefficient (the hypothesis testing using t-stats and p-value) and R2.

-if the certificate indicator increase by 1, the amount paid on building claim increase by 3757.2. The baseline one expects to get (with code 1) for building claim is 3757.2. Newer certificate without BFE will give one more building claim. -p value and t value are smaller than 0.01 significance level, thus we can reject the null hypothesis. The amount paid on building claim and elevation certificate indicator are correlated

  1. If you could add variables to increase the explanatory power of the model, what variable would you want to add? Why? Since we are looking at elevation certificate’s relationship with building claim, I would also water to add elevation difference in this model

Because the p values are larger than the critical value, we can not reject the null hypothesis. The average total insurance amount paid in dollars on the building has not changed over time.


Now we want to see the relationship between the total amount of contents claims (variable: amountpaidoncontentsclaim) and the elevation of the lowest floor (variable: lowestfloorelevation).

Plot the histogram of the total amount of insurance claims for contents.

ggplot(femaclaim, aes(x = amountpaidoncontentsclaim, na.rm = TRUE)) + geom_histogram(bins=100) + xlim(0,100000)
## Warning: Removed 79726 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

Plot the histogram of the elevation of the lowest floors.

#whole range
ggplot(femaclaim, aes(x = lowestfloorelevation, na.rm = TRUE)) + geom_histogram(bins = 100) + xlim(-100,1000)
## Warning: Removed 481 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

#focused range with y limit
ggplot(femaclaim, aes(x = lowestfloorelevation, na.rm = TRUE)) + geom_histogram(bins = 100) + xlim(-100,1000)+ylim(0,300)
## Warning: Removed 481 rows containing non-finite values (stat_bin).
## Warning: Removed 11 rows containing missing values (geom_bar).

Write out the equation for the fitted model.

-total amount of contents claims (variable: amountpaidoncontentsclaim) = intercept + coefficient * elevation of the lowest floor (variable: lowestfloorelevation)

Plot the relationship between two variables.

femaclaim %>%
  select(amountpaidoncontentsclaim, lowestfloorelevation) %>%
  na.omit() %>%
  ggplot(aes(x=lowestfloorelevation, y=amountpaidoncontentsclaim)) + geom_point()

Fit the model and paste the result.

lm_2 <- lm(amountpaidoncontentsclaim~lowestfloorelevation, data = femaclaim)
summary(lm_2)
## 
## Call:
## lm(formula = amountpaidoncontentsclaim ~ lowestfloorelevation, 
##     data = femaclaim)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -25827 -12837  -9120   1416 485880 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.412e+04  1.300e+02 108.575  < 2e-16 ***
## lowestfloorelevation 1.171e+00  3.377e-01   3.468 0.000526 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28760 on 49130 degrees of freedom
##   (79387 observations deleted due to missingness)
## Multiple R-squared:  0.0002447,  Adjusted R-squared:  0.0002243 
## F-statistic: 12.02 on 1 and 49130 DF,  p-value: 0.0005258

In a short paragraph, describe how these two variables are associated. Make sure to explicitly discuss and interpret each coefficient (the hypothesis testing using t-stats and p-value) and R2.

-The intercept is 1412, indicating that the baseline (min number of lowest floor level) gets 1412 total amount of contents claims. With one level of floor elevation change, the amount of claim increase by 1.17. The p-value and t value are below significance level 0.01, hence we can reject the null hypothesis. The total amount of contents claims and elevation of the lowest floor are correlated.

If you could add variables to increase the explanatory power of the model, what variable would you want to add? Why?

-I would include the number of floors in the insured building, because this model concerns with the floor levels’ relationship with amount of claim for contents