What are different ways in which two variables might be correlated (i.e. direct causation, codetermination, etc…) ? Please give one example for each.
ANSWER: direct causation: cookies are very sweet because it contains a lot of sugar (sweetness of cookie vs amount of sugar added) codetermination: more butter and flour will make cookies have higher calorie (calorie vs amount of different ingredients’ energy ) reverse causation: sweetness of cookie decreases because of the increase amount of flour used (sweetness of cookie vs amount of flour) hidden variables (confounders): adding more sugar to cookie will increase the calorie and increase the sweetness
The output of the bivariate regression of Home Energy Rating System (HERS) Index score on annual utility cost per square foot, for our sample of 20,452 homes with HERS ratings, tells us that there is a constant of 0.634 and a coefficient of 0.006.
2)(a) Write the equation for this bivariate regression. y = bx + k y = 0.006x + 0.634
x = Annual utility cost per square foot y = Index score
2(b) A one unit increase in the HERS index score is associated with how much change in the annual utility costs for a 2,000 sf home? 1 = 0.006x
x=1/0.006
x*2000
## [1] 333333.3
We will continue to work on the FEMA claims data. Here is the CSV and codebook:
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
##
## Attaching package: 'scales'
##
##
## The following object is masked from 'package:purrr':
##
## discard
##
##
## The following object is masked from 'package:readr':
##
## col_factor
##
##
## Learn more about sjPlot with 'browseVignettes("sjPlot")'.
Create a new table of data from the “MASSZONG.xls”
femaclaim <- read.csv("fema_claims_random.csv")
ec <-
femaclaim$elevationcertificateindicator %>%
na.omit() %>%
as.factor() %>%
plot()
2. Plot the histogram of the amount paid on building claims.
bc <- (na.omit(femaclaim$amountpaidonbuildingclaim))
femaclaim %>%
ggplot(aes(amountpaidonbuildingclaim, na.rm = TRUE))+
geom_histogram(bins = 100)+
xlim(0, 90000)
## Warning: Removed 42383 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
3. Write out the equation for the fitted model. buildingclaim =
coefficient*elevationcertificateindicator + constant y =
31,322.1-9314.2(x_2 )+8661.3(x_3 )+ 6015.5(x_4) + ε
cor <- femaclaim %>%
ggplot (aes(x=elevationcertificateindicator, y=amountpaidonbuildingclaim)) +
ylim(1,1050000) +
geom_point(color = as.factor(femaclaim$elevationcertificateindicator), alpha = 0.05, size = 4, fill = "white")
cor
## Warning: Removed 106575 rows containing missing values (geom_point).
lm <- lm(amountpaidonbuildingclaim~as.factor(elevationcertificateindicator), data = femaclaim)
summary(lm)
##
## Call:
## lm(formula = amountpaidonbuildingclaim ~ as.factor(elevationcertificateindicator),
## data = femaclaim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39983 -28888 -18363 10397 972957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31322.1 473.8 66.110 < 2e-16
## as.factor(elevationcertificateindicator)2 -9314.2 1115.6 -8.349 < 2e-16
## as.factor(elevationcertificateindicator)3 8661.3 750.7 11.538 < 2e-16
## as.factor(elevationcertificateindicator)4 6015.5 2331.5 2.580 0.00989
##
## (Intercept) ***
## as.factor(elevationcertificateindicator)2 ***
## as.factor(elevationcertificateindicator)3 ***
## as.factor(elevationcertificateindicator)4 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50690 on 22032 degrees of freedom
## (106483 observations deleted due to missingness)
## Multiple R-squared: 0.01243, Adjusted R-squared: 0.01229
## F-statistic: 92.42 on 3 and 22032 DF, p-value: < 2.2e-16
-if the certificate indicator increase by 1, the amount paid on building claim increase by 3757.2. The baseline one expects to get (with code 1) for building claim is 3757.2. Newer certificate without BFE will give one more building claim. -p value and t value are smaller than 0.01 significance level, thus we can reject the null hypothesis. The amount paid on building claim and elevation certificate indicator are correlated
Because the p values are larger than the critical value, we can not reject the null hypothesis. The average total insurance amount paid in dollars on the building has not changed over time.
Plot the histogram of the total amount of insurance claims for contents.
ggplot(femaclaim, aes(x = amountpaidoncontentsclaim, na.rm = TRUE)) + geom_histogram(bins=100) + xlim(0,100000)
## Warning: Removed 79726 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
Plot the histogram of the elevation of the lowest floors.
#whole range
ggplot(femaclaim, aes(x = lowestfloorelevation, na.rm = TRUE)) + geom_histogram(bins = 100) + xlim(-100,1000)
## Warning: Removed 481 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
#focused range with y limit
ggplot(femaclaim, aes(x = lowestfloorelevation, na.rm = TRUE)) + geom_histogram(bins = 100) + xlim(-100,1000)+ylim(0,300)
## Warning: Removed 481 rows containing non-finite values (stat_bin).
## Warning: Removed 11 rows containing missing values (geom_bar).
Write out the equation for the fitted model.
-total amount of contents claims (variable: amountpaidoncontentsclaim) = intercept + coefficient * elevation of the lowest floor (variable: lowestfloorelevation)
Plot the relationship between two variables.
femaclaim %>%
select(amountpaidoncontentsclaim, lowestfloorelevation) %>%
na.omit() %>%
ggplot(aes(x=lowestfloorelevation, y=amountpaidoncontentsclaim)) + geom_point()
Fit the model and paste the result.
lm_2 <- lm(amountpaidoncontentsclaim~lowestfloorelevation, data = femaclaim)
summary(lm_2)
##
## Call:
## lm(formula = amountpaidoncontentsclaim ~ lowestfloorelevation,
## data = femaclaim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25827 -12837 -9120 1416 485880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.412e+04 1.300e+02 108.575 < 2e-16 ***
## lowestfloorelevation 1.171e+00 3.377e-01 3.468 0.000526 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28760 on 49130 degrees of freedom
## (79387 observations deleted due to missingness)
## Multiple R-squared: 0.0002447, Adjusted R-squared: 0.0002243
## F-statistic: 12.02 on 1 and 49130 DF, p-value: 0.0005258
In a short paragraph, describe how these two variables are associated. Make sure to explicitly discuss and interpret each coefficient (the hypothesis testing using t-stats and p-value) and R2.
-The intercept is 1412, indicating that the baseline (min number of lowest floor level) gets 1412 total amount of contents claims. With one level of floor elevation change, the amount of claim increase by 1.17. The p-value and t value are below significance level 0.01, hence we can reject the null hypothesis. The total amount of contents claims and elevation of the lowest floor are correlated.
If you could add variables to increase the explanatory power of the model, what variable would you want to add? Why?
-I would include the number of floors in the insured building, because this model concerns with the floor levels’ relationship with amount of claim for contents