question: Identify and describe at least 4 different pitfalls to watch out for in regressions that might affect the reliability of your coefficient estimates. -multicollinearity: when using multivariate regression, if the independent variables are positively correlated, then the coefficient will get larger -multicollinearity: when using multivariate regression, if the independent variables are negatively correlated, then the coefficient will get smaller -overfitting: the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably -extrapolation: predict beyond the “scope of the model” when one uses an estimated regression equation to predict a new response for values not in the range of the sample data used to determine the estimated regression equation. -excluding Important Predictor Variables: the linear association between two variables ignoring other relevant variables can differ both in magnitude and direction from the association that controls for other relevant variables.
question: Why would you use a logistic regression or logit model instead of a linear regression? answer: Linear regression is used to predict the continuous dependent variable. Logistic Regression is used to predict the categorical dependent variable (ex: yes or no, number of hours…)
We will continue to work on the FEMA claims data. Here is the CSV and codebook:
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
##
## Attaching package: 'scales'
##
##
## The following object is masked from 'package:purrr':
##
## discard
##
##
## The following object is masked from 'package:readr':
##
## col_factor
##
##
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
Create a new table of data from the “MASSZONG.xls”
femaclaim <- read_csv("~/Documents/GitHub/QR2022FA/county_claims_2016_2020.csv")
## Rows: 2438 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): countycode, total_insurance, total_insurance_building, total_insura...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
femaclaim_1 <- read_csv("~/Documents/GitHub/QR2022FA/county_characteristics_2016_2020.csv")
## Rows: 3221 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (10): countycode, white_percentage, black_percentage, ame_ala_percentage...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#View(femaclaim)
#View(femaclaim_1)
####Step 1: Data preparation
Import both datasets to R. Create a new dataset joining the county claims dataset to the county characteristics dataset. (Hint: from the library, tidyverse, use the function inner_join() and the FIPS code column to join both datasets)
femajoined <- inner_join(femaclaim,femaclaim_1,by="countycode")
#View(femajoined)
####Step 2: Explore Relationships between variables
femajoined %>%
ggplot(aes(x=median_value, y=total_insurance, color=total_insurance/median_value))+
geom_point(size=2,alpha=0.1) +
geom_smooth(method = lm, color="pink", alpha=.3)+
labs(title="Total insurance paid and the county median property value")+
xlab("median property value") +
ylab("total insurance paid")
## `geom_smooth()` using formula 'y ~ x'
answer: There is a positive correlation between median property value
and total insurance amount paid.
3.Regress the total insurance paid on the median property value. In a short paragraph, describe how these two variables are associated. Make sure to discuss and interpret each coefficient and R2
cor_value_insurance <- lm(total_insurance~median_value,data=femajoined)
summary(cor_value_insurance)
##
## Call:
## lm(formula = total_insurance ~ median_value, data = femajoined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -389044 -61283 -14717 36535 3014645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.223e+05 4.960e+03 24.66 <2e-16 ***
## median_value 4.960e-01 2.469e-02 20.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 130200 on 2436 degrees of freedom
## Multiple R-squared: 0.1421, Adjusted R-squared: 0.1418
## F-statistic: 403.6 on 1 and 2436 DF, p-value: < 2.2e-16
answer: The regression showed that there is a positive correlation between median property value and total insurance amount paid. When the median value increase by 1, the total insurance paid increase by 0.496. Assuming the median value is 0, the insurance paid is 1.223e+05.The R-squared is 0.1421, meaning that about 14% of the data can be explained by the model.
log_insurance_paid <- log(femajoined$total_insurance)
log_median_value <- log(femajoined$median_value)
summary(log_insurance_paid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.21 11.80 12.17 12.08 12.46 15.06
summary(log_median_value)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.19 11.59 11.86 11.92 12.18 13.97
4.1) Create a scatter plot of these two log scale variables. Discuss the similarity and differences between this plot and the plot from question (2). answer: The graph after log transformation better depict the positive correlation between median property value and total insurance amount paid than the graph in question (2). The data points in this graph are more evenly distributed along the fitted line, and you can understand the correlation by eye. Whereas you can hardly understand the correlation by eyes in the graph in question 2.
4.2) Add the line of best fit with confidence intervals at the 95% level. Remember to label your y-axis, x-axis, and provide a title to your graph.
data_frame(log_median_value, log_insurance_paid) %>%
ggplot(aes(x=log_median_value, y=log_insurance_paid, color=log_insurance_paid/log_median_value))+
geom_point(size=2,alpha=0.2) +
geom_smooth(method = lm, color="pink", alpha=.3, level=0.95)+
labs(title="Total insurance paid and the county median property value in log")+
xlab("log median property value") +
ylab("log total insurance paid")
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## `geom_smooth()` using formula 'y ~ x'
*** 5. Regress the natural logarithm of the total insurance paid on the
natural logarithm of the median property value. In a short paragraph,
describe how these two variables are associated. Make sure to discuss
and interpret each coefficient and R2. In addition, please discuss the
confidence interval of the coefficient for median property value and how
it is presented visually in the plot from 4.2.
summary(lm(log_insurance_paid~log_median_value))
##
## Call:
## lm(formula = log_insurance_paid ~ log_median_value)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.77085 -0.22981 0.04049 0.30758 2.22922
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.22639 0.27376 19.09 <2e-16 ***
## log_median_value 0.57494 0.02295 25.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5301 on 2436 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.2045
## F-statistic: 627.6 on 1 and 2436 DF, p-value: < 2.2e-16
answer: The regression showed that there is a positive correlation between log median property value and log total insurance amount paid. When the median value increase by 1%, the total insurance paid increase by 0.57494%. Assuming the median value is 0, the insurance paid is 5.22639. The R-squared is 0.2048, meaning that about 20% of the data can be explained by the model. Confidence interval of the coefficient for median property value is presented through the fitted line’s surrounding grey bands in the graphic. (? check with TA)
####Step 3. Make better predictions
We will create a multivariate regression to predict the total insurance paid:
summary(lm(log_insurance_paid ~ log_median_value+femajoined$black_percentage))
##
## Call:
## lm(formula = log_insurance_paid ~ log_median_value + femajoined$black_percentage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.72554 -0.22653 0.03781 0.29091 2.30801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.8212783 0.2763333 17.447 < 2e-16 ***
## log_median_value 0.6043657 0.0230516 26.218 < 2e-16 ***
## femajoined$black_percentage 0.0054425 0.0007383 7.371 2.3e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5244 on 2435 degrees of freedom
## Multiple R-squared: 0.2222, Adjusted R-squared: 0.2216
## F-statistic: 347.8 on 2 and 2435 DF, p-value: < 2.2e-16
The regression showed that there is a positive correlation between log median property value and log total insurance amount paid and black_percentage. When the median value increase by 1%, the total insurance paid increase by 0.6043657%. Assuming the median value is 0, the insurance paid is 4.8212783. When black_percentage increase by 1%, insurance paid increase by 0.0054425%. The R-squared is 0.2222, meaning that about 22% of the data can be explained by the model.