library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(pastecs)
##
## Attaching package: 'pastecs'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## The following object is masked from 'package:tidyr':
##
## extract
library(readr)
Workers_Compensation_Claims_Data <- read_csv("Workers__Compensation_Claims_Data.csv")
## Rows: 56 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (18): Year, Subject employers, Subject employees, Accepted disabling cla...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(Workers_Compensation_Claims_Data)
## # A tibble: 6 × 18
## Year `Subject employers` `Subject employees` `Accepted disabling claims`
## <dbl> <dbl> <dbl> <dbl>
## 1 1968 49021 671900 32509
## 2 1969 52191 700800 35372
## 3 1970 52789 704300 30338
## 4 1971 58768 732500 30663
## 5 1972 62584 778800 34835
## 6 1973 65788 820600 36802
## # ℹ 14 more variables: `Est. accepted nondisabling claims` <dbl>,
## # `Est. total accepted claims` <dbl>, `Denied claims` <dbl>,
## # `Disabling claim denial rate` <dbl>, `Fatality claims` <dbl>,
## # `Net PTD claims` <dbl>,
## # `Rate: accepted disabling claims per 100 employees` <dbl>,
## # `Rate: fatality claims per 100,000 employees` <dbl>,
## # `Aggravation claims: Accepted disabling` <dbl>, …
Dependant Variable: Accepted Disabling Claims Independent Variables: Year, Subject Employees, Rate: Accepted disabling claims per 100 employess
stat.desc(Workers_Compensation_Claims_Data$`Accepted disabling claims`)
## nbr.val nbr.null nbr.na min max range
## 5.600000e+01 0.000000e+00 0.000000e+00 1.801000e+04 4.784400e+04 2.983400e+04
## sum median mean SE.mean CI.mean.0.95 var
## 1.636954e+06 2.936350e+04 2.923132e+04 1.065983e+03 2.136278e+03 6.363393e+07
## std.dev coef.var
## 7.977088e+03 2.728952e-01
stat.desc(Workers_Compensation_Claims_Data$Year)
## nbr.val nbr.null nbr.na min max range
## 5.600000e+01 0.000000e+00 0.000000e+00 1.968000e+03 2.023000e+03 5.500000e+01
## sum median mean SE.mean CI.mean.0.95 var
## 1.117480e+05 1.995500e+03 1.995500e+03 2.179449e+00 4.367714e+00 2.660000e+02
## std.dev coef.var
## 1.630951e+01 8.173143e-03
workers_comp<-Workers_Compensation_Claims_Data
comp_model<-lm(`Denied claims`~`Subject employees`,data = workers_comp)
summary(comp_model)
##
## Call:
## lm(formula = `Denied claims` ~ `Subject employees`, data = workers_comp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7303 -4620 -1870 4318 9299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.074e+03 2.741e+03 0.757 0.45268
## `Subject employees` 6.661e-03 1.889e-03 3.526 0.00089 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5321 on 52 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.193, Adjusted R-squared: 0.1774
## F-statistic: 12.43 on 1 and 52 DF, p-value: 0.0008902
An R-squared value of 0.9988 shows that the model accounts for 99.88% of the variation in the dependent variable, which is good. I believe the p-values indicate that the key variables in the model are Subject employers, Subject employees, and the percentage of Disabling claims that are accepted or denied on time. But I also think that on the other hand, variables like Accepted disabling claims, Estimated accepted nondisabling claims, and Denied claims are not significant and probably don’t play a meaningful role in explaining the dependent variable.
workers_multiple<-lm(Denied claims
~Subject employees
+year
+subject ecmployers
,data=
workers_comp)
summary(workers_multiple)
workers_multiple<-lm(`Denied claims`~`Subject employees`+`Accepted disabling claims`+`Subject employers`,data= workers_comp)
summary(workers_multiple)
##
## Call:
## lm(formula = `Denied claims` ~ `Subject employees` + `Accepted disabling claims` +
## `Subject employers`, data = workers_comp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4617.1 -1481.6 -337.8 1831.6 6926.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.802e+03 3.861e+03 -1.244 0.219382
## `Subject employees` 3.985e-02 2.714e-03 14.686 < 2e-16 ***
## `Accepted disabling claims` 2.479e-01 6.969e-02 3.557 0.000833 ***
## `Subject employers` -5.390e-01 4.173e-02 -12.914 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2548 on 50 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.8221, Adjusted R-squared: 0.8114
## F-statistic: 76.99 on 3 and 50 DF, p-value: < 2.2e-16
plot(workers_multiple,which=1)
I belive this model doesn’t meet the linearity assumption. The curved
trend in the residuals indicates that the linear model isn’t effectively
representing the actual relationship.