Homework 6

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)

library(pastecs)

## 
## Attaching package: 'pastecs'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

library(readr)
Workers_Compensation_Claims_Data <- read_csv("Workers__Compensation_Claims_Data.csv")

## Rows: 56 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (18): Year, Subject employers, Subject employees, Accepted disabling cla...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(Workers_Compensation_Claims_Data)

## # A tibble: 6 × 18
##    Year `Subject employers` `Subject employees` `Accepted disabling claims`
##   <dbl>               <dbl>               <dbl>                       <dbl>
## 1  1968               49021              671900                       32509
## 2  1969               52191              700800                       35372
## 3  1970               52789              704300                       30338
## 4  1971               58768              732500                       30663
## 5  1972               62584              778800                       34835
## 6  1973               65788              820600                       36802
## # ℹ 14 more variables: `Est. accepted nondisabling claims` <dbl>,
## #   `Est. total accepted claims` <dbl>, `Denied claims` <dbl>,
## #   `Disabling claim denial rate` <dbl>, `Fatality claims` <dbl>,
## #   `Net PTD claims` <dbl>,
## #   `Rate: accepted disabling claims per 100 employees` <dbl>,
## #   `Rate: fatality claims per 100,000 employees` <dbl>,
## #   `Aggravation claims: Accepted disabling` <dbl>, …

Dependant Variable: Accepted Disabling Claims Independent Variables: Year, Subject Employees, Rate: Accepted disabling claims per 100 employess

stat.desc(Workers_Compensation_Claims_Data$`Accepted disabling claims`)

##      nbr.val     nbr.null       nbr.na          min          max        range 
## 5.600000e+01 0.000000e+00 0.000000e+00 1.801000e+04 4.784400e+04 2.983400e+04 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
## 1.636954e+06 2.936350e+04 2.923132e+04 1.065983e+03 2.136278e+03 6.363393e+07 
##      std.dev     coef.var 
## 7.977088e+03 2.728952e-01

stat.desc(Workers_Compensation_Claims_Data$Year)

##      nbr.val     nbr.null       nbr.na          min          max        range 
## 5.600000e+01 0.000000e+00 0.000000e+00 1.968000e+03 2.023000e+03 5.500000e+01 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
## 1.117480e+05 1.995500e+03 1.995500e+03 2.179449e+00 4.367714e+00 2.660000e+02 
##      std.dev     coef.var 
## 1.630951e+01 8.173143e-03

workers_comp<-Workers_Compensation_Claims_Data

comp_model<-lm(`Denied claims`~`Subject employees`,data = workers_comp)

summary(comp_model)

## 
## Call:
## lm(formula = `Denied claims` ~ `Subject employees`, data = workers_comp)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7303  -4620  -1870   4318   9299 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.074e+03  2.741e+03   0.757  0.45268    
## `Subject employees` 6.661e-03  1.889e-03   3.526  0.00089 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5321 on 52 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.193,  Adjusted R-squared:  0.1774 
## F-statistic: 12.43 on 1 and 52 DF,  p-value: 0.0008902

An R-squared value of 0.9988 shows that the model accounts for 99.88% of the variation in the dependent variable, which is good. I believe the p-values indicate that the key variables in the model are Subject employers, Subject employees, and the percentage of Disabling claims that are accepted or denied on time. But I also think that on the other hand, variables like Accepted disabling claims, Estimated accepted nondisabling claims, and Denied claims are not significant and probably don’t play a meaningful role in explaining the dependent variable.

workers_multiple<-lm(Denied claims~Subject employees+year+subject ecmployers,data= workers_comp)

summary(workers_multiple)

workers_multiple<-lm(`Denied claims`~`Subject employees`+`Accepted disabling claims`+`Subject employers`,data= workers_comp)

summary(workers_multiple)

## 
## Call:
## lm(formula = `Denied claims` ~ `Subject employees` + `Accepted disabling claims` + 
##     `Subject employers`, data = workers_comp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4617.1 -1481.6  -337.8  1831.6  6926.0 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -4.802e+03  3.861e+03  -1.244 0.219382    
## `Subject employees`          3.985e-02  2.714e-03  14.686  < 2e-16 ***
## `Accepted disabling claims`  2.479e-01  6.969e-02   3.557 0.000833 ***
## `Subject employers`         -5.390e-01  4.173e-02 -12.914  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2548 on 50 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.8221, Adjusted R-squared:  0.8114 
## F-statistic: 76.99 on 3 and 50 DF,  p-value: < 2.2e-16

plot(workers_multiple,which=1)

I belive this model doesn’t meet the linearity assumption. The curved trend in the residuals indicates that the linear model isn’t effectively representing the actual relationship.

Homework 6

2025-04-09