library(tidyverse)

Using R, build a regression model for data that interests you. Conduct residual analysis.

df <-  read.csv("https://raw.githubusercontent.com/waheeb123/Data-621/main/Homeworks/Homework%204/insurance_training_data.csv")
str(df)
## 'data.frame':    8161 obs. of  26 variables:
##  $ INDEX      : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET_FLAG: int  0 0 0 0 0 1 0 1 1 0 ...
##  $ TARGET_AMT : num  0 0 0 0 0 ...
##  $ KIDSDRIV   : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ AGE        : int  60 43 35 51 50 34 54 37 34 50 ...
##  $ HOMEKIDS   : int  0 0 1 0 0 1 0 2 0 0 ...
##  $ YOJ        : int  11 11 10 14 NA 12 NA NA 10 7 ...
##  $ INCOME     : chr  "$67,349" "$91,449" "$16,039" "" ...
##  $ PARENT1    : chr  "No" "No" "No" "No" ...
##  $ HOME_VAL   : chr  "$0" "$257,252" "$124,191" "$306,251" ...
##  $ MSTATUS    : chr  "z_No" "z_No" "Yes" "Yes" ...
##  $ SEX        : chr  "M" "M" "z_F" "M" ...
##  $ EDUCATION  : chr  "PhD" "z_High School" "z_High School" "<High School" ...
##  $ JOB        : chr  "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
##  $ TRAVTIME   : int  14 22 5 32 36 46 33 44 34 48 ...
##  $ CAR_USE    : chr  "Private" "Commercial" "Private" "Private" ...
##  $ BLUEBOOK   : chr  "$14,230" "$14,940" "$4,010" "$15,440" ...
##  $ TIF        : int  11 1 4 7 1 1 1 1 1 7 ...
##  $ CAR_TYPE   : chr  "Minivan" "Minivan" "z_SUV" "Minivan" ...
##  $ RED_CAR    : chr  "yes" "yes" "no" "yes" ...
##  $ OLDCLAIM   : chr  "$4,461" "$0" "$38,690" "$0" ...
##  $ CLM_FREQ   : int  2 0 2 0 2 0 0 1 0 0 ...
##  $ REVOKED    : chr  "No" "No" "No" "No" ...
##  $ MVR_PTS    : int  3 0 3 0 3 0 0 10 0 1 ...
##  $ CAR_AGE    : int  18 1 10 6 17 7 1 7 1 17 ...
##  $ URBANICITY : chr  "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...
classes <- as.data.frame(unlist(lapply(df, class))) |>
    rownames_to_column()
cols <- c("Variable", "Class")
colnames(classes) <- cols
classes_summary <- classes |>
    group_by(Class) |>
    summarize(Count = n(),
              Variables = paste(sort(unique(Variable)),collapse=", "))
kable(classes_summary, "latex", booktabs = T) |>
  kableExtra::column_spec(2:3, width = "7cm")

INCOME, HOME_VAL, BLUEBOOK, and OLDCLAIM are all character variables that will need to be coerced to integers after we strip the “$” from their strings. TARGET_FLAG and the remaining character variables will all need to be coerced to factors.

vars <- c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM")
df <- df |>
    mutate(across(all_of(vars), ~gsub("\\$|,", "", .) |> as.integer()))

We remove the column named INDEX from the dataset, then we take a look at a summary of the dataset’s completeness.

df <- df |> select(-INDEX)
completeness <- introduce(df)
knitr::kable(t(completeness), format = "simple")
rows 8161
columns 25
discrete_columns 10
continuous_columns 15
all_missing_columns 0
total_missing_values 1879
complete_rows 6448
total_observations 204025
memory_usage 1183032

None of our columns are completely devoid of data. There are 6,448 complete rows in the dataset, which is about 65% of our observations. There are 1,879 total missing values. We take a look at which variables contain these missing values and what the spread is.

look <- look + 
     scale_fill_brewer(palette = "Paired")
look

df %>%
  ggplot(aes(CAR_AGE)) +
  geom_histogram(bins = 20)

df %>%
  ggplot(aes(INCOME)) +
  geom_histogram(bins = 20)

ggplot(df, aes(x = CAR_AGE  , y =INCOME)) + 
  geom_point() +
  stat_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

Histograms reveal that both CAR_AGE and INCOME exhibit non-normal distributions. However, the scatterplot illustrates a linear relationship between the two variables.

lm1 <- lm(INCOME ~ CAR_AGE , data = df)
summary(lm1)
## 
## Call:
## lm(formula = INCOME ~ CAR_AGE, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -115751  -28785   -5247   21581  302879 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33191.32     898.45   36.94   <2e-16 ***
## CAR_AGE      3440.01      88.86   38.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43240 on 7235 degrees of freedom
##   (924 observations deleted due to missingness)
## Multiple R-squared:  0.1716, Adjusted R-squared:  0.1715 
## F-statistic:  1499 on 1 and 7235 DF,  p-value: < 2.2e-16
plot(lm1)

The linear regression model shows a statistically significant relationship between car age and income.

For each unit increase in car age, there is an estimated increase of $3440.01 in income, holding other variables constant. The intercept of $33191.32 represents the estimated income when car age is zero. The model explains approximately 17.16% of the variance in income, suggesting that car age alone accounts for a modest portion of the variability in income. The model is statistically significant, with both the coefficients for car age and the intercept having p-values less than 0.001. The residual standard error is $43240, indicating the average deviation of observed income values from the predicted values. However, it’s important to note that a large number of observations were deleted due to missing data, which may impact the generalizability of the model. In conclusion, while car age appears to be a significant predictor of income, the model’s explanatory power is relatively low, and the impact of missing data on the model’s performance warrants further consideration.