Using R, build a regression model for data that interests you. Conduct residual analysis.
df <- read.csv("https://raw.githubusercontent.com/waheeb123/Data-621/main/Homeworks/Homework%204/insurance_training_data.csv")## 'data.frame': 8161 obs. of 26 variables:
## $ INDEX : int 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET_FLAG: int 0 0 0 0 0 1 0 1 1 0 ...
## $ TARGET_AMT : num 0 0 0 0 0 ...
## $ KIDSDRIV : int 0 0 0 0 0 0 0 1 0 0 ...
## $ AGE : int 60 43 35 51 50 34 54 37 34 50 ...
## $ HOMEKIDS : int 0 0 1 0 0 1 0 2 0 0 ...
## $ YOJ : int 11 11 10 14 NA 12 NA NA 10 7 ...
## $ INCOME : chr "$67,349" "$91,449" "$16,039" "" ...
## $ PARENT1 : chr "No" "No" "No" "No" ...
## $ HOME_VAL : chr "$0" "$257,252" "$124,191" "$306,251" ...
## $ MSTATUS : chr "z_No" "z_No" "Yes" "Yes" ...
## $ SEX : chr "M" "M" "z_F" "M" ...
## $ EDUCATION : chr "PhD" "z_High School" "z_High School" "<High School" ...
## $ JOB : chr "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
## $ TRAVTIME : int 14 22 5 32 36 46 33 44 34 48 ...
## $ CAR_USE : chr "Private" "Commercial" "Private" "Private" ...
## $ BLUEBOOK : chr "$14,230" "$14,940" "$4,010" "$15,440" ...
## $ TIF : int 11 1 4 7 1 1 1 1 1 7 ...
## $ CAR_TYPE : chr "Minivan" "Minivan" "z_SUV" "Minivan" ...
## $ RED_CAR : chr "yes" "yes" "no" "yes" ...
## $ OLDCLAIM : chr "$4,461" "$0" "$38,690" "$0" ...
## $ CLM_FREQ : int 2 0 2 0 2 0 0 1 0 0 ...
## $ REVOKED : chr "No" "No" "No" "No" ...
## $ MVR_PTS : int 3 0 3 0 3 0 0 10 0 1 ...
## $ CAR_AGE : int 18 1 10 6 17 7 1 7 1 17 ...
## $ URBANICITY : chr "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...
classes <- as.data.frame(unlist(lapply(df, class))) |>
rownames_to_column()
cols <- c("Variable", "Class")
colnames(classes) <- cols
classes_summary <- classes |>
group_by(Class) |>
summarize(Count = n(),
Variables = paste(sort(unique(Variable)),collapse=", "))
kable(classes_summary, "latex", booktabs = T) |>
kableExtra::column_spec(2:3, width = "7cm")INCOME, HOME_VAL, BLUEBOOK,
and OLDCLAIM are all character variables that will need to
be coerced to integers after we strip the “$” from their strings.
TARGET_FLAG and the remaining character variables will all
need to be coerced to factors.
vars <- c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM")
df <- df |>
mutate(across(all_of(vars), ~gsub("\\$|,", "", .) |> as.integer()))We remove the column named INDEX from the dataset, then
we take a look at a summary of the dataset’s completeness.
df <- df |> select(-INDEX)
completeness <- introduce(df)
knitr::kable(t(completeness), format = "simple")| rows | 8161 |
| columns | 25 |
| discrete_columns | 10 |
| continuous_columns | 15 |
| all_missing_columns | 0 |
| total_missing_values | 1879 |
| complete_rows | 6448 |
| total_observations | 204025 |
| memory_usage | 1183032 |
None of our columns are completely devoid of data. There are 6,448 complete rows in the dataset, which is about 65% of our observations. There are 1,879 total missing values. We take a look at which variables contain these missing values and what the spread is.
## `geom_smooth()` using formula = 'y ~ x'
Histograms reveal that both CAR_AGE and INCOME exhibit non-normal distributions. However, the scatterplot illustrates a linear relationship between the two variables.
##
## Call:
## lm(formula = INCOME ~ CAR_AGE, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -115751 -28785 -5247 21581 302879
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33191.32 898.45 36.94 <2e-16 ***
## CAR_AGE 3440.01 88.86 38.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43240 on 7235 degrees of freedom
## (924 observations deleted due to missingness)
## Multiple R-squared: 0.1716, Adjusted R-squared: 0.1715
## F-statistic: 1499 on 1 and 7235 DF, p-value: < 2.2e-16
The linear regression model shows a statistically significant relationship between car age and income.
For each unit increase in car age, there is an estimated increase of $3440.01 in income, holding other variables constant. The intercept of $33191.32 represents the estimated income when car age is zero. The model explains approximately 17.16% of the variance in income, suggesting that car age alone accounts for a modest portion of the variability in income. The model is statistically significant, with both the coefficients for car age and the intercept having p-values less than 0.001. The residual standard error is $43240, indicating the average deviation of observed income values from the predicted values. However, it’s important to note that a large number of observations were deleted due to missing data, which may impact the generalizability of the model. In conclusion, while car age appears to be a significant predictor of income, the model’s explanatory power is relatively low, and the impact of missing data on the model’s performance warrants further consideration.