library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” readr 2.1.5
## âś” forcats 1.0.0 âś” stringr 1.5.1
## âś” ggplot2 3.5.1 âś” tibble 3.2.1
## âś” lubridate 1.9.3 âś” tidyr 1.3.1
## âś” purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(modelsummary)
Reading in the data
catalog = read_csv("catalog.csv")
## Rows: 200 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (21): SpendRat, Age, LenRes, Income, TotAsset, SecAssets, ShortLiq, Long...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(catalog)
Any Age value less than 18 is invalid because we are not legally allowed to target minors, so we must remove them. Also, someone cannot live somewhere longer than they have been alive so we must filter accordingly.
catalog = catalog %>%
filter(Age > 17, LenRes < Age)
Then, we need to convert all of the catalog variables to factors because they each represent binary information.
catalog$CollGifts = as.factor(catalog$CollGifts)
catalog$BricMortar = as.factor(catalog$BricMortar)
catalog$MarthaHome = as.factor(catalog$MarthaHome)
catalog$SunAds = as.factor(catalog$SunAds)
catalog$ThemeColl = as.factor(catalog$ThemeColl)
catalog$CustDec = as.factor(catalog$CustDec)
catalog$RetailKids = as.factor(catalog$RetailKids)
catalog$TeenWr = as.factor(catalog$TeenWr)
catalog$Carlovers = as.factor(catalog$Carlovers)
catalog$CountryColl = as.factor(catalog$CountryColl)
Next, I want to check the levels of the factor variables to make sure there isn’t anything in there that’s not a “0” or “1”.
levels(catalog$CollGifts)
## [1] "0" "1"
levels(catalog$BricMortar)
## [1] "0" "1"
levels(catalog$MarthaHome)
## [1] "0" "1"
levels(catalog$SunAds)
## [1] "0" "1"
levels(catalog$ThemeColl)
## [1] "0" "1"
levels(catalog$CustDec)
## [1] "0" "1"
levels(catalog$RetailKids)
## [1] "0" "1"
levels(catalog$TeenWr)
## [1] "0" "1"
levels(catalog$Carlovers)
## [1] "0" "1"
levels(catalog$CountryColl)
## [1] "0" "1"
Finally, I am going to leave Income as a continuous variable because it is ordered and represents information that is in fact continuous (amount of money).
Using the datasummary_skim function from the modelsummary package give us a really nice output to view the summary statistics for all of our variables.
datasummary_skim(catalog)
Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
---|---|---|---|---|---|---|---|---|
SpendRat | 183 | 0 | 43.8 | 66.1 | 0.1 | 18.8 | 401.4 | |
Age | 56 | 0 | 54.7 | 13.6 | 21.0 | 53.0 | 89.0 | |
LenRes | 40 | 0 | 14.6 | 9.9 | 0.0 | 11.0 | 46.0 | |
Income | 9 | 0 | 4.5 | 1.4 | 1.0 | 5.0 | 9.0 | |
TotAsset | 139 | 0 | 184.7 | 155.0 | 5.0 | 150.0 | 999.0 | |
SecAssets | 71 | 0 | 40.9 | 79.8 | 0.0 | 28.0 | 999.0 | |
ShortLiq | 32 | 0 | 240.6 | 66.9 | 160.0 | 230.0 | 999.0 | |
LongLiq | 26 | 0 | 439.5 | 55.6 | 400.0 | 430.0 | 999.0 | |
WlthIdx | 56 | 0 | 367.1 | 90.0 | 90.0 | 360.0 | 880.0 | |
SpendVol | 65 | 0 | 568.4 | 154.0 | 0.0 | 610.0 | 780.0 | |
SpenVel | 75 | 0 | 219.5 | 217.3 | 0.0 | 160.0 | 999.0 | |
N | % | |||||||
CollGifts | 0 | 94 | 51.1 | |||||
1 | 90 | 48.9 | ||||||
BricMortar | 0 | 131 | 71.2 | |||||
1 | 53 | 28.8 | ||||||
MarthaHome | 0 | 117 | 63.6 | |||||
1 | 67 | 36.4 | ||||||
SunAds | 0 | 105 | 57.1 | |||||
1 | 79 | 42.9 | ||||||
ThemeColl | 0 | 111 | 60.3 | |||||
1 | 73 | 39.7 | ||||||
CustDec | 0 | 120 | 65.2 | |||||
1 | 64 | 34.8 | ||||||
RetailKids | 0 | 119 | 64.7 | |||||
1 | 65 | 35.3 | ||||||
TeenWr | 0 | 89 | 48.4 | |||||
1 | 95 | 51.6 | ||||||
Carlovers | 0 | 133 | 72.3 | |||||
1 | 51 | 27.7 | ||||||
CountryColl | 0 | 107 | 58.2 | |||||
1 | 77 | 41.8 |
The most meaningful statistic for the continuous variables, in my opinion, is the median. For the categorical variables the most meaningful statistic is the count for each level.
I also what to see the distribution of SpendRat so I am going to plot a histogram
ggplot(catalog, aes(x = SpendRat)) +
geom_histogram(binwidth = 15, color = "#2c7fb8", fill = "gold") +
theme_minimal() +
ggtitle("Distribution of Catalog Spending")
The histogram shows a long tail to the right with most of the data in
the first few bins close to zero. We can clearly see that SpendRat’s
under 50 are much more common than those above 50. Finally, we can also
see that the data is not normally distributed and is in fact
right-skewed.
summary(catalog$SpendRat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.080 6.077 18.805 43.792 50.273 401.420
sd(catalog$SpendRat)
## [1] 66.09873
plot(catalog)
lm1 = lm(SpendRat ~ ., data = catalog)
summary(lm1)
##
## Call:
## lm(formula = SpendRat ~ ., data = catalog)
##
## Residuals:
## Min 1Q Median 3Q Max
## -101.616 -31.110 -8.238 16.176 273.558
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.53015 179.15157 -0.209 0.83433
## Age 0.44224 0.40620 1.089 0.27789
## LenRes 0.74215 0.50013 1.484 0.13976
## Income -1.45995 3.53909 -0.413 0.68050
## TotAsset -0.04689 0.08417 -0.557 0.57818
## SecAssets 0.10371 0.26069 0.398 0.69128
## ShortLiq 0.12096 0.13839 0.874 0.38341
## LongLiq -0.07824 0.46023 -0.170 0.86521
## WlthIdx -0.02007 0.11985 -0.167 0.86724
## SpendVol 0.01635 0.04444 0.368 0.71349
## SpenVel 0.02413 0.02743 0.880 0.38040
## CollGifts1 25.96189 11.76414 2.207 0.02872 *
## BricMortar1 35.20239 11.07492 3.179 0.00177 **
## MarthaHome1 28.37021 10.72825 2.644 0.00898 **
## SunAds1 -0.70414 13.12672 -0.054 0.95729
## ThemeColl1 21.83030 10.45189 2.089 0.03829 *
## CustDec1 8.27352 11.63807 0.711 0.47816
## RetailKids1 -4.49706 11.16289 -0.403 0.68758
## TeenWr1 13.49246 9.72981 1.387 0.16742
## Carlovers1 9.93914 10.06260 0.988 0.32475
## CountryColl1 6.64051 13.66314 0.486 0.62761
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.27 on 163 degrees of freedom
## Multiple R-squared: 0.3078, Adjusted R-squared: 0.2229
## F-statistic: 3.624 on 20 and 163 DF, p-value: 2.3e-06
car::vif(lm1)
## Age LenRes Income TotAsset SecAssets ShortLiq
## 1.655402 1.332905 1.320161 9.174431 23.345506 4.623496
## LongLiq WlthIdx SpendVol SpenVel CollGifts BricMortar
## 35.327844 6.276715 2.525015 1.915365 1.874120 1.363124
## MarthaHome SunAds ThemeColl CustDec RetailKids TeenWr
## 1.444192 2.287891 1.416909 1.665059 1.542838 1.281235
## Carlovers CountryColl
## 1.099383 2.461967
We seem to have a multicolinearity issue with the predictors, LongLiq, TotAssest, SecAssets and WlthIdx.
There is a relationship between the predictors and the response as indicated by the overall p-value of 2.3e-06.
CollGifts, BricMortar, MarthaHome and ThemeColl appear to have a statistically significant relationship to the response.
The coeffcient for Age is 0.44. This suggests that for every one increase in age SpendRat will increase by 0.44, althought this relationship was not found to be significant.
lm1_step = step(lm1, direction = "both")
## Start: AIC=1515.65
## SpendRat ~ Age + LenRes + Income + TotAsset + SecAssets + ShortLiq +
## LongLiq + WlthIdx + SpendVol + SpenVel + CollGifts + BricMortar +
## MarthaHome + SunAds + ThemeColl + CustDec + RetailKids +
## TeenWr + Carlovers + CountryColl
##
## Df Sum of Sq RSS AIC
## - SunAds 1 10 553440 1513.7
## - WlthIdx 1 95 553526 1513.7
## - LongLiq 1 98 553529 1513.7
## - SpendVol 1 459 553890 1513.8
## - SecAssets 1 537 553968 1513.8
## - RetailKids 1 551 553982 1513.8
## - Income 1 578 554008 1513.8
## - CountryColl 1 802 554233 1513.9
## - TotAsset 1 1054 554485 1514.0
## - CustDec 1 1716 555147 1514.2
## - ShortLiq 1 2594 556024 1514.5
## - SpenVel 1 2627 556057 1514.5
## - Carlovers 1 3312 556743 1514.8
## - Age 1 4024 557455 1515.0
## <none> 553431 1515.7
## - TeenWr 1 6529 559960 1515.8
## - LenRes 1 7476 560907 1516.1
## - ThemeColl 1 14812 568242 1518.5
## - CollGifts 1 16536 569966 1519.1
## - MarthaHome 1 23743 577174 1521.4
## - BricMortar 1 34303 587734 1524.7
##
## Step: AIC=1513.65
## SpendRat ~ Age + LenRes + Income + TotAsset + SecAssets + ShortLiq +
## LongLiq + WlthIdx + SpendVol + SpenVel + CollGifts + BricMortar +
## MarthaHome + ThemeColl + CustDec + RetailKids + TeenWr +
## Carlovers + CountryColl
##
## Df Sum of Sq RSS AIC
## - LongLiq 1 92 553532 1511.7
## - WlthIdx 1 99 553540 1511.7
## - SpendVol 1 458 553898 1511.8
## - SecAssets 1 529 553969 1511.8
## - Income 1 586 554026 1511.8
## - RetailKids 1 656 554097 1511.9
## - CountryColl 1 955 554395 1512.0
## - TotAsset 1 1044 554485 1512.0
## - CustDec 1 1718 555159 1512.2
## - ShortLiq 1 2591 556032 1512.5
## - SpenVel 1 2659 556099 1512.5
## - Carlovers 1 3340 556780 1512.8
## - Age 1 4076 557516 1513.0
## <none> 553440 1513.7
## - TeenWr 1 6599 560039 1513.8
## - LenRes 1 7513 560954 1514.1
## + SunAds 1 10 553431 1515.7
## - ThemeColl 1 15012 568453 1516.6
## - CollGifts 1 16616 570057 1517.1
## - MarthaHome 1 23879 577319 1519.4
## - BricMortar 1 36503 589943 1523.4
##
## Step: AIC=1511.68
## SpendRat ~ Age + LenRes + Income + TotAsset + SecAssets + ShortLiq +
## WlthIdx + SpendVol + SpenVel + CollGifts + BricMortar + MarthaHome +
## ThemeColl + CustDec + RetailKids + TeenWr + Carlovers + CountryColl
##
## Df Sum of Sq RSS AIC
## - WlthIdx 1 48 553581 1509.7
## - SpendVol 1 451 553983 1509.8
## - Income 1 624 554156 1509.9
## - RetailKids 1 674 554206 1509.9
## - CountryColl 1 907 554439 1510.0
## - SecAssets 1 1524 555057 1510.2
## - CustDec 1 1709 555241 1510.2
## - TotAsset 1 2103 555635 1510.4
## - ShortLiq 1 3333 556865 1510.8
## - Carlovers 1 3446 556979 1510.8
## - Age 1 3984 557516 1511.0
## - SpenVel 1 4244 557776 1511.1
## <none> 553532 1511.7
## - TeenWr 1 6509 560041 1511.8
## - LenRes 1 7495 561027 1512.2
## + LongLiq 1 92 553440 1513.7
## + SunAds 1 3 553529 1513.7
## - ThemeColl 1 14957 568489 1514.6
## - CollGifts 1 16637 570170 1515.1
## - MarthaHome 1 23788 577320 1517.4
## - BricMortar 1 37109 590641 1521.6
##
## Step: AIC=1509.7
## SpendRat ~ Age + LenRes + Income + TotAsset + SecAssets + ShortLiq +
## SpendVol + SpenVel + CollGifts + BricMortar + MarthaHome +
## ThemeColl + CustDec + RetailKids + TeenWr + Carlovers + CountryColl
##
## Df Sum of Sq RSS AIC
## - SpendVol 1 403 553983 1507.8
## - Income 1 612 554193 1507.9
## - RetailKids 1 649 554230 1507.9
## - CountryColl 1 911 554491 1508.0
## - SecAssets 1 1608 555189 1508.2
## - CustDec 1 1667 555248 1508.2
## - TotAsset 1 3379 556960 1508.8
## - Carlovers 1 3427 557007 1508.8
## - Age 1 4112 557693 1509.1
## - SpenVel 1 4215 557796 1509.1
## - ShortLiq 1 4563 558144 1509.2
## <none> 553581 1509.7
## - TeenWr 1 6623 560204 1509.9
## - LenRes 1 7459 561040 1510.2
## + WlthIdx 1 48 553532 1511.7
## + LongLiq 1 41 553540 1511.7
## + SunAds 1 7 553574 1511.7
## - ThemeColl 1 14924 568505 1512.6
## - CollGifts 1 16635 570215 1513.2
## - MarthaHome 1 23912 577493 1515.5
## - BricMortar 1 37094 590675 1519.6
##
## Step: AIC=1507.83
## SpendRat ~ Age + LenRes + Income + TotAsset + SecAssets + ShortLiq +
## SpenVel + CollGifts + BricMortar + MarthaHome + ThemeColl +
## CustDec + RetailKids + TeenWr + Carlovers + CountryColl
##
## Df Sum of Sq RSS AIC
## - Income 1 509 554492 1506.0
## - RetailKids 1 758 554741 1506.1
## - CountryColl 1 903 554886 1506.1
## - SecAssets 1 1277 555260 1506.3
## - CustDec 1 2072 556055 1506.5
## - TotAsset 1 3076 557059 1506.8
## - Carlovers 1 3434 557417 1507.0
## - Age 1 3735 557718 1507.1
## - ShortLiq 1 4185 558168 1507.2
## - SpenVel 1 5206 559189 1507.5
## <none> 553983 1507.8
## - LenRes 1 7071 561054 1508.2
## - TeenWr 1 7360 561343 1508.3
## + SpendVol 1 403 553581 1509.7
## + LongLiq 1 70 553914 1509.8
## + SunAds 1 3 553980 1509.8
## + WlthIdx 1 0 553983 1509.8
## - ThemeColl 1 14656 568639 1510.6
## - CollGifts 1 17381 571364 1511.5
## - MarthaHome 1 23510 577494 1513.5
## - BricMortar 1 36941 590924 1517.7
##
## Step: AIC=1506
## SpendRat ~ Age + LenRes + TotAsset + SecAssets + ShortLiq + SpenVel +
## CollGifts + BricMortar + MarthaHome + ThemeColl + CustDec +
## RetailKids + TeenWr + Carlovers + CountryColl
##
## Df Sum of Sq RSS AIC
## - RetailKids 1 893 555385 1504.3
## - CountryColl 1 1228 555720 1504.4
## - SecAssets 1 1427 555918 1504.5
## - CustDec 1 1877 556369 1504.6
## - Carlovers 1 3161 557653 1505.0
## - TotAsset 1 3581 558073 1505.2
## - Age 1 3670 558162 1505.2
## - ShortLiq 1 4254 558745 1505.4
## - SpenVel 1 5023 559515 1505.7
## <none> 554492 1506.0
## - TeenWr 1 7204 561696 1506.4
## - LenRes 1 7787 562279 1506.6
## + Income 1 509 553983 1507.8
## + SpendVol 1 299 554193 1507.9
## + LongLiq 1 99 554393 1508.0
## + SunAds 1 6 554486 1508.0
## + WlthIdx 1 0 554492 1508.0
## - ThemeColl 1 14375 568867 1508.7
## - CollGifts 1 16889 571381 1509.5
## - MarthaHome 1 23125 577617 1511.5
## - BricMortar 1 36434 590926 1515.7
##
## Step: AIC=1504.3
## SpendRat ~ Age + LenRes + TotAsset + SecAssets + ShortLiq + SpenVel +
## CollGifts + BricMortar + MarthaHome + ThemeColl + CustDec +
## TeenWr + Carlovers + CountryColl
##
## Df Sum of Sq RSS AIC
## - CustDec 1 1206 556591 1502.7
## - CountryColl 1 1246 556631 1502.7
## - SecAssets 1 1334 556719 1502.7
## - Carlovers 1 3007 558393 1503.3
## - TotAsset 1 3565 558950 1503.5
## - ShortLiq 1 4060 559445 1503.6
## - Age 1 4301 559686 1503.7
## - SpenVel 1 4574 559959 1503.8
## <none> 555385 1504.3
## - TeenWr 1 6807 562192 1504.5
## - LenRes 1 7239 562624 1504.7
## + RetailKids 1 893 554492 1506.0
## + Income 1 644 554741 1506.1
## + SpendVol 1 390 554995 1506.2
## + LongLiq 1 149 555237 1506.2
## + SunAds 1 122 555263 1506.3
## + WlthIdx 1 7 555378 1506.3
## - ThemeColl 1 14425 569810 1507.0
## - CollGifts 1 16067 571452 1507.5
## - MarthaHome 1 23823 579208 1510.0
## - BricMortar 1 36484 591869 1514.0
##
## Step: AIC=1502.7
## SpendRat ~ Age + LenRes + TotAsset + SecAssets + ShortLiq + SpenVel +
## CollGifts + BricMortar + MarthaHome + ThemeColl + TeenWr +
## Carlovers + CountryColl
##
## Df Sum of Sq RSS AIC
## - CountryColl 1 1236 557827 1501.1
## - SecAssets 1 1755 558346 1501.3
## - Carlovers 1 3049 559640 1501.7
## - Age 1 3716 560307 1501.9
## - TotAsset 1 3892 560483 1502.0
## - SpenVel 1 4114 560705 1502.0
## - ShortLiq 1 4154 560745 1502.1
## <none> 556591 1502.7
## - TeenWr 1 6138 562729 1502.7
## - LenRes 1 7692 564283 1503.2
## + CustDec 1 1206 555385 1504.3
## + SpendVol 1 669 555922 1504.5
## + Income 1 389 556202 1504.6
## + RetailKids 1 222 556369 1504.6
## + LongLiq 1 159 556432 1504.6
## + WlthIdx 1 51 556539 1504.7
## + SunAds 1 0 556591 1504.7
## - ThemeColl 1 14916 571506 1505.6
## - CollGifts 1 17764 574355 1506.5
## - MarthaHome 1 31862 588452 1510.9
## - BricMortar 1 37381 593972 1512.7
##
## Step: AIC=1501.1
## SpendRat ~ Age + LenRes + TotAsset + SecAssets + ShortLiq + SpenVel +
## CollGifts + BricMortar + MarthaHome + ThemeColl + TeenWr +
## Carlovers
##
## Df Sum of Sq RSS AIC
## - SecAssets 1 1751 559578 1499.7
## - Carlovers 1 2906 560733 1500.1
## - TotAsset 1 4226 562053 1500.5
## - SpenVel 1 4255 562082 1500.5
## - Age 1 4265 562092 1500.5
## - ShortLiq 1 4448 562275 1500.6
## - TeenWr 1 5758 563585 1501.0
## <none> 557827 1501.1
## - LenRes 1 7888 565715 1501.7
## + CountryColl 1 1236 556591 1502.7
## + CustDec 1 1196 556631 1502.7
## + Income 1 679 557148 1502.9
## + SpendVol 1 617 557210 1502.9
## + SunAds 1 245 557582 1503.0
## + RetailKids 1 232 557595 1503.0
## + LongLiq 1 94 557732 1503.1
## + WlthIdx 1 46 557781 1503.1
## - ThemeColl 1 19861 577688 1505.5
## - MarthaHome 1 30912 588739 1509.0
## - CollGifts 1 31502 589329 1509.2
## - BricMortar 1 39718 597545 1511.8
##
## Step: AIC=1499.68
## SpendRat ~ Age + LenRes + TotAsset + ShortLiq + SpenVel + CollGifts +
## BricMortar + MarthaHome + ThemeColl + TeenWr + Carlovers
##
## Df Sum of Sq RSS AIC
## - TotAsset 1 2601 562179 1498.5
## - Carlovers 1 2784 562361 1498.6
## - ShortLiq 1 3313 562891 1498.8
## - SpenVel 1 4015 563593 1499.0
## - Age 1 4917 564495 1499.3
## - TeenWr 1 5231 564809 1499.4
## <none> 559578 1499.7
## - LenRes 1 7852 567430 1500.2
## + SecAssets 1 1751 557827 1501.1
## + CustDec 1 1616 557962 1501.2
## + LongLiq 1 1270 558308 1501.3
## + CountryColl 1 1232 558346 1501.3
## + Income 1 805 558773 1501.4
## + SunAds 1 503 559075 1501.5
## + SpendVol 1 142 559435 1501.6
## + RetailKids 1 127 559451 1501.6
## + WlthIdx 1 2 559576 1501.7
## - ThemeColl 1 21335 580912 1504.6
## - MarthaHome 1 29970 589548 1507.3
## - CollGifts 1 30308 589886 1507.4
## - BricMortar 1 39974 599552 1510.4
##
## Step: AIC=1498.53
## SpendRat ~ Age + LenRes + ShortLiq + SpenVel + CollGifts + BricMortar +
## MarthaHome + ThemeColl + TeenWr + Carlovers
##
## Df Sum of Sq RSS AIC
## - ShortLiq 1 1368 563547 1497.0
## - Carlovers 1 2491 564670 1497.3
## - SpenVel 1 2659 564838 1497.4
## - Age 1 4951 567130 1498.2
## - TeenWr 1 5173 567353 1498.2
## <none> 562179 1498.5
## - LenRes 1 7996 570175 1499.1
## + TotAsset 1 2601 559578 1499.7
## + CountryColl 1 1634 560545 1500.0
## + WlthIdx 1 1423 560757 1500.1
## + CustDec 1 1393 560786 1500.1
## + Income 1 1360 560820 1500.1
## + SunAds 1 800 561379 1500.3
## + LongLiq 1 422 561757 1500.4
## + RetailKids 1 206 561973 1500.5
## + SpendVol 1 155 562024 1500.5
## + SecAssets 1 126 562053 1500.5
## - ThemeColl 1 20580 582759 1503.2
## - MarthaHome 1 28672 590851 1505.7
## - CollGifts 1 30914 593093 1506.4
## - BricMortar 1 40959 603138 1509.5
##
## Step: AIC=1496.98
## SpendRat ~ Age + LenRes + SpenVel + CollGifts + BricMortar +
## MarthaHome + ThemeColl + TeenWr + Carlovers
##
## Df Sum of Sq RSS AIC
## - SpenVel 1 1817 565364 1495.6
## - Carlovers 1 2401 565949 1495.8
## - Age 1 4878 568425 1496.6
## - TeenWr 1 5668 569215 1496.8
## <none> 563547 1497.0
## - LenRes 1 7758 571305 1497.5
## + CountryColl 1 1663 561884 1498.4
## + CustDec 1 1422 562125 1498.5
## + ShortLiq 1 1368 562179 1498.5
## + Income 1 1117 562430 1498.6
## + SunAds 1 929 562618 1498.7
## + TotAsset 1 656 562891 1498.8
## + RetailKids 1 135 563412 1498.9
## + SpendVol 1 71 563476 1499.0
## + LongLiq 1 8 563539 1499.0
## + SecAssets 1 8 563540 1499.0
## + WlthIdx 1 0 563547 1499.0
## - ThemeColl 1 21624 585171 1501.9
## - MarthaHome 1 29449 592996 1504.3
## - CollGifts 1 31553 595100 1505.0
## - BricMortar 1 40728 604275 1507.8
##
## Step: AIC=1495.57
## SpendRat ~ Age + LenRes + CollGifts + BricMortar + MarthaHome +
## ThemeColl + TeenWr + Carlovers
##
## Df Sum of Sq RSS AIC
## - Carlovers 1 2105 567469 1494.3
## - Age 1 4033 569396 1494.9
## - TeenWr 1 5363 570727 1495.3
## <none> 565364 1495.6
## - LenRes 1 7298 572662 1495.9
## + SpenVel 1 1817 563547 1497.0
## + CountryColl 1 1656 563707 1497.0
## + CustDec 1 1060 564304 1497.2
## + SunAds 1 893 564470 1497.3
## + Income 1 893 564471 1497.3
## + ShortLiq 1 526 564838 1497.4
## + TotAsset 1 453 564911 1497.4
## + RetailKids 1 71 565293 1497.5
## + LongLiq 1 35 565329 1497.6
## + SpendVol 1 25 565339 1497.6
## + SecAssets 1 5 565359 1497.6
## + WlthIdx 1 4 565360 1497.6
## - ThemeColl 1 21657 587020 1500.5
## - MarthaHome 1 28866 594230 1502.7
## - CollGifts 1 31334 596697 1503.5
## - BricMortar 1 39910 605274 1506.1
##
## Step: AIC=1494.26
## SpendRat ~ Age + LenRes + CollGifts + BricMortar + MarthaHome +
## ThemeColl + TeenWr
##
## Df Sum of Sq RSS AIC
## - Age 1 3605 571073 1493.4
## - TeenWr 1 5618 573087 1494.1
## <none> 567469 1494.3
## - LenRes 1 7775 575244 1494.8
## + Carlovers 1 2105 565364 1495.6
## + SpenVel 1 1520 565949 1495.8
## + CountryColl 1 1491 565978 1495.8
## + SunAds 1 1259 566210 1495.8
## + CustDec 1 1116 566353 1495.9
## + Income 1 550 566919 1496.1
## + ShortLiq 1 523 566946 1496.1
## + TotAsset 1 396 567073 1496.1
## + RetailKids 1 42 567427 1496.2
## + SpendVol 1 35 567434 1496.2
## + LongLiq 1 30 567439 1496.2
## + WlthIdx 1 9 567460 1496.2
## + SecAssets 1 7 567462 1496.2
## - ThemeColl 1 21497 588965 1499.1
## - MarthaHome 1 28957 596425 1501.4
## - CollGifts 1 34167 601636 1503.0
## - BricMortar 1 39869 607338 1504.8
##
## Step: AIC=1493.42
## SpendRat ~ LenRes + CollGifts + BricMortar + MarthaHome + ThemeColl +
## TeenWr
##
## Df Sum of Sq RSS AIC
## - TeenWr 1 4142 575216 1492.8
## <none> 571073 1493.4
## + Age 1 3605 567469 1494.3
## + CountryColl 1 2062 569011 1494.8
## + SunAds 1 1681 569393 1494.9
## + Carlovers 1 1677 569396 1494.9
## + SpenVel 1 816 570257 1495.2
## + Income 1 753 570321 1495.2
## + CustDec 1 661 570412 1495.2
## + ShortLiq 1 628 570446 1495.2
## + TotAsset 1 473 570600 1495.3
## + SpendVol 1 362 570712 1495.3
## + RetailKids 1 342 570731 1495.3
## + WlthIdx 1 37 571037 1495.4
## + SecAssets 1 26 571048 1495.4
## + LongLiq 1 1 571073 1495.4
## - LenRes 1 13686 584759 1495.8
## - ThemeColl 1 21820 592893 1498.3
## - MarthaHome 1 30234 601307 1500.9
## - CollGifts 1 33676 604750 1502.0
## - BricMortar 1 39905 610979 1503.8
##
## Step: AIC=1492.75
## SpendRat ~ LenRes + CollGifts + BricMortar + MarthaHome + ThemeColl
##
## Df Sum of Sq RSS AIC
## <none> 575216 1492.8
## + TeenWr 1 4142 571073 1493.4
## + Age 1 2129 573087 1494.1
## + Carlovers 1 1963 573253 1494.1
## + CountryColl 1 1488 573728 1494.3
## + SunAds 1 986 574230 1494.4
## + ShortLiq 1 943 574273 1494.5
## + SpenVel 1 743 574473 1494.5
## + Income 1 513 574703 1494.6
## + TotAsset 1 338 574878 1494.6
## + CustDec 1 306 574910 1494.7
## + RetailKids 1 196 575020 1494.7
## + SpendVol 1 34 575182 1494.7
## + SecAssets 1 4 575212 1494.8
## + WlthIdx 1 1 575215 1494.8
## + LongLiq 1 1 575215 1494.8
## - LenRes 1 13896 589112 1495.1
## - ThemeColl 1 24503 599719 1498.4
## - MarthaHome 1 30273 605489 1500.2
## - CollGifts 1 34790 610006 1501.6
## - BricMortar 1 51783 626999 1506.6
summary(lm1_step)
##
## Call:
## lm(formula = SpendRat ~ LenRes + CollGifts + BricMortar + MarthaHome +
## ThemeColl, data = catalog)
##
## Residuals:
## Min 1Q Median 3Q Max
## -97.342 -30.315 -7.095 13.601 272.223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.7432 9.6691 -1.628 0.10525
## LenRes 0.8779 0.4233 2.074 0.03955 *
## CollGifts1 30.3194 9.2406 3.281 0.00124 **
## BricMortar1 39.2957 9.8165 4.003 9.17e-05 ***
## MarthaHome1 28.6729 9.3680 3.061 0.00255 **
## ThemeColl1 25.5835 9.2909 2.754 0.00651 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.85 on 178 degrees of freedom
## Multiple R-squared: 0.2806, Adjusted R-squared: 0.2604
## F-statistic: 13.88 on 5 and 178 DF, p-value: 1.873e-11
Using a stepwise selection process on our original model we are left with a 5 variable model that accounts for roughly 26% of the variation in our response variable. While this is an improvement from the 22% achieved by our original model is still isn’t good.
#par(mfrow = c(2,2))
plot(lm1_step)
The QQ plot suggests that the data is not normally distributed, however it seem like this could be fixed by trimming the tails. The residual plots identify several large outliers. The leverage plot identifies several observations with unusually high leverage.
lm2 = lm(SpendRat ~ . + log(Age) - Age, data = catalog)
summary(lm2)
##
## Call:
## lm(formula = SpendRat ~ . + log(Age) - Age, data = catalog)
##
## Residuals:
## Min 1Q Median 3Q Max
## -101.354 -30.552 -7.855 16.688 272.809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -100.87973 191.38978 -0.527 0.59885
## LenRes 0.74654 0.50366 1.482 0.14021
## Income -1.50917 3.54660 -0.426 0.67101
## TotAsset -0.04625 0.08425 -0.549 0.58375
## SecAssets 0.08682 0.25930 0.335 0.73818
## ShortLiq 0.11587 0.13824 0.838 0.40316
## LongLiq -0.05007 0.45803 -0.109 0.91309
## WlthIdx -0.02162 0.11994 -0.180 0.85716
## SpendVol 0.01302 0.04414 0.295 0.76838
## SpenVel 0.02467 0.02744 0.899 0.36993
## CollGifts1 25.96856 11.77802 2.205 0.02887 *
## BricMortar1 35.45369 11.08033 3.200 0.00165 **
## MarthaHome1 28.22609 10.75515 2.624 0.00950 **
## SunAds1 -0.34539 13.13803 -0.026 0.97906
## ThemeColl1 21.48743 10.44842 2.057 0.04133 *
## CustDec1 8.11430 11.64682 0.697 0.48698
## RetailKids1 -4.79173 11.19595 -0.428 0.66922
## TeenWr1 13.26228 9.73176 1.363 0.17483
## Carlovers1 9.89343 10.07896 0.982 0.32776
## CountryColl1 6.73293 13.67937 0.492 0.62324
## log(Age) 20.06358 21.54960 0.931 0.35321
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.33 on 163 degrees of freedom
## Multiple R-squared: 0.3065, Adjusted R-squared: 0.2214
## F-statistic: 3.601 on 20 and 163 DF, p-value: 2.595e-06
lm2_step = step(lm2, direction = "both")
## Start: AIC=1516.01
## SpendRat ~ Age + LenRes + Income + TotAsset + SecAssets + ShortLiq +
## LongLiq + WlthIdx + SpendVol + SpenVel + CollGifts + BricMortar +
## MarthaHome + SunAds + ThemeColl + CustDec + RetailKids +
## TeenWr + Carlovers + CountryColl + log(Age) - Age
##
## Df Sum of Sq RSS AIC
## - SunAds 1 2 554508 1514.0
## - LongLiq 1 41 554547 1514.0
## - WlthIdx 1 111 554617 1514.0
## - SpendVol 1 296 554802 1514.1
## - SecAssets 1 381 554888 1514.1
## - Income 1 616 555122 1514.2
## - RetailKids 1 623 555129 1514.2
## - CountryColl 1 824 555330 1514.3
## - TotAsset 1 1025 555531 1514.3
## - CustDec 1 1651 556157 1514.5
## - ShortLiq 1 2390 556896 1514.8
## - SpenVel 1 2750 557256 1514.9
## - log(Age) 1 2949 557455 1515.0
## - Carlovers 1 3278 557784 1515.1
## <none> 554506 1516.0
## - TeenWr 1 6318 560824 1516.1
## - LenRes 1 7474 561980 1516.5
## - ThemeColl 1 14388 568894 1518.7
## - CollGifts 1 16538 571044 1519.4
## - MarthaHome 1 23431 577937 1521.6
## - BricMortar 1 34829 589335 1525.2
##
## Step: AIC=1514.01
## SpendRat ~ LenRes + Income + TotAsset + SecAssets + ShortLiq +
## LongLiq + WlthIdx + SpendVol + SpenVel + CollGifts + BricMortar +
## MarthaHome + ThemeColl + CustDec + RetailKids + TeenWr +
## Carlovers + CountryColl + log(Age)
##
## Df Sum of Sq RSS AIC
## - LongLiq 1 39 554547 1512.0
## - WlthIdx 1 113 554621 1512.0
## - SpendVol 1 296 554804 1512.1
## - SecAssets 1 383 554891 1512.1
## - Income 1 620 555129 1512.2
## - RetailKids 1 712 555220 1512.2
## - TotAsset 1 1027 555536 1512.3
## - CountryColl 1 1043 555552 1512.3
## - CustDec 1 1673 556181 1512.6
## - ShortLiq 1 2403 556912 1512.8
## - SpenVel 1 2772 557280 1512.9
## - log(Age) 1 3007 557516 1513.0
## - Carlovers 1 3335 557844 1513.1
## <none> 554508 1514.0
## - TeenWr 1 6369 560878 1514.1
## - LenRes 1 7503 562011 1514.5
## + SunAds 1 2 554506 1516.0
## - ThemeColl 1 14621 569130 1516.8
## - CollGifts 1 16600 571108 1517.4
## - MarthaHome 1 23550 578058 1519.7
## - BricMortar 1 36859 591368 1523.8
##
## Step: AIC=1512.02
## SpendRat ~ LenRes + Income + TotAsset + SecAssets + ShortLiq +
## WlthIdx + SpendVol + SpenVel + CollGifts + BricMortar + MarthaHome +
## ThemeColl + CustDec + RetailKids + TeenWr + Carlovers + CountryColl +
## log(Age)
##
## Df Sum of Sq RSS AIC
## - WlthIdx 1 80 554628 1510.0
## - SpendVol 1 295 554842 1510.1
## - Income 1 647 555194 1510.2
## - RetailKids 1 722 555269 1510.3
## - CountryColl 1 1013 555560 1510.3
## - SecAssets 1 1444 555991 1510.5
## - CustDec 1 1668 556215 1510.6
## - TotAsset 1 1884 556431 1510.6
## - log(Age) 1 2969 557516 1511.0
## - ShortLiq 1 3356 557904 1511.1
## - Carlovers 1 3413 557960 1511.2
## - SpenVel 1 4151 558698 1511.4
## <none> 554547 1512.0
## - TeenWr 1 6333 560880 1512.1
## - LenRes 1 7616 562163 1512.5
## + LongLiq 1 39 554508 1514.0
## + SunAds 1 1 554547 1514.0
## - ThemeColl 1 14593 569141 1514.8
## - CollGifts 1 16613 571160 1515.5
## - MarthaHome 1 23526 578073 1517.7
## - BricMortar 1 37340 591888 1522.0
##
## Step: AIC=1510.05
## SpendRat ~ LenRes + Income + TotAsset + SecAssets + ShortLiq +
## SpendVol + SpenVel + CollGifts + BricMortar + MarthaHome +
## ThemeColl + CustDec + RetailKids + TeenWr + Carlovers + CountryColl +
## log(Age)
##
## Df Sum of Sq RSS AIC
## - SpendVol 1 229 554856 1508.1
## - Income 1 631 555258 1508.2
## - RetailKids 1 690 555318 1508.3
## - CountryColl 1 1020 555648 1508.4
## - SecAssets 1 1541 556169 1508.6
## - CustDec 1 1609 556236 1508.6
## - log(Age) 1 3065 557693 1509.1
## - TotAsset 1 3176 557804 1509.1
## - Carlovers 1 3386 558014 1509.2
## - SpenVel 1 4076 558704 1509.4
## - ShortLiq 1 4396 559024 1509.5
## <none> 554628 1510.0
## - TeenWr 1 6462 561090 1510.2
## - LenRes 1 7537 562165 1510.5
## + WlthIdx 1 80 554547 1512.0
## + LongLiq 1 6 554621 1512.0
## + SunAds 1 3 554625 1512.0
## - ThemeColl 1 14517 569145 1512.8
## - CollGifts 1 16610 571237 1513.5
## - MarthaHome 1 23569 578196 1515.7
## - BricMortar 1 37273 591901 1520.0
##
## Step: AIC=1508.12
## SpendRat ~ LenRes + Income + TotAsset + SecAssets + ShortLiq +
## SpenVel + CollGifts + BricMortar + MarthaHome + ThemeColl +
## CustDec + RetailKids + TeenWr + Carlovers + CountryColl +
## log(Age)
##
## Df Sum of Sq RSS AIC
## - Income 1 549 555406 1506.3
## - RetailKids 1 775 555631 1506.4
## - CountryColl 1 1006 555862 1506.5
## - SecAssets 1 1326 556182 1506.6
## - CustDec 1 1934 556791 1506.8
## - log(Age) 1 2862 557718 1507.1
## - TotAsset 1 3156 558013 1507.2
## - Carlovers 1 3392 558248 1507.2
## - ShortLiq 1 4167 559024 1507.5
## - SpenVel 1 4913 559769 1507.7
## <none> 554856 1508.1
## - TeenWr 1 7081 561937 1508.5
## - LenRes 1 7311 562167 1508.5
## + SpendVol 1 229 554628 1510.0
## + LongLiq 1 18 554838 1510.1
## + WlthIdx 1 14 554842 1510.1
## + SunAds 1 1 554855 1510.1
## - ThemeColl 1 14351 569208 1510.8
## - CollGifts 1 17235 572092 1511.8
## - MarthaHome 1 23363 578220 1513.7
## - BricMortar 1 37144 592001 1518.0
##
## Step: AIC=1506.3
## SpendRat ~ LenRes + TotAsset + SecAssets + ShortLiq + SpenVel +
## CollGifts + BricMortar + MarthaHome + ThemeColl + CustDec +
## RetailKids + TeenWr + Carlovers + CountryColl + log(Age)
##
## Df Sum of Sq RSS AIC
## - RetailKids 1 922 556328 1504.6
## - CountryColl 1 1367 556773 1504.8
## - SecAssets 1 1490 556896 1504.8
## - CustDec 1 1735 557141 1504.9
## - log(Age) 1 2757 558162 1505.2
## - Carlovers 1 3106 558512 1505.3
## - TotAsset 1 3694 559100 1505.5
## - ShortLiq 1 4241 559647 1505.7
## - SpenVel 1 4721 560127 1505.9
## <none> 555406 1506.3
## - TeenWr 1 6910 562316 1506.6
## - LenRes 1 8110 563516 1507.0
## + Income 1 549 554856 1508.1
## + SpendVol 1 148 555258 1508.2
## + LongLiq 1 34 555371 1508.3
## + WlthIdx 1 14 555391 1508.3
## + SunAds 1 3 555402 1508.3
## - ThemeColl 1 14062 569468 1508.9
## - CollGifts 1 16712 572118 1509.8
## - MarthaHome 1 22971 578377 1511.8
## - BricMortar 1 36595 592001 1516.0
##
## Step: AIC=1504.61
## SpendRat ~ LenRes + TotAsset + SecAssets + ShortLiq + SpenVel +
## CollGifts + BricMortar + MarthaHome + ThemeColl + CustDec +
## TeenWr + Carlovers + CountryColl + log(Age)
##
## Df Sum of Sq RSS AIC
## - CustDec 1 1069 557398 1503.0
## - SecAssets 1 1389 557717 1503.1
## - CountryColl 1 1390 557718 1503.1
## - Carlovers 1 2954 559283 1503.6
## - log(Age) 1 3358 559686 1503.7
## - TotAsset 1 3676 560004 1503.8
## - ShortLiq 1 4041 560370 1503.9
## - SpenVel 1 4270 560598 1504.0
## <none> 556328 1504.6
## - TeenWr 1 6516 562844 1504.8
## - LenRes 1 7525 563853 1505.1
## + RetailKids 1 922 555406 1506.3
## + Income 1 697 555631 1506.4
## + SpendVol 1 214 556114 1506.5
## + SunAds 1 113 556215 1506.6
## + LongLiq 1 65 556263 1506.6
## + WlthIdx 1 0 556328 1506.6
## - ThemeColl 1 14088 570416 1507.2
## - CollGifts 1 15868 572197 1507.8
## - MarthaHome 1 23619 579947 1510.3
## - BricMortar 1 36669 592998 1514.3
##
## Step: AIC=1502.96
## SpendRat ~ LenRes + TotAsset + SecAssets + ShortLiq + SpenVel +
## CollGifts + BricMortar + MarthaHome + ThemeColl + TeenWr +
## Carlovers + CountryColl + log(Age)
##
## Df Sum of Sq RSS AIC
## - CountryColl 1 1370 558767 1501.4
## - SecAssets 1 1790 559188 1501.5
## - log(Age) 1 2909 560307 1501.9
## - Carlovers 1 2997 560395 1502.0
## - SpenVel 1 3868 561266 1502.2
## - TotAsset 1 3981 561379 1502.3
## - ShortLiq 1 4132 561529 1502.3
## - TeenWr 1 5916 563313 1502.9
## <none> 557398 1503.0
## - LenRes 1 7934 565332 1503.6
## + CustDec 1 1069 556328 1504.6
## + Income 1 438 556959 1504.8
## + SpendVol 1 428 556969 1504.8
## + RetailKids 1 256 557141 1504.9
## + LongLiq 1 76 557322 1504.9
## + WlthIdx 1 15 557383 1505.0
## + SunAds 1 0 557397 1505.0
## - ThemeColl 1 14580 571977 1505.7
## - CollGifts 1 17505 574903 1506.7
## - MarthaHome 1 31366 588764 1511.0
## - BricMortar 1 37504 594902 1512.9
##
## Step: AIC=1501.41
## SpendRat ~ LenRes + TotAsset + SecAssets + ShortLiq + SpenVel +
## CollGifts + BricMortar + MarthaHome + ThemeColl + TeenWr +
## Carlovers + log(Age)
##
## Df Sum of Sq RSS AIC
## - SecAssets 1 1791 560558 1500.0
## - Carlovers 1 2842 561610 1500.3
## - log(Age) 1 3324 562092 1500.5
## - SpenVel 1 3990 562757 1500.7
## - TotAsset 1 4347 563114 1500.8
## - ShortLiq 1 4439 563207 1500.9
## - TeenWr 1 5503 564270 1501.2
## <none> 558767 1501.4
## - LenRes 1 8180 566947 1502.1
## + CountryColl 1 1370 557398 1503.0
## + CustDec 1 1050 557718 1503.1
## + Income 1 766 558001 1503.2
## + SpendVol 1 369 558398 1503.3
## + RetailKids 1 271 558496 1503.3
## + SunAds 1 267 558500 1503.3
## + LongLiq 1 30 558738 1503.4
## + WlthIdx 1 10 558757 1503.4
## - ThemeColl 1 19579 578346 1505.8
## - MarthaHome 1 30347 589114 1509.1
## - CollGifts 1 31443 590211 1509.5
## - BricMortar 1 39956 598723 1512.1
##
## Step: AIC=1500
## SpendRat ~ LenRes + TotAsset + ShortLiq + SpenVel + CollGifts +
## BricMortar + MarthaHome + ThemeColl + TeenWr + Carlovers +
## log(Age)
##
## Df Sum of Sq RSS AIC
## - TotAsset 1 2689 563247 1498.9
## - Carlovers 1 2723 563281 1498.9
## - ShortLiq 1 3289 563847 1499.1
## - SpenVel 1 3748 564306 1499.2
## - log(Age) 1 3937 564495 1499.3
## - TeenWr 1 4982 565540 1499.6
## <none> 560558 1500.0
## - LenRes 1 8121 568679 1500.7
## + SecAssets 1 1791 558767 1501.4
## + CustDec 1 1448 559110 1501.5
## + LongLiq 1 1409 559149 1501.5
## + CountryColl 1 1370 559188 1501.5
## + Income 1 907 559651 1501.7
## + SunAds 1 538 560020 1501.8
## + RetailKids 1 153 560405 1502.0
## + SpendVol 1 43 560515 1502.0
## + WlthIdx 1 25 560533 1502.0
## - ThemeColl 1 21020 581578 1504.8
## - MarthaHome 1 29366 589924 1507.4
## - CollGifts 1 30232 590790 1507.7
## - BricMortar 1 40241 600799 1510.8
##
## Step: AIC=1498.88
## SpendRat ~ LenRes + ShortLiq + SpenVel + CollGifts + BricMortar +
## MarthaHome + ThemeColl + TeenWr + Carlovers + log(Age)
##
## Df Sum of Sq RSS AIC
## - ShortLiq 1 1314 564561 1497.3
## - SpenVel 1 2401 565648 1497.7
## - Carlovers 1 2422 565669 1497.7
## - log(Age) 1 3884 567130 1498.2
## - TeenWr 1 4907 568154 1498.5
## <none> 563247 1498.9
## - LenRes 1 8312 571559 1499.6
## + TotAsset 1 2689 560558 1500.0
## + CountryColl 1 1803 561444 1500.3
## + WlthIdx 1 1663 561584 1500.3
## + Income 1 1502 561745 1500.4
## + CustDec 1 1229 562018 1500.5
## + SunAds 1 853 562394 1500.6
## + LongLiq 1 401 562846 1500.8
## + SpendVol 1 309 562938 1500.8
## + RetailKids 1 245 563002 1500.8
## + SecAssets 1 133 563114 1500.8
## - ThemeColl 1 20263 583509 1503.4
## - MarthaHome 1 28067 591314 1505.8
## - CollGifts 1 30843 594090 1506.7
## - BricMortar 1 41239 604486 1509.9
##
## Step: AIC=1497.31
## SpendRat ~ LenRes + SpenVel + CollGifts + BricMortar + MarthaHome +
## ThemeColl + TeenWr + Carlovers + log(Age)
##
## Df Sum of Sq RSS AIC
## - SpenVel 1 1615 566176 1495.8
## - Carlovers 1 2339 566900 1496.1
## - log(Age) 1 3864 568425 1496.6
## - TeenWr 1 5392 569953 1497.1
## <none> 564561 1497.3
## - LenRes 1 8056 572617 1497.9
## + CountryColl 1 1830 562731 1498.7
## + ShortLiq 1 1314 563247 1498.9
## + CustDec 1 1259 563302 1498.9
## + Income 1 1250 563311 1498.9
## + SunAds 1 981 563579 1499.0
## + TotAsset 1 714 563847 1499.1
## + SpendVol 1 183 564377 1499.2
## + RetailKids 1 167 564394 1499.3
## + SecAssets 1 10 564550 1499.3
## + WlthIdx 1 9 564552 1499.3
## + LongLiq 1 7 564554 1499.3
## - ThemeColl 1 21277 585838 1502.1
## - MarthaHome 1 28814 593375 1504.5
## - CollGifts 1 31472 596033 1505.3
## - BricMortar 1 41013 605573 1508.2
##
## Step: AIC=1495.84
## SpendRat ~ LenRes + CollGifts + BricMortar + MarthaHome + ThemeColl +
## TeenWr + Carlovers + log(Age)
##
## Df Sum of Sq RSS AIC
## - Carlovers 1 2068 568244 1494.5
## - log(Age) 1 3221 569396 1494.9
## - TeenWr 1 5137 571313 1495.5
## <none> 566176 1495.8
## - LenRes 1 7585 573761 1496.3
## + CountryColl 1 1805 564371 1497.2
## + SpenVel 1 1615 564561 1497.3
## + Income 1 1010 565166 1497.5
## + CustDec 1 952 565224 1497.5
## + SunAds 1 941 565235 1497.5
## + ShortLiq 1 528 565648 1497.7
## + TotAsset 1 506 565670 1497.7
## + RetailKids 1 95 566081 1497.8
## + LongLiq 1 31 566144 1497.8
## + SecAssets 1 2 566174 1497.8
## + WlthIdx 1 0 566175 1497.8
## + SpendVol 1 0 566176 1497.8
## - ThemeColl 1 21340 587515 1500.6
## - MarthaHome 1 28312 594488 1502.8
## - CollGifts 1 31273 597448 1503.7
## - BricMortar 1 40223 606399 1506.5
##
## Step: AIC=1494.51
## SpendRat ~ LenRes + CollGifts + BricMortar + MarthaHome + ThemeColl +
## TeenWr + log(Age)
##
## Df Sum of Sq RSS AIC
## - log(Age) 1 2829 571073 1493.4
## - TeenWr 1 5385 573629 1494.2
## <none> 568244 1494.5
## - LenRes 1 8080 576324 1495.1
## + Carlovers 1 2068 566176 1495.8
## + CountryColl 1 1629 566615 1496.0
## + SpenVel 1 1344 566900 1496.1
## + SunAds 1 1311 566933 1496.1
## + CustDec 1 1007 567237 1496.2
## + Income 1 640 567604 1496.3
## + ShortLiq 1 526 567718 1496.3
## + TotAsset 1 443 567801 1496.4
## + RetailKids 1 61 568183 1496.5
## + LongLiq 1 26 568218 1496.5
## + SecAssets 1 4 568240 1496.5
## + SpendVol 1 1 568243 1496.5
## + WlthIdx 1 0 568244 1496.5
## - ThemeColl 1 21204 589448 1499.2
## - MarthaHome 1 28441 596685 1501.5
## - CollGifts 1 34085 602329 1503.2
## - BricMortar 1 40162 608406 1505.1
##
## Step: AIC=1493.42
## SpendRat ~ LenRes + CollGifts + BricMortar + MarthaHome + ThemeColl +
## TeenWr
##
## Df Sum of Sq RSS AIC
## - TeenWr 1 4142 575216 1492.8
## <none> 571073 1493.4
## + log(Age) 1 2829 568244 1494.5
## + CountryColl 1 2062 569011 1494.8
## + SunAds 1 1681 569393 1494.9
## + Carlovers 1 1677 569396 1494.9
## + SpenVel 1 816 570257 1495.2
## + Income 1 753 570321 1495.2
## + CustDec 1 661 570412 1495.2
## + ShortLiq 1 628 570446 1495.2
## + TotAsset 1 473 570600 1495.3
## + SpendVol 1 362 570712 1495.3
## + RetailKids 1 342 570731 1495.3
## + WlthIdx 1 37 571037 1495.4
## + SecAssets 1 26 571048 1495.4
## + LongLiq 1 1 571073 1495.4
## - LenRes 1 13686 584759 1495.8
## - ThemeColl 1 21820 592893 1498.3
## - MarthaHome 1 30234 601307 1500.9
## - CollGifts 1 33676 604750 1502.0
## - BricMortar 1 39905 610979 1503.8
##
## Step: AIC=1492.75
## SpendRat ~ LenRes + CollGifts + BricMortar + MarthaHome + ThemeColl
##
## Df Sum of Sq RSS AIC
## <none> 575216 1492.8
## + TeenWr 1 4142 571073 1493.4
## + Carlovers 1 1963 573253 1494.1
## + log(Age) 1 1587 573629 1494.2
## + CountryColl 1 1488 573728 1494.3
## + SunAds 1 986 574230 1494.4
## + ShortLiq 1 943 574273 1494.5
## + SpenVel 1 743 574473 1494.5
## + Income 1 513 574703 1494.6
## + TotAsset 1 338 574878 1494.6
## + CustDec 1 306 574910 1494.7
## + RetailKids 1 196 575020 1494.7
## + SpendVol 1 34 575182 1494.7
## + SecAssets 1 4 575212 1494.8
## + WlthIdx 1 1 575215 1494.8
## + LongLiq 1 1 575215 1494.8
## - LenRes 1 13896 589112 1495.1
## - ThemeColl 1 24503 599719 1498.4
## - MarthaHome 1 30273 605489 1500.2
## - CollGifts 1 34790 610006 1501.6
## - BricMortar 1 51783 626999 1506.6
summary(lm2_step)
##
## Call:
## lm(formula = SpendRat ~ LenRes + CollGifts + BricMortar + MarthaHome +
## ThemeColl, data = catalog)
##
## Residuals:
## Min 1Q Median 3Q Max
## -97.342 -30.315 -7.095 13.601 272.223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.7432 9.6691 -1.628 0.10525
## LenRes 0.8779 0.4233 2.074 0.03955 *
## CollGifts1 30.3194 9.2406 3.281 0.00124 **
## BricMortar1 39.2957 9.8165 4.003 9.17e-05 ***
## MarthaHome1 28.6729 9.3680 3.061 0.00255 **
## ThemeColl1 25.5835 9.2909 2.754 0.00651 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.85 on 178 degrees of freedom
## Multiple R-squared: 0.2806, Adjusted R-squared: 0.2604
## F-statistic: 13.88 on 5 and 178 DF, p-value: 1.873e-11
catalogNew = catalog
catalogNew$LogLenRes = log(catalogNew$LenRes)
catalogNew = catalogNew %>%
filter_if(~is.numeric(.), all_vars(!is.infinite(.)))
lm3 = lm(SpendRat ~ LogLenRes + CollGifts + BricMortar + MarthaHome + ThemeColl, data = catalogNew)
summary(lm3)
##
## Call:
## lm(formula = SpendRat ~ LogLenRes + CollGifts + BricMortar +
## MarthaHome + ThemeColl, data = catalogNew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -92.34 -30.12 -7.22 12.87 273.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -27.690 16.556 -1.673 0.096188 .
## LogLenRes 10.178 5.970 1.705 0.089989 .
## CollGifts1 30.677 9.351 3.281 0.001248 **
## BricMortar1 38.977 9.922 3.928 0.000122 ***
## MarthaHome1 28.193 9.439 2.987 0.003219 **
## ThemeColl1 25.237 9.377 2.691 0.007798 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57.22 on 177 degrees of freedom
## Multiple R-squared: 0.2728, Adjusted R-squared: 0.2522
## F-statistic: 13.28 on 5 and 177 DF, p-value: 5.457e-11
catalogNew = catalog
catalogNew$SqLenRes = (catalogNew$LenRes)^2
lm4 = lm(SpendRat ~ SqLenRes + CollGifts + BricMortar + MarthaHome + ThemeColl, data = catalogNew)
summary(lm4)
##
## Call:
## lm(formula = SpendRat ~ SqLenRes + CollGifts + BricMortar + MarthaHome +
## ThemeColl, data = catalogNew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.825 -31.244 -7.079 13.609 274.786
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.57649 7.96444 -1.202 0.23080
## SqLenRes 0.02128 0.01014 2.098 0.03735 *
## CollGifts1 30.05467 9.23542 3.254 0.00136 **
## BricMortar1 39.17983 9.81364 3.992 9.55e-05 ***
## MarthaHome1 28.81888 9.36695 3.077 0.00242 **
## ThemeColl1 25.90091 9.29400 2.787 0.00590 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.83 on 178 degrees of freedom
## Multiple R-squared: 0.281, Adjusted R-squared: 0.2608
## F-statistic: 13.91 on 5 and 178 DF, p-value: 1.787e-11
After performing a stepwise selection we were left with a 5 variable model containing one numeric predictor. Testing a few transformations on this numeric predictor did not significantly improve the model.
#par(mfrow = c(2,2))
plot(lm2_step)
#par(mfrow = c(2,2))
plot(lm3)
#par(mfrow = c(2,2))
plot(lm4)
The QQ plots suggest that the data is not normally distributed, however it seem like this could be fixed by trimming the tails. The residual plots identify several large outliers. The leverage plot identifies several observations with unusually high leverage.
Removing the outliers
cooks = cooks.distance(lm4)
inf_cooks = cooks > .01
catalogNew = catalogNew[-inf_cooks,]
lm4 = lm(SpendRat ~ SqLenRes + CollGifts + BricMortar + MarthaHome + ThemeColl, data = catalogNew)
summary(lm4)
##
## Call:
## lm(formula = SpendRat ~ SqLenRes + CollGifts + BricMortar + MarthaHome +
## ThemeColl, data = catalogNew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -97.22 -30.34 -6.88 13.61 274.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.28172 7.98410 -1.163 0.24659
## SqLenRes 0.02085 0.01017 2.050 0.04184 *
## CollGifts1 29.53890 9.27264 3.186 0.00171 **
## BricMortar1 39.76420 9.85693 4.034 8.14e-05 ***
## MarthaHome1 29.40524 9.41138 3.124 0.00208 **
## ThemeColl1 25.75667 9.30756 2.767 0.00625 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.9 on 177 degrees of freedom
## Multiple R-squared: 0.2826, Adjusted R-squared: 0.2623
## F-statistic: 13.94 on 5 and 177 DF, p-value: 1.73e-11
#par(mfrow = c(2,2))
plot(lm4)
Removing some influential points did slightly improve the adj r-squared but not enough to make the model useful.
Read in adult data
adult = read_csv("adult.csv", na = "?")
## Rows: 32561 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): work_class, education, marital_status, occupation, relationship, ra...
## dbl (6): age, wgt, education_num, capital_gain, capital_loss, hours_per_week
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(adult, aes(x = age, y = income)) +
geom_boxplot() +
theme_minimal() +
ggtitle("Income increases with age")
Age and Income seem to be very correlated.
ggplot(adult, aes(fill = income, x = work_class)) +
geom_bar(position = "fill") +
theme_minimal() +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
scale_y_continuous(labels = scales::percent) +
ggtitle("Income by work_class")
Some catergories of work class seem influential but overall the
correlation does not seem as strong as some of the other predictors.
ggplot(adult, aes(x = wgt, y = income)) +
geom_boxplot() +
theme_minimal() +
ggtitle("Weight seems to be unrelated to income")
ggplot(adult, aes(fill = income, x = education)) +
geom_bar(position = "fill") +
theme_minimal() +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
scale_y_continuous(labels = scales::percent) +
ggtitle("More education is correlated with higher income")
Education and Income seem to be very correlated.
ggplot(adult, aes(x = education_num, y = income)) +
geom_boxplot() +
theme_minimal() +
ggtitle("More education is correlated with higher income")
ggplot(adult, aes(fill = income, x = marital_status)) +
geom_bar(position = "fill") +
theme_minimal() +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
scale_y_continuous(labels = scales::percent) +
ggtitle("Marital status is correlated with income")
Marital Status and Income seem to be very correlated.
ggplot(adult, aes(fill = income, x = occupation)) +
geom_bar(position = "fill") +
theme_minimal() +
scale_x_discrete(guide = guide_axis(n.dodge = 4)) +
scale_y_continuous(labels = scales::percent) +
ggtitle("Occupation is correlated with income")
Occupation and Income seem to be very correlated.
ggplot(adult, aes(fill = income, x = relationship)) +
geom_bar(position = "fill") +
theme_minimal() +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
scale_y_continuous(labels = scales::percent) +
ggtitle("Relationship is correlated with income")
Relationship and Income seem to be very correlated.
ggplot(adult, aes(fill = income, x = race)) +
geom_bar(position = "fill") +
theme_minimal() +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
scale_y_continuous(labels = scales::percent) +
ggtitle("Race is correlated with income")
Race is correlated with income but not as dramtically as some other
predictors.
ggplot(adult, aes(fill = income, x = sex)) +
geom_bar(position = "fill") +
theme_minimal() +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
scale_y_continuous(labels = scales::percent) +
ggtitle("Sex is correlated with income")
Sex is very correlated with income
ggplot(adult, aes(x = capital_gain, y = income)) +
geom_boxplot() +
theme_minimal() +
ggtitle("Capital gain is correlated with income")
ggplot(adult, aes(x = capital_loss, y = income)) +
geom_boxplot() +
theme_minimal() +
ggtitle("Capital loss is correlated with income")
ggplot(adult, aes(x = hours_per_week, y = income)) +
geom_boxplot() +
theme_minimal() +
ggtitle("Hours per week is correlated with income")
ggplot(adult, aes(fill = income, x = native_country)) +
geom_bar(position = "fill") +
theme_minimal() +
scale_x_discrete(guide = guide_axis(n.dodge = 8)) +
scale_y_continuous(labels = scales::percent) +
ggtitle("Native country is correlated with income")
adult$income = as.factor(adult$income)
glm1 = glm(income ~ age + education + occupation + relationship + sex + capital_gain + capital_loss + hours_per_week, data = adult, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm1)
##
## Call:
## glm(formula = income ~ age + education + occupation + relationship +
## sex + capital_gain + capital_loss + hours_per_week, family = "binomial",
## data = adult)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.086e+00 2.121e-01 -23.979 < 2e-16 ***
## age 2.719e-02 1.592e-03 17.079 < 2e-16 ***
## education11th 8.283e-02 2.125e-01 0.390 0.696659
## education12th 3.719e-01 2.695e-01 1.380 0.167556
## education1st-4th -6.384e-01 4.793e-01 -1.332 0.182897
## education5th-6th -5.030e-01 3.316e-01 -1.517 0.129264
## education7th-8th -6.023e-01 2.370e-01 -2.541 0.011048 *
## education9th -2.597e-01 2.650e-01 -0.980 0.327049
## educationAssoc-acdm 1.320e+00 1.775e-01 7.439 1.02e-13 ***
## educationAssoc-voc 1.295e+00 1.710e-01 7.577 3.54e-14 ***
## educationBachelors 1.887e+00 1.587e-01 11.888 < 2e-16 ***
## educationDoctorate 2.772e+00 2.148e-01 12.903 < 2e-16 ***
## educationHS-grad 7.815e-01 1.547e-01 5.052 4.38e-07 ***
## educationMasters 2.190e+00 1.688e-01 12.980 < 2e-16 ***
## educationPreschool -1.720e+01 9.261e+01 -0.186 0.852636
## educationProf-school 2.673e+00 2.021e-01 13.227 < 2e-16 ***
## educationSome-college 1.132e+00 1.569e-01 7.211 5.55e-13 ***
## occupationArmed-Forces -5.889e-01 1.495e+00 -0.394 0.693626
## occupationCraft-repair -1.411e-02 7.793e-02 -0.181 0.856348
## occupationExec-managerial 7.206e-01 7.495e-02 9.614 < 2e-16 ***
## occupationFarming-fishing -1.322e+00 1.358e-01 -9.735 < 2e-16 ***
## occupationHandlers-cleaners -7.443e-01 1.407e-01 -5.290 1.23e-07 ***
## occupationMachine-op-inspct -3.482e-01 9.981e-02 -3.489 0.000486 ***
## occupationOther-service -9.602e-01 1.156e-01 -8.303 < 2e-16 ***
## occupationPriv-house-serv -3.833e+00 1.780e+00 -2.153 0.031334 *
## occupationProf-specialty 4.210e-01 7.939e-02 5.303 1.14e-07 ***
## occupationProtective-serv 4.173e-01 1.188e-01 3.513 0.000443 ***
## occupationSales 1.995e-01 7.938e-02 2.513 0.011968 *
## occupationTech-support 6.209e-01 1.089e-01 5.700 1.20e-08 ***
## occupationTransport-moving -1.967e-01 9.766e-02 -2.015 0.043956 *
## relationshipNot-in-family -1.867e+00 5.669e-02 -32.926 < 2e-16 ***
## relationshipOther-relative -2.031e+00 1.941e-01 -10.464 < 2e-16 ***
## relationshipOwn-child -2.972e+00 1.414e-01 -21.017 < 2e-16 ***
## relationshipUnmarried -1.862e+00 9.821e-02 -18.962 < 2e-16 ***
## relationshipWife 1.268e+00 1.027e-01 12.349 < 2e-16 ***
## sexMale 8.342e-01 7.795e-02 10.702 < 2e-16 ***
## capital_gain 3.220e-04 1.055e-05 30.514 < 2e-16 ***
## capital_loss 6.463e-04 3.759e-05 17.193 < 2e-16 ***
## hours_per_week 2.973e-02 1.644e-03 18.084 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 34483 on 30717 degrees of freedom
## Residual deviance: 20232 on 30679 degrees of freedom
## (1843 observations deleted due to missingness)
## AIC: 20310
##
## Number of Fisher Scoring iterations: 12
glmtoolbox::adjR2(glm1)
## [1] 0.41255
The adj r-squared is 0.41255
car::vif(glm1)
## GVIF Df GVIF^(1/(2*Df))
## age 1.108509 1 1.052858
## education 1.758439 15 1.018992
## occupation 2.001138 13 1.027041
## relationship 3.072433 5 1.118789
## sex 2.785575 1 1.669004
## capital_gain 1.028287 1 1.014045
## capital_loss 1.010351 1 1.005162
## hours_per_week 1.108306 1 1.052761
Odds ratio for hours_per_week
exp(2.973e-02)
## [1] 1.030176
for every additional hour i work, my odds of making over 50K increase by a factor of 1.03
6. Comment on the results obtained. How good do these models fit the data? Can we use any of them to predict the spending ratio?
None of the models achieved an adj r-squared above 0.3 and we would like an adj r-squared of closer to 0.7 before we would use the model to make predictions. Overall, the model does not fit the data well.