library(caret)
Warning: package ‘caret’ was built under R version 4.4.2Loading required package: lattice
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: ‘caret’
The following object is masked from ‘package:purrr’:
lift
Reading in the dataset
df = read_csv("catalog.csv")
Rows: 200 Columns: 21── Column specification ─────────────────────────────────────────
Delimiter: ","
dbl (21): SpendRat, Age, LenRes, Income, TotAsset, SecAssets,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(df)
spc_tbl_ [200 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ SpendRat : num [1:200] 11.8 16.8 11.4 31.3 1.9 ...
$ Age : num [1:200] 0 35 46 41 46 46 46 56 48 54 ...
$ LenRes : num [1:200] 2 3 9 2 7 15 16 31 8 8 ...
$ Income : num [1:200] 3 5 5 2 9 5 4 6 5 5 ...
$ TotAsset : num [1:200] 122 195 123 117 493 138 162 117 119 50 ...
$ SecAssets : num [1:200] 27 36 24 25 105 27 25 27 23 10 ...
$ ShortLiq : num [1:200] 225 220 200 222 310 340 230 300 250 200 ...
$ LongLiq : num [1:200] 422 420 420 419 500 450 430 440 430 420 ...
$ WlthIdx : num [1:200] 286 430 290 279 520 440 360 400 360 230 ...
$ SpendVol : num [1:200] 503 690 600 543 680 440 690 500 610 660 ...
$ SpenVel : num [1:200] 285 570 280 308 100 50 180 10 0 0 ...
$ CollGifts : num [1:200] 1 0 1 1 0 0 1 1 1 0 ...
$ BricMortar : num [1:200] 0 1 0 0 1 1 0 1 0 1 ...
$ MarthaHome : num [1:200] 0 1 0 0 1 1 0 1 1 0 ...
$ SunAds : num [1:200] 1 0 1 1 0 0 1 0 0 0 ...
$ ThemeColl : num [1:200] 0 0 1 1 0 0 0 1 1 0 ...
$ CustDec : num [1:200] 1 1 1 0 1 1 0 1 1 0 ...
$ RetailKids : num [1:200] 1 1 1 0 0 0 0 1 0 0 ...
$ TeenWr : num [1:200] 1 0 0 0 0 0 0 1 0 1 ...
$ Carlovers : num [1:200] 0 0 0 0 0 1 0 1 0 0 ...
$ CountryColl: num [1:200] 1 0 1 1 0 0 1 0 1 0 ...
- attr(*, "spec")=
.. cols(
.. SpendRat = col_double(),
.. Age = col_double(),
.. LenRes = col_double(),
.. Income = col_double(),
.. TotAsset = col_double(),
.. SecAssets = col_double(),
.. ShortLiq = col_double(),
.. LongLiq = col_double(),
.. WlthIdx = col_double(),
.. SpendVol = col_double(),
.. SpenVel = col_double(),
.. CollGifts = col_double(),
.. BricMortar = col_double(),
.. MarthaHome = col_double(),
.. SunAds = col_double(),
.. ThemeColl = col_double(),
.. CustDec = col_double(),
.. RetailKids = col_double(),
.. TeenWr = col_double(),
.. Carlovers = col_double(),
.. CountryColl = col_double()
.. )
- attr(*, "problems")=<externalptr>
#Data Cleaning The purpose of this code chunks below is to make a cleaner more coherent variables to ensure that the information can be interpreted easier. Some of the changes that were made were that of making the dummy variables into a factor variable, because they should not be a numeric value.The sole purpose of a dummy variable is to represent whether or not something occurs in an observation or not, with no real numeric representation of anything. The other changes that were made to this data set was making Income variable a factor variable because it is ordinal. Observations that were taken out were based on the notion of people being \(18\) or older and if a person did not meet that criteria then that person was removed. The last instance of observations being removed from the dataset was if LenRes was greater than Age which is impossible, because a person cannot be living in a residential area for longer than they have been alive.
Making integer variables into factor variables.
df$CollGifts = as.factor(df$CollGifts)
df$BricMortar = as.factor(df$BricMortar)
df$MarthaHome = as.factor(df$MarthaHome)
df$SunAds = as.factor(df$SunAds)
df$ThemeColl = as.factor(df$ThemeColl)
df$CustDec = as.factor(df$CustDec)
df$RetailKids = as.factor(df$RetailKids)
df$TeenWr = as.factor(df$TeenWr)
df$Carlovers = as.factor(df$Carlovers)
df$CountryColl = as.factor(df$CountryColl)
df$Income = as.factor(df$Income)
Age should not be less than 18 and the Length of a resident can not be greater than their age.
df1 = df[df$Age >= 18,]
df1 = df[df$LenRes < df$Age,]
Univariate Statistics Age and LenRes are going to be the median, while all the other variables are going to be the average. The factor variables we are going to disregard.
describe(df1)
df1|>
ggplot(aes(SpendRat))+
geom_boxplot(outlier.colour = "red", outlier.size = 1)+
labs(title = "Boxplot of Spending Ratio")+
theme_minimal()
ggpairs(df1, columns = c(1:5), title = "Scatter Plot Matrix for Catalog Spending", axisLabels = "show")
ggpairs(df1, columns = c(5:11), title = "Scatter Plot Matrix for Catalog Spending")
ggpairs(df, columns = c(12:17), title = "Scatter Plot Matrix for Catalog Spending")
ggpairs(df, columns = c(18:21), title = "Scatter Plot Matrix for Catalog Spending")
Spending Ratio as the response variable and all the other predictors is the formula in this given lm function. What we can see in these coefficient is that as Spending Ratio increase by \(1\) then Age increases by \(0.36\). The variables that are statistically significant, that is the alternative hypothesis is\(B_1\neq 0\). Simply put means that their is a relationship between the response variable and the predictor variable. From the given output of this formula the only statistically significant predictors are BricMortar1,MarthaHome1, ThemeColl1, these variables are dummy variables. If BricMortar occurs it increases by \(35.96\) by a one unit increase in SpendingRatio. If MarthaHome occurs it increases by \(28.14\) when a one unit increase in SpendingRatio occurs. Likewise, if ThemeColl occurs it increases by \(22.94\) with one unit increase in SpendingRatio. The model has a p-value of \(8.398e-05\) which is increadibly small, which we can reject the null hypthesis which states that model has found at least one predictor that has a relationship with the response variable. The Adjusted R-Squared is \(.19\) which is incredibly small and tells us that this model may not be the best fit for this linear model.
lm.fit = lm(SpendRat ~ .,data = df1)
summary(lm.fit)
Call:
lm(formula = SpendRat ~ ., data = df1)
Residuals:
Min 1Q Median 3Q Max
-107.220 -30.190 -5.462 15.617 276.820
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.20057 194.20726 -0.022 0.98277
Age 0.36028 0.43090 0.836 0.40437
LenRes 0.77997 0.51065 1.527 0.12868
Income2 -3.84984 45.84860 -0.084 0.93319
Income3 -19.04629 46.03617 -0.414 0.67964
Income4 -8.37050 45.05843 -0.186 0.85287
Income5 -16.52870 45.17640 -0.366 0.71496
Income6 -4.19143 46.22194 -0.091 0.92786
Income7 -12.56324 48.24412 -0.260 0.79489
Income8 -30.87936 77.51184 -0.398 0.69089
Income9 -58.69974 77.27794 -0.760 0.44864
TotAsset -0.02744 0.08812 -0.311 0.75595
SecAssets 0.12072 0.27318 0.442 0.65916
ShortLiq 0.13201 0.14439 0.914 0.36197
LongLiq -0.13223 0.48577 -0.272 0.78583
WlthIdx -0.03116 0.12468 -0.250 0.80301
SpendVol 0.01216 0.04575 0.266 0.79084
SpenVel 0.02785 0.02843 0.980 0.32883
CollGifts1 22.69078 12.52795 1.811 0.07203 .
BricMortar1 35.96048 11.37754 3.161 0.00189 **
MarthaHome1 28.14030 10.97695 2.564 0.01130 *
SunAds1 -1.01572 13.62597 -0.075 0.94067
ThemeColl1 22.94005 10.79143 2.126 0.03510 *
CustDec1 9.85642 12.26606 0.804 0.42288
RetailKids1 -6.03001 11.50007 -0.524 0.60078
TeenWr1 12.20007 10.16173 1.201 0.23173
Carlovers1 8.13114 10.53941 0.771 0.44158
CountryColl1 7.63132 14.43239 0.529 0.59772
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 59.22 on 156 degrees of freedom
Multiple R-squared: 0.3158, Adjusted R-squared: 0.1974
F-statistic: 2.667 on 27 and 156 DF, p-value: 8.398e-05
Using a stepwise function to produce the most statistically significant variables for predicting SpendingRatio. All the variables within this model are statistically significant, but the predictor that is the most significant is BricMortar. This model’s p-value indicates that there lies a relationship between the predictors and the response variable. Adjusted R-Squared shows that this model is probably not the best fit for this regression problem.
step.lm = step(lm.fit, direction = c("both"),trace = 0)
summary(step.lm)
Call:
lm(formula = SpendRat ~ LenRes + CollGifts + BricMortar + MarthaHome +
ThemeColl, data = df1)
Residuals:
Min 1Q Median 3Q Max
-97.342 -30.315 -7.095 13.601 272.223
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.7432 9.6691 -1.628 0.10525
LenRes 0.8779 0.4233 2.074 0.03955 *
CollGifts1 30.3194 9.2406 3.281 0.00124 **
BricMortar1 39.2957 9.8165 4.003 9.17e-05 ***
MarthaHome1 28.6729 9.3680 3.061 0.00255 **
ThemeColl1 25.5835 9.2909 2.754 0.00651 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 56.85 on 178 degrees of freedom
Multiple R-squared: 0.2806, Adjusted R-squared: 0.2604
F-statistic: 13.88 on 5 and 178 DF, p-value: 1.873e-11
Residual vs Fitted Diagnostic Plot appears to be a horizontal line, which means the residuals follow a linear pattern. Q-Q Plot looks to follow normal distribution. Scale-Location plot looks to be a horizontal line, implying that there is homoscedasticity. Residual vs Leverage plot is looking for influential observation, which does not appear to be any.
par(mfrow = c(2,2))
plot(step.lm)
colSums(is.na(df1))
SpendRat Age LenRes Income TotAsset
0 0 0 0 0
SecAssets ShortLiq LongLiq WlthIdx SpendVol
0 0 0 0 0
SpenVel CollGifts BricMortar MarthaHome SunAds
0 0 0 0 0
ThemeColl CustDec RetailKids TeenWr Carlovers
0 0 0 0 0
CountryColl
0
sapply(df1,function(x) sum(is.infinite(x)))
SpendRat Age LenRes Income TotAsset
0 0 0 0 0
SecAssets ShortLiq LongLiq WlthIdx SpendVol
0 0 0 0 0
SpenVel CollGifts BricMortar MarthaHome SunAds
0 0 0 0 0
ThemeColl CustDec RetailKids TeenWr Carlovers
0 0 0 0 0
CountryColl
0
sapply(df1,function(x) sum(is.nan(x)))
SpendRat Age LenRes Income TotAsset
0 0 0 0 0
SecAssets ShortLiq LongLiq WlthIdx SpendVol
0 0 0 0 0
SpenVel CollGifts BricMortar MarthaHome SunAds
0 0 0 0 0
ThemeColl CustDec RetailKids TeenWr Carlovers
0 0 0 0 0
CountryColl
0
trans.sqrt.lm = lm(SpendRat ~ + sqrt(Age) + sqrt(LenRes) + Income + sqrt(TotAsset) + sqrt(SecAssets) + sqrt(ShortLiq) + sqrt(WlthIdx) + sqrt(SpendVol) + sqrt(SpenVel) + CollGifts + BricMortar + MarthaHome + SunAds + ThemeColl + CustDec + RetailKids + TeenWr + Carlovers + CountryColl, data = df1)
summary(trans.sqrt.lm)
Call:
lm(formula = SpendRat ~ +sqrt(Age) + sqrt(LenRes) + Income +
sqrt(TotAsset) + sqrt(SecAssets) + sqrt(ShortLiq) + sqrt(WlthIdx) +
sqrt(SpendVol) + sqrt(SpenVel) + CollGifts + BricMortar +
MarthaHome + SunAds + ThemeColl + CustDec + RetailKids +
TeenWr + Carlovers + CountryColl, data = df1)
Residuals:
Min 1Q Median 3Q Max
-108.412 -30.805 -7.869 17.013 275.893
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -112.4946 97.6907 -1.152 0.25126
sqrt(Age) 4.7821 6.2689 0.763 0.44671
sqrt(LenRes) 5.1060 3.8752 1.318 0.18956
Income2 -7.4717 45.8054 -0.163 0.87063
Income3 -19.7003 45.9661 -0.429 0.66881
Income4 -9.8178 45.0062 -0.218 0.82760
Income5 -17.7502 44.9835 -0.395 0.69368
Income6 -6.8964 45.8841 -0.150 0.88072
Income7 -13.4174 48.0385 -0.279 0.78038
Income8 -29.9430 77.8476 -0.385 0.70103
Income9 -57.1757 76.9249 -0.743 0.45843
sqrt(TotAsset) -3.6373 3.2618 -1.115 0.26650
sqrt(SecAssets) 3.0217 3.5800 0.844 0.39993
sqrt(ShortLiq) 2.6235 4.3461 0.604 0.54695
sqrt(WlthIdx) 1.6918 5.0082 0.338 0.73596
sqrt(SpendVol) 0.5051 1.4616 0.346 0.73013
sqrt(SpenVel) 0.6871 0.7347 0.935 0.35115
CollGifts1 23.8873 12.4799 1.914 0.05743 .
BricMortar1 35.8156 11.2314 3.189 0.00172 **
MarthaHome1 27.0611 10.7985 2.506 0.01323 *
SunAds1 -2.3202 13.4741 -0.172 0.86350
ThemeColl1 22.9131 10.7287 2.136 0.03426 *
CustDec1 7.7378 12.1745 0.636 0.52598
RetailKids1 -4.1634 11.3745 -0.366 0.71483
TeenWr1 11.9090 10.1690 1.171 0.24333
Carlovers1 7.3275 10.4793 0.699 0.48544
CountryColl1 6.9159 14.1623 0.488 0.62600
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 59.11 on 157 degrees of freedom
Multiple R-squared: 0.3138, Adjusted R-squared: 0.2002
F-statistic: 2.762 on 26 and 157 DF, p-value: 5.682e-05
step(trans.sqrt.lm, direction = "both", trace = 0)|>
summary()
Call:
lm(formula = SpendRat ~ sqrt(LenRes) + CollGifts + BricMortar +
MarthaHome + ThemeColl, data = df1)
Residuals:
Min 1Q Median 3Q Max
-95.47 -29.85 -6.68 13.96 272.22
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -26.015 14.322 -1.816 0.07099 .
sqrt(LenRes) 6.391 3.338 1.915 0.05713 .
CollGifts1 30.687 9.265 3.312 0.00112 **
BricMortar1 39.419 9.834 4.008 8.98e-05 ***
MarthaHome1 28.576 9.384 3.045 0.00268 **
ThemeColl1 25.221 9.304 2.711 0.00737 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 56.95 on 178 degrees of freedom
Multiple R-squared: 0.2781, Adjusted R-squared: 0.2578
F-statistic: 13.71 on 5 and 178 DF, p-value: 2.52e-11
Diagnostic Plots look to meet assumption!
par(mfrow = c(2,2))
step(trans.sqrt.lm, direction = "both", trace = 0)|>
plot()
trans.x2.lm = lm(SpendRat ~ + I(Age^2) + I(LenRes^2) + Income + I(TotAsset^2) + I(SecAssets^2) + I(ShortLiq^2) + I(WlthIdx^2) + I(SpendVol^2) + I(SpenVel^2) + CollGifts + BricMortar + MarthaHome + SunAds + ThemeColl + CustDec + RetailKids + TeenWr + Carlovers + CountryColl, data = df1)
summary(trans.x2.lm)
Call:
lm(formula = SpendRat ~ +I(Age^2) + I(LenRes^2) + Income + I(TotAsset^2) +
I(SecAssets^2) + I(ShortLiq^2) + I(WlthIdx^2) + I(SpendVol^2) +
I(SpenVel^2) + CollGifts + BricMortar + MarthaHome + SunAds +
ThemeColl + CustDec + RetailKids + TeenWr + Carlovers + CountryColl,
data = df1)
Residuals:
Min 1Q Median 3Q Max
-103.887 -31.899 -6.159 16.207 276.596
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.910e+01 5.192e+01 -0.368 0.7135
I(Age^2) 2.285e-03 3.650e-03 0.626 0.5322
I(LenRes^2) 1.792e-02 1.206e-02 1.486 0.1394
Income2 -1.941e-01 4.571e+01 -0.004 0.9966
Income3 -1.696e+01 4.609e+01 -0.368 0.7134
Income4 -4.227e+00 4.511e+01 -0.094 0.9255
Income5 -1.334e+01 4.528e+01 -0.295 0.7687
Income6 -1.742e+00 4.597e+01 -0.038 0.9698
Income7 -6.670e+00 4.844e+01 -0.138 0.8907
Income8 -2.675e+01 7.680e+01 -0.348 0.7281
Income9 -6.127e+01 7.700e+01 -0.796 0.4274
I(TotAsset^2) -9.757e-06 5.538e-05 -0.176 0.8604
I(SecAssets^2) 1.259e-05 8.296e-05 0.152 0.8795
I(ShortLiq^2) 7.449e-05 1.111e-04 0.671 0.5034
I(WlthIdx^2) 7.917e-06 1.317e-04 0.060 0.9521
I(SpendVol^2) -1.414e-05 5.166e-05 -0.274 0.7846
I(SpenVel^2) 4.534e-05 2.898e-05 1.565 0.1196
CollGifts1 2.285e+01 1.248e+01 1.831 0.0690 .
BricMortar1 3.659e+01 1.125e+01 3.253 0.0014 **
MarthaHome1 2.718e+01 1.096e+01 2.479 0.0142 *
SunAds1 5.175e-01 1.343e+01 0.039 0.9693
ThemeColl1 2.207e+01 1.075e+01 2.052 0.0418 *
CustDec1 1.180e+01 1.208e+01 0.977 0.3299
RetailKids1 -7.596e+00 1.139e+01 -0.667 0.5057
TeenWr1 1.264e+01 9.981e+00 1.267 0.2071
Carlovers1 8.481e+00 1.048e+01 0.809 0.4195
CountryColl1 6.433e+00 1.427e+01 0.451 0.6528
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 58.85 on 157 degrees of freedom
Multiple R-squared: 0.3199, Adjusted R-squared: 0.2073
F-statistic: 2.841 on 26 and 157 DF, p-value: 3.499e-05
step(trans.x2.lm, direction = "both", trace = 0)|>
summary()
Call:
lm(formula = SpendRat ~ I(LenRes^2) + CollGifts + BricMortar +
MarthaHome + ThemeColl, data = df1)
Residuals:
Min 1Q Median 3Q Max
-96.825 -31.244 -7.079 13.609 274.786
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.57649 7.96444 -1.202 0.23080
I(LenRes^2) 0.02128 0.01014 2.098 0.03735 *
CollGifts1 30.05467 9.23542 3.254 0.00136 **
BricMortar1 39.17983 9.81364 3.992 9.55e-05 ***
MarthaHome1 28.81888 9.36695 3.077 0.00242 **
ThemeColl1 25.90091 9.29400 2.787 0.00590 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 56.83 on 178 degrees of freedom
Multiple R-squared: 0.281, Adjusted R-squared: 0.2608
F-statistic: 13.91 on 5 and 178 DF, p-value: 1.787e-11
Diagnostic Plots appear to meet all assumptions!
par(mfrow = c(2,2))
step(trans.x2.lm, direction = "both", trace = 0)|>
plot()
The linear models that were run with various transformations yielded poor results for interpretation, but that does not mean that they yielded poor results for prediction. We absolutely can use these models to predict the spending ratio, now we do not know how “good” at predicting they will be, but nonetheless they still can be used.
adult = read_csv("adult.csv", na = "?")
Rows: 32561 Columns: 15── Column specification ────────────────────────────────────────
Delimiter: ","
chr (9): work_class, education, marital_status, occupation, ...
dbl (6): age, wgt, education_num, capital_gain, capital_loss...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
adult = na.omit(adult) # remove na
# remove na's
str(adult)
tibble [30,162 × 15] (S3: tbl_df/tbl/data.frame)
$ age : num [1:30162] 39 50 38 53 28 37 49 52 31 42 ...
$ work_class : chr [1:30162] "State-gov" "Self-emp-not-inc" "Private" "Private" ...
$ wgt : num [1:30162] 77516 83311 215646 234721 338409 ...
$ education : chr [1:30162] "Bachelors" "Bachelors" "HS-grad" "11th" ...
$ education_num : num [1:30162] 13 13 9 7 13 14 5 9 14 13 ...
$ marital_status: chr [1:30162] "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
$ occupation : chr [1:30162] "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
$ relationship : chr [1:30162] "Not-in-family" "Husband" "Not-in-family" "Husband" ...
$ race : chr [1:30162] "White" "White" "White" "Black" ...
$ sex : chr [1:30162] "Male" "Male" "Male" "Male" ...
$ capital_gain : num [1:30162] 2174 0 0 0 0 ...
$ capital_loss : num [1:30162] 0 0 0 0 0 0 0 0 0 0 ...
$ hours_per_week: num [1:30162] 40 13 40 40 40 40 16 45 50 40 ...
$ native_country: chr [1:30162] "United-States" "United-States" "United-States" "United-States" ...
$ income : chr [1:30162] "<=50K" "<=50K" "<=50K" "<=50K" ...
- attr(*, "na.action")= 'omit' Named int [1:2399] 15 28 39 52 62 70 78 94 107 129 ...
..- attr(*, "names")= chr [1:2399] "15" "28" "39" "52" ...
adult.num = adult|> select_if(is.numeric)
adult$income = ifelse(adult$income == ">50K",1 ,0)
adult$income = as.numeric(adult$income)
adult.num["income"] = adult$income
suppressMessages(
adult.num|>
ggpairs(columns = 1:6,
title = "Scatter Plot Matrix on Adults Income Exceeding $50k/yr")
)
cor(adult.num)
Warning: the standard deviation is zero
age wgt education_num
age 1.00000000 -0.0765108361 0.04352609
wgt -0.07651084 1.0000000000 -0.04499174
education_num 0.04352609 -0.0449917421 1.00000000
capital_gain 0.08015423 0.0004215674 0.12441600
capital_loss 0.06016548 -0.0097495278 0.07964641
hours_per_week 0.10159876 -0.0228857516 0.15252207
income NA NA NA
capital_gain capital_loss hours_per_week income
age 0.0801542263 0.060165480 0.10159876 NA
wgt 0.0004215674 -0.009749528 -0.02288575 NA
education_num 0.1244159953 0.079646410 0.15252207 NA
capital_gain 1.0000000000 -0.032229327 0.08043180 NA
capital_loss -0.0322293265 1.000000000 0.05241705 NA
hours_per_week 0.0804318007 0.052417049 1.00000000 NA
income NA NA NA 1
Box plot of Income by Age
adult.num|>
ggplot(aes(x = as.factor(Income), y = age))+
geom_boxplot()+
labs(title = "Boxplot of Income by Age",
x = "Income",
y = "Age")+
theme_minimal()
Box plot of Income by wgt
adult.num|>
ggplot(aes(x = as.factor(Income), y = wgt))+
geom_boxplot()+
labs(title = "Boxplot of Income by wgt",
x = "Income",
y = "wgt")+
theme_minimal()
Box plot of Income by education
adult.num|>
ggplot(aes(x = as.factor(Income), y = education_num))+
geom_boxplot()+
labs(title = "Boxplot of Income by Education",
x = "Income",
y = "Education")+
theme_minimal()
Box plot of Income by Capital Gain
adult.num|>
ggplot(aes(x = as.factor(Income), y = capital_gain))+
geom_boxplot()+
labs(title = "Boxplot of Income by Capital Gain",
x = "Income",
y = "Capital Gain")+
theme_minimal()
Box plot of Income by Capital Loss
adult.num|>
ggplot(aes(x = as.factor(Income), y = capital_loss))+
geom_boxplot()+
labs(title = "Boxplot of Income by Capital Loss",
x = "Income",
y = "Capital Loss")+
theme_minimal()
Box plot of Income by Hours worked in a week
adult.num|>
ggplot(aes(x = as.factor(Income), y = hours_per_week))+
geom_boxplot()+
labs(title = "Boxplot of Income hours worked in a week",
x = "Income",
y = "Hours Worked In A Week")+
theme_minimal()
From these scatter plots it does not look like there is any meaningful correlation coefficient. From the Box plots that were produced looked skewed except for the variable hours_per_week.
full.glm = glm(income ~ . -wgt ,family = "binomial", data = adult.num,)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(full.glm)
Call:
glm(formula = income ~ . - wgt, family = "binomial", data = adult.num)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.284e+00 1.196e-01 -69.29 <2e-16 ***
age 4.437e-02 1.297e-03 34.21 <2e-16 ***
education_num 3.218e-01 7.057e-03 45.60 <2e-16 ***
capital_gain 3.205e-04 1.004e-05 31.94 <2e-16 ***
capital_loss 6.990e-04 3.355e-05 20.83 <2e-16 ***
hours_per_week 3.976e-02 1.400e-03 28.40 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 33851 on 30161 degrees of freedom
Residual deviance: 25054 on 30156 degrees of freedom
AIC: 25066
Number of Fisher Scoring iterations: 7
step.glm = stepAIC(full.glm, direction = "both", trace = 0)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurredWarning: glm.fit: fitted probabilities numerically 0 or 1 occurredWarning: glm.fit: fitted probabilities numerically 0 or 1 occurredWarning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(step.glm)
Call:
glm(formula = income ~ (age + wgt + education_num + capital_gain +
capital_loss + hours_per_week) - wgt, family = "binomial",
data = adult.num)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.284e+00 1.196e-01 -69.29 <2e-16 ***
age 4.437e-02 1.297e-03 34.21 <2e-16 ***
education_num 3.218e-01 7.057e-03 45.60 <2e-16 ***
capital_gain 3.205e-04 1.004e-05 31.94 <2e-16 ***
capital_loss 6.990e-04 3.355e-05 20.83 <2e-16 ***
hours_per_week 3.976e-02 1.400e-03 28.40 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 33851 on 30161 degrees of freedom
Residual deviance: 25054 on 30156 degrees of freedom
AIC: 25066
Number of Fisher Scoring iterations: 7
print(rsquared)
[1] 0.2598751
print(odd.ratio)
(Intercept) age education_num capital_gain
0.0002524105 1.0453689219 1.3796332866 1.0003205864
capital_loss hours_per_week
1.0006992972 1.0405660447
For every year of age the odds of earning >$50k increases by .04. For every year of education the odds of earning >$50k increase by .37. For every one-unit increase in capital_gain multiple .0003 and .0006 for capital_loss. Each additional hour worked per week increases the odds of higher income by .04.