The dataset teengamb concerns a study of teenage gambling in Britain. Fit a regression model with the expenditure on gambling as the response and the sex, status, income and verbal score as predictors. Present a summary output and then answer/carry out the fol- lowing.
library(faraway)
data("teengamb")
teengamb$sex <- factor(teengamb$sex)
levels(teengamb$sex) <- c("male", "female")
summary(teengamb)
## sex status income verbal gamble
## male :28 Min. :18.00 Min. : 0.600 Min. : 1.00 Min. : 0.0
## female:19 1st Qu.:28.00 1st Qu.: 2.000 1st Qu.: 6.00 1st Qu.: 1.1
## Median :43.00 Median : 3.250 Median : 7.00 Median : 6.0
## Mean :45.23 Mean : 4.642 Mean : 6.66 Mean : 19.3
## 3rd Qu.:61.50 3rd Qu.: 6.210 3rd Qu.: 8.00 3rd Qu.: 19.4
## Max. :75.00 Max. :15.000 Max. :10.00 Max. :156.0
fit <- lm(gamble ~ sex+status+income+verbal, data=teengamb)
summary(fit)
##
## Call:
## lm(formula = gamble ~ sex + status + income + verbal, data = teengamb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.082 -11.320 -1.451 9.452 94.252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.55565 17.19680 1.312 0.1968
## sexfemale -22.11833 8.21111 -2.694 0.0101 *
## status 0.05223 0.28111 0.186 0.8535
## income 4.96198 1.02539 4.839 1.79e-05 ***
## verbal -2.95949 2.17215 -1.362 0.1803
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.69 on 42 degrees of freedom
## Multiple R-squared: 0.5267, Adjusted R-squared: 0.4816
## F-statistic: 11.69 on 4 and 42 DF, p-value: 1.815e-06
(a) Percentage of variation in response : 0.5267
(b) INCOME
As seen income variable has the least P-value and hence is most significant.Therefore, if INCOME is removed from the regression model then there will be biggest decrease in the coefficient of determination. As, a proof we can see from below that as we remove each predictor, only income predictor shows a significant decrease in r-square value.
fit_inc <- lm(gamble ~ sex+status+verbal, data=teengamb)
summary(fit_inc)$r.squared
## [1] 0.2628501
fit_sex <- lm(gamble ~ income+status+verbal, data=teengamb)
summary(fit_sex)$r.squared
## [1] 0.4449587
fit_stat <- lm(gamble ~ sex+income+verbal, data=teengamb)
summary(fit_stat)$r.squared
## [1] 0.5263344
fit_verb <- lm(gamble ~ sex+status+income, data=teengamb)
summary(fit_verb)$r.squared
## [1] 0.5058054
fit <- lm(gamble ~ sex+status+verbal+income, data=teengamb)
fit$residuals
## 1 2 3 4 5 6
## 10.6507430 9.3711318 5.4630298 -17.4957487 29.5194692 -2.9846919
## 7 8 9 10 11 12
## -7.0242994 -12.3060734 6.8496267 -10.3329505 1.5934936 -3.0958161
## 13 14 15 16 17 18
## 0.1172839 9.5331344 2.8488167 17.2107726 -25.2627227 -27.7998544
## 19 20 21 22 23 24
## 13.1446553 -15.9510624 -16.0041386 -9.5801478 -27.2711657 94.2522174
## 25 26 27 28 29 30
## 0.6993361 -9.1670510 -25.8747696 -8.7455549 -6.8803097 -19.8090866
## 31 32 33 34 35 36
## 10.8793766 15.0599340 11.7462296 -3.5932770 -14.4016736 45.6051264
## 37 38 39 40 41 42
## 20.5472529 11.2429290 -51.0824078 8.8669438 -1.4513921 -3.8361619
## 43 44 45 46 47
## -4.3831786 -14.8940753 5.4506347 1.4092321 7.1662399
min(fit$residuals)
## [1] -51.08241
max(fit$residuals)
## [1] 94.25222
(c) smallest (negative) residual = -51.08241
largest (positive) residual = 94.25222
mean(fit$residuals)
## [1] 6.42512e-16
median(fit$residuals)
## [1] -1.451392
(d) Mean of residuals = 6.42512e-16
Median of residuals = -1.451392
Difference between mean and median indicates the skewness. If mean is greater than the median then it is positively skewed and if the mean is less than the median the the data set is said to be negatively skewed.
res <- (fit$residuals)
hist(res)
plot(density(res))
(e) From the histogram and density plots above and the values that Mean is greater than median we can see that the plot is POSITIVELY skewed
cor(resid(fit), fitted(fit))
## [1] -1.646916e-17
plot(fit, which = 1)
(f) The correlation of residuals with the fitted values = -1.646916e-17 From the plot above we can note that there is no particular linear pattern as a lot of points in the center with a dip and are a little scattered from the fitted model. But for the initial few values (till 20) the model proves to be a good fit.
cor(fit$residuals, teengamb$income)
## [1] -7.214367e-17
(g) Correlation of residuals with income = -7.214367e-17
(h) In the data set, Male = 0 and Female = 1. From the summary we can see that the coefficient for female = -22.11833. Hence we can say that, on an average, the teenage females spent 22.1183301 pounds less on gambling than teenage males.