Question 1

The dataset teengamb concerns a study of teenage gambling in Britain. Fit a regression model with the expenditure on gambling as the response and the sex, status, income and verbal score as predictors. Present a summary output and then answer/carry out the fol- lowing.

library(faraway)
data("teengamb")
teengamb$sex <- factor(teengamb$sex)
levels(teengamb$sex) <- c("male", "female")
summary(teengamb)
##      sex         status          income           verbal          gamble     
##  male  :28   Min.   :18.00   Min.   : 0.600   Min.   : 1.00   Min.   :  0.0  
##  female:19   1st Qu.:28.00   1st Qu.: 2.000   1st Qu.: 6.00   1st Qu.:  1.1  
##              Median :43.00   Median : 3.250   Median : 7.00   Median :  6.0  
##              Mean   :45.23   Mean   : 4.642   Mean   : 6.66   Mean   : 19.3  
##              3rd Qu.:61.50   3rd Qu.: 6.210   3rd Qu.: 8.00   3rd Qu.: 19.4  
##              Max.   :75.00   Max.   :15.000   Max.   :10.00   Max.   :156.0
fit <- lm(gamble ~ sex+status+income+verbal, data=teengamb)
summary(fit)
## 
## Call:
## lm(formula = gamble ~ sex + status + income + verbal, data = teengamb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.082 -11.320  -1.451   9.452  94.252 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.55565   17.19680   1.312   0.1968    
## sexfemale   -22.11833    8.21111  -2.694   0.0101 *  
## status        0.05223    0.28111   0.186   0.8535    
## income        4.96198    1.02539   4.839 1.79e-05 ***
## verbal       -2.95949    2.17215  -1.362   0.1803    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.69 on 42 degrees of freedom
## Multiple R-squared:  0.5267, Adjusted R-squared:  0.4816 
## F-statistic: 11.69 on 4 and 42 DF,  p-value: 1.815e-06

(a) Percentage of variation in response : 0.5267

(b) INCOME

As seen income variable has the least P-value and hence is most significant.Therefore, if INCOME is removed from the regression model then there will be biggest decrease in the coefficient of determination. As, a proof we can see from below that as we remove each predictor, only income predictor shows a significant decrease in r-square value.

fit_inc <- lm(gamble ~ sex+status+verbal, data=teengamb)
summary(fit_inc)$r.squared
## [1] 0.2628501
fit_sex <- lm(gamble ~ income+status+verbal, data=teengamb)
summary(fit_sex)$r.squared
## [1] 0.4449587
fit_stat <- lm(gamble ~ sex+income+verbal, data=teengamb)
summary(fit_stat)$r.squared
## [1] 0.5263344
fit_verb <- lm(gamble ~ sex+status+income, data=teengamb)
summary(fit_verb)$r.squared
## [1] 0.5058054
fit <- lm(gamble ~ sex+status+verbal+income, data=teengamb)

fit$residuals
##           1           2           3           4           5           6 
##  10.6507430   9.3711318   5.4630298 -17.4957487  29.5194692  -2.9846919 
##           7           8           9          10          11          12 
##  -7.0242994 -12.3060734   6.8496267 -10.3329505   1.5934936  -3.0958161 
##          13          14          15          16          17          18 
##   0.1172839   9.5331344   2.8488167  17.2107726 -25.2627227 -27.7998544 
##          19          20          21          22          23          24 
##  13.1446553 -15.9510624 -16.0041386  -9.5801478 -27.2711657  94.2522174 
##          25          26          27          28          29          30 
##   0.6993361  -9.1670510 -25.8747696  -8.7455549  -6.8803097 -19.8090866 
##          31          32          33          34          35          36 
##  10.8793766  15.0599340  11.7462296  -3.5932770 -14.4016736  45.6051264 
##          37          38          39          40          41          42 
##  20.5472529  11.2429290 -51.0824078   8.8669438  -1.4513921  -3.8361619 
##          43          44          45          46          47 
##  -4.3831786 -14.8940753   5.4506347   1.4092321   7.1662399
min(fit$residuals)
## [1] -51.08241
max(fit$residuals)
## [1] 94.25222

(c) smallest (negative) residual = -51.08241

largest (positive) residual = 94.25222

mean(fit$residuals)
## [1] 6.42512e-16
median(fit$residuals)
## [1] -1.451392

(d) Mean of residuals = 6.42512e-16

Median of residuals = -1.451392

Difference between mean and median indicates the skewness. If mean is greater than the median then it is positively skewed and if the mean is less than the median the the data set is said to be negatively skewed.

res <- (fit$residuals)
hist(res)

plot(density(res))

(e) From the histogram and density plots above and the values that Mean is greater than median we can see that the plot is POSITIVELY skewed

cor(resid(fit), fitted(fit))
## [1] -1.646916e-17
plot(fit, which = 1)

(f) The correlation of residuals with the fitted values = -1.646916e-17   From the plot above we can note that there is no particular linear pattern as a lot of points in the center with a dip and are a little scattered from the fitted model. But for the initial few values (till 20) the model proves to be a good fit.

cor(fit$residuals, teengamb$income)
## [1] -7.214367e-17

(g) Correlation of residuals with income = -7.214367e-17

(h) In the data set, Male = 0 and Female = 1. From the summary we can see that the coefficient for female = -22.11833. Hence we can say that, on an average, the teenage females spent 22.1183301 pounds less on gambling than teenage males.