Note

  1. residual plots: larger residual, weaker prediction.
  2. Better use continuous and normally distributed variables while using Multiple Regression Analysis
#data from faraway
str(teengamb)
## 'data.frame':    47 obs. of  5 variables:
##  $ sex   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ status: int  51 28 37 28 65 61 28 27 43 18 ...
##  $ income: num  2 2.5 2 7 2 3.47 5.5 6.42 2 6 ...
##  $ verbal: int  8 8 6 4 8 6 7 5 6 7 ...
##  $ gamble: num  0 0 0 7.3 19.6 0.1 1.45 6.6 1.7 0.1 ...

The teengamb data frame has 47 rows and 5 columns. A survey was conducted to study teenage gambling in Britain.This data frame contains the following columns: sex: 0=male, 1=female status: Socioeconomic status score based on parents’ occupation income: in pounds per week verbal: verbal score in words out of 12 correctly defined gamble: expenditure on gambling in pounds per year

#convert gender to factor variables
teengamb$sex <- as.factor(teengamb$sex)

Check the correlation between variables in this data for deciding which variables to use

dta_gamble <- teengamb[, c("status","income","gamble","verbal")]
cor(dta_gamble)
##             status     income      gamble     verbal
## status  1.00000000 -0.2750340 -0.05042081  0.5316102
## income -0.27503402  1.0000000  0.62207690 -0.1755707
## gamble -0.05042081  0.6220769  1.00000000 -0.2200562
## verbal  0.53161022 -0.1755707 -0.22005619  1.0000000

Due to the result, using gamble and status for multiple linear regression might be a better choice

#check data
ggplot(aes(y = gamble, x = status, color = sex), data = teengamb) +
  geom_point() +
  geom_smooth(method = lm, se = F) + 
  theme_bw()
## `geom_smooth()` using formula 'y ~ x'

Based on the plot, the expenditure on gambling in pounds per year has differences between gender.

#multiple linear regression
teengambml <- lm(gamble ~ status + sex, data = teengamb)
summary(teengambml)
## 
## Call:
## lm(formula = gamble ~ status + sex, data = teengamb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.873 -15.755  -3.007  10.924 111.586 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  60.2233    15.1347   3.979 0.000255 ***
## status       -0.5855     0.2727  -2.147 0.037321 *  
## sex1        -35.7094     9.4899  -3.763 0.000493 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.99 on 44 degrees of freedom
## Multiple R-squared:  0.2454, Adjusted R-squared:  0.2111 
## F-statistic: 7.154 on 2 and 44 DF,  p-value: 0.002042

gamble=60.22-0.586status-35.71sex+28

Teenages in Britain spend at least 60 pounds on gambling. And, considering gender, female spends 35 pounds less than male.

R-squared 0.2454 value indicate that the regression model accounts for only 24.54% of the variability in the outcome measure.

Diagnostic

Checking the normality of the residuals

#Calculates variance-inflation factors
vif(teengambml)
##   status     sex1 
## 1.300895 1.300895
hist( x = residuals(teengambml),
      xlab = "Value of residual",
      main = "",
      breaks = 20)

Based on the histogram, though it shows that the value of residual is not normal distribution, most of the residuals are closed to 0.

plot(teengambml, which = 2)

Based on the plot, there might be outliers in the data.

plot(teengambml, which = 1)

plot(teengambml, which = 3)

Based on the plots, the lines are not horizontal. That means that residuals are correlated with fitted values,and there are other factors which are able to explain. The model is not good enough