#data from faraway
str(teengamb)
## 'data.frame': 47 obs. of 5 variables:
## $ sex : int 1 1 1 1 1 1 1 1 1 1 ...
## $ status: int 51 28 37 28 65 61 28 27 43 18 ...
## $ income: num 2 2.5 2 7 2 3.47 5.5 6.42 2 6 ...
## $ verbal: int 8 8 6 4 8 6 7 5 6 7 ...
## $ gamble: num 0 0 0 7.3 19.6 0.1 1.45 6.6 1.7 0.1 ...
The teengamb data frame has 47 rows and 5 columns. A survey was conducted to study teenage gambling in Britain.This data frame contains the following columns: sex: 0=male, 1=female status: Socioeconomic status score based on parents’ occupation income: in pounds per week verbal: verbal score in words out of 12 correctly defined gamble: expenditure on gambling in pounds per year
#convert gender to factor variables
teengamb$sex <- as.factor(teengamb$sex)
dta_gamble <- teengamb[, c("status","income","gamble","verbal")]
cor(dta_gamble)
## status income gamble verbal
## status 1.00000000 -0.2750340 -0.05042081 0.5316102
## income -0.27503402 1.0000000 0.62207690 -0.1755707
## gamble -0.05042081 0.6220769 1.00000000 -0.2200562
## verbal 0.53161022 -0.1755707 -0.22005619 1.0000000
Due to the result, using gamble and status for multiple linear regression might be a better choice
#check data
ggplot(aes(y = gamble, x = status, color = sex), data = teengamb) +
geom_point() +
geom_smooth(method = lm, se = F) +
theme_bw()
## `geom_smooth()` using formula 'y ~ x'
Based on the plot, the expenditure on gambling in pounds per year has differences between gender.
#multiple linear regression
teengambml <- lm(gamble ~ status + sex, data = teengamb)
summary(teengambml)
##
## Call:
## lm(formula = gamble ~ status + sex, data = teengamb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.873 -15.755 -3.007 10.924 111.586
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.2233 15.1347 3.979 0.000255 ***
## status -0.5855 0.2727 -2.147 0.037321 *
## sex1 -35.7094 9.4899 -3.763 0.000493 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.99 on 44 degrees of freedom
## Multiple R-squared: 0.2454, Adjusted R-squared: 0.2111
## F-statistic: 7.154 on 2 and 44 DF, p-value: 0.002042
gamble=60.22-0.586status-35.71sex+28
Teenages in Britain spend at least 60 pounds on gambling. And, considering gender, female spends 35 pounds less than male.
R-squared 0.2454 value indicate that the regression model accounts for only 24.54% of the variability in the outcome measure.
Checking the normality of the residuals
#Calculates variance-inflation factors
vif(teengambml)
## status sex1
## 1.300895 1.300895
hist( x = residuals(teengambml),
xlab = "Value of residual",
main = "",
breaks = 20)
Based on the histogram, though it shows that the value of residual is not normal distribution, most of the residuals are closed to 0.
plot(teengambml, which = 2)
Based on the plot, there might be outliers in the data.
plot(teengambml, which = 1)
plot(teengambml, which = 3)
Based on the plots, the lines are not horizontal. That means that residuals are correlated with fitted values,and there are other factors which are able to explain. The model is not good enough