str(teengamb)
## 'data.frame': 47 obs. of 5 variables:
## $ sex : int 1 1 1 1 1 1 1 1 1 1 ...
## $ status: int 51 28 37 28 65 61 28 27 43 18 ...
## $ income: num 2 2.5 2 7 2 3.47 5.5 6.42 2 6 ...
## $ verbal: int 8 8 6 4 8 6 7 5 6 7 ...
## $ gamble: num 0 0 0 7.3 19.6 0.1 1.45 6.6 1.7 0.1 ...
The teengamb data frame has 47 rows and 5 columns. A survey was conducted to study teenage gambling in Britain.This data frame contains the following columns: sex: 0=male, 1=female status: Socioeconomic status score based on parents’ occupation income: in pounds per week verbal: verbal score in words out of 12 correctly defined gamble: expenditure on gambling in pounds per year.
summary(teengamb)
## sex status income verbal
## Min. :0.0000 Min. :18.00 Min. : 0.600 Min. : 1.00
## 1st Qu.:0.0000 1st Qu.:28.00 1st Qu.: 2.000 1st Qu.: 6.00
## Median :0.0000 Median :43.00 Median : 3.250 Median : 7.00
## Mean :0.4043 Mean :45.23 Mean : 4.642 Mean : 6.66
## 3rd Qu.:1.0000 3rd Qu.:61.50 3rd Qu.: 6.210 3rd Qu.: 8.00
## Max. :1.0000 Max. :75.00 Max. :15.000 Max. :10.00
## gamble
## Min. : 0.0
## 1st Qu.: 1.1
## Median : 6.0
## Mean : 19.3
## 3rd Qu.: 19.4
## Max. :156.0
#convert gender to factor variables
teengamb$sex <- as.factor(teengamb$sex)
teengamb |> head() |> knitr::kable()
| sex | status | income | verbal | gamble |
|---|---|---|---|---|
| 1 | 51 | 2.00 | 8 | 0.0 |
| 1 | 28 | 2.50 | 8 | 0.0 |
| 1 | 37 | 2.00 | 6 | 0.0 |
| 1 | 28 | 7.00 | 4 | 7.3 |
| 1 | 65 | 2.00 | 8 | 19.6 |
| 1 | 61 | 3.47 | 6 | 0.1 |
Check the correlation between variables in this data for deciding which variables to use
dta_gamble <- teengamb[, c("status","income","gamble","verbal")]
cor(dta_gamble)
## status income gamble verbal
## status 1.00000000 -0.2750340 -0.05042081 0.5316102
## income -0.27503402 1.0000000 0.62207690 -0.1755707
## gamble -0.05042081 0.6220769 1.00000000 -0.2200562
## verbal 0.53161022 -0.1755707 -0.22005619 1.0000000
Due to the result, using gamble and status for multiple linear regression might be a better choice
#identify the outliers
boxplot(gamble ~ sex, data = teengamb)$out
## [1] 156.0 19.6
#identify the outliers
filter(teengamb, gamble == 156.0)
## sex status income verbal gamble
## 24 0 27 10 4 156
filter(teengamb, gamble == 19.6)
## sex status income verbal gamble
## 5 1 65 2 8 19.6
#remove rows of the outlier
rmgamb <- teengamb[-c(24, 5), ]
#Check data again
str(rmgamb)
## 'data.frame': 45 obs. of 5 variables:
## $ sex : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ status: int 51 28 37 28 61 28 27 43 18 18 ...
## $ income: num 2 2.5 2 7 3.47 5.5 6.42 2 6 3 ...
## $ verbal: int 8 8 6 4 6 7 5 6 7 6 ...
## $ gamble: num 0 0 0 7.3 0.1 1.45 6.6 1.7 0.1 0.1 ...
#linear regression
teengambml <- lm(gamble ~ status + sex, data = rmgamb)
summary(teengambml)
##
## Call:
## lm(formula = gamble ~ status + sex, data = rmgamb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.759 -15.255 -1.708 5.964 59.248
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.1429 12.9135 3.496 0.001130 **
## status -0.3787 0.2307 -1.641 0.108181
## sex1 -29.4228 7.9999 -3.678 0.000663 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.83 on 42 degrees of freedom
## Multiple R-squared: 0.2469, Adjusted R-squared: 0.211
## F-statistic: 6.884 on 2 and 42 DF, p-value: 0.002596
gamble = 45.14 - 0.3787Status - 29.4228Female +
21.83
Teenagers in Britain spend at least 45.14 pounds on gambling. And,
considering gender, female spends 29.4228 pounds less than male.
t-value 3.496>1.96, and p-value is less than the significance level
0.05, reject the null hypothesis.
Status, as a predictors, t-value 1.641<1.96, and p-value is not less
than the significance level 0.05, cannot reject the null hypothesis. The
correlation between status and gamble is not significant.
R-squared 0.2454 indicate that the regression model accounts for only
25.96% of the variability in the outcome measure.
#Check the normality of the residuals
hist( x = residuals(teengambml),
xlab = "Value of residual",
main = "",
breaks = 20)
Based on the histogram, though most of the residuals value are closed to 0, the distribution are not exactly normally distributed.
#Normal Q-Q
plot(teengambml, which = 2)
Based on the plot, even though we removed 2 outliers, there are still several outliers in the data.
#Check the linearity of the relationship
plot(teengambml, which = 1)
Based on the plots,the red line is not a straight horizontal line. It shows that some characteritics or patterns still apparent in data after fitting a model.