0426

str(teengamb)

## 'data.frame':    47 obs. of  5 variables:
##  $ sex   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ status: int  51 28 37 28 65 61 28 27 43 18 ...
##  $ income: num  2 2.5 2 7 2 3.47 5.5 6.42 2 6 ...
##  $ verbal: int  8 8 6 4 8 6 7 5 6 7 ...
##  $ gamble: num  0 0 0 7.3 19.6 0.1 1.45 6.6 1.7 0.1 ...

The teengamb data frame has 47 rows and 5 columns. A survey was conducted to study teenage gambling in Britain.This data frame contains the following columns: sex: 0=male, 1=female status: Socioeconomic status score based on parents’ occupation income: in pounds per week verbal: verbal score in words out of 12 correctly defined gamble: expenditure on gambling in pounds per year.

summary(teengamb)

##       sex             status          income           verbal     
##  Min.   :0.0000   Min.   :18.00   Min.   : 0.600   Min.   : 1.00  
##  1st Qu.:0.0000   1st Qu.:28.00   1st Qu.: 2.000   1st Qu.: 6.00  
##  Median :0.0000   Median :43.00   Median : 3.250   Median : 7.00  
##  Mean   :0.4043   Mean   :45.23   Mean   : 4.642   Mean   : 6.66  
##  3rd Qu.:1.0000   3rd Qu.:61.50   3rd Qu.: 6.210   3rd Qu.: 8.00  
##  Max.   :1.0000   Max.   :75.00   Max.   :15.000   Max.   :10.00  
##      gamble     
##  Min.   :  0.0  
##  1st Qu.:  1.1  
##  Median :  6.0  
##  Mean   : 19.3  
##  3rd Qu.: 19.4  
##  Max.   :156.0

#convert gender to factor variables
teengamb$sex <- as.factor(teengamb$sex)

teengamb |> head() |> knitr::kable()

sex	status	income	verbal	gamble
1	51	2.00	8	0.0
1	28	2.50	8	0.0
1	37	2.00	6	0.0
1	28	7.00	4	7.3
1	65	2.00	8	19.6
1	61	3.47	6	0.1

Check the correlation between variables in this data for deciding which variables to use

dta_gamble <- teengamb[, c("status","income","gamble","verbal")]
cor(dta_gamble)

##             status     income      gamble     verbal
## status  1.00000000 -0.2750340 -0.05042081  0.5316102
## income -0.27503402  1.0000000  0.62207690 -0.1755707
## gamble -0.05042081  0.6220769  1.00000000 -0.2200562
## verbal  0.53161022 -0.1755707 -0.22005619  1.0000000

Due to the result, using gamble and status for multiple linear regression might be a better choice

Remove outlier

#identify the outliers
boxplot(gamble ~ sex, data = teengamb)$out

## [1] 156.0  19.6

#identify the outliers
filter(teengamb, gamble == 156.0)

##    sex status income verbal gamble
## 24   0     27     10      4    156

filter(teengamb, gamble == 19.6)

##   sex status income verbal gamble
## 5   1     65      2      8   19.6

#remove rows of the outlier
rmgamb <- teengamb[-c(24, 5), ]

#Check data again
str(rmgamb)

## 'data.frame':    45 obs. of  5 variables:
##  $ sex   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ status: int  51 28 37 28 61 28 27 43 18 18 ...
##  $ income: num  2 2.5 2 7 3.47 5.5 6.42 2 6 3 ...
##  $ verbal: int  8 8 6 4 6 7 5 6 7 6 ...
##  $ gamble: num  0 0 0 7.3 0.1 1.45 6.6 1.7 0.1 0.1 ...

#linear regression
teengambml <- lm(gamble ~ status + sex, data = rmgamb)
summary(teengambml)

## 
## Call:
## lm(formula = gamble ~ status + sex, data = rmgamb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.759 -15.255  -1.708   5.964  59.248 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.1429    12.9135   3.496 0.001130 ** 
## status       -0.3787     0.2307  -1.641 0.108181    
## sex1        -29.4228     7.9999  -3.678 0.000663 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.83 on 42 degrees of freedom
## Multiple R-squared:  0.2469, Adjusted R-squared:  0.211 
## F-statistic: 6.884 on 2 and 42 DF,  p-value: 0.002596

gamble = 45.14 - 0.3787Status - 29.4228Female + 21.83
Teenagers in Britain spend at least 45.14 pounds on gambling. And, considering gender, female spends 29.4228 pounds less than male.
t-value 3.496>1.96, and p-value is less than the significance level 0.05, reject the null hypothesis.
Status, as a predictors, t-value 1.641<1.96, and p-value is not less than the significance level 0.05, cannot reject the null hypothesis. The correlation between status and gamble is not significant.
R-squared 0.2454 indicate that the regression model accounts for only 25.96% of the variability in the outcome measure.

Diagnostic

#Check the normality of the residuals
hist( x = residuals(teengambml),
      xlab = "Value of residual",
      main = "",
      breaks = 20)

Based on the histogram, though most of the residuals value are closed to 0, the distribution are not exactly normally distributed.

#Normal Q-Q
plot(teengambml, which = 2)

Based on the plot, even though we removed 2 outliers, there are still several outliers in the data.

#Check the linearity of the relationship
plot(teengambml, which = 1)

Based on the plots,the red line is not a straight horizontal line. It shows that some characteritics or patterns still apparent in data after fitting a model.

0426

2022-04-26

Remove outlier

Diagnostic