Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
url <- "https://raw.githubusercontent.com/saayedalam/Data/master/master.csv"
data <- read.csv(url)
head(data)
## ï..country year sex age suicides_no population
## 1 Albania 1987 male 15-24 years 21 312900
## 2 Albania 1987 male 35-54 years 16 308000
## 3 Albania 1987 female 15-24 years 14 289700
## 4 Albania 1987 male 75+ years 1 21800
## 5 Albania 1987 male 25-34 years 9 274300
## 6 Albania 1987 female 75+ years 1 35600
## suicides.100k.pop country.year HDI.for.year gdp_for_year....
## 1 6.71 Albania1987 NA 2,156,624,900
## 2 5.19 Albania1987 NA 2,156,624,900
## 3 4.83 Albania1987 NA 2,156,624,900
## 4 4.59 Albania1987 NA 2,156,624,900
## 5 3.28 Albania1987 NA 2,156,624,900
## 6 2.81 Albania1987 NA 2,156,624,900
## gdp_per_capita.... generation
## 1 796 Generation X
## 2 796 Silent
## 3 796 Generation X
## 4 796 G.I. Generation
## 5 796 Boomers
## 6 796 G.I. Generation
multi_reg <- lm(suicides.100k.pop ~ sex + (gdp_per_capita....^2) + (sex * gdp_per_capita....), data = data)
multi_reg
##
## Call:
## lm(formula = suicides.100k.pop ~ sex + (gdp_per_capita....^2) +
## (sex * gdp_per_capita....), data = data)
##
## Coefficients:
## (Intercept) sexmale
## 5.053e+00 1.546e+01
## gdp_per_capita.... sexmale:gdp_per_capita....
## 2.012e-05 -3.666e-05
summary(multi_reg)
##
## Call:
## lm(formula = suicides.100k.pop ~ sex + (gdp_per_capita....^2) +
## (sex * gdp_per_capita....), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.514 -6.823 -3.110 3.511 204.749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.053e+00 1.983e-01 25.482 < 2e-16 ***
## sexmale 1.546e+01 2.805e-01 55.142 < 2e-16 ***
## gdp_per_capita.... 2.012e-05 7.832e-06 2.569 0.010192 *
## sexmale:gdp_per_capita.... -3.666e-05 1.108e-05 -3.310 0.000934 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.45 on 27816 degrees of freedom
## Multiple R-squared: 0.1536, Adjusted R-squared: 0.1535
## F-statistic: 1683 on 3 and 27816 DF, p-value: < 2.2e-16
The p-value of f-statistics is small. Therefore we reject the hypothesis that there is relationship between these variables. Also, The r-squared value only describes 15% of data’s variance. Overall this is not good model. Let us do residual analysis to verify our assumptions.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
ggplot(data = multi_reg, aes(fitted(multi_reg), resid(multi_reg))) +
geom_point() +
geom_smooth(method = lm, se = F) +
theme_minimal()
ggplot(data = multi_reg, aes(sample = resid(multi_reg))) +
stat_qq() +
stat_qq_line() +
theme_minimal()
The residual’s variance is not uniform and distribution is not normal. Therefore, this model was not appropriate.