Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

url <- "https://raw.githubusercontent.com/saayedalam/Data/master/master.csv"
data <- read.csv(url)
head(data)
##   ï..country year    sex         age suicides_no population
## 1    Albania 1987   male 15-24 years          21     312900
## 2    Albania 1987   male 35-54 years          16     308000
## 3    Albania 1987 female 15-24 years          14     289700
## 4    Albania 1987   male   75+ years           1      21800
## 5    Albania 1987   male 25-34 years           9     274300
## 6    Albania 1987 female   75+ years           1      35600
##   suicides.100k.pop country.year HDI.for.year gdp_for_year....
## 1              6.71  Albania1987           NA    2,156,624,900
## 2              5.19  Albania1987           NA    2,156,624,900
## 3              4.83  Albania1987           NA    2,156,624,900
## 4              4.59  Albania1987           NA    2,156,624,900
## 5              3.28  Albania1987           NA    2,156,624,900
## 6              2.81  Albania1987           NA    2,156,624,900
##   gdp_per_capita....      generation
## 1                796    Generation X
## 2                796          Silent
## 3                796    Generation X
## 4                796 G.I. Generation
## 5                796         Boomers
## 6                796 G.I. Generation
multi_reg <- lm(suicides.100k.pop ~ sex + (gdp_per_capita....^2) + (sex * gdp_per_capita....), data = data)
multi_reg
## 
## Call:
## lm(formula = suicides.100k.pop ~ sex + (gdp_per_capita....^2) + 
##     (sex * gdp_per_capita....), data = data)
## 
## Coefficients:
##                (Intercept)                     sexmale  
##                  5.053e+00                   1.546e+01  
##         gdp_per_capita....  sexmale:gdp_per_capita....  
##                  2.012e-05                  -3.666e-05
summary(multi_reg)
## 
## Call:
## lm(formula = suicides.100k.pop ~ sex + (gdp_per_capita....^2) + 
##     (sex * gdp_per_capita....), data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.514  -6.823  -3.110   3.511 204.749 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 5.053e+00  1.983e-01  25.482  < 2e-16 ***
## sexmale                     1.546e+01  2.805e-01  55.142  < 2e-16 ***
## gdp_per_capita....          2.012e-05  7.832e-06   2.569 0.010192 *  
## sexmale:gdp_per_capita.... -3.666e-05  1.108e-05  -3.310 0.000934 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.45 on 27816 degrees of freedom
## Multiple R-squared:  0.1536, Adjusted R-squared:  0.1535 
## F-statistic:  1683 on 3 and 27816 DF,  p-value: < 2.2e-16

The p-value of f-statistics is small. Therefore we reject the hypothesis that there is relationship between these variables. Also, The r-squared value only describes 15% of data’s variance. Overall this is not good model. Let us do residual analysis to verify our assumptions.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
ggplot(data = multi_reg, aes(fitted(multi_reg), resid(multi_reg))) +
  geom_point() +
  geom_smooth(method = lm, se = F) +
  theme_minimal()

ggplot(data = multi_reg, aes(sample = resid(multi_reg))) +
  stat_qq() +
  stat_qq_line() +
  theme_minimal()

The residual’s variance is not uniform and distribution is not normal. Therefore, this model was not appropriate.