https://www.kaggle.com/datasets/mikhail1681/adult-mortality-rate-2019-2021
raw_data <- read.csv("https://raw.githubusercontent.com/RonBalaban/CUNY-SPS-R/main/Adult%20mortality%20rate%20(2019-2021).csv")
colnames(raw_data)
## [1] "Countries" "Continent"
## [3] "Average_Pop.thousands.people." "Average_GDP.M.."
## [5] "Average_GDP_per_capita..." "Average_HEXP..."
## [7] "Development_level" "AMR_female.per_1000_female_adults."
## [9] "AMR_male.per_1000_male_adults." "Average_CDR"
# Make linear model of response variable average crude mortality rate by predictor average population
mortality_population.lm <- lm(Average_CDR ~ Average_Pop.thousands.people., data=raw_data)
mortality_population.lm
##
## Call:
## lm(formula = Average_CDR ~ Average_Pop.thousands.people., data = raw_data)
##
## Coefficients:
## (Intercept) Average_Pop.thousands.people.
## 8.161e+00 -4.110e-07
# Get summary of our model
summary(mortality_population.lm)
##
## Call:
## lm(formula = Average_CDR ~ Average_Pop.thousands.people., data = raw_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9896 -1.7593 -0.4998 1.2313 10.2422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.161e+00 2.449e-01 33.319 <2e-16 ***
## Average_Pop.thousands.people. -4.110e-07 1.472e-06 -0.279 0.781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.963 on 154 degrees of freedom
## Multiple R-squared: 0.0005056, Adjusted R-squared: -0.005985
## F-statistic: 0.0779 on 1 and 154 DF, p-value: 0.7805
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
ggplot(mortality_population.lm, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, color='red') +
labs(x = "Fitted Values", y = "Residuals")
qqnorm(resid(mortality_population.lm))
qqline(resid(mortality_population.lm))
par(mfrow=c(2,2))
plot(mortality_population.lm)
As we can see, there doesn’t seem to be a strong relationship between the average population and average crude mortality rate. While the residuals are very closely centered with a median value of \(-0.4998\), the min value \((-6.9896)\) and max value \((10.2422)\) are not equidistant. The 1st and 3rd quartiles do appear to be, interestingly enough, with values of \(-1.7593\) and \(1.2313\) respectively. Once we head towards the +2 quantile, the Q-Q plot heavily skews right
Looking at the test statistic, the t-value is minuscule with \(-0.279\), and \(P(>|t|) <2e-16\), with the 3 asterisks (***). The relationship between population and crude mortality rate has an intercept of \(8.161e+00\), and a slope of \(-4.110e-07\), which is very weak.
However, when looking at the R-squared value with 154 degrees of freedom, we see that the Multiple R-squared is \(0.0005056\), and Adjusted R-squared is \(-0.005985\), showing that this linear model with the total population being the independent variable accounts for very little of the variance within the data itself, so the population is not a good predictor for the crude mortality rate.
# Make linear model of average crude mortality rate by average population
mortality_Hexp.lm <- lm(Average_CDR ~ Average_HEXP..., data=raw_data)
mortality_Hexp.lm
##
## Call:
## lm(formula = Average_CDR ~ Average_HEXP..., data = raw_data)
##
## Coefficients:
## (Intercept) Average_HEXP...
## 7.9801765 0.0001403
# Get summary of our model
summary(mortality_Hexp.lm)
##
## Call:
## lm(formula = Average_CDR ~ Average_HEXP..., data = raw_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.0881 -1.7851 -0.5841 1.1844 10.2985
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.9801765 0.2784704 28.66 <2e-16 ***
## Average_HEXP... 0.0001403 0.0001264 1.11 0.269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.952 on 154 degrees of freedom
## Multiple R-squared: 0.007933, Adjusted R-squared: 0.001491
## F-statistic: 1.231 on 1 and 154 DF, p-value: 0.2689
ggplot(mortality_Hexp.lm, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, color='red') +
labs(x = "Fitted Values", y = "Residuals")
qqnorm(resid(mortality_Hexp.lm))
qqline(resid(mortality_Hexp.lm))
par(mfrow=c(2,2))
plot(mortality_Hexp.lm)