library(RCurl)
## Loading required package: bitops
library(knitr)
Data: Skin cancer mortality
References: PennState Eberly College Of Science
https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R
NOTES: The data is very old. Provided by PennState Eberly College Of Science The data were compiled in the 1950s, so Alaska and Hawaii were not yet states. And, Washington, D.C. is included in the data set even though it is not technically a state.
Mort: number of deaths per 10 million people
Lat: Latitude
Long: Longitude
sc.data <- read.csv(text = getURL("https://raw.githubusercontent.com/AjayArora35/compmath/master/AArora_Posting11.csv"), header = T, stringsAsFactors = F)
kable(sc.data[sample(nrow(sc.data), 10), ], align='l', caption = "Sample of 10 rows", row.names=FALSE)
| State | Lat | Mort | Ocean | Long |
|---|---|---|---|---|
| Michigan | 43.5 | 117 | 0 | 84.5 |
| Texas | 31.5 | 229 | 1 | 98.0 |
| Delaware | 39.0 | 200 | 1 | 75.5 |
| Montana | 47.0 | 109 | 0 | 110.5 |
| Iowa | 42.2 | 128 | 0 | 93.8 |
| Georgia | 33.0 | 214 | 1 | 83.5 |
| Washington | 47.5 | 117 | 1 | 121.0 |
| Colorado | 39.0 | 149 | 0 | 105.5 |
| Kentucky | 37.8 | 147 | 0 | 85.0 |
| Utah | 39.5 | 142 | 0 | 111.5 |
Scatterplot (Linear Model):
sc.lm <- lm(Mort ~ Lat, data = sc.data)
plot(sc.data$Lat, sc.data$Mort, xlab = 'Latitude', ylab = 'Mortality (Deaths per 10 million)', main='Skin Cancer Mortality vs. State Latitude')
abline(sc.lm, col="red")
Summary (Linear Model):
summary(sc.lm)
##
## Call:
## lm(formula = Mort ~ Lat, data = sc.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.972 -13.185 0.972 12.006 43.938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 389.1894 23.8123 16.34 < 2e-16 ***
## Lat -5.9776 0.5984 -9.99 3.31e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.12 on 47 degrees of freedom
## Multiple R-squared: 0.6798, Adjusted R-squared: 0.673
## F-statistic: 99.8 on 1 and 47 DF, p-value: 3.309e-13
Mortality = 389.1894 -5.9776x means that for every unit increase in x there is a reduction in the mortality.
The intiution is that living in higher latitudes of the northern USA leads to less exposure to harmful sun rays, and therefore, less risk of death from skin cancer. (PennState Eberly College Of Science)
Correlation: As Latitude increases, mortality decreases.
cor(sc.data$Lat, sc.data$Mort)
## [1] -0.8245178
Residual standard error: 19.12 on 47 degrees of freedom
Multiple R-squared: 0.6798 means that 68% of the variation in skin cancer is due to or explained by latitude.
Adjusted R-squared: 0.673 This value is more interesting with models with more than one predictor.
F-statistic: 99.8 This is not particulary useful because this is a single parameter model. This value is more interesting in multi-factor models.
Residuals are essentially the difference between the actual observed response values (Mortality) and the response values that the model predicted. The Residuals section of the model output breaks it down into 5 summary points. When assessing how well the model fit the data, you should look for a symmetrical distribution across these points on the mean value zero (0).
plot(fitted(sc.lm),resid(sc.lm))
abline(0,0)
The Fitted versus the Residual graph shows data close to evenly distributed about the 0.
hist(sc.lm$residuals)
Furthermore, the residuals data is displaying (roughly) a uniform data distribution.
qqnorm(sc.lm$residuals)
qqline(sc.lm$residuals)
Lastly, Q-Q plot show that the residuals keep very close to the straight line. A good indicator that the data is normally distributed.
Conclusion:
NULL Hypothesis: Latitude has no impact on skin cancer
Alternate Hypothesis: Latitude has impact on skin cancer
P-Value is 3.31e-13 which is smaller than 0.05, we can reject the NULL hypothesis. There is sufficient statistical evidence to conclude that there is a significant linear relationship between latitude and skin cancer mortality. However, it should be noted that other variables could provide insight as well.