library(RCurl)
## Loading required package: bitops
library(knitr)

Data: Skin cancer mortality

References: PennState Eberly College Of Science

        https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R

NOTES: The data is very old. Provided by PennState Eberly College Of Science The data were compiled in the 1950s, so Alaska and Hawaii were not yet states. And, Washington, D.C. is included in the data set even though it is not technically a state.

Mort: number of deaths per 10 million people

Lat: Latitude

Long: Longitude

sc.data <- read.csv(text = getURL("https://raw.githubusercontent.com/AjayArora35/compmath/master/AArora_Posting11.csv"), header = T, stringsAsFactors = F)
kable(sc.data[sample(nrow(sc.data), 10), ], align='l', caption = "Sample of 10 rows", row.names=FALSE)
Sample of 10 rows
State Lat Mort Ocean Long
Michigan 43.5 117 0 84.5
Texas 31.5 229 1 98.0
Delaware 39.0 200 1 75.5
Montana 47.0 109 0 110.5
Iowa 42.2 128 0 93.8
Georgia 33.0 214 1 83.5
Washington 47.5 117 1 121.0
Colorado 39.0 149 0 105.5
Kentucky 37.8 147 0 85.0
Utah 39.5 142 0 111.5

Scatterplot (Linear Model):

sc.lm <- lm(Mort ~ Lat, data = sc.data)
plot(sc.data$Lat, sc.data$Mort, xlab = 'Latitude', ylab = 'Mortality (Deaths per 10 million)', main='Skin Cancer Mortality vs. State Latitude')
abline(sc.lm, col="red")

Summary (Linear Model):

summary(sc.lm)
## 
## Call:
## lm(formula = Mort ~ Lat, data = sc.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.972 -13.185   0.972  12.006  43.938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 389.1894    23.8123   16.34  < 2e-16 ***
## Lat          -5.9776     0.5984   -9.99 3.31e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.12 on 47 degrees of freedom
## Multiple R-squared:  0.6798, Adjusted R-squared:  0.673 
## F-statistic:  99.8 on 1 and 47 DF,  p-value: 3.309e-13

Mortality = 389.1894 -5.9776x means that for every unit increase in x there is a reduction in the mortality.

The intiution is that living in higher latitudes of the northern USA leads to less exposure to harmful sun rays, and therefore, less risk of death from skin cancer. (PennState Eberly College Of Science)

Correlation: As Latitude increases, mortality decreases.

cor(sc.data$Lat, sc.data$Mort)
## [1] -0.8245178

Residual standard error: 19.12 on 47 degrees of freedom

Multiple R-squared: 0.6798 means that 68% of the variation in skin cancer is due to or explained by latitude.

Adjusted R-squared: 0.673 This value is more interesting with models with more than one predictor.

F-statistic: 99.8 This is not particulary useful because this is a single parameter model. This value is more interesting in multi-factor models.

Residuals are essentially the difference between the actual observed response values (Mortality) and the response values that the model predicted. The Residuals section of the model output breaks it down into 5 summary points. When assessing how well the model fit the data, you should look for a symmetrical distribution across these points on the mean value zero (0).

plot(fitted(sc.lm),resid(sc.lm))
abline(0,0)

The Fitted versus the Residual graph shows data close to evenly distributed about the 0.

hist(sc.lm$residuals)

Furthermore, the residuals data is displaying (roughly) a uniform data distribution.

qqnorm(sc.lm$residuals)
qqline(sc.lm$residuals)

Lastly, Q-Q plot show that the residuals keep very close to the straight line. A good indicator that the data is normally distributed.

Conclusion:

NULL Hypothesis: Latitude has no impact on skin cancer

Alternate Hypothesis: Latitude has impact on skin cancer

P-Value is 3.31e-13 which is smaller than 0.05, we can reject the NULL hypothesis. There is sufficient statistical evidence to conclude that there is a significant linear relationship between latitude and skin cancer mortality. However, it should be noted that other variables could provide insight as well.

  1. Length of exposure to sunshine and frequency (Different seasons)
  2. Individual genetic propensity to skin ailments
  3. Skin color