Some analysis of US Prostate Cancer Rates and Soil Minerals prompted by this Google+ post and comments linking to this blog post.
My initial reason for investigating this is the seeming inconsistency of the visual correlation for soil selenium and prostate cancer with the research literature indicating a connection between low body selenium levels and prostate cancer.
This is of special interest because of the Feb. 2014 media coverage emphasizing the negative consequences of selenium. See http://www.cancer.gov/newscenter/qa/2008/selectqa
This media flurry was a reaction to the publication of Baseline Selenium Status and Effects of Selenium and Vitamin E Supplementation on Prostate Cancer Risk
Clark LC, Combs GF Jr, Turnbull BW, et al; Nutritional Prevention of Cancer Study Group. Effects of selenium supplementation for cancer prevention in patients with carcinoma of the skin: a randomized controlled trial JAMA. 1996;276(24):1957-1963 PDF
Yoshizawa K, Willett WC, Morris SJ, Stampfer MJ, Spiegelman D, Rimm EB, et al. Study of prediagnostic selenium level in toenails and the risk of advanced prostate cancer. J Natl Cancer Inst. 1998; 90(16):12191224. [PubMed: 9719083] (referred to as “Epi study”)
Duffield-Lillico AJ, Dalkin BL, Reid ME, Turnbull BW, Slate EH, Jacobs ET, et al. Selenium supplementation, baseline plasma selenium status and incidence of prostate cancer: an analysis of the complete treatment period of the Nutritional Prevention of Cancer Trial. BJU Int. 2003; 91(7):608612. PubMed: 12699469 PDF (secondary analysis of 1996 Clark “NPC study”)
Prostate cancer data from http://ratecalc.cancer.gov/ratecalc/ (raw data by county downloadable from maps)
Soil minerals data from http://mrdata.usgs.gov/geochem/doc/averages/countydata.htm
County code data from http://www.census.gov/geo/reference/codes/cou.html
fips_codes <- read.csv("~/Health/Cancer Mortality Maps/fips_codes_website.csv", stringsAsFactors=FALSE)
counties <- read.csv("~/Health/Cancer Mortality Maps/national_county.csv", stringsAsFactors=FALSE)
counties$FIPS <- counties$State.ANSI * 1000 + counties$County.ANSI
PCRate <- read.csv("~/Health/Cancer Mortality Maps/ProstateCancerRateCalcResults.csv", stringsAsFactors=FALSE)
PCRateUS <- PCRate[1,]
PCRateCounty <- PCRate[-1,]
PCRateCounty$FIPS <- PCRateCounty$Geographic.code
# Note removed second row of CSV (mostly units) and changed sigma to sd in names
minerals <- read.csv("~/Health/Cancer Mortality Maps/county-averages.csv", stringsAsFactors=FALSE)
# Combine data into a single data frame
data <- merge(PCRateCounty, minerals, by="FIPS")
First take a look at the data.
Mineral correlations with mortality rate from prostate cancer (and each other).
dataCor <- cor(data[,c("Mortality.rate", "As", "Se", "Hg", "Pb", "Zn", "Cu", "Al",
"Na", "Mg", "P", "Ti", "Ca", "Mn", "Fe")],
use="pairwise.complete.obs")
corrplot(dataCor)
Numerically the correlation of soil selenium and prostate cancer is 0.07
This seems neglible to me.
Also note that the correlation is positive while the blog post posits a negative (inverse) visual correlation.
Distributions for selenium and mortality rate and their scatter plot.
summary(data$Se)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.100 0.187 0.276 0.339 0.412 5.320 24
Boxplot(data$Se, main="Soil Selenium (ppm)",
labels=paste0(data$County, " county, ", data$State))
## [1] "Boyd county, Nebraska" "Pueblo county, Colorado"
## [3] "Barbour county, West Virginia" "Delta county, Colorado"
## [5] "Stanley county, South Dakota" "Redwood county, Minnesota"
## [7] "Clay county, South Dakota" "Rooks county, Kansas"
## [9] "Knox county, Nebraska" "Murray county, Minnesota"
summary(data$Mortality.rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 28.2 31.3 31.7 34.4 228.0
Boxplot(data$Mortality.rate, main="Prostate Cancer Mortality Rates",
labels=paste0(data$County, " county, ", data$State))
## [1] "Jackson county, Colorado" "McPherson county, Nebraska"
## [3] "Shannon county, South Dakota" "Hinsdale county, Colorado"
## [5] "Wayne county, Utah" "Issaquena county, Mississippi"
## [7] "Arthur county, Nebraska" "Jeff Davis county, Texas"
## [9] "Slope county, North Dakota" "Summit county, Colorado"
## [11] "Loving county, Texas" "Mineral county, Colorado"
## [13] "San Juan county, Colorado" "Buffalo county, South Dakota"
## [15] "Alpine county, California" "King county, Texas"
## [17] "Grant county, Nebraska" "Kenedy county, Texas"
## [19] "Niobrara county, Wyoming" "Sublette county, Wyoming"
scatterplot(Mortality.rate ~ Se, data=data,
xlab="Soil Selenium (ppm)", ylab="Prostate Cancer Mortality Rate",
labels=paste0(data$County, " county, ", data$State),
id.method=list("x", "y"), id.n=5)
## Alpine county, California Delta county, Colorado
## 158 229
## Mineral county, Colorado Pueblo county, Colorado
## 254 265
## San Juan county, Colorado Boyd county, Nebraska
## 270 1623
## Buffalo county, South Dakota Stanley county, South Dakota
## 2323 2373
## King county, Texas Barbour county, West Virginia
## 2605 2883
Exclude the outliers to get a better look at the rest of the scatter plot.
scatterplot(Mortality.rate ~ Se, data=data, xlim=c(0,2), ylim=c(0,80),
xlab="Soil Selenium (ppm)", ylab="Prostate Cancer Mortality Rate",
labels=paste0(data$County, " county, ", data$State),
id.method=list("x", "y"), id.n=5)
## Alpine county, California Delta county, Colorado
## 158 229
## Mineral county, Colorado Pueblo county, Colorado
## 254 265
## San Juan county, Colorado Boyd county, Nebraska
## 270 1623
## Buffalo county, South Dakota Stanley county, South Dakota
## 2323 2373
## King county, Texas Barbour county, West Virginia
## 2605 2883
Lastly, look at the regression line for mortality rate on soil selenium.
mod <- lm(Mortality.rate ~ Se, data=data)
summary(mod)
##
## Call:
## lm(formula = Mortality.rate ~ Se, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.79 -3.43 -0.24 2.80 59.02
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.049 0.182 170.35 < 2e-16 ***
## Se 1.646 0.429 3.84 0.00013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.06 on 3027 degrees of freedom
## (24 observations deleted due to missingness)
## Multiple R-squared: 0.00484, Adjusted R-squared: 0.00452
## F-statistic: 14.7 on 1 and 3027 DF, p-value: 0.000126
We do get a statistically significant coefficent for Se, but the effect size is small and the R^2 is tiny (0.005, i.e. Se explains very little of the variance in mortality rate).
These are provided to check if we see any visual correlation as described in this blog post.
Notice there is some mismatch in county codes between the data and the map package (e.g. no data for Alaska).
df = data.frame(region=data$FIPS, value=data$Se)
choroplethr(df, lod="county", title="Soil Selenium (ppm)")
## [1] "The following counties were missing and are being set to NA: 34017, 48423, 02050, 48499, 36081, 02105, 02122, 02150, 02164, 48271, 48295, 48301, 48323, 51810, 51840, 51600, 51013, 51019, 51031, 02180, 51053, 02188, 02240, 02090, 02198, 40093, 15005, 02100, 02170, 48119, 48183, 48343, 51179, 51515, 51590, 51620, 51680, 51720, 51740, 51830, 12086, 02060, 48223, 51520, 51580, 51610, 51640, 51678, 51683, 51685, 51710, 51730, 51750, 51790, 55083, 02290, 36005, 36047, 36085, 25019, 02282, 15003, 26083, 51077, 51149, 51199, 51570, 51630, 51650, 51700, 51735, 51770, 51690, 51087, 22101, 08014, 35006, 40153, 48357, 48379, 48449, 48459, 48063, 48159, 48195, 51155, 51510, 51530, 51540, 51595, 51660, 51670, 51760, 51775, 51820, 55115, 55007, 02070, 02110, 02130, 02185, 02195, 02220, 02230, 02020, 02068, 02013, 02261, 02270, 02275, 15001, 15007, 04012, 15009"
df = data.frame(region=data$FIPS, value=data$Mortality.rate)
choroplethr(df, lod="county", title="Prostate Cancer Mortality Rates")
## [1] "The following counties were missing and are being set to NA: 02050, 36081, 02105, 02122, 02150, 02164, 51810, 51840, 51600, 51013, 51019, 51031, 02180, 51053, 02188, 02240, 02090, 02198, 15005, 02100, 02170, 51179, 51515, 51590, 51620, 51680, 51720, 51740, 51830, 12086, 02060, 51520, 51580, 51610, 51640, 51678, 51683, 51685, 51710, 51730, 51750, 51790, 55083, 02290, 36005, 36047, 36085, 02282, 15003, 51077, 51149, 51199, 51570, 51630, 51650, 51700, 51735, 51770, 51690, 51087, 08014, 35006, 51155, 51510, 51530, 51540, 51595, 51660, 51670, 51760, 51775, 51820, 55115, 02070, 02110, 02130, 02185, 02195, 02220, 02230, 02020, 02068, 02013, 02261, 02270, 02275, 15001, 15007, 04012, 15009"
I’m not sure if there is a visual correlation or not, but given the numerical correlation above I think any visual correlation is spurious–probably caused either by some bias in the numbers by county visual size and/or the tendency of humans to see patterns even when they aren’t really present.
Lastly, to be clear, I agree with the information in What You Should Know about Selenium also by Kas Thomas. I just don’t think the soil Se data supports the conclusions drawn.