Some analysis of US Prostate Cancer Rates and Soil Minerals prompted by this Google+ post and comments linking to this blog post.

My initial reason for investigating this is the seeming inconsistency of the visual correlation for soil selenium and prostate cancer with the research literature indicating a connection between low body selenium levels and prostate cancer.

This is of special interest because of the Feb. 2014 media coverage emphasizing the negative consequences of selenium. See http://www.cancer.gov/newscenter/qa/2008/selectqa
This media flurry was a reaction to the publication of Baseline Selenium Status and Effects of Selenium and Vitamin E Supplementation on Prostate Cancer Risk

Research Papers

Clark LC, Combs GF Jr, Turnbull BW, et al; Nutritional Prevention of Cancer Study Group. Effects of selenium supplementation for cancer prevention in patients with carcinoma of the skin: a randomized controlled trial JAMA. 1996;276(24):1957-1963 PDF

Yoshizawa K, Willett WC, Morris SJ, Stampfer MJ, Spiegelman D, Rimm EB, et al. Study of prediagnostic selenium level in toenails and the risk of advanced prostate cancer. J Natl Cancer Inst. 1998; 90(16):1219–1224. [PubMed: 9719083] (referred to as “Epi study”)

Duffield-Lillico AJ, Dalkin BL, Reid ME, Turnbull BW, Slate EH, Jacobs ET, et al. Selenium supplementation, baseline plasma selenium status and incidence of prostate cancer: an analysis of the complete treatment period of the Nutritional Prevention of Cancer Trial. BJU Int. 2003; 91(7):608–612. PubMed: 12699469 PDF (secondary analysis of 1996 Clark “NPC study”)

Data Sources

Prostate cancer data from http://ratecalc.cancer.gov/ratecalc/ (raw data by county downloadable from maps)
Soil minerals data from http://mrdata.usgs.gov/geochem/doc/averages/countydata.htm
County code data from http://www.census.gov/geo/reference/codes/cou.html

fips_codes <- read.csv("~/Health/Cancer Mortality Maps/fips_codes_website.csv", stringsAsFactors=FALSE)

counties <- read.csv("~/Health/Cancer Mortality Maps/national_county.csv", stringsAsFactors=FALSE)
counties$FIPS <- counties$State.ANSI * 1000 + counties$County.ANSI

PCRate <- read.csv("~/Health/Cancer Mortality Maps/ProstateCancerRateCalcResults.csv", stringsAsFactors=FALSE)
PCRateUS <- PCRate[1,]
PCRateCounty <- PCRate[-1,]
PCRateCounty$FIPS <- PCRateCounty$Geographic.code

# Note removed second row of CSV (mostly units) and changed sigma to sd in names
minerals <- read.csv("~/Health/Cancer Mortality Maps/county-averages.csv", stringsAsFactors=FALSE)

# Combine data into a single data frame
data <- merge(PCRateCounty, minerals, by="FIPS")

Exploratory Data Analysis

First take a look at the data.

Mineral correlations with mortality rate from prostate cancer (and each other).

dataCor <- cor(data[,c("Mortality.rate", "As", "Se", "Hg", "Pb", "Zn", "Cu", "Al",
                       "Na", "Mg", "P", "Ti", "Ca", "Mn", "Fe")],
               use="pairwise.complete.obs")
corrplot(dataCor)

plot of chunk unnamed-chunk-3

Numerically the correlation of soil selenium and prostate cancer is 0.07
This seems neglible to me.

Also note that the correlation is positive while the blog post posits a negative (inverse) visual correlation.

Distributions for selenium and mortality rate and their scatter plot.

summary(data$Se)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.100   0.187   0.276   0.339   0.412   5.320      24
Boxplot(data$Se, main="Soil Selenium (ppm)",
        labels=paste0(data$County, " county, ", data$State))

plot of chunk unnamed-chunk-4

##  [1] "Boyd county, Nebraska"         "Pueblo county, Colorado"      
##  [3] "Barbour county, West Virginia" "Delta county, Colorado"       
##  [5] "Stanley county, South Dakota"  "Redwood county, Minnesota"    
##  [7] "Clay county, South Dakota"     "Rooks county, Kansas"         
##  [9] "Knox county, Nebraska"         "Murray county, Minnesota"
summary(data$Mortality.rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    28.2    31.3    31.7    34.4   228.0
Boxplot(data$Mortality.rate, main="Prostate Cancer Mortality Rates",
        labels=paste0(data$County, " county, ", data$State))

plot of chunk unnamed-chunk-4

##  [1] "Jackson county, Colorado"      "McPherson county, Nebraska"   
##  [3] "Shannon county, South Dakota"  "Hinsdale county, Colorado"    
##  [5] "Wayne county, Utah"            "Issaquena county, Mississippi"
##  [7] "Arthur county, Nebraska"       "Jeff Davis county, Texas"     
##  [9] "Slope county, North Dakota"    "Summit county, Colorado"      
## [11] "Loving county, Texas"          "Mineral county, Colorado"     
## [13] "San Juan county, Colorado"     "Buffalo county, South Dakota" 
## [15] "Alpine county, California"     "King county, Texas"           
## [17] "Grant county, Nebraska"        "Kenedy county, Texas"         
## [19] "Niobrara county, Wyoming"      "Sublette county, Wyoming"
scatterplot(Mortality.rate ~ Se, data=data,
            xlab="Soil Selenium (ppm)", ylab="Prostate Cancer Mortality Rate",
            labels=paste0(data$County, " county, ", data$State),
            id.method=list("x", "y"), id.n=5)

plot of chunk unnamed-chunk-5

##     Alpine county, California        Delta county, Colorado 
##                           158                           229 
##      Mineral county, Colorado       Pueblo county, Colorado 
##                           254                           265 
##     San Juan county, Colorado         Boyd county, Nebraska 
##                           270                          1623 
##  Buffalo county, South Dakota  Stanley county, South Dakota 
##                          2323                          2373 
##            King county, Texas Barbour county, West Virginia 
##                          2605                          2883

Exclude the outliers to get a better look at the rest of the scatter plot.

scatterplot(Mortality.rate ~ Se, data=data, xlim=c(0,2), ylim=c(0,80),
            xlab="Soil Selenium (ppm)", ylab="Prostate Cancer Mortality Rate",
            labels=paste0(data$County, " county, ", data$State),
            id.method=list("x", "y"), id.n=5)

plot of chunk unnamed-chunk-6

##     Alpine county, California        Delta county, Colorado 
##                           158                           229 
##      Mineral county, Colorado       Pueblo county, Colorado 
##                           254                           265 
##     San Juan county, Colorado         Boyd county, Nebraska 
##                           270                          1623 
##  Buffalo county, South Dakota  Stanley county, South Dakota 
##                          2323                          2373 
##            King county, Texas Barbour county, West Virginia 
##                          2605                          2883

Lastly, look at the regression line for mortality rate on soil selenium.

mod <- lm(Mortality.rate ~ Se, data=data)
summary(mod)
## 
## Call:
## lm(formula = Mortality.rate ~ Se, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31.79  -3.43  -0.24   2.80  59.02 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   31.049      0.182  170.35  < 2e-16 ***
## Se             1.646      0.429    3.84  0.00013 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.06 on 3027 degrees of freedom
##   (24 observations deleted due to missingness)
## Multiple R-squared:  0.00484,    Adjusted R-squared:  0.00452 
## F-statistic: 14.7 on 1 and 3027 DF,  p-value: 0.000126

We do get a statistically significant coefficent for Se, but the effect size is small and the R^2 is tiny (0.005, i.e. Se explains very little of the variance in mortality rate).

Choropleth Maps

These are provided to check if we see any visual correlation as described in this blog post.

Notice there is some mismatch in county codes between the data and the map package (e.g. no data for Alaska).

df = data.frame(region=data$FIPS, value=data$Se)
choroplethr(df, lod="county", title="Soil Selenium (ppm)")
## [1] "The following counties were missing and are being set to NA: 34017, 48423, 02050, 48499, 36081, 02105, 02122, 02150, 02164, 48271, 48295, 48301, 48323, 51810, 51840, 51600, 51013, 51019, 51031, 02180, 51053, 02188, 02240, 02090, 02198, 40093, 15005, 02100, 02170, 48119, 48183, 48343, 51179, 51515, 51590, 51620, 51680, 51720, 51740, 51830, 12086, 02060, 48223, 51520, 51580, 51610, 51640, 51678, 51683, 51685, 51710, 51730, 51750, 51790, 55083, 02290, 36005, 36047, 36085, 25019, 02282, 15003, 26083, 51077, 51149, 51199, 51570, 51630, 51650, 51700, 51735, 51770, 51690, 51087, 22101, 08014, 35006, 40153, 48357, 48379, 48449, 48459, 48063, 48159, 48195, 51155, 51510, 51530, 51540, 51595, 51660, 51670, 51760, 51775, 51820, 55115, 55007, 02070, 02110, 02130, 02185, 02195, 02220, 02230, 02020, 02068, 02013, 02261, 02270, 02275, 15001, 15007, 04012, 15009"

plot of chunk unnamed-chunk-8

df = data.frame(region=data$FIPS, value=data$Mortality.rate)
choroplethr(df, lod="county", title="Prostate Cancer Mortality Rates")
## [1] "The following counties were missing and are being set to NA: 02050, 36081, 02105, 02122, 02150, 02164, 51810, 51840, 51600, 51013, 51019, 51031, 02180, 51053, 02188, 02240, 02090, 02198, 15005, 02100, 02170, 51179, 51515, 51590, 51620, 51680, 51720, 51740, 51830, 12086, 02060, 51520, 51580, 51610, 51640, 51678, 51683, 51685, 51710, 51730, 51750, 51790, 55083, 02290, 36005, 36047, 36085, 02282, 15003, 51077, 51149, 51199, 51570, 51630, 51650, 51700, 51735, 51770, 51690, 51087, 08014, 35006, 51155, 51510, 51530, 51540, 51595, 51660, 51670, 51760, 51775, 51820, 55115, 02070, 02110, 02130, 02185, 02195, 02220, 02230, 02020, 02068, 02013, 02261, 02270, 02275, 15001, 15007, 04012, 15009"

plot of chunk unnamed-chunk-8

I’m not sure if there is a visual correlation or not, but given the numerical correlation above I think any visual correlation is spurious–probably caused either by some bias in the numbers by county visual size and/or the tendency of humans to see patterns even when they aren’t really present.

Lastly, to be clear, I agree with the information in What You Should Know about Selenium also by Kas Thomas. I just don’t think the soil Se data supports the conclusions drawn.