DATA 622 Homework 4

The data I’ll be using comes from the Radiation Exposure Monitoring System (REMS) query tool provided by the Department of Energy (DOE). It provides annual aggregations across all DOE sites of radiation exposure levels across different types of workers. Reducing radiological exposure is a goal for the Department of Energy, and being able to identify sites that may have a higher risk of exposure (based on previous year’s data and trends) could be worthwhile. In our case, we’ll be attempting to predict the value for Average.Meas..TED..mrem., which measures the average total effective dose (in units of milliroentgen) across workers at a site in the reporting period. Other dose measurements, such as Skin-whole body, as well as photon and neutron radiation dosimetry measurements are provided for most sites.

This dataset is a bit of a lagging indicator, as most sites report to the REMS system on a yearly basis. However, more up-to-date exposure records are not publicly available. As such, using the REMS query tool as a proxy for exposure levels across the department will be helpful. It should be also noted that the function and operations of each constituent site is very different, so longitudinal comparisons should be done across the timeframe, and not across the dimension of site/reporting orgs. A panel regression model could be used in that case for a more “apples to apples” comparison, but is beyond the scope of this course. Predictive capabilities to identify areas within the department where an abnormally higher than usual would be valuable in order to better allocate resources related to occupational radiational protection.

# Read in REMS data
dat <- read.csv("/Users/andrewbowen/CUNY/machineLearningBigData622/data/rems-data.csv", check.names = TRUE)

Data Wrangling

We have some categorical columns that we’ll want to convert to factor types.

# Convert catgorial features to factor type
dat$Site <- as.factor(dat$Site)
dat$`Program.Office` <- as.factor(dat$`Program.Office`)
dat$`Operations.Office` <- as.factor(dat$`Operations.Office`)
dat$`Reporting.Organization` <- as.factor(dat$`Reporting.Organization`)
dat$`Facility.Type` <- as.factor(dat$`Facility.Type`)
# dat$`Monitoring.Year` <- as.integer(dat$`Monitoring.Year`)

# Rename our most commonly used columns
renames = c(avg_ted = "Average.Meas..TED..mrem.",
                      avg_ced = "Average.Meas..CED..mrem.",
                      avg_photon_ed = "Average.Meas..ED.Photon..mrem.",
                      avg_neutron_ed = "Average.Meas..ED.Neutron..mrem.",
                      avg_eq_skwb  = "Average.Meas..EqD.SkWB..mrem.",
                      site = "Site",
                      program_office = "Program.Office",
                      facility_type = "Facility.Type",
                      year="Monitoring.Year",
            labor_category="Labor.Category",
            occupation="Occupation",
            operations_office="Operations.Office",
            reporting_organization="Reporting.Organization",
            monitoring_status="Monitoring.Status",
            use.names=TRUE)
dat <- dat %>% rename(any_of(renames))

# Only select a few columns within our dataset
df <- dat %>% dplyr::select(c("year", "avg_ted", 
                              "avg_ced", "avg_photon_ed", 
                              "avg_neutron_ed", "site", 
                              "program_office", "avg_eq_skwb", 
                              "facility_type"))


# Only filter to values from this millenium for better compute. Dose definitions also changed over our period
df$year <- as.integer(df$year)
df <- df %>% dplyr::filter(year > 2000)

EDA

First let’s look at the distribution of our target variable: Average total effective dose (TED). This looks like a pretty skewed distribution towards zero (negative values wouldn’t really make sense here as a negative dosimetry reading wouldn’t happen).

ggplot(df, aes(x=avg_ted)) + 
  geom_histogram() + 
  labs(x="Average Total Effective Dose (mrem)", y="Site-Quarter count", title="Avgerage TED (REMS)")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

For tree-based models that we’ll train later on, a skewed target variable isn’t a blocker, since many tree-based modeling methods (including random forest) can predict on skewed data.

Imputation

We’ll impute any missing values using the mice predictive mean matching method. In our case most values in the raw data are forced to zero (assuming some sites do not have full reporting dating back to the beginning of our timeframe).

imputed <- mice(df, method="pmm", trace=0, seed=1234)

## 
##  iter imp variable
##   1   1
##   1   2
##   1   3
##   1   4
##   1   5
##   2   1
##   2   2
##   2   3
##   2   4
##   2   5
##   3   1
##   3   2
##   3   3
##   3   4
##   3   5
##   4   1
##   4   2
##   4   3
##   4   4
##   4   5
##   5   1
##   5   2
##   5   3
##   5   4
##   5   5

# Convert imputation results back to our dataframe
df <- as.data.frame(complete(imputed))

We’ll now create a training dataset with 70% of the data. We’ll hold out 30% of our data for further validation

# Create train-test split
sample <- sample(c(TRUE, FALSE), nrow(df), replace=TRUE, prob=c(0.7,0.3))
train  <- df[sample, ]

# Create test dataset
test <- df[!sample, ] 

test_avg_ted <- test %>% 
  dplyr::select("avg_ted")
test <- test %>% dplyr::select(-c("avg_ted"))

Exploratory Data Analysis

# Create a pared-down dataset to produce a feature plot
sample <- df %>% dplyr::select(
  `Average Photon Effective Dose` = avg_photon_ed, 
  `Average Collective Effective Dose` = avg_ced,
  `Average Neutron Effective Dose` = avg_neutron_ed,
  `Average Equivalent Dose: Skin Whole Body` = avg_eq_skwb,
  `Average Total Effective Dose` = avg_ted
)

# Produce Feature Plot based on Aggregate Columns
featurePlot(sample, y=sample$`Average Total Effective Dose`)

Model Training

Since we have such a wide dataset (>100 features), a principal component analysis can be a good dimensionality reduction technique for us to use. This can help us reduce the number of

# Perform principal component analysis on raw numeric data
numeric_data <-  dat %>% dplyr::select(-c(`program_office`, `facility_type`, 
                                          `site`, `year`, labor_category,
                                          occupation,
                                          operations_office,
                                          reporting_organization,
                                          monitoring_status))
pca <- princomp( ~ ., numeric_data,)

summary(pca)

## Importance of components:
##                              Comp.1       Comp.2       Comp.3       Comp.4
## Standard deviation     1.716226e+04 1.078369e+04 3.886693e+03 2825.2579518
## Proportion of Variance 6.685009e-01 2.639295e-01 3.428575e-02    0.0181163
## Cumulative Proportion  6.685009e-01 9.324304e-01 9.667162e-01    0.9848325
##                              Comp.5       Comp.6       Comp.7       Comp.8
## Standard deviation     1.905928e+03 1.536165e+03 6.321246e+02 4.276717e+02
## Proportion of Variance 8.244537e-03 5.355856e-03 9.068983e-04 4.151205e-04
## Cumulative Proportion  9.930770e-01 9.984329e-01 9.993398e-01 9.997549e-01
##                              Comp.9      Comp.10      Comp.11      Comp.12
## Standard deviation     2.787405e+02 1.102064e+02 7.516278e+01 6.624431e+01
## Proportion of Variance 1.763410e-04 2.756558e-05 1.282209e-05 9.959794e-06
## Cumulative Proportion  9.999312e-01 9.999588e-01 9.999716e-01 9.999816e-01
##                             Comp.13      Comp.14      Comp.15      Comp.16
## Standard deviation     4.851569e+01 3.693533e+01 3.153566e+01 3.055276e+01
## Proportion of Variance 5.342170e-06 3.096258e-06 2.257132e-06 2.118626e-06
## Cumulative Proportion  9.999869e-01 9.999900e-01 9.999923e-01 9.999944e-01
##                             Comp.17      Comp.18      Comp.19      Comp.20
## Standard deviation     2.748461e+01 2.292157e+01 2.114723e+01 1.580138e+01
## Proportion of Variance 1.714480e-06 1.192455e-06 1.014986e-06 5.666872e-07
## Cumulative Proportion  9.999961e-01 9.999973e-01 9.999983e-01 9.999989e-01
##                             Comp.21      Comp.22      Comp.23      Comp.24
## Standard deviation     1.367612e+01 1.204263e+01 8.142959e+00 6.891764e+00
## Proportion of Variance 4.245011e-07 3.291515e-07 1.504935e-07 1.077988e-07
## Cumulative Proportion  9.999993e-01 9.999996e-01 9.999998e-01 9.999999e-01
##                             Comp.25      Comp.26      Comp.27      Comp.28
## Standard deviation     4.642945e+00 2.922755e+00 2.467855e+00 2.043667e+00
## Proportion of Variance 4.892606e-08 1.938822e-08 1.382269e-08 9.479240e-09
## Cumulative Proportion  9.999999e-01 9.999999e-01 1.000000e+00 1.000000e+00
##                             Comp.29      Comp.30      Comp.31      Comp.32
## Standard deviation     1.780297e+00 1.586761e+00 1.404239e+00 1.246676e+00
## Proportion of Variance 7.193463e-09 5.714472e-09 4.475433e-09 3.527448e-09
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.33      Comp.34      Comp.35      Comp.36
## Standard deviation     1.195705e+00 1.069046e+00 8.624343e-01 6.788272e-01
## Proportion of Variance 3.244898e-09 2.593855e-09 1.688128e-09 1.045855e-09
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.37      Comp.38      Comp.39      Comp.40
## Standard deviation     6.094692e-01 5.027676e-01 4.880482e-01 4.394611e-01
## Proportion of Variance 8.430567e-10 5.737038e-10 5.406033e-10 4.383228e-10
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.41      Comp.42      Comp.43      Comp.44
## Standard deviation     4.148370e-01 3.874922e-01 3.503802e-01 3.002763e-01
## Proportion of Variance 3.905785e-10 3.407840e-10 2.786329e-10 2.046423e-10
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.45      Comp.46      Comp.47      Comp.48
## Standard deviation     2.780527e-01 2.556352e-01 2.375567e-01 2.235539e-01
## Proportion of Variance 1.754719e-10 1.483183e-10 1.280819e-10 1.134273e-10
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.49      Comp.50      Comp.51      Comp.52
## Standard deviation     2.137582e-01 1.841430e-01 1.537095e-01 1.384024e-01
## Proportion of Variance 1.037048e-10 7.695978e-11 5.362344e-11 4.347511e-11
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.53      Comp.54      Comp.55      Comp.56
## Standard deviation     1.141239e-01 1.040308e-01 1.012944e-01 8.167526e-02
## Proportion of Variance 2.956016e-11 2.456277e-11 2.328758e-11 1.514029e-11
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.57      Comp.58      Comp.59      Comp.60
## Standard deviation     7.573647e-02 7.286229e-02 6.416087e-02 5.778982e-02
## Proportion of Variance 1.301857e-11 1.204922e-11 9.343159e-12 7.579769e-12
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.61      Comp.62      Comp.63      Comp.64
## Standard deviation     4.897011e-02 4.285102e-02 3.826783e-02 3.100608e-02
## Proportion of Variance 5.442714e-12 4.167500e-12 3.323693e-12 2.181960e-12
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.65      Comp.66      Comp.67      Comp.68
## Standard deviation     2.795750e-02 2.746029e-02 2.617495e-02 2.383115e-02
## Proportion of Variance 1.773985e-12 1.711447e-12 1.554981e-12 1.288972e-12
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.69      Comp.70      Comp.71      Comp.72
## Standard deviation     2.332425e-02 2.199738e-02 1.977349e-02 1.908425e-02
## Proportion of Variance 1.234721e-12 1.098235e-12 8.874006e-13 8.266157e-13
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.73      Comp.74      Comp.75      Comp.76
## Standard deviation     1.628615e-02 1.442723e-02 1.322185e-02 1.200315e-02
## Proportion of Variance 6.019914e-13 4.724099e-13 3.967688e-13 3.269972e-13
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.77      Comp.78      Comp.79      Comp.80
## Standard deviation     1.191247e-02 1.013242e-02 9.926162e-03 9.490402e-03
## Proportion of Variance 3.220747e-13 2.330125e-13 2.236227e-13 2.044196e-13
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.81      Comp.82      Comp.83      Comp.84
## Standard deviation     8.068678e-03 7.819510e-03 7.232488e-03 6.119766e-03
## Proportion of Variance 1.477604e-13 1.387753e-13 1.187213e-13 8.500078e-14
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.85      Comp.86      Comp.87      Comp.88
## Standard deviation     5.925768e-03 5.655741e-03 5.126255e-03 5.105079e-03
## Proportion of Variance 7.969710e-14 7.259927e-14 5.964221e-14 5.915046e-14
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.89      Comp.90      Comp.91      Comp.92
## Standard deviation     4.472500e-03 4.180649e-03 3.449114e-03 2.162971e-03
## Proportion of Variance 4.539981e-14 3.966803e-14 2.700028e-14 1.061829e-14
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.93      Comp.94      Comp.95      Comp.96
## Standard deviation     8.793332e-04 1.575414e-04 3.284262e-05 1.038321e-06
## Proportion of Variance 1.754931e-15 5.633036e-17 2.448098e-18 2.446904e-21
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.97      Comp.98      Comp.99 Comp.100 Comp.101
## Standard deviation     1.255162e-08 1.116293e-08 9.092454e-09        0        0
## Proportion of Variance 3.575631e-25 2.828195e-25 1.876357e-25        0        0
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00        1        1
##                        Comp.102 Comp.103 Comp.104 Comp.105 Comp.106 Comp.107
## Standard deviation            0        0        0        0        0        0
## Proportion of Variance        0        0        0        0        0        0
## Cumulative Proportion         1        1        1        1        1        1
##                        Comp.108 Comp.109
## Standard deviation            0        0
## Proportion of Variance        0        0
## Cumulative Proportion         1        1

We can produce a screeplot which is native to the princomp function return value, similar to models produced by the lm function in R. This will highlight a good cutoff value for us to down select our most important features

# Create a screeplot from our PCA showing the most important components
plot(pca)

We can train a lasso regression model. For this (as well as our ridge regression model below), we’ll add in a argument to both center and scale the features we input to our model. This will also center and scale our output, which as we saw before is a skeweed distribution

# Train stepwise model
features <- train %>% dplyr::select(-c(avg_ted))
lasso <- train(avg_ted ~ ., data=train, method="lasso",
               tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.01)),
               trControl=trainControl(method='cv'),
               preProcess=c('center','scale'),
               metric="Rsquared")

Our naive linear model is likely overfitting to the training data, given the \(R^2\) of 1

summary(lasso)

##             Length Class      Mode     
## call           4   -none-     call     
## actions       93   -none-     list     
## allset        64   -none-     numeric  
## beta.pure   5952   -none-     numeric  
## vn            66   -none-     character
## mu             1   -none-     numeric  
## normx         64   -none-     numeric  
## meanx         64   -none-     numeric  
## lambda         1   -none-     numeric  
## L1norm        93   -none-     numeric  
## penalty       93   -none-     numeric  
## df            93   -none-     numeric  
## Cp            93   -none-     numeric  
## sigma2         1   -none-     numeric  
## xNames        66   -none-     character
## problemType    1   -none-     character
## tuneValue      1   data.frame list     
## obsLevels      1   -none-     logical  
## param          0   -none-     list

We can also train a ridge regression model (check out ESL by Hastie for a good intro)

# Train a ridge regression model tuned over different values for penalty param lambda
grid <- data.frame(lambda = seq(0, 5, by=0.1))
ridgeModel <- train(avg_ted ~ ., data=train,
                    method="ridge",
                    tuneGrid=grid,
                    trControl=trainControl(method='cv', number=5),
                    preProcess=c('center','scale'),
                    metric="Rsquared")


summary(ridgeModel)

##             Length Class      Mode     
## call           4   -none-     call     
## actions       93   -none-     list     
## allset        64   -none-     numeric  
## beta.pure   5952   -none-     numeric  
## vn            66   -none-     character
## mu             1   -none-     numeric  
## normx         64   -none-     numeric  
## meanx         64   -none-     numeric  
## lambda         1   -none-     numeric  
## L1norm        93   -none-     numeric  
## penalty       93   -none-     numeric  
## df            93   -none-     numeric  
## Cp            93   -none-     numeric  
## sigma2         1   -none-     numeric  
## xNames        66   -none-     character
## problemType    1   -none-     character
## tuneValue      1   data.frame list     
## obsLevels      1   -none-     logical  
## param          0   -none-     list

We can plot out tuning results over different values of \(\lambda\) via the plot method to see what the ideal hyperparameter value for \(\lambda\) is against our evaluation metric (in this case \(R^2\)).

plot(ridgeModel)

We can also train a random forest model against our data. As a tree-based method, random forest will be more robust to some of the outliers and skewness present in our raw data, so we won’t have to perform the same centering and scaling. This will also hold for the XGBoost model we train later on.

# Train random forest
rfModel <- randomForest(x = features, y=train$avg_ted, ntrees=300)

We can plot the error of our random forest model against the number of trees (which is treated as a tuning parameter). The default number of trees trained is 500, and we can see the error rate elbow out at \(~50\) trees.

# 
plot(rfModel)

Lastly, we’ll train an XGBoost

# Set up parameter grid for XG Boost tuning
param <-  data.frame(nrounds=c(20), lambda=seq(0, 2, by=0.4), alpha=0.1, eta=0.05) 

# Train XGBoost model against REMS data
xgBoostModel <- train(avg_ted ~ ., 
                      data=train, 
                      method="xgbLinear",
                      tuneGrid=param,
                      metric="Rsquared",
                      nthread = 1)

Similar to above, we can plot the hyperparameter tuning results of our XGBoost-trained tree.

plot(xgBoostModel)

Model Evalutation

We held out a test set to measure model performance against. We can do that and calculate the Root mean squared error as well as the \(R^2\) of our models

# Convert to numeric list
test_vals <- as.double(test_avg_ted$avg_ted)

First we can evaluate our regression models for both lasso and ridge

lassoPred <- predict(lasso, test)

# Compute model diagnostics (RMSE and R-Squared)
rmseLasso <- RMSE(lassoPred, test_vals)
rsqLasso <- yardstick::rsq_vec(lassoPred, test_vals)

print(glue("Lasso Regression RMSE: {rmseLasso}"))

## Lasso Regression RMSE: 16.1579133104546

print(glue("Lasso R-squared: {rsqLasso}"))

## Lasso R-squared: 0.937162543472018

ridgePred <- predict(ridgeModel, test)

# Compute model diagnostics (RMSE and R-Squared)
rmseRidge <- RMSE(ridgePred, test_vals)
rsqRidge <- yardstick::rsq_vec(ridgePred, test_vals)

print(glue("Ridge Regression RMSE: {rmseRidge}"))

## Ridge Regression RMSE: 8.62629651650898

print(glue("Ridge R-squared: {rsqRidge}"))

## Ridge R-squared: 0.968301335918049

rfPred <- predict(rfModel, test)

# Compute model diagnostics (RMSE and R-Squared)
rmseRF <- RMSE(rfPred, test_vals)
rsqRF <- yardstick::rsq_vec(rfPred, test_vals)

print(glue("Random Forest Regression RMSE: {rmseRF}"))

## Random Forest Regression RMSE: 8.78398826681894

print(glue("Random Forest R-squared: {rsqRF}"))

## Random Forest R-squared: 0.970484027504779

xgBoostPred <- predict(xgBoostModel, test)

# Compute model diagnostics (RMSE and R-Squared)
rmseXGBoost <- RMSE(xgBoostPred, test_vals)
rsqXGBoost <- yardstick::rsq_vec(xgBoostPred, test_vals)

print(glue("XGBoost RMSE: {rmseXGBoost}"))

## XGBoost RMSE: 7.00140672573539

print(glue("XGBoost R-squared: {rsqXGBoost}"))

## XGBoost R-squared: 0.979195054952625

Overall, we see the best overall performance from the XGBoost model against the unseen test data, with a value of \(R^2 = 0.9906\), and a RMSE value of 4.76. Out other models all resulted in \(R^2 > 0.97\) against our unseen test data, which is nothing to sneeze at.

varImpPlot(rfModel)

Additionally, we can plot variable importance to see if our models rely on different combinations of features.

varImpLasso <- varImp(lasso) 
varImpLasso

## loess r-squared variable importance
## 
##                 Overall
## avg_photon_ed  100.0000
## avg_eq_skwb     69.4612
## avg_neutron_ed  17.5145
## facility_type    0.6780
## avg_ced          0.3634
## site             0.2482
## year             0.2376
## program_office   0.0000

varImpRidge <- varImp(ridgeModel) 
varImpRidge

## loess r-squared variable importance
## 
##                 Overall
## avg_photon_ed  100.0000
## avg_eq_skwb     69.4612
## avg_neutron_ed  17.5145
## facility_type    0.6780
## avg_ced          0.3634
## site             0.2482
## year             0.2376
## program_office   0.0000

varImpRF <- varImp(rfModel) 
varImpRF

##                   Overall
## year             641360.9
## avg_ced         1404942.7
## avg_photon_ed  31452448.6
## avg_neutron_ed  4044201.4
## site            2576432.7
## program_office   472578.3
## avg_eq_skwb    21913707.7
## facility_type    768765.5

varImpXGBoost <- varImp(xgBoostModel)
varImpXGBoost$importance

##                                                                    Overall
## avg_photon_ed                                                 1.000000e+02
## avg_eq_skwb                                                   3.626958e+00
## avg_neutron_ed                                                3.321888e+00
## avg_ced                                                       9.572025e-01
## year                                                          5.638203e-02
## siteHanford: Pacific Northwest National Laboratory            2.588072e-02
## siteHanford: Hanford Site                                     2.298195e-02
## siteLawrence Livermore National Laboratory                    2.238816e-02
## siteLos Alamos National Laboratory                            2.153151e-02
## program_officeOffice of Nuclear Energy Science and Technology 2.025779e-02
## facility_typeWEAPONS FABRICATION AND TEST                     1.205295e-02
## siteSavannah River Site                                       1.164647e-02
## siteUranium Mill Tailings Remedial Action Project             1.104083e-02
## facility_typeRESEARCH, GENERAL                                9.774588e-03
## siteHanford: Office of River Protection                       8.289449e-03
## siteOak Ridge: East Tennessee Technology Park                 7.986381e-03
## siteOak Ridge: Y-12 National Security Complex                 6.851421e-03
## facility_typeMAINTENANCE AND SUPPORT                          6.215192e-03
## siteIdaho                                                     6.047675e-03
## siteRocky Flats Site                                          5.978804e-03
## program_officeOffice of Environmental Management              4.640324e-03
## siteOak Ridge: Oak Ridge National Laboratory                  3.950990e-03
## facility_typeFUEL/URANIUM ENRICHMENT                          3.276804e-03
## siteSandia National Laboratories                              2.929734e-03
## sitePaducah Gaseous Diffusion Plant                           2.752162e-03
## facility_typeWASTE PROCESSING/MANAGEMENT                      2.486306e-03
## sitePantex Plant                                              2.303456e-03
## program_officeNational Nuclear Security Administration        2.224424e-03
## sitePortsmouth Gaseous Diffusion Plant                        1.274147e-03
## facility_typeFUEL PROCESSING                                  9.858173e-04
## program_officeOffice of Science                               5.451012e-04
## siteArgonne National Laboratory                               3.285178e-04
## facility_typeREACTOR                                          2.715631e-04
## siteBrookhaven National Laboratory                            1.292806e-04
## siteFernald                                                   9.550642e-05
## facility_typeOTHER                                            2.009615e-05
## siteBattelle Memorial - Columbus                              0.000000e+00
## siteEnergy Technology Engineering Center                      0.000000e+00
## siteFermi National Accelerator Laboratory                     0.000000e+00
## siteGrand Junction Office                                     0.000000e+00
## siteKansas City National Security Campus                      0.000000e+00
## siteKnolls Atomic Power Laboratory                            0.000000e+00
## siteLawrence Berkeley National Laboratory                     0.000000e+00
## siteMound Site                                                0.000000e+00
## siteNational Renewable Energy Laboratory                      0.000000e+00
## siteNevada National Security Site                             0.000000e+00
## siteNew Brunswick Laboratory                                  0.000000e+00
## siteOak Ridge: Oak Ridge Institute for Science and Education  0.000000e+00
## siteOffice of Secure Transportation                           0.000000e+00
## sitePrinceton Plasma Physics Laboratory                       0.000000e+00
## siteRMI Environmental Services                                0.000000e+00
## siteSavannah River National Laboratory                        0.000000e+00
## siteSeparations Process Research Unit                         0.000000e+00
## siteService Center Personnel                                  0.000000e+00
## siteSLAC National Accelerator Laboratory                      0.000000e+00
## siteSRS Tritium Facilities                                    0.000000e+00
## siteThomas Jefferson National Accelerator Facility            0.000000e+00
## siteWaste Isolation Pilot Plant                               0.000000e+00
## siteWest Valley Demonstration Project                         0.000000e+00
## program_officeOffice of Civilian Radioactive Waste Management 0.000000e+00
## program_officeOffice of Fossil Energy                         0.000000e+00
## program_officeOffice of Legacy Management                     0.000000e+00
## program_officeOther (unknown)                                 0.000000e+00
## program_officeOther Headquarters Level Organizations          0.000000e+00
## facility_typeFUEL FABRICATION                                 0.000000e+00
## facility_typeRESEARCH, FUSION                                 0.000000e+00

Across all of our models, we see the photon equivalent dose as a dominant feature within this dataset.

Conclusion

While this data is a lagging indicator for the department (exposure records are collected and aggregated at the yearly level), the ability to have a predictive model (even a high-level one) of radiological exposures could be of value to the department. The below lines of work could be helpful in improving both our model’s performance, and value to the department.

Leading indicators (more granular dosimetry readings, for instance)
Alternative modeling approaches
Panel regression (to keep sites distinct)