—– Introduction —–

In early 2018, I sent 10 bean germplasm entries to five field locations for testing. I sent five entries that, based on a Finlay-Wilkinson model, I predicted would perform similarly across sites, so have low GxE, or be ‘stable’:

Starlight
UI537
Yolano
55012
Aztec

I also sent five entries that I predicted would be very variable across sites, so have high GxE, or be ‘labile’:

93_207G
97_409P
Buster
Max
ND307

These ten entries were the most distinguishable sets that were Durango beans and that I had sufficient seed bulked to send out.

I wanted to see if:

1. The Finlay-Wilkinson model was predictive, that is, if germplasm had the same type of GxE in both the training and the test data.
2. Using just the test data, we could see two groups of germplasm. That is, there was a significant difference in the slopes of these two groups of lines using just the test data.

Here, the yield data that I received for these ten varieties from these five sites in 2018 is the ‘test’ dataset. The yield data from 1981-2015 from the CDBN was the ‘training’ dataset, and gave predicted yield values (yhat) for a Finlay-Wilkinson model, the ‘prediction’ data.

1. The 2018 GxE Dataset

workbook <- loadWorkbook("Predicted FW.xlsx")
wbsheets <- readWorksheet(workbook, sheet = getSheets(workbook))

FW_data <- as_tibble(wbsheets$Long_format_R)

fwdf <- FW_data %>%
  mutate(h_index = case_when(
    Location_code == 'MOCO' ~ -70.26,
    Location_code == 'NDHA' ~ -0.87,
    Location_code == 'MISA' ~ 7.16,
    Location_code == 'NESB' ~ 13.60,
    Location_code == 'WAOT' ~ 37.90
  ),
  gxe_type = case_when(
    CDBN_ID %in% c('55012', 'Aztec', 'Starlight', 'UI537', 'Yolano') ~ 'stable',
    CDBN_ID %in% c('93_207G', '97_409P', 'Buster', 'Max', 'ND307') ~ 'labile'
  ))

fwdf$CDBN_ID <- factor(fwdf$CDBN_ID, levels = c('55012', 'Aztec', 'Starlight', 'UI537', 'Yolano', '93_207G', '97_409P', 'Buster', 'Max', 'ND307'))
fwdf$gxe_type <- factor(fwdf$gxe_type, levels = c('stable', 'labile'))

fwdf

y is actual data, either that I used to train the FW model or that we collected in 2018 to test the FW model (Data_type distinguishes between these datasets).

yhat is predicted data from the FW model, and SD_yhat is the standard deviation around those predictions. I had predicted data from 30 sites for all 312 varieties, but I filtered the data to just keep the 5 sites and 10 varieties that I actually have test data for, so we can compare those.

CDBN_ID is the name of the germplasm entry. Location_code is a four-letter code corresponding to where the entry was grown. Data_type refers to whether the y value was from the ‘training’ or the ‘test’ dataset. And the h_index is the environmental index predicted for each location code by the Finlay-Wilkinson model. This ranks how good different locations are for growing common bean.

2. GxE Predictions

The figure below shows the predicted slopes and predicted yields for each germplasm entry, against the environmental index (the h_index) based on the training set data.

The slopes indicate the type of GxE. Flatter training set slopes I called “stable”, and steeper slopes I called “labile”.

fwdf %>%
  filter(Data_type == 'Training' & CDBN_ID != "Stampede") %>%
  ggplot(aes(x = h_index, y = yhat)) +
  geom_point(aes(color = gxe_type)) +
  geom_line(aes(color = gxe_type, group = CDBN_ID), lty = 2) + 
  scale_color_viridis(discrete = TRUE, end = 0.8, direction = -1) +
  theme(legend.position = "top") +
  ylim(c(0, 4800)) + 
  xlim(c(-75, 40)) + 
  labs(x = "Environmental Index", y = "Predicted Yields (kg/ha)", title = "Predicted GxE groupings")

—– Results —–

Here is both the training and the test data for all ten entries, broken out by entry name and colored by the expected type of GxE. The dashed black line is the predicted slope from the Finlay-Wilkinson analysis, and the solid colored line (with the 95% confidence interval about that line shaded in grey) is the slope based on only the ‘test’ data from 2018.

fwdf %>%
  filter(CDBN_ID != "Stampede") %>%
  ggplot(aes(x = h_index, y = y)) +
  geom_point(aes(color = gxe_type, shape = Data_type), alpha = 0.6) +
  geom_path(aes(y = yhat), lty = 2) +
  geom_smooth(data = filter(fwdf, Data_type == "Test"), aes(color = gxe_type),
              method = "lm") +
  theme(legend.position = "top") + 
  scale_color_viridis(discrete = TRUE, end = 0.8, direction = -1) +
  scale_shape_manual(values = c(15, 20)) +
  ylim(c(0, 4800)) + 
  xlim(c(-75, 40)) + 
  facet_wrap(~CDBN_ID, ncol = 5) +
  labs(x = "Environmental Index", y = "Yield (kg/ha)",
       title = "Predicted and Test Slopes across Five Sites")

fwdf %>%
  filter(CDBN_ID != "Stampede" & Location_code != "MOCO") %>%
  ggplot(aes(x = h_index, y = y)) +
  geom_point(aes(color = gxe_type, shape = Data_type), alpha = 0.6) +
  geom_path(aes(y = yhat), lty = 2) +
  geom_smooth(data = filter(fwdf, Data_type == "Test" & 
                              Location_code != "MOCO"), aes(color = gxe_type),
              method = "lm") +
  theme(legend.position = "top") + 
  scale_color_viridis(discrete = TRUE, end = 0.8, direction = -1) +
  scale_shape_manual(values = c(15, 20)) +
  ylim(c(0, 4800)) + 
  xlim(c(-10, 55)) + 
  facet_wrap(~CDBN_ID, ncol = 5) +
  labs(x = "Environmental Index", y = "Yield (kg/ha)", 
       title = "Predicted and Test Slopes across Four Sites")

Are the slopes for these lines similar to our predictions?

Most slopes are within the 95% CI of the Test data slope (except for ND307). Also, for sites and CDBN entries where I have both training and test data, these values mostly seem similar. I don’t really have enough data in the test set to look for statistical outliers, so this is more of a visual sanity check.

However, I wanted to see two distinguishable groups of slopes for the two groups of five. Overall, it looks like all of the slopes regressed towards the mean and there aren’t two distinguishable groups here. Let’s compare the slope values for our two ‘groups’ to be sure, using an ANOVA.

I will also do the same analysis for just the four sites that are good common bean growing environments. Columbia, Missouri is an outlier environment, and I have been warned that this could strongly skew the results.

3. Statistical Comparison of Slopes

3a. Stabile vs Labile across Five Sites

The output of linear models of yield against environmental index plus an interaction between environmental index and GxE type are shown below. I wanted to see a significant effect of the interaction between the environmental index and the type of GxE predicted. I don’t see that in my test data, even though I do in the training data for the model.

Moreover, if I take a random sample of the training data of the same size as my test data, I still have the power to detect this interaction in the training data, suggesting that I have the power to detect an interaction in the test dataset if it were there, but it isn’t.

NB: Ideally the model would have CDBN_ID as a random effect, but I haven’t done that here.

set.seed(1234)
sample_t <- sample(1:93, 64, replace = FALSE)
train_sized <- filter(fwdf, Data_type == "Training" & !is.na(y))
five_sites_test <- lm(y ~ h_index + gxe_type:h_index, data = filter(fwdf, Data_type == "Test" & !is.na(y))) # 64 datapoints
five_sites_pred <- lm(yhat ~ h_index + gxe_type:h_index, data = filter(fwdf, Data_type == "Training"))
five_sites_train <- lm(y ~ h_index + gxe_type:h_index, data = train_sized[sample_t,]) # 93 datapoints

Model output from the test data does not show a significant interaction

summary(five_sites_test)
## 
## Call:
## lm(formula = y ~ h_index + gxe_type:h_index, data = filter(fwdf, 
##     Data_type == "Test" & !is.na(y)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1697.33  -521.45    53.68   496.20  1545.14 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2621.3240    88.0775  29.762  < 2e-16 ***
## h_index                  20.4861     3.2083   6.385 2.61e-08 ***
## h_index:gxe_typelabile   -0.9379     4.7655  -0.197    0.845    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 692.9 on 61 degrees of freedom
## Multiple R-squared:  0.5336, Adjusted R-squared:  0.5183 
## F-statistic: 34.89 on 2 and 61 DF,  p-value: 7.89e-11

Here is model output for the predicted interaction based on the yhat.

summary(five_sites_pred)
## 
## Call:
## lm(formula = yhat ~ h_index + gxe_type:h_index, data = filter(fwdf, 
##     Data_type == "Training"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -817.12 -164.72  -34.53  161.21 1045.96 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2842.356     26.835  105.92   <2e-16 ***
## h_index                  14.247      1.006   14.16   <2e-16 ***
## h_index:gxe_typelabile   22.487      1.558   14.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 270.9 on 104 degrees of freedom
##   (22 observations deleted due to missingness)
## Multiple R-squared:  0.9208, Adjusted R-squared:  0.9193 
## F-statistic: 604.4 on 2 and 104 DF,  p-value: < 2.2e-16

Here is the model output for a random sample of the training data of the same size as my test data. It has a significant interaction.

summary(five_sites_train)
## 
## Call:
## lm(formula = y ~ h_index + gxe_type:h_index, data = train_sized[sample_t, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2018.40  -306.38    21.57   463.43  1397.60 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2751.726    100.856  27.284  < 2e-16 ***
## h_index                  18.666      4.008   4.658 2.13e-05 ***
## h_index:gxe_typelabile   15.003      6.588   2.277   0.0267 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 686.5 on 54 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.5579, Adjusted R-squared:  0.5415 
## F-statistic: 34.07 on 2 and 54 DF,  p-value: 2.691e-10

3b. Four Sites

What about models omitting Columbia, Missouri?

Here, there is an interaction between the environmental index and the type of GxE, though it isn’t significant. Doing the same analysis for a subsample of the training data indicates that I should have had the power to detect a significant interaction.

sample_t <- sample(1:80, 50, replace = FALSE)
train_sized <- filter(fwdf, Data_type == "Training" & Location_code != "MOCO" & !is.na(y))
four_sites_test <- lm(y ~ h_index + gxe_type:h_index, data = filter(fwdf, Data_type == "Test" & Location_code != "MOCO"))
four_sites_pred <- lm(yhat ~ h_index + gxe_type:h_index, data = filter(fwdf, Data_type == "Training" & Location_code != "MOCO"))
four_sites_train <- lm(y ~ h_index + gxe_type:h_index, data = train_sized[sample_t,])

Model output from the test data has a weak interaction

summary(four_sites_test)
## 
## Call:
## lm(formula = y ~ h_index + gxe_type:h_index, data = filter(fwdf, 
##     Data_type == "Test" & Location_code != "MOCO"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1465.19  -511.94    45.67   581.87  1663.63 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2664.153    132.684  20.079   <2e-16 ***
## h_index                   8.624      9.209   0.936    0.354    
## h_index:gxe_typelabile   16.623     11.354   1.464    0.150    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 734.5 on 47 degrees of freedom
## Multiple R-squared:  0.1392, Adjusted R-squared:  0.1025 
## F-statistic: 3.799 on 2 and 47 DF,  p-value: 0.02955

Here is model output for the predicted interaction based on the yhat.

summary(four_sites_pred)
## 
## Call:
## lm(formula = yhat ~ h_index + gxe_type:h_index, data = filter(fwdf, 
##     Data_type == "Training" & Location_code != "MOCO"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -735.73 -146.62  -43.25  146.74  407.91 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2821.830     33.016  85.469  < 2e-16 ***
## h_index                  12.642      2.136   5.918 6.48e-08 ***
## h_index:gxe_typelabile   26.819      2.157  12.435  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 206.3 on 86 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.8656, Adjusted R-squared:  0.8625 
## F-statistic:   277 on 2 and 86 DF,  p-value: < 2.2e-16

Here is the model output for a random sample of the training data of the same size as my test data. It has a significant interaction.

summary(four_sites_train)
## 
## Call:
## lm(formula = y ~ h_index + gxe_type:h_index, data = train_sized[sample_t, 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1617.3  -395.3   131.2   488.4  1177.4 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2941.101    187.463  15.689   <2e-16 ***
## h_index                  -5.821     14.083  -0.413   0.6818    
## h_index:gxe_typelabile   33.237     12.890   2.579   0.0142 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 737.4 on 36 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.3124, Adjusted R-squared:  0.2742 
## F-statistic: 8.179 on 2 and 36 DF,  p-value: 0.001179

—– Conclusions —–

Though the training data indicated that these two sets of five varieties would differ in their GxE, I find only weak evidence for this in the test dataset from 2018, and that evidence only when the poorest performing site was excluded. Moreover, I see regression towards the mean for every slope for each of the 10 varieties that we tested in 2018. This indicates to me that the Finlay-Wilkinson model I used doesn’t have sufficient accuracy to be useful for predicting differences in GxE across these environments, and thus for looking at the genetics of GxE across environments.

More data, particularly from locations which are good environments to grow common bean, might help, but it seems like a poor choice to do more work in support for an underlying model which has performed so poorly.

I’ve now done an analysis of GxE that incorporates weather data specific to each location and year, and the results are looking very promising. I would like to followup on this analysis with some crosses, and perhaps with field testing in 2020.

—– More Method Details —–

An earlier attempt to describe this approach can be found here: http://rpubs.com/alice_macqueen/FW-Predictions_5-location. However, the explanation is a little lacking, there, and it doesn’t have the final list of ten varieties that I chose.

Methods to generate predictions:

These predictions are based on a Finlay-Wilkinson (FW) model fitted using a Bayesian Gibbs sampler and a matrix of variety relatedness (A). The Gibbs sampler shrinks estimates for each variety towards the average performance of the model, and generally gives better predictive power than an ordinary least squares model. The A matrix was calculated in Tassel using its recommended methodology (centered IBS). The SNP matrix used in Tassel was generated from the GBS data collected by Phil McClean, Rian Lee, and Alice MacQueen using the ApeKI enzyme, aligned using bwa mem to the P. vulgaris genome V 2.0, and with SNP calls using NGSEP.

The FW model was fitted for 312 bean varieties across 30 CDBN locations. The 30 locations were picked from 77 CDBN locations by selecting locations that had grown ten check varieties (CELRK, Fleetwood, Montcalm, NW63, Othello, Midnight, Redkloud, UI114, UI59, & Viva) at least once. The year each variety was grown was ignored in this model. Future work will account for the effect of year by incorporating daily weather data into the model.

Also, if you compare by CDBN_ID, you see the same results, namely, that the test data does a very poor job of recapitulating the slope of GxE in the training data. Max and 97_409P are the two entries that really break this model - they are both ‘labile’ entries that look a lot more like ‘stable’ lines in the test data. Also, for the four site model, Aztec also messes up the model.

In this analysis, I was expecting for the first four entries to have low values for the interaction estimate, and the last five entries to have high values.

five_sites_test <- lm(y ~ h_index + h_index:CDBN_ID, data = filter(fwdf, Data_type == "Test"))
five_sites_pred <- lm(yhat ~ h_index + h_index:CDBN_ID, data = filter(fwdf, Data_type == "Training"))
five_sites_train <- lm(y ~ h_index + h_index:CDBN_ID, data = filter(fwdf, Data_type == "Training"))

summary(five_sites_test)
## 
## Call:
## lm(formula = y ~ h_index + h_index:CDBN_ID, data = filter(fwdf, 
##     Data_type == "Test"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1507.34  -486.93    40.67   427.84  1523.98 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              2620.6872    92.1467  28.440   <2e-16 ***
## h_index                    15.4900     6.7533   2.294   0.0258 *  
## h_index:CDBN_IDAztec        6.5984     9.5040   0.694   0.4905    
## h_index:CDBN_IDStarlight    0.7870    11.1482   0.071   0.9440    
## h_index:CDBN_IDUI537        9.1341    11.1482   0.819   0.4163    
## h_index:CDBN_IDYolano       8.4155     9.5040   0.885   0.3799    
## h_index:CDBN_ID93_207G      8.5959    11.1482   0.771   0.4441    
## h_index:CDBN_ID97_409P      0.7706    11.1482   0.069   0.9452    
## h_index:CDBN_IDBuster      11.5899    11.1482   1.040   0.3032    
## h_index:CDBN_IDMax         -3.5270    11.1482  -0.316   0.7530    
## h_index:CDBN_IDND307        3.3657     9.5040   0.354   0.7246    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 722.1 on 53 degrees of freedom
## Multiple R-squared:  0.5599, Adjusted R-squared:  0.4769 
## F-statistic: 6.743 on 10 and 53 DF,  p-value: 1.163e-06
summary(five_sites_pred)
## 
## Call:
## lm(formula = yhat ~ h_index + h_index:CDBN_ID, data = filter(fwdf, 
##     Data_type == "Training"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -338.95 -177.75  -88.07  151.46  648.76 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              2830.9948    20.9288 135.268  < 2e-16 ***
## h_index                    -0.3134     2.5802  -0.121    0.904    
## h_index:CDBN_IDAztec       16.3895     2.9790   5.502 3.12e-07 ***
## h_index:CDBN_IDStarlight   14.1022     3.0431   4.634 1.13e-05 ***
## h_index:CDBN_IDUI537       16.8866     3.0126   5.605 1.99e-07 ***
## h_index:CDBN_IDYolano      16.4203     3.2241   5.093 1.76e-06 ***
## h_index:CDBN_ID93_207G     31.5113     3.4835   9.046 1.67e-14 ***
## h_index:CDBN_ID97_409P     25.4985     3.6485   6.989 3.66e-10 ***
## h_index:CDBN_IDBuster      38.0783     3.2650  11.662  < 2e-16 ***
## h_index:CDBN_IDMax         42.8426     2.9882  14.337  < 2e-16 ***
## h_index:CDBN_IDND307       36.9348     3.3381  11.065  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.7 on 96 degrees of freedom
##   (22 observations deleted due to missingness)
## Multiple R-squared:  0.9562, Adjusted R-squared:  0.9516 
## F-statistic: 209.4 on 10 and 96 DF,  p-value: < 2.2e-16
summary(five_sites_train)
## 
## Call:
## lm(formula = y ~ h_index + h_index:CDBN_ID, data = filter(fwdf, 
##     Data_type == "Training"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2044.5  -369.4    64.0   446.9  1431.5 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              2729.817     89.242  30.589  < 2e-16 ***
## h_index                    -6.198      8.340  -0.743 0.459942    
## h_index:CDBN_IDAztec       20.400      9.724   2.098 0.039629 *  
## h_index:CDBN_IDStarlight   25.856      9.845   2.626 0.010656 *  
## h_index:CDBN_IDUI537       27.761     10.014   2.772 0.007176 ** 
## h_index:CDBN_IDYolano      30.694     10.432   2.942 0.004452 ** 
## h_index:CDBN_ID93_207G     20.914     15.297   1.367 0.176064    
## h_index:CDBN_ID97_409P     49.218     19.664   2.503 0.014725 *  
## h_index:CDBN_IDBuster      38.200     10.883   3.510 0.000800 ***
## h_index:CDBN_IDMax         46.569     11.618   4.008 0.000154 ***
## h_index:CDBN_IDND307       54.088     14.751   3.667 0.000483 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 677.9 on 68 degrees of freedom
##   (50 observations deleted due to missingness)
## Multiple R-squared:  0.6444, Adjusted R-squared:  0.5921 
## F-statistic: 12.32 on 10 and 68 DF,  p-value: 7.355e-12
four_sites_test <- lm(y ~ h_index + h_index:CDBN_ID, data = filter(fwdf, Data_type == "Test" & Location_code != "MOCO"))
four_sites_pred <- lm(yhat ~ h_index + h_index:CDBN_ID, data = filter(fwdf, Data_type == "Training" & Location_code != "MOCO"))
four_sites_train <- lm(y ~ h_index + h_index:CDBN_ID, data = filter(fwdf, Data_type == "Training" & Location_code != "MOCO"))

summary(four_sites_test)
## 
## Call:
## lm(formula = y ~ h_index + h_index:CDBN_ID, data = filter(fwdf, 
##     Data_type == "Test" & Location_code != "MOCO"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1468.49  -353.52   -60.12   504.21  1279.17 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2664.15     129.57  20.561  < 2e-16 ***
## h_index                    -16.60      18.08  -0.919  0.36399    
## h_index:CDBN_IDAztec        53.50      24.79   2.158  0.03716 *  
## h_index:CDBN_IDStarlight    21.43      24.79   0.864  0.39264    
## h_index:CDBN_IDUI537        35.58      24.79   1.435  0.15918    
## h_index:CDBN_IDYolano       15.62      24.79   0.630  0.53224    
## h_index:CDBN_ID93_207G      47.89      24.79   1.931  0.06071 .  
## h_index:CDBN_ID97_409P      37.04      24.79   1.494  0.14321    
## h_index:CDBN_IDBuster       39.23      24.79   1.582  0.12163    
## h_index:CDBN_IDMax          15.67      24.79   0.632  0.53094    
## h_index:CDBN_IDND307        69.42      24.79   2.800  0.00791 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 717.3 on 39 degrees of freedom
## Multiple R-squared:  0.3188, Adjusted R-squared:  0.1441 
## F-statistic: 1.825 on 10 and 39 DF,  p-value: 0.08814
summary(four_sites_pred)
## 
## Call:
## lm(formula = yhat ~ h_index + h_index:CDBN_ID, data = filter(fwdf, 
##     Data_type == "Training" & Location_code != "MOCO"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -234.88 -103.61  -32.74   76.48  377.57 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              2822.345     23.337 120.936  < 2e-16 ***
## h_index                    -8.643      3.640  -2.375  0.02001 *  
## h_index:CDBN_IDAztec       18.649      4.098   4.550 1.94e-05 ***
## h_index:CDBN_IDStarlight   14.181      4.743   2.990  0.00373 ** 
## h_index:CDBN_IDUI537       34.233      4.336   7.894 1.51e-11 ***
## h_index:CDBN_IDYolano      31.669      4.864   6.511 6.61e-09 ***
## h_index:CDBN_ID93_207G     42.972      4.403   9.760 3.65e-15 ***
## h_index:CDBN_ID97_409P     44.451      5.019   8.856 2.05e-13 ***
## h_index:CDBN_IDBuster      53.953      3.995  13.504  < 2e-16 ***
## h_index:CDBN_IDMax         46.757      3.832  12.201  < 2e-16 ***
## h_index:CDBN_IDND307       47.813      4.098  11.666  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 145.2 on 78 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.9396, Adjusted R-squared:  0.9319 
## F-statistic: 121.3 on 10 and 78 DF,  p-value: < 2.2e-16
summary(four_sites_train)
## 
## Call:
## lm(formula = y ~ h_index + h_index:CDBN_ID, data = filter(fwdf, 
##     Data_type == "Training" & Location_code != "MOCO"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2075.41  -253.20    24.06   414.19  1413.56 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2770.66     126.00  21.989  < 2e-16 ***
## h_index                    -45.85      16.50  -2.778 0.007463 ** 
## h_index:CDBN_IDAztec        56.52      19.29   2.931 0.004919 ** 
## h_index:CDBN_IDStarlight    50.39      21.38   2.357 0.022037 *  
## h_index:CDBN_IDUI537       127.77      34.52   3.702 0.000497 ***
## h_index:CDBN_IDYolano       82.94      21.95   3.778 0.000391 ***
## h_index:CDBN_ID93_207G      59.40      19.98   2.974 0.004360 ** 
## h_index:CDBN_ID97_409P      87.63      23.21   3.775 0.000394 ***
## h_index:CDBN_IDBuster       74.64      18.60   4.014 0.000182 ***
## h_index:CDBN_IDMax          84.83      17.45   4.862 1.01e-05 ***
## h_index:CDBN_IDND307        92.32      19.50   4.735 1.58e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 651.2 on 55 degrees of freedom
##   (43 observations deleted due to missingness)
## Multiple R-squared:  0.5072, Adjusted R-squared:  0.4176 
## F-statistic: 5.662 on 10 and 55 DF,  p-value: 9.043e-06