In early 2018, I sent 10 bean germplasm entries to five field locations for testing. I sent five entries that, based on a Finlay-Wilkinson model, I predicted would perform similarly across sites, so have low GxE, or be ‘stable’:
Starlight
UI537
Yolano
55012
Aztec
I also sent five entries that I predicted would be very variable across sites, so have high GxE, or be ‘labile’:
93_207G
97_409P
Buster
Max
ND307
These ten entries were the most distinguishable sets that were Durango beans and that I had sufficient seed bulked to send out.
I wanted to see if:
1. The Finlay-Wilkinson model was predictive, that is, if germplasm had the same type of GxE in both the training and the test data.
2. Using just the test data, we could see two groups of germplasm. That is, there was a significant difference in the slopes of these two groups of lines using just the test data.
Here, the yield data that I received for these ten varieties from these five sites in 2018 is the ‘test’ dataset. The yield data from 1981-2015 from the CDBN was the ‘training’ dataset, and gave predicted yield values (yhat) for a Finlay-Wilkinson model, the ‘prediction’ data.
workbook <- loadWorkbook("Predicted FW.xlsx")
wbsheets <- readWorksheet(workbook, sheet = getSheets(workbook))
FW_data <- as_tibble(wbsheets$Long_format_R)
fwdf <- FW_data %>%
mutate(h_index = case_when(
Location_code == 'MOCO' ~ -70.26,
Location_code == 'NDHA' ~ -0.87,
Location_code == 'MISA' ~ 7.16,
Location_code == 'NESB' ~ 13.60,
Location_code == 'WAOT' ~ 37.90
),
gxe_type = case_when(
CDBN_ID %in% c('55012', 'Aztec', 'Starlight', 'UI537', 'Yolano') ~ 'stable',
CDBN_ID %in% c('93_207G', '97_409P', 'Buster', 'Max', 'ND307') ~ 'labile'
))
fwdf$CDBN_ID <- factor(fwdf$CDBN_ID, levels = c('55012', 'Aztec', 'Starlight', 'UI537', 'Yolano', '93_207G', '97_409P', 'Buster', 'Max', 'ND307'))
fwdf$gxe_type <- factor(fwdf$gxe_type, levels = c('stable', 'labile'))
fwdf
y is actual data, either that I used to train the FW model or that we collected in 2018 to test the FW model (Data_type distinguishes between these datasets).
yhat is predicted data from the FW model, and SD_yhat is the standard deviation around those predictions. I had predicted data from 30 sites for all 312 varieties, but I filtered the data to just keep the 5 sites and 10 varieties that I actually have test data for, so we can compare those.
CDBN_ID is the name of the germplasm entry. Location_code is a four-letter code corresponding to where the entry was grown. Data_type refers to whether the y value was from the ‘training’ or the ‘test’ dataset. And the h_index is the environmental index predicted for each location code by the Finlay-Wilkinson model. This ranks how good different locations are for growing common bean.
The figure below shows the predicted slopes and predicted yields for each germplasm entry, against the environmental index (the h_index) based on the training set data.
The slopes indicate the type of GxE. Flatter training set slopes I called “stable”, and steeper slopes I called “labile”.
fwdf %>%
filter(Data_type == 'Training' & CDBN_ID != "Stampede") %>%
ggplot(aes(x = h_index, y = yhat)) +
geom_point(aes(color = gxe_type)) +
geom_line(aes(color = gxe_type, group = CDBN_ID), lty = 2) +
scale_color_viridis(discrete = TRUE, end = 0.8, direction = -1) +
theme(legend.position = "top") +
ylim(c(0, 4800)) +
xlim(c(-75, 40)) +
labs(x = "Environmental Index", y = "Predicted Yields (kg/ha)", title = "Predicted GxE groupings")
Here is both the training and the test data for all ten entries, broken out by entry name and colored by the expected type of GxE. The dashed black line is the predicted slope from the Finlay-Wilkinson analysis, and the solid colored line (with the 95% confidence interval about that line shaded in grey) is the slope based on only the ‘test’ data from 2018.
fwdf %>%
filter(CDBN_ID != "Stampede") %>%
ggplot(aes(x = h_index, y = y)) +
geom_point(aes(color = gxe_type, shape = Data_type), alpha = 0.6) +
geom_path(aes(y = yhat), lty = 2) +
geom_smooth(data = filter(fwdf, Data_type == "Test"), aes(color = gxe_type),
method = "lm") +
theme(legend.position = "top") +
scale_color_viridis(discrete = TRUE, end = 0.8, direction = -1) +
scale_shape_manual(values = c(15, 20)) +
ylim(c(0, 4800)) +
xlim(c(-75, 40)) +
facet_wrap(~CDBN_ID, ncol = 5) +
labs(x = "Environmental Index", y = "Yield (kg/ha)",
title = "Predicted and Test Slopes across Five Sites")
fwdf %>%
filter(CDBN_ID != "Stampede" & Location_code != "MOCO") %>%
ggplot(aes(x = h_index, y = y)) +
geom_point(aes(color = gxe_type, shape = Data_type), alpha = 0.6) +
geom_path(aes(y = yhat), lty = 2) +
geom_smooth(data = filter(fwdf, Data_type == "Test" &
Location_code != "MOCO"), aes(color = gxe_type),
method = "lm") +
theme(legend.position = "top") +
scale_color_viridis(discrete = TRUE, end = 0.8, direction = -1) +
scale_shape_manual(values = c(15, 20)) +
ylim(c(0, 4800)) +
xlim(c(-10, 55)) +
facet_wrap(~CDBN_ID, ncol = 5) +
labs(x = "Environmental Index", y = "Yield (kg/ha)",
title = "Predicted and Test Slopes across Four Sites")
Are the slopes for these lines similar to our predictions?
Most slopes are within the 95% CI of the Test data slope (except for ND307). Also, for sites and CDBN entries where I have both training and test data, these values mostly seem similar. I don’t really have enough data in the test set to look for statistical outliers, so this is more of a visual sanity check.
However, I wanted to see two distinguishable groups of slopes for the two groups of five. Overall, it looks like all of the slopes regressed towards the mean and there aren’t two distinguishable groups here. Let’s compare the slope values for our two ‘groups’ to be sure, using an ANOVA.
I will also do the same analysis for just the four sites that are good common bean growing environments. Columbia, Missouri is an outlier environment, and I have been warned that this could strongly skew the results.
The output of linear models of yield against environmental index plus an interaction between environmental index and GxE type are shown below. I wanted to see a significant effect of the interaction between the environmental index and the type of GxE predicted. I don’t see that in my test data, even though I do in the training data for the model.
Moreover, if I take a random sample of the training data of the same size as my test data, I still have the power to detect this interaction in the training data, suggesting that I have the power to detect an interaction in the test dataset if it were there, but it isn’t.
NB: Ideally the model would have CDBN_ID as a random effect, but I haven’t done that here.
set.seed(1234)
sample_t <- sample(1:93, 64, replace = FALSE)
train_sized <- filter(fwdf, Data_type == "Training" & !is.na(y))
five_sites_test <- lm(y ~ h_index + gxe_type:h_index, data = filter(fwdf, Data_type == "Test" & !is.na(y))) # 64 datapoints
five_sites_pred <- lm(yhat ~ h_index + gxe_type:h_index, data = filter(fwdf, Data_type == "Training"))
five_sites_train <- lm(y ~ h_index + gxe_type:h_index, data = train_sized[sample_t,]) # 93 datapoints
Model output from the test data does not show a significant interaction
summary(five_sites_test)
##
## Call:
## lm(formula = y ~ h_index + gxe_type:h_index, data = filter(fwdf,
## Data_type == "Test" & !is.na(y)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1697.33 -521.45 53.68 496.20 1545.14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2621.3240 88.0775 29.762 < 2e-16 ***
## h_index 20.4861 3.2083 6.385 2.61e-08 ***
## h_index:gxe_typelabile -0.9379 4.7655 -0.197 0.845
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 692.9 on 61 degrees of freedom
## Multiple R-squared: 0.5336, Adjusted R-squared: 0.5183
## F-statistic: 34.89 on 2 and 61 DF, p-value: 7.89e-11
Here is model output for the predicted interaction based on the yhat.
summary(five_sites_pred)
##
## Call:
## lm(formula = yhat ~ h_index + gxe_type:h_index, data = filter(fwdf,
## Data_type == "Training"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -817.12 -164.72 -34.53 161.21 1045.96
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2842.356 26.835 105.92 <2e-16 ***
## h_index 14.247 1.006 14.16 <2e-16 ***
## h_index:gxe_typelabile 22.487 1.558 14.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 270.9 on 104 degrees of freedom
## (22 observations deleted due to missingness)
## Multiple R-squared: 0.9208, Adjusted R-squared: 0.9193
## F-statistic: 604.4 on 2 and 104 DF, p-value: < 2.2e-16
Here is the model output for a random sample of the training data of the same size as my test data. It has a significant interaction.
summary(five_sites_train)
##
## Call:
## lm(formula = y ~ h_index + gxe_type:h_index, data = train_sized[sample_t,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2018.40 -306.38 21.57 463.43 1397.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2751.726 100.856 27.284 < 2e-16 ***
## h_index 18.666 4.008 4.658 2.13e-05 ***
## h_index:gxe_typelabile 15.003 6.588 2.277 0.0267 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 686.5 on 54 degrees of freedom
## (7 observations deleted due to missingness)
## Multiple R-squared: 0.5579, Adjusted R-squared: 0.5415
## F-statistic: 34.07 on 2 and 54 DF, p-value: 2.691e-10
What about models omitting Columbia, Missouri?
Here, there is an interaction between the environmental index and the type of GxE, though it isn’t significant. Doing the same analysis for a subsample of the training data indicates that I should have had the power to detect a significant interaction.
sample_t <- sample(1:80, 50, replace = FALSE)
train_sized <- filter(fwdf, Data_type == "Training" & Location_code != "MOCO" & !is.na(y))
four_sites_test <- lm(y ~ h_index + gxe_type:h_index, data = filter(fwdf, Data_type == "Test" & Location_code != "MOCO"))
four_sites_pred <- lm(yhat ~ h_index + gxe_type:h_index, data = filter(fwdf, Data_type == "Training" & Location_code != "MOCO"))
four_sites_train <- lm(y ~ h_index + gxe_type:h_index, data = train_sized[sample_t,])
Model output from the test data has a weak interaction
summary(four_sites_test)
##
## Call:
## lm(formula = y ~ h_index + gxe_type:h_index, data = filter(fwdf,
## Data_type == "Test" & Location_code != "MOCO"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1465.19 -511.94 45.67 581.87 1663.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2664.153 132.684 20.079 <2e-16 ***
## h_index 8.624 9.209 0.936 0.354
## h_index:gxe_typelabile 16.623 11.354 1.464 0.150
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 734.5 on 47 degrees of freedom
## Multiple R-squared: 0.1392, Adjusted R-squared: 0.1025
## F-statistic: 3.799 on 2 and 47 DF, p-value: 0.02955
Here is model output for the predicted interaction based on the yhat.
summary(four_sites_pred)
##
## Call:
## lm(formula = yhat ~ h_index + gxe_type:h_index, data = filter(fwdf,
## Data_type == "Training" & Location_code != "MOCO"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -735.73 -146.62 -43.25 146.74 407.91
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2821.830 33.016 85.469 < 2e-16 ***
## h_index 12.642 2.136 5.918 6.48e-08 ***
## h_index:gxe_typelabile 26.819 2.157 12.435 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 206.3 on 86 degrees of freedom
## (20 observations deleted due to missingness)
## Multiple R-squared: 0.8656, Adjusted R-squared: 0.8625
## F-statistic: 277 on 2 and 86 DF, p-value: < 2.2e-16
Here is the model output for a random sample of the training data of the same size as my test data. It has a significant interaction.
summary(four_sites_train)
##
## Call:
## lm(formula = y ~ h_index + gxe_type:h_index, data = train_sized[sample_t,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1617.3 -395.3 131.2 488.4 1177.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2941.101 187.463 15.689 <2e-16 ***
## h_index -5.821 14.083 -0.413 0.6818
## h_index:gxe_typelabile 33.237 12.890 2.579 0.0142 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 737.4 on 36 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.3124, Adjusted R-squared: 0.2742
## F-statistic: 8.179 on 2 and 36 DF, p-value: 0.001179
Though the training data indicated that these two sets of five varieties would differ in their GxE, I find only weak evidence for this in the test dataset from 2018, and that evidence only when the poorest performing site was excluded. Moreover, I see regression towards the mean for every slope for each of the 10 varieties that we tested in 2018. This indicates to me that the Finlay-Wilkinson model I used doesn’t have sufficient accuracy to be useful for predicting differences in GxE across these environments, and thus for looking at the genetics of GxE across environments.
More data, particularly from locations which are good environments to grow common bean, might help, but it seems like a poor choice to do more work in support for an underlying model which has performed so poorly.
I’ve now done an analysis of GxE that incorporates weather data specific to each location and year, and the results are looking very promising. I would like to followup on this analysis with some crosses, and perhaps with field testing in 2020.
An earlier attempt to describe this approach can be found here: http://rpubs.com/alice_macqueen/FW-Predictions_5-location. However, the explanation is a little lacking, there, and it doesn’t have the final list of ten varieties that I chose.
These predictions are based on a Finlay-Wilkinson (FW) model fitted using a Bayesian Gibbs sampler and a matrix of variety relatedness (A). The Gibbs sampler shrinks estimates for each variety towards the average performance of the model, and generally gives better predictive power than an ordinary least squares model. The A matrix was calculated in Tassel using its recommended methodology (centered IBS). The SNP matrix used in Tassel was generated from the GBS data collected by Phil McClean, Rian Lee, and Alice MacQueen using the ApeKI enzyme, aligned using bwa mem to the P. vulgaris genome V 2.0, and with SNP calls using NGSEP.
The FW model was fitted for 312 bean varieties across 30 CDBN locations. The 30 locations were picked from 77 CDBN locations by selecting locations that had grown ten check varieties (CELRK, Fleetwood, Montcalm, NW63, Othello, Midnight, Redkloud, UI114, UI59, & Viva) at least once. The year each variety was grown was ignored in this model. Future work will account for the effect of year by incorporating daily weather data into the model.
Also, if you compare by CDBN_ID, you see the same results, namely, that the test data does a very poor job of recapitulating the slope of GxE in the training data. Max and 97_409P are the two entries that really break this model - they are both ‘labile’ entries that look a lot more like ‘stable’ lines in the test data. Also, for the four site model, Aztec also messes up the model.
In this analysis, I was expecting for the first four entries to have low values for the interaction estimate, and the last five entries to have high values.
five_sites_test <- lm(y ~ h_index + h_index:CDBN_ID, data = filter(fwdf, Data_type == "Test"))
five_sites_pred <- lm(yhat ~ h_index + h_index:CDBN_ID, data = filter(fwdf, Data_type == "Training"))
five_sites_train <- lm(y ~ h_index + h_index:CDBN_ID, data = filter(fwdf, Data_type == "Training"))
summary(five_sites_test)
##
## Call:
## lm(formula = y ~ h_index + h_index:CDBN_ID, data = filter(fwdf,
## Data_type == "Test"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1507.34 -486.93 40.67 427.84 1523.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2620.6872 92.1467 28.440 <2e-16 ***
## h_index 15.4900 6.7533 2.294 0.0258 *
## h_index:CDBN_IDAztec 6.5984 9.5040 0.694 0.4905
## h_index:CDBN_IDStarlight 0.7870 11.1482 0.071 0.9440
## h_index:CDBN_IDUI537 9.1341 11.1482 0.819 0.4163
## h_index:CDBN_IDYolano 8.4155 9.5040 0.885 0.3799
## h_index:CDBN_ID93_207G 8.5959 11.1482 0.771 0.4441
## h_index:CDBN_ID97_409P 0.7706 11.1482 0.069 0.9452
## h_index:CDBN_IDBuster 11.5899 11.1482 1.040 0.3032
## h_index:CDBN_IDMax -3.5270 11.1482 -0.316 0.7530
## h_index:CDBN_IDND307 3.3657 9.5040 0.354 0.7246
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 722.1 on 53 degrees of freedom
## Multiple R-squared: 0.5599, Adjusted R-squared: 0.4769
## F-statistic: 6.743 on 10 and 53 DF, p-value: 1.163e-06
summary(five_sites_pred)
##
## Call:
## lm(formula = yhat ~ h_index + h_index:CDBN_ID, data = filter(fwdf,
## Data_type == "Training"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -338.95 -177.75 -88.07 151.46 648.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2830.9948 20.9288 135.268 < 2e-16 ***
## h_index -0.3134 2.5802 -0.121 0.904
## h_index:CDBN_IDAztec 16.3895 2.9790 5.502 3.12e-07 ***
## h_index:CDBN_IDStarlight 14.1022 3.0431 4.634 1.13e-05 ***
## h_index:CDBN_IDUI537 16.8866 3.0126 5.605 1.99e-07 ***
## h_index:CDBN_IDYolano 16.4203 3.2241 5.093 1.76e-06 ***
## h_index:CDBN_ID93_207G 31.5113 3.4835 9.046 1.67e-14 ***
## h_index:CDBN_ID97_409P 25.4985 3.6485 6.989 3.66e-10 ***
## h_index:CDBN_IDBuster 38.0783 3.2650 11.662 < 2e-16 ***
## h_index:CDBN_IDMax 42.8426 2.9882 14.337 < 2e-16 ***
## h_index:CDBN_IDND307 36.9348 3.3381 11.065 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 209.7 on 96 degrees of freedom
## (22 observations deleted due to missingness)
## Multiple R-squared: 0.9562, Adjusted R-squared: 0.9516
## F-statistic: 209.4 on 10 and 96 DF, p-value: < 2.2e-16
summary(five_sites_train)
##
## Call:
## lm(formula = y ~ h_index + h_index:CDBN_ID, data = filter(fwdf,
## Data_type == "Training"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2044.5 -369.4 64.0 446.9 1431.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2729.817 89.242 30.589 < 2e-16 ***
## h_index -6.198 8.340 -0.743 0.459942
## h_index:CDBN_IDAztec 20.400 9.724 2.098 0.039629 *
## h_index:CDBN_IDStarlight 25.856 9.845 2.626 0.010656 *
## h_index:CDBN_IDUI537 27.761 10.014 2.772 0.007176 **
## h_index:CDBN_IDYolano 30.694 10.432 2.942 0.004452 **
## h_index:CDBN_ID93_207G 20.914 15.297 1.367 0.176064
## h_index:CDBN_ID97_409P 49.218 19.664 2.503 0.014725 *
## h_index:CDBN_IDBuster 38.200 10.883 3.510 0.000800 ***
## h_index:CDBN_IDMax 46.569 11.618 4.008 0.000154 ***
## h_index:CDBN_IDND307 54.088 14.751 3.667 0.000483 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 677.9 on 68 degrees of freedom
## (50 observations deleted due to missingness)
## Multiple R-squared: 0.6444, Adjusted R-squared: 0.5921
## F-statistic: 12.32 on 10 and 68 DF, p-value: 7.355e-12
four_sites_test <- lm(y ~ h_index + h_index:CDBN_ID, data = filter(fwdf, Data_type == "Test" & Location_code != "MOCO"))
four_sites_pred <- lm(yhat ~ h_index + h_index:CDBN_ID, data = filter(fwdf, Data_type == "Training" & Location_code != "MOCO"))
four_sites_train <- lm(y ~ h_index + h_index:CDBN_ID, data = filter(fwdf, Data_type == "Training" & Location_code != "MOCO"))
summary(four_sites_test)
##
## Call:
## lm(formula = y ~ h_index + h_index:CDBN_ID, data = filter(fwdf,
## Data_type == "Test" & Location_code != "MOCO"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1468.49 -353.52 -60.12 504.21 1279.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2664.15 129.57 20.561 < 2e-16 ***
## h_index -16.60 18.08 -0.919 0.36399
## h_index:CDBN_IDAztec 53.50 24.79 2.158 0.03716 *
## h_index:CDBN_IDStarlight 21.43 24.79 0.864 0.39264
## h_index:CDBN_IDUI537 35.58 24.79 1.435 0.15918
## h_index:CDBN_IDYolano 15.62 24.79 0.630 0.53224
## h_index:CDBN_ID93_207G 47.89 24.79 1.931 0.06071 .
## h_index:CDBN_ID97_409P 37.04 24.79 1.494 0.14321
## h_index:CDBN_IDBuster 39.23 24.79 1.582 0.12163
## h_index:CDBN_IDMax 15.67 24.79 0.632 0.53094
## h_index:CDBN_IDND307 69.42 24.79 2.800 0.00791 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 717.3 on 39 degrees of freedom
## Multiple R-squared: 0.3188, Adjusted R-squared: 0.1441
## F-statistic: 1.825 on 10 and 39 DF, p-value: 0.08814
summary(four_sites_pred)
##
## Call:
## lm(formula = yhat ~ h_index + h_index:CDBN_ID, data = filter(fwdf,
## Data_type == "Training" & Location_code != "MOCO"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -234.88 -103.61 -32.74 76.48 377.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2822.345 23.337 120.936 < 2e-16 ***
## h_index -8.643 3.640 -2.375 0.02001 *
## h_index:CDBN_IDAztec 18.649 4.098 4.550 1.94e-05 ***
## h_index:CDBN_IDStarlight 14.181 4.743 2.990 0.00373 **
## h_index:CDBN_IDUI537 34.233 4.336 7.894 1.51e-11 ***
## h_index:CDBN_IDYolano 31.669 4.864 6.511 6.61e-09 ***
## h_index:CDBN_ID93_207G 42.972 4.403 9.760 3.65e-15 ***
## h_index:CDBN_ID97_409P 44.451 5.019 8.856 2.05e-13 ***
## h_index:CDBN_IDBuster 53.953 3.995 13.504 < 2e-16 ***
## h_index:CDBN_IDMax 46.757 3.832 12.201 < 2e-16 ***
## h_index:CDBN_IDND307 47.813 4.098 11.666 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 145.2 on 78 degrees of freedom
## (20 observations deleted due to missingness)
## Multiple R-squared: 0.9396, Adjusted R-squared: 0.9319
## F-statistic: 121.3 on 10 and 78 DF, p-value: < 2.2e-16
summary(four_sites_train)
##
## Call:
## lm(formula = y ~ h_index + h_index:CDBN_ID, data = filter(fwdf,
## Data_type == "Training" & Location_code != "MOCO"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2075.41 -253.20 24.06 414.19 1413.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2770.66 126.00 21.989 < 2e-16 ***
## h_index -45.85 16.50 -2.778 0.007463 **
## h_index:CDBN_IDAztec 56.52 19.29 2.931 0.004919 **
## h_index:CDBN_IDStarlight 50.39 21.38 2.357 0.022037 *
## h_index:CDBN_IDUI537 127.77 34.52 3.702 0.000497 ***
## h_index:CDBN_IDYolano 82.94 21.95 3.778 0.000391 ***
## h_index:CDBN_ID93_207G 59.40 19.98 2.974 0.004360 **
## h_index:CDBN_ID97_409P 87.63 23.21 3.775 0.000394 ***
## h_index:CDBN_IDBuster 74.64 18.60 4.014 0.000182 ***
## h_index:CDBN_IDMax 84.83 17.45 4.862 1.01e-05 ***
## h_index:CDBN_IDND307 92.32 19.50 4.735 1.58e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 651.2 on 55 degrees of freedom
## (43 observations deleted due to missingness)
## Multiple R-squared: 0.5072, Adjusted R-squared: 0.4176
## F-statistic: 5.662 on 10 and 55 DF, p-value: 9.043e-06