This file will take Jeff’s latest CDBN experiment data (V1.5) from 2017-12-21 and do the following:
CDBN_ID’s and Seq_ID’s, as well as some other variety information that I have, and save a new Germplasm sheet: Germplasm_ahmKinshipCDBN_ID and Seq_ID columns from the Germplasm_ahm sheet to Jeff’s Phenotypes sheet so that it will be easier to combine the phenotypic data with the genotype data in the future. I save a new Phenotypes sheet: Phenotypes_ahmlocation-year combinations don’t have missing data. I save a new Locations sheet: Locations_ahm which has just the 77 unique locations, plus I save a Locations_by_Year sheet that has all ~690 Location by year combinations, which is the way Jeff had the data.CDBN-metadata_V1.5.1.xlsx. One file has all the phenotype data, which can be related back to the metadata using the Location_code, Year, CDBN_ID and/or Seq_ID keys, and is called CDBN-phenotypes_V1.5.1.xlsx. Finally I may have a weather data file called CDBN-weather_V1.5.1.xlsx.wb <- loadWorkbook("All_experiments_V1.5.1.xlsx")
lst = readWorksheet(wb, sheet = getSheets(wb))
metadata1 <- loadWorkbook("../../CDBN Variety Info/CDBN_Metadata_sequencing_2017-12-08.xlsx")
metadata = readWorksheet(metadata1, sheet = getSheets(metadata1))
all_wea <- readRDS("../../CDBN Site Data/Weather Data/R_Weather/All_Weather_Data.rds")
wbp <- loadWorkbook("../../CDBN Site Data/Weather Data/R_Weather/R-Weather_CDBN-Planting-Dates-and-Est_1975-2015_V3.xlsx")
Plantings_est <- as_tibble(readWorksheet(wbp, sheet = getSheets(wbp)))
lst$GermplasmI added the echo = FALSE parameter to most of the code for this step, so the html file does not show several hundred regular expression substitutions that generated the CDBN_ID’s from the Genotypes. If you want to see these, refer to the .Rmd document.
Here are four punctuation fixes, as examples:
Germplasm$CDBN_ID <- gsub("-","_", Germplasm$CDBN_ID)
Germplasm$CDBN_ID <- gsub(":","_", Germplasm$CDBN_ID)
Germplasm$CDBN_ID <- gsub(" ","_", Germplasm$CDBN_ID)
Germplasm$CDBN_ID <- gsub("Long's_Peak","Longs_Peak", Germplasm$CDBN_ID)
Germplasm$CDBN_ID <- gsub("^ICB_10_5$", "ICB_10", Germplasm$CDBN_ID)
Germplasm$CDBN_ID <- gsub("^CO_75511$", "Grand Mesa", Germplasm$CDBN_ID)
kinship <- read.table("../../CDBN Genomics/CDBN_SNPs/8_Tassel/Filter_159Ct_MAF1per_CDBN_001to324_kinship.txt", skip = 3, sep = "\t")
kinship <- kinship %>%
separate(V1, into = c("Seq_ID", "CDBN_ID"), sep = 9)
kinship$Seq_ID <- gsub("_$", "", kinship$Seq_ID)
colnames(kinship) <- c("Seq_ID","CDBN_ID", kinship$CDBN_ID)
kinship %>%
anti_join(Germplasm, by = "CDBN_ID") %>%
dplyr::select(CDBN_ID)
## CDBN_ID
## 1 UNS_117_2
## 2 Canario
## 3 Canario707_2
## 4 L94C356_2
## 5 Stampede_2
## 6 Beryl_2
kinship <- kinship %>%
dplyr::select(-UNS_117_2, -Canario, -Canario707_2, -L94C356_2, -Stampede_2, -Beryl_2) %>%
filter(!(CDBN_ID %in% c("UNS_117_2", "Canario", "Canario707_2", "L94C356_2", "Stampede_2", "Beryl_2")))
# write.table(kinship, "../../CDBN Genomics/CDBN SNPs/8_Tassel/Filter_159Ct_MAF1per_CDBN_312_kinship.txt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = TRUE)
The kinship table below is empty, which means that there are no remaining CDBN_ID’s in my sequenced kinship matrix or metadata for sequenced lines that are unaccounted for.
kinship %>%
anti_join(Germplasm, by = "CDBN_ID") %>%
dplyr::select(CDBN_ID)
## [1] CDBN_ID
## <0 rows> (or 0-length row.names)
metadata %>%
anti_join(Germplasm, by = "CDBN_ID") %>%
dplyr::select(CDBN_ID)
## [1] CDBN_ID
## <0 rows> (or 0-length row.names)
Germplasm_ahm <- Germplasm %>%
left_join(metadata, by = "CDBN_ID") %>%
dplyr::select(Genotype:GBS_barcode_length.2, MDP_ID, Flower_color:Stem_thickness_below_cotyledons_width.1, Hue:Width_10..cm., Gene_pool:Market.class, Seed.From.1:Seed.From.2, BEANCAP_ID:Synonym.10)
Germplasm_small <- Germplasm_ahm %>%
dplyr::select(Genotype, CDBN_ID, Seq_ID, Gene_pool, Race, Market.class)
Phenotypes <- lst$Phenotypes
Phenotypes_ahm <- Phenotypes %>%
left_join(Germplasm_small, by = "Genotype") %>%
dplyr::select(Genotype, CDBN_ID, Seq_ID, Location_code, Year, Gene_pool, Race, Market.class, everything())
This step is different from what I did in V1.3, because my cleanup of the weather data has resulted in a few more location code changes that I should make for the Locations and Locations_by_year worksheets.
And this is different from what I did in November, because Jeff sent me a new file for the Location data: CDBN_weather_stations_V1.0_GL_JW1.xlsx which has new complete latitudes, longitudes, and elevations for 83 sites, finished in early November by Greg Lohrey (Thanks!). So that’s all good,now I will:
loc_transl_wb <- loadWorkbook("../CDBN Phenotype Summaries/Location_code_translations_w_JW_check.xlsx")
loc_climatebin = readWorksheet(loc_transl_wb, sheet = getSheets(loc_transl_wb))
small_climatebin <- loc_climatebin %>%
dplyr::select(Location_code, State, Climate_bin) # All we really care about are State and Climate_bin here, and we need to be able to link it to the other two Location tables by Location_code.
weather_station_wb <- loadWorkbook("../CDBN Phenotype Summaries/CDBN_weather_stations_V1.0_GL_JW1.xlsx")
station_lst = readWorksheet(weather_station_wb, sheet = getSheets(weather_station_wb))
stations <- station_lst$CDBN_locations
Locations <- lst$Locations
weather_climate <- loc_climatebin %>%
left_join(stations, by = "Location_code")
Locations_ahm <- Locations %>%
left_join(weather_climate, by = c("Location_code")) %>%
dplyr::select(Location_code, Climate_bin, Latitude, Lat_best, Longitude, Long_best, Elev_best, everything()) %>%
mutate_if(is.numeric, funs(na_if(., -99))) %>%
arrange(desc(Latitude)) %>%
filter(Location_code != "")
Unfortunately, this give me 11 year*location combinations without latitude and longitude information, so I need to go back through and clean these up. I suspect they are mostly merging issues due to errors in the original datasets.
But, because the unique locations don’t have any missing values for lat/long, I think that these can be filled in from other datapoints that share the same location code. I need to fix Lat/Long for the 11 year*location combinations.
Loc_to_fix <- Locations_ahm %>%
filter(is.na(Lat_best))
sort(unique(Loc_to_fix$Location_code)) # "CAD2" "COF2" "IDK2" "MOPO" "TXCS" "TXMU" "TXVE"
## [1] "IDK2" "MISA" "MNRA"
# MISA and MNRA
CAD2Fixing CAD2 to have the same lat/long as CADV.
#Locations_ahm %>%
# filter(Location_code %in% c("CAD2", "CADV"))
Locations_ahm <- Locations_ahm %>%
# filter(Location_code %in% c("CADV","CAD2")) %>%
mutate(Latitude = ifelse(Location_code == "CAD2",
38.53782,
Latitude),
Longitude = ifelse(Location_code == "CAD2",
-121.757,
Longitude),
Institution = ifelse(Location_code %in% c("CADV","CAD2"),
"UC Davis",
Institution),
Location_code = ifelse(Location_code == "CAD2",
"CADV",
Location_code)
)
COFCSome COFC are messed up because the locations didn’t merge correctly.
#Locations_ahm %>%
# filter(Location_code %in% c("COF2", "COFC"))
Locations_ahm <- Locations_ahm %>%
# filter(Location_code %in% c("COFC","COF2")) %>%
mutate(Latitude = ifelse(Location_code == "COF2",
40.65167,
Latitude),
Longitude = ifelse(Location_code == "COF2",
-104.9992,
Longitude),
Location.x = ifelse(Location_code %in% c("COFC","COF2"),
"Ft. Collins",
Location.x),
)
IDK2 should have the same lat/long as IDKI.
#Locations_ahm %>%
# filter(Location_code %in% c("IDKI", "IDK2"))
# Just give IDK2 IDKI's data.
Locations_ahm <- Locations_ahm %>%
mutate(Latitude = ifelse(Location_code == "IDK2",
42.55103,
Latitude),
Lat_best = ifelse(Location_code == "IDK2",
42.55103,
Lat_best),
Longitude = ifelse(Location_code == "IDK2",
-114.34,
Longitude),
Long_best =ifelse(Location_code == "IDK2",
-114.34,
Long_best),
Climate_bin = ifelse(Location_code == "IDK2",
"RockiesWest",
Climate_bin),
Elev_best = ifelse(Location_code == "IDK2",
1200,
Elev_best),
)
Fix issues where only one of the two lat/long columns - site or weather station - have info. To fix this, I am using the lat/long information from the other column for now, that should be good enough.
Locations_ahm <- Locations_ahm %>%
mutate(Latitude = ifelse(Location_code %in% c("MOPO", "TXCS", "TXMU", "TXVE"),
Lat_best,
Latitude),
Longitude = ifelse(Location_code %in% c("MOPO", "TXCS", "TXMU", "TXVE"),
Long_best,
Longitude)
)
Locations_ahm <- Locations_ahm %>%
mutate(Lat_best = ifelse(Location_code %in% c("MISA", "MNRA"),
Latitude,
Lat_best),
Long_best = ifelse(Location_code %in% c("MISA", "MNRA"),
Longitude,
Long_best)
)
Locations_ahm has all of the site*year combinations, which is a little too much information, really, given that the latitudes, longitudes, and elevations don’t change by the year.
Locations_unique <- Locations_ahm %>%
group_by(Location_code) %>%
mutate(Num_Year = n()) %>%
filter(row_number(State.x) == 1) %>%
arrange(Location_code) %>%
mutate(First_Year = Year,
State = State.x,
Location = Location.x) %>%
dplyr::select(Location_code, Climate_bin, State, Latitude, Longitude, Lat_best, Long_best, Elev_best, Location, First_Year, Num_Year, Institution, Research_sta, Soil_series, Soil_class, NOAA, Notes)
Locations_ahm sheetThe Locations_ahm sheet has the Locations_by_year data.
Add estimated or actual planting dates to this sheet, and ensure names are synced between the planting data estimates file and the Locations_by_Year sheet.
First, which are the problem locations with estimated planting dates that don’t match my locations_by_years table?
Plantings_est %>%
anti_join(Locations_ahm, by = c("Location_code", "Year")) %>%
filter(!(Location_code %in% c("COGR", "MNRO", "NEMI", "NYLE", "NYRO", "NYVA"))) # CAD2 MIFR MIMO NEMI NES3 WAPR
## # A tibble: 38 x 7
## Location_code location_country loc_2nd_level Location Year
## <chr> <chr> <chr> <chr> <dbl>
## 1 ABBI CAN Alberta Bow Island 2002
## 2 ABLE CAN Alberta Lethbridge 2002
## 3 CAD2 USA California Davis 2007
## 4 COFC USA Colorado Ft. Collins 2002
## 5 COFR USA Colorado Fruita 2002
## 6 IDPA USA Idaho Parma 2002
## 7 MBMO CAN Manitoba Morden 2002
## 8 MIFR USA Michigan Frankenmuth 2009
## 9 MIFR USA Michigan Frankenmuth 2010
## 10 MIFR USA Michigan Frankenmuth 2011
## # ... with 28 more rows, and 2 more variables: Planting_date_est <dttm>,
## # PD_check <chr>
Locations_ahm %>%
anti_join(Plantings_est, by = c("Location_code", "Year")) %>%
dplyr::select(Location_code, Year) # WARO MISA MIEN MIEL (MIEL was the site of seed quality tests not growth).
## Location_code Year
## 1 WARO 1981
## 2 WARO 1982
## 3 WARO 1983
## 4 WARO 1984
## 5 WARO 1985
## 6 WARO 1986
## 7 WARO 1987
## 8 WARO 1988
## 9 WARO 1990
## 10 WARO 1991
## 11 WARO 1992
## 12 WARO 1993
## 13 WARO 1994
## 14 WARO 1999
## 15 MISA 2002
## 16 MISA 2009
## 17 MISA 2010
## 18 MISA 2011
## 19 MISA 2012
## 20 MISA 2013
## 21 MISA 2014
## 22 MISA 2015
## 23 MIEN 2002
## 24 MIEN 2004
## 25 MIEN 2006
## 26 MIEN 2011
## 27 MIEL 1999
# MISA 2002 MIEN 2002 MIEL 1999 (MIEL was the site of seed quality tests not growth).
Now change the problem locations to their synonymous sites according to the “Location_code_translations_w_JW_check.xlsx” file.
I also make a DSSATwthcode variable which should help with merging with the weather data. Mostly, sites that end in “2” are renamed. Usually these sites were treatment plots at the same site as the one with the first three letters of the code, so the weather at the treatment plot is the same, only the treatment - e.g. white mold control, drought - differs.
Plantings_est <- Plantings_est %>%
dplyr::select(Location_code, Year, Planting_date_est, PD_check) %>%
mutate(Location_code = ifelse(Location_code == "WAPR",
"WARO",
Location_code),
Location_code = ifelse(Location_code == "MIFR",
"MISA",
Location_code),
Location_code = ifelse(Location_code == "MIMO",
"MIEN",
Location_code),
DSSATwthCode = ifelse(Location_code == "CAD2",
"CADV",
Location_code),
DSSATwthCode = ifelse(Location_code == "MIFR",
"MISA",
DSSATwthCode),
DSSATwthCode = ifelse(Location_code == "MIMO",
"MIEN",
DSSATwthCode),
DSSATwthCode = ifelse(Location_code == "COF2",
"COFC",
DSSATwthCode),
DSSATwthCode = ifelse(Location_code == "IDK2",
"IDKI",
DSSATwthCode),
DSSATwthCode = ifelse(Location_code == "IDP2",
"IDPA",
DSSATwthCode),
DSSATwthCode = ifelse(Location_code == "NES2",
"NESB",
DSSATwthCode),
DSSATwthCode = ifelse(Location_code == "NES3",
"NEMI",
DSSATwthCode),
DSSATwthCode = ifelse(Location_code == "WAPR",
"WARO",
DSSATwthCode)
)
Plantings_est <- Plantings_est %>%
mutate(PD_check = ifelse(PD_check == "M",
"Not_estimated",
PD_check),
PD_check = ifelse(PD_check == "R",
"Known",
PD_check),
PD_check = ifelse(PD_check == "E",
"Estimated",
PD_check)
)
Now that the problem locations are fixed or dropped, join the estimated planting dates to the location dataset.
Also, clean the planting date estimate so that R knows it is a date.
Locations_ahm <- Locations_ahm %>%
left_join(Plantings_est, by = c("Location_code", "Year")) %>%
dplyr::select(Location_code:Year, DSSATwthCode, Planting_date_est, PD_check, everything())
Locations_ahm$Planting_date_est <- as.Date(Locations_ahm$Planting_date_est)
Ensure weather data can be merged with the Locations_ahm sheet. To this end, make a Location_code_wea (why not call itDSSATwthCode which is the key in all_wea, that will make things simple) that is used to merge the Locations_ahm data frame with the all_wea data frame. The Location_codes sometimes encode unique information (CADV vs CAD2, maybe) that shouldn’t be lost, but the sites still experience the same weather in some cases. Did that two code chunks ago.
Which weather can’t be merged with my locations_by_years table? (Locations_ahm)
all_wea$Year <- as.numeric(all_wea$Year)
# Locations_ahm %>%
# anti_join(all_wea, by = c("DSSATwthCode", "Year")) %>%
# arrange(Location_code, Year)
Ok, there are quite a lot of errors here, and Jeff is sending me new files for some. So just make v1.5.1 for now, and v1.5.2 can have weather data.
Assign Germplasm_ahm to lst$Germplasm, Phenotypes_ahm to lst$Phenotypes, Locations_ahm to lst$Locations_by_Years, Locations_unique to lst$Locations_ahm. Save two new xlsx files (both V1.5.1) for Jeff to look over. One file has all the germplasm and location-year metadata, including the weather data, if I can, and is called CDBN-metadata_V1.5.1.xlsx. One file has all the phenotype data, which can be related back to the metadata using the Location_code, Year, CDBN_ID and/or Seq_ID keys, and is called CDBN-phenotypes_V1.5.1.xlsx. Finally I may have a weather data file called CDBN-weather_V1.5.1.xlsx… but as of now there are too many rows in the table for an Excel file (more than 1,048,575… so just keep all_wea as a rds object for now.)
NB: I commented out this code because it only needed to run once to generate the new Excel file.
# createSheet(wb, name = "Kinship")
# createSheet(wb, name = "Germplasm_ahm")
# createSheet(wb, name = "Locations_by_Years")
# createSheet(wb, name = "Locations_ahm")
# writeWorksheet(wb, Germplasm_ahm, sheet = "Germplasm_ahm")
# writeWorksheet(wb, Locations_ahm, sheet = "Locations_by_Years")
# writeWorksheet(wb, Locations_unique, sheet = "Locations_ahm")
# writeWorksheet(wb, kinship, sheet = "Kinship")
# saveWorkbook(wb)
# wb1 <- loadWorkbook("CDBN-metadata_V1.5.1.xlsx", create = TRUE)
# createSheet(wb1, name = "Kinship")
# createSheet(wb1, name = "Germplasm_ahm")
# createSheet(wb1, name = "Locations_by_Years")
# createSheet(wb1, name = "Locations_ahm")
# writeWorksheet(wb1, Germplasm_ahm, sheet = "Germplasm_ahm")
# writeWorksheet(wb1, Locations_ahm, sheet = "Locations_by_Years")
# writeWorksheet(wb1, Locations_unique, sheet = "Locations_ahm")
# writeWorksheet(wb1, kinship, sheet = "Kinship")
# saveWorkbook(wb1)
# wb2 <- loadWorkbook("CDBN-phenotypes_V1.5.1.xlsx", create = TRUE)
# createSheet(wb2, name = "Phenotypes_ahm")
# writeWorksheet(wb2, Phenotypes_ahm, sheet = "Phenotypes_ahm")
# saveWorkbook(wb2)
# ?saveWorkbook
# wb3 <- loadWorkbook("CDBN-weather_V1.5.1.xlsx", create = TRUE)
#createSheet(wb3, name = "Daily_weather")
#writeWorksheet(wb3, all_wea, sheet = "Daily_weather")
#saveWorkbook(wb3)