All-experiments_V1.5

This file will take Jeff’s latest CDBN experiment data (V1.5) from 2017-12-21 and do the following:

Make some additions to the Germplasm sheet. I add my CDBN_ID’s and Seq_ID’s, as well as some other variety information that I have, and save a new Germplasm sheet: Germplasm_ahm
Make a Kinship sheet (V1.3): Add a kinship sheet with a Centered_IBS matrix for all 312 varieties I have sequence data for: Kinship
Make some additions to the Phenotypes sheet. I add my CDBN_ID and Seq_ID columns from the Germplasm_ahm sheet to Jeff’s Phenotypes sheet so that it will be easier to combine the phenotypic data with the genotype data in the future. I save a new Phenotypes sheet: Phenotypes_ahm
Make some additions to the Locations sheet. I add the Climate_bin information to the Locations sheet and check that location-year combinations don’t have missing data. I save a new Locations sheet: Locations_ahm which has just the 77 unique locations, plus I save a Locations_by_Year sheet that has all ~690 Location by year combinations, which is the way Jeff had the data.
New V1.5: Make some additions to the Locations_by_Year sheet. Add estimated or actual planting dates to this sheet, and ensure names are synced between the planting data estimates file and the Locations_by_Year sheet. Clean up weather data and ensure it can be merged with the Locations_by_Year sheet.
New V1.5: Save two new xlsx files (both V1.5.1) for Jeff to look over. One file has all the germplasm and location-year metadata, including the weather data, if I can, and is called CDBN-metadata_V1.5.1.xlsx. One file has all the phenotype data, which can be related back to the metadata using the Location_code, Year, CDBN_ID and/or Seq_ID keys, and is called CDBN-phenotypes_V1.5.1.xlsx. Finally I may have a weather data file called CDBN-weather_V1.5.1.xlsx.

wb <- loadWorkbook("All_experiments_V1.5.1.xlsx")
lst = readWorksheet(wb, sheet = getSheets(wb))
metadata1 <- loadWorkbook("../../CDBN Variety Info/CDBN_Metadata_sequencing_2017-12-08.xlsx")
metadata = readWorksheet(metadata1, sheet = getSheets(metadata1))
all_wea <- readRDS("../../CDBN Site Data/Weather Data/R_Weather/All_Weather_Data.rds")
wbp <- loadWorkbook("../../CDBN Site Data/Weather Data/R_Weather/R-Weather_CDBN-Planting-Dates-and-Est_1975-2015_V3.xlsx")
Plantings_est <- as_tibble(readWorksheet(wbp, sheet = getSheets(wbp)))

1. Add CDBN_ID’s to `lst$Germplasm`

I added the echo = FALSE parameter to most of the code for this step, so the html file does not show several hundred regular expression substitutions that generated the CDBN_ID’s from the Genotypes. If you want to see these, refer to the .Rmd document.

Here are four punctuation fixes, as examples:

Germplasm$CDBN_ID <- gsub("-","_", Germplasm$CDBN_ID)
Germplasm$CDBN_ID <- gsub(":","_", Germplasm$CDBN_ID)
Germplasm$CDBN_ID <- gsub(" ","_", Germplasm$CDBN_ID)
Germplasm$CDBN_ID <- gsub("Long's_Peak","Longs_Peak", Germplasm$CDBN_ID)
Germplasm$CDBN_ID <- gsub("^ICB_10_5$", "ICB_10", Germplasm$CDBN_ID)
Germplasm$CDBN_ID <- gsub("^CO_75511$", "Grand Mesa", Germplasm$CDBN_ID)

1a. Clean the kinship matrix generated from the GBS data

kinship <- read.table("../../CDBN Genomics/CDBN_SNPs/8_Tassel/Filter_159Ct_MAF1per_CDBN_001to324_kinship.txt", skip = 3, sep = "\t")

kinship <- kinship %>%
  separate(V1, into = c("Seq_ID", "CDBN_ID"), sep = 9) 
kinship$Seq_ID <- gsub("_$", "", kinship$Seq_ID)
colnames(kinship) <- c("Seq_ID","CDBN_ID", kinship$CDBN_ID)

kinship %>%
  anti_join(Germplasm, by = "CDBN_ID") %>%
  dplyr::select(CDBN_ID)

##        CDBN_ID
## 1    UNS_117_2
## 2      Canario
## 3 Canario707_2
## 4    L94C356_2
## 5   Stampede_2
## 6      Beryl_2

kinship <- kinship %>%
  dplyr::select(-UNS_117_2, -Canario, -Canario707_2, -L94C356_2, -Stampede_2, -Beryl_2) %>%
  filter(!(CDBN_ID %in% c("UNS_117_2", "Canario", "Canario707_2", "L94C356_2", "Stampede_2", "Beryl_2")))

# write.table(kinship, "../../CDBN Genomics/CDBN SNPs/8_Tassel/Filter_159Ct_MAF1per_CDBN_312_kinship.txt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = TRUE)

1b. Test that the second round of punctuation fixes actually ties the CDBN_ID keys to the Germplasm table.

The kinship table below is empty, which means that there are no remaining CDBN_ID’s in my sequenced kinship matrix or metadata for sequenced lines that are unaccounted for.

kinship %>% 
  anti_join(Germplasm, by = "CDBN_ID") %>%
  dplyr::select(CDBN_ID)

## [1] CDBN_ID
## <0 rows> (or 0-length row.names)

metadata %>% 
  anti_join(Germplasm, by = "CDBN_ID") %>%
  dplyr::select(CDBN_ID)

## [1] CDBN_ID
## <0 rows> (or 0-length row.names)

2. Join Germplasm and Metadata

Germplasm_ahm <- Germplasm %>% 
  left_join(metadata, by = "CDBN_ID") %>%
  dplyr::select(Genotype:GBS_barcode_length.2, MDP_ID, Flower_color:Stem_thickness_below_cotyledons_width.1, Hue:Width_10..cm., Gene_pool:Market.class, Seed.From.1:Seed.From.2, BEANCAP_ID:Synonym.10)

3. Make a Germplasm table to add to the Phenotypes worksheet

Germplasm_small <- Germplasm_ahm %>%
  dplyr::select(Genotype, CDBN_ID, Seq_ID, Gene_pool, Race, Market.class)

3a. Join small Genotypes table to Phenotypes worksheet

Phenotypes <- lst$Phenotypes

Phenotypes_ahm <- Phenotypes %>%
  left_join(Germplasm_small, by = "Genotype") %>%
  dplyr::select(Genotype, CDBN_ID, Seq_ID, Location_code, Year, Gene_pool, Race, Market.class, everything())

4. Add Climate_bin to Locations worksheet

This step is different from what I did in V1.3, because my cleanup of the weather data has resulted in a few more location code changes that I should make for the Locations and Locations_by_year worksheets.

And this is different from what I did in November, because Jeff sent me a new file for the Location data: CDBN_weather_stations_V1.0_GL_JW1.xlsx which has new complete latitudes, longitudes, and elevations for 83 sites, finished in early November by Greg Lohrey (Thanks!). So that’s all good,now I will:

combine it with the Locations in the All_experiments datasheet
combine that with the location code translations that I have come up with for my climate_bins.

loc_transl_wb <- loadWorkbook("../CDBN Phenotype Summaries/Location_code_translations_w_JW_check.xlsx")
loc_climatebin = readWorksheet(loc_transl_wb, sheet = getSheets(loc_transl_wb))
small_climatebin <- loc_climatebin %>%
  dplyr::select(Location_code, State, Climate_bin) # All we really care about are State and Climate_bin here, and we need to be able to link it to the other two Location tables by Location_code.

weather_station_wb <- loadWorkbook("../CDBN Phenotype Summaries/CDBN_weather_stations_V1.0_GL_JW1.xlsx")
station_lst = readWorksheet(weather_station_wb, sheet = getSheets(weather_station_wb))
stations <- station_lst$CDBN_locations

Locations <- lst$Locations

weather_climate <- loc_climatebin %>%
  left_join(stations, by = "Location_code")

Locations_ahm <- Locations %>%
  left_join(weather_climate, by = c("Location_code")) %>%
  dplyr::select(Location_code, Climate_bin, Latitude, Lat_best, Longitude, Long_best, Elev_best, everything()) %>%
  mutate_if(is.numeric, funs(na_if(., -99))) %>%
  arrange(desc(Latitude)) %>%
  filter(Location_code != "")

Unfortunately, this give me 11 year*location combinations without latitude and longitude information, so I need to go back through and clean these up. I suspect they are mostly merging issues due to errors in the original datasets.

4a. Some locations don’t have associated Lat and Long

But, because the unique locations don’t have any missing values for lat/long, I think that these can be filled in from other datapoints that share the same location code. I need to fix Lat/Long for the 11 year*location combinations.

Loc_to_fix <- Locations_ahm %>%
  filter(is.na(Lat_best))

sort(unique(Loc_to_fix$Location_code)) # "CAD2" "COF2" "IDK2" "MOPO" "TXCS" "TXMU" "TXVE"

## [1] "IDK2" "MISA" "MNRA"

# MISA and MNRA

4b. Fix `CAD2`

Fixing CAD2 to have the same lat/long as CADV.

#Locations_ahm %>%
#  filter(Location_code %in% c("CAD2", "CADV"))

Locations_ahm <- Locations_ahm %>%
#  filter(Location_code %in% c("CADV","CAD2")) %>%
    mutate(Latitude = ifelse(Location_code == "CAD2",
                           38.53782,
                           Latitude),
         Longitude = ifelse(Location_code == "CAD2",
                            -121.757,
                            Longitude),
         Institution = ifelse(Location_code %in% c("CADV","CAD2"),
                              "UC Davis",
                              Institution),
         Location_code = ifelse(Location_code == "CAD2",
                                "CADV",
                                Location_code)
         )

4c. Fix `COFC`

Some COFC are messed up because the locations didn’t merge correctly.

#Locations_ahm %>%
#  filter(Location_code %in% c("COF2", "COFC"))

Locations_ahm <- Locations_ahm %>%
#  filter(Location_code %in% c("COFC","COF2")) %>%
    mutate(Latitude = ifelse(Location_code == "COF2",
                           40.65167,
                           Latitude),
         Longitude = ifelse(Location_code == "COF2",
                            -104.9992,
                            Longitude),
         Location.x = ifelse(Location_code %in% c("COFC","COF2"),
                              "Ft. Collins",
                              Location.x),
         )

4d. Fix IDK2

IDK2 should have the same lat/long as IDKI.

#Locations_ahm %>%
#  filter(Location_code %in% c("IDKI", "IDK2"))

# Just give IDK2 IDKI's data.

Locations_ahm <- Locations_ahm %>%
  mutate(Latitude = ifelse(Location_code == "IDK2",
                           42.55103,
                           Latitude),
         Lat_best = ifelse(Location_code == "IDK2",
                           42.55103,
                           Lat_best),
         Longitude = ifelse(Location_code == "IDK2",
                            -114.34,
                            Longitude),
         Long_best =ifelse(Location_code == "IDK2",
                            -114.34,
                            Long_best),
         Climate_bin = ifelse(Location_code == "IDK2",
                            "RockiesWest",
                            Climate_bin),
         Elev_best = ifelse(Location_code == "IDK2",
                            1200,
                            Elev_best),
               )

4e. Supply missing Lat/Long or Lat_best/Long_best

Fix issues where only one of the two lat/long columns - site or weather station - have info. To fix this, I am using the lat/long information from the other column for now, that should be good enough.

Locations_ahm <- Locations_ahm %>%
  mutate(Latitude = ifelse(Location_code %in% c("MOPO", "TXCS", "TXMU", "TXVE"),
                           Lat_best,
                           Latitude),
         Longitude = ifelse(Location_code %in% c("MOPO", "TXCS", "TXMU", "TXVE"),
                           Long_best,
                           Longitude)
  )
      

Locations_ahm <- Locations_ahm %>%
  mutate(Lat_best = ifelse(Location_code %in% c("MISA", "MNRA"),
                           Latitude,
                           Lat_best),
         Long_best = ifelse(Location_code %in% c("MISA", "MNRA"),
                           Longitude,
                           Long_best)
  )

5. Make a unique Locations sheet.

Locations_ahm has all of the site*year combinations, which is a little too much information, really, given that the latitudes, longitudes, and elevations don’t change by the year.

Locations_unique <- Locations_ahm %>%
  group_by(Location_code) %>% 
  mutate(Num_Year = n()) %>%
  filter(row_number(State.x) == 1) %>%
  arrange(Location_code) %>%
  mutate(First_Year = Year,
         State = State.x,
         Location = Location.x) %>%
  dplyr::select(Location_code, Climate_bin, State, Latitude, Longitude, Lat_best, Long_best, Elev_best, Location, First_Year, Num_Year, Institution, Research_sta, Soil_series, Soil_class, NOAA, Notes)

6. Make changes to the `Locations_ahm` sheet

The Locations_ahm sheet has the Locations_by_year data.

Add estimated or actual planting dates to this sheet, and ensure names are synced between the planting data estimates file and the Locations_by_Year sheet.

Clean up estimated planting date tibble

First, which are the problem locations with estimated planting dates that don’t match my locations_by_years table?

Plantings_est %>%
  anti_join(Locations_ahm, by = c("Location_code", "Year")) %>%
  filter(!(Location_code %in% c("COGR", "MNRO", "NEMI", "NYLE", "NYRO", "NYVA"))) # CAD2 MIFR MIMO NEMI NES3 WAPR

## # A tibble: 38 x 7
##    Location_code location_country loc_2nd_level    Location  Year
##            <chr>            <chr>         <chr>       <chr> <dbl>
##  1          ABBI              CAN       Alberta  Bow Island  2002
##  2          ABLE              CAN       Alberta  Lethbridge  2002
##  3          CAD2              USA    California       Davis  2007
##  4          COFC              USA      Colorado Ft. Collins  2002
##  5          COFR              USA      Colorado      Fruita  2002
##  6          IDPA              USA         Idaho       Parma  2002
##  7          MBMO              CAN      Manitoba      Morden  2002
##  8          MIFR              USA      Michigan Frankenmuth  2009
##  9          MIFR              USA      Michigan Frankenmuth  2010
## 10          MIFR              USA      Michigan Frankenmuth  2011
## # ... with 28 more rows, and 2 more variables: Planting_date_est <dttm>,
## #   PD_check <chr>

Locations_ahm %>%
  anti_join(Plantings_est, by = c("Location_code", "Year")) %>%
  dplyr::select(Location_code, Year) # WARO MISA MIEN MIEL (MIEL was the site of seed quality tests not growth).

##    Location_code Year
## 1           WARO 1981
## 2           WARO 1982
## 3           WARO 1983
## 4           WARO 1984
## 5           WARO 1985
## 6           WARO 1986
## 7           WARO 1987
## 8           WARO 1988
## 9           WARO 1990
## 10          WARO 1991
## 11          WARO 1992
## 12          WARO 1993
## 13          WARO 1994
## 14          WARO 1999
## 15          MISA 2002
## 16          MISA 2009
## 17          MISA 2010
## 18          MISA 2011
## 19          MISA 2012
## 20          MISA 2013
## 21          MISA 2014
## 22          MISA 2015
## 23          MIEN 2002
## 24          MIEN 2004
## 25          MIEN 2006
## 26          MIEN 2011
## 27          MIEL 1999

# MISA 2002 MIEN 2002 MIEL 1999 (MIEL was the site of seed quality tests not growth).

Now change the problem locations to their synonymous sites according to the “Location_code_translations_w_JW_check.xlsx” file.

I also make a DSSATwthcode variable which should help with merging with the weather data. Mostly, sites that end in “2” are renamed. Usually these sites were treatment plots at the same site as the one with the first three letters of the code, so the weather at the treatment plot is the same, only the treatment - e.g. white mold control, drought - differs.

Plantings_est <- Plantings_est %>%
  dplyr::select(Location_code, Year, Planting_date_est, PD_check) %>%
    mutate(Location_code = ifelse(Location_code == "WAPR",
                           "WARO",
                           Location_code),
           Location_code = ifelse(Location_code == "MIFR",
                           "MISA",
                           Location_code),
           Location_code = ifelse(Location_code == "MIMO",
                           "MIEN",
                           Location_code),
           DSSATwthCode = ifelse(Location_code == "CAD2",
                           "CADV",
                           Location_code),
           DSSATwthCode = ifelse(Location_code == "MIFR",
                           "MISA",
                           DSSATwthCode),
           DSSATwthCode = ifelse(Location_code == "MIMO",
                           "MIEN",
                           DSSATwthCode),
           DSSATwthCode = ifelse(Location_code == "COF2",
                          "COFC",
                          DSSATwthCode),
           DSSATwthCode = ifelse(Location_code == "IDK2",
                          "IDKI",
                          DSSATwthCode),
           DSSATwthCode = ifelse(Location_code == "IDP2",
                          "IDPA",
                          DSSATwthCode),
           DSSATwthCode = ifelse(Location_code == "NES2",
                          "NESB",
                          DSSATwthCode),
           DSSATwthCode = ifelse(Location_code == "NES3",
                           "NEMI",
                           DSSATwthCode),
           DSSATwthCode = ifelse(Location_code == "WAPR",
                           "WARO",
                           DSSATwthCode)
    )

Plantings_est <- Plantings_est %>%
  mutate(PD_check = ifelse(PD_check == "M",
                           "Not_estimated",
                           PD_check),
         PD_check = ifelse(PD_check == "R",
                           "Known",
                           PD_check),
         PD_check = ifelse(PD_check == "E",
                           "Estimated",
                           PD_check)
         )

Now that the problem locations are fixed or dropped, join the estimated planting dates to the location dataset.

Also, clean the planting date estimate so that R knows it is a date.

Locations_ahm <- Locations_ahm %>%
  left_join(Plantings_est, by = c("Location_code", "Year")) %>%
  dplyr::select(Location_code:Year, DSSATwthCode, Planting_date_est, PD_check, everything())

Locations_ahm$Planting_date_est <- as.Date(Locations_ahm$Planting_date_est)

Ensure weather data can be merged with the Locations_ahm sheet. To this end, make a Location_code_wea (why not call itDSSATwthCode which is the key in all_wea, that will make things simple) that is used to merge the Locations_ahm data frame with the all_wea data frame. The Location_codes sometimes encode unique information (CADV vs CAD2, maybe) that shouldn’t be lost, but the sites still experience the same weather in some cases. Did that two code chunks ago.

Which weather can’t be merged with my locations_by_years table? (Locations_ahm)

all_wea$Year <- as.numeric(all_wea$Year)

# Locations_ahm %>%
#  anti_join(all_wea, by = c("DSSATwthCode", "Year")) %>%
#  arrange(Location_code, Year)

Ok, there are quite a lot of errors here, and Jeff is sending me new files for some. So just make v1.5.1 for now, and v1.5.2 can have weather data.

7. Save the new data to an Excel file.

Assign Germplasm_ahm to lst$Germplasm, Phenotypes_ahm to lst$Phenotypes, Locations_ahm to lst$Locations_by_Years, Locations_unique to lst$Locations_ahm. Save two new xlsx files (both V1.5.1) for Jeff to look over. One file has all the germplasm and location-year metadata, including the weather data, if I can, and is called CDBN-metadata_V1.5.1.xlsx. One file has all the phenotype data, which can be related back to the metadata using the Location_code, Year, CDBN_ID and/or Seq_ID keys, and is called CDBN-phenotypes_V1.5.1.xlsx. Finally I may have a weather data file called CDBN-weather_V1.5.1.xlsx… but as of now there are too many rows in the table for an Excel file (more than 1,048,575… so just keep all_wea as a rds object for now.)

NB: I commented out this code because it only needed to run once to generate the new Excel file.

 # createSheet(wb, name = "Kinship")
 # createSheet(wb, name = "Germplasm_ahm")
 # createSheet(wb, name = "Locations_by_Years")
 # createSheet(wb, name = "Locations_ahm")
 # writeWorksheet(wb, Germplasm_ahm, sheet = "Germplasm_ahm")
 # writeWorksheet(wb, Locations_ahm, sheet = "Locations_by_Years")
 # writeWorksheet(wb, Locations_unique, sheet = "Locations_ahm")
 # writeWorksheet(wb, kinship, sheet = "Kinship")
 # saveWorkbook(wb)

# wb1 <- loadWorkbook("CDBN-metadata_V1.5.1.xlsx", create = TRUE)
# createSheet(wb1, name = "Kinship")
# createSheet(wb1, name = "Germplasm_ahm")
# createSheet(wb1, name = "Locations_by_Years")
# createSheet(wb1, name = "Locations_ahm")
# writeWorksheet(wb1, Germplasm_ahm, sheet = "Germplasm_ahm")
# writeWorksheet(wb1, Locations_ahm, sheet = "Locations_by_Years")
# writeWorksheet(wb1, Locations_unique, sheet = "Locations_ahm")
# writeWorksheet(wb1, kinship, sheet = "Kinship")
# saveWorkbook(wb1)
 
# wb2 <- loadWorkbook("CDBN-phenotypes_V1.5.1.xlsx", create = TRUE)
# createSheet(wb2, name = "Phenotypes_ahm")
# writeWorksheet(wb2, Phenotypes_ahm, sheet = "Phenotypes_ahm")
# saveWorkbook(wb2)
# ?saveWorkbook
 
 # wb3 <- loadWorkbook("CDBN-weather_V1.5.1.xlsx", create = TRUE)
 #createSheet(wb3, name = "Daily_weather")
 #writeWorksheet(wb3, all_wea, sheet = "Daily_weather")
 #saveWorkbook(wb3)

All-experiments_V1.5_Cleanup

Alice MacQueen

2017-12-21

1. Add CDBN_ID’s to `lst$Germplasm`

1a. Clean the kinship matrix generated from the GBS data

1b. Test that the second round of punctuation fixes actually ties the CDBN_ID keys to the Germplasm table.

2. Join Germplasm and Metadata

3. Make a Germplasm table to add to the Phenotypes worksheet

3a. Join small Genotypes table to Phenotypes worksheet

4. Add Climate_bin to Locations worksheet

4a. Some locations don’t have associated Lat and Long

4b. Fix `CAD2`

4c. Fix `COFC`

4d. Fix IDK2

4e. Supply missing Lat/Long or Lat_best/Long_best

5. Make a unique Locations sheet.

6. Make changes to the `Locations_ahm` sheet

Clean up estimated planting date tibble

7. Save the new data to an Excel file.

All-experiments_V1.5_Cleanup

Alice MacQueen

2017-12-21

1. Add CDBN_ID’s to lst$Germplasm

1a. Clean the kinship matrix generated from the GBS data

1b. Test that the second round of punctuation fixes actually ties the CDBN_ID keys to the Germplasm table.

2. Join Germplasm and Metadata

3. Make a Germplasm table to add to the Phenotypes worksheet

3a. Join small Genotypes table to Phenotypes worksheet

4. Add Climate_bin to Locations worksheet

4a. Some locations don’t have associated Lat and Long

4b. Fix CAD2

4c. Fix COFC

4d. Fix IDK2

4e. Supply missing Lat/Long or Lat_best/Long_best

5. Make a unique Locations sheet.

6. Make changes to the Locations_ahm sheet

Clean up estimated planting date tibble

7. Save the new data to an Excel file.

1. Add CDBN_ID’s to `lst$Germplasm`

4b. Fix `CAD2`

4c. Fix `COFC`

6. Make changes to the `Locations_ahm` sheet