Creation date: 2018 April 20
Updated: 2018 April 21
This document provides basic QA/QC tools for the Spring 2017 willow utilization data.
Analyses aim to characterize basic data structure and identify potential issues such as:
The field “Record Id” has beed added to help indexing rows for QA/QC purposes. It’s a sequential index of rows.
There are 1285 records in the raw data set.
A total of 195 records are have an NA or incomplete (e.g., only year) date.
## Warning: 39 failed to parse.
Questions/Issues:
1. We’ve got varying levels of “missingness” for various key fields (e.g., date, plant, stid, live). Check field forms
There are 6 records with a missing value for “stid”.
There are 1279 distinct stid values.
There are 6 records with a missing value for “plant”. Should this be renamed to “willid”?
There are 23 records with a missing value for “live”.
| live | n |
|---|---|
| dead | 113 |
| DEAD | 68 |
| live | 947 |
| missing | 42 |
| nd | 87 |
| new | 3 |
| NEW | 2 |
| NA | 23 |
Questions/Issues:
Need to standardize the encoding for the ‘live’ field:
“NEW” to “retag”?
“NEW” vs “new”? “DEAD” vs “dead”?
*“nd” vs “NA” vs “missing”?
| br_sch | n |
|---|---|
| 0 | 433 |
| 1 | 451 |
| 2 | 31 |
| 3 | 11 |
| 4 | 5 |
| 5 | 4 |
| 6 | 3 |
| NA | 347 |
*Examine the records with Browse scheme = NA
Questions/Issues: Are NA correct?
Not necessarily something to change, but the scheme fields are names differently between the production and utilization data sets. For example: ub_scheme in prod, ubr_sch in utilization. ubr_sch of 120: Is this correct? Some of the other scheme codes seem really big…