Definitions

Julian Flowers 02/10/2018

National Institute of Standards and Technology. NIST Big Data Interoperability Framework: Volume 1, Definitions (NIST Special Publication 1500-1). (2015). Available from: http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-1.pdf

Some big data and data science terms and definitions from the NIST Big Data definition manual.

defs <- readtext("http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-1.pdf")

def_sents <- defs %>%
  unnest_tokens(sentences, text, token = "sentences") 

def_sents %>% 
  .[494:516, 2] %>%
  data.frame()

##                                                                                                                                                                                                                                                                                                                                                            .
## 1                                                                                                                                        big data consists of extensive datasetsprimarily in the characteristics of volume, variety, velocity,   and/or variabilitythat require a scalable architecture for efficient storage, manipulation, and analysis.
## 2                                                                                                          big data engineering includes advanced techniques that harness independent resources for building   scalable data systems when the characteristics of the datasets require new architectures for efficient   storage, manipulation, and analysis.
## 3                                                                                                                                            the big data paradigm consists of the distribution of data systems across horizontally coupled,   independent resources to achieve the scalability needed for the efficient processing of extensive   datasets.
## 4                                                                                                                                                                                                                                                                  computational portability is the movement of the computation to the location of the data.
## 5                                                                                                                                                                                                          data governance refers to the overall management of the availability, usability, integrity, and   security of the data employed in an enterprise.
## 6                                                                                                                                                                        the data lifecycle is the set of processes that transforms raw data into actionable knowledge, which   includes data collection, preparation, analytics, visualization, and access.
## 7                                                                                                                                                                                                                              data science is the empirical synthesis of actionable knowledge from raw data through the complete data   life cycle process.
## 8                                                                                                                                                                                                            the data science is extraction of actionable knowledge directly from data through a process of discovery,   hypothesis, and hypothesis testing.
## 9                                                                                    a latency is a practitioner who has sufficient knowledge in the overlapping regimes of business needs,   domain knowledge, analytical skills, and software and systems engineering to manage the end-to-end   data processes through each stage in the data life cycle.
## 10                                                                                                                                                                                            distributed computing is a computing system in which components located on networked   computers communicate and coordinate their actions by passing messages.
## 11                                                                                                                                                                                                           distributed file systems contain multi-structured (object) datasets that are distributed across the   computing nodes of the server cluster(s).
## 12  a federated database system is a type of meta-database management system, which transparently maps   multiple autonomous database systems into a single federated database. horizontal scaling implies the coordination of individual resources (e.g., server) that are integrated to act   in parallel as a single system (i.e., operate as a cluster).
## 13                                                                                                                                                                                                                                                                                             latency refers to the delay in processing or in availability.
## 14                                                                                                                                                                                                                       massively parallel processing refers to a multitude of individual processors working in parallel to execute   a particular program.
## 15                                                                                                                                                                                    non-relational models, frequently referred to as nosql, refer to logical data models that do not follow   relational algebra for the storage and manipulation of data.
## 16 resource negotiation consists of built-in data management capabilities that provide the necessary   support functions, such as operations management, workflow integration, security, governance, support   for additional processing models, and controls for multi-tenant environments, providing higher   availability and lower latency applications.
## 17                                                                                                   a-1  nist big data interoperability framework: volume 1, definitions schema-on-read is the application of a data schema through preparation steps such as transformations,   cleansing, and integration at the time the data is read from the database.
## 18     shared-disk file systems, such as storage area networks (sans) and network attached storage (nas),   use a single storage pool, which is accessed from multiple computing resources. validity refers to appropriateness of the data for its intended use.         value refers to the inherent wealth, economic and social, embedded in any dataset.
## 19                                                                                                                                                                                                                                                                                          variability refers to the change in other data characteristics.
## 20                                                                                                                                                                                                                                                                                    variety refers to data from multiple repositories, domains, or types.
## 21                                                                                                                                                                                                                                                                    velocity refers to the rate of data flow. veracity refers to the accuracy of the data.
## 22                                                                                                                                                  vertical scaling implies increasing the system parameters of processing speed, storage, and memory for   greater performance. volatility refers to the tendency for data structures to change over time.
## 23                                                                                                                                                                                                                                                                                                                 volume refers to the size of the dataset.

x
big data consists of extensive datasetsprimarily in the characteristics of volume, variety, velocity, and/or variabilitythat require a scalable architecture for efficient storage, manipulation, and analysis.
big data engineering includes advanced techniques that harness independent resources for building scalable data systems when the characteristics of the datasets require new architectures for efficient storage, manipulation, and analysis.
the big data paradigm consists of the distribution of data systems across horizontally coupled, independent resources to achieve the scalability needed for the efficient processing of extensive datasets.
computational portability is the movement of the computation to the location of the data.
data governance refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise.
the data lifecycle is the set of processes that transforms raw data into actionable knowledge, which includes data collection, preparation, analytics, visualization, and access.
data science is the empirical synthesis of actionable knowledge from raw data through the complete data life cycle process.
the data science is extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and hypothesis testing.
a latency is a practitioner who has sufficient knowledge in the overlapping regimes of business needs, domain knowledge, analytical skills, and software and systems engineering to manage the end-to-end data processes through each stage in the data life cycle.
distributed computing is a computing system in which components located on networked computers communicate and coordinate their actions by passing messages.
distributed file systems contain multi-structured (object) datasets that are distributed across the computing nodes of the server cluster(s).
a federated database system is a type of meta-database management system, which transparently maps multiple autonomous database systems into a single federated database. horizontal scaling implies the coordination of individual resources (e.g., server) that are integrated to act in parallel as a single system (i.e., operate as a cluster).
latency refers to the delay in processing or in availability.
massively parallel processing refers to a multitude of individual processors working in parallel to execute a particular program.
non-relational models, frequently referred to as nosql, refer to logical data models that do not follow relational algebra for the storage and manipulation of data.
resource negotiation consists of built-in data management capabilities that provide the necessary support functions, such as operations management, workflow integration, security, governance, support for additional processing models, and controls for multi-tenant environments, providing higher availability and lower latency applications.
a-1 nist big data interoperability framework: volume 1, definitions schema-on-read is the application of a data schema through preparation steps such as transformations, cleansing, and integration at the time the data is read from the database.
shared-disk file systems, such as storage area networks (sans) and network attached storage (nas), use a single storage pool, which is accessed from multiple computing resources. validity refers to appropriateness of the data for its intended use.  value refers to the inherent wealth, economic and social, embedded in any dataset.
 variability refers to the change in other data characteristics.
 variety refers to data from multiple repositories, domains, or types.
velocity refers to the rate of data flow. veracity refers to the accuracy of the data.
vertical scaling implies increasing the system parameters of processing speed, storage, and memory for greater performance. volatility refers to the tendency for data structures to change over time.
volume refers to the size of the dataset.