Julian Flowers 02/10/2018
Some big data and data science terms and definitions from the NIST Big Data definition manual.
defs <- readtext("http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-1.pdf")
def_sents <- defs %>%
unnest_tokens(sentences, text, token = "sentences")
def_sents %>%
.[494:516, 2] %>%
data.frame()
## .
## 1 big data consists of extensive datasetsprimarily in the characteristics of volume, variety, velocity, and/or variabilitythat require a scalable architecture for efficient storage, manipulation, and analysis.
## 2 big data engineering includes advanced techniques that harness independent resources for building scalable data systems when the characteristics of the datasets require new architectures for efficient storage, manipulation, and analysis.
## 3 the big data paradigm consists of the distribution of data systems across horizontally coupled, independent resources to achieve the scalability needed for the efficient processing of extensive datasets.
## 4 computational portability is the movement of the computation to the location of the data.
## 5 data governance refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise.
## 6 the data lifecycle is the set of processes that transforms raw data into actionable knowledge, which includes data collection, preparation, analytics, visualization, and access.
## 7 data science is the empirical synthesis of actionable knowledge from raw data through the complete data life cycle process.
## 8 the data science is extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and hypothesis testing.
## 9 a latency is a practitioner who has sufficient knowledge in the overlapping regimes of business needs, domain knowledge, analytical skills, and software and systems engineering to manage the end-to-end data processes through each stage in the data life cycle.
## 10 distributed computing is a computing system in which components located on networked computers communicate and coordinate their actions by passing messages.
## 11 distributed file systems contain multi-structured (object) datasets that are distributed across the computing nodes of the server cluster(s).
## 12 a federated database system is a type of meta-database management system, which transparently maps multiple autonomous database systems into a single federated database. horizontal scaling implies the coordination of individual resources (e.g., server) that are integrated to act in parallel as a single system (i.e., operate as a cluster).
## 13 latency refers to the delay in processing or in availability.
## 14 massively parallel processing refers to a multitude of individual processors working in parallel to execute a particular program.
## 15 non-relational models, frequently referred to as nosql, refer to logical data models that do not follow relational algebra for the storage and manipulation of data.
## 16 resource negotiation consists of built-in data management capabilities that provide the necessary support functions, such as operations management, workflow integration, security, governance, support for additional processing models, and controls for multi-tenant environments, providing higher availability and lower latency applications.
## 17 a-1 nist big data interoperability framework: volume 1, definitions schema-on-read is the application of a data schema through preparation steps such as transformations, cleansing, and integration at the time the data is read from the database.
## 18 shared-disk file systems, such as storage area networks (sans) and network attached storage (nas), use a single storage pool, which is accessed from multiple computing resources. validity refers to appropriateness of the data for its intended use. value refers to the inherent wealth, economic and social, embedded in any dataset.
## 19 variability refers to the change in other data characteristics.
## 20 variety refers to data from multiple repositories, domains, or types.
## 21 velocity refers to the rate of data flow. veracity refers to the accuracy of the data.
## 22 vertical scaling implies increasing the system parameters of processing speed, storage, and memory for greater performance. volatility refers to the tendency for data structures to change over time.
## 23 volume refers to the size of the dataset.
x |
---|
big data consists of extensive datasetsprimarily in the characteristics of volume, variety, velocity, and/or variabilitythat require a scalable architecture for efficient storage, manipulation, and analysis. |
big data engineering includes advanced techniques that harness independent resources for building scalable data systems when the characteristics of the datasets require new architectures for efficient storage, manipulation, and analysis. |
the big data paradigm consists of the distribution of data systems across horizontally coupled, independent resources to achieve the scalability needed for the efficient processing of extensive datasets. |
computational portability is the movement of the computation to the location of the data. |
data governance refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise. |
the data lifecycle is the set of processes that transforms raw data into actionable knowledge, which includes data collection, preparation, analytics, visualization, and access. |
data science is the empirical synthesis of actionable knowledge from raw data through the complete data life cycle process. |
the data science is extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and hypothesis testing. |
a latency is a practitioner who has sufficient knowledge in the overlapping regimes of business needs, domain knowledge, analytical skills, and software and systems engineering to manage the end-to-end data processes through each stage in the data life cycle. |
distributed computing is a computing system in which components located on networked computers communicate and coordinate their actions by passing messages. |
distributed file systems contain multi-structured (object) datasets that are distributed across the computing nodes of the server cluster(s). |
a federated database system is a type of meta-database management system, which transparently maps multiple autonomous database systems into a single federated database. horizontal scaling implies the coordination of individual resources (e.g., server) that are integrated to act in parallel as a single system (i.e., operate as a cluster). |
latency refers to the delay in processing or in availability. |
massively parallel processing refers to a multitude of individual processors working in parallel to execute a particular program. |
non-relational models, frequently referred to as nosql, refer to logical data models that do not follow relational algebra for the storage and manipulation of data. |
resource negotiation consists of built-in data management capabilities that provide the necessary support functions, such as operations management, workflow integration, security, governance, support for additional processing models, and controls for multi-tenant environments, providing higher availability and lower latency applications. |
a-1 nist big data interoperability framework: volume 1, definitions schema-on-read is the application of a data schema through preparation steps such as transformations, cleansing, and integration at the time the data is read from the database. |
shared-disk file systems, such as storage area networks (sans) and network attached storage (nas), use a single storage pool, which is accessed from multiple computing resources. validity refers to appropriateness of the data for its intended use. value refers to the inherent wealth, economic and social, embedded in any dataset. |
variability refers to the change in other data characteristics. |
variety refers to data from multiple repositories, domains, or types. |
velocity refers to the rate of data flow. veracity refers to the accuracy of the data. |
vertical scaling implies increasing the system parameters of processing speed, storage, and memory for greater performance. volatility refers to the tendency for data structures to change over time. |
volume refers to the size of the dataset. |