Date: 9 September, 2022
Author: Brandon Ou (bro9tn)
Note: This article draws mostly from IBM’s own article about data science. Feel free to read the original article for more information.
Data science is the combined effort of data collection, processing,
analysis, and summary to communicate potential insights in data. The
data science lifecycle, or process, involves many steps, many of which
are shown below:
IBM breaks down data science into the following categories:
This deals with how data is collected (e.g. web scraping, manual entry, video logs, etc.)
This concerns one with how to clean (e.g. remove unnecessary information, remove excess “noise”), transform, or combine data to a form that is usable for data analysis. This also concerns one with how to store the data, whether that be in data lakes, data warehouses, etc.
Here, data scientists use various analysis techniques to analyze data for trends. This could be in the form of hypothesis testing, developing models for machine learning or deep learning, or something else.
Insights that were found from the Data Analysis stage is presented in a form that is understandable by some target audience. This will often include creating graphs to visualize trends, tables to communicate data, and may involve using sophisticated data visualization tools. For example,
plot(iris$Petal.Length, iris$Petal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Iris Flower Length vs Width ", xlab="Petal Length (cm)", ylab="Petal Width (cm)")
Data science is extremely important in analyzing large trends. For ambiguous situations, data often provides one of the best insights. This allows companies to analyze the actions of their stakeholders, allowing them to better tailor their product to the needs of customers.
IBM offers several examples of data science uses, also known as use cases:
Cloud computing is the process of giving computing machines more computing power by giving them access to more computing resources. This is generally achieved through connecting to cloud computing resources over the internet.
Large datasets are difficult to analyze for single machines, as the time required to clean and train machines on data can take years. As a result, machines may require extra computing power to decrease computation time. Thus, cloud computing allows data science to occur faster, as large data manipulation will take less time given more resources.
So how can data science be performed? There are countless tools that facilitate data science. Some include: