Introduction to Data Science and Cloud Computing

Data Science and IBM

Date: 9 September, 2022

Author: Brandon Ou (bro9tn)

Note: This article draws mostly from IBM’s own article about data science. Feel free to read the original article for more information.

What is Data Science?

Data science is the combined effort of data collection, processing, analysis, and summary to communicate potential insights in data. The data science lifecycle, or process, involves many steps, many of which are shown below:

IBM breaks down data science into the following categories:

Data Ingestion

This deals with how data is collected (e.g. web scraping, manual entry, video logs, etc.)

Data Storage and Processing

This concerns one with how to clean (e.g. remove unnecessary information, remove excess “noise”), transform, or combine data to a form that is usable for data analysis. This also concerns one with how to store the data, whether that be in data lakes, data warehouses, etc.

Data Analysis

Here, data scientists use various analysis techniques to analyze data for trends. This could be in the form of hypothesis testing, developing models for machine learning or deep learning, or something else.

Communication

Insights that were found from the Data Analysis stage is presented in a form that is understandable by some target audience. This will often include creating graphs to visualize trends, tables to communicate data, and may involve using sophisticated data visualization tools. For example,

plot(iris$Petal.Length, iris$Petal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Iris Flower Length vs Width ", xlab="Petal Length (cm)", ylab="Petal Width (cm)")

Importance of Data Science

Data science is extremely important in analyzing large trends. For ambiguous situations, data often provides one of the best insights. This allows companies to analyze the actions of their stakeholders, allowing them to better tailor their product to the needs of customers.

IBM offers several examples of data science uses, also known as use cases:

Banks can leverage data about customers to analyze the risk of loans
A robotic process automation provider can use previous emails to analyze the purpose of received emails and sorrt them by urgency
Hospitals can use medical records to predict the likelihood of various diseases developing or already present in patients.

Data Science & Cloud Computing

What is Cloud Computing

Cloud computing is the process of giving computing machines more computing power by giving them access to more computing resources. This is generally achieved through connecting to cloud computing resources over the internet.

Cloud Computing’s Effect on Data Science

Large datasets are difficult to analyze for single machines, as the time required to clean and train machines on data can take years. As a result, machines may require extra computing power to decrease computation time. Thus, cloud computing allows data science to occur faster, as large data manipulation will take less time given more resources.

Data Science Tools

So how can data science be performed? There are countless tools that facilitate data science. Some include:

R,R Studio, and Python allow users to write custom code and employ open-source libraries to analyze data
SAS is a comprehensive tool suite developed by the company SAS that offers users tools for data analytics
Machine Learning libraries and tools, such as Pytorch, TensorFlow, and Spark also help users to find hidden insights in data

References & Extra Resources

References

Discussion

From the references, I found IBM’s discussion about the definition of data science to be the most useful, as well IBM’s discussion into cloud computing and how it applies to data science, for these helped me describe various terms more formally.