Data analytics is the process of turning data into meaningful insights. While this definition is a bit broad, so too is the world of analytics. There are many types of analytic techniques that fall into this category such as basic data summaries, data visualizations, regression techniques, and others. Data science on the other hand, focuses on discovering new questions that we may not have known needed answering in order to drive innovation. We can think of data analytics as a subcategory of data science. In this course, while we often refer to ourselves as data analysts, we strive to go far beyond what is done by an analyst and into the data science space. It should be noted that most companies use these terms interchangeably (though, they may not know that they have done so). You can be confident that this program will give you the skills necessary to excel at both jobs.
Analytics normally falls into 3 or 4 broad categories: descriptive, diagnostic, predictive and prescriptive (diagnostic analytics is sometimes omitted as is the case in your textbook). Each type of analytics has a different level of complexity; however, as the complexity level rises, so too does the value to the business/ customer.
Descriptive analytics focuses on what happened in the past. Things like data quires, reports and descriptive statistics are involved in descriptive analytics. Data dashboards are often found in descriptive analytics as they are useful in visually illustrating several metrics simultaneously.
Diagnostic analytics focus on why something happened. Data mining is a broad term that describes the use of analytical techniques to understand patterns and relationships within data sets. Using data mining techniques, we can begin to build insight as to why there is a relationship within the data.
Predictive analytics consists of techniques that use models constructed from past data to predict the future or ascertain the impact of one variable on another. This type of analytics goes beyond understanding the relationship between data and tries to understand if the relationship will continue moving forward. Techniques like linear regression, simulation and logistic regression all fall under this umbrella. Often it is data mining that unearths these relationships and so data mining plays a role in this type of analytics, too.
Prescriptive analytics indicates a course of action moving forward for the company/ client. Thus, the output of a model is a decision. Forecasts and predictions come from predictive analytics. When those forecasts and predictions are combined with a set of rules, then the analysis becomes prescriptive. Hence, prescriptive analytics are sometimes referred to as rule-based models.
According to Wikipedia, “Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software”. While data that has a large number of columns often are useful in statistical techniques, these data tend to lead to an increase in discoveries that are ultimately incorrect.
Normally, we discuss big data in terms of V’s. While there is some disagreement as to the number of V’s that should be used, the standard is somewhere between three and five:
Other data V’s not discussed in your textbook include value and variability. As a result of these challenges, technologies like Hadoop were developed to help deal with big data. You can read more about Hadoop at it’s website.
Because data are collected electronically, we are able to collect more of it. To be useful, these data must be stored, and this storage has led to vast quantities of data. Many companies now store in excess of 100 terabytes of data (a terabyte is 1,024 gigabytes). Thus, the volume of data has made traditional processing techniques difficult.
In the chart below, it is helpful to recall that \(2^{10} = 1,024\).
| Name | Equal to: | Size in Bytes |
|---|---|---|
| Bit | 1 bit | 1/8 |
| Nibble | 4 bits | 1/2 (rare) |
| Byte | 8 bits | 1 |
| Kilobyte | 1,024 bytes | 1,024 |
| Megabyte | 1,024 kilobytes | 1,048,576 |
| Gigabyte | 1,024 megabytes | 1,073,741,824 |
| Terrabyte | 1,024 gigabytes | 1,099,511,627,776 |
| Petabyte | 1,024 terrabytes | 1,125,899,906,842,624 |
| Exabyte | 1,024 petabytes | 1,152,921,504,606,846,976 |
| Zettabyte | 1,024 exabytes | 1,180,591,620,717,411,303,424 |
| Yottabyte | 1,024 zettabytes | 1,208,925,819,614,629,174,706,176 |
Real-time capture and analysis of data present unique challenges both in how data are stored, and the speed with which those data can be analyzed. The New York Stock Exchange, for example, collects 1 terabyte of data in a single trading session, and having current data and real-time rules for trades and predictive modeling are important for managing stock portfolios. This increase in velocity both in terms of intake and in terms of processing has made previous processing techniques obsolete.
To determine how many Kilobytes are in 1 Terrabyte, we observe: \[1 Tb \times \dfrac{2^{10}Gb}{1Tb} \times \dfrac{1^{10}Mb}{1Gb} \times \dfrac{2^{10}Kb}{1Mb} = 2^{30}Kb.\]
While deciding what flavor of data science you might pursue, it makes sense to keep a number of options in mind:
Predictive models are used to forecast financial performance, to assess the risk of investment portfolios and projects, and to construct financial instruments such as derivatives.
A relatively new area of application for analytics is the management of an organization’s human resources (people analytics). The HR function is charged with ensuring that the organization has the mix of skill sets necessary to meet its needs, is hiring the highest-quality talent and providing an environment that retains it, and achieves its organizational diversity goals.
A better understanding of consumer behavior through the use of scanner data and data generated from social media has led to an increased interest in marketing analytics. As a result, analytics are heavily used in marketing. A better understanding of consumer behavior through analytics leads to the better use of advertising budgets, more effective pricing strategies, improved forecasting of demand, improved product-line management, and increased customer satisfaction and loyalty.
The use of analytics in health care is on the increase because of pressure to simultaneously control costs and provide more effective treatment. A study by McKinsey Global Institute (MGI) and McKinsey & Company estimates that the health care system in the United States could save more than $300 billion per year by better utilizing analytics; these savings are approximately the equivalent of the entire gross domestic product of countries such as Finland, Singapore, and Ireland.
The core service of companies such as UPS and FedEx is the efficient delivery of goods, and analytics has long been used to achieve efficiency. The optimal sorting of goods, vehicle and staff scheduling, and vehicle routing are all key to profitability for logistics companies such as UPS and FedEx.
Optimal travel routes for delivery drivers is a problem proven to be very difficult (computationally infeasible) to solve. Thus, data science techniques in this area often deal with finding “better” solutions rather than finding the “best” solutions.
Government and other nonprofit groups have used analytics to drive out inefficiencies and increase the effectiveness and accountability of programs. Nonprofit agencies have also used analytics to ensure their effectiveness and accountability to their donors and clients.
The use of analytics for player evaluation and on-field strategy is now commonplace. Professional sports teams use analytics to assess players for amateur drafts and to make decisions on financial compensation during contract negotiations. Teams use analytics to assist with on-field decisions such as which pitchers to use and for how long. The use of analytics for off-the-field business decisions related to the fan experience inside stadiums is also increasing rapidly. Ensuring customer satisfaction is important for any company, and fans are the customers of sports teams. Over $4.5-million is spent annually on sports analytics.
Web analytics is the analysis of online activity, which includes, but is not limited to, visits to web sites and social media sites such as Facebook and LinkedIn. Web analytics obviously has huge implications for promoting and selling products and services via the Internet. Leading companies apply descriptive and advanced analytics to data collected in online experiments to determine the best way to configure web sites, position ads, and utilize social networks for the promotion of products and services.
Data science can be used to model traffic patterns for cars, bikes and pedestrians. Based on data from cellphones, city planners can determine optimal street light and traffic light patterns.
Analytics is used in military operations has been going on since World War II. Analytics can predict the outcome of a mission, the probability of a successful operation and the likely casualties that may be incurred. It is also common to use analytics in applications related to information security and espionage, as we saw during World War II.
The U.S. Immigrations and Customs Enforcement (ICE) has used facial recognition technology to mine driver’s license photo databases, with the goal of deporting undocumented immigrants. Additionally, the American judicial system employs software to gauge an incarcerated person’s risk of reoffending.
Tax evasion costs the U.S. government $458 billion each year. To combat these crimes, the IRS has modernized its fraud-detection protocols. The agency has improved efficiency by constructing multidimensional taxpayer profiles from public social media data, assorted metadata, emailing analysis, electronic payment patterns and more. Based on those profiles, the agency forecasts individual tax returns. Anyone with wildly different real and forecasted returns gets flagged for auditing.
When singles match on Tinder, a carefully-crafted algorithm works behind the scenes, boosting the probability of matches. Initially, this algorithm relied on users’ Elo scores, essentially an attractiveness ranking. Now, it prioritizes matches between active users, users near each other and users who seem like each other’s “types” based on their swiping history.
With the advent of big data and the dramatic increase in the use of analytics and data science to improve decision making, increased attention has been paid to ethical concerns around data privacy and the ethical use of models based on data.
As businesses, health authorities, doctors, and others collect data, they have an obligation to protect the data and to not misuse that data. Data breaches occur when data is accessed without proper authorization. Data breaches should be a concern for anyone who collects or uses data. As a result, data privacy laws have been designed to protect individuals’ data from being used against their wishes.
One of the strictest data privacy laws is the General Data Protection Regulation (GDPR) which went into effect in the European Union in May 2018. The law stipulates that the request for consent to use an individual’s data must be easily understood and accessible, the intended uses of the data must be specified, and it must be easy for an individual to withdraw consent. The law also stipulates that an individual has a right to a copy of their data and the right “to be forgotten,” (the right to demand that their data be erased). It is imperative that everyone understands the laws that govern the collection, storage, and use of data both in their physical region and in regions where their company operates.
Ethical issues that arise are just as important as the legal issues. Analytics professionals have a responsibility to behave ethically, which includes protecting data, being transparent about the data and how it was collected, and what it does and does not contain. Analysts must be transparent about the methods used to analyze the data and any assumptions that have to be made for the methods used. Finally, analysts must provide valid conclusions and understandable recommendations to their clients.
Following four years of preparation and debate, GDPR was approved by the European Parliament in April 2016 and the official texts and regulation of the directive were published in all of the official languages of the European Union on May 2016. The legislation came into force across the European Union on 25 May 2018.
At its core, GDPR is a new set of rules designed to give European Union citizens more control over their personal data. It aims to simplify the regulatory environment for business so both citizens and businesses in the European Union can fully benefit from the digital economy. As it is the gold standard in data protection, it is likely to become more wide-spread over the coming years.
Data breaches inevitably happen. Under the terms of GDPR, not only do organizations have to ensure that personal data is gathered legally and under strict conditions, but also that those who collect and manage it are obliged to protect it from misuse and exploitation, as well as to respect the rights of data owners.
GDPR applies to any organization operating within the European Union, as well as any organizations outside of the European Union which offer goods or services to customers or businesses in the European Union. That ultimately means that almost every major corporation in the world needs a GDPR compliance strategy.
There are two different types of data-handlers the legislation applies to: ‘processors’ and ‘controllers’. A controller is a “person, public authority, agency or other body which, alone or jointly with others, determines the purposes and means of processing of personal data”, while the processor is a “person, public authority, agency or other body which processes personal data on behalf of the controller”.
GDPR ultimately places legal obligations on a processor to maintain records of personal data and how it is processed, providing a much higher level of legal liability should the organization be breached. Controllers are also forced to ensure that all contracts with processors are in compliance with GDPR.
The types of data considered personal under the existing legislation include name, address, and photos. GDPR extends the definition of personal data so that something like an IP address can be personal data. It also includes sensitive personal data such as genetic data, and biometric data which could be processed to uniquely identify an individual.
One of the major changes GDPR brings is providing consumers with a right to know when their data has been hacked. Organizations are required to notify the appropriate national bodies as soon as possible in order to ensure European Union citizens can take appropriate measures to prevent their data from being abused. Consumers are also promised easier access to their own personal data in terms of how it is processed, with organizations required to detail how they use customer information in a clear and understandable way.
GDPR also brings a clarified ‘right to be forgotten’ process, which provides additional rights and freedoms to people who no longer want their personal data processed to have it deleted, providing there’s no grounds for retaining it.
Failure to comply with GDPR can result in a fine ranging from 10 million euros to four per cent of the company’s annual global turnover, a figure which for some could mean billions. Fines depend on the severity of the breach and on whether the company is deemed to have taken compliance and regulations around security in a serious enough manner.
Camm, Jeffrey D. Business Analytics. Third edition, Cengage, 2019.
McKinsey Global Institute. (2016) The Age of Analytics: Competing in a Data-Driven World. Available here.
Wikipedia contributors. (2021, April 12). Apache Hadoop. In Wikipedia, The Free Encyclopedia. Available here.
Wikipedia contributors. (2021, April 12). Big data. In Wikipedia, The Free Encyclopedia. Available here.