Hands-On Machine Learning with R

Bradley Boehmke

Chapter I

I.1 Introduction to Machine Learning

Introduction to Machine Learning Machine learning (ML) continues to grow in importance for many organizations across nearly all domains. Some example applications of machine learning in practice include:

  • Predicting the likelihood of a patient returning to the hospital (readmission) within 30 days of discharge.

  • Segmenting customers based on common attributes or purchasing behavior for targeted marketing.

  • Predicting coupon redemption rates for a given marketing campaign.

  • Predicting customer churn so an organization can perform preventative intervention.

  • These tasks all seek to learn from data. These algorithms, or learners, can be classified according to the amount and type of supervision needed during training. The two of them: supervised learners and unsupervised learners.

I.2 Supervised learning

According with Kuhn and Johnson (2013, 26:2), a predictive model is“…the process of developing a mathematical tool or model that generates an accurate prediction.”

Examples of predictive modeling include:

  • using customer attributes to predict the probability of the customer churning in the next 6 weeks;

  • using home attributes to predict the sales price;

  • using employee attributes to predict the likelihood of attrition;

  • using patient attributes and symptoms to predict the risk of readmission;

  • using production attributes to predict time to market.

Most supervised learning problems can be bucketed into one of two categories, regressionor classification.

Regression problems

When the objective of our supervised learning is to predict a numeric outcome, we refer to this as a regression problem (not to be confused with linear regression modeling).

Classification problems

When the objective of our supervised learning is to predict a categorical outcome, we refer to this as a classification problem.

I.3 Unsupervised learning

Unsupervised learning, in contrast to supervised learning, includes a set of statistical tools to better understand and describe your data, but performs the analysis without a target variable. In essence, unsupervised learning is concerned with identifying groups in a data set. The groups may be defined by the rows (i.e., clustering) or the columns (i.e., dimension reduction); however, the motive in each case is quite different.

Unsupervised learning is often performed as part of an exploratory data analysis (EDA) and techniques are often used in organizations to:

  • Divide consumers into different homogeneous groups so that tailored marketing strategies can be developed and deployed for each segment.

  • Identify groups of online shoppers with similar browsing and purchase histories, as well as items that are of particular interest to the shoppers within each group.

  • Identify products that have similar purchasing behavior so that managers can manage them as product groups.

I.4 The data sets

In this point this book gives a review of necessary tasks prior to the ML tasks such as:

  • Feature selection (i.e., removing unnecessary variables and retaining only those variables you wish to include in your modeling process).

  • Recoding variable names and values so that they are meaningful and more interpretable.

  • Recoding, removing, or some other approach to handling missing values.

Most of the exemplar data sets we use throughout this book, gone through the necessary cleaning processes and the focus on theses data sets are:

  1. Property sales information as described in De Cock (2011).

  2. Employee attrition information originally provided by IBM Watson Analytics Lab.

  3. Image information for handwritten numbers originally presented to AT&T Bell Lab’s to help build automatic mail-sorting machines for the USPS.

  4. Grocery items and quantities purchased. Each observation represents a single basket of goods that were purchased together.

References

Cireşan, Dan, Ueli Meier, and Jürgen Schmidhuber. 2012. “Multi-Column Deep Neural Networks for Image Classification.” arXiv Preprint arXiv:1202.2745.

De Cock, Dean. 2011. “Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project.” Journal of Statistics Education 19 (3). Taylor & Francis.

Irizarry, Rafael A. 2018. Dslabs: Data Science Labs. <https://CRAN.R-project.org/package=dslabs>.

Kuhn, Max. 2017a. AmesHousing: The Ames Iowa Housing Data. <https://CRAN.R-project.org/package=AmesHousing>.

Kuhn, Max, and Kjell Johnson. 2013. Applied Predictive Modeling. Vol. 26. Springer.

Kuhn, Max, and Hadley Wickham. 2019. Rsample: General Resampling Infrastructure. <https://CRAN.R-project.org/package=rsample>.

LeCun, Yann, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. 1990. “Handwritten Digit Recognition with a Back-Propagation Network.” In Advances in Neural Information Processing Systems, 396–404.

LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE 86 (11). IEEE: 2278–2324.