Regression Models

Measuring Performance

When the outcome is a number, the most common method for characterizing a model’s predictive capabilities is to use the root mean squared error (RMSE). Another common metric is the coefficient of determination, commonly written as R2

Root mean squared error (RMSE)

Definition

This metric is a function of the model residuals, which are the observed values minus the model predictions. The mean squared error (MSE) is calculated by squaring the residuals and summing them. The RMSE is then calculated by taking the square root of the MSE so that it is in the same units as the original data. The value is usually interpreted as either how far (on average) the residuals are from zero or as the average distance between the observed values and the model predictions.

Video

R-Squared (R2).

This value can be interpreted as the proportion of the information in the data that is explained by the model. Is the percentage of variation explained by the relationship between two variables

Definition

There is XX% less variation around the line than the mean. The relationship accounts for XX% of the variation or the relationship between this two variables explains XX% of the variation in the data.

Video

Bias and Variance

Definition

The inability for machine learning method (like linear regression) to capture the true relationship is called bias, applying to the train set.

The difference in fits between data sets is called variance. The difference between our model with the points of the test set.

Video

Market Basket Analysis (MBA)

Is a technique or algorithm to identify the associations rules from your data. It works on three concepts:

Support, Confidence, Lift

  • Support: P (AUB) - Probability of buying/selling A - B together. 5% means that 5% of transactions shows that product A and B are sold together.

  • Confidence P(B|A): P (AUB) / P (A). The product B is sold only then product A is sold. 70% means that 70% of customers who bought product A also bought Product B.

  • Lift: Confidence / P(B) -> P(AUB)/P(A)*P(B). lift> 1 is good in predicting, if it is less <1 is not good.

Correlation does not Imply Causation

It’s often very tempting to look at statistical information, spot correlation, and then assume causation. It’s a mistake that gets made often, but things are rarely this simple or straightforward.

Of course, circumstances can be that straightforward occasionally, but assuming that they are is never a good idea because you will often jump to the wrong conclusions.

Just because correlation is evident, that doesn’t mean that A causes B. In statistics, it’s a logical fallacy to suggest that correlation proves causation, and no one will take you seriously if your research falls into this trap.