Simple linear regression: your first step!

In your first exercise, you’ll familiarize yourself with the concept of simple linear regression. You are given measures of grey kangaroos’ nose width and length (Source). You can find the data in kang_nose, which is loaded in your workspace. It has two columns: nose_width and nose_length.

Your job is to describe a possible linear relationship between the grey kangaroo’s nose width (mm) and nose length (mm). Make use of the lm() function as shown in the video and consult the help file if necessary. Remember to explore your data first!

We caught Skippy and measured its nose width, nose_width_new, but it escaped before we measured its nose length, can you help?

kang_nose <- read.csv("kang_nose.csv")
nose_width_new <- read.csv("nose_width_new.csv")
# Plot nose length as function of nose width.
plot(kang_nose)

# Describe the linear relationship between the two variables: lm_kang
lm_kang <- lm(nose_length ~ nose_width, data=kang_nose)

# Print the coefficients of lm_kang
lm_kang$coefficients
## (Intercept)  nose_width 
##   27.893058    2.701175
# Predict and print the nose length of the escaped kangoroo
predict(lm_kang, nose_width_new)
##        1 
## 703.1869
  • There is a visible linear pattern between the nose width and length of Grey Kangaroos! With a nose width of 250mm, your prediction tells us Skippy has a nose length of around 700mm.

Performance measure: RMSE

Now that you’ve got a grasp on the concept of simple linear regression, let’s move on to assessing the performance. Let’s stick to the Kangaroo example. The dataset, kang_nose, as well as the linear model you built, lm_kang, are available so you can start right away.

In this exercise, you’ll plot the regression line through the data points. Next, you’ll calculate the residuals,

  • \[RES_i={Y_i - Y_i}\]

These are the errors you made by fitting a line through the data points. Lastly, you’ll determine the Root-Mean-Square-Error,

  • \[RMSE= \sqrt{\frac{1}{N}\sum_{i=1}^{n} res^2_i}\] #### This estimates the standard deviation of the model. Remember, the smaller the RMSE, the better your fit!
# Build model and make plot
lm_kang <- lm(nose_length ~ nose_width, data=kang_nose)
plot(kang_nose, xlab = "nose width", ylab = "nose length")
abline(lm_kang$coefficients, col = "red")

# Apply predict() to lm_kang: nose_length_est. These values correspond to y^1,y^2,.,y^N.
nose_length_est <- predict(lm_kang)

# Calculate difference between the predicted and the true values: res. Determine the residuals.
res <- kang_nose$nose_length - nose_length_est

# Calculate RMSE, assign it to rmse and print it
rmse <- sqrt(mean(res ^ 2))
print(rmse)
## [1] 43.26288
  • The regression line passes well through the datapoints! What do you think of the RMSE value: 43mm? High or low? It’s difficult to interpret it if you have no other model to compare with.

Performance measures: R-squared

You’ve correctly calculated the RMSE in the last exercise, but were you able to interpret it? You can compare the RMSE to the total variance of your response by calculating the R^2, which is unitless! The closer R^2 to 1, the greater the degree of linear association is between the predictor and the response variable.

R calculates these performance measures for you. You can display them by applying summary() to your linear model object. Your job is to now manually calculate R^2 and compare your value to the value that R calculated automatically.

  • \[R^2=1-\frac{SS_{res}}{SS_{tot}}\]

Here, SSres is the residual sum of squares

  • \[{SS_{res}} = \sum_{i=1}^{n} res^2_i = \sum_{i=1}^{n} {Y_i - Y_i}^2\]

whereas SStot is the total sum of squares

  • \[{SS_{tot}} = \sum_{i=1}^{n} {Y_i - Y}^2\]
# Calculate the residual sum of squares: ss_res
ss_res <- sum(res ^ 2)

# Determine the total sum of squares: ss_tot
ss_tot <- sum((kang_nose$nose_length - mean(kang_nose$nose_length)) ^ 2)

# Calculate R-squared and assign it to r_sq. Also print it.
r_sq <- 1 - ss_res / ss_tot
r_sq
## [1] 0.7768914
# Apply summary() to lm_kang
summary(lm_kang)
## 
## Call:
## lm(formula = nose_length ~ nose_width, data = kang_nose)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.876 -32.912  -4.855  30.227  86.307 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.8931    54.2991   0.514     0.61    
## nose_width    2.7012     0.2207  12.236 1.34e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.26 on 43 degrees of freedom
## Multiple R-squared:  0.7769, Adjusted R-squared:  0.7717 
## F-statistic: 149.7 on 1 and 43 DF,  p-value: 1.342e-15

Another take at regression: be critical

You are given data on GDP per capita and its relation to the percentage of urban population for several UN countries, measured in the year 2014 (Source: The World Bank). This dataset is stored in a data frame world_bank_train and has two variables: cgdp and urb_pop.

Have a look at the data, do you think a relationship between the two is plausible? Try to set up a linear model for the percentage of urban population based on the GDP per capita.

Afghanistan has a GDP per capita of 413 USD, stored in cgdp_afg, but its urban population in 2014 is not known yet. Can you predict the outcome using your model?

world_bank_train <- read.csv("world_bank_train.csv")
cgdp_afg <- read.csv("cgdp_afg.csv")
# Plot urb_pop as function of cgdp
plot(world_bank_train$urb_pop, world_bank_train$cgdp)

# Set up a linear model between the two variables: lm_wb
lm_wb <- lm(urb_pop ~ cgdp, data = world_bank_train)

# Add a red regression line to your scatter plot
abline(lm_wb$coefficients, col = "red")

# Summarize lm_wb and select R-squared
summary(lm_wb)$r.squared
## [1] 0.3822067
# Predict the urban population of afghanistan based on cgdp_afg
predict(lm_wb, cgdp_afg)
##        1 
## 45.01204
  • Although, you must admit, your linear model is barely acceptable and far from satisfying. R-squared is quite low and the regression line doesn’t seem to fit well, so is the predicted 45% to be trusted?

Non-linear, but still linear?

In the last exercise, your scatter plot didn’t show a strong linear relationship. You confirmed this with the regression line and R2R2.

To improve your model, take a step back and study the nature of the data. The predictor variable is numerical, while the response variable is expressed in percentiles. It would make more sense if there were a linear relationship between the percentile changes of the GDP / capita and the changes in the response.

To obtain an estimation of percentile changes, take the natural logarithm of the GDP / capita and use this as your new predictor variable. A model solution to the previous exercise is included in the editor; up to you to make some changes.

# Convert the cgdp variable of world_bank_train to log(cgdp) in both plot() and lm().

# Plot: change the formula and xlab
plot(urb_pop ~ log(cgdp), data = world_bank_train, 
     xlab = "log(GDP per Capita)", 
     ylab = "Percentage of urban population")

# Linear model: change the formula
lm_wb_log <- lm(urb_pop ~ log(cgdp), data = world_bank_train)

# Add a red regression line to your scatter plot
abline(lm_wb_log$coefficients, col = "red")

# Summarize lm_wb and select R-squared
summary(lm_wb_log)$r.squared
## [1] 0.5787588
# Predict the urban population of afghanistan based on cgdp_afg
predict(lm_wb_log, cgdp_afg)
##        1 
## 25.86759
  • Your scatter plot now shows a stronger linear relationship! This model predicts the urban population of Afghanistan to be around 26%, what a huge difference with the 45% you had before! In the next exercise you will compare the R-squared measures for both these linear regression attempts.

Interpreting R-squared

In your last exercise, you transformed your predictor wonderfully. Your prediction differed substantially between your two models, so which one should we take? To answer this, you can have a look at the R^2 values of both attempts. The first linear model you built, that did not use a transformation, is available as lm_wb. The second model, where log() was used, is pre-loaded as lm_wb_log.

Which of the following statements is true?

lm_wb
## 
## Call:
## lm(formula = urb_pop ~ cgdp, data = world_bank_train)
## 
## Coefficients:
## (Intercept)         cgdp  
##   4.470e+01    7.578e-04
lm_wb_log
## 
## Call:
## lm(formula = urb_pop ~ log(cgdp), data = world_bank_train)
## 
## Coefficients:
## (Intercept)    log(cgdp)  
##      -42.60        11.37
  • lm_wb_log returns the best prediction, its explained variance is the highest.
  • The second model clearly is better to predict the percentage of urban population based on the GDP per capita. The type of the second model is called log-linear, as you take the logarithm of your predictor variable, but leave your response variable unchanged. Overall your model still has quite a lot of unexplained variance. If you want more precise predictions, you’ll have to add other relevant variables in a multivariable linear model.

Going all-in with predictors!

In this exercise you’ll add even more predictors: inventory (inv), the size of the district (size_dist) and the shop size (sq_ft).

Your job is to set up this model, verify if the fit is good and finally measure the accuracy. Make sure you interpret the results at every step!

shop_data <- read.csv("shop_data.csv")
# Add a plot: sales as a function of inventory. Is linearity plausible?
plot(sales ~ sq_ft, shop_data)

plot(sales ~ size_dist, shop_data)

plot(sales ~ inv, shop_data)

# Build a linear model for net sales based on all other variables: lm_shop
lm_shop <- lm(sales ~., data = shop_data)

# Summarize lm_shop
# Generate a summary() of lm_shop. Have a look at the R2R2 and adjusted R2R2 and try to interpret them. Is R2R2 close to 1? Is the adjusted R2R2 higher than in the video example (0.906)?
summary(lm_shop)
## 
## Call:
## lm(formula = sales ~ ., data = shop_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.338  -9.699  -4.496   4.040  41.139 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18.85941   30.15023  -0.626 0.538372    
## sq_ft        16.20157    3.54444   4.571 0.000166 ***
## inv           0.17464    0.05761   3.032 0.006347 ** 
## ads          11.52627    2.53210   4.552 0.000174 ***
## size_dist    13.58031    1.77046   7.671 1.61e-07 ***
## comp         -5.31097    1.70543  -3.114 0.005249 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.65 on 21 degrees of freedom
## Multiple R-squared:  0.9932, Adjusted R-squared:  0.9916 
## F-statistic: 611.6 on 5 and 21 DF,  p-value: < 2.2e-16
  • Multiple R-squared: 0.9932, Adjusted R-squared: 0.9916

Are all predictors relevant?

To further analyze the performance, take a look at the p-values of every predictor. Are they all relevant? Remember that you should verify the assumptions on the error variable before interpreting these results.

There is a shop owner that didn’t participate in the questionnaire, who has caught wind of your amazing model! He asked us to estimate his net sales based on the data he provided. Can you help him out? The data can be found in shop_new. shop_data and lm_shop are also available in the workspace.

shop_new <- read.csv("shop_new.csv")
# Plot the residuals in function of your fitted observations
plot(lm_shop$fitted.values , lm_shop$residuals)

# Make a Q-Q plot of your residual quantiles
qqnorm(lm_shop$residuals,  ylab = "Residual Quantiles")

# Summarize your model, are there any irrelevant predictors?
summary(lm_shop)
## 
## Call:
## lm(formula = sales ~ ., data = shop_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.338  -9.699  -4.496   4.040  41.139 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18.85941   30.15023  -0.626 0.538372    
## sq_ft        16.20157    3.54444   4.571 0.000166 ***
## inv           0.17464    0.05761   3.032 0.006347 ** 
## ads          11.52627    2.53210   4.552 0.000174 ***
## size_dist    13.58031    1.77046   7.671 1.61e-07 ***
## comp         -5.31097    1.70543  -3.114 0.005249 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.65 on 21 degrees of freedom
## Multiple R-squared:  0.9932, Adjusted R-squared:  0.9916 
## F-statistic: 611.6 on 5 and 21 DF,  p-value: < 2.2e-16
# Predict the net sales based on shop_new.
predict(lm_shop, shop_new)
##        1 
## 262.5006
  • There is no clear pattern in your residuals. Moreover, the residual quantiles are approximately on one line. From the small p-values you can conclude that every predictor is important!

Are all predictors relevant? Take 2!

Your job is to build a multi linear model for the energy based on all other variables and judge its performance.

choco_data <- read.csv("choco_data.csv")
# Add a plot:  energy/100g as function of total size. Linearity plausible?
plot(energy ~ protein, choco_data)

plot(energy ~ fat, choco_data)

plot(energy ~ size, choco_data)

# Build a linear model for the energy based on all other variables: lm_choco
lm_choco <- lm(energy ~., data=choco_data)

# Plot the residuals in function of your fitted observations
plot(lm_choco$fitted.values , lm_choco$residuals)

# Make a Q-Q plot of your residual quantiles
qqnorm(lm_choco$residuals,  ylab = "Residual Quantiles")

# Summarize lm_choco
summary(lm_choco)
## 
## Call:
## lm(formula = energy ~ ., data = choco_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.084  -35.756   -8.323   36.100  104.660 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1338.3333    40.0928  33.381  < 2e-16 ***
## protein       23.0019     3.6636   6.279 6.97e-08 ***
## fat           24.4662     1.6885  14.490  < 2e-16 ***
## size          -0.8183     0.6035  -1.356    0.181    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52.2 on 52 degrees of freedom
## Multiple R-squared:  0.9019, Adjusted R-squared:  0.8962 
## F-statistic: 159.3 on 3 and 52 DF,  p-value: < 2.2e-16

Interpreting the residuals and p-values

In the last exercise you built a linear regression model for the energy (per 100g) in a chocolate bar based on its size, percentage of protein and percentage of fat. Are all these predictors relevant with a 95% confidence? The residual and quantile plots from the previous exercise are already available. You can also build the summary() of lm_choco again if you want; the linear model is available in the workspace.

Which of the following statements is true?

  • The predictors protein and fat are statically insignificant, it is best to remove them. 1
  • The predictor size is statistically insignificant, it is best to remove it. 2
  • The residuals seem to be dependent, the p-values must be ignored. 3
  • The residuals don’t appear to be normally distributed, the p-values must be ignored.

2

Does your model generalize?

Remember the dataset on GDP per capita and percentage of urban development? You were advised to take the logarithm of GDP per capita. Both the regression line and R2R2 showed a better result for the log-linear model, lm_wb_log, than for the simple linear model, lm_wb.

You might be wondering whether we were misguiding you and had you produce an overfit? The workspace contains world_bank_train that you’ve used before to build the log-linear model. Now, however, also world_bank_test is available. You can use this dataset to test if your model generalizes well.

You’ll be comparing the RMSE for both sets:

  • RMSE=\[\sqrt{\frac{1}{N}\sum_{i=1}^{n} res^2_i}\]
world_bank_test <- read.csv("world_bank_test.csv")
# Build the log-linear model
lm_wb_log <- lm(urb_pop ~ log(cgdp), data = world_bank_train)

# Calculate rmse_train
rmse_train <- sqrt(mean(lm_wb_log$residuals ^ 2))

# The real percentage of urban population in the test set, the ground truth
world_bank_test_truth <- world_bank_test$urb_pop

# The predictions of the percentage of urban population in the test set
world_bank_test_input <- data.frame(cgdp = world_bank_test$cgdp)
world_bank_test_output <- predict(lm_wb_log, world_bank_test_input)

# The residuals: the difference between the ground truth and the predictions
res_test <- world_bank_test_output - world_bank_test_truth

# Use res_test to calculate rmse_test
rmse_test <- sqrt(mean(res_test ^ 2))

# Print the ratio of the test RMSE over the training RMSE
rmse_test/rmse_train
## [1] 1.08308
  • The test’s RMSE is only slightly larger than the training RMSE. This means that your model generalizes well to unseen observations. You can conclude the logarithm transformation did improve your model. It fits your data better and does a good job at generalizing!

Your own k-NN algorithm!

In the video, Gilles shortly showed you how to set up your own k-NN algorithm. Now it’s time to inspect up close how it works.

We went ahead and defined a function my_knn, that contains a k-NN algorithm. Its arguments are: - x_pred: predictor values of the new observations (this will be the cgdp column of world_bank_test), - x: predictor values of the training set (the cgdp column of world_bank_train), - y: corresponding response values of the training set (the urb_pop column of world_bank_train), - k: the number of neighbors (this will be 30).

The function returns the predicted values for your new observations (predict_knn).

You’ll apply a k-NN algorithm to the GDP / capita of the countries in world_bank_test to predict their percentage of urban population.

###
# You don't have to change this!
# The algorithm is already coded for you; 
# inspect it and try to understand how it works!
my_knn <- function(x_pred, x, y, k){
  m <- length(x_pred)
  predict_knn <- rep(0, m)
  for (i in 1:m) {
    
    # Calculate the absolute distance between x_pred[i] and x (between the new observation and your training sample)
    dist <- abs(x_pred[i] - x)
    
    # Apply order() to dist, sort_index will contain 
    # the indices of elements in the dist vector, in 
    # ascending order. This means sort_index[1:k] will
    # return the indices of the k-nearest neighbors.
    # Calculation of sort_index, the order of dist, using order(); This function returns the indices of elements in dist, in ascending order. So sort_index[1] will return the index of the smallest distance in dist, sort_index[2] the second smallest, ...
    sort_index <- order(dist)    
    
    # Apply mean() to the responses of the k-nearest neighbors
    # Calculation of predict_knn[i], the mean() of the values in y that correspond to the k smallest distances. Note that sort_index[1:k] will return the indices of the k smallest elements in dist. y contains the percentages of urban population in the training set.
    predict_knn[i] <- mean(y[sort_index[1:k]])    
    
  }
  return(predict_knn)
}
###

# world_bank_train and world_bank_test are pre-loaded

# Apply your algorithm on the test set: test_output
test_output <- my_knn(world_bank_test$cgdp, world_bank_train$cgdp, world_bank_train$urb_pop, 30)

# Have a look at the plot of the output
plot(world_bank_train, 
     xlab = "GDP per Capita", 
     ylab = "Percentage Urban Population")
points(world_bank_test$cgdp, test_output, col = "green")

Parametric vs non-parametric!

So now you’ve build three different models for the same data: - a simple linear model, lm_wb, - a log-linear model, lm_wb_log and - a non-parametric k-NN model. This k-NN model is actually simply a function that takes test and training data and predicts response variables on the fly: my_knn().

These objects are all stored in the workspace, as are world_bank_train and world_bank_test.

Have a look at the sample code on the right, that shows the first steps for building a fancy plot. In the end, three lines should be plotted that represent the predicted responses for the test set, together with the true responses.

You’ll also calculate the RMSE of the test set for the simple linear, log-linear and k-NN regression. Have a look at the results, which regression approach performs the best?

# world_bank_train and world_bank_test are pre-loaded
# lm_wb and lm_wb_log have been trained on world_bank_train
# The my_knn() function is available

# Define ranks to order the predictor variables in the test set
ranks <- order(world_bank_test$cgdp)

# Scatter plot of test set
plot(world_bank_test,
     xlab = "GDP per Capita", ylab = "Percentage Urban Population")

# Predict with simple linear model and add line
test_output_lm <- predict(lm_wb, data.frame(cgdp = world_bank_test$cgdp))
lines(world_bank_test$cgdp[ranks], test_output_lm[ranks], lwd = 2, col = "blue")

# Predict with log-linear model and add line
 test_output_lm_log <- predict(lm_wb_log, data.frame(cgdp = world_bank_test$cgdp))
lines(world_bank_test$cgdp[ranks], test_output_lm_log[ranks], lwd = 2, col = "red")

# Predict with k-NN and add line
test_output_knn <- my_knn(world_bank_test$cgdp, world_bank_train$cgdp, world_bank_train$urb_pop, 30)
lines(world_bank_test$cgdp[ranks], test_output_knn[ranks], lwd = 2, col = "green")

# Calculate RMSE on the test set for simple linear model
sqrt(mean( (test_output_lm - world_bank_test$urb_pop) ^ 2))
## [1] 17.41897
# Calculate RMSE on the test set for log-linear model
sqrt(mean( (test_output_lm_log - world_bank_test$urb_pop) ^ 2))
## [1] 15.01911
# Calculate RMSE on the test set for k-NN technique
sqrt(mean( (test_output_knn - world_bank_test$urb_pop) ^ 2))
## [1] 16.10026