Simple linear regression: your first step!
In your first exercise, you’ll familiarize yourself with the concept of simple linear regression. You are given measures of grey kangaroos’ nose width and length (Source). You can find the data in kang_nose, which is loaded in your workspace. It has two columns: nose_width and nose_length.
Your job is to describe a possible linear relationship between the grey kangaroo’s nose width (mm) and nose length (mm). Make use of the lm() function as shown in the video and consult the help file if necessary. Remember to explore your data first!
We caught Skippy and measured its nose width, nose_width_new, but it escaped before we measured its nose length, can you help?
kang_nose <- read.csv("kang_nose.csv")
nose_width_new <- read.csv("nose_width_new.csv")
# Plot nose length as function of nose width.
plot(kang_nose)

# Describe the linear relationship between the two variables: lm_kang
lm_kang <- lm(nose_length ~ nose_width, data=kang_nose)
# Print the coefficients of lm_kang
lm_kang$coefficients
## (Intercept) nose_width
## 27.893058 2.701175
# Predict and print the nose length of the escaped kangoroo
predict(lm_kang, nose_width_new)
## 1
## 703.1869
- There is a visible linear pattern between the nose width and length of Grey Kangaroos! With a nose width of 250mm, your prediction tells us Skippy has a nose length of around 700mm.
Another take at regression: be critical
You are given data on GDP per capita and its relation to the percentage of urban population for several UN countries, measured in the year 2014 (Source: The World Bank). This dataset is stored in a data frame world_bank_train and has two variables: cgdp and urb_pop.
Have a look at the data, do you think a relationship between the two is plausible? Try to set up a linear model for the percentage of urban population based on the GDP per capita.
Afghanistan has a GDP per capita of 413 USD, stored in cgdp_afg, but its urban population in 2014 is not known yet. Can you predict the outcome using your model?
world_bank_train <- read.csv("world_bank_train.csv")
cgdp_afg <- read.csv("cgdp_afg.csv")
# Plot urb_pop as function of cgdp
plot(world_bank_train$urb_pop, world_bank_train$cgdp)
# Set up a linear model between the two variables: lm_wb
lm_wb <- lm(urb_pop ~ cgdp, data = world_bank_train)
# Add a red regression line to your scatter plot
abline(lm_wb$coefficients, col = "red")

# Summarize lm_wb and select R-squared
summary(lm_wb)$r.squared
## [1] 0.3822067
# Predict the urban population of afghanistan based on cgdp_afg
predict(lm_wb, cgdp_afg)
## 1
## 45.01204
- Although, you must admit, your linear model is barely acceptable and far from satisfying. R-squared is quite low and the regression line doesn’t seem to fit well, so is the predicted 45% to be trusted?
Non-linear, but still linear?
In the last exercise, your scatter plot didn’t show a strong linear relationship. You confirmed this with the regression line and R2R2.
To improve your model, take a step back and study the nature of the data. The predictor variable is numerical, while the response variable is expressed in percentiles. It would make more sense if there were a linear relationship between the percentile changes of the GDP / capita and the changes in the response.
To obtain an estimation of percentile changes, take the natural logarithm of the GDP / capita and use this as your new predictor variable. A model solution to the previous exercise is included in the editor; up to you to make some changes.
# Convert the cgdp variable of world_bank_train to log(cgdp) in both plot() and lm().
# Plot: change the formula and xlab
plot(urb_pop ~ log(cgdp), data = world_bank_train,
xlab = "log(GDP per Capita)",
ylab = "Percentage of urban population")
# Linear model: change the formula
lm_wb_log <- lm(urb_pop ~ log(cgdp), data = world_bank_train)
# Add a red regression line to your scatter plot
abline(lm_wb_log$coefficients, col = "red")

# Summarize lm_wb and select R-squared
summary(lm_wb_log)$r.squared
## [1] 0.5787588
# Predict the urban population of afghanistan based on cgdp_afg
predict(lm_wb_log, cgdp_afg)
## 1
## 25.86759
- Your scatter plot now shows a stronger linear relationship! This model predicts the urban population of Afghanistan to be around 26%, what a huge difference with the 45% you had before! In the next exercise you will compare the R-squared measures for both these linear regression attempts.
Interpreting R-squared
In your last exercise, you transformed your predictor wonderfully. Your prediction differed substantially between your two models, so which one should we take? To answer this, you can have a look at the R^2 values of both attempts. The first linear model you built, that did not use a transformation, is available as lm_wb. The second model, where log() was used, is pre-loaded as lm_wb_log.
Which of the following statements is true?
lm_wb
##
## Call:
## lm(formula = urb_pop ~ cgdp, data = world_bank_train)
##
## Coefficients:
## (Intercept) cgdp
## 4.470e+01 7.578e-04
lm_wb_log
##
## Call:
## lm(formula = urb_pop ~ log(cgdp), data = world_bank_train)
##
## Coefficients:
## (Intercept) log(cgdp)
## -42.60 11.37
- lm_wb_log returns the best prediction, its explained variance is the highest.
- The second model clearly is better to predict the percentage of urban population based on the GDP per capita. The type of the second model is called log-linear, as you take the logarithm of your predictor variable, but leave your response variable unchanged. Overall your model still has quite a lot of unexplained variance. If you want more precise predictions, you’ll have to add other relevant variables in a multivariable linear model.
Going all-in with predictors!
In this exercise you’ll add even more predictors: inventory (inv), the size of the district (size_dist) and the shop size (sq_ft).
Your job is to set up this model, verify if the fit is good and finally measure the accuracy. Make sure you interpret the results at every step!
shop_data <- read.csv("shop_data.csv")
# Add a plot: sales as a function of inventory. Is linearity plausible?
plot(sales ~ sq_ft, shop_data)

plot(sales ~ size_dist, shop_data)

plot(sales ~ inv, shop_data)

# Build a linear model for net sales based on all other variables: lm_shop
lm_shop <- lm(sales ~., data = shop_data)
# Summarize lm_shop
# Generate a summary() of lm_shop. Have a look at the R2R2 and adjusted R2R2 and try to interpret them. Is R2R2 close to 1? Is the adjusted R2R2 higher than in the video example (0.906)?
summary(lm_shop)
##
## Call:
## lm(formula = sales ~ ., data = shop_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.338 -9.699 -4.496 4.040 41.139
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.85941 30.15023 -0.626 0.538372
## sq_ft 16.20157 3.54444 4.571 0.000166 ***
## inv 0.17464 0.05761 3.032 0.006347 **
## ads 11.52627 2.53210 4.552 0.000174 ***
## size_dist 13.58031 1.77046 7.671 1.61e-07 ***
## comp -5.31097 1.70543 -3.114 0.005249 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.65 on 21 degrees of freedom
## Multiple R-squared: 0.9932, Adjusted R-squared: 0.9916
## F-statistic: 611.6 on 5 and 21 DF, p-value: < 2.2e-16
- Multiple R-squared: 0.9932, Adjusted R-squared: 0.9916
Are all predictors relevant?
To further analyze the performance, take a look at the p-values of every predictor. Are they all relevant? Remember that you should verify the assumptions on the error variable before interpreting these results.
There is a shop owner that didn’t participate in the questionnaire, who has caught wind of your amazing model! He asked us to estimate his net sales based on the data he provided. Can you help him out? The data can be found in shop_new. shop_data and lm_shop are also available in the workspace.
shop_new <- read.csv("shop_new.csv")
# Plot the residuals in function of your fitted observations
plot(lm_shop$fitted.values , lm_shop$residuals)

# Make a Q-Q plot of your residual quantiles
qqnorm(lm_shop$residuals, ylab = "Residual Quantiles")

# Summarize your model, are there any irrelevant predictors?
summary(lm_shop)
##
## Call:
## lm(formula = sales ~ ., data = shop_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.338 -9.699 -4.496 4.040 41.139
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.85941 30.15023 -0.626 0.538372
## sq_ft 16.20157 3.54444 4.571 0.000166 ***
## inv 0.17464 0.05761 3.032 0.006347 **
## ads 11.52627 2.53210 4.552 0.000174 ***
## size_dist 13.58031 1.77046 7.671 1.61e-07 ***
## comp -5.31097 1.70543 -3.114 0.005249 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.65 on 21 degrees of freedom
## Multiple R-squared: 0.9932, Adjusted R-squared: 0.9916
## F-statistic: 611.6 on 5 and 21 DF, p-value: < 2.2e-16
# Predict the net sales based on shop_new.
predict(lm_shop, shop_new)
## 1
## 262.5006
- There is no clear pattern in your residuals. Moreover, the residual quantiles are approximately on one line. From the small p-values you can conclude that every predictor is important!
Are all predictors relevant? Take 2!
Let’s take a different dataset. In 2002 Brisbane did a study on different chocolate bars and measured its energy per 100 gram, percentage of proteins, percentage of fat and the total size. You’re wondering whether energy is related to the other variables. You can find the results in choco_data.
Interpreting the residuals and p-values
In the last exercise you built a linear regression model for the energy (per 100g) in a chocolate bar based on its size, percentage of protein and percentage of fat. Are all these predictors relevant with a 95% confidence? The residual and quantile plots from the previous exercise are already available. You can also build the summary() of lm_choco again if you want; the linear model is available in the workspace.
Which of the following statements is true?
- The predictors protein and fat are statically insignificant, it is best to remove them. 1
- The predictor size is statistically insignificant, it is best to remove it. 2
- The residuals seem to be dependent, the p-values must be ignored. 3
- The residuals don’t appear to be normally distributed, the p-values must be ignored.
2
Does your model generalize?
Remember the dataset on GDP per capita and percentage of urban development? You were advised to take the logarithm of GDP per capita. Both the regression line and R2R2 showed a better result for the log-linear model, lm_wb_log, than for the simple linear model, lm_wb.
You might be wondering whether we were misguiding you and had you produce an overfit? The workspace contains world_bank_train that you’ve used before to build the log-linear model. Now, however, also world_bank_test is available. You can use this dataset to test if your model generalizes well.
You’ll be comparing the RMSE for both sets:
- RMSE=\[\sqrt{\frac{1}{N}\sum_{i=1}^{n} res^2_i}\]
world_bank_test <- read.csv("world_bank_test.csv")
# Build the log-linear model
lm_wb_log <- lm(urb_pop ~ log(cgdp), data = world_bank_train)
# Calculate rmse_train
rmse_train <- sqrt(mean(lm_wb_log$residuals ^ 2))
# The real percentage of urban population in the test set, the ground truth
world_bank_test_truth <- world_bank_test$urb_pop
# The predictions of the percentage of urban population in the test set
world_bank_test_input <- data.frame(cgdp = world_bank_test$cgdp)
world_bank_test_output <- predict(lm_wb_log, world_bank_test_input)
# The residuals: the difference between the ground truth and the predictions
res_test <- world_bank_test_output - world_bank_test_truth
# Use res_test to calculate rmse_test
rmse_test <- sqrt(mean(res_test ^ 2))
# Print the ratio of the test RMSE over the training RMSE
rmse_test/rmse_train
## [1] 1.08308
- The test’s RMSE is only slightly larger than the training RMSE. This means that your model generalizes well to unseen observations. You can conclude the logarithm transformation did improve your model. It fits your data better and does a good job at generalizing!
Your own k-NN algorithm!
In the video, Gilles shortly showed you how to set up your own k-NN algorithm. Now it’s time to inspect up close how it works.
We went ahead and defined a function my_knn, that contains a k-NN algorithm. Its arguments are: - x_pred: predictor values of the new observations (this will be the cgdp column of world_bank_test), - x: predictor values of the training set (the cgdp column of world_bank_train), - y: corresponding response values of the training set (the urb_pop column of world_bank_train), - k: the number of neighbors (this will be 30).
The function returns the predicted values for your new observations (predict_knn).
You’ll apply a k-NN algorithm to the GDP / capita of the countries in world_bank_test to predict their percentage of urban population.
###
# You don't have to change this!
# The algorithm is already coded for you;
# inspect it and try to understand how it works!
my_knn <- function(x_pred, x, y, k){
m <- length(x_pred)
predict_knn <- rep(0, m)
for (i in 1:m) {
# Calculate the absolute distance between x_pred[i] and x (between the new observation and your training sample)
dist <- abs(x_pred[i] - x)
# Apply order() to dist, sort_index will contain
# the indices of elements in the dist vector, in
# ascending order. This means sort_index[1:k] will
# return the indices of the k-nearest neighbors.
# Calculation of sort_index, the order of dist, using order(); This function returns the indices of elements in dist, in ascending order. So sort_index[1] will return the index of the smallest distance in dist, sort_index[2] the second smallest, ...
sort_index <- order(dist)
# Apply mean() to the responses of the k-nearest neighbors
# Calculation of predict_knn[i], the mean() of the values in y that correspond to the k smallest distances. Note that sort_index[1:k] will return the indices of the k smallest elements in dist. y contains the percentages of urban population in the training set.
predict_knn[i] <- mean(y[sort_index[1:k]])
}
return(predict_knn)
}
###
# world_bank_train and world_bank_test are pre-loaded
# Apply your algorithm on the test set: test_output
test_output <- my_knn(world_bank_test$cgdp, world_bank_train$cgdp, world_bank_train$urb_pop, 30)
# Have a look at the plot of the output
plot(world_bank_train,
xlab = "GDP per Capita",
ylab = "Percentage Urban Population")
points(world_bank_test$cgdp, test_output, col = "green")

Parametric vs non-parametric!
So now you’ve build three different models for the same data: - a simple linear model, lm_wb, - a log-linear model, lm_wb_log and - a non-parametric k-NN model. This k-NN model is actually simply a function that takes test and training data and predicts response variables on the fly: my_knn().
These objects are all stored in the workspace, as are world_bank_train and world_bank_test.
Have a look at the sample code on the right, that shows the first steps for building a fancy plot. In the end, three lines should be plotted that represent the predicted responses for the test set, together with the true responses.