title:“Model Building & Evaluation (lab10)” author: Lauren Kroll date: "4/23/2020 output: html_notebook: default html_document: default subtitle: BSAD 343H, Business Analytics, Spring 2020 —
The methods used in deriving a predictive model and later evaluating performance form the building blocks to any work in machine learning. The first step is to split a given set of data into two separate datasets: the training set and the testing set, with the training set greater than or equal to 50% of the data. In principle the bigger the trainig set the beter the model derived. We use the training set to build a model best predictor of the data. Later, we test the built model on the testing data set, and assess how good the model is in predicting the actual values. This approach is common in supervised machine learning where the output of a model can be assessed against expected outcome.
The concept is not very complicated but it is important to understand the steps involved in the training stage and differentiate from the steps in the testing stage
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work in Sakai as detailed in previous notes.
For your assignment you may be using different data sets than what is included here. Always read carefully the instructions provided, before executing any included code chunks and/or adding your own code. For clarity, tasks/questions to be completed/answered are highlighted in red color and numbered according to their particular placement in the task section. The red color is only apparent when in Preview mode. Quite often you will need to add your own code chunk.
Execute all code chunks (already included and own added), preview, check integrity, and submit final work (\(html\) file) in Sakai.
The first half of this lab focuses on building a model using the training data. The data we will be using was obtained from Kaggle site[1] and is in reference to world university rankings [2] . The data looks at university world scoring based on different ranking criteria such as quality of education, quality of faculty, and rank for patents. Spend some time to visit the referenced website to get more acquinated with the data. The data obtained is divided inteo two sets: a training set and a testing set. We begin by reading the training data set ‘universityrank_training.csv’ file, and checking the header lines to make sure the data is read correctly.
traindata = read.csv(file="universityrank_training.csv", header=TRUE)
head(traindata)
## world_rank institution country national_rank
## 1 1 Harvard University USA 1
## 2 2 Massachusetts Institute of Technology USA 2
## 3 3 Stanford University USA 3
## 4 4 University of Cambridge United Kingdom 1
## 5 5 California Institute of Technology USA 4
## 6 6 Princeton University USA 5
## quality_of_education alumni_employment quality_of_faculty publications
## 1 7 9 1 1
## 2 9 17 3 12
## 3 17 11 5 4
## 4 10 24 4 16
## 5 2 29 7 37
## 6 8 14 2 53
## influence citations broad_impact patents score year
## 1 1 1 NA 5 100.00 2012
## 2 4 4 NA 1 91.67 2012
## 3 2 2 NA 15 89.50 2012
## 4 16 11 NA 50 86.17 2012
## 5 22 22 NA 18 85.21 2012
## 6 33 26 NA 101 82.50 2012
Next, we extract the two columns of interest and call them properly so we can easily refer to them later in the code.
patent_train=traindata$patents
score_train = traindata$score
The first model we will build is a simple linear model. We will use the patents ranking variable to predict the university score. To better understand the data, the lower the patents ranking number the better it is. A value of 1 is a top rank for patents and represent the highest category in terms of number of patents owned by the particular academic institution. On the other hand the higher the calculated total score the better, as reflected by the world rank number. A value of 100 is a perfect score.
linear_train = lm(score_train ~ patent_train)
summary(linear_train)
##
## Call:
## lm(formula = score_train ~ patent_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.876 -4.010 -1.118 1.512 45.471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.6397362 0.3857798 141.63 <2e-16 ***
## patent_train -0.0157558 0.0008281 -19.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.535 on 1198 degrees of freedom
## Multiple R-squared: 0.2321, Adjusted R-squared: 0.2314
## F-statistic: 362 on 1 and 1198 DF, p-value: < 2.2e-16
plot(patent_train,score_train)
abline(linear_train, col="blue", lwd=2)
##### 1A) Complete the steps in the code chunk below to build a non-linear quadratic model. Follow the steps used in lab08 for costs versus servers (4pts)
# First define a new variable which is the squared value of patent_train (defined above)
patent_train2=(patent_train)^2
# Next derive the quadratic regression model. You may want to call it quad_train.
quad_train = lm(score_train~ patent_train + patent_train2)
# Publish the summary statistics
summary(quad_train)
##
## Call:
## lm(formula = score_train ~ patent_train + patent_train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.555 -2.345 -0.582 1.302 40.843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.966e+01 4.971e-01 120.02 <2e-16 ***
## patent_train -6.249e-02 3.319e-03 -18.83 <2e-16 ***
## patent_train2 5.972e-05 4.127e-06 14.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.955 on 1197 degrees of freedom
## Multiple R-squared: 0.3464, Adjusted R-squared: 0.3453
## F-statistic: 317.2 on 2 and 1197 DF, p-value: < 2.2e-16
##### 1B) Looking at the values for R-Squared and Adjusted R-Squared for both the linear and the quadratic select the best predictive model. Explain your logic (2pts)
The R-squared for the linear model is .2321 The R-Squared fo the quadratic model is .3464 The Adjusted R-Squared for the linear model is .2314 The Adjusted R-Squared for the quadratic model is .3453 The best predictor can be determined finding the best Adjusted R-Squared value. In this case, it is .3453, and it is fom the quadratic model. From this, you can conclude that the quadreatic model is the best predictive model.
The second half of predictive modeling is about testing the model using a different data set called the testing data. Again we must first read the testing data set, and make sure the dataset is read propertly.
testdata = read.csv("universityrank_testing.csv", header=TRUE)
head(testdata)
## world_rank institution country national_rank
## 1 1 Harvard University USA 1
## 2 2 Stanford University USA 2
## 3 3 Massachusetts Institute of Technology USA 3
## 4 4 University of Cambridge United Kingdom 1
## 5 5 University of Oxford United Kingdom 2
## 6 6 Columbia University USA 4
## quality_of_education alumni_employment quality_of_faculty publications
## 1 1 1 1 1
## 2 9 2 4 5
## 3 3 11 2 15
## 4 2 10 5 11
## 5 7 13 10 7
## 6 13 6 9 13
## influence citations broad_impact patents score year
## 1 1 1 1 3 100.00 2015
## 2 3 3 4 10 98.66 2015
## 3 2 2 2 1 97.54 2015
## 4 6 12 13 48 96.81 2015
## 5 12 7 9 15 96.46 2015
## 6 13 11 12 4 96.14 2015
We extract again the two columns of interest, in reference this time to the testing data set, and call them accordingly.
patent_test = testdata$patents
score_test = testdata$score
We are ready now to check if the derived models are actually good predictive models. First we calculate the predicted test data score using the linear model as derived from the training set. Later we will consider the quadratic model.
# Calculate the predicted test data score
score_predict1 = coef(linear_train)[1] + coef(linear_train)[2]*patent_test
For a visual qualitative evaluation we can plot the actual testing data, and overlay the predicted values
# Plot the actual values for patent and score as observed in the testing data set
plot(patent_test, score_test, main='Test Data -- Score vs Patent')
# Overlay the predicted values as calculated from the linear model and derived using the training model
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
# The red color is used to distinguish the predicted values which, because of the linear model, will fit exactly a line
plot(patent_test, score_predict1, col="red")
A better way to qualify the goodness of a predictive model is to scatter plot the actual values against the predicted values. In a perfect predictive model the points will line up along the diagonal line. This is rarely the case, if ever!
#Plot predicted values from the linear model versus actual values form the test data
plot(score_test, score_predict1, xlab='Actual', ylab='Predict', main='Linear Model -- Predict vs Actual Test')
From the plot we can easily see that most of the predicted values versus actual are far from the diagoonal line. In many cases this is fine. Finally, to quantify the goodness of a model, we need to calculate the Root Mean Square Error (RMSE).
#Calculate RMSE for Linear Model
error1 = sum((score_predict1 - score_test)^2)/length(score_test)
rmse1 = sqrt(error1)
rmse1
## [1] 5.95868
It is hard to judge the goodness of the number unless we compare to other possibilities. Of course a perfect scenario will have zero RMSE. We now need to repeat the above calculations for the non-linear quadratic model.
##### 2A) Fill in the code chunk below to calculate the predicted values for the non-linear quadratic model (4pts)
# Calculate score_predict2 based on the quadratic model and the patent test data. You need to refer again to the coefficients of the quadratic model derived earlier and the actual patent values obtained from the testing data
patent_test2=patent_test^2
score_predict2 = coef(quad_train)[1] + coef(quad_train)[2]*patent_test + coef(quad_train)[3]*patent_test2
For a visual representation, similar to the linear model, we need to do the following.
##### 2B) Fill-in the code chunk below to plot the actual Score vs Patent for the test data, and overlay the predicted values as calculated in 2A. Label axes and title properly (4pts)
# Plot the actual values for patent and score as observed in the testing data set
# Plot the actual values for patent and score as observed in the testing data set
plot(patent_test, score_test, main='Test Data -- Score vs Quadratic Patent')
# Overlay the predicted values as calculated from the linear model and derived using the training model
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
# The red color is used to distinguish the predicted values which, because of the linear model, will fit exactly a line
plot(patent_test, score_predict2, col="green")
# Overlay the predicted values as calculated from the quadratic model and derived using the training model.
# The green color is used to distinguish the predicted values which, because of the quadratic model, will in this case will fit exactly a parabola
#### 2C) Plot the Predict vs Actual for the quadratic model. Label axes and tile properly (2pts)
#Plot the predicted values form the quadratic model versus the actual values from the test data
plot(score_test, score_predict2, xlab='Actual', ylab='Predict', main='Quadratic Model-- Predict vs Actual Test')
#### 2D) By looking at the scatterplots for the linear and quadratic models are you able to tell which model is better? Explain your logic (1pt) Just by looking at the plots, the quadratic model seems to fit better. This is because it seems to fit the shapr of the data. The linear model has a lot of space in between the line and the actual data points. The quadratic line seems to fit the data better. A better way is to quantify the goodness of the model by calculating again the RMSE.
#### 2E) Calculate the root mean square error (RMSE) for the quadratic model (2pts)
#Calculate RMSE for Quadratic Model
error2 = sum((score_predict2 - score_test)^2)/length(score_test)
rmse2 = sqrt(error2)
rmse2
## [1] 5.685396
#### 2F) Based on the root mean square error (RMSE) which model is better? How your conclusion reconcile with the results from Task 1? (1pt) The RMSE of the quadratic model is 5.68, and the RMSE of the linear model is 5.95. This means that the quadratic model has less errors and fits the shape of the line better than the linear model. This reconciles, as in Task 1, we determined the quadratic model was a better predictor. source [1]: http://www.kaggle.com
source [2]: http://www.cwur.org