The best way to create a predictive model is to begin with a complete data set. Split the data into two seperate datasets: the training set and the testing set, with the training set greater than or equal to 50% of the data. We will use the training set to build a model best predictor of the data. Lastly, we will test the built model on the testing data set, and assess how good the model is in predicting the actual values.
The concept may seem overwhelming, but we already have almost all of the tools to perform this analysis.
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.
For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their particular placement in the task section.
The first half of this lab focus on building a model using the training data. The data we will be using was obtained from Kaggle site[1] and is in reference to world university rankings [2] . The data looks at university world scoring based on different ranking criteria such as rank for quality of education, rank for quality of faculty, and rank for patents. We begin by reading the training data set ‘universityrank_training.csv’ file, and checking the header lines to make sure the data is read correctly.
traindata = read.csv(file="data/universityrank_training.csv", header=TRUE)
head(traindata)
## world_rank institution country
## 1 1 Harvard University USA
## 2 2 Massachusetts Institute of Technology USA
## 3 3 Stanford University USA
## 4 4 University of Cambridge United Kingdom
## 5 5 California Institute of Technology USA
## 6 6 Princeton University USA
## national_rank quality_of_education alumni_employment quality_of_faculty
## 1 1 7 9 1
## 2 2 9 17 3
## 3 3 17 11 5
## 4 1 10 24 4
## 5 4 2 29 7
## 6 5 8 14 2
## publications influence citations broad_impact patents score year
## 1 1 1 1 NA 5 100.00 2012
## 2 12 4 4 NA 1 91.67 2012
## 3 4 2 2 NA 15 89.50 2012
## 4 16 16 11 NA 50 86.17 2012
## 5 37 22 22 NA 18 85.21 2012
## 6 53 33 26 NA 101 82.50 2012
Next, we extract the two columns of interest and call them properly so we can easily refer to them later in the code.
patent_train = traindata$patents
score_train = traindata$score
The first model we will build is a simple linear model and we also plot the patents versus the score data. We will use the patents ranking variable to predict the university score. To better understand the data, the lower the patents ranking number the better it is. A value of 1 is a top rank for patents and represent the highest category in terms of number of patents owned by the particular academic institution. On the other hand the higher the calculated total score the better, as reflected by the world rank number. A value of 100 is a perfect score.
linear_train = lm(score_train ~ patent_train)
summary(linear_train)
##
## Call:
## lm(formula = score_train ~ patent_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.876 -4.014 -1.121 1.515 45.470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.6406641 0.3859848 141.56 <2e-16 ***
## patent_train -0.0157607 0.0008291 -19.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.539 on 1197 degrees of freedom
## Multiple R-squared: 0.2319, Adjusted R-squared: 0.2313
## F-statistic: 361.4 on 1 and 1197 DF, p-value: < 2.2e-16
plot(patent_train,score_train)
abline(linear_train, col="blue", lwd=2)
##### Consider now building a non-linear quadratic model. Follow the steps in lab07 to derive a quadratic model similar to what we did for costs versus servers. #### TASK 1: First, define a new variable patent_train2 which is the squared value of patent_train (defined above)
patent_train2 = patent_train^2
patent_train2
## [1] 25 1 225 2500 324 10201 676 4356 25
## [10] 256 10201 100 81 9 49 1156 529 841
## [19] 169 484 10201 784 3844 1089 441 1936 400
## [28] 1681 10201 3249 7396 121 1225 64 10201 841
## [37] 196 144 10201 10201 10201 361 10201 529 1024
## [46] 10201 4225 10201 7056 729 2601 2916 3844 10201
## [55] 4 1225 10201 8464 10201 10201 10201 10201 10201
## [64] 9801 10201 10201 10201 10201 5476 10201 3721 7225
## [73] 10201 2116 16 10201 10201 10201 841 2401 2116
## [82] 10201 10201 7396 10201 10201 1600 10201 10000 10201
## [91] 10201 2116 10201 4761 10201 10000 8464 10201 10201
## [100] 10201 49 121 225 1 1521 100 361 10201
## [109] 10201 1764 289 196 49 25 169 144 4
## [118] 3600 2601 484 576 3969 900 4489 576 400
## [127] 2304 900 10201 5329 2601 36 2304 1764 289
## [136] 10201 10201 1296 1600 16 10201 729 9216 4489
## [145] 10201 10201 10201 6561 10201 4489 10201 3600 8100
## [154] 10000 81 10201 1024 10000 10201 10201 1600 10201
## [163] 10201 10201 10201 5329 10201 10201 1936 5329 3969
## [172] 10201 400 10201 10201 2601 10201 10201 10201 10201
## [181] 10201 6084 2916 10201 10201 10201 729 10201 8100
## [190] 10201 3364 2304 6561 6561 676 10201 6561 10201
## [199] 10201 1024 4 36 1 2304 256 16 784
## [208] 22201 41616 2025 144 81 196 784 100 324
## [217] 2304 7056 9 169 361 1600 3721 25 784
## [226] 1764 3249 4096 2025 14641 22201 2916 2916 25281
## [235] 543169 32761 51529 2601 1156 41616 8836 41616 1156
## [244] 5476 1444 10000 1444 543169 4624 114244 1089 400
## [253] 4096 28900 49 94249 4624 32761 64 4624 6084
## [262] 484 16384 94249 405769 32761 51529 10000 8836 94249
## [271] 2601 441 543169 529 37636 114244 5776 2025 231361
## [280] 19044 28900 32761 576 7056 8836 784 256 114244
## [289] 25281 32761 139129 729 1681 4096 10000 1936 37636
## [298] 51529 10000 405769 51529 10000 7056 6084 78400 1156
## [307] 225 3721 10000 231361 139129 16384 41616 67081 6084
## [316] 543169 51529 25281 37636 67081 37636 114244 114244 543169
## [325] 13225 10000 114244 121 304704 1156 543169 4624 6084
## [334] 181476 4096 14641 78400 6084 13225 181476 25281 41616
## [343] 94249 543169 16384 67081 25281 676 19044 139129 78400
## [352] 181476 94249 181476 231361 3249 139129 231361 51529 181476
## [361] 78400 7921 51529 37636 25281 543169 543169 543169 139129
## [370] 25281 543169 19044 2601 784 51529 41616 41616 51529
## [379] 405769 94249 231361 114244 78400 543169 22201 625 405769
## [388] 3721 10000 231361 32761 32761 28900 405769 19044 22201
## [397] 2304 41616 181476 51529 181476 41616 37636 2916 543169
## [406] 8836 16384 139129 28900 114244 139129 231361 94249 51529
## [415] 11881 231361 3249 28900 304704 51529 25281 67081 231361
## [424] 51529 139129 11881 231361 32761 231361 41616 304704 78400
## [433] 41616 37636 94249 16384 11881 19044 139129 22201 78400
## [442] 5476 19044 181476 181476 94249 5776 304704 405769 94249
## [451] 405769 51529 67081 41616 114244 51529 543169 231361 114244
## [460] 139129 7921 67081 139129 139129 114244 405769 304704 41616
## [469] 94249 231361 51529 543169 19044 139129 51529 231361 41616
## [478] 181476 11881 405769 8836 543169 543169 114244 543169 139129
## [487] 231361 304704 543169 181476 16384 19044 543169 139129 78400
## [496] 28900 13225 543169 94249 78400 304704 543169 181476 405769
## [505] 543169 543169 543169 181476 181476 181476 67081 304704 405769
## [514] 4624 1764 51529 304704 13225 67081 139129 543169 19044
## [523] 78400 67081 543169 139129 51529 181476 32761 139129 543169
## [532] 114244 543169 3249 181476 231361 231361 543169 114244 543169
## [541] 139129 543169 67081 405769 304704 405769 181476 231361 78400
## [550] 405769 16384 78400 94249 94249 405769 10000 231361 181476
## [559] 405769 41616 405769 19044 51529 405769 94249 139129 181476
## [568] 543169 114244 41616 405769 114244 543169 7921 67081 94249
## [577] 543169 543169 51529 7056 231361 94249 543169 7921 543169
## [586] 41616 304704 114244 78400 37636 231361 19044 139129 181476
## [595] 543169 543169 543169 543169 32761 13225 22201 32761 181476
## [604] 405769 405769 6084 28900 32761 231361 11881 543169 14641
## [613] 16384 543169 22201 11881 94249 304704 405769 13225 304704
## [622] 304704 543169 41616 139129 231361 25281 4624 7056 405769
## [631] 78400 181476 543169 405769 139129 405769 51529 231361 405769
## [640] 16384 304704 78400 543169 22201 114244 543169 8836 114244
## [649] 405769 181476 304704 405769 37636 139129 405769 543169 231361
## [658] 28900 543169 405769 114244 543169 51529 78400 139129 405769
## [667] 543169 231361 139129 543169 41616 181476 405769 41616 114244
## [676] 94249 139129 405769 304704 304704 114244 14641 139129 543169
## [685] 181476 304704 543169 304704 543169 405769 543169 181476 405769
## [694] 405769 94249 28900 67081 51529 114244 67081 78400 405769
## [703] 67081 231361 139129 114244 405769 543169 543169 543169 543169
## [712] 181476 7921 304704 139129 543169 405769 231361 14641 114244
## [721] 231361 543169 405769 94249 543169 543169 405769 41616 16384
## [730] 405769 181476 139129 304704 543169 14641 304704 543169 304704
## [739] 22201 405769 139129 67081 543169 543169 231361 405769 181476
## [748] 139129 304704 543169 543169 139129 28900 304704 94249 67081
## [757] 304704 543169 304704 28900 543169 181476 543169 181476 405769
## [766] 231361 543169 543169 51529 139129 231361 304704 22201 181476
## [775] 543169 78400 181476 543169 32761 181476 114244 543169 304704
## [784] 139129 304704 231361 405769 543169 231361 304704 304704 543169
## [793] 181476 41616 543169 231361 543169 304704 139129 94249 14641
## [802] 543169 51529 304704 304704 543169 405769 543169 231361 543169
## [811] 543169 181476 405769 543169 231361 543169 543169 543169 94249
## [820] 543169 543169 231361 543169 543169 139129 543169 543169 405769
## [829] 51529 405769 304704 304704 181476 231361 543169 543169 543169
## [838] 139129 543169 405769 304704 405769 231361 41616 405769 304704
## [847] 304704 543169 231361 304704 543169 114244 304704 181476 543169
## [856] 543169 543169 139129 543169 51529 181476 405769 543169 543169
## [865] 139129 78400 543169 25281 543169 231361 543169 231361 304704
## [874] 543169 181476 543169 543169 405769 114244 543169 405769 231361
## [883] 304704 304704 67081 543169 231361 543169 543169 543169 139129
## [892] 231361 543169 304704 405769 543169 304704 304704 304704 304704
## [901] 543169 139129 304704 543169 405769 543169 114244 405769 543169
## [910] 405769 304704 304704 543169 304704 67081 114244 543169 231361
## [919] 543169 304704 543169 405769 181476 543169 304704 543169 405769
## [928] 543169 231361 94249 231361 543169 543169 114244 543169 543169
## [937] 543169 67081 543169 304704 543169 543169 94249 231361 543169
## [946] 543169 181476 543169 543169 543169 543169 304704 304704 405769
## [955] 94249 405769 78400 543169 543169 543169 405769 231361 543169
## [964] 231361 231361 304704 94249 51529 304704 304704 543169 181476
## [973] 405769 304704 139129 51529 181476 25281 231361 543169 78400
## [982] 94249 304704 139129 405769 67081 78400 543169 231361 67081
## [991] 543169 78400 543169 405769 231361 231361 181476 543169 405769
## [1000] 231361 543169 543169 181476 231361 304704 231361 543169 543169
## [1009] 543169 405769 304704 543169 51529 231361 405769 405769 231361
## [1018] 405769 78400 304704 114244 543169 543169 543169 304704 543169
## [1027] 114244 304704 114244 181476 304704 139129 543169 78400 405769
## [1036] 543169 37636 543169 304704 181476 543169 304704 405769 543169
## [1045] 405769 231361 543169 405769 543169 304704 304704 543169 543169
## [1054] 405769 543169 543169 543169 405769 405769 405769 405769 139129
## [1063] 543169 543169 51529 181476 304704 405769 94249 231361 543169
## [1072] 405769 181476 181476 405769 543169 304704 231361 231361 405769
## [1081] 139129 543169 231361 304704 405769 543169 304704 405769 405769
## [1090] 543169 543169 543169 543169 543169 231361 405769 543169 543169
## [1099] 543169 543169 543169 181476 543169 543169 139129 543169 543169
## [1108] 543169 543169 543169 543169 543169 543169 543169 94249 543169
## [1117] 543169 304704 543169 139129 543169 543169 114244 181476 543169
## [1126] 543169 543169 543169 543169 543169 304704 304704 543169 543169
## [1135] 543169 304704 181476 543169 405769 139129 78400 231361 405769
## [1144] 139129 543169 543169 543169 231361 543169 543169 543169 543169
## [1153] 543169 543169 405769 543169 543169 543169 543169 78400 543169
## [1162] 543169 543169 543169 543169 543169 543169 543169 543169 304704
## [1171] 231361 405769 543169 543169 304704 543169 543169 543169 543169
## [1180] 139129 405769 543169 543169 543169 405769 405769 543169 543169
## [1189] 543169 543169 543169 405769 543169 543169 405769 543169 543169
## [1198] 543169 405769
quad_train = lm(score_train ~ patent_train + patent_train2)
summary(quad_train)
##
## Call:
## lm(formula = score_train ~ patent_train + patent_train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.556 -2.347 -0.578 1.301 40.842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.966e+01 4.973e-01 119.97 <2e-16 ***
## patent_train -6.250e-02 3.321e-03 -18.82 <2e-16 ***
## patent_train2 5.975e-05 4.130e-06 14.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.957 on 1196 degrees of freedom
## Multiple R-squared: 0.3463, Adjusted R-squared: 0.3452
## F-statistic: 316.8 on 2 and 1196 DF, p-value: < 2.2e-16
The better model is the quadriatic model because both the R squared and the adjusted R Squared are larger in the quadritic model than in the linear model.
The second half of predictive modeling is about testing the model using a different data set called the testing data. Again we must first read the testing data set, and make sure the dataset is read propertly.
testdata = read.csv("data/universityrank_testing.csv", header=TRUE)
head(testdata)
## world_rank institution country
## 1 1000 Yanbian University China
## 2 1 Harvard University USA
## 3 2 Stanford University USA
## 4 3 Massachusetts Institute of Technology USA
## 5 4 University of Cambridge United Kingdom
## 6 5 University of Oxford United Kingdom
## national_rank quality_of_education alumni_employment quality_of_faculty
## 1 84 355 478 210
## 2 1 1 1 1
## 3 2 9 2 4
## 4 3 3 11 2
## 5 1 2 10 5
## 6 2 7 13 10
## publications influence citations broad_impact patents score year
## 1 890 790 800 1000 737 44.18 2014
## 2 1 1 1 1 3 100.00 2015
## 3 5 3 3 4 10 98.66 2015
## 4 15 2 2 2 1 97.54 2015
## 5 11 6 12 13 48 96.81 2015
## 6 7 12 7 9 15 96.46 2015
We extract again the two columns of interest, in reference this time to the testing data set, and call them accordingly.
patent_test = testdata$patents
score_test = testdata$score
We are ready now to check if the derived models are actually good predictive models. First we calculate the predicted test data score using the linear models derived earlier.
# Calculate the predicted score for the linear model (built in Part A) run on the patent_test data
score_predict1 = coef(linear_train)[1] + coef(linear_train)[2]*patent_test
For a visual representation we can plot the actual testing data, and overlay the predicted values
# Plot the actual values for patent and score as observed in the testing data set
plot(patent_test, score_test, main='Test Data -- Score vs Patent')
# Overlay the predicted values as calculated from the linear model and derived using the training model. The red color is used to distinguish the predicted values which, because of the linear model, will fit exactly a line
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
plot(patent_test, score_predict1, col="red")
A better way to qualify the goodness of a predictive model is to scatter plot the actual values against the predicted values. In a perfect predictive model the points will line up along the diagonal line. This is rarely the case, if ever!
#Plot predicted values from the linear model versus actual values form the test data
plot(score_test, score_predict1, xlab='Actual', ylab='Predict', main='Linear Model -- Predict vs Actual Test')
From the plot we can easily see that most of the predicted values versus actual are far from the diagoonal line. In many cases this is fine. Finally, to quantify the goodness of a model, we need to calculate the Root Mean Square Error (RMSE) and we name the variable as rmse1.
#Calculate RMSE for Linear Model
error1 = sum((score_predict1 - score_test)^2)/length(score_test)
rmse1 = sqrt(error1)
rmse1
## [1] 5.956123
It is hard to judge the goodness of the number unless we compare to other possibilities. Of course a perfect scenario will have zero RMSE. We now need to repeat the above calculations for the non-linear quadratic model.
score_predict2 = coef(quad_train)[1] + coef(quad_train)[2]*patent_test + coef(quad_train)[3]*patent_test^2
For a visual representation, similar to the linear model, we need to do the following.
# TASK 6A: Plot the actual values for patent and score as observed in the testing data set. Write the code below:
plot(patent_test, score_test, main='Test Data -- Score vs Patent')
# Overlay the predicted values as calculated from the quadratic model and derived using the training model.The GREEN color is used to distinguish the predicted values which, because of the quadratic model, will in this case fit exactly a parabola.
# TASK 6B: Write the code below the following code.
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
plot(patent_test, score_predict2, col="green")
#Plot the predicted values form the quadratic model versus the actual values from the test data. Write the code below:
plot(score_test, score_predict2, xlab='Actual', ylab='Predict', main='Quadriactic Model -- Predict vs Actual Test')
Based on the graphs, the quadratic models seems to fit the scatter plot better than the linear model. The quadriatic model seems to pass by much more points in the scatter plot than the linear model.
A better way to quantify the goodness of the model is to calculate again the RMSE.
#Calculate RMSE for Quadratic Model. Write the code below and name the variable as rmse2:
error2 = sum((score_predict2 - score_test)^2)/length(score_test)
rmse2 = sqrt(error2)
rmse2
## [1] 5.684821
The closer the RSME is to zero the better. Therfore the best model is the quadritic model since the RSME is a little smaller than that of the linear model. This result agrees with my observations in task four. The two observations lead me to belive that the best model is the quadriatic model.
source [1]: http://www.kaggle.com source [2]: http://www.cwur.org