About

The best way to create a predictive model is to begin with a complete data set. Split the data into two seperate datasets: the training set and the testing set, with the training set greater than or equal to 50% of the data. We will use the training set to build a model best predictor of the data. Lastly, we will test the built model on the testing data set, and assess how good the model is in predicting the actual values.

The concept may seem overwhelming, but we already have almost all of the tools to perform this analysis.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.

Note

For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their particular placement in the task section.


PART A: TRAINING SET & MODEL BUILDING

The first half of this lab focus on building a model using the training data. The data we will be using was obtained from Kaggle site[1] and is in reference to world university rankings [2] . The data looks at university world scoring based on different ranking criteria such as rank for quality of education, rank for quality of faculty, and rank for patents. We begin by reading the training data set ‘universityrank_training.csv’ file, and checking the header lines to make sure the data is read correctly.

traindata = read.csv(file="data/universityrank_training.csv", header=TRUE)
head(traindata)
##   world_rank                           institution        country
## 1          1                    Harvard University            USA
## 2          2 Massachusetts Institute of Technology            USA
## 3          3                   Stanford University            USA
## 4          4               University of Cambridge United Kingdom
## 5          5    California Institute of Technology            USA
## 6          6                  Princeton University            USA
##   national_rank quality_of_education alumni_employment quality_of_faculty
## 1             1                    7                 9                  1
## 2             2                    9                17                  3
## 3             3                   17                11                  5
## 4             1                   10                24                  4
## 5             4                    2                29                  7
## 6             5                    8                14                  2
##   publications influence citations broad_impact patents  score year
## 1            1         1         1           NA       5 100.00 2012
## 2           12         4         4           NA       1  91.67 2012
## 3            4         2         2           NA      15  89.50 2012
## 4           16        16        11           NA      50  86.17 2012
## 5           37        22        22           NA      18  85.21 2012
## 6           53        33        26           NA     101  82.50 2012

Next, we extract the two columns of interest and call them properly so we can easily refer to them later in the code.

patent_train = traindata$patents
score_train = traindata$score

The first model we will build is a simple linear model and we also plot the patents versus the score data. We will use the patents ranking variable to predict the university score. To better understand the data, the lower the patents ranking number the better it is. A value of 1 is a top rank for patents and represent the highest category in terms of number of patents owned by the particular academic institution. On the other hand the higher the calculated total score the better, as reflected by the world rank number. A value of 100 is a perfect score.

linear_train = lm(score_train ~ patent_train)
summary(linear_train)
## 
## Call:
## lm(formula = score_train ~ patent_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.876 -4.014 -1.121  1.515 45.470 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  54.6406641  0.3859848  141.56   <2e-16 ***
## patent_train -0.0157607  0.0008291  -19.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.539 on 1197 degrees of freedom
## Multiple R-squared:  0.2319, Adjusted R-squared:  0.2313 
## F-statistic: 361.4 on 1 and 1197 DF,  p-value: < 2.2e-16
plot(patent_train,score_train)
abline(linear_train, col="blue", lwd=2)

##### Consider now building a non-linear quadratic model. Follow the steps in lab07 to derive a quadratic model similar to what we did for costs versus servers. #### TASK 1: First, define a new variable patent_train2 which is the squared value of patent_train (defined above)

patent_train2 = patent_train^2

patent_train2
##    [1]     25      1    225   2500    324  10201    676   4356     25
##   [10]    256  10201    100     81      9     49   1156    529    841
##   [19]    169    484  10201    784   3844   1089    441   1936    400
##   [28]   1681  10201   3249   7396    121   1225     64  10201    841
##   [37]    196    144  10201  10201  10201    361  10201    529   1024
##   [46]  10201   4225  10201   7056    729   2601   2916   3844  10201
##   [55]      4   1225  10201   8464  10201  10201  10201  10201  10201
##   [64]   9801  10201  10201  10201  10201   5476  10201   3721   7225
##   [73]  10201   2116     16  10201  10201  10201    841   2401   2116
##   [82]  10201  10201   7396  10201  10201   1600  10201  10000  10201
##   [91]  10201   2116  10201   4761  10201  10000   8464  10201  10201
##  [100]  10201     49    121    225      1   1521    100    361  10201
##  [109]  10201   1764    289    196     49     25    169    144      4
##  [118]   3600   2601    484    576   3969    900   4489    576    400
##  [127]   2304    900  10201   5329   2601     36   2304   1764    289
##  [136]  10201  10201   1296   1600     16  10201    729   9216   4489
##  [145]  10201  10201  10201   6561  10201   4489  10201   3600   8100
##  [154]  10000     81  10201   1024  10000  10201  10201   1600  10201
##  [163]  10201  10201  10201   5329  10201  10201   1936   5329   3969
##  [172]  10201    400  10201  10201   2601  10201  10201  10201  10201
##  [181]  10201   6084   2916  10201  10201  10201    729  10201   8100
##  [190]  10201   3364   2304   6561   6561    676  10201   6561  10201
##  [199]  10201   1024      4     36      1   2304    256     16    784
##  [208]  22201  41616   2025    144     81    196    784    100    324
##  [217]   2304   7056      9    169    361   1600   3721     25    784
##  [226]   1764   3249   4096   2025  14641  22201   2916   2916  25281
##  [235] 543169  32761  51529   2601   1156  41616   8836  41616   1156
##  [244]   5476   1444  10000   1444 543169   4624 114244   1089    400
##  [253]   4096  28900     49  94249   4624  32761     64   4624   6084
##  [262]    484  16384  94249 405769  32761  51529  10000   8836  94249
##  [271]   2601    441 543169    529  37636 114244   5776   2025 231361
##  [280]  19044  28900  32761    576   7056   8836    784    256 114244
##  [289]  25281  32761 139129    729   1681   4096  10000   1936  37636
##  [298]  51529  10000 405769  51529  10000   7056   6084  78400   1156
##  [307]    225   3721  10000 231361 139129  16384  41616  67081   6084
##  [316] 543169  51529  25281  37636  67081  37636 114244 114244 543169
##  [325]  13225  10000 114244    121 304704   1156 543169   4624   6084
##  [334] 181476   4096  14641  78400   6084  13225 181476  25281  41616
##  [343]  94249 543169  16384  67081  25281    676  19044 139129  78400
##  [352] 181476  94249 181476 231361   3249 139129 231361  51529 181476
##  [361]  78400   7921  51529  37636  25281 543169 543169 543169 139129
##  [370]  25281 543169  19044   2601    784  51529  41616  41616  51529
##  [379] 405769  94249 231361 114244  78400 543169  22201    625 405769
##  [388]   3721  10000 231361  32761  32761  28900 405769  19044  22201
##  [397]   2304  41616 181476  51529 181476  41616  37636   2916 543169
##  [406]   8836  16384 139129  28900 114244 139129 231361  94249  51529
##  [415]  11881 231361   3249  28900 304704  51529  25281  67081 231361
##  [424]  51529 139129  11881 231361  32761 231361  41616 304704  78400
##  [433]  41616  37636  94249  16384  11881  19044 139129  22201  78400
##  [442]   5476  19044 181476 181476  94249   5776 304704 405769  94249
##  [451] 405769  51529  67081  41616 114244  51529 543169 231361 114244
##  [460] 139129   7921  67081 139129 139129 114244 405769 304704  41616
##  [469]  94249 231361  51529 543169  19044 139129  51529 231361  41616
##  [478] 181476  11881 405769   8836 543169 543169 114244 543169 139129
##  [487] 231361 304704 543169 181476  16384  19044 543169 139129  78400
##  [496]  28900  13225 543169  94249  78400 304704 543169 181476 405769
##  [505] 543169 543169 543169 181476 181476 181476  67081 304704 405769
##  [514]   4624   1764  51529 304704  13225  67081 139129 543169  19044
##  [523]  78400  67081 543169 139129  51529 181476  32761 139129 543169
##  [532] 114244 543169   3249 181476 231361 231361 543169 114244 543169
##  [541] 139129 543169  67081 405769 304704 405769 181476 231361  78400
##  [550] 405769  16384  78400  94249  94249 405769  10000 231361 181476
##  [559] 405769  41616 405769  19044  51529 405769  94249 139129 181476
##  [568] 543169 114244  41616 405769 114244 543169   7921  67081  94249
##  [577] 543169 543169  51529   7056 231361  94249 543169   7921 543169
##  [586]  41616 304704 114244  78400  37636 231361  19044 139129 181476
##  [595] 543169 543169 543169 543169  32761  13225  22201  32761 181476
##  [604] 405769 405769   6084  28900  32761 231361  11881 543169  14641
##  [613]  16384 543169  22201  11881  94249 304704 405769  13225 304704
##  [622] 304704 543169  41616 139129 231361  25281   4624   7056 405769
##  [631]  78400 181476 543169 405769 139129 405769  51529 231361 405769
##  [640]  16384 304704  78400 543169  22201 114244 543169   8836 114244
##  [649] 405769 181476 304704 405769  37636 139129 405769 543169 231361
##  [658]  28900 543169 405769 114244 543169  51529  78400 139129 405769
##  [667] 543169 231361 139129 543169  41616 181476 405769  41616 114244
##  [676]  94249 139129 405769 304704 304704 114244  14641 139129 543169
##  [685] 181476 304704 543169 304704 543169 405769 543169 181476 405769
##  [694] 405769  94249  28900  67081  51529 114244  67081  78400 405769
##  [703]  67081 231361 139129 114244 405769 543169 543169 543169 543169
##  [712] 181476   7921 304704 139129 543169 405769 231361  14641 114244
##  [721] 231361 543169 405769  94249 543169 543169 405769  41616  16384
##  [730] 405769 181476 139129 304704 543169  14641 304704 543169 304704
##  [739]  22201 405769 139129  67081 543169 543169 231361 405769 181476
##  [748] 139129 304704 543169 543169 139129  28900 304704  94249  67081
##  [757] 304704 543169 304704  28900 543169 181476 543169 181476 405769
##  [766] 231361 543169 543169  51529 139129 231361 304704  22201 181476
##  [775] 543169  78400 181476 543169  32761 181476 114244 543169 304704
##  [784] 139129 304704 231361 405769 543169 231361 304704 304704 543169
##  [793] 181476  41616 543169 231361 543169 304704 139129  94249  14641
##  [802] 543169  51529 304704 304704 543169 405769 543169 231361 543169
##  [811] 543169 181476 405769 543169 231361 543169 543169 543169  94249
##  [820] 543169 543169 231361 543169 543169 139129 543169 543169 405769
##  [829]  51529 405769 304704 304704 181476 231361 543169 543169 543169
##  [838] 139129 543169 405769 304704 405769 231361  41616 405769 304704
##  [847] 304704 543169 231361 304704 543169 114244 304704 181476 543169
##  [856] 543169 543169 139129 543169  51529 181476 405769 543169 543169
##  [865] 139129  78400 543169  25281 543169 231361 543169 231361 304704
##  [874] 543169 181476 543169 543169 405769 114244 543169 405769 231361
##  [883] 304704 304704  67081 543169 231361 543169 543169 543169 139129
##  [892] 231361 543169 304704 405769 543169 304704 304704 304704 304704
##  [901] 543169 139129 304704 543169 405769 543169 114244 405769 543169
##  [910] 405769 304704 304704 543169 304704  67081 114244 543169 231361
##  [919] 543169 304704 543169 405769 181476 543169 304704 543169 405769
##  [928] 543169 231361  94249 231361 543169 543169 114244 543169 543169
##  [937] 543169  67081 543169 304704 543169 543169  94249 231361 543169
##  [946] 543169 181476 543169 543169 543169 543169 304704 304704 405769
##  [955]  94249 405769  78400 543169 543169 543169 405769 231361 543169
##  [964] 231361 231361 304704  94249  51529 304704 304704 543169 181476
##  [973] 405769 304704 139129  51529 181476  25281 231361 543169  78400
##  [982]  94249 304704 139129 405769  67081  78400 543169 231361  67081
##  [991] 543169  78400 543169 405769 231361 231361 181476 543169 405769
## [1000] 231361 543169 543169 181476 231361 304704 231361 543169 543169
## [1009] 543169 405769 304704 543169  51529 231361 405769 405769 231361
## [1018] 405769  78400 304704 114244 543169 543169 543169 304704 543169
## [1027] 114244 304704 114244 181476 304704 139129 543169  78400 405769
## [1036] 543169  37636 543169 304704 181476 543169 304704 405769 543169
## [1045] 405769 231361 543169 405769 543169 304704 304704 543169 543169
## [1054] 405769 543169 543169 543169 405769 405769 405769 405769 139129
## [1063] 543169 543169  51529 181476 304704 405769  94249 231361 543169
## [1072] 405769 181476 181476 405769 543169 304704 231361 231361 405769
## [1081] 139129 543169 231361 304704 405769 543169 304704 405769 405769
## [1090] 543169 543169 543169 543169 543169 231361 405769 543169 543169
## [1099] 543169 543169 543169 181476 543169 543169 139129 543169 543169
## [1108] 543169 543169 543169 543169 543169 543169 543169  94249 543169
## [1117] 543169 304704 543169 139129 543169 543169 114244 181476 543169
## [1126] 543169 543169 543169 543169 543169 304704 304704 543169 543169
## [1135] 543169 304704 181476 543169 405769 139129  78400 231361 405769
## [1144] 139129 543169 543169 543169 231361 543169 543169 543169 543169
## [1153] 543169 543169 405769 543169 543169 543169 543169  78400 543169
## [1162] 543169 543169 543169 543169 543169 543169 543169 543169 304704
## [1171] 231361 405769 543169 543169 304704 543169 543169 543169 543169
## [1180] 139129 405769 543169 543169 543169 405769 405769 543169 543169
## [1189] 543169 543169 543169 405769 543169 543169 405769 543169 543169
## [1198] 543169 405769

TASK 2: Second, derive the quadratic regression model to predict the score_train from patent_train. You may want to call it quad_train.

quad_train = lm(score_train ~ patent_train + patent_train2)

TASK 3: Third, publish the summary statistics.

summary(quad_train)
## 
## Call:
## lm(formula = score_train ~ patent_train + patent_train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.556  -2.347  -0.578   1.301  40.842 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.966e+01  4.973e-01  119.97   <2e-16 ***
## patent_train  -6.250e-02  3.321e-03  -18.82   <2e-16 ***
## patent_train2  5.975e-05  4.130e-06   14.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.957 on 1196 degrees of freedom
## Multiple R-squared:  0.3463, Adjusted R-squared:  0.3452 
## F-statistic: 316.8 on 2 and 1196 DF,  p-value: < 2.2e-16

TASK 4: Based on the values of R-Squared and Adj R-Squared for both the linear and the quadratic select the best predictive model. Explain your argument.

ANSWER TASK 4:

The better model is the quadriatic model because both the R squared and the adjusted R Squared are larger in the quadritic model than in the linear model.


PART B: TESTING SET & MODEL EVALUATION

The second half of predictive modeling is about testing the model using a different data set called the testing data. Again we must first read the testing data set, and make sure the dataset is read propertly.

testdata = read.csv("data/universityrank_testing.csv", header=TRUE)
head(testdata)
##   world_rank                           institution        country
## 1       1000                    Yanbian University          China
## 2          1                    Harvard University            USA
## 3          2                   Stanford University            USA
## 4          3 Massachusetts Institute of Technology            USA
## 5          4               University of Cambridge United Kingdom
## 6          5                  University of Oxford United Kingdom
##   national_rank quality_of_education alumni_employment quality_of_faculty
## 1            84                  355               478                210
## 2             1                    1                 1                  1
## 3             2                    9                 2                  4
## 4             3                    3                11                  2
## 5             1                    2                10                  5
## 6             2                    7                13                 10
##   publications influence citations broad_impact patents  score year
## 1          890       790       800         1000     737  44.18 2014
## 2            1         1         1            1       3 100.00 2015
## 3            5         3         3            4      10  98.66 2015
## 4           15         2         2            2       1  97.54 2015
## 5           11         6        12           13      48  96.81 2015
## 6            7        12         7            9      15  96.46 2015

We extract again the two columns of interest, in reference this time to the testing data set, and call them accordingly.

patent_test = testdata$patents
score_test = testdata$score

We are ready now to check if the derived models are actually good predictive models. First we calculate the predicted test data score using the linear models derived earlier.

# Calculate the predicted score for the linear model (built in Part A) run on the patent_test data
score_predict1 = coef(linear_train)[1] + coef(linear_train)[2]*patent_test

For a visual representation we can plot the actual testing data, and overlay the predicted values

# Plot the actual values for patent and score as observed in the testing data set
plot(patent_test, score_test, main='Test Data -- Score vs Patent')
# Overlay the predicted values as calculated from the linear model and derived using the training model. The red color is used to distinguish the predicted values which, because of the linear model, will fit exactly a line
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
plot(patent_test, score_predict1, col="red")

A better way to qualify the goodness of a predictive model is to scatter plot the actual values against the predicted values. In a perfect predictive model the points will line up along the diagonal line. This is rarely the case, if ever!

#Plot predicted values from the linear model versus actual values form the test data
plot(score_test, score_predict1, xlab='Actual', ylab='Predict', main='Linear Model -- Predict vs Actual Test')

From the plot we can easily see that most of the predicted values versus actual are far from the diagoonal line. In many cases this is fine. Finally, to quantify the goodness of a model, we need to calculate the Root Mean Square Error (RMSE) and we name the variable as rmse1.

#Calculate RMSE for Linear Model
error1 = sum((score_predict1 - score_test)^2)/length(score_test)
rmse1 = sqrt(error1)
rmse1
## [1] 5.956123

It is hard to judge the goodness of the number unless we compare to other possibilities. Of course a perfect scenario will have zero RMSE. We now need to repeat the above calculations for the non-linear quadratic model.

Start by calculating the predicted values and overlay the calculated predicted values over the actual values.

TASK 5: Calculate score_predict2 based on the quadratic model and the patent test data. You need to refer again to the coefficients of the quadratic model derived earlier and the actual patent values obtained from the testing data.

score_predict2 = coef(quad_train)[1] + coef(quad_train)[2]*patent_test + coef(quad_train)[3]*patent_test^2 

For a visual representation, similar to the linear model, we need to do the following.

TASK 6: Plot the score versus patent actual test data and overlay the predictied values as calculated in TASK 5.

# TASK 6A: Plot the actual values for patent and score as observed in the testing data set. Write the code below:
plot(patent_test, score_test, main='Test Data -- Score vs Patent')

# Overlay the predicted values as calculated from the quadratic model and derived using the training model.The GREEN color is used to distinguish the predicted values which, because of the quadratic model, will in this case fit exactly a parabola.
# TASK 6B: Write the code below the following code.
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
plot(patent_test, score_predict2, col="green")  

TASK 7: Plot the predicted values against the true values. Label properly.

#Plot the predicted values form the quadratic model versus the actual values from the test data. Write the code below:
plot(score_test, score_predict2, xlab='Actual', ylab='Predict', main='Quadriactic Model -- Predict vs Actual Test')

TASK 8: By looking at the scatterplots for the linear and quadratic models are you able to tell which model is better? Elaborate.

ANSWER TASK 8:

Based on the graphs, the quadratic models seems to fit the scatter plot better than the linear model. The quadriatic model seems to pass by much more points in the scatter plot than the linear model.

A better way to quantify the goodness of the model is to calculate again the RMSE.

TASK 9: Calculate the root mean square error (RMSE) for the quadratic model.

#Calculate RMSE  for Quadratic Model. Write the code below and name the variable as rmse2:
error2 = sum((score_predict2 - score_test)^2)/length(score_test)
rmse2 = sqrt(error2)
rmse2
## [1] 5.684821

TASK 10: Based on the root mean square error (RMSE) which model is better? How do results reconcile with the results from Task 4?

ANSWER TASK 10:

The closer the RSME is to zero the better. Therfore the best model is the quadritic model since the RSME is a little smaller than that of the linear model. This result agrees with my observations in task four. The two observations lead me to belive that the best model is the quadriatic model.

source [1]: http://www.kaggle.com source [2]: http://www.cwur.org