About

The best way to create a predictive model is to begin with a complete data set. Split the data into two seperate datasets: the training set and the testing set, with the training set greater than or equal to 50% of the data. We will use the training set to build a model best predictor of the data. Lastly, we will test the built model on the testing data set, and assess how good the model is in predicting the actual values.

The concept may seem overwhelming, but we already have almost all of the tools to perform this analysis.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.

Note

For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their particular placement in the task section.


Task 1: Training Set & Model Building

The first half of this lab focus on building a model using the training data. The data we will be using was obtained from Kaggle site[1] and is in reference to world university rankings [2] . The data looks at university world scoring based on different ranking criteria such as rank for quality of education, rank for quality of faculty, and rank for patents. We begin by reading the training data set ‘universityrank_training.csv’ file, and checking the header lines to make sure the data is read correctly.

traindata = read.csv(file="data/universityrank_training.csv", header=TRUE)
head(traindata)
##   world_rank                           institution        country
## 1          1                    Harvard University            USA
## 2          2 Massachusetts Institute of Technology            USA
## 3          3                   Stanford University            USA
## 4          4               University of Cambridge United Kingdom
## 5          5    California Institute of Technology            USA
## 6          6                  Princeton University            USA
##   national_rank quality_of_education alumni_employment quality_of_faculty
## 1             1                    7                 9                  1
## 2             2                    9                17                  3
## 3             3                   17                11                  5
## 4             1                   10                24                  4
## 5             4                    2                29                  7
## 6             5                    8                14                  2
##   publications influence citations broad_impact patents  score year
## 1            1         1         1           NA       5 100.00 2012
## 2           12         4         4           NA       1  91.67 2012
## 3            4         2         2           NA      15  89.50 2012
## 4           16        16        11           NA      50  86.17 2012
## 5           37        22        22           NA      18  85.21 2012
## 6           53        33        26           NA     101  82.50 2012

Here we observe that there are 1199 columns of data and 14 variables that describe the data.

Next, we extract the two columns of interest and call them properly so we can easily refer to them later in the code.

patent_train = traindata$patents
score_train = traindata$score

The first model we will build is a simple linear model. We will use the patents ranking variable to predict the university score. To better understand the data, the lower the patents ranking number the better it is. A value of 1 is a top rank for patents and represent the highest category in terms of number of patents owned by the particular academic institution. On the other hand the higher the calculated total score the better, as reflected by the world rank number. A value of 100 is a perfect score.

linear_train = lm(score_train ~ patent_train)
summary(linear_train)
## 
## Call:
## lm(formula = score_train ~ patent_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.876 -4.014 -1.121  1.515 45.470 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  54.6406641  0.3859848  141.56   <2e-16 ***
## patent_train -0.0157607  0.0008291  -19.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.539 on 1197 degrees of freedom
## Multiple R-squared:  0.2319, Adjusted R-squared:  0.2313 
## F-statistic: 361.4 on 1 and 1197 DF,  p-value: < 2.2e-16
plot(patent_train,score_train)
abline(linear_train, col="blue", lwd=2)

Using a linear model with the variables of score_train and patent_train, the equation comes out to be: score_train = 54.64 - 0.02(patent_train). The y-intercept is a score of 54.64 (when the rank is 0, null, or not available) and the slope is negative 0.02, meaning tha every time the rank changes by one, the score_train moves at a rate of 0.02. The R-squared is low, but more importantly, the Adj R-squared is also relatively low with a value of 0.2313, showing that this model might not be a good predictor of the data. It is important to note that in the above scenario, we are working with the training data.

1A) Consider now building a non-linear quadratic model. Follow the steps in lab07 to derive a quadratic model similar to what we did for costs versus servers.
# First define a new variable which is the squared value of patent_train (defined above)

patent_train2 = patent_train^2

# Next derive the quadratic regression model.  You may want to call it quad_train.

quad_train = lm(score_train ~ patent_train + patent_train2)

# Publish the summary statistics

summary(quad_train)
## 
## Call:
## lm(formula = score_train ~ patent_train + patent_train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.556  -2.347  -0.578   1.301  40.842 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.966e+01  4.973e-01  119.97   <2e-16 ***
## patent_train  -6.250e-02  3.321e-03  -18.82   <2e-16 ***
## patent_train2  5.975e-05  4.130e-06   14.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.957 on 1196 degrees of freedom
## Multiple R-squared:  0.3463, Adjusted R-squared:  0.3452 
## F-statistic: 316.8 on 2 and 1196 DF,  p-value: < 2.2e-16

Using a quadratic model (still using the training data) gives a higher adj. r-squared and the equation we get is score_train = 5.966e+01 - 6.250e-02(patent_train) +
5.975e-05(patent_train2)

1B) Based on the values of R-Squared and Adj R-Squared for both the linear and the quadratic select the best predictive model. Explain your argument.
              R-squared     Adj. R-squared

Linear Model 0.2319 0.2313 Quadratic Model 0.3463 0.3452

Based on the r-squared and the adj. r-squared (being a better representative), the quadratic model is a better model to portray the data set because fo the higher values.

Task 2: Testing Set & Model Evaluation

The second half of predictive modeling is about testing the model using a different data set called the testing data. Again we must first read the testing data set, and make sure the dataset is read propertly.

testdata = read.csv("data/universityrank_testing.csv", header=TRUE)
head(testdata)
##   world_rank                           institution        country
## 1       1000                    Yanbian University          China
## 2          1                    Harvard University            USA
## 3          2                   Stanford University            USA
## 4          3 Massachusetts Institute of Technology            USA
## 5          4               University of Cambridge United Kingdom
## 6          5                  University of Oxford United Kingdom
##   national_rank quality_of_education alumni_employment quality_of_faculty
## 1            84                  355               478                210
## 2             1                    1                 1                  1
## 3             2                    9                 2                  4
## 4             3                    3                11                  2
## 5             1                    2                10                  5
## 6             2                    7                13                 10
##   publications influence citations broad_impact patents  score year
## 1          890       790       800         1000     737  44.18 2014
## 2            1         1         1            1       3 100.00 2015
## 3            5         3         3            4      10  98.66 2015
## 4           15         2         2            2       1  97.54 2015
## 5           11         6        12           13      48  96.81 2015
## 6            7        12         7            9      15  96.46 2015

We extract again the two columns of interest, in reference this time to the testing data set, and call them accordingly.

patent_test = testdata$patents
score_test = testdata$score

We are ready now to check if the derived models are actually good predictive models. First we calculate the predicted test data score using the linear and the quadratic models derived earlier. Here we are using the testing data on the model we built with the training data set earlier.

# Calculate the predicted test data score 
score_predict1 = coef(linear_train)[1] + coef(linear_train)[2]*patent_test

For a visual representation we can plot the actual testing data, and overlay the predicted values

# Plot the actual values for patent and score as observed in the testing data set
plot(patent_test, score_test, main='Test Data -- Score vs Patent')
# Overlay the predicted values as calculated from the linear model and derived using the training model
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
# The red color is used to distinguish the predicted values which, because of the linear model, will fit exactly a line
plot(patent_test, score_predict1, col="red")

The above graph shows the testing data set values and the testind data set on the model we built in task 1, or in other words, the training model. From this plot, one notices that there is a gap between the line of predicted values and the actual data set.

A better way to qualify the goodness of a predictive model is to scatter plot the actual values against the predicted values. In a perfect predictive model the points will line up along the diagonal line. This is rarely the case, if ever!

#Plot predicted values from the linear model versus actual values form the test data
plot(score_test, score_predict1, xlab='Actual', ylab='Predict', main='Linear Model -- Predict vs Actual Test')

This graph may look strange at first but in most real life cases, it is not that unusual. From the plot we can easily see that most of the predicted values versus actual are far from the diagonal line. In many cases this is fine. Finally, to quantify the goodness of a model, we need to calculate the Root Mean Square Error (RMSE).

#Calculate RMSE for Linear Model
error1 = sum((score_predict1 - score_test)^2)/length(score_test)
rmse1 = sqrt(error1)
rmse1
## [1] 5.956123

The root mean square error calculation is one way to assess how good your model is (using a different data set). The smaller the value, the better the predictor, however, it is hard to judge the goodness of the number unless we compare to other possibilities. Of course a perfect scenario will have zero RMSE. We now need to repeat the above calculations for the non-linear quadratic model.

2A) Start by calculating the predicted values and overlay the calculated predicted values over the actual values.
# Calculate score_predict2 based on the quadratic model and the patent test data.  You need to refer again to the coefficients of the quadratic model derived earlier and the actual patent values obtained from the testing data
patent_test2 = patent_test^2

score_predict2 = coef(quad_train)[1] + coef(quad_train)[2]*patent_test + coef(quad_train)[3]*patent_test2  

For a visual representation, similar to the linear model, we need to do the following.

2B) Plot the score versus patent actual test data and overlay the predictied values as calculated in 2A
# Plot the actual values for patent and score as observed in the testing data set
plot(patent_test, score_test, main='Test Data -- Score vs Patent')
# Overlay the predicted values as calculated from the linear model and derived using the training model
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
# The green color is used to distinguish the predicted values which, because of the quadratic model, will in this case fit exactly a parabola
 plot(patent_test, score_predict2, col="green")

The plot above agrees with the decision we made above that the quadratic model would be a better fit for the data. The green parabola represents the predicted values, against the points in the testing data set.Though the points don’t line up exactly, they are better represented here rather than in the linear model.

2C) Plot the predicted values against the true values. Label properly.

#Plot the predicted values form the quadratic model versus the actual values from the test data
plot(score_test, score_predict2, xlab='Actual', ylab='Predict', main='Quadratic Model -- Predict vs Actual Test')

2D) By looking at the scatterplots for the linear and quadratic models are you able to tell which model is better? Elaborate.

By just looking at the scatterplots and “predictor” lines, one can tell that the quadratic model is a better model to use. Though the points do not, and usually do not, line up perfectly on the predicted line or parabola, the points are closer and better represented in regards to the parabola in the quadratic model.

A better way to quantify the goodness of the model is to calculate again the RMSE.

2E) Calculate the root mean square error (RMSE) for the quadratic model.

#Calculate RMSE  for Quadratic Model
error2 = sum((score_predict2 - score_test)^2)/length(score_test)
rmse2 = sqrt(error2)
rmse2
## [1] 5.684821

2F) Based on the root mean square error (RMSE) which model is better? How do results reconcile with the results from Task 1?

Because the root mean square in the quadratic model, our results from task 1 are correct and line up. The smaller the value of the root mean square error, the better the model; similar to the idea that the higher the adjusted r-squared, the better the model.

source [1]: http://www.kaggle.com source [2]: http://www.cwur.org