The best way to create a predictive model is to begin with a complete data set. Split the data into two seperate datasets: the training set and the testing set, with the training set greater than or equal to 50% of the data. We will use the training set to build a model best predictor of the data. Lastly, we will test the built model on the testing data set, and assess how good the model is in predicting the actual values.
The concept may seem overwhelming, but we already have almost all of the tools to perform this analysis.
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.
For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their particular placement in the task section.
The first half of this lab focus on building a model using the training data. The data we will be using was obtained from Kaggle site[1] and is in reference to world university rankings [2] . The data looks at university world scoring based on different ranking criteria such as rank for quality of education, rank for quality of faculty, and rank for patents. We begin by reading the training data set ‘universityrank_training.csv’ file, and checking the header lines to make sure the data is read correctly.
traindata = read.csv(file="data/universityrank_training.csv", header=TRUE)
head(traindata)
## world_rank institution country
## 1 1 Harvard University USA
## 2 2 Massachusetts Institute of Technology USA
## 3 3 Stanford University USA
## 4 4 University of Cambridge United Kingdom
## 5 5 California Institute of Technology USA
## 6 6 Princeton University USA
## national_rank quality_of_education alumni_employment quality_of_faculty
## 1 1 7 9 1
## 2 2 9 17 3
## 3 3 17 11 5
## 4 1 10 24 4
## 5 4 2 29 7
## 6 5 8 14 2
## publications influence citations broad_impact patents score year
## 1 1 1 1 NA 5 100.00 2012
## 2 12 4 4 NA 1 91.67 2012
## 3 4 2 2 NA 15 89.50 2012
## 4 16 16 11 NA 50 86.17 2012
## 5 37 22 22 NA 18 85.21 2012
## 6 53 33 26 NA 101 82.50 2012
Next, we extract the two columns of interest and call them properly so we can easily refer to them later in the code.
patent_train = traindata$patents
score_train = traindata$score
plot(patent_train,score_train)
The first model we will build is a simple linear model. We will use the patents ranking variable to predict the university score. To better understand the data, the lower the patents ranking number the better it is. A value of 1 is a top rank for patents and represent the highest category in terms of number of patents owned by the particular academic institution. On the other hand the higher the calculated total score the better, as reflected by the world rank number. A value of 100 is a perfect score.
linear_train = lm(score_train ~ patent_train)
summary(linear_train)
##
## Call:
## lm(formula = score_train ~ patent_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.876 -4.014 -1.121 1.515 45.470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.6406641 0.3859848 141.56 <2e-16 ***
## patent_train -0.0157607 0.0008291 -19.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.539 on 1197 degrees of freedom
## Multiple R-squared: 0.2319, Adjusted R-squared: 0.2313
## F-statistic: 361.4 on 1 and 1197 DF, p-value: < 2.2e-16
plot(patent_train,score_train)
abline(linear_train, col="blue", lwd=2)
Even though the model is given a training data set, it doesn’t predcit the scores in the training data sets with 100% accuracy (i.e. it has errors in predciting scores even in training data set- remember the analogy we used in calss: The students might stop the training, and select a method of learning that gives them asnwers that are close to accurate. Maybe they think this is enough, or maybe that’s the best they can do)
#This is the predcited scores in the training data set
score_predict_train = coef(linear_train)[1] + coef(linear_train)[2]*patent_train
#Let us compare it with the actual scores in training data set
plot(score_train, score_predict_train, xlab='Actual', ylab='Predict', main='Linear Model -- Predict vs Actual Training')
That’s the best that the linear regression model function can do.There is no linear model that function lm() can find to provide a better accuracy, so it selected the model and coeffcients which you can see in summary(linear_train), and the model is ready now to be tested using a testing data set. This is similar to our analogy in class: students did their best in learing from answer key, but they developed a method that is not very accurate. They stopped at this point, and they are now ready to be tested.
However, before we test the linear model, we can choose to try building another model and comapre adjusted R square. Then we can later test both of them.
# First define a new variable which is the squared value of patent_train (defined above)
patent_train2 = patent_train^2
# Next derive the quadratic regression model. You may want to call it quad_train.
quad_train = lm(score_train ~ patent_train + patent_train2)
# Publish the summary statistics
summary(quad_train)
##
## Call:
## lm(formula = score_train ~ patent_train + patent_train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.556 -2.347 -0.578 1.301 40.842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.966e+01 4.973e-01 119.97 <2e-16 ***
## patent_train -6.250e-02 3.321e-03 -18.82 <2e-16 ***
## patent_train2 5.975e-05 4.130e-06 14.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.957 on 1196 degrees of freedom
## Multiple R-squared: 0.3463, Adjusted R-squared: 0.3452
## F-statistic: 316.8 on 2 and 1196 DF, p-value: < 2.2e-16
The values for the linear model are: 0.2319 for R-squared and 0.2313 for adjusted r-square. The quadratic model values are: 0.3463 for r-squared and 0.3452 for adjusted r-squared. After viewing the data, I think that quadratic model is the best predicitve model. The quadratic model higher adjusted r-squared value than the linear model. In addition, it has a higher r-squared value as well in the quadratic model than linear.
The second half of predictive modeling is about testing the model using a different data set called the testing data. Again we must first read the testing data set, and make sure the dataset is read propertly.
testdata = read.csv("data/universityrank_testing.csv", header=TRUE)
head(testdata)
## world_rank institution country
## 1 1000 Yanbian University China
## 2 1 Harvard University USA
## 3 2 Stanford University USA
## 4 3 Massachusetts Institute of Technology USA
## 5 4 University of Cambridge United Kingdom
## 6 5 University of Oxford United Kingdom
## national_rank quality_of_education alumni_employment quality_of_faculty
## 1 84 355 478 210
## 2 1 1 1 1
## 3 2 9 2 4
## 4 3 3 11 2
## 5 1 2 10 5
## 6 2 7 13 10
## publications influence citations broad_impact patents score year
## 1 890 790 800 1000 737 44.18 2014
## 2 1 1 1 1 3 100.00 2015
## 3 5 3 3 4 10 98.66 2015
## 4 15 2 2 2 1 97.54 2015
## 5 11 6 12 13 48 96.81 2015
## 6 7 12 7 9 15 96.46 2015
We extract again the two columns of interest, in reference this time to the testing data set, and call them accordingly.
patent_test = testdata$patents
score_test = testdata$score
We are ready now to check if the derived models are actually good predictive models. First we calculate the predicted test data score using the linear and the quadratic models derived earlier.
# Calculate the predicted test data score. Notice that here we used patent_test because we want to see predcited scores in testing data set
score_predict1 = coef(linear_train)[1] + coef(linear_train)[2]*patent_test
For a visual representation we can plot the actual testing data, and overlay the predicted values
par(mar = c(5, 4, 4, 4) + 0.3) # Leave space for second axis
# Plot the actual values for patent and score as observed in the testing data set
plot(patent_test, score_test, main='Test Data -- Score vs Patent', ylab = "Actual Scores")
# Overlay the predicted values as calculated from the linear model and derived using the training model
par(new=TRUE)
# The red color is used to distinguish the predicted values which, because of the linear model, will fit exactly a line
plot(patent_test, score_predict1, axes = FALSE, bty = "n", xlab = "", ylab = "", col="red")
#This is to create a second axis for the predcited scores
axis(side = 4, at = pretty(range(score_predict1)), col ="red")
mtext("Predcited Scores", side = 4, line = 3, col = "red")
# you can also see the first few values of predcited and actual scores to better understand the plot above
head(score_predict1)
## [1] 43.02501 54.59338 54.48306 54.62490 53.88415 54.40425
head(score_test)
## [1] 44.18 100.00 98.66 97.54 96.81 96.46
A better way to qualify the goodness of a predictive model is to scatter plot the actual values against the predicted values. In a perfect predictive model the points will line up along the diagonal line. This is rarely the case, if ever!
#Plot predicted values from the linear model versus actual values form the test data
plot(score_test, score_predict1, xlab='Actual', ylab='Predict', main='Linear Model -- Predict vs Actual Test')
From the plot we can easily see that most of the predicted values versus actual are far from the diagoonal line. In many cases this is fine. Finally, to quantify the goodness of a model, we need to calculate the Root Mean Square Error (RMSE).
#Calculate RMSE for Linear Model
error1 = sum((score_predict1 - score_test)^2)/length(score_test)
rmse1 = sqrt(error1)
rmse1
## [1] 5.956123
#If you would like to understand why we used the above formula for RMSE, see the example at the end of the worksheet
It is hard to judge the goodness of the number unless we compare to other possibilities. Of course a perfect scenario will have zero RMSE. We now need to repeat the above calculations for the non-linear quadratic model.
# Calculate score_predict2 based on the quadratic model and the patent test data. You need to refer again to the coefficients of the quadratic model derived earlier and the actual patent values obtained from the testing data
patent_test2=patent_test^2
score_predict2 = coef(quad_train)[1] + coef(quad_train)[2]*patent_test + coef(quad_train)[3]*patent_test^2
For a visual representation, similar to the linear model, we need to do the following.
# Plot the actual values for patent and score as observed in the testing data set
plot(patent_test, score_test)
# Overlay the predicted values as calculated from the quadratic model and derived using the training model
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
# The green color is used to distinguish the predicted values which, because of the quadratic model, will in this case fit exactly a parabola
plot(patent_test, score_predict2, col="green")
#Plot the predicted values form the quadratic model versus the actual values from the test data
plot(score_test, score_predict2, xlab='Test', ylab='Predict', main='Quadratric')
It is hard to tell which model is better because they both are very similar and stacked. The values are also very similar which doesn’t help distingush the two.
A better way to quantify the goodness of the model is to calculate again the RMSE.
#Calculate RMSE for Quadratic Model
error2=sum((score_predict2 - score_test)^2)/length(score_test)
rmse2=sqrt(error2)
rmse2
## [1] 5.684821
The RMSE for the linear model was 5.956123 while the RMSE for the quadratic model was 5.684821. After collecting more data, it is evident that the quadratic model is the better one because the RMSE is lower than the linear model. In task 1, the quadratic model had a higher value for r-squared and adjusted r squared values than the linear model.
source [1]: http://www.kaggle.com source [2]: http://www.cwur.org
Predcit_example = c(1,2,3)
Actual_example = c(1,1,1)
(Predcit_example-Actual_example)
## [1] 0 1 2
(Predcit_example-Actual_example)^2
## [1] 0 1 4
sum((Predcit_example-Actual_example)^2)
## [1] 5
length(Predcit_example)
## [1] 3
length(Actual_example)
## [1] 3
Mean_Square_Error = sum((Predcit_example-Actual_example)^2)/length(Predcit_example)
#you can also do Mean_Square_Error = sum((Predcit_example-Actual_example)^2)/length(Actual__example)
Mean_Square_Error
## [1] 1.666667
root_mean_square_error = sqrt(Mean_Square_Error)
root_mean_square_error
## [1] 1.290994