Task 1: Sensitivity Analysis
Sensitivity analysis is the study of how the uncertainty in the output of a mathematical model or system (numerical or otherwise) can be apportioned to different sources of uncertainty in its inputs. We will use the lpSolveAPI R-package as we did in the previous lab.
In order to conduct a sensitivity analysis, we will need to download again the lpSolveAPI package unless you have it already installed in your R environment
# Require will load the package only if not installed
# Dependencies = TRUE makes sure that dependencies are install
if(!require("lpSolveAPI",quietly = TRUE))
install.packages("lpSolveAPI",dependencies = TRUE, repos = "https://cloud.r-project.org")
We will revisit and solve again the marketing case discussed in class (also part of previous lab).
# We start with `0` constraint and `2` decision variables. The object name `lpmark` is discretionary.
lpmark = make.lp(0, 2)
# Define type of optimization as maximum and dump the screen output into a `dummy` variable
dummy = lp.control(lpmark, sense="max")
# Set the objective function coefficients
set.objfn(lpmark, c(275.691, 48.341))
Add all constraints to the model.
add.constraint(lpmark, c(1, 1), "<=", 350000)
add.constraint(lpmark, c(1, 0), ">=", 15000)
add.constraint(lpmark, c(0, 1), ">=", 75000)
add.constraint(lpmark, c(2, -1), "=", 0)
add.constraint(lpmark, c(1, 0), ">=", 0)
add.constraint(lpmark, c(0, 1), ">=", 0)
Now, view the problem setting in tabular/matrix form. This is a good checkpoint to confirm that our contraints have been properly set.
lpmark
Model name:
C1 C2
Maximize 275.691 48.341
R1 1 1 <= 350000
R2 1 0 >= 15000
R3 0 1 >= 75000
R4 2 -1 = 0
R5 1 0 >= 0
R6 0 1 >= 0
Kind Std Std
Type Real Real
Upper Inf Inf
Lower 0 0
# solve
solve(lpmark)
[1] 0
Next we get the optimum results.
# display the objective function optimum value
get.objective(lpmark)
[1] 43443517
# display the decision variables optimum values
get.variables(lpmark)
[1] 116666.7 233333.3
For sensitivity we will introduce the code command to obtain the sensitivities due to changes in resources as expressed in the right hand side (rhs) constraints.
# display sensitivity to rhs constraints.
# There will be a total of m+n values where m is the number of contraints and n is the number of decision variables
get.sensitivity.rhs(lpmark)
$duals
[1] 124.12433 0.00000 0.00000 75.78333 0.00000 0.00000 0.00000 0.00000
$dualsfrom
[1] 1.125e+05 -1.000e+30 -1.000e+30 -3.050e+05 -1.000e+30 -1.000e+30 -1.000e+30 -1.000e+30
$dualstill
[1] 1.00e+30 1.00e+30 1.00e+30 4.75e+05 1.00e+30 1.00e+30 1.00e+30 1.00e+30
For this exercise we are only interested in the first six values (corresponding to the six constraints) of the output labeled duals.
##### 1A) Identify and explain what the binding/non-binding constraints, the surplus/slack, and the marginal values are and what they represent in the context of our marketing use case
The first constraint xradio+xtv<=350000 is a binding constraint. If the maximum budget increased by 1 dollar, then the optimal solution changes, and the sales value will increase by 124.12. The second constraint (xradio >= 15000) is non-binding on account of the zero value. calculation is below.
surplus_c_2= get.variables(lpmark)[1]-15000
surplus_c_2
[1] 101666.7
The third constraint of xtv >= 75000 is non-binding. Surplus calculated as follows:
surplus_c_3 = get.variables(lpmark)[2]-75000
surplus_c_3
[1] 158333.3
the fourth constraint 2Xradio - Xtv = 0 is binding. If the right hand side of the equation was increased by 1 dollar, then the optimal solution will change, sales value will increase by marginal value $75.78
To acquire a better understanding of the sensitivity results, and to confirm integrity of the calculations, independent tests can be conducted. To demonstrate, we will repeat the linear programming (LP) optimization problem by slightly tweaking one binding constraint and verifying the impact.
##### 1B) Define a new model object lpmark1 and add the constraints exactly as entered at beginning of this task. All being equal, change the budget constraint by a marginal $1 value and solve the LP problem. Note the new optimum value for sales. Calculate the differential change in optimum sales from the earlier computed optimum sales, and compare your calculation to the value obtained from the prior sensitivity calculation
# Define a new model object called lpmark1
lpmark1 = make.lp(0, 2)
# Repeat rest of commands with the one constraint change for budget. Solve and display the objective function optimum value
dummy = lp.control(lpmark1, sense="max")
set.objfn(lpmark1, c(275.691, 48.341))
add.constraint(lpmark1, c(1, 1), "<=", 350001)
add.constraint(lpmark1, c(1, 0), ">=", 15000)
add.constraint(lpmark1, c(0, 1), ">=", 75000)
add.constraint(lpmark1, c(2, -1), "=", 0)
add.constraint(lpmark1, c(1, 0), ">=", 0)
add.constraint(lpmark1, c(0, 1), ">=", 0)
lpmark1
Model name:
C1 C2
Maximize 275.691 48.341
R1 1 1 <= 350001
R2 1 0 >= 15000
R3 0 1 >= 75000
R4 2 -1 = 0
R5 1 0 >= 0
R6 0 1 >= 0
Kind Std Std
Type Real Real
Upper Inf Inf
Lower 0 0
solve(lpmark1)
[1] 0
get.objective(lpmark1)
[1] 43443641
get.variables(lpmark1)
[1] 116667 233334
Here we can subtract our optimum values to get the marginal change.
get.objective(lpmark1)-get.objective(lpmark)
[1] 124.1243
Here we can see that the marginal change in sales value is the amount we calculated for the previous problem.
##### 1C) Following the logic of calculations in 1B) and without running another code, explain the steps to validate the integrity of the other remaining marginal value identified in 1A). Use a different line to explain each step
Revert the right hand side of the first constraint back to 350000, then add 1 to the right hand side of the second constraint, keep the other constraints equal, then calculate the difference between the new optimum value and the original optimum value.
Revert the right hand side of the second constraint back to 15000, then add 1 to the right hand side of the third constraint, keep the other constraints equal, then calculate the difference between the new optimum value and the original optimum value.
Revert the right hand side of the third constraint back to 750000, then add 1 to the right hand side of the fourth constraint, keep the other constraints equal, then calculate the difference between the new optimum value and the original optimum value.
Task 2: Random Sampling & Monte Carlo Simulation
Monte Carlo Simulations utilize repeated random sampling from a given universe or population distribution to derive certain results. This type of simulation is known as a probabilistic simulation, as opposed to a deterministic simulation.
An example of a Monte Carlo simulation is the one applied to approximate the value of pi. The simulation is based on generating random points within a unit square and see how many points fall within the circle enclosed by the unit square (marked in red). The value of pi is estimated by the number of points inside the circle over the total number of points generated inside the square. The higher the number of sampled points the closer the result is to the actual result. After selecting 30,000 random points, the estimate for pi is much closer to the actual value within the four decimal points of precision. The interested and curious reader is encouraged to explore the mathematical foundation behind the logic.

For this task we will be running a Monte Carlo simulation to calculate the probability that the daily return from S&P will be > 5%. We will assume that the historical S&P daily return follows a normal distribution with an average daily return of 0.03 (%) and a standard deviation of 0.97 (%). Consider this as our population.
To begin we will generate 100 random samples from the normal distribution. For the generated samples we will calculate the mean, standard deviation, and probability of occurrence where the simulation result is greater than 5%.
To generate random samples from a normal distribution we will use the rnorm() function in R. In the example below we set the number of runs (or samples) to 100.
# number of simulations/samples
runs = 100
# random number generator per defined normal distribution with given mean and standard deviation
sims = rnorm(runs,mean=0.03,sd=0.97)
# Mean calculated from the random distribution of samples
average = mean(sims)
average
[1] 0.1272613
# STD calculated from the random distribution of samples
std = sd(sims)
std
[1] 1.078883
# probability of occurrence on any given day based on samples will be equal to count (or sum) where sample result is greater than 5% divided by total number of samples.
prob = sum(sims >=0.05)/runs
prob
[1] 0.5
##### 2A) Repeat the above calculations for the two cases where the number of simulations/samples is equal to 1000 and 10000. For each case record the mean, standard deviation, and probability
# Repeat calculations here
runs1 = 1000
sims1 = rnorm(runs1,mean=0.03,sd=0.97)
average1 = mean(sims1)
average1
[1] 0.02324018
std1 = sd(sims1)
std1
[1] 0.9686136
prob1 = sum(sims1 >=0.05)/runs1
prob1
[1] 0.501
runs2 = 10000
sims2 = rnorm(runs2,mean=0.03,sd=0.97)
average2 = mean(sims2)
average2
[1] 0.0262328
std2 = sd(sims2)
std2
[1] 0.9665475
prob2 = sum(sims2 >=0.05)/runs2
prob2
[1] 0.4898
##### 2B) List in seperate lines the values for mean, standard deviation, and probability for all three cases: 100, 1000, and 10000 simulations. Describe how the values change/behave as the number of simulations is increased. What is your best bet on the probability of occurrence greater than 5% and why? How is this behavior similar to the use case demonstration to calculate pi?
runs 100 mean = 0.1272613 sd = 1.078883 prob = 0.5
runs 1000 mean = 0.02324018 sd = 0.9686136 prob = 0.501
runs 10000 mean = 0.0262328 sd = 0.9665475 prob = 0.4898
The larger the sample, the closer we get to the average and standard deviation values we were given at the beginning of the problem. That being said, the best probability here is probably the 0.4898 that was given by the model of 10000 runs. ### Task 3: Training Set & Model Building
The first half of this lab focuses on building a model using the training data. The data we will be using was obtained from Kaggle site[1] and is in reference to world university rankings [2] . The data looks at university world scoring based on different ranking criteria such as quality of education, quality of faculty, and rank for patents. Spend some time to visit the referenced website to get more acquinated with the data. The data obtained is divided inteo two sets: a training set and a testing set. We begin by reading the training data set ‘universityrank_training.csv’ file, and checking the header lines to make sure the data is read correctly.
traindata = read.csv(file="universityrank_training.csv", header=TRUE)
head(traindata)
Next, we extract the two columns of interest and call them properly so we can easily refer to them later in the code.
patent_train = traindata$patents
score_train = traindata$score
The first model we will build is a simple linear model. We will use the patents ranking variable to predict the university score. To better understand the data, the lower the patents ranking number the better it is. A value of 1 is a top rank for patents and represent the highest category in terms of number of patents owned by the particular academic institution. On the other hand the higher the calculated total score the better, as reflected by the world rank number. A value of 100 is a perfect score.
linear_train = lm(score_train ~ patent_train)
summary(linear_train)
Call:
lm(formula = score_train ~ patent_train)
Residuals:
Min 1Q Median 3Q Max
-9.876 -4.010 -1.118 1.512 45.471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.6397362 0.3857798 141.63 <2e-16 ***
patent_train -0.0157558 0.0008281 -19.03 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.535 on 1198 degrees of freedom
Multiple R-squared: 0.2321, Adjusted R-squared: 0.2314
F-statistic: 362 on 1 and 1198 DF, p-value: < 2.2e-16
plot(patent_train,score_train)
abline(linear_train, col="blue", lwd=2)

##### 3A) Complete the steps in the code chunk below to build a non-linear quadratic model. Follow the steps used in lab08 for costs versus servers
# First define a new variable which is the squared value of patent_train (defined above)
patent_train2 = patent_train^2
# Next derive the quadratic regression model. You may want to call it quad_train.
quad_train = lm (score_train ~ patent_train + patent_train2)
# Publish the summary statistics
summary(quad_train)
Call:
lm(formula = score_train ~ patent_train + patent_train2)
Residuals:
Min 1Q Median 3Q Max
-13.555 -2.345 -0.582 1.302 40.843
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.966e+01 4.971e-01 120.02 <2e-16 ***
patent_train -6.249e-02 3.319e-03 -18.83 <2e-16 ***
patent_train2 5.972e-05 4.127e-06 14.47 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.955 on 1197 degrees of freedom
Multiple R-squared: 0.3464, Adjusted R-squared: 0.3453
F-statistic: 317.2 on 2 and 1197 DF, p-value: < 2.2e-16
score_fitted = coef(quad_train)[1] + coef(quad_train)[2] * patent_train + coef(quad_train)[3] * patent_train2
plot(patent_train,score_train, pch = 16)
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
plot(patent_train, score_fitted, col = "red", pch = 16)

##### 3B) Looking at the values for R-Squared and Adjusted R-Squared for both the linear and the quadratic select the best predictive model. Explain your logic
The quadratic model is best on account of its R Squared and Adjusted R Squared values, indicating it to be the best predictive model.
Task 4: Testing Set & Model Evaluation
The second half of predictive modeling is about testing the model using a different data set called the testing data. Again we must first read the testing data set, and make sure the dataset is read propertly.
testdata = read.csv("universityrank_testing.csv", header=TRUE)
head(testdata)
We extract again the two columns of interest, in reference this time to the testing data set, and call them accordingly.
patent_test = testdata$patents
score_test = testdata$score
We are ready now to check if the derived models are actually good predictive models. First we calculate the predicted test data score using the linear model as derived from the training set. Later we will consider the quadratic model.
# Calculate the predicted test data score
score_predict1 = coef(linear_train)[1] + coef(linear_train)[2]*patent_test
For a visual qualitative evaluation we can plot the actual testing data, and overlay the predicted values
# Plot the actual values for patent and score as observed in the testing data set
plot(patent_test, score_test, main='Test Data -- Score vs Patent')
# Overlay the predicted values as calculated from the linear model and derived using the training model
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
# The red color is used to distinguish the predicted values which, because of the linear model, will fit exactly a line
plot(patent_test, score_predict1, col="red")

A better way to qualify the goodness of a predictive model is to scatter plot the actual values against the predicted values. In a perfect predictive model the points will line up along the diagonal line. This is rarely the case, if ever!
#Plot predicted values from the linear model versus actual values form the test data
plot(score_test, score_predict1, xlab='Actual', ylab='Predict', main='Linear Model -- Predict vs Actual Test')

From the plot we can easily see that most of the predicted values versus actual are far from the diagoonal line. In many cases this is fine. Finally, to quantify the goodness of a model, we need to calculate the Root Mean Square Error (RMSE).
#Calculate RMSE for Linear Model
error1 = sum((score_predict1 - score_test)^2)/length(score_test)
rmse1 = sqrt(error1)
rmse1
[1] 5.95868
It is hard to judge the goodness of the number unless we compare to other possibilities. Of course a perfect scenario will have zero RMSE. We now need to repeat the above calculations for the non-linear quadratic model.
##### 4A) Fill in the code chunk below to calculate the predicted values for the non-linear quadratic model
# Calculate score_predict2 using the quadratic model as derived earlier from the training data and as applied to the actual patent values obtained from the testing data
patent_test2 = patent_test^2
score_predict2 = coef(quad_train)[1] + coef(quad_train)[2] * patent_test + coef(quad_train)[3] * patent_test2
For a visual representation, similar to the linear model, we need to do the following.
##### 4B) Fill-in the code chunk below to plot the actual Score vs Patent for the test data, and overlay the predicted values as calculated in 2A. Label axes and title properly (4pts)
# Plot the actual values for patent and score as observed in the testing data set
plot(patent_test, score_test, main='Test Data -- Score vs Patent')
# Overlay the predicted values as calculated from the quadratic model and derived using the training model.
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
# The green color is used to distinguish the predicted values which, because of the quadratic model, will in this case will fit exactly a parabola
plot(patent_test, score_predict2, col="green")

#### 4C) Plot the Predict vs Actual for the quadratic model. Label axes and title properly (2pts)
#Plot the predicted values form the quadratic model versus the actual values from the test data
plot(score_test, score_predict2, xlab='Actual', ylab='Predict', main='Quad Model -- Predict vs Actual Test')

#### 4D) By looking at the scatterplots for the linear and quadratic models are you able to tell which model is better? Explain your logic (1pt)
Quadratic is a better fit because the actual points line up a tad bit closer compared to the linear one.
A better way is to quantify the goodness of the model by calculating again the RMSE.
#### 4E) Calculate the root mean square error (RMSE) for the quadratic model (2pts)
#Calculate RMSE for Quadratic Model
error2 = sum((score_predict2 - score_test)^2)/length(score_test)
rmse2 = sqrt(error2)
rmse2
[1] 5.685396
#### 4F) Based on the root mean square error (RMSE) which model is better? How your conclusion reconcile with the results from Task 1? (1pt)
RMSE for the quadratic model is less than the RMSE for the linear model. This means there is less error in the predictions of the quadratic model, making it the better predictor.
source [1]: http://www.kaggle.com
source [2]: http://www.cwur.org
---
title: "Sensitivity Analysis & Simulations (lab09)"
author: "Uriel Reyes Vazquez"
date: "4/29/2020"
output:
  html_notebook: default
  html_document: default
  pdf_document: default
subtitle: BSAD 343, Business Analytics, Spring 2020
---


### About

In this lab, we will learn how to run a sensitivity analysis on the marketing use case, and how to generate random samples from a simple normal distribution similar to a Monte Carlo simulation.

### Setup

Remember to always set your working directory to the source file location. Go to 'Session', scroll down to 'Set Working Directory', and click 'To Source File Location'. Read carefully the below and follow the instructions to complete the tasks and answer any questions.  Submit your work in Sakai as detailed in previous notes. 

### Note

For your assignment you may be using different data sets than what is included here. Always read carefully the instructions provided, before executing any included code chunks and/or adding your own code.  For clarity, tasks/questions to be completed/answered are highlighted in red color and numbered according to their particular placement in the task section.  The red color is only apparent when in Preview mode. Quite often you will need to add your own code chunk.

Execute all code chunks (already included and own added), preview, check integrity, and submit final work ($html$ file) in Sakai.

--------------

### Task 1: Sensitivity Analysis

Sensitivity analysis is the study of how the uncertainty in the output of a mathematical model or system (numerical or otherwise) can be apportioned to different sources of uncertainty in its inputs. We will use the `lpSolveAPI` R-package as we did in the previous lab. 

In order to conduct a sensitivity analysis, we will need to download again the `lpSolveAPI` package unless you have it already installed in your R environment

```{r}
# Require will load the package only if not installed 
# Dependencies = TRUE makes sure that dependencies are install
if(!require("lpSolveAPI",quietly = TRUE))
  install.packages("lpSolveAPI",dependencies = TRUE, repos = "https://cloud.r-project.org")
```

We will revisit and solve again the marketing case discussed in class (also part of previous lab). 
```{r}
# We start with `0` constraint and `2` decision variables. The object name `lpmark` is discretionary.
lpmark = make.lp(0, 2)

# Define type of optimization as maximum and dump the screen output into a `dummy` variable
dummy = lp.control(lpmark, sense="max") 
# Set the objective function coefficients 
set.objfn(lpmark, c(275.691, 48.341))
```

Add all constraints to the model.

```{r}
add.constraint(lpmark, c(1, 1), "<=", 350000)
add.constraint(lpmark, c(1, 0), ">=", 15000)
add.constraint(lpmark, c(0, 1), ">=", 75000)
add.constraint(lpmark, c(2, -1), "=", 0)
add.constraint(lpmark, c(1, 0), ">=", 0)
add.constraint(lpmark, c(0, 1), ">=", 0)
```

Now, view the problem setting in tabular/matrix form. This is a good checkpoint to confirm that our contraints have been properly set.

```{r}
lpmark
# solve
solve(lpmark) 
```

Next we get the optimum results.
```{r}
# display the objective function optimum value
get.objective(lpmark)

# display the decision variables optimum values
get.variables(lpmark)
```

For sensitivity we will introduce the code command to obtain the sensitivities due to changes in resources as expressed in the right hand side (rhs) constraints. 

```{r}
# display sensitivity to rhs constraints. 
# There will be a total of m+n values where m is the number of contraints and n is the number of decision variables
get.sensitivity.rhs(lpmark) 
```

For this exercise we are only interested in the first six values (corresponding to the six constraints) of the output labeled `duals`. 

<span style="color:red">
##### 1A)  Identify and explain what the binding/non-binding constraints, the surplus/slack, and the marginal values are and what they represent in the context of our marketing use case 
</span>

The first constraint xradio+xtv<=350000 is a binding constraint. If the maximum budget increased by 1 dollar, then the optimal solution changes, and the sales value will increase by 124.12.
The second constraint (xradio >= 15000) is non-binding on account of the zero value. calculation is below.
```{r}
surplus_c_2= get.variables(lpmark)[1]-15000
surplus_c_2
```
The third constraint of xtv >= 75000 is non-binding. Surplus calculated as follows:
```{r}
surplus_c_3 = get.variables(lpmark)[2]-75000
surplus_c_3
```
the fourth constraint 2Xradio - Xtv = 0 is binding. If the right hand side of the equation was increased by 1 dollar, then the optimal solution will change, sales value will increase by marginal value $75.78

To acquire a better understanding of the sensitivity results, and to confirm integrity of the calculations, independent tests can be conducted.  To demonstrate, we will repeat the linear programming (LP) optimization problem by slightly tweaking one binding constraint and verifying the impact.

<span style="color:red">
##### 1B)  Define a new model object `lpmark1` and add the constraints exactly as entered at beginning of this task. All being equal, change the budget constraint by a marginal $1 value and solve the LP problem.  Note the new optimum value for sales. Calculate the differential change in optimum sales from the earlier computed optimum sales,  and compare your calculation to the value obtained from the prior sensitivity calculation 
</span>

```{r}
# Define a new model object called lpmark1
lpmark1 = make.lp(0, 2)
# Repeat rest of commands with the one constraint change for budget. Solve and display the objective function optimum value

dummy = lp.control(lpmark1, sense="max") 

set.objfn(lpmark1, c(275.691, 48.341))

add.constraint(lpmark1, c(1, 1), "<=", 350001)
add.constraint(lpmark1, c(1, 0), ">=", 15000)
add.constraint(lpmark1, c(0, 1), ">=", 75000)
add.constraint(lpmark1, c(2, -1), "=", 0)
add.constraint(lpmark1, c(1, 0), ">=", 0)
add.constraint(lpmark1, c(0, 1), ">=", 0)

lpmark1

solve(lpmark1) 

get.objective(lpmark1)
get.variables(lpmark1) 
```
Here we can subtract our optimum values to get the marginal change.
```{r}
get.objective(lpmark1)-get.objective(lpmark)
```
Here we can see that the marginal change in sales value is the amount we calculated for the previous problem. 

<span style="color:red">
##### 1C)  Following the logic of calculations in 1B) and without running another code, explain the steps to validate the integrity of the other remaining marginal value identified in 1A). Use a different line to explain each step 
</span>

Revert the right hand side of the first constraint back to 350000, then add 1 to the right hand side of the second constraint, keep the other constraints equal, then calculate the difference between the new optimum value and the original optimum value.

Revert the right hand side of the second constraint back to 15000, then add 1 to the right hand side of the third constraint, keep the other constraints equal, then calculate the difference between the new optimum value and the original optimum value.

Revert the right hand side of the third constraint back to 750000, then add 1 to the right hand side of the fourth constraint, keep the other constraints equal, then calculate the difference between the new optimum value and the original optimum value.
----------

### Task 2: Random Sampling & Monte Carlo Simulation

Monte Carlo Simulations utilize repeated random sampling from a given universe or population distribution to derive certain results. This type of simulation is known as a probabilistic simulation, as opposed to a deterministic simulation.

An example of a Monte Carlo simulation is the one applied to approximate the value of `pi`. The simulation is based on generating random points within a unit square and see how many points fall within the circle enclosed by the unit square (marked in red). The value of `pi` is estimated by the number of points inside the circle over the total number of points generated inside the square.  The higher the number of sampled points the closer the result is to the actual result.  After selecting 30,000 random points, the estimate for `pi` is much closer to the actual value within the four decimal points of precision. The interested and curious reader is encouraged to explore the mathematical foundation behind the logic.

![](montecarlo.gif)


For this task we will be running a Monte Carlo simulation to calculate the probability that the daily return from S&P will be > 5%. 
We will assume that the historical S&P daily return follows a normal distribution with an average daily return of 0.03 (%) and a standard deviation of 0.97 (%). Consider this as our population.

To begin we will generate 100 random samples from the normal distribution. For the generated samples we will calculate the mean, standard deviation, and probability of occurrence where the simulation result is greater than 5%. 

To generate random samples from a normal distribution we will use the `rnorm()` function in R. In the example below we set the number of runs (or samples) to 100.

```{r}
# number of simulations/samples
runs = 100
# random number generator per defined normal distribution with given mean and standard deviation
sims =  rnorm(runs,mean=0.03,sd=0.97)
```

```{r}
# Mean calculated from the random distribution of samples
average = mean(sims)
average
```

```{r}
# STD calculated from the random distribution of samples
std = sd(sims) 
std
```

```{r}
# probability of occurrence on any given day based on samples will be equal to count (or sum) where sample result is greater than 5% divided by total number of samples. 
prob = sum(sims >=0.05)/runs
prob
```

<span style="color:red">
##### 2A)  Repeat the above calculations for the two cases where the number of simulations/samples is equal to 1000 and 10000.  For each case record the mean, standard deviation, and probability 
</span>

````{r}
# Repeat calculations here
runs1 = 1000

sims1 =  rnorm(runs1,mean=0.03,sd=0.97)

average1 = mean(sims1)
average1

std1 = sd(sims1) 
std1

prob1 = sum(sims1 >=0.05)/runs1
prob1


runs2 = 10000

sims2 =  rnorm(runs2,mean=0.03,sd=0.97)

average2 = mean(sims2)
average2

std2 = sd(sims2) 
std2

prob2 = sum(sims2 >=0.05)/runs2
prob2
```

<span style="color:red">
##### 2B)  List in seperate lines the values for mean, standard deviation, and probability for all three cases: 100, 1000, and 10000 simulations. Describe how the values change/behave as the number of simulations is increased. What is your best bet on the probability of occurrence greater than 5% and why? How is this behavior similar to the use case demonstration to calculate `pi`?   
</span>
runs 100 
mean = 0.1272613
sd = 1.078883
prob = 0.5

runs 1000
mean = 0.02324018
sd = 0.9686136
prob = 0.501

runs 10000
mean = 0.0262328
sd = 0.9665475
prob = 0.4898

The larger the sample, the closer we get to the average and standard deviation values we were given at the beginning of the problem. That being said, the best probability here is probably the 0.4898 that was given by the model of 10000 runs.
### Task 3: Training Set & Model Building

The first half of this lab focuses on building a model using the training data. The data we will be using was obtained from Kaggle site[1] and is in reference to world university rankings [2] . The data looks at university world scoring based on different ranking criteria such as quality of education, quality of faculty, and rank for patents. Spend some time to visit the referenced website to get more acquinated with the data.  The data obtained is divided inteo two sets: a training set and a testing set. We begin by reading the training data set 'universityrank_training.csv' file, and checking the header lines to make sure the data is read correctly. 

```{r}
traindata = read.csv(file="universityrank_training.csv", header=TRUE)
head(traindata)
```

Next, we extract the two columns of interest and call them properly so we can easily refer to them later in the code. 

```{r}
patent_train = traindata$patents
score_train = traindata$score
```

The first model we will build is a simple linear model. We will use the patents ranking variable to predict the university score.  To better understand the data, the lower the patents ranking number the better it is.  A value of 1 is a top rank for patents and represent the highest category in terms of number of patents owned by the particular academic institution. On the other hand the higher the calculated total score the better, as reflected  by the world rank number. A  value of 100 is a perfect score.  

```{r}
linear_train = lm(score_train ~ patent_train)
summary(linear_train)

plot(patent_train,score_train)
abline(linear_train, col="blue", lwd=2)
```

<span style="color:red">
##### 3A) Complete the steps in the code chunk below to build a non-linear quadratic model. Follow the steps used in lab08 for costs versus servers 
</span>

```{r}
# First define a new variable which is the squared value of patent_train (defined above)
patent_train2 = patent_train^2
# Next derive the quadratic regression model.  You may want to call it quad_train.
quad_train = lm (score_train ~ patent_train + patent_train2)
# Publish the summary statistics
summary(quad_train)
```
```{r}
score_fitted = coef(quad_train)[1] + coef(quad_train)[2] * patent_train + coef(quad_train)[3] * patent_train2

plot(patent_train,score_train, pch = 16)

par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE) 

plot(patent_train, score_fitted, col = "red", pch = 16)
```

<span style="color:red">
##### 3B) Looking at the values for R-Squared and Adjusted R-Squared for both the linear and the quadratic select the best predictive model. Explain your logic 
</span>

The quadratic model is best on account of its R Squared and Adjusted R Squared values, indicating it to be the best predictive model. 
----------

### Task 4: Testing Set & Model Evaluation

The second half of predictive modeling is about testing the model using a different data set called the testing data. Again we must first read the testing data set, and make sure the dataset is read propertly.

```{r}
testdata = read.csv("universityrank_testing.csv", header=TRUE)
head(testdata)
```

We extract again the two columns of interest, in reference this time to the testing data set, and call them accordingly.

```{r}
patent_test = testdata$patents
score_test = testdata$score
```

We are ready now to check if the derived models are actually good predictive models. First we calculate the predicted test data score using the linear model as derived from the training set. Later we will consider the quadratic model. 

```{r}
# Calculate the predicted test data score 
score_predict1 = coef(linear_train)[1] + coef(linear_train)[2]*patent_test
```

For a visual qualitative evaluation we can plot the actual testing data, and overlay the predicted values

```{r}
# Plot the actual values for patent and score as observed in the testing data set
plot(patent_test, score_test, main='Test Data -- Score vs Patent')
# Overlay the predicted values as calculated from the linear model and derived using the training model
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
# The red color is used to distinguish the predicted values which, because of the linear model, will fit exactly a line
plot(patent_test, score_predict1, col="red")
```

A better way to qualify the goodness of a predictive model is to scatter plot the actual values against the predicted values. In a perfect predictive model the points will line up along the diagonal line.  This is rarely the case, if ever!

```{r}
#Plot predicted values from the linear model versus actual values form the test data
plot(score_test, score_predict1, xlab='Actual', ylab='Predict', main='Linear Model -- Predict vs Actual Test')
```

From the plot we can easily see that most of the predicted values versus actual are far from the diagoonal line. In many cases this is fine. Finally, to quantify the goodness of a  model, we need to calculate the Root Mean Square Error (RMSE).

```{r}
#Calculate RMSE for Linear Model
error1 = sum((score_predict1 - score_test)^2)/length(score_test)
rmse1 = sqrt(error1)
rmse1
```


It is hard to judge the goodness of the number unless we compare to other possibilities. Of course a perfect scenario will have zero RMSE. We now need to repeat the above calculations for the non-linear quadratic model.

<span style="color:red">
##### 4A) Fill in the code chunk below to calculate the predicted values for the non-linear quadratic model 
</span>

```{r}
# Calculate score_predict2 using the quadratic model as derived earlier from the training data and as applied to the actual patent values obtained from the testing data
patent_test2 = patent_test^2
score_predict2 = coef(quad_train)[1] + coef(quad_train)[2] * patent_test + coef(quad_train)[3] * patent_test2

```

For a visual representation, similar to the linear model, we need to do the following.

<span style="color:red">
##### 4B) Fill-in the code chunk below to plot the actual Score vs Patent for the test data, and overlay the predicted values as calculated in 2A. Label axes and title properly (4pts)
</span>

```{r}
# Plot the actual values for patent and score as observed in the testing data set
plot(patent_test, score_test, main='Test Data -- Score vs Patent')
# Overlay the predicted values as calculated from the quadratic model and derived using the training model. 
par(new=TRUE, xaxt="n", yaxt="n", ann=FALSE)
# The green color is used to distinguish the predicted values which, because of the quadratic model, will in this case will fit exactly a parabola
plot(patent_test, score_predict2, col="green")
```

<span style="color:red">
#### 4C) Plot the Predict vs Actual for the quadratic model.  Label axes and title properly (2pts)
</span>

```{r}
#Plot the predicted values form the quadratic model versus the actual values from the test data
plot(score_test, score_predict2, xlab='Actual', ylab='Predict', main='Quad Model -- Predict vs Actual Test')
```

<span style="color:red">
#### 4D) By looking at the scatterplots for the linear and quadratic models are you able to tell which model is better? Explain your logic (1pt)
</span>

Quadratic is a better fit because the actual points line up a tad bit closer compared to the linear one.

A better way is to quantify the goodness of the model by calculating again the RMSE.

<span style="color:red">
#### 4E) Calculate the root mean square error (RMSE) for the quadratic model (2pts)
</span>

```{r}
#Calculate RMSE  for Quadratic Model
error2 = sum((score_predict2 - score_test)^2)/length(score_test)
rmse2 = sqrt(error2)
rmse2
```

<span style="color:red">
#### 4F) Based on the root mean square error (RMSE)  which model is better? How your conclusion reconcile with the results from Task 1? (1pt)
</span>

RMSE for the quadratic model is less than the RMSE for the linear model. This means there is less error in the predictions of the quadratic model, making it the better predictor.

source [1]: [http://www.kaggle.com](http://www.kaggle.com)

source [2]: [http://www.cwur.org](http://www.cwur.org) 


