For the review, We will use “EuroStore” data set (same data set in Lab 10)- THIS IS NOT THE DATA SET WE WILL USE IN THE EXAM. THE DATA WE WILL USE FOR EXAM 3 IS UPLOADED TO SAKAI IN FOLDER “EXAM_3”>> “DATA_FOR_EXAM_3.ZIP”. This review will be used as examples for questions that could be asked in the exam. THE EXAM WILL INCLUDE ADDITIONAL QUESTIONS THAT ARE NOT DISCUSSED HERE AS WELL. PLEASE READ EVERYTHING IN SLIDES, ASSIGNMENTS, AND CLASS NOTES.

IMPORTANT: In the exam, you will have the questions on paper. It will not be in R studio. Therefore, you will have to create a new R Notebook to do your calculations. In order to do that, you click on file>>new file>> R Notebook >> then save the R notebook in the folder where you have the data for exam_3 >> then set your working directory to source file

1- Calculate signal to noise ratio for “Sales” column; does this data have high / low potential for noise? Explain your answer. (USE TRAINING DATA SET “EuroStore_Train.csv”)

# first you set working direcotry to source file
# then we should read the data 
store_data = read.csv(file = "EuroStore_Train.csv")
head(store_data) # you did this to see that data looks good in R studio
#then we need to read the "Sales" column 
sales_train = store_data$Sales
# snr = mean / sd 
mean_sales_train = mean(sales_train)
sd_sales_train = sd(sales_train)
snr_sales_train = mean_sales_train/sd_sales_train
snr_sales_train

the signal to noise ration is 9.145139. In the exam you have to show your calculations (i.e. the snr formula, the mean, the sd, and the snr value. don’t just write a number) It seems that it has low potential for noise because Snr is much higher than 1.

2- Do we have outliers in “Fuel Volume”? If yes, list those outliers; show your calculations (USE TRAINING DATA SET “EuroStore_Train.csv”)

#first you need to decide on the method. As you know we have two methods. I will pick mean+-3*sd

# we ened to read the column 
fuel_vol_train = store_data$Fuel.Volume
mean_fuel_vol_train = mean(fuel_vol_train)
sd_fuel_vol_train = sd(fuel_vol_train)

upper_threshold = mean_fuel_vol_train+3*sd_fuel_vol_train
upper_threshold
lower_threshold = mean_fuel_vol_train - 3*sd_fuel_vol_train
lower_threshold

Upper threshold = 69660.98, lower threshold = 56443.82. open the EuroStore_Train.csv, and see if any of the fuel volume values is outside the range 56443.82 to 69660.98. You will not find any; therefore, you can say that we don’t have outliers (remember, show your calcualtions i.e. upper limit, lower limit, formulas, and reasoning)

3- This week TV GRP = 182, Radio GRP = 190. Calculate the corresponding z-value or z-score. If we want to compare this week’s GRP with GRP values in the data set for both TV and Radio commercials (i.e. comparing this week’s TV GRP with TV GRP values in the data set, and comparing this week’s Radio GRP with Radio GRP values in the data set ), which of the commercials did better, TV or Radio? (USE TRAINING DATA SET “EuroStore_Train.csv”)

# we need to read the data
tv_train = store_data$TV
radio_train = store_data$Radio

# the z score formula is z = (X-mean)/sd, where X is the value we are calculating the z score for.
mean_tv_train = mean(tv_train)
sd_tv_train = sd(tv_train)
z_tv = (182-mean_tv_train) / sd_tv_train
z_tv

mean_radio_train = mean(radio_train)
sd_radio_train = sd(radio_train)
z_radio = (190-mean_radio_train)/sd_radio_train
z_radio

z score for TV GRP = 182 is 1.949604, and z score for Radio GRP = 190 is 1.310603.

How do I know which of the commercials did better this week? The question asks you to compare this week’s TV GRP with TV GRP values in the data set, and compare this week’s Radio GRP with Radio GRP values in the data set, and then decide which of the commercials did better. Actually the z scores will help a lot. Z score is the number of standard deviations the value is away from the mean. We know that 180 is 1.949604 away from the mean of TV GRP’s in the data set, and that 190 is 1.310603 standard deviations away from the mean of Radio GRP’s in the data set. As a result, this week the TV commercials did better this week comparing to hestorical values.

4- You need to create a predictive model that predicts sales. You will create the model using the training data set, and then test it using the testing data set.

4.A Calculate different correlation factors and list by order the most potential pedictive variables, execlude week and holiday.(USE TRAINING DATA SET “EuroStore_Train.csv”)

# here we will claculate the correlation coefficient between sales and all other varaibles in the data set 

cor(store_data)

you need to look at the correlation values in the table and rank the predictive varaibles accordingly. Based on the results we can see that the order is as the following (the first is the most potential predicitve variable, and the last is the least potential)

1- Temprature 2- Fuel Volume 3- Fuel Price 4- Radio 5- TV

(side note: remember that this is not enough to say that temprature, or fuel volume are good predcitors. You need to create models and test them; we can’t rely on correaltion coefficient only)

4.B The manager of the marekting department requested that you evaluate two models that predcits sales. Model_1 uses temprature and fuel price as predcitors, and Model_2 uses fuel volume and radio as predciotrs.

You need to evaluate the two models and show which one is the best.

First part of questions (models creation) :you need to create the two models using the training data set, and test them using the testing data set. After you create them using the training data set, you need to write the equations for the two models, explain the coefficients of the equation, and write the Rsquare and adjusted R square values. You need to compare the two models based on the training data set results using the R squared and adjusted R squared values.

Second part of question (models testing, evaluation, and selection): For the testing you need to compare the RMSE values for each model; also you can plot the predcited values vs actual values and compare (you don’t need to draw any plot on the exam paper; this is for you to have better insights on the best model).You also need to compare the two models based on the testing data set results i.e. based on RMSE (for this question You need to use both the training data set “EuroStore_Train.csv” and the testing data set “EuroStore_Test.csv”)

# let us start by creating the two models
# we already extracted the sales, fuel volume, Tv, and radio from training data set. we need to extract the temprature and fuel price 

temp_train = store_data$Temp
fuel_p_train = store_data$Fuel.Price
# now we can create the first model 
model_1 = lm (sales_train~temp_train+fuel_p_train)
summary(model_1)
# we can also create the second model
model_2 = lm (sales_train~fuel_vol_train+radio_train)
summary(model_2)

From the summary results you can asnwer the first part of the question

Model_1 Equation: Sales = -6761.09 + 204.36 * temprature + 234.78 * fuel price Coefficients: if temprature and price are both equal to zero the sales will be -6761.09 i.e. we will lose money. If temprature is increased by 1, the sales will increase by 204.36, and if fuel price is increased by 1, the sales will increase by 234.78. R squared = 0.6339, Adjusted R-squared = 0.6198

Model_2 Equation: Sales = -1.350e+04 + 5.627e-01* Fuel Vol + 9.481e+00 *radio coefficients: if fuel volume and radio are both equal to zero the sales will be -1.350e+04 i.e. we will lose money. If fuel volume is increased by 1, the sales will increase by 5.627e-01, and if radio is increased by 1, the sales will increase by 9.481e+00. R-squared: 0.4987, Adjusted R-squared: 0.4794

Based on these results, you can see that model_1 could be better (it has a higher R square and adjusted R square); however, we need to test the two models first before making the decision on which one is the best; remember that a model could be over-fitted to the training data set and do badly on testing data set. We don’t like such model; we need a model that does well on both training and testing.

# let us test the two models
# first we need to read the testing data set
store_data_2 = read.csv(file = "EuroStore_Test.csv")
head(store_data_2)
# we need to read the columns 
sales_test = store_data_2$Sales
temp_test = store_data_2$Temp
fuel_p_test = store_data_2$Fuel.Price
fuel_vol_test = store_data_2$Fuel.Volume
radio_test = store_data_2$Radio
# we will now run the two models on the testing data set and calculate the predicted values of sales. 
#We have two models, so we will have two vectors of predcited values for sales. We will compare the two vectors of predcited values for sales with the actual values for sales in the testing data set (this is what is called testing)

sales_predict_model_1 = coef(model_1)[1]+coef(model_1)[2]*temp_test+coef(model_1)[3]*fuel_p_test

plot(sales_predict_model_1,sales_test)
mse_model_1 = sum((sales_predict_model_1-sales_test)^2)/length(sales_test)
rmse_model_1 =sqrt(mse_model_1)
rmse_model_1

the rmse of model_1 is 3089.406. The plot shows not much correlation between actual sales and predicted sales. However, we can’t assess until we see results of model_2

sales_predict_model_2 = coef(model_2)[1]+coef(model_2)[2]*fuel_vol_test+coef(model_2)[3]*fuel_p_test

plot(sales_predict_model_2,sales_test)
mse_model_2 = sum((sales_predict_model_2-sales_test)^2)/length(sales_test)
rmse_model_2 =sqrt(mse_model_2)
rmse_model_2

the rmse of model_2 is 1870.863, much better than rmse of model_1. Also, looking at the plot of model_2, we will see better correaltion between predcited values and actual values.

As a result, I can say that model_2 is a better model. Although model_1 had higher R squared and Adjusted R squared in training data set, when we introduced the testing data set, model 2 had more accuracy.

  1. After you select your best predictive model, you have to formulate and solve the linear optimization problem. Your goal is to select the optimal values for the two predictors in order to optimize sales. Assume that maximum capacity of the fuel station is 90000, maximum Radio GRP is 30000,and every increase in Radio GRP by 1 leads to sales of at least 10 liters of fuel.Also,

Formulate the problem, identify decision valriables, solve the problem, identify binidng and none-binidng cosntraints, and calculate slack and surplus.

Based on the analysis, you will find that model_2 is the best, so we will use it to solve this question. The goal is to find the optimal fuel volume, and optimal radio GRP to maximize sales

Formulation: decision variables X1 = Fuel volume, X2 = Radio

Obj Max Sales = 5.627e-01* X1 + 9.481e+00 X2 Sub to X1 <= 90000 X2<= 30000 X1-10X2 >= 0 X1 >= 0 X2 >= 0

In order to solve, we need to download LpSolverAPI package

# we first need to downlaod the package to solve linear optimization problem 

if(!require("lpSolveAPI",quietly = TRUE))
  install.packages("lpSolveAPI",dependencies = TRUE, repos = "https://cloud.r-project.org")

now we can use the package to solve

lpSales <- make.lp(0, 2) 
dump = lp.control(lpSales, sense="max")  
# Set the objective function.

set.objfn(lpSales, c(5.627e-01, 9.481e+00))

# add constraints
add.constraint(lpSales, c(1,0 ), "<=", 90000)
add.constraint(lpSales, c(0,1), "<=", 30000)
add.constraint(lpSales, c(1,-10), ">=", 0)
add.constraint(lpSales, c(1,0 ), ">=", 0)
add.constraint(lpSales, c(0,1), ">=", 0)
# View the problem formulation in tabular/matrix form
lpSales

# Solve 
solve(lpSales) 

# Display the objective function optimum value
get.objective(lpSales)

# Display the decision variables optimum values
get.variables(lpSales) 

Solution: X 1 = 90000, X2 = 9000, Sales = 135972

get.sensitivity.rhs(lpSales)

First constraint is binding becuase marginal value =1.5108 <> 0, an increase in the right hand side of the contraint by 1 (if capacity of fuel station icnreases by 1) will increase optimal sales by 1.5108

Third constraint is binding because marginal value = -0.9481 <> 0, if the right hand side of the constraint is increased by 1, the optimal sales will decreases by -0.9481

All other constraints are non-binding becuase they have marginal value = 0

For surplus and slack, I will do it for second constraint only: Second constraint is non-binding and has <=, so it has slack (if it has >= then it has surplus) Second constraint is X2 <= 30000 Optimal solution for X2 = 9000, the slack = 30000-9000 = 21000

  1. Create a quadratic regression model to predict sales based on temprature (use training data set)

Try to solve this on your own, and send me questions

---
title: "Review_Final"
output: html_notebook
---

### For the review, We will use "EuroStore" data set (same data set in Lab 10)- THIS IS NOT THE DATA SET WE WILL USE IN THE EXAM. THE DATA WE WILL USE FOR EXAM 3 IS UPLOADED TO SAKAI IN FOLDER "EXAM_3">> "DATA_FOR_EXAM_3.ZIP". This review will be used as examples for questions that could be asked in the exam. THE EXAM WILL INCLUDE ADDITIONAL QUESTIONS THAT ARE NOT DISCUSSED HERE AS WELL. PLEASE READ EVERYTHING IN SLIDES, ASSIGNMENTS, AND CLASS NOTES. 

### IMPORTANT: In the exam, you will have the questions on paper. It will not be in R studio. Therefore, you will have to create a new R Notebook to do your calculations. In order to do that, you click on file>>new file>> R Notebook >> then save the R notebook in the folder where you have the data for exam_3 >> then set your working directory to source file

#1- Calculate signal to noise ratio for "Sales" column; does this data have high / low potential for noise? Explain your answer. (USE TRAINING DATA SET "EuroStore_Train.csv")

```{r}
# first you set working direcotry to source file
# then we should read the data 
store_data = read.csv(file = "EuroStore_Train.csv")
head(store_data) # you did this to see that data looks good in R studio

```

```{r}
#then we need to read the "Sales" column 
sales_train = store_data$Sales
# snr = mean / sd 
mean_sales_train = mean(sales_train)
sd_sales_train = sd(sales_train)
snr_sales_train = mean_sales_train/sd_sales_train
snr_sales_train

```

the signal to noise ration is 9.145139. In the exam you have to show your calculations (i.e. the snr formula, the mean, the sd, and the snr value. don't just write a number)
It seems that it has low potential for noise because Snr is much higher than 1.

#2- Do we have outliers in "Fuel Volume"? If yes, list those outliers; show your calculations (USE TRAINING DATA SET "EuroStore_Train.csv")

```{r}
#first you need to decide on the method. As you know we have two methods. I will pick mean+-3*sd

# we ened to read the column 
fuel_vol_train = store_data$Fuel.Volume
mean_fuel_vol_train = mean(fuel_vol_train)
sd_fuel_vol_train = sd(fuel_vol_train)

upper_threshold = mean_fuel_vol_train+3*sd_fuel_vol_train
upper_threshold
lower_threshold = mean_fuel_vol_train - 3*sd_fuel_vol_train
lower_threshold
```

Upper threshold = 69660.98, lower threshold = 56443.82. open the EuroStore_Train.csv, and see if any of the fuel volume values is outside the range 56443.82 to 69660.98. You will not find any; therefore, you can say that we don't have outliers (remember, show your calcualtions i.e. upper limit, lower limit, formulas, and reasoning)

#3- This week TV GRP = 182, Radio GRP = 190. Calculate the corresponding z-value or z-score. If we want to compare this week's GRP with GRP values in the data set for both TV and Radio commercials (i.e. comparing this week's TV GRP with TV GRP values in the data set, and comparing this week's Radio GRP with Radio GRP values in the data set ), which of the commercials did better, TV or Radio? (USE TRAINING DATA SET "EuroStore_Train.csv")

```{r}
# we need to read the data
tv_train = store_data$TV
radio_train = store_data$Radio

# the z score formula is z = (X-mean)/sd, where X is the value we are calculating the z score for.
mean_tv_train = mean(tv_train)
sd_tv_train = sd(tv_train)
z_tv = (182-mean_tv_train) / sd_tv_train
z_tv

mean_radio_train = mean(radio_train)
sd_radio_train = sd(radio_train)
z_radio = (190-mean_radio_train)/sd_radio_train
z_radio
```
z score for TV GRP = 182 is 1.949604, and z score for Radio GRP = 190 is 1.310603. 

How do I know which of the commercials did better this week? The question asks you to compare this week's TV GRP with TV GRP values in the data set, and compare this week's Radio GRP with Radio GRP values in the data set, and then decide which of the commercials did better. Actually the z scores will help a lot. Z score is the number of standard deviations the value is away from the mean. We know that 180 is 1.949604 away from the mean of TV GRP's in the data set, and that 190 is 1.310603 standard deviations away from the mean of Radio GRP's in the data set. As a result, this week the TV commercials did better this week comparing to hestorical values. 


#4- You need to create a predictive model that predicts sales. You will create the model using the training data set, and then test it using the testing data set.

#4.A Calculate different correlation factors and list by order the most potential pedictive variables, execlude week and holiday.(USE TRAINING DATA SET "EuroStore_Train.csv")

```{r}
# here we will claculate the correlation coefficient between sales and all other varaibles in the data set 

cor(store_data)
```

you need to look at the correlation values in the table and rank the predictive varaibles accordingly. Based on the results we can see that the order is as the following (the first is the most potential predicitve variable, and the last is the least potential)

1- Temprature 2- Fuel Volume 3- Fuel Price 4- Radio 5- TV

(side note: remember that this is not enough to say that temprature, or fuel volume are good predcitors. You need to create models and test them; we can't rely on correaltion coefficient only)

4.B The manager of the marekting department requested that you evaluate two models that predcits sales. Model_1 uses temprature and fuel price as predcitors, and Model_2 uses fuel volume and radio as predciotrs.

You need to evaluate the two models and show which one is the best.

First part of questions (models creation) :you need to create the two models using the training data set, and test them using the testing data set. After you create them using the training data set, you need to write the equations for the two models, explain the coefficients of the equation, and write the Rsquare and adjusted R square values. You need to compare the two models based on the training data set results using the R squared and adjusted R squared values.

Second part of question (models testing, evaluation, and selection): For the testing you need to compare the RMSE values for each model; also you can plot the predcited values vs actual values and compare (you don't need to draw any plot on the exam paper; this is for you to have  better insights on the best model).You also need to compare the two models based on the testing data set results i.e. based on RMSE 
(for this question You need to use both the training data set "EuroStore_Train.csv" and the testing data set "EuroStore_Test.csv")

```{r}
# let us start by creating the two models
# we already extracted the sales, fuel volume, Tv, and radio from training data set. we need to extract the temprature and fuel price 

temp_train = store_data$Temp
fuel_p_train = store_data$Fuel.Price

```


```{r}
# now we can create the first model 
model_1 = lm (sales_train~temp_train+fuel_p_train)
summary(model_1)
```

```{r}
# we can also create the second model
model_2 = lm (sales_train~fuel_vol_train+radio_train)
summary(model_2)
```

From the summary results you can asnwer the first part of the question 

Model_1 
Equation: Sales = -6761.09 + 204.36 * temprature + 234.78 * fuel price 
Coefficients: if temprature and price are both equal to zero the sales will be -6761.09 i.e. we will lose money. If temprature is increased by 1, the sales will increase by 204.36, and if fuel price is increased by 1, the sales will increase by 234.78. 
R squared = 0.6339,	Adjusted R-squared = 0.6198 

Model_2 
Equation: Sales = -1.350e+04 + 5.627e-01* Fuel Vol + 9.481e+00 *radio
coefficients: if fuel volume and radio are both equal to zero the sales will be -1.350e+04 i.e. we will lose money. If fuel volume is increased by 1, the sales will increase by 5.627e-01, and if radio is increased by 1, the sales will increase by 9.481e+00.
R-squared:  0.4987,	Adjusted R-squared:  0.4794 

Based on these results, you can see that model_1 could be better (it has a higher R square and adjusted R square); however, we need to test the two models first before making the decision on which one is the best; remember that a model could be over-fitted to the training data set and do badly on testing data set. We don't like such model; we need a model that does well on both training and testing. 

```{r}
# let us test the two models
# first we need to read the testing data set
store_data_2 = read.csv(file = "EuroStore_Test.csv")
head(store_data_2)
```

```{r}
# we need to read the columns 
sales_test = store_data_2$Sales
temp_test = store_data_2$Temp
fuel_p_test = store_data_2$Fuel.Price
fuel_vol_test = store_data_2$Fuel.Volume
radio_test = store_data_2$Radio
```

```{r}
# we will now run the two models on the testing data set and calculate the predicted values of sales. 
#We have two models, so we will have two vectors of predcited values for sales. We will compare the two vectors of predcited values for sales with the actual values for sales in the testing data set (this is what is called testing)

sales_predict_model_1 = coef(model_1)[1]+coef(model_1)[2]*temp_test+coef(model_1)[3]*fuel_p_test

plot(sales_predict_model_1,sales_test)
```

```{r}
mse_model_1 = sum((sales_predict_model_1-sales_test)^2)/length(sales_test)
rmse_model_1 =sqrt(mse_model_1)
rmse_model_1

```

the rmse of model_1 is 3089.406. The plot shows not much correlation between actual sales and predicted sales. However, we can't assess until we see results of model_2 

```{r}
sales_predict_model_2 = coef(model_2)[1]+coef(model_2)[2]*fuel_vol_test+coef(model_2)[3]*fuel_p_test

plot(sales_predict_model_2,sales_test)
```

```{r}
mse_model_2 = sum((sales_predict_model_2-sales_test)^2)/length(sales_test)
rmse_model_2 =sqrt(mse_model_2)
rmse_model_2

```
the rmse of model_2 is 1870.863, much better than rmse of model_1. Also, looking at the plot of model_2, we will see better correaltion between predcited values and actual values. 

As a result, I can say that model_2 is a better model. Although model_1 had higher R squared and Adjusted R squared in training data set, when we introduced the testing data set, model 2 had more accuracy.

5. After you select your best predictive model, you have to formulate and solve the linear optimization problem. Your goal is to select the optimal values for the two predictors in order to optimize sales. Assume that maximum capacity of the fuel station is 90000, maximum Radio GRP is 30000,and every increase in Radio GRP by 1 leads to sales of at least 10 liters of fuel.Also, 

Formulate the problem, identify decision valriables, solve the problem, identify binidng and none-binidng cosntraints, and calculate slack and surplus.



Based on the analysis, you will find that model_2 is the best, so we will use it to solve this question. 
The goal is to find the optimal fuel volume, and optimal radio GRP to maximize sales

Formulation:
decision variables X1 = Fuel volume, X2 = Radio

Obj Max  Sales = 5.627e-01* X1 + 9.481e+00 *X2
Sub to 
X1 <= 90000
X2<= 30000
X1-10*X2 >= 0
X1 >= 0
X2 >= 0

In order to solve, we need to download LpSolverAPI package
```{r}
# we first need to downlaod the package to solve linear optimization problem 

if(!require("lpSolveAPI",quietly = TRUE))
  install.packages("lpSolveAPI",dependencies = TRUE, repos = "https://cloud.r-project.org")
```

now we can use the package to solve 

```{r}
lpSales <- make.lp(0, 2) 
```

```{r}
dump = lp.control(lpSales, sense="max")  
# Set the objective function.

set.objfn(lpSales, c(5.627e-01, 9.481e+00))

# add constraints
add.constraint(lpSales, c(1,0 ), "<=", 90000)
add.constraint(lpSales, c(0,1), "<=", 30000)
add.constraint(lpSales, c(1,-10), ">=", 0)
add.constraint(lpSales, c(1,0 ), ">=", 0)
add.constraint(lpSales, c(0,1), ">=", 0)
```


```{r}
# View the problem formulation in tabular/matrix form
lpSales

# Solve 
solve(lpSales) 

# Display the objective function optimum value
get.objective(lpSales)

# Display the decision variables optimum values
get.variables(lpSales) 
```

Solution: X 1 = 90000, X2 = 9000, Sales = 135972

```{r}
get.sensitivity.rhs(lpSales)
```

First constraint is binding becuase marginal value =1.5108 <> 0, an increase in the right hand side of the contraint by 1 (if capacity of fuel station icnreases by 1) will increase optimal sales by 1.5108

Third constraint is binding because marginal value = -0.9481 <> 0, if the right hand side of the constraint is increased by 1, the  optimal sales will decreases by -0.9481

All other constraints are non-binding becuase they have marginal value = 0

For surplus and slack, I will do it for second constraint only: 
Second constraint is non-binding and has <=, so it has slack (if it has >= then it has surplus)
Second constraint is 
X2 <= 30000
Optimal solution for X2 = 9000, the slack = 30000-9000 = 21000


6. Create a quadratic regression model to predict sales based on temprature (use training data set)

Try to solve this on your own, and send me questions 

