Answer at least 4 questions in the following:
Describe the assumptions related to a multiple linear regression. The assumptions in multiple linear regression:
Which of the aforementioned assumptions could be checked using the residual plots? +You can check for heteroskedacity, constant variance, +check for normality +if linear model is a good fit.
What is overfitting in predictive modeling?
Compare the linear regression and the logistic regression as detail as possible.
Linear regression is used when your response variable is numeric.
Linear the regression fit is a strain line
logistic regression the response variable is categorical
logistic the regression fit is a “S” shape curve
Answer the following questions using the two vectors given in the R code where y is the true response, and yhat is the predicted response.
set.seed(1000)
y <- c(1,0,1,1,0,1,1,0,1,1,0,1,1,1,1,0,0,1,1,0,1,1,0,0,1,1,1,1,0,0,0,1,1,0,1,0)
yhat <- c(1,0,0,1,0,1,1,1,1,0,1,1,0,0,1,0,1,1,1,1,0,1,0,0,1,0,1,1,0,0,1,0,1,0,0,0)
*Create a confusion matrix without using the ConfusionMatrix function or similar predetermined functions.
table_cf <- table(y,yhat)
table_cf
## yhat
## y 0 1
## 0 9 5
## 1 8 14
Calculate the accuracy, sensitivity, and specificity without using the ConfusionMatrix function or similar predetermined functions.
TP = 14
TN = 9
FP = 5
FN = 8
tot = 14 + 9 + 5+ 8
acc <- (TP + TN)/tot
sens_1 <- TP/(TP + FN)
specif_1 <- TN/(TN + FP)
cat(acc,sens_1,specif_1)
## 0.6388889 0.6363636 0.6428571
Suppose that we obtained the following result using a linear regression for the median housing price in Greensboro :
Answer (i) - (iv) where the variable Highway is a binary variable (Yes for a house near a highway and No otherwise), and for (iv) use n=210.
i = -0.0007/0.04
ii = 8.31 * 0.1675
iii = -0.4782/0.2205
iv = 2*pt(iii, 209)
cat(i,ii,iii,iv)
## -0.0175 1.391925 -2.168707 0.03123393
Interpret the coefficient of the variable Room.
Answer For every unit of increase for median price, Room will increase on average of 1.39
Interpret the coefficient of the variable Highway(Yes).
Answer The median housing price is negatively affected when Highway is (No)
When Age = 50, Room = 3, Highway = Yes, and the intercept estimate is 5, estimate the predicted value of the median housing price in Greensboro.
est <- 5 -0.0007*50 + ii*3 -0.4782
est_1 <- 5 -0.0007*50 + 1.391925*3 -0.4782
est_1
## [1] 8.662575
Suppose that we obtained the following result using a logistic regression for the admission status (Yes or No) of a graduate school
Answer (i) - (iv) where the variable SOP denotes Statement of Purpose and it has two categories, Fair and Good. Answer
i = 0.0023/2.25
ii = 2.33 * 0.3318
iii = 0.2312/0.1023
iv = 2*pt(iii, 22)
cat(i,ii,iii,iv)
## 0.001022222 0.773094 2.26002 1.96594
Convert the coefficient of the variable College GPA to the odds ratio and interpret it.
ORgpa = exp(0.773094)
Answer IF the GPA increases by one unit the the odds of admission increase by 2.16 times more likely
Convert the coefficient of the variable SOP(Good) to the odds ratio and interpret it.
ORsop = exp(0.2312)
Answer With a good SOP, admission is 1.26 times more likely
When GRE = 2000, College GPA = 3.2, SOP = Good, and the intercept estimate is -2, estimate the probability of admission status = Yes, and predict the admission status using the estimated probability.
pAdmis = -2 + 0.0023*2000 + ii*3.2 + 1
pAdmis
## [1] 6.073901
Answer the probability is 6.07 of being admitted and the prediction is yes the student will be admitted.
# Question 5 (20)
Suppose that you want to predict whether or not a customer purchases an item at your online store?
Provide your approach step-by-step to build a predictive modeling. If necessary, provide a list of predictors and algorithms.
Answer
*I would import a data set with a response variable of Purchase(1) and No purchase(0) and predictors such as gender, number of purchases, days since last purchase, price of purchase, and # of items purchased and split the data into 70% train and 30% test set.
I would build different classification models such as logistic regression, naive bayes and decision tree using train data .
Apply each model to the test data
evaluate the prediction performance using a confusion matrix
the model with the best accuracy, sensitivity, and specificity would be the model I choose..