Source file ⇒ The_Analytics_Edge_edX_MIT062015_2.Rmd
These are my notes for the lectures of the The_Analytics_Edge_edX_MIT15.0 71x_June2015“ by Professor Dimitris Bertsimas. The goal of these notes is to provide the reproducible R code for all the lectures.
I am also pursuing the course offered by Coursera :Applied Logistic Regression by Professor Stanley Lemeshow.
A good list of resources about using R for Logistic Regression are given below:
QUICK QUESTION
Q:Which of the following dependent variables are categorical? (Select all that apply.)
Ans:Deciding whether to buy, sell, or hold a stock,The winner of an election with two candidates ,The day of the week with the highest revenue, Whether or not revenue will exceed $50,000
EXPLANATION:The weekly revenue of a company is not categorical, since it has a large number of possible values, on a continuous range. The number of daily car thefts in New York City is also not categorical because the number of car thefts could range from 0 to hundreds.On the other hand, the other options each have a limiited number of possible outcomes.
Q:Which of the following dependent variables are binary? (Select all that apply.)
Ans:The winner of an election with two candidates, Whether or not revenue will exceed $50,000
EXPLANATION:The only variables with two possible outcomes are the winner of an election with two candidates, and whether or not revenue will exceed $50,000.
QUICK QUESTION
Q:Suppose the coefficients of a logistic regression model with two independent variables are as follows:
\(\beta_0=-1.5\), \(\beta_1=3\), \(\beta_2=-0.5\)
And we have an observation with the following values for the independent variables:
\(x_1=1\), \(x_2=5\)
So the logit(i.e.Log(odds)) equation becomes:
\(log(Odds)=\beta_0 + \beta_1 x_1 + \beta_2 x_2\)
What is the value of the Logit for this observation? Recall that the Logit is log(Odds).
#substituting the values of betas and x's in the logit eqn, we get
logit<- -1.5 + 3*1 + -0.5*5
logit
## [1] -1
Ans:-1
EXPLANATION:The Logit is just log(Odds), and looks like the linear regression equation. So the Logit is -1.5 + 31 - 0.55 = -1.
Q:What is the value of the Odds for this observation? Note that you can compute e^x, for some number x, in your R console by typing exp(x). The function exp() computes the exponential of its argument.4
we know that \(Odds=e^{(\beta_0 + \beta_1 x_1 + \beta_2 x_2)}\)
Odds<-exp(logit)
Odds
## [1] 0.3678794
Ans:0.3678794
EXPLANATION:Using the value of the Logit from the previous question, we have that Odds = e^(-1) = 0.3678794.
Q:What is the value of P(y = 1) for this observation?
We know that Odds Ratio is \(Odds=\frac{P(y=1)}{P(y=0)}\)
and the Logistic response Function is:
\[P(y=1)=\frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2)}}\] \[=\frac{1}{1 + e^{-logit}}\]
#From the logistic function we get
1/(1 + exp(-(-1)))
## [1] 0.2689414
Ans:0.2689414
EXPLANATION:Using the Logistic Response Function, we can compute that P(y = 1) = 1/(1 + e^(-Logit)) = 1/(1 + e^(1)) = 0.2689414.
The variables in the dataset quality.csv are as follows:
MemberID numbers the patients from 1 to 131, and is just an identifying number.
InpatientDays is the number of inpatient visits, or number of days the person spent in the hospital.
ERVisits is the number of times the patient visited the emergency room.
OfficeVisits is the number of times the patient visited any doctor’s office.
Narcotics is the number of prescriptions the patient had for narcotics.
DaysSinceLastERVisit is the number of days between the patient’s last emergency room visit and the end of the study period (set to the length of the study period if they never visited the ER).
Pain is the number of visits for which the patient complained about pain.
TotalVisits is the total number of times the patient visited any healthcare provider.
ProviderCount is the number of providers that served the patient.
MedicalClaims is the number of days on which the patient had a medical claim.
ClaimLines is the total number of medical claims.
StartedOnCombination is whether or not the patient was started on a combination of drugs to treat their diabetes (TRUE or FALSE).
AcuteDrugGapSmall is the fraction of acute drugs that were refilled quickly after the prescription ran out.
PoorCare is the outcome or dependent variable, and is equal to 1 if the patient had poor care, and equal to 0 if the patient had good care.
# Unit 3, Modeling the Expert
# Video 4
# Read in dataset
quality = read.csv("quality.csv")
# Look at structure
str(quality)
## 'data.frame': 131 obs. of 14 variables:
## $ MemberID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ InpatientDays : int 0 1 0 0 8 2 16 2 2 4 ...
## $ ERVisits : int 0 1 0 1 2 0 1 0 1 2 ...
## $ OfficeVisits : int 18 6 5 19 19 9 8 8 4 0 ...
## $ Narcotics : int 1 1 3 0 3 2 1 0 3 2 ...
## $ DaysSinceLastERVisit: num 731 411 731 158 449 ...
## $ Pain : int 10 0 10 34 10 6 4 5 5 2 ...
## $ TotalVisits : int 18 8 5 20 29 11 25 10 7 6 ...
## $ ProviderCount : int 21 27 16 14 24 40 19 11 28 21 ...
## $ MedicalClaims : int 93 19 27 59 51 53 40 28 20 17 ...
## $ ClaimLines : int 222 115 148 242 204 156 261 87 98 66 ...
## $ StartedOnCombination: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ AcuteDrugGapSmall : int 0 1 5 0 0 4 0 0 0 0 ...
## $ PoorCare : int 0 0 0 0 0 1 0 0 1 0 ...
# Table outcome for DV
table(quality$PoorCare)
##
## 0 1
## 98 33
# Baseline accuracy (in a classification problem, the standard baseline method is to just predict the most frequent outcome for all observations)
98/131
## [1] 0.7480916
#This is the baseline model which we will try to beat with our logistic model
# Install and load caTools package
#install.packages("caTools")
library(caTools)
##
## Attaching package: 'caTools'
## The following objects are masked from 'package:base64enc':
##
## base64decode, base64encode
# Randomly split data
set.seed(88)
split = sample.split(quality$PoorCare, SplitRatio = 0.75) #the DV is split into 75% which we get from baseline accuracy above
split # TRUE means we should put that obs in Training set and FALSE means that obs should be put in testing set
## [1] TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
## [12] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## [34] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## [45] FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## [56] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [67] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [78] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## [89] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## [100] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
## [111] FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
## [122] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
# Create training and testing sets
qualityTrain = subset(quality, split == TRUE)
qualityTest = subset(quality, split == FALSE)
nrow(qualityTrain)
## [1] 99
nrow(qualityTest)
## [1] 32
# Logistic Regression Model
QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics, data=qualityTrain, family=binomial)
summary(QualityLog)
##
## Call:
## glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial,
## data = qualityTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.06303 -0.63155 -0.50503 -0.09689 2.16686
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.64613 0.52357 -5.054 4.33e-07 ***
## OfficeVisits 0.08212 0.03055 2.688 0.00718 **
## Narcotics 0.07630 0.03205 2.381 0.01728 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 111.888 on 98 degrees of freedom
## Residual deviance: 89.127 on 96 degrees of freedom
## AIC: 95.127
##
## Number of Fisher Scoring iterations: 4
# Make predictions on training set
predictTrain = predict(QualityLog, type="response") #type="response" gives us probabilities
# Analyze predictions
summary(predictTrain) # since we are dealing with probabilities, all the nos will be between 0 & 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06623 0.11910 0.15970 0.25250 0.26760 0.98460
#lets see if we are predicting higher probabilities for the actual poorcare cases as we expect
tapply(predictTrain, qualityTrain$PoorCare, mean) #avg prediction for each of the TRUE outcomes
## 0 1
## 0.1894512 0.4392246
QUICK QUESTION
Q:In R, create a logistic regression model to predict “PoorCare” using the independent variables “StartedOnCombination” and “ProviderCount”. Use the training set we created in the previous video to build the model.
# Logistic Regression Model on the training set
QualityLog1 = glm(PoorCare ~ StartedOnCombination + ProviderCount, data=qualityTrain, family=binomial)
summary(QualityLog1)
##
## Call:
## glm(formula = PoorCare ~ StartedOnCombination + ProviderCount,
## family = binomial, data = qualityTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.61826 -0.72782 -0.64555 -0.08407 1.94662
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.00097 0.55097 -3.632 0.000282 ***
## StartedOnCombinationTRUE 1.95230 1.22342 1.596 0.110541
## ProviderCount 0.03366 0.01983 1.697 0.089706 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 111.89 on 98 degrees of freedom
## Residual deviance: 104.37 on 96 degrees of freedom
## AIC: 110.37
##
## Number of Fisher Scoring iterations: 4
#Ans:1.95230
QUICK QUESTION
Q:StartedOnCombination is a binary variable, which equals 1 if the patient is started on a combination of drugs to treat their diabetes, and equals 0 if the patient is not started on a combination of drugs. All else being equal, does this model imply that starting a patient on a combination of drugs is indicative of poor care, or good care?
Ans:Poor Care
EXPLANATION:The coefficient value is positive, meaning that positive values of the variable make the outcome of 1 more likely. This corresponds to Poor Care.
# Video 5
# Confusion matrix for threshold of 0.5
table(qualityTrain$PoorCare, predictTrain > 0.5) #first arg is for the rows(true coutcomes) and second arg is for the col labelling(predicted outcomes)
##
## FALSE TRUE
## 0 70 4
## 1 15 10
# Sensitivity and specificity
10/25
## [1] 0.4
70/74
## [1] 0.9459459
# Confusion matrix for increased threshold of 0.7
table(qualityTrain$PoorCare, predictTrain > 0.7)
##
## FALSE TRUE
## 0 73 1
## 1 17 8
# Sensitivity and specificity
8/25 #Sensitivity=TP/total no of positive cases
## [1] 0.32
73/74 #specificity=TN/total no of negative cases
## [1] 0.9864865
# Confusion matrix for aa decreased threshold of 0.2
table(qualityTrain$PoorCare, predictTrain > 0.2)
##
## FALSE TRUE
## 0 54 20
## 1 9 16
# Sensitivity and specificity
16/25
## [1] 0.64
54/74
## [1] 0.7297297
#from the above we can see that higher threshold value will have LOWER sensitivity & a HIGHER specificity and lower threshold value will have HIGHER sensitivity & a LOWER specificity
#So the big question is how to select the threshold value .This is done in the next section
CONFUSION MATRIX OR CLASSIFICATION MATRIX
Confusion_matrix
#lets recreate the confusion matrix give in the Qs
#Confusion Matrix #1:
conf1<-matrix(c(15,10,5,20),ncol=2,byrow=TRUE)
colnames(conf1)<-c("Predicted = 0","Predicted = 1")
rownames(conf1)<-c("Actual = 0","Actual = 1")
conf1<-as.table(conf1)
conf1
## Predicted = 0 Predicted = 1
## Actual = 0 15 10
## Actual = 1 5 20
#Confusion Matrix #2:
conf2<-matrix(c(20,5,10,15),ncol=2,byrow=TRUE)
colnames(conf2)<-c("Predicted = 0","Predicted = 1")
rownames(conf2)<-c("Actual = 0","Actual = 1")
conf1<-as.table(conf2)
conf2
## Predicted = 0 Predicted = 1
## Actual = 0 20 5
## Actual = 1 10 15
#QUICK QUESTION
#What is the sensitivity of Confusion Matrix #1?
20/(5+20)
## [1] 0.8
#Ans:0.8
#EXPLANATION:The sensitivity of a confusion matrix is the true positives, divided by the true positives plus the false negatives. In this case, it is 20/(20+5) = 0.8
#What is the specificity of Confusion Matrix #1?
15/(15+10)
## [1] 0.6
#Ans:0.6
#EXPLANATION:The specificity of a confusion matrix is the true negatives, divided by the true negatives plus the false positives. In this case, it is 15/(15+10) = 0.6
#########################
#QUICK QUESTION
#To go from Confusion Matrix #1 to Confusion Matrix #2, did we increase or decrease the threshold value?
#lets calc sensitivity for Confusion Matrix #2
15/(10+15)
## [1] 0.6
#Ans:We increased the threshold value.
#EXPLANATION:We predict the outcome 1 less often in Confusion Matrix #2. This means we must have increased the threshold.(Also we see that sensitivity has reduced from the 1st matrix to second matrix implying that the threshold value is higher for 2nd confusion matrix)
# Install and load ROCR package
#install.packages("ROCR")
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
# Prediction function
ROCRpred = prediction(predictTrain, qualityTrain$PoorCare)
# Performance function
ROCRperf = performance(ROCRpred, "tpr", "fpr") #This defines what we want to plot on x & y axis
# Plot ROC curve
plot(ROCRperf)
# Add colors
plot(ROCRperf, colorize=TRUE)
# Add threshold labels
plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
This question will ask about the following ROC curve:
ROC-Threshold Curve
QUICK QUESTION
Q:Given this ROC curve, which threshold would you pick if you wanted to correctly identify a small group of patients who are receiving the worst care with high confidence?
Ans:t=0.7
EXPLANATION:The threshold 0.7 is best to identify a small group of patients who are receiving the worst care with high confidence, since at this threshold we make very few false positive mistakes, and identify about 35% of the true positives. The threshold t = 0.8 is not a good choice, since it makes about the same number of false positives, but only identifies 10% of the true positives. The thresholds 0.2 and 0.3 both identify more of the true positives, but they make more false positive mistakes, so our confidence decreases.
Q:Which threshold would you pick if you wanted to correctly identify half of the patients receiving poor care, while making as few errors as possible?
Ans:t=0.3
EXPLANATION:The threshold 0.3 is the best choice in this scenerio. The threshold 0.2 also identifies over half of the patients receiving poor care, but it makes many more false positive mistakes. The thresholds 0.7 and 0.8 don’t identify at least half of the patients receiving poor care.
QUICK QUESTION
IMPORTANT NOTE: This question uses the original model with the independent variables “OfficeVisits” and “Narcotics”. Be sure to use this model, instead of the model you built in Quick Question 4.
# Logistic Regression Model
QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics, data=qualityTrain, family=binomial)
summary(QualityLog)
##
## Call:
## glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial,
## data = qualityTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.06303 -0.63155 -0.50503 -0.09689 2.16686
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.64613 0.52357 -5.054 4.33e-07 ***
## OfficeVisits 0.08212 0.03055 2.688 0.00718 **
## Narcotics 0.07630 0.03205 2.381 0.01728 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 111.888 on 98 degrees of freedom
## Residual deviance: 89.127 on 96 degrees of freedom
## AIC: 95.127
##
## Number of Fisher Scoring iterations: 4
#Compute the test set predictions in R by running the command:
predictTest = predict(QualityLog, type="response", newdata=qualityTest)
#You can compute the test set AUC by running the following two commands in R:
ROCRpredTest = prediction(predictTest, qualityTest$PoorCare)
auc = as.numeric(performance(ROCRpredTest, "auc")@y.values)
auc
## [1] 0.7994792
#What is the AUC of this model on the test set?
#Ans:0.7994792
#The AUC of a model has the following nice interpretation: given a random patient from the dataset who actually received poor care, and a random patient from the dataset who actually received good care, the AUC is the perecentage of time that our model will classify which is which correctly.
QUICK QUESTION
Q:Why was the city of Framingham, Massachusetts selected for this study? Select all that apply.
Ans:It had an appropriate size, It had a stable population to observe over time, The doctors and residents in Framingham were willing to participate
EXPLANATION:The reasons for Framingham being selected for this study are listed on Slide 4 of the previous video: it had an appropriate size, it had a stable population, and the doctors and residents in the town were willing to participate. However, the city did not represent all types of people in the United States (we’ll see later in the lecture how to extend the model to different populations) and there were not an abnormally large number of people with heart disease.
QUICK QUESTION
Q:Are “risk factors” the independent variables or the dependent variables in our model?
Ans:Independent Variables
EXPLANATION:Risk factors are the independent variables in our model, and are what we will use to predict the dependent variable.
Q:In many situations, a dataset is handed to you and you are tasked with discovering which variables are important. But for the Framingham Heart Study, the researchers had to collect data from patients. In a situation like this one, where data needs to be collected by the researchers, should the potential risk factors be defined before or after the data is collected?
Ans:Before
EXPLANATION:The researchers should first hypothesize potential risk factors, and then collect data corresponding to those risk factors. Of course, they could always define more risk factors later and collect more data, but this data would take longer to collect.
# Unit 3, The Framingham Heart Study
# Video 3
# Read in the dataset
framingham = read.csv("framingham.csv")
# Look at structure
str(framingham)
## 'data.frame': 4240 obs. of 16 variables:
## $ male : int 1 0 1 0 0 0 0 0 1 1 ...
## $ age : int 39 46 48 61 46 43 63 45 52 43 ...
## $ education : int 4 2 1 3 3 2 1 2 1 1 ...
## $ currentSmoker : int 0 0 1 1 1 0 0 1 0 1 ...
## $ cigsPerDay : int 0 0 20 30 23 0 0 20 0 30 ...
## $ BPMeds : int 0 0 0 0 0 0 0 0 0 0 ...
## $ prevalentStroke: int 0 0 0 0 0 0 0 0 0 0 ...
## $ prevalentHyp : int 0 0 0 1 0 1 0 0 1 1 ...
## $ diabetes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ totChol : int 195 250 245 225 285 228 205 313 260 225 ...
## $ sysBP : num 106 121 128 150 130 ...
## $ diaBP : num 70 81 80 95 84 110 71 71 89 107 ...
## $ BMI : num 27 28.7 25.3 28.6 23.1 ...
## $ heartRate : int 80 95 75 65 85 77 60 79 76 93 ...
## $ glucose : int 77 76 70 103 85 99 85 78 79 88 ...
## $ TenYearCHD : int 0 0 0 1 0 0 1 0 0 0 ...
# Load the library caTools
library(caTools)
# Randomly split the data into training and testing sets
set.seed(1000)
split = sample.split(framingham$TenYearCHD, SplitRatio = 0.65)
# Split up the data using subset
train = subset(framingham, split==TRUE)
test = subset(framingham, split==FALSE)
# Logistic Regression Model
framinghamLog = glm(TenYearCHD ~ ., data = train, family=binomial)
summary(framinghamLog)
##
## Call:
## glm(formula = TenYearCHD ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8487 -0.6007 -0.4257 -0.2842 2.8369
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.886574 0.890729 -8.854 < 2e-16 ***
## male 0.528457 0.135443 3.902 9.55e-05 ***
## age 0.062055 0.008343 7.438 1.02e-13 ***
## education -0.058923 0.062430 -0.944 0.34525
## currentSmoker 0.093240 0.194008 0.481 0.63080
## cigsPerDay 0.015008 0.007826 1.918 0.05514 .
## BPMeds 0.311221 0.287408 1.083 0.27887
## prevalentStroke 1.165794 0.571215 2.041 0.04126 *
## prevalentHyp 0.315818 0.171765 1.839 0.06596 .
## diabetes -0.421494 0.407990 -1.033 0.30156
## totChol 0.003835 0.001377 2.786 0.00533 **
## sysBP 0.011344 0.004566 2.485 0.01297 *
## diaBP -0.004740 0.008001 -0.592 0.55353
## BMI 0.010723 0.016157 0.664 0.50689
## heartRate -0.008099 0.005313 -1.524 0.12739
## glucose 0.008935 0.002836 3.150 0.00163 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2020.7 on 2384 degrees of freedom
## Residual deviance: 1792.3 on 2369 degrees of freedom
## (371 observations deleted due to missingness)
## AIC: 1824.3
##
## Number of Fisher Scoring iterations: 5
# Predictions on the test set
predictTest = predict(framinghamLog, type="response", newdata=test)
# Confusion matrix with threshold of 0.5
table(test$TenYearCHD, predictTest > 0.5)
##
## FALSE TRUE
## 0 1069 6
## 1 187 11
# Overall Accuracy of the model
(1069+11)/(1069+6+187+11)
## [1] 0.8483896
#this needs to be compared with Baseline model
# Baseline accuracy/model
(1069+6)/(1069+6+187+11)
## [1] 0.8444619
#Our model barely beats the baseline model.Is our model worthy?
# Test set AUC
library(ROCR)
ROCRpred = prediction(predictTest, test$TenYearCHD)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.7421095
#Model can differentiate low-risk from high risk patients as out of sample i.e. test was AUC=0.74
QUICK QUESTION
In the previous video, we computed the following confusion matrix for our logistic regression model on our test set with a threshold of 0.5:
FALSE TRUE 0 1069 6 1 187 11
Using this confusion matrix, answer the following questions.
Q:What is the sensitivity of our logistic regression model on the test set, using a threshold of 0.5?
11/(187+11)
## [1] 0.05555556
Ans:0.05555556
Q:What is the specificity of our logistic regression model on the test set, using a threshold of 0.5?
1069/(1069+6)
## [1] 0.9944186
Ans:0.9944186
EXPLANATION:Using this confusion matrix, we can compute that the sensitivity is 11/(11+187) and the specificity is 1069/(1069+6).
QUICK QUESTION
Q:For which of the following models should external validation be used? Consider both the population used to train the model, and the population that the model will be used on. (Select all that apply.)
Ans:
A model to predict obesity risk. Data from a random sample of California residents was used to build the model, and we want to use the model to predict the obesity risk of all United States residents.
A model to predict the probability of a runner winning a marathon. Data from all runners in the Boston Marathon was used to build the model, and we want use the model to predict the probability of winning for all people who run marathons.
EXPLANATION:In the first and third models, we are using a special sub-population to build the model. While we can use the model for that sub-population, we should use external validation to test the model on other populations. The second model uses data from a special sub-population, but the model is only intended for that sub-population, so external validation is not necessary.
VIDEO 5: INTERVENTIONS
QUICK QUESTION
Q:In Video 3, we built a logistic regression model and found that the following variables were significant (or almost significant) for predicting ten year risk of CHD: male, age, number of cigarettes per day, whether or not the patient previously had a stroke, whether or not the patient is currently hypertensive, total cholesterol level, systolic blood pressure, and blood glucose level. Which one of the following variables would be the most dramatically affected by a behavioral intervention? HINT: Think about how much control the patient has over each of the variables.
Ans:Number of Cigarettes per day
EXPLANATION:The number of cigarettes smoked per day would be the most dramatically affected by a behavioral intervention. This is a variable that the patient has the ability to control the most.
VIDEO 2: DEALING WITH MISSING DATA (R script reproduced here)
# Unit 3, Recitation
# Video 2
# Read in data
polling = read.csv("PollingData.csv")
str(polling)
## 'data.frame': 145 obs. of 7 variables:
## $ State : Factor w/ 50 levels "Alabama","Alaska",..: 1 1 2 2 3 3 3 4 4 4 ...
## $ Year : int 2004 2008 2004 2008 2004 2008 2012 2004 2008 2012 ...
## $ Rasmussen : int 11 21 NA 16 5 5 8 7 10 NA ...
## $ SurveyUSA : int 18 25 NA NA 15 NA NA 5 NA NA ...
## $ DiffCount : int 5 5 1 6 8 9 4 8 5 2 ...
## $ PropR : num 1 1 1 1 1 ...
## $ Republican: int 1 1 1 1 1 1 1 1 1 1 ...
table(polling$Year)
##
## 2004 2008 2012
## 50 50 45
summary(polling) # to see the no NAs
## State Year Rasmussen SurveyUSA
## Arizona : 3 Min. :2004 Min. :-41.0000 Min. :-33.0000
## Arkansas : 3 1st Qu.:2004 1st Qu.: -8.0000 1st Qu.:-11.7500
## California : 3 Median :2008 Median : 1.0000 Median : -2.0000
## Colorado : 3 Mean :2008 Mean : 0.0404 Mean : -0.8243
## Connecticut: 3 3rd Qu.:2012 3rd Qu.: 8.5000 3rd Qu.: 8.0000
## Florida : 3 Max. :2012 Max. : 39.0000 Max. : 30.0000
## (Other) :127 NA's :46 NA's :71
## DiffCount PropR Republican
## Min. :-19.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: -6.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 1.000 Median :0.6250 Median :1.0000
## Mean : -1.269 Mean :0.5259 Mean :0.5103
## 3rd Qu.: 4.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. : 11.000 Max. :1.0000 Max. :1.0000
##
# Install and load mice package
#install.packages("mice") #Multiple Imputation by Chained Equations(mice) package
library(mice)
## Loading required package: Rcpp
## mice 2.25 2015-11-09
##
## Attaching package: 'mice'
## The following object is masked from 'package:tidyr':
##
## complete
# Multiple imputation
simple = polling[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount")] #creating a smaller dataset with less variables
summary(simple)
## Rasmussen SurveyUSA PropR DiffCount
## Min. :-41.0000 Min. :-33.0000 Min. :0.0000 Min. :-19.000
## 1st Qu.: -8.0000 1st Qu.:-11.7500 1st Qu.:0.0000 1st Qu.: -6.000
## Median : 1.0000 Median : -2.0000 Median :0.6250 Median : 1.000
## Mean : 0.0404 Mean : -0.8243 Mean :0.5259 Mean : -1.269
## 3rd Qu.: 8.5000 3rd Qu.: 8.0000 3rd Qu.:1.0000 3rd Qu.: 4.000
## Max. : 39.0000 Max. : 30.0000 Max. :1.0000 Max. : 11.000
## NA's :46 NA's :71
set.seed(144)
imputed = complete(mice(simple)) #dataset with the imputed values for the NAs
##
## iter imp variable
## 1 1 Rasmussen SurveyUSA
## 1 2 Rasmussen SurveyUSA
## 1 3 Rasmussen SurveyUSA
## 1 4 Rasmussen SurveyUSA
## 1 5 Rasmussen SurveyUSA
## 2 1 Rasmussen SurveyUSA
## 2 2 Rasmussen SurveyUSA
## 2 3 Rasmussen SurveyUSA
## 2 4 Rasmussen SurveyUSA
## 2 5 Rasmussen SurveyUSA
## 3 1 Rasmussen SurveyUSA
## 3 2 Rasmussen SurveyUSA
## 3 3 Rasmussen SurveyUSA
## 3 4 Rasmussen SurveyUSA
## 3 5 Rasmussen SurveyUSA
## 4 1 Rasmussen SurveyUSA
## 4 2 Rasmussen SurveyUSA
## 4 3 Rasmussen SurveyUSA
## 4 4 Rasmussen SurveyUSA
## 4 5 Rasmussen SurveyUSA
## 5 1 Rasmussen SurveyUSA
## 5 2 Rasmussen SurveyUSA
## 5 3 Rasmussen SurveyUSA
## 5 4 Rasmussen SurveyUSA
## 5 5 Rasmussen SurveyUSA
summary(imputed)
## Rasmussen SurveyUSA PropR DiffCount
## Min. :-41.000 Min. :-33.000 Min. :0.0000 Min. :-19.000
## 1st Qu.: -8.000 1st Qu.:-11.000 1st Qu.:0.0000 1st Qu.: -6.000
## Median : 3.000 Median : 1.000 Median :0.6250 Median : 1.000
## Mean : 1.731 Mean : 1.517 Mean :0.5259 Mean : -1.269
## 3rd Qu.: 11.000 3rd Qu.: 18.000 3rd Qu.:1.0000 3rd Qu.: 4.000
## Max. : 39.000 Max. : 30.000 Max. :1.0000 Max. : 11.000
#last step is to copy back the var with imputed values back into the main dataset
polling$Rasmussen = imputed$Rasmussen
polling$SurveyUSA = imputed$SurveyUSA
summary(polling)
## State Year Rasmussen SurveyUSA
## Arizona : 3 Min. :2004 Min. :-41.000 Min. :-33.000
## Arkansas : 3 1st Qu.:2004 1st Qu.: -8.000 1st Qu.:-11.000
## California : 3 Median :2008 Median : 3.000 Median : 1.000
## Colorado : 3 Mean :2008 Mean : 1.731 Mean : 1.517
## Connecticut: 3 3rd Qu.:2012 3rd Qu.: 11.000 3rd Qu.: 18.000
## Florida : 3 Max. :2012 Max. : 39.000 Max. : 30.000
## (Other) :127
## DiffCount PropR Republican
## Min. :-19.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: -6.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 1.000 Median :0.6250 Median :1.0000
## Mean : -1.269 Mean :0.5259 Mean :0.5103
## 3rd Qu.: 4.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. : 11.000 Max. :1.0000 Max. :1.0000
##
# Video 3
#Lets start building the models.First we split the data into training & test sets
# Subset data into training set and test set
Train = subset(polling, Year == 2004 | Year == 2008)
Test = subset(polling, Year == 2012)
# Smart Baseline
table(Train$Republican) #lets see the breakdown of the dependent/outcome var
##
## 0 1
## 47 53
#Baseline model is that model which predicts the most common outcome.Here the baseline model predicts 53%.But the model here is a weak model considering that even where Democrats are ahead by 15-20%, it still predicts Republican outcome.Hence we want a smarter/better baseline model as compared to this naive model
#sign() returns a 1 when a postive no is passed into it and returns a -1 if a negative no is passed and finally returns a 0 if 0 is passed into it.
sign(20)
## [1] 1
sign(-10)
## [1] -1
sign(0)
## [1] 0
table(sign(Train$Rasmussen))
##
## -1 0 1
## 42 3 55
table(Train$Republican, sign(Train$Rasmussen)) # this smart base model just makes 4 mistakes out of 100 obs
##
## -1 0 1
## 0 42 2 3
## 1 0 1 52
# Video 4
#Before building the model it is reasonable to suspect that multicollinearity exists among the Independent var as all of them are predicting the same thing i.e. how is the Republican candidate is performing in each state
# Multicollinearity
#cor(Train) This gives error as we hav a factor var (Non-numeric)
str(Train)
## 'data.frame': 100 obs. of 7 variables:
## $ State : Factor w/ 50 levels "Alabama","Alaska",..: 1 1 2 2 3 3 4 4 5 5 ...
## $ Year : int 2004 2008 2004 2008 2004 2008 2004 2008 2004 2008 ...
## $ Rasmussen : int 11 21 16 16 5 5 7 10 -11 -27 ...
## $ SurveyUSA : int 18 25 21 21 15 8 5 9 -11 -24 ...
## $ DiffCount : int 5 5 1 6 8 9 8 5 -8 -5 ...
## $ PropR : num 1 1 1 1 1 1 1 1 0 0 ...
## $ Republican: int 1 1 1 1 1 1 1 1 0 0 ...
cor(Train[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount", "Republican")]) # so we select only numeric Independent var and also the dependent var.
## Rasmussen SurveyUSA PropR DiffCount Republican
## Rasmussen 1.0000000 0.9194508 0.8404803 0.5124098 0.8021191
## SurveyUSA 0.9194508 1.0000000 0.8756581 0.5541816 0.8205806
## PropR 0.8404803 0.8756581 1.0000000 0.8273785 0.9484204
## DiffCount 0.5124098 0.5541816 0.8273785 1.0000000 0.8092777
## Republican 0.8021191 0.8205806 0.9484204 0.8092777 1.0000000
#We are seeing large correlation values which are a concern
# Logistic Regression Model building with just one var as a starting point.
mod1 = glm(Republican~PropR, data=Train, family="binomial") # we hav selected that var which is the most correlated with the outcome var 'Republican'
summary(mod1)
##
## Call:
## glm(formula = Republican ~ PropR, family = "binomial", data = Train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.22880 -0.06541 0.10260 0.10260 1.37392
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.146 1.977 -3.108 0.001882 **
## PropR 11.390 3.153 3.613 0.000303 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.269 on 99 degrees of freedom
## Residual deviance: 15.772 on 98 degrees of freedom
## AIC: 19.772
##
## Number of Fisher Scoring iterations: 8
# Training set predictions using the above model
pred1 = predict(mod1, type="response")
table(Train$Republican, pred1 >= 0.5) # this model makes 4 mistakes similar to our smart base model
##
## FALSE TRUE
## 0 45 2
## 1 2 51
#Lets now see whether we can improve upon this model by adding another var
# Two-variable model
mod2 = glm(Republican~SurveyUSA+DiffCount, data=Train, family="binomial") # here we select 2 I.Vs var which are the least correlated among themselves as together they will have more predictive power
pred2 = predict(mod2, type="response") # this is a training set prediction
table(Train$Republican, pred2 >= 0.5)
##
## FALSE TRUE
## 0 45 2
## 1 1 52
# This model shows that we made 3 mistakes i.e. one less that model before but not a big improvemnt over the bseline model
summary(mod2)
##
## Call:
## glm(formula = Republican ~ SurveyUSA + DiffCount, family = "binomial",
## data = Train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.01196 -0.00698 0.01005 0.05074 1.54975
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1405 1.2456 -0.916 0.3599
## SurveyUSA 0.2976 0.1949 1.527 0.1267
## DiffCount 0.7673 0.4188 1.832 0.0669 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.269 on 99 degrees of freedom
## Residual deviance: 12.439 on 97 degrees of freedom
## AIC: 18.439
##
## Number of Fisher Scoring iterations: 10
#postives of this model
#1.this model has a lower AIC value indicating a better/stronger model than the previous one.
#2. coeff are +ve which is intuitively correct
#-ves
#I.V.coeff are NON significant
#We will adopt the Two var model
# Video 5
#Time to evaluate the model on the testing set
# Smart baseline accuracy comparision with Test outcome values
table(Test$Republican, sign(Test$Rasmussen)) # here the model makes 4 mistake and 2 inconclusive results
##
## -1 0 1
## 0 18 2 4
## 1 0 0 21
# Test set predictions
TestPrediction = predict(mod2, newdata=Test, type="response")
table(Test$Republican, TestPrediction >= 0.5) # we make just one mistake
##
## FALSE TRUE
## 0 23 1
## 1 0 21
# Analyze mistake by taking out that particular obs from the test dataset
subset(Test, TestPrediction >= 0.5 & Republican == 0)
## State Year Rasmussen SurveyUSA DiffCount PropR Republican
## 24 Florida 2012 2 0 6 0.6666667 0
# Overall these predictions are better that what the smart baseline model is predicting
The music industry has a well-developed market with a global annual revenue around $15 billion. The recording industry is highly competitive and is dominated by three big production companies which make up nearly 82% of the total annual album sales.
Artists are at the core of the music industry and record labels provide them with the necessary resources to sell their music on a large scale. A record label incurs numerous costs (studio recording, marketing, distribution, and touring) in exchange for a percentage of the profits from album sales, singles and concert tickets.
Unfortunately, the success of an artist’s release is highly uncertain: a single may be extremely popular, resulting in widespread radio play and digital downloads, while another single may turn out quite unpopular, and therefore unprofitable.
Knowing the competitive nature of the recording industry, record labels face the fundamental decision problem of which musical releases to support to maximize their financial success.
How can we use analytics to predict the popularity of a song? In this assignment, we challenge ourselves to predict whether a song will reach a spot in the Top 10 of the Billboard Hot 100 Chart.
Taking an analytics approach, we aim to use information about a song’s properties to predict its popularity. The dataset songs.csv consists of all songs which made it to the Top 10 of the Billboard Hot 100 Chart from 1990-2010 plus a sample of additional songs that didn’t make the Top 10. This data comes from three sources: Wikipedia, Billboard.com, and EchoNest.
The variables included in the dataset either describe the artist or the song, or they are associated with the following song attributes: time signature, loudness, key, pitch, tempo, and timbre.
Here’s a detailed description of the variables:
year = the year the song was released
songtitle = the title of the song
artistname = the name of the artist of the song
songID and artistID = identifying variables for the song and artist
timesignature and timesignature_confidence = a variable estimating the time signature of the song, and the confidence in the estimate
loudness = a continuous variable indicating the average amplitude of the audio in decibels
tempo and tempo_confidence = a variable indicating the estimated beats per minute of the song, and the confidence in the estimate
key and key_confidence = a variable with twelve levels indicating the estimated key of the song (C, C#, . . ., B), and the confidence in the estimate
energy = a variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch = a continuous variable that indicates the pitch of the song
timbre_0_min, timbre_0_max, timbre_1_min, timbre_1_max, . . . , timbre_11_min, and timbre_11_max = variables that indicate the minimum/maximum values over all segments for each of the twelve values in the timbre vector (resulting in 24 continuous variables)
Top10 = a binary variable indicating whether or not the song made it to the Top 10 of the Billboard Hot 100 Chart (1 if it was in the top 10, and 0 if it was not)
Use the read.csv function to load the dataset “songs.csv” into R.
#Lets import the dataset
songs<-read.csv("songs.csv")
str(songs)
## 'data.frame': 7574 obs. of 39 variables:
## $ year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ songtitle : Factor w/ 7141 levels "'03 Bonnie & Clyde",..: 6204 5522 241 3098 47 607 254 4419 2887 6756 ...
## $ artistname : Factor w/ 1032 levels "50 Cent","98 Degrees",..: 3 3 3 3 3 3 3 3 3 12 ...
## $ songID : Factor w/ 7549 levels "SOAACNI1315CD4AC42",..: 595 5439 5252 1716 3431 1020 1831 3964 6904 2473 ...
## $ artistID : Factor w/ 1047 levels "AR00B1I1187FB433EB",..: 671 671 671 671 671 671 671 671 671 507 ...
## $ timesignature : int 3 4 4 4 4 4 4 4 4 4 ...
## $ timesignature_confidence: num 0.853 1 1 1 0.788 1 0.968 0.861 0.622 0.938 ...
## $ loudness : num -4.26 -4.05 -3.57 -3.81 -4.71 ...
## $ tempo : num 91.5 140 160.5 97.5 140.1 ...
## $ tempo_confidence : num 0.953 0.921 0.489 0.794 0.286 0.347 0.273 0.83 0.018 0.929 ...
## $ key : int 11 10 2 1 6 4 10 5 9 11 ...
## $ key_confidence : num 0.453 0.469 0.209 0.632 0.483 0.627 0.715 0.423 0.751 0.602 ...
## $ energy : num 0.967 0.985 0.99 0.939 0.988 ...
## $ pitch : num 0.024 0.025 0.026 0.013 0.063 0.038 0.026 0.033 0.027 0.004 ...
## $ timbre_0_min : num 0.002 0 0.003 0 0 ...
## $ timbre_0_max : num 57.3 57.4 57.4 57.8 56.9 ...
## $ timbre_1_min : num -6.5 -37.4 -17.2 -32.1 -223.9 ...
## $ timbre_1_max : num 171 171 171 221 171 ...
## $ timbre_2_min : num -81.7 -149.6 -72.9 -138.6 -147.2 ...
## $ timbre_2_max : num 95.1 180.3 157.9 173.4 166 ...
## $ timbre_3_min : num -285 -380.1 -204 -73.5 -128.1 ...
## $ timbre_3_max : num 259 384 251 373 389 ...
## $ timbre_4_min : num -40.4 -48.7 -66 -55.6 -43.9 ...
## $ timbre_4_max : num 73.6 100.4 152.1 119.2 99.3 ...
## $ timbre_5_min : num -104.7 -87.3 -98.7 -77.5 -96.1 ...
## $ timbre_5_max : num 183.1 42.8 141.4 141.2 38.3 ...
## $ timbre_6_min : num -88.8 -86.9 -88.9 -70.8 -110.8 ...
## $ timbre_6_max : num 73.5 75.5 66.5 64.5 72.4 ...
## $ timbre_7_min : num -71.1 -65.8 -67.4 -63.7 -55.9 ...
## $ timbre_7_max : num 82.5 106.9 80.6 96.7 110.3 ...
## $ timbre_8_min : num -52 -61.3 -59.8 -78.7 -56.5 ...
## $ timbre_8_max : num 39.1 35.4 46 41.1 37.6 ...
## $ timbre_9_min : num -35.4 -81.9 -46.3 -49.2 -48.6 ...
## $ timbre_9_max : num 71.6 74.6 59.9 95.4 67.6 ...
## $ timbre_10_min : num -126.4 -103.8 -108.3 -102.7 -52.8 ...
## $ timbre_10_max : num 18.7 121.9 33.3 46.4 22.9 ...
## $ timbre_11_min : num -44.8 -38.9 -43.7 -59.4 -50.4 ...
## $ timbre_11_max : num 26 22.5 25.7 37.1 32.8 ...
## $ Top10 : int 0 0 0 0 0 0 0 0 0 1 ...
#PROBLEM 1.1 - UNDERSTANDING THE DATA
#How many observations (songs) are from the year 2010?
table(songs$year) #or
##
## 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
## 328 196 186 324 198 258 178 329 380 357 363 282 518 434 479
## 2005 2006 2007 2008 2009 2010
## 392 479 622 415 483 373
nrow(songs[songs$year>=2010,])
## [1] 373
#Ans:7574
############################
#PROBLEM 1.2 - UNDERSTANDING THE DATA
#How many songs does the dataset include for which the artist name is "Michael Jackson"?
nrow(songs[songs$artistname=="Michael Jackson",])
## [1] 18
#or
MichaelJackson = subset(songs, artistname == "Michael Jackson")
nrow(MichaelJackson) #or str(MichaelJackson)
## [1] 18
#Ans:18
########################################
#PROBLEM 1.3 - UNDERSTANDING THE DATA
#Which of these songs by Michael Jackson made it to the Top 10? Select all that apply.
MichaelJackson$songtitle[MichaelJackson$Top10=="1"]
## [1] You Rock My World You Are Not Alone Black or White Remember the Time
## [5] In The Closet
## 7141 Levels: '03 Bonnie & Clyde '69 ... Zumbi
MichaelJackson[c("songtitle","Top10")]
## songtitle Top10
## 4329 You Rock My World 1
## 6205 She's Out of My Life 0
## 6206 Wanna Be Startin' Somethin' 0
## 6207 You Are Not Alone 1
## 6208 Billie Jean 0
## 6209 The Way You Make Me Feel 0
## 6210 Black or White 1
## 6211 Rock with You 0
## 6212 Bad 0
## 6213 I Just Can't Stop Loving You 0
## 6214 Man in the Mirror 0
## 6215 Thriller 0
## 6216 Beat It 0
## 6217 The Girl Is Mine 0
## 6218 Remember the Time 1
## 6219 Don't Stop 'Til You Get Enough 0
## 6220 Heal the World 0
## 6915 In The Closet 1
#Ans:You Rock My World ,You Are Not Alone
#################################
#PROBLEM 1.4 - UNDERSTANDING THE DATA
#The variable corresponding to the estimated time signature (timesignature) is discrete, meaning that it only takes integer values (0, 1, 2, 3, . . . ). What are the values of this variable that occur in our dataset? Select all that apply.
table(songs$timesignature)
##
## 0 1 3 4 5 7
## 10 143 503 6787 112 19
#Ans:0,1,3,4,5,7
#Which timesignature value is the most frequent among songs in our dataset?
#Ans:4
#EXPLANATION:We can also read from the table that 6787 songs have a value of 4 for the timesignature, which is the highest count out of all of the possible timesignature values.
##################################
#PROBLEM 1.5 - UNDERSTANDING THE DATA
#Out of all of the songs in our dataset, the song with the highest tempo is one of the following songs. Which one is it?
songs$songtitle[songs$tempo==max(songs$tempo)]
## [1] Wanna Be Startin' Somethin'
## 7141 Levels: '03 Bonnie & Clyde '69 ... Zumbi
#or
which.max(songs$tempo)
## [1] 6206
songs$songtitle[6206]
## [1] Wanna Be Startin' Somethin'
## 7141 Levels: '03 Bonnie & Clyde '69 ... Zumbi
#Ans:Wanna Be Startin' Somethin'
#######################################
#PROBLEM 2.1 - CREATING OUR PREDICTION MODEL
#We wish to predict whether or not a song will make it to the Top 10. To do this, first use the subset function to split the data into a training set "SongsTrain" consisting of all the observations up to and including 2009 song releases, and a testing set "SongsTest", consisting of the 2010 song releases.
# Subset data into training set and test set
SongsTrain= subset(songs, year<=2009)
SongsTest= subset(songs, year==2010)
#How many observations (songs) are in the training set?
nrow(SongsTrain) #or str(SongsTrain)
## [1] 7201
#Ans:7201
##################################
#PROBLEM 2.2 - CREATING OUR PREDICTION MODEL
#In this problem, our outcome variable is "Top10" - we are trying to predict whether or not a song will make it to the Top 10 of the Billboard Hot 100 Chart. Since the outcome variable is binary, we will build a logistic regression model*. We'll start by using all song attributes as our independent variables, which we'll call Model 1.
#We will only use the variables in our dataset that describe the numerical attributes of the song in our logistic regression model. So we won't use the variables "year", "songtitle", "artistname", "songID" or "artistID".
#We have seen in the lecture that, to build the logistic regression model, we would normally explicitly input the formula including all the independent variables in R. However, in this case, this is a tedious amount of work since we have a large number of independent variables.There is a nice trick to avoid doing so. Let's suppose that, except for the outcome variable Top10, all other variables in the training set are inputs to Model 1. Then, we can use the formula
#SongsLog1 = glm(Top10 ~ ., data=SongsTrain, family=binomial)
#to build our model. Notice that the "."(called as period#) is used in place of enumerating all the independent variables. (Also, keep in mind that you can choose to put quotes around binomial, or leave out the quotes. R can understand this argument either way.)
#However, in our case, we want to exclude some of the variables in our dataset from being used as independent variables ("year", "songtitle", "artistname", "songID", and "artistID"). To do this, we can use the following trick.
#First define a vector of variable names called nonvars - these are the variables that we won't use in our model.
nonvars = c("year", "songtitle", "artistname", "songID", "artistID")
#To remove these variables from your training and testing sets, type the following commands in your R console:
SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]
SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]
#Now, use the glm function to build a logistic regression model to predict Top10 using all of the other variables as the independent variables. You should use SongsTrain to build the model.
# Logistic Regression Model building
SongsLog1 = glm(Top10~., data=SongsTrain, family="binomial")
summary(SongsLog1)
##
## Call:
## glm(formula = Top10 ~ ., family = "binomial", data = SongsTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9220 -0.5399 -0.3459 -0.1845 3.0770
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.470e+01 1.806e+00 8.138 4.03e-16 ***
## timesignature 1.264e-01 8.674e-02 1.457 0.145050
## timesignature_confidence 7.450e-01 1.953e-01 3.815 0.000136 ***
## loudness 2.999e-01 2.917e-02 10.282 < 2e-16 ***
## tempo 3.634e-04 1.691e-03 0.215 0.829889
## tempo_confidence 4.732e-01 1.422e-01 3.329 0.000873 ***
## key 1.588e-02 1.039e-02 1.529 0.126349
## key_confidence 3.087e-01 1.412e-01 2.187 0.028760 *
## energy -1.502e+00 3.099e-01 -4.847 1.25e-06 ***
## pitch -4.491e+01 6.835e+00 -6.570 5.02e-11 ***
## timbre_0_min 2.316e-02 4.256e-03 5.441 5.29e-08 ***
## timbre_0_max -3.310e-01 2.569e-02 -12.882 < 2e-16 ***
## timbre_1_min 5.881e-03 7.798e-04 7.542 4.64e-14 ***
## timbre_1_max -2.449e-04 7.152e-04 -0.342 0.732087
## timbre_2_min -2.127e-03 1.126e-03 -1.889 0.058843 .
## timbre_2_max 6.586e-04 9.066e-04 0.726 0.467571
## timbre_3_min 6.920e-04 5.985e-04 1.156 0.247583
## timbre_3_max -2.967e-03 5.815e-04 -5.103 3.34e-07 ***
## timbre_4_min 1.040e-02 1.985e-03 5.237 1.63e-07 ***
## timbre_4_max 6.110e-03 1.550e-03 3.942 8.10e-05 ***
## timbre_5_min -5.598e-03 1.277e-03 -4.385 1.16e-05 ***
## timbre_5_max 7.736e-05 7.935e-04 0.097 0.922337
## timbre_6_min -1.686e-02 2.264e-03 -7.445 9.66e-14 ***
## timbre_6_max 3.668e-03 2.190e-03 1.675 0.093875 .
## timbre_7_min -4.549e-03 1.781e-03 -2.554 0.010661 *
## timbre_7_max -3.774e-03 1.832e-03 -2.060 0.039408 *
## timbre_8_min 3.911e-03 2.851e-03 1.372 0.170123
## timbre_8_max 4.011e-03 3.003e-03 1.336 0.181620
## timbre_9_min 1.367e-03 2.998e-03 0.456 0.648356
## timbre_9_max 1.603e-03 2.434e-03 0.659 0.510188
## timbre_10_min 4.126e-03 1.839e-03 2.244 0.024852 *
## timbre_10_max 5.825e-03 1.769e-03 3.292 0.000995 ***
## timbre_11_min -2.625e-02 3.693e-03 -7.108 1.18e-12 ***
## timbre_11_max 1.967e-02 3.385e-03 5.811 6.21e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6017.5 on 7200 degrees of freedom
## Residual deviance: 4759.2 on 7167 degrees of freedom
## AIC: 4827.2
##
## Number of Fisher Scoring iterations: 6
#Looking at the summary of your model, what is the value of the Akaike Information Criterion (AIC)?
#Ans:4827.2
################################
#PROBLEM 2.3 - CREATING OUR PREDICTION MODEL
#Let's now think about the variables in our dataset related to the confidence of the time signature, key and tempo (timesignature_confidence, key_confidence, and tempo_confidence). Our model seems to indicate that these confidence variables are significant (rather than the variables timesignature, key and tempo themselves). What does the model suggest?
#Ans:The higher our confidence about time signature, key and tempo, the more likely the song is to be in the Top 10
#EXPLANATION:If you look at the output summary(model), where model is the name of your logistic regression model, you can see that the coefficient estimates for the confidence variables (timesignature_confidence, key_confidence, and tempo_confidence) are positive. This means that higher confidence leads to a higher predicted probability of a Top 10 hit.
###########################################
#PROBLEM 2.4 - CREATING OUR PREDICTION MODEL
#In general, if the confidence is low for the time signature, tempo, and key, then the song is more likely to be complex. What does Model 1 suggest in terms of complexity?
#Ans:Mainstream listeners tend to prefer less complex songs
#EXPLANATION:Since the coefficient values for timesignature_confidence, tempo_confidence, and key_confidence are all positive, lower confidence leads to a lower predicted probability of a song being a hit. So mainstream listeners tend to prefer less complex songs.
######################################
#PROBLEM 2.5 - CREATING OUR PREDICTION MODEL
#Songs with heavier instrumentation tend to be louder (have higher values in the variable "loudness") and more energetic (have higher values in the variable "energy").
#By inspecting the coefficient of the variable "loudness", what does Model 1 suggest?
#Ans:Mainstream listeners prefer songs with heavy instrumentation
#By inspecting the coefficient of the variable "energy", do we draw the same conclusions as above?
#Ans:No
#EXPLANATION:The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs, which are those with heavier instrumentation. However, the coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those with light instrumentation. These coefficients lead us to different conclusions!
#######################################################
#PROBLEM 3.1 - BEWARE OF MULTICOLLINEARITY ISSUES!
#What is the correlation between the variables "loudness" and "energy" in the training set?
cor(SongsTrain$loudness,SongsTrain$energy)
## [1] 0.7399067
#Ans:0.7399067
#Given that these two variables are highly correlated, Model 1 suffers from multicollinearity. To avoid this issue, we will omit one of these two variables and rerun the logistic regression. In the rest of this problem, we'll build two variations of our original model: Model 2, in which we keep "energy" and omit "loudness", and Model 3, in which we keep "loudness" and omit "energy".
###################################################
#PROBLEM 3.2 - BEWARE OF MULTICOLLINEARITY ISSUES!
#Create Model 2, which is Model 1 without the independent variable "loudness". This can be done with the following command:
SongsLog2 = glm(Top10 ~ . - loudness, data=SongsTrain, family=binomial)
summary(SongsLog2)
##
## Call:
## glm(formula = Top10 ~ . - loudness, family = binomial, data = SongsTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0983 -0.5607 -0.3602 -0.1902 3.3107
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.241e+00 7.465e-01 -3.002 0.002686 **
## timesignature 1.625e-01 8.734e-02 1.860 0.062873 .
## timesignature_confidence 6.885e-01 1.924e-01 3.578 0.000346 ***
## tempo 5.521e-04 1.665e-03 0.332 0.740226
## tempo_confidence 5.497e-01 1.407e-01 3.906 9.40e-05 ***
## key 1.740e-02 1.026e-02 1.697 0.089740 .
## key_confidence 2.954e-01 1.394e-01 2.118 0.034163 *
## energy 1.813e-01 2.608e-01 0.695 0.486991
## pitch -5.150e+01 6.857e+00 -7.511 5.87e-14 ***
## timbre_0_min 2.479e-02 4.240e-03 5.847 5.01e-09 ***
## timbre_0_max -1.007e-01 1.178e-02 -8.551 < 2e-16 ***
## timbre_1_min 7.143e-03 7.710e-04 9.265 < 2e-16 ***
## timbre_1_max -7.830e-04 7.064e-04 -1.108 0.267650
## timbre_2_min -1.579e-03 1.109e-03 -1.424 0.154531
## timbre_2_max 3.889e-04 8.964e-04 0.434 0.664427
## timbre_3_min 6.500e-04 5.949e-04 1.093 0.274524
## timbre_3_max -2.462e-03 5.674e-04 -4.339 1.43e-05 ***
## timbre_4_min 9.115e-03 1.952e-03 4.670 3.02e-06 ***
## timbre_4_max 6.306e-03 1.532e-03 4.115 3.87e-05 ***
## timbre_5_min -5.641e-03 1.255e-03 -4.495 6.95e-06 ***
## timbre_5_max 6.937e-04 7.807e-04 0.889 0.374256
## timbre_6_min -1.612e-02 2.235e-03 -7.214 5.45e-13 ***
## timbre_6_max 3.814e-03 2.157e-03 1.768 0.076982 .
## timbre_7_min -5.102e-03 1.755e-03 -2.907 0.003644 **
## timbre_7_max -3.158e-03 1.811e-03 -1.744 0.081090 .
## timbre_8_min 4.488e-03 2.810e-03 1.597 0.110254
## timbre_8_max 6.423e-03 2.950e-03 2.177 0.029497 *
## timbre_9_min -4.282e-04 2.955e-03 -0.145 0.884792
## timbre_9_max 3.525e-03 2.377e-03 1.483 0.138017
## timbre_10_min 2.993e-03 1.804e-03 1.660 0.097004 .
## timbre_10_max 7.367e-03 1.731e-03 4.255 2.09e-05 ***
## timbre_11_min -2.837e-02 3.630e-03 -7.815 5.48e-15 ***
## timbre_11_max 1.829e-02 3.341e-03 5.476 4.34e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6017.5 on 7200 degrees of freedom
## Residual deviance: 4871.8 on 7168 degrees of freedom
## AIC: 4937.8
##
## Number of Fisher Scoring iterations: 6
#We just subtracted the variable loudness. We couldn't do this with the variables "songtitle" and "artistname", because they are not numeric variables, and we might get different values in the test set that the training set has never seen. But this approach (subtracting the variable from the model formula) will always work when you want to remove numeric variables.
#Look at the summary of SongsLog2, and inspect the coefficient of the variable "energy". What do you observe?
#Ans:Model 2 suggests that songs with high energy levels tend to be more popular. This contradicts our observation in Model 1
#EXPLANATION:The coefficient estimate for energy is positive in Model 2, suggesting that songs with higher energy levels tend to be more popular. However, note that the variable energy is not significant in this model.
#################################################
#PROBLEM 3.3 - BEWARE OF MULTICOLLINEARITY ISSUES!
#Now, create Model 3, which should be exactly like Model 1, but without the variable "energy".
SongsLog3 = glm(Top10 ~ . - energy, data=SongsTrain, family=binomial)
summary(SongsLog3)
##
## Call:
## glm(formula = Top10 ~ . - energy, family = binomial, data = SongsTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9182 -0.5417 -0.3481 -0.1874 3.4171
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.196e+01 1.714e+00 6.977 3.01e-12 ***
## timesignature 1.151e-01 8.726e-02 1.319 0.187183
## timesignature_confidence 7.143e-01 1.946e-01 3.670 0.000242 ***
## loudness 2.306e-01 2.528e-02 9.120 < 2e-16 ***
## tempo -6.460e-04 1.665e-03 -0.388 0.698107
## tempo_confidence 3.841e-01 1.398e-01 2.747 0.006019 **
## key 1.649e-02 1.035e-02 1.593 0.111056
## key_confidence 3.394e-01 1.409e-01 2.409 0.015984 *
## pitch -5.328e+01 6.733e+00 -7.914 2.49e-15 ***
## timbre_0_min 2.205e-02 4.239e-03 5.200 1.99e-07 ***
## timbre_0_max -3.105e-01 2.537e-02 -12.240 < 2e-16 ***
## timbre_1_min 5.416e-03 7.643e-04 7.086 1.38e-12 ***
## timbre_1_max -5.115e-04 7.110e-04 -0.719 0.471928
## timbre_2_min -2.254e-03 1.120e-03 -2.012 0.044190 *
## timbre_2_max 4.119e-04 9.020e-04 0.457 0.647915
## timbre_3_min 3.179e-04 5.869e-04 0.542 0.588083
## timbre_3_max -2.964e-03 5.758e-04 -5.147 2.64e-07 ***
## timbre_4_min 1.105e-02 1.978e-03 5.585 2.34e-08 ***
## timbre_4_max 6.467e-03 1.541e-03 4.196 2.72e-05 ***
## timbre_5_min -5.135e-03 1.269e-03 -4.046 5.21e-05 ***
## timbre_5_max 2.979e-04 7.855e-04 0.379 0.704526
## timbre_6_min -1.784e-02 2.246e-03 -7.945 1.94e-15 ***
## timbre_6_max 3.447e-03 2.182e-03 1.580 0.114203
## timbre_7_min -5.128e-03 1.768e-03 -2.900 0.003733 **
## timbre_7_max -3.394e-03 1.820e-03 -1.865 0.062208 .
## timbre_8_min 3.686e-03 2.833e-03 1.301 0.193229
## timbre_8_max 4.658e-03 2.988e-03 1.559 0.119022
## timbre_9_min -9.318e-05 2.957e-03 -0.032 0.974859
## timbre_9_max 1.342e-03 2.424e-03 0.554 0.579900
## timbre_10_min 4.050e-03 1.827e-03 2.217 0.026637 *
## timbre_10_max 5.793e-03 1.759e-03 3.294 0.000988 ***
## timbre_11_min -2.638e-02 3.683e-03 -7.162 7.96e-13 ***
## timbre_11_max 1.984e-02 3.365e-03 5.896 3.74e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6017.5 on 7200 degrees of freedom
## Residual deviance: 4782.7 on 7168 degrees of freedom
## AIC: 4848.7
##
## Number of Fisher Scoring iterations: 6
#Look at the summary of Model 3 and inspect the coefficient of the variable "loudness". Remembering that higher loudness and energy both occur in songs with heavier instrumentation, do we make the same observation about the popularity of heavy instrumentation as we did with Model 2?
#Ans:Yes
#EXPLANATION: Looking at the output of summary(SongsLog3), we can see that loudness has a positive coefficient estimate, meaning that our model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion we got from Model 2.
#########################################
#PROBLEM 4.1 - VALIDATING OUR MODEL
#Make predictions on the test set using Model 3. What is the accuracy of Model 3 on the test set, using a threshold of 0.45? (Compute the accuracy as a number between 0 and 1.)
# Predictions on the test set
testPredict = predict(SongsLog3, type="response", newdata=SongsTest)
# Confusion matrix with threshold of 0.5
table(SongsTest$Top10, testPredict > 0.45)
##
## FALSE TRUE
## 0 309 5
## 1 40 19
#therefore the Overall Accuracy is
(309+19)/(309+5+40+19)
## [1] 0.8793566
#0.8793566
######################################
#PROBLEM 4.2 - VALIDATING OUR MODEL
#Let's check if there's any incremental benefit in using Model 3 instead of a baseline model. Given the difficulty of guessing which song is going to be a hit, an easier model would be to pick the most frequent outcome (a song is not a Top 10 hit) for all songs. What would the accuracy of the baseline model be on the test set? (Give your answer as a number between 0 and 1.)
# Baseline accuracy/model
(309+5)/(309+5+40+19) #From the Confusion matrix:an easier model would be to pick the most frequent outcome (a song is not a Top 10 hit) for all songs
## [1] 0.8418231
#or
#You can compute the baseline accuracy by tabling the outcome variable in the test set
table(SongsTest$Top10)
##
## 0 1
## 314 59
#The baseline model would get 314 observations correct, and 59 wrong, for an accuracy of
314/(314+59)
## [1] 0.8418231
#Ans:0.8418231
#####################################
#PROBLEM 4.3 - VALIDATING OUR MODEL
#It seems that Model 3 gives us a small improvement over the baseline model. Still, does it create an edge?
#Let's view the two models from an investment perspective. A production company is interested in investing in songs that are highly likely to make it to the Top 10. The company's objective is to minimize its risk of financial losses attributed to investing in songs that end up unpopular.
#A competitive edge can therefore be achieved if we can provide the production company a list of songs that are highly likely to end up in the Top 10. We note that the baseline model does not prove useful, as it simply does not label any song as a hit. Let us see what our model has to offer.
#How many songs does Model 3 correctly predict as Top 10 hits in 2010 (remember that all songs in 2010 went into our test set), using a threshold of 0.45?
# Predictions on the test set
testPredict = predict(SongsLog3, type="response", newdata=SongsTest)
# Confusion matrix with threshold of 0.5
table(SongsTest$Top10, testPredict > 0.45)
##
## FALSE TRUE
## 0 309 5
## 1 40 19
#Ans:19
#How many non-hit songs does Model 3 predict will be Top 10 hits (again, looking at the test set), using a threshold of 0.45?
#Ans:5 # from the above confusion matrix
#EXPLANATION:According to our model's confusion matrix:table(SongsTest$Top10, testPredict >= 0.45),We have 19 true positives (Top 10 hits that we predict correctly), and 5 false positives (songs that we predict will be Top 10 hits, but end up not being Top 10 hits).
##################################################
#PROBLEM 4.4 - VALIDATING OUR MODEL
#Using the confusion matrix:
table(SongsTest$Top10, testPredict >= 0.45)
##
## FALSE TRUE
## 0 309 5
## 1 40 19
#What is the sensitivity of Model 3 on the test set, using a threshold of 0.45?
19/(19+40)
## [1] 0.3220339
#Ans:0.3220339
#What is the specificity of Model 3 on the test set, using a threshold of 0.45?
309/(309+5)
## [1] 0.9840764
#Ans:0.9840764
####################################
#PROBLEM 4.5 - VALIDATING OUR MODEL
#What conclusions can you make about our model?
#Ans:Model 3 favors specificity over sensitivity. & Model 3 provides conservative predictions, and predicts that a song will make it to the Top 10 very rarely. So while it detects less than half of the Top 10 songs, we can be very confident in the songs that it does predict to be Top 10 hits.
#EXPLANATION:Model 3 has a very high specificity, meaning that it favors specificity over sensitivity. While Model 3 only captures less than half of the Top 10 songs, it still can offer a competitive edge, since it is very conservative in its predictions.
In many criminal justice systems around the world, inmates deemed not to be a threat to society are released from prison under the parole system prior to completing their sentence. They are still considered to be serving their sentence while on parole, and they can be returned to prison if they violate the terms of their parole.
Parole boards are charged with identifying which inmates are good candidates for release on parole. They seek to release inmates who will not commit additional crimes after release. In this problem, we will build and validate a model that predicts if an inmate will violate the terms of his or her parole. Such a model could be useful to a parole board when deciding to approve or deny an application for parole.
For this prediction task, we will use data from the United States 2004 National Corrections Reporting Program, a nationwide census of parole releases that occurred during 2004. We limited our focus to parolees who served no more than 6 months in prison and whose maximum sentence for all charges did not exceed 18 months. The dataset contains all such parolees who either successfully completed their term of parole during 2004 or those who violated the terms of their parole during that year. The dataset contains the following variables:
male: 1 if the parolee is male, 0 if female
race: 1 if the parolee is white, 2 otherwise
age: the parolee’s age (in years) when he or she was released from prison
state: a code for the parolee’s state. 2 is Kentucky, 3 is Louisiana, 4 is Virginia, and 1 is any other state. The three states were selected due to having a high representation in the dataset.
time.served: the number of months the parolee served in prison (limited by the inclusion criteria to not exceed 6 months).
max.sentence: the maximum sentence length for all charges, in months (limited by the inclusion criteria to not exceed 18 months).
multiple.offenses: 1 if the parolee was incarcerated for multiple offenses, 0 otherwise. crime: a code for the parolee’s main crime leading to incarceration. 2 is larceny, 3 is drug-related crime, 4 is driving-related crime, and 1 is any other crime.
violator: 1 if the parolee violated the parole, and 0 if the parolee completed the parole without violation.
#PROBLEM 1.1 - LOADING THE DATASET
#lets load the dataframe and see the details of this dataset
parole<-read.csv("parole.csv ")
str(parole)
## 'data.frame': 675 obs. of 9 variables:
## $ male : int 1 0 1 1 1 1 1 0 0 1 ...
## $ race : int 1 1 2 1 2 2 1 1 1 2 ...
## $ age : num 33.2 39.7 29.5 22.4 21.6 46.7 31 24.6 32.6 29.1 ...
## $ state : int 1 1 1 1 1 1 1 1 1 1 ...
## $ time.served : num 5.5 5.4 5.6 5.7 5.4 6 6 4.8 4.5 4.7 ...
## $ max.sentence : int 18 12 12 18 12 18 18 12 13 12 ...
## $ multiple.offenses: int 0 0 0 0 0 0 0 0 0 0 ...
## $ crime : int 4 3 3 1 1 4 3 1 3 2 ...
## $ violator : int 0 0 0 0 0 0 0 0 0 0 ...
summary(parole)
## male race age state
## Min. :0.0000 Min. :1.000 Min. :18.40 Min. :1.000
## 1st Qu.:1.0000 1st Qu.:1.000 1st Qu.:25.35 1st Qu.:2.000
## Median :1.0000 Median :1.000 Median :33.70 Median :3.000
## Mean :0.8074 Mean :1.424 Mean :34.51 Mean :2.887
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:42.55 3rd Qu.:4.000
## Max. :1.0000 Max. :2.000 Max. :67.00 Max. :4.000
## time.served max.sentence multiple.offenses crime
## Min. :0.000 Min. : 1.00 Min. :0.0000 Min. :1.000
## 1st Qu.:3.250 1st Qu.:12.00 1st Qu.:0.0000 1st Qu.:1.000
## Median :4.400 Median :12.00 Median :1.0000 Median :2.000
## Mean :4.198 Mean :13.06 Mean :0.5363 Mean :2.059
## 3rd Qu.:5.200 3rd Qu.:15.00 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :6.000 Max. :18.00 Max. :1.0000 Max. :4.000
## violator
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1156
## 3rd Qu.:0.0000
## Max. :1.0000
#How many parolees are contained in the dataset?
nrow(parole) #or str(parole)
## [1] 675
#Ans:675
##########################################
#PROBLEM 1.2 - LOADING THE DATASET
#How many of the parolees in the dataset violated the terms of their parole?
table(parole$violator)
##
## 0 1
## 597 78
#Ans:78
###########################################
#PROBLEM 2.1 - PREPARING THE DATASET
#You should be familiar with unordered factors (if not, review the Week 2 homework problem "Reading Test Scores"). Which variables in this dataset are unordered factors with at least three levels? Select all that apply.
sapply(parole,class) #to get classes of all var
## male race age state
## "integer" "integer" "numeric" "integer"
## time.served max.sentence multiple.offenses crime
## "numeric" "integer" "integer" "integer"
## violator
## "integer"
#Ans:state , crime
#EXPLANATION:While the variables male, race, state, crime, and violator are all unordered factors, only state and crime have at least 3 levels in this dataset.
#############################
#PROBLEM 2.2 - PREPARING THE DATASET
#In the last subproblem, we identified variables that are unordered factors with at least 3 levels, so we need to convert them to factors for our prediction problem (we introduced this idea in the "Reading Test Scores" problem last week). Using the as.factor() function, convert these variables to factors. Keep in mind that we are not changing the values, just the way R understands them (the values are still numbers).
parole$state<-as.factor(parole$state)
parole$crime<-as.factor(parole$crime)
summary(parole)
## male race age state time.served
## Min. :0.0000 Min. :1.000 Min. :18.40 1:143 Min. :0.000
## 1st Qu.:1.0000 1st Qu.:1.000 1st Qu.:25.35 2:120 1st Qu.:3.250
## Median :1.0000 Median :1.000 Median :33.70 3: 82 Median :4.400
## Mean :0.8074 Mean :1.424 Mean :34.51 4:330 Mean :4.198
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:42.55 3rd Qu.:5.200
## Max. :1.0000 Max. :2.000 Max. :67.00 Max. :6.000
## max.sentence multiple.offenses crime violator
## Min. : 1.00 Min. :0.0000 1:315 Min. :0.0000
## 1st Qu.:12.00 1st Qu.:0.0000 2:106 1st Qu.:0.0000
## Median :12.00 Median :1.0000 3:153 Median :0.0000
## Mean :13.06 Mean :0.5363 4:101 Mean :0.1156
## 3rd Qu.:15.00 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :18.00 Max. :1.0000 Max. :1.0000
#How does the output of summary() change for a factor variable as compared to a numerical variable?
#Ans:The output becomes similar to that of the table() function applied to that variable
#EXPLANATION:The output of summary(parole$state) or summary(parole$crime) now shows a breakdown of the number of parolees with each level of the factor, which is most similar to the output of the table() function.
##########################################
#PROBLEM 3.1 - SPLITTING INTO A TRAINING AND TESTING SET
set.seed(144)
library(caTools)
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)
#Roughly what proportion of parolees have been allocated to the training and testing sets?
#Ans:70% to the training set, 30% to the testing set
#EXPLANATION:SplitRatio=0.7 causes split to take the value TRUE roughly 70% of the time, so train should contain roughly 70% of the values in the dataset. You can verify this by running nrow(train) and nrow(test).
###############################
#PROBLEM 3.2 - SPLITTING INTO A TRAINING AND TESTING SET
#Now, suppose you re-ran lines [1]-[5] of Problem 3.1. What would you expect?
#Ans:The exact same training/testing set split as the first execution of [1]-[5]
#If you instead ONLY re-ran lines [3]-[5], what would you expect?
#Ans:A different training/testing set split from the first execution of [1]-[5]
#If you instead called set.seed() with a different number and then re-ran lines [3]-[5] of Problem 3.1, what would you expect?
#Ans:A different training/testing set split from the first execution of [1]-[5]
#EXPLANATION:If you set a random seed, split, set the seed again to the same value, and then split again, you will get the same split. However, if you set the seed and then split twice, you will get different splits. If you set the seed to different values, you will get different splits.You can also verify this by running the specified code in R. If you have training sets train1 and train2, the function sum(train1 != train2) will count the number of values in those two data frames that are different.
##############################
#PROBLEM 4.1 - BUILDING A LOGISTIC REGRESSION MODEL
#If you tested other training/testing set splits in the previous section, please re-run the original 5 lines of code to obtain the original split.
#Using glm (and remembering the parameter family="binomial"), train a logistic regression model on the training set. Your dependent variable is "violator", and you should use all of the other variables as independent variables.
mod<-glm(violator~.,data=train,family = binomial)
summary(mod)
##
## Call:
## glm(formula = violator ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7041 -0.4236 -0.2719 -0.1690 2.8375
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.2411574 1.2938852 -3.278 0.00105 **
## male 0.3869904 0.4379613 0.884 0.37690
## race 0.8867192 0.3950660 2.244 0.02480 *
## age -0.0001756 0.0160852 -0.011 0.99129
## state2 0.4433007 0.4816619 0.920 0.35739
## state3 0.8349797 0.5562704 1.501 0.13335
## state4 -3.3967878 0.6115860 -5.554 2.79e-08 ***
## time.served -0.1238867 0.1204230 -1.029 0.30359
## max.sentence 0.0802954 0.0553747 1.450 0.14705
## multiple.offenses 1.6119919 0.3853050 4.184 2.87e-05 ***
## crime2 0.6837143 0.5003550 1.366 0.17180
## crime3 -0.2781054 0.4328356 -0.643 0.52054
## crime4 -0.0117627 0.5713035 -0.021 0.98357
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 340.04 on 472 degrees of freedom
## Residual deviance: 251.48 on 460 degrees of freedom
## AIC: 277.48
##
## Number of Fisher Scoring iterations: 6
#What variables are significant in this model? Significant variables should have a least one star, or should have a probability less than 0.05 (the column Pr(>|z|) in the summary output). Select all that apply.
#Ans:race,state4,multiple.offenses
#############################################
#PROBLEM 4.2 - BUILDING A LOGISTIC REGRESSION MODEL
#What can we say based on the coefficient of the multiple.offenses variable?
#The following two properties might be useful to you when answering this question:
#1) If we have a coefficient c for a variable, then that means the log odds (or Logit) are increased by c for a unit increase in the variable.
#2) If we have a coefficient c for a variable, then that means the odds are multiplied by e^c for a unit increase in the variable.
exp(1.6119919)
## [1] 5.012786
#Ans:Our model predicts that a parolee who committed multiple offenses has 5.01 times higher odds of being a violator than a parolee who did not commit multiple offenses but is otherwise identical.
#EXPLANATION:For parolees A and B who are identical other than A having committed multiple offenses, the predicted log odds of A is 1.61 more than the predicted log odds of B. Then we have:
#ln(odds of A) = ln(odds of B) + 1.61
#exp(ln(odds of A)) = exp(ln(odds of B) + 1.61)
#exp(ln(odds of A)) = exp(ln(odds of B)) * exp(1.61)
#odds of A = exp(1.61) * odds of B
#odds of A= 5.01 * odds of B
#In the second step we raised e to the power of both sides. In the third step we used the exponentiation rule that e^(a+b) = e^a * e^b. In the fourth step we used the rule that e^(ln(x)) = x.
###############################################
#PROBLEM 4.3 - BUILDING A LOGISTIC REGRESSION MODEL
#Consider a parolee who is male, of white race, aged 50 years at prison release, from the state of Maryland, served 3 months, had a maximum sentence of 12 months, did not commit multiple offenses, and committed a larceny. Answer the following questions based on the model's predictions for this individual. (HINT: You should use the coefficients of your model, the Logistic Response Function, and the Odds equation to solve this problem.)
#According to the model, what are the odds this individual is a violator?
Logodds<--4.2411574+0.3869904*1+0.8867192*1+-0.0001756*50+0+-0.1238867 *3+0.0802954*12+0+ 0.6837143*1
Logodds
## [1] -1.700629
exp(Logodds)
## [1] 0.1825687
#Ans:0.1825687
#According to the model, what is the probability this individual is a violator?
0.1825687/(1+0.1825687) #or
## [1] 0.1543832
1/(1+exp(-(-1.700629)))
## [1] 0.1543831
#Ans:0.1543832
#EXPLANATION:
#From the logistic regression equation,
#we have log(odds) = -4.2411574 + 0.3869904*male + 0.8867192*race - 0.0001756*age + 0.4433007*state2 + 0.8349797*state3 - 3.3967878*state4 - 0.1238867*time.served + 0.0802954*max.sentence + 1.6119919*multiple.offenses + 0.6837143*crime2 - 0.2781054*crime3 - 0.0117627*crime4. This parolee has male=1, race=1, age=50, state2=0, state3=0, state4=0, time.served=3, max.sentence=12, multiple.offenses=0, crime2=1, crime3=0, crime4=0. We conclude that log(odds) = -1.700629.
#Therefore, the odds ratio is exp(-1.700629) = 0.183, and the predicted probability of violation is 1/(1+exp(1.700629)) = 0.154.
###############################################
#PROBLEM 5.1 - EVALUATING THE MODEL ON THE TESTING SET
#Use the predict() function to obtain the model's predicted probabilities for parolees in the testing set, remembering to pass type="response".
predictions<-predict(mod,type="response", newdata=test)
#What is the maximum predicted probability of a violation?
summary(predictions) #or max(predictpar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002334 0.023780 0.057910 0.146600 0.147500 0.907300
#Ans:0.9072791
#######################################################
#PROBLEM 5.2 - EVALUATING THE MODEL ON THE TESTING SET
#In the following questions, evaluate the model's predictions on the test set using a threshold of 0.5.
# Confusion matrix with threshold of 0.5
table(test$violator, as.numeric(predictions >= 0.5))
##
## 0 1
## 0 167 12
## 1 11 12
#What is the model's sensitivity?
12/(11+12) # The sensitivity (proportion of the actual violators we got correct) is 12/(11+12) = 0.522
## [1] 0.5217391
#Ans:0.5217391
#What is the model's specificity?
167/(167+12) #the specificity (proportion of the actual non-violators we got correct) is 167/(167+12) = 0.933
## [1] 0.9329609
#Ans:0.9329609
#What is the model's accuracy?
(167+12)/(167+12+11+12) #There are 202 observations in the test set. The accuracy (percentage of values on the diagonal) is (167+12)/202 = 0.886
## [1] 0.8861386
#Ans:0.8861386
##################################################
#PROBLEM 5.3 - EVALUATING THE MODEL ON THE TESTING SET
#What is the accuracy of a simple model that predicts that every parolee is a non-violator?
(167+12)/(167+12+11+12) #From the Confusion matrix:an easier model would be to pick the most frequent outcome i.e. corresponding to 0s (Non violators)
## [1] 0.8861386
#or If you table the outcome variable using the following command:
table(test$violator)
##
## 0 1
## 179 23
179/(179+23)
## [1] 0.8861386
#you can see that there are 179 negative examples, which are the ones that the baseline model would get correct. Thus the baseline model would have an accuracy of 179/202 = 0.886.
####################################################
#PROBLEM 5.4 - EVALUATING THE MODEL ON THE TESTING SET
#Consider a parole board using the model to predict whether parolees will be violators or not. The job of a parole board is to make sure that a prisoner is ready to be released into free society, and therefore parole boards tend to be particularily concerned with releasing prisoners who will violate their parole. Which of the following most likely describes their preferences and best course of action?
#Ans:The board assigns more cost to a false negative than a false positive, and should therefore use a logistic regression cutoff less than 0.5.
#EXPLANATION:If the board used the model for parole decisions, a negative prediction would lead to a prisoner being granted parole, while a positive prediction would lead to a prisoner being denied parole. The parole board would experience more regret for releasing a prisoner who then violates parole (a negative prediction that is actually positive, or false negative) than it would experience for denying parole to a prisoner who would not have violated parole (a positive prediction that is actually negative, or false positive).
#Decreasing the cutoff leads to more positive predictions, which increases false positives and decreases false negatives. Meanwhile, increasing the cutoff leads to more negative predictions, which increases false negatives and decreases false positives. The parole board assigns high cost to false negatives, and therefore should decrease the cutoff.
##################################################
#PROBLEM 5.5 - EVALUATING THE MODEL ON THE TESTING SET
#Which of the following is the most accurate assessment of the value of the logistic regression model with a cutoff 0.5 to a parole board, based on the model's accuracy as compared to the simple baseline model?
#Ans:The model is likely of value to the board, and using a different logistic regression cutoff is likely to improve the model's value.
#EXPLANATION:The model at cutoff 0.5 has 12 false positives and 11 false negatives, while the baseline model has 0 false positives and 23 false negatives. Because a parole board is likely to assign more cost to a false negative, the model at cutoff 0.5 is likely of value to the board.
#From the previous question, the parole board would likely benefit from decreasing the logistic regression cutoffs, which decreases the false negative rate while increasing the false positive rate.
######################################
#PROBLEM 5.6 - EVALUATING THE MODEL ON THE TESTING SET
#Using the ROCR package, what is the AUC value for the model?
# Test set AUC
library(ROCR)
ROCRpred = prediction(predictions, test$violator)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.8945834
#Ans:0.8945834
################################
#PROBLEM 5.7 - EVALUATING THE MODEL ON THE TESTING SET
#Describe the meaning of AUC in this context.
#Ans:The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.
#EXPLANATION:The AUC deals with differentiating between a randomly selected positive and negative example. It is independent of the regression cutoff selected.
##########################################
#PROBLEM 6.1 - IDENTIFYING BIAS IN OBSERVATIONAL DATA
#Our goal has been to predict the outcome of a parole decision, and we used a publicly available dataset of parole releases for predictions. In this final problem, we'll evaluate a potential source of bias associated with our analysis. It is always important to evaluate a dataset for possible sources of bias.
#The dataset contains all individuals released from parole in 2004, either due to completing their parole term or violating the terms of their parole. However, it does not contain parolees who neither violated their parole nor completed their term in 2004, causing non-violators to be underrepresented. This is called "selection bias" or "selecting on the dependent variable," because only a subset of all relevant parolees were included in our analysis, based on our dependent variable in this analysis (parole violation). How could we improve our dataset to best address selection bias?
#Ans:We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term.
#EXPLANATION:While expanding the dataset to include the missing parolees and labeling each as violator=0 would improve the representation of non-violators, it does not capture the true outcome, since the parolee might become a violator after 2004. Though labeling these new examples with violator=NA correctly identifies that we don't know their true outcome, we cannot train or test a prediction model with a missing dependent variable.
#As a result, a prospective dataset that tracks a cohort of parolees and observes the true outcome of each is more desirable. Unfortunately, such datasets are often more challenging to obtain (for instance, if a parolee had a 10-year term, it might require tracking that individual for 10 years before building the model). Such a prospective analysis would not be possible using the 2004 National Corrections Reporting Program dataset.
######################################################
In the lending industry, investors provide loans to borrowers in exchange for the promise of repayment with interest. If the borrower repays the loan, then the lender profits from the interest. However, if the borrower is unable to repay the loan, then the lender loses money. Therefore, lenders face the problem of predicting the risk of a borrower being unable to repay a loan.
To address this problem, we will use publicly available data from LendingClub.com, a website that connects borrowers and investors over the Internet. This dataset represents 9,578 3-year loans that were funded through the LendingClub.com platform between May 2007 and February 2010. The binary dependent variable not_fully_paid indicates that the loan was not paid back in full (the borrower either defaulted or the loan was “charged off,” meaning the borrower was deemed unlikely to ever pay it back).
To predict this dependent variable, we will use the following independent variables available to the investor when deciding whether to fund a loan:
#PROBLEM 1.1 - PREPARING THE DATASET
#Lets load the dataset
loans<-read.csv("loans.csv")
str(loans)
## 'data.frame': 9578 obs. of 14 variables:
## $ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
## $ purpose : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
## $ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
## $ installment : num 829 228 367 162 103 ...
## $ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num 19.5 14.3 11.6 8.1 15 ...
## $ fico : int 737 707 682 712 667 727 667 722 682 707 ...
## $ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
## $ revol.bal : int 28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
## $ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
## $ inq.last.6mths : int 0 0 1 1 0 0 0 0 1 1 ...
## $ delinq.2yrs : int 0 0 0 0 1 0 0 0 0 0 ...
## $ pub.rec : int 0 0 0 0 0 0 1 0 0 0 ...
## $ not.fully.paid : int 0 0 0 0 0 0 1 1 0 0 ...
summary(loans)
## credit.policy purpose int.rate
## Min. :0.000 all_other :2331 Min. :0.0600
## 1st Qu.:1.000 credit_card :1262 1st Qu.:0.1039
## Median :1.000 debt_consolidation:3957 Median :0.1221
## Mean :0.805 educational : 343 Mean :0.1226
## 3rd Qu.:1.000 home_improvement : 629 3rd Qu.:0.1407
## Max. :1.000 major_purchase : 437 Max. :0.2164
## small_business : 619
## installment log.annual.inc dti fico
## Min. : 15.67 Min. : 7.548 Min. : 0.000 Min. :612.0
## 1st Qu.:163.77 1st Qu.:10.558 1st Qu.: 7.213 1st Qu.:682.0
## Median :268.95 Median :10.928 Median :12.665 Median :707.0
## Mean :319.09 Mean :10.932 Mean :12.607 Mean :710.8
## 3rd Qu.:432.76 3rd Qu.:11.290 3rd Qu.:17.950 3rd Qu.:737.0
## Max. :940.14 Max. :14.528 Max. :29.960 Max. :827.0
## NA's :4
## days.with.cr.line revol.bal revol.util inq.last.6mths
## Min. : 179 Min. : 0 Min. : 0.00 Min. : 0.000
## 1st Qu.: 2820 1st Qu.: 3187 1st Qu.: 22.70 1st Qu.: 0.000
## Median : 4140 Median : 8596 Median : 46.40 Median : 1.000
## Mean : 4562 Mean : 16914 Mean : 46.87 Mean : 1.572
## 3rd Qu.: 5730 3rd Qu.: 18250 3rd Qu.: 71.00 3rd Qu.: 2.000
## Max. :17640 Max. :1207359 Max. :119.00 Max. :33.000
## NA's :29 NA's :62 NA's :29
## delinq.2yrs pub.rec not.fully.paid
## Min. : 0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 0.0000 Median :0.0000 Median :0.0000
## Mean : 0.1638 Mean :0.0621 Mean :0.1601
## 3rd Qu.: 0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :13.0000 Max. :5.0000 Max. :1.0000
## NA's :29 NA's :29
#What proportion of the loans in the dataset were not paid in full? Please input a number between 0 and 1.
prop.table(table(loans$not.fully.paid))
##
## 0 1
## 0.8399457 0.1600543
#Ans:0.1600543
############################################
#PROBLEM 1.2 - PREPARING THE DATASET
#Which of the following variables has at least one missing observation? Select all that apply.
colnames(loans)[apply(is.na(loans), 2, any)] #or
## [1] "log.annual.inc" "days.with.cr.line" "revol.util"
## [4] "inq.last.6mths" "delinq.2yrs" "pub.rec"
which(apply(is.na(loans), 2, sum)>0) # gives col names as well as col no
## log.annual.inc days.with.cr.line revol.util inq.last.6mths
## 5 8 10 11
## delinq.2yrs pub.rec
## 12 13
#or
colnames(loans)[apply(is.na(loans), 2, any)] #or very crude way summary(loans)
## [1] "log.annual.inc" "days.with.cr.line" "revol.util"
## [4] "inq.last.6mths" "delinq.2yrs" "pub.rec"
#Ans:log.annual.inc, days.with.cr.line, revol.util, inq.last.6mths, delinq.2yrs and pub.rec
####################################
#PROBLEM 1.3 - PREPARING THE DATASET
#Which of the following is the best reason to fill in the missing values for these variables instead of removing observations with missing data? (Hint: you can use the subset() function to build a data frame with the observations missing at least one value. To test if a variable, for example pub.rec, is missing a value, use is.na(pub.rec).
#Ans:We want to be able to predict risk for all borrowers, instead of just the ones with all data reported.
#EXPLANATION:Answering this question requires analyzing the loans with missing data. We can build a data frame limited to observations with some missing data with the following command:
missing = subset(loans, is.na(log.annual.inc) | is.na(days.with.cr.line) | is.na(revol.util) | is.na(inq.last.6mths) | is.na(delinq.2yrs) | is.na(pub.rec))
nrow(missing)
## [1] 62
table(missing$not.fully.paid)
##
## 0 1
## 50 12
#From nrow(missing), we see that only 62 of 9578 loans have missing data; removing this small number of observations would not lead to overfitting. From table(missing$not.fully.paid), we see that 12 of 62 loans with missing data were not fully paid, or 19.35%. This rate is similar to the 16.01% across all loans, so the form of biasing described is not an issue. However, to predict risk for loans with missing data we need to fill in the missing values instead of removing the observations.
##########################################################
#PROBLEM 1.4 - PREPARING THE DATASET
#For the rest of this problem, we'll be using a revised version of the dataset that has the missing values filled in with multiple imputation (which was discussed in the Recitation of this Unit). To ensure everybody has the same data frame going forward, you can either run the commands below in your R console (if you haven't already, run the command install.packages("mice") first), or you can download and load into R the dataset we created after running the imputation: loans_imputed.csv.
#IMPORTANT NOTE: On certain operating systems, the imputation results are not the same even if you set the random seed. If you decide to do the imputation yourself, please still read the provided imputed dataset (loans_imputed.csv) into R and compare your results, using the summary function. If the results are different, please make sure to use the data in loans_imputed.csv for the rest of the problem.
library(mice)
set.seed(144)
vars.for.imputation = setdiff(names(loans), "not.fully.paid") #Note that to do this imputation, we set vars.for.imputation to all variables in the data frame except for not.fully.paid, to impute the values using all of the other independent variables.
imputed = complete(mice(loans[vars.for.imputation]))
##
## iter imp variable
## 1 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 1 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 1 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 1 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 1 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 2 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 2 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 2 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 2 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 2 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 3 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 3 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 3 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 3 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 3 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 4 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 4 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 4 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 4 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 4 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 5 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 5 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 5 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 5 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 5 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
loans[vars.for.imputation] = imputed
#What best describes the process we just used to handle missing values?
#Ans:We predicted missing variable values using the available independent variables for each observation
#EXPLANATION:Imputation predicts missing variable values for a given observation using the variable values that are reported. We called the imputation on a data frame with the dependent variable not.fully.paid removed, so we predicted the missing values using only other independent variables.
####################################
#PROBLEM 2.1 - PREDICTION MODELS
#Now that we have prepared the dataset, we need to split it into a training and testing set. To ensure everybody obtains the same split, set the random seed to 144 (even though you already did so earlier in the problem) and use the sample.split function to select the 70% of observations for the training set (the dependent variable for sample.split is not.fully.paid). Name the data frames train and test.
# - SPLITTING INTO A TRAINING AND TESTING SET
set.seed(144)
library(caTools)
split = sample.split(loans$not.fully.paid, SplitRatio = 0.7)
train = subset(loans, split == TRUE)
test = subset(loans, split == FALSE)
#Now, use logistic regression trained on the training set to predict the dependent variable not.fully.paid using all the independent variables.
mod<-glm(not.fully.paid~.,data=train,family = binomial)
summary(mod)
##
## Call:
## glm(formula = not.fully.paid ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2097 -0.6214 -0.4950 -0.3601 2.6414
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.260e+00 1.554e+00 5.958 2.55e-09 ***
## credit.policy -3.327e-01 1.011e-01 -3.292 0.000995 ***
## purpose2 -6.100e-01 1.344e-01 -4.538 5.67e-06 ***
## purpose3 -3.181e-01 9.185e-02 -3.463 0.000534 ***
## purpose4 1.386e-01 1.753e-01 0.791 0.429074
## purpose5 1.774e-01 1.479e-01 1.199 0.230496
## purpose6 -4.783e-01 2.009e-01 -2.381 0.017260 *
## purpose7 4.159e-01 1.419e-01 2.932 0.003373 **
## int.rate 6.202e-01 2.085e+00 0.297 0.766111
## installment 1.279e-03 2.093e-04 6.110 9.96e-10 ***
## log.annual.inc -4.357e-01 7.151e-02 -6.093 1.11e-09 ***
## dti 4.733e-03 5.501e-03 0.861 0.389508
## fico -9.406e-03 1.707e-03 -5.510 3.60e-08 ***
## days.with.cr.line 3.174e-06 1.587e-05 0.200 0.841463
## revol.bal 3.103e-06 1.169e-06 2.653 0.007966 **
## revol.util 1.796e-03 1.532e-03 1.172 0.241022
## inq.last.6mths 8.386e-02 1.577e-02 5.317 1.06e-07 ***
## delinq.2yrs -7.794e-02 6.532e-02 -1.193 0.232814
## pub.rec 3.191e-01 1.134e-01 2.814 0.004899 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5896.6 on 6704 degrees of freedom
## Residual deviance: 5485.7 on 6686 degrees of freedom
## AIC: 5523.7
##
## Number of Fisher Scoring iterations: 5
#Which independent variables are significant in our model? (Significant variables have at least one star, or a Pr(>|z|) value less than 0.05.) Select all that apply.
#Ans:credit.policy ,purpose2,purpose3, purpose6, purpose7 ,installment ,log.annual.inc ,fico,revol.bal,inq.last.6mths ,pub.rec
# Note that some have a positive coefficient (meaning that higher values of the variable lead to an increased risk of defaulting) and some have a negative coefficient (meaning that higher values of the variable lead to a decreased risk of defaulting).
###########################################
#PROBLEM 2.2 - PREDICTION MODELS
#Consider two loan applications, which are identical other than the fact that the borrower in Application A has FICO credit score 700 while the borrower in Application B has FICO credit score 710.
#Let Logit(A) be the log odds of loan A not being paid back in full, according to our logistic regression model, and define Logit(B) similarly for loan B. What is the value of Logit(A) - Logit(B)?
-9.406e-03*-10 # coeff of FICO * (diff in FICO values=700-710=-10)
## [1] 0.09406
#Ans:0.09406
#EXPLANATION:Because Application A is identical to Application B other than having a FICO score 10 lower, its predicted log odds differ by -0.009317 * -10 = 0.09317 from the predicted log odds of Application B.
#Now, let O(A) be the odds of loan A not being paid back in full, according to our logistic regression model, and define O(B) similarly for loan B. What is the value of O(A)/O(B)? (HINT: Use the mathematical rule that exp(A + B + C) = exp(A)*exp(B)*exp(C). Also, remember that exp() is the exponential function in R.)
exp(0.09406) #logA-logB=log(A/B) hence oddsA/oddsb=exp(log(oddsA)-log(oddsB))=exp(log(oddsA/oddsB))
## [1] 1.098626
#Ans:1.0976
#EXPLANATION:Using the answer from the previous question, the predicted odds of loan A not being paid back in full are exp(0.09317) = 1.0976 times larger than the predicted odds for loan B. Intuitively, it makes sense that loan A should have higher odds of non-payment than loan B, since the borrower has a worse credit score.
###################################################
#PROBLEM 2.3 - PREDICTION MODELS
#Predict the probability of the test set loans not being paid back in full (remember type="response" for the predict function). Store these predicted probabilities in a variable named predicted.risk and add it to your test set (we will use this variable in later parts of the problem). Compute the confusion matrix using a threshold of 0.5.
predicted.risk<-predict(mod,type="response", newdata=test)
## Warning: contrasts dropped from factor purpose
test$predicted.risk<-predicted.risk # Adding the the predicted prob as a var in the test dataset.This can be achieved in sinle command:test$predicted.risk = predict(mod, newdata=test, type="response")
# Confusion matrix with threshold of 0.5
table(test$not.fully.paid, as.numeric(predicted.risk >= 0.5))
##
## 0 1
## 0 2400 13
## 1 457 3
#What is the accuracy of the logistic regression model? Input the accuracy as a number between 0 and 1.
(2400+3)/(2400+13+457+3)
## [1] 0.8364079
#Ans: 0.8364079
#2403 predictions are correct (accuracy 2403/2873=0.8364)
#What is the accuracy of the baseline model? Input the accuracy as a number between 0 and 1.
(2400+13)/(2400+13+457+3) #From the Confusion matrix:an easier model would be to pick the most frequent outcome
## [1] 0.8398886
#or from
table(test$not.fully.paid)
##
## 0 1
## 2413 460
#Ans:0.8398886
#2413 predictions would be correct in the baseline model of guessing every loan would be paid back in full (accuracy 2413/2873=0.8399)
#######################################
#PROBLEM 2.4 - PREDICTION MODELS (2 points possible)
#Use the ROCR package to compute the test set AUC.
# Test set AUC
library(ROCR)
ROCRpred = prediction(predicted.risk, test$not.fully.paid)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.6721337
#Ans:0.6721337
#The model has poor accuracy at the threshold 0.5. But despite the poor accuracy, we will see later how an investor can still leverage this logistic regression model to make profitable investments.
############################################
#PROBLEM 3.1 - A "SMART BASELINE"
#In the previous problem, we built a logistic regression model that has an AUC significantly higher than the AUC of 0.5 that would be obtained by randomly ordering observations.
#However, LendingClub.com assigns the interest rate to a loan based on their estimate of that loan's risk. This variable, int.rate, is an independent variable in our dataset. In this part, we will investigate using the loan's interest rate as a "smart baseline" to order the loans according to risk.
#Using the training set, build a bivariate logistic regression model (aka a logistic regression model with a single independent variable) that predicts the dependent variable not.fully.paid using only the variable int.rate.
bivariate<-glm(not.fully.paid~ int.rate,data=train,family = binomial)
summary(bivariate)
##
## Call:
## glm(formula = not.fully.paid ~ int.rate, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0547 -0.6271 -0.5442 -0.4361 2.2914
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.6726 0.1688 -21.76 <2e-16 ***
## int.rate 15.9214 1.2702 12.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5896.6 on 6704 degrees of freedom
## Residual deviance: 5734.8 on 6703 degrees of freedom
## AIC: 5738.8
##
## Number of Fisher Scoring iterations: 4
#The variable int.rate is highly significant in the bivariate model, but it is not significant at the 0.05 level in the model trained with all the independent variables. What is the most likely explanation for this difference?
#Ans:nt.rate is correlated with other risk-related variables, and therefore does not incrementally improve the model when those other variables are included.
#EXPLANATION:Decreased significance between a bivariate and multivariate model is typically due to correlation. From cor(train$int.rate, train$fico), we can see that the interest rate is moderately well correlated with a borrower's credit score.Training/testing set split rarely has a large effect on the significance of variables (this can be verified in this case by trying out a few other training/testing splits), and the models were trained on the same observations.
##############################################################
#PROBLEM 3.2 - A "SMART BASELINE"
#Make test set predictions for the bivariate model. What is the highest predicted probability of a loan not being paid in full on the testing set?
pred.bivariate<-predict(bivariate,type="response", newdata=test)
max(pred.bivariate) #or
## [1] 0.426624
summary(pred.bivariate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06196 0.11550 0.15080 0.15960 0.18930 0.42660
#Ans:0.426624
#With a logistic regression cutoff of 0.5, how many loans would be predicted as not being paid in full on the testing set?
#Ans:0
#EXPLANATION:According to the summary function, the maximum predicted probability of the loan not being paid back is 0.4266, which means no loans would be flagged at a logistic regression cutoff of 0.5.
#####################################################
#PROBLEM 3.3 - A "SMART BASELINE"
#What is the test set AUC of the bivariate model?
# Test set AUC
library(ROCR)
ROCRpred = prediction(pred.bivariate, test$not.fully.paid)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.6239081
################################################################
#PROBLEM 4.1 - COMPUTING THE PROFITABILITY OF AN INVESTMENT
#While thus far we have predicted if a loan will be paid back or not, an investor needs to identify loans that are expected to be profitable. If the loan is paid back in full, then the investor makes interest on the loan. However, if the loan is not paid back, the investor loses the money invested. Therefore, the investor should seek loans that best balance this risk and reward.
#To compute interest revenue, consider a $c investment in a loan that has an annual interest rate r over a period of t years. Using continuous compounding of interest, this investment pays back c * exp(rt) dollars by the end of the t years, where exp(rt) is e raised to the r*t power.
#How much does a $10 investment with an annual interest rate of 6% pay back after 3 years, using continuous compounding of interest? Hint: remember to convert the percentage to a proportion before doing the math. Enter the number of dollars, without the $ sign.
10*exp(0.06*3)
## [1] 11.97217
#Ans:11.97217
#EXPLANATION:In this problem, we have c=10, r=0.06, and t=3. Using the formula above, the final value is 10*exp(0.06*3) = 11.97.
#################################################
#PROBLEM 4.2 - COMPUTING THE PROFITABILITY OF AN INVESTMENT
#While the investment has value c * exp(rt) dollars after collecting interest, the investor had to pay $c for the investment. What is the profit to the investor if the investment is paid back in full?
#Ans:c * exp(rt) - c
#EXPLANATION:A person's profit is what they get minus what they paid for it. In this case, the investor gets c * exp(rt) but paid c, yielding a profit of c * exp(rt) - c.
#########################################################
#PROBLEM 4.3 - COMPUTING THE PROFITABILITY OF AN INVESTMENT
#Now, consider the case where the investor made a $c investment, but it was not paid back in full. Assume, conservatively, that no money was received from the borrower (often a lender will receive some but not all of the value of the loan, making this a pessimistic assumption of how much is received). What is the profit to the investor in this scenario?
#Ans:-c
#EXPLANATION:A person's profit is what they get minus what they paid for it. In this case, the investor gets no money but paid c dollars, yielding a profit of -c dollars.
###########################################################
#PROBLEM 5.1 - A SIMPLE INVESTMENT STRATEGY
#In the previous subproblem, we concluded that an investor who invested c dollars in a loan with interest rate r for t years makes c * (exp(rt) - 1) dollars of profit if the loan is paid back in full and -c dollars of profit if the loan is not paid back in full (pessimistically).
#In order to evaluate the quality of an investment strategy, we need to compute this profit for each loan in the test set. For this variable, we will assume a $1 investment (aka c=1). To create the variable, we first assign to the profit for a fully paid loan, exp(rt)-1, to every observation, and we then replace this value with -1 in the cases where the loan was not paid in full. All the loans in our dataset are 3-year loans, meaning t=3 in our calculations. Enter the following commands in your R console to create this new variable:
test$profit = exp(test$int.rate*3) - 1
test$profit[test$not.fully.paid == 1] = -1
#What is the maximum profit of a $10 investment in any loan in the testing set (do not include the $ sign in your answer)?
max(test$profit[test$not.fully.paid == 0]) *10
## [1] 8.894769
#or
summary(test$profit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000 0.2858 0.4111 0.2094 0.4980 0.8895
0.8895 *10
## [1] 8.895
#8.894769
#EXPLANATION:From summary(test$profit), we see the maximum profit for a $1 investment in any loan is $0.8895. Therefore, the maximum profit of a $10 investment is 10 times as large, or $8.895.
#################################################################
#PROBLEM 6.1 - AN INVESTMENT STRATEGY BASED ON RISK
#A simple investment strategy of equally investing in all the loans would yield profit $20.94 for a $100 investment. But this simple investment strategy does not leverage the prediction model we built earlier in this problem. As stated earlier, investors seek loans that balance reward with risk, in that they simultaneously have high interest rates and a low risk of not being paid back.
#To meet this objective, we will analyze an investment strategy in which the investor only purchases loans with a high interest rate (a rate of at least 15%), but amongst these loans selects the ones with the lowest predicted risk of not being paid back in full. We will model an investor who invests $1 in each of the most promising 100 loans.
#First, use the subset() function to build a data frame called highInterest consisting of the test set loans with an interest rate of at least 15%.
str(test)
## 'data.frame': 2873 obs. of 16 variables:
## $ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
## $ purpose : Factor w/ 7 levels "all_other","credit_card",..: 2 3 3 3 1 3 1 2 7 2 ...
## ..- attr(*, "contrasts")= num [1:7, 1:6] 0 1 0 0 0 0 0 0 0 1 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "all_other" "credit_card" "debt_consolidation" "educational" ...
## .. .. ..$ : chr "2" "3" "4" "5" ...
## $ int.rate : num 0.107 0.136 0.122 0.132 0.08 ...
## $ installment : num 228.2 366.9 84.1 253.6 188 ...
## $ log.annual.inc : num 11.1 10.4 10.2 11.8 11.2 ...
## $ dti : num 14.29 11.63 10 9.16 16.08 ...
## $ fico : int 707 682 707 662 772 662 772 797 712 682 ...
## $ days.with.cr.line: num 2760 4710 2730 4298 4889 ...
## $ revol.bal : int 33623 3511 5630 5122 29797 4175 3660 6844 3534 43039 ...
## $ revol.util : num 76.7 25.6 23 18.2 23.2 51.5 6.8 14.4 54.4 93.4 ...
## $ inq.last.6mths : int 0 1 1 2 1 0 0 0 0 3 ...
## $ delinq.2yrs : int 0 0 0 1 0 1 0 0 0 0 ...
## $ pub.rec : int 0 0 0 0 0 0 0 0 0 0 ...
## $ not.fully.paid : int 0 0 0 0 0 0 0 0 0 0 ...
## $ predicted.risk : num 0.077 0.1735 0.1087 0.1023 0.0676 ...
## $ profit : num 0.379 0.502 0.442 0.488 0.271 ...
highInterest<-subset(test,int.rate>=0.15)
str(highInterest)
## 'data.frame': 437 obs. of 16 variables:
## $ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
## $ purpose : Factor w/ 7 levels "all_other","credit_card",..: 7 2 7 3 3 3 3 3 3 5 ...
## ..- attr(*, "contrasts")= num [1:7, 1:6] 0 1 0 0 0 0 0 0 0 1 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "all_other" "credit_card" "debt_consolidation" "educational" ...
## .. .. ..$ : chr "2" "3" "4" "5" ...
## $ int.rate : num 0.15 0.153 0.156 0.152 0.162 ...
## $ installment : num 225 444 700 730 881 ...
## $ log.annual.inc : num 12.3 11 11.4 11.2 11.6 ...
## $ dti : num 6.45 19.52 1.53 21.84 24.84 ...
## $ fico : int 677 667 667 677 682 662 667 662 672 662 ...
## $ days.with.cr.line: num 6240 2701 3604 2640 7619 ...
## $ revol.bal : int 56411 33074 4569 27839 51039 26255 34841 20658 28511 2076 ...
## $ revol.util : num 75.3 68.8 56.4 72.7 93.8 74.3 89.6 55.4 56.7 23.1 ...
## $ inq.last.6mths : int 0 2 1 0 0 1 0 0 1 2 ...
## $ delinq.2yrs : int 0 0 0 1 1 0 0 0 0 0 ...
## $ pub.rec : int 0 0 0 0 0 0 0 0 0 0 ...
## $ not.fully.paid : int 1 0 0 0 1 0 0 0 0 0 ...
## $ predicted.risk : num 0.164 0.169 0.332 0.203 0.221 ...
## $ profit : num -1 0.584 0.599 0.578 -1 ...
#What is the average profit of a $1 investment in one of these high-interest loans (do not include the $ sign in your answer)?
mean(highInterest$profit) #or summary(highInterest$profit)
## [1] 0.2251015
#Ans:0.2251015
#What proportion of the high-interest loans were not paid back in full?
prop.table(table(highInterest$not.fully.paid)) #110 of the 437 loans were not paid back in full, for a proportion of 0.2517.
##
## 0 1
## 0.7482838 0.2517162
#Ans:0.2517162
####################################################
#PROBLEM 6.2 - AN INVESTMENT STRATEGY BASED ON RISK
#Next, we will determine the 100th smallest predicted probability of not paying in full by sorting the predicted risks in increasing order and selecting the 100th element of this sorted list. Find the highest predicted risk that we will include by typing the following command into your R console:
cutoff = sort(highInterest$predicted.risk, decreasing=FALSE)[100]
cutoff
## [1] 0.1769593
#Use the subset() function to build a data frame called selectedLoans consisting of the high-interest loans with predicted risk not exceeding the cutoff we just computed. Check to make sure you have selected 100 loans for investment.
selectedLoans<-subset(highInterest,predicted.risk <= cutoff)
nrow(selectedLoans) # check to see if it has 100 obs
## [1] 100
summary(selectedLoans$profit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000 0.5830 0.5992 0.3287 0.6317 0.8508
#What is the profit of the investor, who invested $1 in each of these 100 loans (do not include the $ sign in your answer)?
sum(selectedLoans$profit)
## [1] 32.87361
#Ans:31.28
#How many of 100 selected loans were not paid back in full?
table(selectedLoans$not.fully.paid)
##
## 0 1
## 82 18
#Ans:18
#We have now seen how analytics can be used to select a subset of the high-interest loans that were paid back at only a slightly lower rate than average, resulting in a significant increase in the profit from our investor's $100 investment. Although the logistic regression models developed in this problem did not have large AUC values, we see that they still provided the edge needed to improve the profitability of an investment portfolio.
#We conclude with a note of warning. Throughout this analysis we assume that the loans we invest in will perform in the same way as the loans we used to train our model, even though our training set covers a relatively short period of time. If there is an economic shock like a large financial downturn, default rates might be significantly higher than those observed in the training set and we might end up losing money instead of profiting. Investors must pay careful attention to such risk when making investment decisions.
#Note:Slight discrepancy in answers is due to imputation effect of missing data
Last week, in the Moneyball lecture, we discussed how regular season performance is not strongly correlated with winning the World Series in baseball. In this homework question, we’ll use the same data to investigate how well we can predict the World Series winner at the beginning of the playoffs.
To begin, load the dataset baseball.csv into R using the read.csv function, and call the data frame “baseball”. This is the same data file we used during the Moneyball lecture, and the data comes from Baseball-Reference.com.
As a reminder, this dataset contains data concerning a baseball team’s performance in a given year. It has the following variables:
#Lets start by loading the relevant dataset
baseball<-read.csv("baseball.csv")
str(baseball)
## 'data.frame': 1232 obs. of 15 variables:
## $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 2 3 4 5 7 8 9 10 11 12 ...
## $ League : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ RS : int 734 700 712 734 613 748 669 667 758 726 ...
## $ RA : int 688 600 705 806 759 676 588 845 890 670 ...
## $ W : int 81 94 93 69 61 85 97 68 64 88 ...
## $ OBP : num 0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
## $ SLG : num 0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
## $ BA : num 0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
## $ Playoffs : int 0 1 1 0 0 0 1 0 0 1 ...
## $ RankSeason : int NA 4 5 NA NA NA 2 NA NA 6 ...
## $ RankPlayoffs: int NA 5 4 NA NA NA 4 NA NA 2 ...
## $ G : int 162 162 162 162 162 162 162 162 162 162 ...
## $ OOBP : num 0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
## $ OSLG : num 0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
#PROBLEM 1.1 - LIMITING TO TEAMS MAKING THE PLAYOFFS
#Each row in the baseball dataset represents a team in a particular year.
#How many team/year pairs are there in the whole dataset?
nrow(baseball) #or str(baseball)
## [1] 1232
#Ans:1232
############################################
#PROBLEM 1.2 - LIMITING TO TEAMS MAKING THE PLAYOFFS
#Though the dataset contains data from 1962 until 2012, we removed several years with shorter-than-usual seasons. Using the table() function, identify the total number of years included in this dataset.
length(table(baseball$Year)) #or
## [1] 47
length(unique(baseball$Year))
## [1] 47
#Ans:47
################################################
#PROBLEM 1.3 - LIMITING TO TEAMS MAKING THE PLAYOFFS
#Because we're only analyzing teams that made the playoffs, use the subset() function to replace baseball with a data frame limited to teams that made the playoffs (so your subsetted data frame should still be called "baseball"). How many team/year pairs are included in the new dataset?
baseball<-subset(baseball,Playoffs ==1)
nrow(baseball) #or str(baseball)
## [1] 244
#Ans:244
########################################
#PROBLEM 1.4 - LIMITING TO TEAMS MAKING THE PLAYOFFS
#Through the years, different numbers of teams have been invited to the playoffs. Which of the following has been the number of teams making the playoffs in some season? Select all that apply.
str(baseball)
## 'data.frame': 244 obs. of 15 variables:
## $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 3 4 9 12 25 26 32 33 36 39 ...
## $ League : Factor w/ 2 levels "AL","NL": 2 1 2 1 1 1 2 2 1 2 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ RS : int 700 712 669 726 804 713 718 765 808 731 ...
## $ RA : int 600 705 588 670 668 614 649 648 707 594 ...
## $ W : int 94 93 97 88 95 94 94 88 93 98 ...
## $ OBP : num 0.32 0.311 0.315 0.335 0.337 0.31 0.327 0.338 0.334 0.322 ...
## $ SLG : num 0.389 0.417 0.411 0.422 0.453 0.404 0.397 0.421 0.446 0.428 ...
## $ BA : num 0.247 0.247 0.251 0.268 0.265 0.238 0.269 0.271 0.273 0.261 ...
## $ Playoffs : int 1 1 1 1 1 1 1 1 1 1 ...
## $ RankSeason : int 4 5 2 6 3 4 4 6 5 1 ...
## $ RankPlayoffs: int 5 4 4 2 3 4 1 3 5 4 ...
## $ G : int 162 162 162 162 162 162 162 162 162 162 ...
## $ OOBP : num 0.306 0.315 0.305 0.314 0.311 0.306 0.313 0.313 0.309 0.303 ...
## $ OSLG : num 0.378 0.403 0.39 0.402 0.419 0.378 0.393 0.387 0.408 0.373 ...
table(baseball$Year) #or
##
## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977
## 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4
## 1978 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
## 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## 2011 2012
## 8 10
table(table(baseball$Year)) #fancier o/p
##
## 2 4 8 10
## 7 23 16 1
#Ans:2,4,8,10
########################################
#PROBLEM 2.1 - ADDING AN IMPORTANT PREDICTOR
#It's much harder to win the World Series if there are 10 teams competing for the championship versus just two. Therefore, we will add the predictor variable NumCompetitors to the baseball data frame. NumCompetitors will contain the number of total teams making the playoffs in the year of a particular team/year pair. For instance, NumCompetitors should be 2 for the 1962 New York Yankees, but it should be 8 for the 1998 Boston Red Sox.
#We start by storing the output of the table() function that counts the number of playoff teams from each year:
PlayoffTable = table(baseball$Year)
PlayoffTable
##
## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977
## 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4
## 1978 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
## 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## 2011 2012
## 8 10
#We will use this stored table to look up the number of teams in the playoffs in the year of each team/year pair.
#Just as we can use the names() function to get the names of a data frame's columns, we can use it to get the names of the entries in a table. What best describes the output of names(PlayoffTable)?
names(PlayoffTable)
## [1] "1962" "1963" "1964" "1965" "1966" "1967" "1968" "1969" "1970" "1971"
## [11] "1973" "1974" "1975" "1976" "1977" "1978" "1979" "1980" "1982" "1983"
## [21] "1984" "1985" "1986" "1987" "1988" "1989" "1990" "1991" "1992" "1993"
## [31] "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005"
## [41] "2006" "2007" "2008" "2009" "2010" "2011" "2012"
class(names(PlayoffTable)) #or
## [1] "character"
str(names(PlayoffTable))
## chr [1:47] "1962" "1963" "1964" "1965" "1966" "1967" ...
#Ans:Vector of years stored as strings (type chr)
#EXPLANATION:From the call str(names(PlayoffTable)) we see PlayoffTable has names of type chr, which are the years of the teams in the dataset.
#########################################################
#PROBLEM 2.2 - ADDING AN IMPORTANT PREDICTOR
#Given a vector of names, the table will return a vector of frequencies. Which function call returns the number of playoff teams in 1990 and 2001? (HINT: If you are not sure how these commands work, go ahead and try them out in your R console!)
PlayoffTable[c("1990", "2001")]
##
## 1990 2001
## 4 8
#Ans:PlayoffTable[c("1990", "2001")]
#EXPLANATION:Because PlayoffTable is an object and not a function, we look up elements in it with square brackets instead of parentheses. We build the vector of years to be passed with the c() function. Because the names of PlayoffTable are strings and not numbers, we need to pass "1990" and "2001".
####################################################
#PROBLEM 2.3 - ADDING AN IMPORTANT PREDICTOR
#Putting it all together, we want to look up the number of teams in the playoffs for each team/year pair in the dataset, and store it as a new variable named NumCompetitors in the baseball data frame. Which of the following function calls accomplishes this? (HINT: Test out the functions if you are not sure what they do.)
baseball$NumCompetitors = PlayoffTable[as.character(baseball$Year)]
#Ans:baseball$NumCompetitors = PlayoffTable[as.character(baseball$Year)]
#EXPLANATION:Because PlayoffTable is an object and not a function, we look up elements in it with square brackets instead of parentheses. as.character() is needed to convert the Year variable in the dataset to a string, which we know from the previous parts is needed to look up elements in a table. If you're not sure what a function does, remember you can look it up with the ? function. For instance, you could type ?as.character to look up information about as.character.
##############################################
#PROBLEM 2.4 - ADDING AN IMPORTANT PREDICTOR
#Add the NumCompetitors variable to your baseball data frame. How many playoff team/year pairs are there in our dataset from years where 8 teams were invited to the playoffs?
table(baseball$NumCompetitors)
##
## 2 4 8 10
## 14 92 128 10
#Ans:128
###########################################
#PROBLEM 3.1 - BIVARIATE MODELS FOR PREDICTING WORLD SERIES WINNER
#In this problem, we seek to predict whether a team won the World Series; in our dataset this is denoted with a RankPlayoffs value of 1. Add a variable named WorldSeries to the baseball data frame, by typing the following command in your R console:
baseball$WorldSeries = as.numeric(baseball$RankPlayoffs == 1)
#WorldSeries takes value 1 if a team won the World Series in the indicated year and a 0 otherwise. How many observations do we have in our dataset where a team did NOT win the World Series?
table(baseball$WorldSeries)
##
## 0 1
## 197 47
#Ans:197
########################################
#PROBLEM 3.2 - BIVARIATE MODELS FOR PREDICTING WORLD SERIES WINNER
#When we're not sure which of our variables are useful in predicting a particular outcome, it's often helpful to build bivariate models, which are models that predict the outcome using a single independent variable. Which of the following variables is a significant predictor of the WorldSeries variable in a bivariate logistic regression model? To determine significance, remember to look at the stars in the summary output of the model. We'll define an independent variable as significant if there is at least one star at the end of the coefficients row for that variable (this is equivalent to the probability column having a value smaller than 0.05). Note that you have to build 12 models to answer this question! Use the entire dataset baseball to build the models. (Select all that apply.)
summary(glm(WorldSeries~Year, data=baseball, family="binomial"))
##
## Call:
## glm(formula = WorldSeries ~ Year, family = "binomial", data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0297 -0.6797 -0.5435 -0.4648 2.1504
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 72.23602 22.64409 3.19 0.00142 **
## Year -0.03700 0.01138 -3.25 0.00115 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 228.35 on 242 degrees of freedom
## AIC: 232.35
##
## Number of Fisher Scoring iterations: 4
#similarly doing for other var one by one
summary(glm(WorldSeries~RA , data=baseball, family="binomial"))
##
## Call:
## glm(formula = WorldSeries ~ RA, family = "binomial", data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9749 -0.6883 -0.6118 -0.4746 2.1577
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.888174 1.483831 1.272 0.2032
## RA -0.005053 0.002273 -2.223 0.0262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 233.88 on 242 degrees of freedom
## AIC: 237.88
##
## Number of Fisher Scoring iterations: 4
summary(glm(WorldSeries~RankSeason , data=baseball, family="binomial"))
##
## Call:
## glm(formula = WorldSeries ~ RankSeason, family = "binomial",
## data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7805 -0.7131 -0.5918 -0.4882 2.1781
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8256 0.3268 -2.527 0.0115 *
## RankSeason -0.2069 0.1027 -2.016 0.0438 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 234.75 on 242 degrees of freedom
## AIC: 238.75
##
## Number of Fisher Scoring iterations: 4
summary(glm(WorldSeries~NumCompetitors , data=baseball, family="binomial"))
##
## Call:
## glm(formula = WorldSeries ~ NumCompetitors, family = "binomial",
## data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9871 -0.8017 -0.5089 -0.5089 2.2643
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.03868 0.43750 0.088 0.929559
## NumCompetitors -0.25220 0.07422 -3.398 0.000678 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 226.96 on 242 degrees of freedom
## AIC: 230.96
##
## Number of Fisher Scoring iterations: 4
#Ans:Year,RA,RankSeason ,NumCompetitors
#XPLANATION:The results come from building each bivariate model and looking at its summary. For instance, the result for the variable Year can be obtained by running summary(glm(WorldSeries~Year, data=baseball, family="binomial")). You can save time on repeated model building by using the up arrow in your R terminal. The W and SLG variables were both nearly significant, with p = 0.0577 and 0.0504, respectively.
####################################
#PROBLEM 4.1 - MULTIVARIATE MODELS FOR PREDICTING WORLD SERIES WINNER
#In this section, we'll consider multivariate models that combine the variables we found to be significant in bivariate models. Build a model using all of the variables that you found to be significant in the bivariate models. How many variables are significant in the combined model?
LogModel<-glm(WorldSeries~Year+RankSeason+RA+NumCompetitors, data=baseball, family="binomial")
summary(LogModel)
##
## Call:
## glm(formula = WorldSeries ~ Year + RankSeason + RA + NumCompetitors,
## family = "binomial", data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0336 -0.7689 -0.5139 -0.4583 2.2195
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 12.5874376 53.6474210 0.235 0.814
## Year -0.0061425 0.0274665 -0.224 0.823
## RankSeason -0.0685046 0.1203459 -0.569 0.569
## RA -0.0008238 0.0027391 -0.301 0.764
## NumCompetitors -0.1794264 0.1815933 -0.988 0.323
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 226.37 on 239 degrees of freedom
## AIC: 236.37
##
## Number of Fisher Scoring iterations: 4
#Ans:0 none
###################################
#PROBLEM 4.2 - MULTIVARIATE MODELS FOR PREDICTING WORLD SERIES WINNER
#Often, variables that were significant in bivariate models are no longer significant in multivariate analysis due to correlation between the variables. Which of the following variable pairs have a high degree of correlation (a correlation greater than 0.8 or less than -0.8)? Select all that apply.
cor(baseball[c("Year","RankSeason","RA","NumCompetitors")])
## Year RankSeason RA NumCompetitors
## Year 1.0000000 0.3852191 0.4762422 0.9139548
## RankSeason 0.3852191 1.0000000 0.3991413 0.4247393
## RA 0.4762422 0.3991413 1.0000000 0.5136769
## NumCompetitors 0.9139548 0.4247393 0.5136769 1.0000000
#or
vars<-c("Year","RankSeason","RA","NumCompetitors")
cor(baseball[,vars])
## Year RankSeason RA NumCompetitors
## Year 1.0000000 0.3852191 0.4762422 0.9139548
## RankSeason 0.3852191 1.0000000 0.3991413 0.4247393
## RA 0.4762422 0.3991413 1.0000000 0.5136769
## NumCompetitors 0.9139548 0.4247393 0.5136769 1.0000000
#Ans:Year/NumCompetitors (0.9139548)
###########################################
#PROBLEM 4.3 - MULTIVARIATE MODELS FOR PREDICTING WORLD SERIES WINNER
#Build all six of the two variable models listed in the previous problem. Together with the four bivariate models, you should have 10 different logistic regression models. Which model has the best AIC value (the minimum AIC value)?
Model1 = glm(WorldSeries ~ Year + RA, data=baseball, family=binomial)
summary(Model1)
##
## Call:
## glm(formula = WorldSeries ~ Year + RA, family = binomial, data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0402 -0.6878 -0.5298 -0.4785 2.1370
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 63.610741 25.654830 2.479 0.0132 *
## Year -0.032084 0.013323 -2.408 0.0160 *
## RA -0.001766 0.002585 -0.683 0.4945
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 227.88 on 241 degrees of freedom
## AIC: 233.88
##
## Number of Fisher Scoring iterations: 4
Model2 = glm(WorldSeries ~ Year + RankSeason, data=baseball, family=binomial)
summary(Model2)
##
## Call:
## glm(formula = WorldSeries ~ Year + RankSeason, family = binomial,
## data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0560 -0.6957 -0.5379 -0.4528 2.2673
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 63.64855 24.37063 2.612 0.00901 **
## Year -0.03254 0.01231 -2.643 0.00822 **
## RankSeason -0.10064 0.11352 -0.887 0.37534
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 227.55 on 241 degrees of freedom
## AIC: 233.55
##
## Number of Fisher Scoring iterations: 4
Model3 = glm(WorldSeries ~ Year + NumCompetitors, data=baseball, family=binomial)
summary(Model3)
##
## Call:
## glm(formula = WorldSeries ~ Year + NumCompetitors, family = binomial,
## data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0050 -0.7823 -0.5115 -0.4970 2.2552
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 13.350467 53.481896 0.250 0.803
## Year -0.006802 0.027328 -0.249 0.803
## NumCompetitors -0.212610 0.175520 -1.211 0.226
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 226.90 on 241 degrees of freedom
## AIC: 232.9
##
## Number of Fisher Scoring iterations: 4
Model4 = glm(WorldSeries ~ RA + RankSeason, data=baseball, family=binomial)
summary(Model4)
##
## Call:
## glm(formula = WorldSeries ~ RA + RankSeason, family = binomial,
## data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9374 -0.6933 -0.5936 -0.4564 2.1979
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.487461 1.506143 0.988 0.323
## RA -0.003815 0.002441 -1.563 0.118
## RankSeason -0.140824 0.110908 -1.270 0.204
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 232.22 on 241 degrees of freedom
## AIC: 238.22
##
## Number of Fisher Scoring iterations: 4
Model5 = glm(WorldSeries ~ RA + NumCompetitors, data=baseball, family=binomial)
summary(Model5)
##
## Call:
## glm(formula = WorldSeries ~ RA + NumCompetitors, family = binomial,
## data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0433 -0.7826 -0.5133 -0.4701 2.2208
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.716895 1.528736 0.469 0.63911
## RA -0.001233 0.002661 -0.463 0.64313
## NumCompetitors -0.229385 0.088399 -2.595 0.00946 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 226.74 on 241 degrees of freedom
## AIC: 232.74
##
## Number of Fisher Scoring iterations: 4
Model6 = glm(WorldSeries ~ RankSeason + NumCompetitors, data=baseball, family=binomial)
summary(Model6)
##
## Call:
## glm(formula = WorldSeries ~ RankSeason + NumCompetitors, family = binomial,
## data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0090 -0.7592 -0.5204 -0.4501 2.2562
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.12277 0.45737 0.268 0.78837
## RankSeason -0.07697 0.11711 -0.657 0.51102
## NumCompetitors -0.22784 0.08201 -2.778 0.00546 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 226.52 on 241 degrees of freedom
## AIC: 232.52
##
## Number of Fisher Scoring iterations: 4
#Bivariate model constructed earlier
summary(glm(WorldSeries~Year, data=baseball, family="binomial"))
##
## Call:
## glm(formula = WorldSeries ~ Year, family = "binomial", data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0297 -0.6797 -0.5435 -0.4648 2.1504
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 72.23602 22.64409 3.19 0.00142 **
## Year -0.03700 0.01138 -3.25 0.00115 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 228.35 on 242 degrees of freedom
## AIC: 232.35
##
## Number of Fisher Scoring iterations: 4
summary(glm(WorldSeries~RA , data=baseball, family="binomial"))
##
## Call:
## glm(formula = WorldSeries ~ RA, family = "binomial", data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9749 -0.6883 -0.6118 -0.4746 2.1577
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.888174 1.483831 1.272 0.2032
## RA -0.005053 0.002273 -2.223 0.0262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 233.88 on 242 degrees of freedom
## AIC: 237.88
##
## Number of Fisher Scoring iterations: 4
summary(glm(WorldSeries~RankSeason , data=baseball, family="binomial"))
##
## Call:
## glm(formula = WorldSeries ~ RankSeason, family = "binomial",
## data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7805 -0.7131 -0.5918 -0.4882 2.1781
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8256 0.3268 -2.527 0.0115 *
## RankSeason -0.2069 0.1027 -2.016 0.0438 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 234.75 on 242 degrees of freedom
## AIC: 238.75
##
## Number of Fisher Scoring iterations: 4
summary(glm(WorldSeries~NumCompetitors , data=baseball, family="binomial"))
##
## Call:
## glm(formula = WorldSeries ~ NumCompetitors, family = "binomial",
## data = baseball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9871 -0.8017 -0.5089 -0.5089 2.2643
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.03868 0.43750 0.088 0.929559
## NumCompetitors -0.25220 0.07422 -3.398 0.000678 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 239.12 on 243 degrees of freedom
## Residual deviance: 226.96 on 242 degrees of freedom
## AIC: 230.96
##
## Number of Fisher Scoring iterations: 4
#Ans:NumCompetitors
#EXPLANATION:None of the models with two independent variables had both variables significant, so none seem promising as compared to a simple bivariate model. Indeed the model with the lowest AIC value is the model with just NumCompetitors as the independent variable.This seems to confirm the claim made by Billy Beane in Moneyball that all that matters in the Playoffs is luck, since NumCompetitors has nothing to do with the quality of the teams!
######################################################
######################################################################