TBANLT 560 | Data Mining Midterm Exam

Instructions

There are ten questions. Each question may cover several topics. The exam is Due on February 23rd. Submit it online. Each question 1 to 10 is worth 10 points. It should take you from 2 to 4 hours.

NOTE: In questions 2, 3, 6, 7A, include all r code or other methods used. For example, it is possible to use Excel to carry out some of the analysis. If you use Excel or something other than R take a screen shot of the setup and the results. Basically, show your work.

1

A hospital is interested in learning about the risk of readmission for people who were treated for a specific chronic ailment. They want to know what contributes to an individual’s risk of readmission. Risk of readmission is based on the likelihood that the person will come back to the hospital within a short period of time. There are several data mining tasks that are relevant in this situation. Discuss the relevant ones and how the analysis may be of value to the hospital.

There are 10 Steps in Data Mining and most of them are applicable (and self explanatory) in this scenario.

Develop an understanding of the purpose of the data mining project.

Obtain the dataset to be used in the analysis.
For this the hospital needs to to isolate the data specific to patients with the specific chronic ailment. This can be easily done by applying a filter to the initial data pull.

Explore, clean, and preprocess the data
I would recommending starting with an exploratory data analysis (EDA). This would allow the hospital to identify which variables are significant, what data points are outliers, and what information may be missing.

Reduce the data dimension, if necessary.
Since hospital data sets tend to be quite large, they may want to utilize a smaller test set than the “standard” 60% split.

Determine the data mining task.

Partition the data (for supervised tasks).

Choose the data mining techniques to be used.
The data mining technique would be dependent on the variables that were identified as being highly correlated. Linear regressions tend to be the easiest to understand so that would be a good starting point. If there were many categorical variables, a multiple regression could be effective.

Use algorithms to perform the task.
This would be one of the more involved parts of the process as it is where the hospital would develop multiple variations of the techniques they have selected. Validation data would be involved at this point to provide feedback on the algorithm’s performance.

Interpret the results of the algorithms.

Deploy the model.

2

A

Describe a situation where you want to keep outliers. Describe one where you don’t. Give an example of each.

When deciding how to treat an outlier, it is important to know your data set and the intention of the analysis. Say you are looking at elementary student performance scores across a school district. Outliers could be useful in identifying groups or even specific students who may need extra attention. Knowing your data will also help you to quickly identify outliers that should be dropped due to incorrectly entered data. For example, if a student is recorded as having a 991% on an exam we can assume that it is a typo since scores only go up to 100%.

B

Missing data can be problematic. There are several ways for handling missing data. Given the following table define two data imputation methods which we have covered for filling in the missing values.

SMBL	DATA1	DATA2	DATA3	DATA4	DATA5	DATA6	DATA7	DATA8
AMCVQ	42.688	1.205	2.529	51.843	83.727	293.990	317.446	225.398
AMCVQ	122.870	4.552	5.417	141.299	235.692	752.615	687.529	579.124
FLMIQ	30.316	334.278	1018.666	1494.990	1154.597	3923.971	1598.786	949.843
FLMIQ	5.967	450.905	NA	1587.916	1281.084	3490.832	1554.884	820.065
FLMIQ	6.683	496.159	997.805	1728.750	1283.719	3573.218	1539.465	838.176
GRA	78.300	975.700	514.200	2228.900	NA	6230.600	3228.300	1730.100
GRA	40.600	596.800	491.900	1681.300	2214.200	6297.600	3154.900	1736.100
GRA	68.300	831.400	376.100	1774.900	1487.100	4945.800	3307.900	1871.300
3ABNTQ	10.769	24.714	NA	57.870	43.906	217.751	92.565	50.138
3ABNTQ	9.101	26.922	24.752	66.764	NA	201.567	113.664	53.285
3ABNTQ	14.009	22.981	21.229	66.798	138.719	184.281	105.423	42.931
ENRNQ	114.917	1576.158	111.463	2726.907	2431.992	13238.940	11107.180	6868.435
ENRNQ	256.000	2255.000	164.000	3979.000	3708.000	16137.000	11348.000	7112.000
ENRNQ	170.000	2151.000	NA	4669.000	4412.000	23422.000	13742.000	9170.000
PTFC	49.095	68.454	327.389	460.970	318.313	1563.586	1099.739	612.082
PTFC	43.474	62.420	283.631	404.144	1553.831	1228.061	1001.243	471.075
PTFC	51.759	49.722	268.550	387.359	240.931	1020.457	309.609	287.098
RVSIQ	0.465	2.157	2.136	5.449	6.215	7.889	3.191	1.317
RVSIQ	3.063	3.412	NA	10.406	5.742	14.988	3.226	1.923
RVSIQ	17.407	11.884	8.074	39.894	15.313	47.288	7.269	3.999
CQB	97.863	361.198	421.806	902.867	488.422	2596.127	1892.729	1177.823
CQB	96.924	296.494	428.260	846.513	612.870	2416.790	1807.437	1071.341
CQB	70.428	274.323	392.190	772.355	NA	2262.492	1806.793	1005.606

The easiest method of filling in missing values is to simply copy the value that is directly before or after it. A second, and possibly more accurate method is to insert the mean value

#select one of your methods to fill in the missing values in the table 
twob2 <-fill(twob,2:9, .direction = "down")
kable(twob2)

SMBL	DATA1	DATA2	DATA3	DATA4	DATA5	DATA6	DATA7	DATA8
AMCVQ	42.688	1.205	2.529	51.843	83.727	293.990	317.446	225.398
AMCVQ	122.870	4.552	5.417	141.299	235.692	752.615	687.529	579.124
FLMIQ	30.316	334.278	1018.666	1494.990	1154.597	3923.971	1598.786	949.843
FLMIQ	5.967	450.905	1018.666	1587.916	1281.084	3490.832	1554.884	820.065
FLMIQ	6.683	496.159	997.805	1728.750	1283.719	3573.218	1539.465	838.176
GRA	78.300	975.700	514.200	2228.900	1283.719	6230.600	3228.300	1730.100
GRA	40.600	596.800	491.900	1681.300	2214.200	6297.600	3154.900	1736.100
GRA	68.300	831.400	376.100	1774.900	1487.100	4945.800	3307.900	1871.300
3ABNTQ	10.769	24.714	376.100	57.870	43.906	217.751	92.565	50.138
3ABNTQ	9.101	26.922	24.752	66.764	43.906	201.567	113.664	53.285
3ABNTQ	14.009	22.981	21.229	66.798	138.719	184.281	105.423	42.931
ENRNQ	114.917	1576.158	111.463	2726.907	2431.992	13238.940	11107.180	6868.435
ENRNQ	256.000	2255.000	164.000	3979.000	3708.000	16137.000	11348.000	7112.000
ENRNQ	170.000	2151.000	164.000	4669.000	4412.000	23422.000	13742.000	9170.000
PTFC	49.095	68.454	327.389	460.970	318.313	1563.586	1099.739	612.082
PTFC	43.474	62.420	283.631	404.144	1553.831	1228.061	1001.243	471.075
PTFC	51.759	49.722	268.550	387.359	240.931	1020.457	309.609	287.098
RVSIQ	0.465	2.157	2.136	5.449	6.215	7.889	3.191	1.317
RVSIQ	3.063	3.412	2.136	10.406	5.742	14.988	3.226	1.923
RVSIQ	17.407	11.884	8.074	39.894	15.313	47.288	7.269	3.999
CQB	97.863	361.198	421.806	902.867	488.422	2596.127	1892.729	1177.823
CQB	96.924	296.494	428.260	846.513	612.870	2416.790	1807.437	1071.341
CQB	70.428	274.323	392.190	772.355	612.870	2262.492	1806.793	1005.606

3

NOTE: For this question you will use the data set “dataforquestion3” that is provided in the Exam module.

A

Split the data into 10 equal (roughly) partitions. That means that you will have 633/10 records in each partition. Determine the number of records with a value of BK = 0 in each partition.

three <- read_excel("D:/MSBA/3-Winter 2020/560/midterm/MidtermData.xlsx",sheet = "Q3DATA")

## New names:
## * DATA1 -> DATA1...27
## * DATA4 -> DATA4...28
## * DATA5 -> DATA5...29
## * DATA6 -> DATA6...30
## * DATA8x -> DATA8x...31
## * ... and 35 more problems

#manual method of splitting
part1<-three[1:63,]
part2<-three[64:126,]
part3<-three[127:189,]
part4<-three[190:252,]
part5<-three[253:314,]
part6<-three[315:377,]
part7<-three[378:441,]
part8<-three[442:504,]
part9<-three[505:567,]
part10<-three[568:633,]

cat(paste("Part 1 Record Count where BK is '0'= ",nrow(part1)-sum(part1$bktype)),
    paste("Part 2 Record Count where BK is '0'= ",nrow(part2)-sum(part2$bktype)),
    paste("Part 3 Record Count where BK is '0'= ",nrow(part3)-sum(part3$bktype)),
    paste("Part 4 Record Count where BK is '0'= ",nrow(part4)-sum(part4$bktype)),
    paste("Part 5 Record Count where BK is '0'= ",nrow(part5)-sum(part5$bktype)),
    paste("Part 6 Record Count where BK is '0'= ",nrow(part6)-sum(part6$bktype)),
    paste("Part 7 Record Count where BK is '0'= ",nrow(part7)-sum(part7$bktype)),
    paste("Part 8 Record Count where BK is '0'= ",nrow(part8)-sum(part8$bktype)),
    paste("Part 9 Record Count where BK is '0'= ",nrow(part9)-sum(part9$bktype)),
    paste("Part 10 Record Count where BK is '0'= ",nrow(part10)-sum(part10$bktype)),
    paste("Total Record Count where BK is '0'= ",nrow(three)-sum(three$bktype)),
    sep = '\n')

## Part 1 Record Count where BK is '0'=  13
## Part 2 Record Count where BK is '0'=  6
## Part 3 Record Count where BK is '0'=  15
## Part 4 Record Count where BK is '0'=  4
## Part 5 Record Count where BK is '0'=  6
## Part 6 Record Count where BK is '0'=  10
## Part 7 Record Count where BK is '0'=  10
## Part 8 Record Count where BK is '0'=  3
## Part 9 Record Count where BK is '0'=  5
## Part 10 Record Count where BK is '0'=  7
## Total Record Count where BK is '0'=  79

#loop method of splitting
partitions<- split(three, sample(1:10, nrow(three), replace=TRUE))
for (i in 1:10){
  assign(paste("partition", i, sep=""), as.data.frame(partitions[i]))
  }
x= c(sum(partition1$X1.bktype==0),
        sum(partition2$X2.bktype==0),
        sum(partition3$X3.bktype==0),
        sum(partition4$X4.bktype==0),
        sum(partition5$X5.bktype==0),
        sum(partition6$X6.bktype==0),
        sum(partition7$X7.bktype==0),
        sum(partition8$X8.bktype==0),
        sum(partition9$X9.bktype==0),
        sum(partition10$X10.bktype==0))
t=c("Total",sum(three$bktype==0))

count_frame <- data.frame(frame=seq(1,10,1), records_where_bk_0 = x)
count_frame <- rbind(count_frame, t)
count_frame

##    frame records_where_bk_0
## 1      1                  6
## 2      2                 15
## 3      3                 11
## 4      4                  9
## 5      5                  5
## 6      6                  8
## 7      7                 11
## 8      8                  5
## 9      9                  4
## 10    10                  5
## 11 Total                 79

B

Generate a subset of the data set with all the records that have a value of BK = 0 (should be 79) and an equal number of BK=1 (randomly selected from the remaining records, BK=1). Calculate the mean for DATA1 for each type BK=0, BK=1.

names(three)[27]<-"DATA_1"
three$DATA_1 <-as.numeric(three$DATA_1)

sub.bk0 <-subset(three,bktype == 0)
sub.bk1 <-subset(three,bktype == 1)
sub.bk1 <- sub.bk1[1:79,]
sub.three<-rbind(sub.bk0,sub.bk1)

cat(paste("DATA1 Mean when BK is '0'= ",mean(sub.bk0$DATA_1)),
    paste("DATA1 Mean when BK is '1'= ",mean(sub.bk1$DATA_1)),
    sep = '\n')

## DATA1 Mean when BK is '0'=  26.7249777351835
## DATA1 Mean when BK is '1'=  96.9087426104911

C

Conduct a t test for difference in the means of the two groups in the subset.

#reduce subset to necessary fields
sub.three.t <-as.data.frame(sub.three[,c(8,27)])

#compute summary statistics by BK value
group_by(sub.three.t,bktype) %>%
  summarise(
    count = n(),
    mean = mean(DATA_1, na.rm = TRUE),
    sd = sd(DATA_1, na.rm = TRUE)
  )

## # A tibble: 2 x 4
##   bktype count  mean    sd
##    <dbl> <int> <dbl> <dbl>
## 1      0    79  26.7  67.6
## 2      1    79  96.9 228.

#visualize the data using box plots
ggboxplot(sub.three.t, x = "bktype", y = "DATA_1", title = "BK 0 vs. BK 1",
          color = "bktype", palette = c("blueviolet", "darkgoldenrod1"),
        ylab = "DATA 1", xlab = "BK Type")

#compute t-test
res<-t.test(DATA_1 ~ bktype, data = sub.three.t, var.equal = TRUE)
res

## 
##  Two Sample t-test
## 
## data:  DATA_1 by bktype
## t = -2.6279, df = 156, p-value = 0.00945
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -122.93860  -17.42893
## sample estimates:
## mean in group 0 mean in group 1 
##        26.72498        96.90874

4

Below is the output from a PCA.
PCA Chart

A

How many principal components should we use? Explain.

Following the Pareto Principal I would utilize the first three principal components as they account for 81.4% of the variance in this model. Additionally, the first three variables are the only ones that have Eignvalues greater than 1 which means they have a high practical significance. Using these three PCs should allow us to build a model that is not overfit. If we find that the accuracy is lacking we could then consider adding in the 4^th PC.

B

Discuss the weights for the first four PCs – PC1 to PC4. How do we interpret them?

The weights of the PCs are demonstrated by their Eigenvector scores. A larger number is a more significant relationship, with the directionality determined by the positive/negative distance from zero. A summarization of the PC weights could be interpreted as follows:

strong positive correlation with age, residence, savings, and employ. This combined with negative debt and credit cards could indicate a financially stable set of people.

negative correlation with debt and credit cards. Income is low and age/residence/employ being negative could indicate a younger population who has not established lines of credit yet.

strong negative correlation with income, education, and credit cards. With a negative age and lower employment this could indicate a student.

strong positive correlation with credit cards and savings but with a negative income. This could indicate someone who may be a low earner/under employed but is fiscally savvy.

C

Using the Factor loading table below, identify the latent factors and give them a label. Justify your labels.
FLT Chart

The Technical Competence Factor: very high potential, company fit, and job fit but low scores on the “soft-skill” categories

The Likability Factor: confident, attractive, and likable with some potential but limited “technical” skills

The Customer Service Factor: essentially no experience but a great communicator who is extremely organized

The Referral Factor: an amazing resume and letter with moderate experience

5

Use the table below to answer this question:
Problem 5 Chart

A

Calculate the confusion matrix for cutoff of 0.6 and 0.8.

AY <- c(0,1,0,1,1,0,1)
PY <- c(0.5,0.9,0.7,0.7,0.3,0.4,0.5)
DF <-data.frame("Actual Y"=AY,"Predicted Y"=PY)

pred.6<- ifelse(DF$Predicted.Y>0.6, 1,0)
pred.8<- ifelse(DF$Predicted.Y>0.8, 1,0)

DFp<-data.frame("Output for threshold 0.6"=pred.6, "Output for threshold 0.8"=pred.8)

new.DF<-cbind(DF,DFp)
new.DF

##   Actual.Y Predicted.Y Output.for.threshold.0.6 Output.for.threshold.0.8
## 1        0         0.5                        0                        0
## 2        1         0.9                        1                        1
## 3        0         0.7                        1                        0
## 4        1         0.7                        1                        0
## 5        1         0.3                        0                        0
## 6        0         0.4                        0                        0
## 7        1         0.5                        0                        0

B

Provide the recall, precision, accuracy, and F1 Measure values for both 0.6 and 0.8

y <- as.factor(new.DF$Actual.Y)
predictions.6 <- as.factor(new.DF$Output.for.threshold.0.6)
predictions.8 <- as.factor(new.DF$Output.for.threshold.0.8)

cm.6 <- confusionMatrix(predictions.6, reference = y)
cm.8 <- confusionMatrix(predictions.8, reference = y)

w<- c('Recall', 'Precision', 'Accuracy', 'F1 Measure')
c.6<-c(cm.6$byClass["Recall"],cm.6$byClass["Precision"],cm.6$byClass["Balanced Accuracy"],cm.6$byClass["F1"])
c.8<-c(cm.8$byClass["Recall"],cm.8$byClass["Precision"],cm.8$byClass["Balanced Accuracy"],cm.8$byClass["F1"])

cm.DF<-data.frame("Value"=w,"Cutoff Scores for 0.6"=c.6,"Cutoff Scores for 0.8"=c.8)
cm.DF[,c(2:3)]

##                   Cutoff.Scores.for.0.6 Cutoff.Scores.for.0.8
## Recall                        0.6666667             1.0000000
## Precision                     0.5000000             0.5000000
## Balanced Accuracy             0.5833333             0.6250000
## F1                            0.5714286             0.6666667

6

The following table below contains 10 records for the training data and 10 for the test. Using KNN to identify the test data classes. Determine the K from possible values of 3, 4, and 5 that will give you the best accuracy for the test cases.
Note: you will want to convert the variable values to integer to perform KNN.

train <- read_excel("D:/MSBA/3-Winter 2020/560/midterm/MidtermData.xlsx",sheet = "Q6Train")
valid <- read_excel("D:/MSBA/3-Winter 2020/560/midterm/MidtermData.xlsx",sheet = "Q6Test")

#convert to variable integers
df<-rbind(train,valid)
df$Age<-as.integer(as.factor(df$Age))
df$`Spectacle Prescription`<-as.integer(as.factor(df$`Spectacle Prescription`))
df$Astigmatic<-as.integer(as.factor(df$Astigmatic))
df$`Tear Production Rate`<-as.integer(as.factor(df$`Tear Production Rate`))
df$`Class Label`<-as.integer(as.factor(df$`Class Label`))

train.df <- as.data.frame(df[1:10,])
valid.df <- as.data.frame(df[11:20,])

# initialize a data frame with two columns: k, and accuracy.
accuracy.df <- data.frame(k = seq(1,5,by=1), RMSE = rep(0,5))

# compute knn for different k on validation.
set.seed(22)
for(i in 1:5){
  knn.pred<-class::knn(train = train.df,                          
                         test = valid.df,                          
                         cl = train.df[,6], k = i)
  accuracy.df[i,2]<-RMSE(as.numeric(as.character(knn.pred)),valid.df[,6])
}


as.data.frame(accuracy.df[3:5,])

##   k      RMSE
## 3 3 1.1832160
## 4 4 0.5477226
## 5 5 0.9486833

The k value of 4 should be used since it has the best fit.

7

Given the following table construct the decision tree manually using the Gini impurity measure: Problem 7 Chart

#When building a decision tree we want to choose the feature with the least Gini Index as the root node.
#calculate for 'Color'
p.Cg = 2/5
p.Cy = 2/5
p.Cb = 1/5

#grey
Cg.Ae = 2/2
Cg.Ag = 0/2
Cg.Am = 0/2
Cg.At = 0/2
GI.Cg = 1-(Cg.Ae^2 + Cg.Ag^2 + Cg.Am^2 + Cg.At^2)

#yellow
Cy.Ae = 0/2
Cy.Ag = 1/2
Cy.Am = 0/2
Cy.At = 1/2
GI.Cy = 1-(Cy.Ae^2 + Cy.Ag^2 + Cy.Am^2 + Cy.At^2)

#brown
Cb.Ae = 0/1
Cb.Ag = 0/1
Cb.Am = 1/1
Cb.At = 0/1
GI.Cb = 1-(Cb.Ae^2 + Cb.Ag^2 + Cb.Am^2 + Cb.At^2)

#weighted sum
WGI.C = (p.Cg*GI.Cg)+(p.Cy*GI.Cy)+(p.Cb*GI.Cb)

#calculate for 'Height'
p.h10 = 3/5
p.h4 = 1/5
p.h3 = 1/5

#10
h10.Ae = 2/3
h10.Ag = 1/3
h10.Am = 0/3
h10.At = 0/3
GI.h10 = 1-(h10.Ae^2 + h10.Ag^2 + h10.Am^2 + h10.At^2)

#4
h4.Ae = 0/1
h4.Ag = 0/1
h4.Am = 0/1
h4.At = 1/1
GI.h4 = 1-(h4.Ae^2 + h4.Ag^2 + h4.Am^2 + h4.At^2)

#3
h3.Ae = 0/1
h3.Ag = 0/1
h3.Am = 1/1
h3.At = 0/1
GI.h3 = 1-(h3.Ae^2 + h3.Ag^2 + h3.Am^2 + h3.At^2)

#weighted sum
WGI.h = (p.h10*GI.h10)+(p.h4*GI.h4)+(p.h3*GI.h3)

#compare Gini Index and choose the lowest to be the root node
af <- c('Color','Height')
gi <- c(WGI.C,WGI.h)
df<-data.frame("Features"=af,"Gini Index"=gi)
df

##   Features Gini.Index
## 1    Color  0.2000000
## 2   Height  0.2666667

8

Say you have 1000 fruits which could be either ‘banana’, ‘orange’ or ‘other’. These are the 3 possible classes of the Y variable.
We have data for the following X variables, all of which are binary (1 or 0):

• Long
• Sweet
• Yellow

The following table is an example of the data:

Fruit	Long (x1)	Sweet (x2)	Yellow (x3)
Orange	0	1	0
Banana	1	0	1
Banana	1	1	1
Other	1	1	0

For computing the probabilities, the training data is aggregated to form a counts table like this: Problem 8 Chart

If a fruit is long, sweet, and yellow, what is it? Show your work.

We can utilize the Naïve Bayes theorem to solve this problem:

#compute the probabilities for the types of fruit
p.b = 500/1000
p.o = 300/1000
p.r = 200/1000

#compute the probability of traits that will go in the denominator
#NOTE: we could technically skip this as the denom is the same for all the fruits meaning it will not affect the probabilities

p.l = 500/1000
p.s = 650/1000
p.y = 800/1000

#compute the probability of likelihood of traits for each type of fruit
#BANANA
p.lb = 400/500
p.sb = 350/500
p.yb = 450/500

tp.b = p.lb*p.sb*p.yb

#ORANGE
p.lo = 0/300
p.so = 150/300
p.yo = 300/300

tp.o = p.lo*p.so*p.yo

#OTHER
p.lr = 100/200
p.sr = 150/200
p.yr = 50/200

tp.r = p.lr*p.sr*p.yr

#input the 3 equations into the Naive Bayes formula to calculate probability
d=(p.l*p.s*p.y)
pb.lsy = (p.lb*p.sb*p.yb*p.b)/d
po.lsy = (p.lo*p.so*p.yo*p.o)/d
pr.lsy = (p.lr*p.sr*p.yr*p.r)/d


f<-c('Banana','Orange','Other')
p<-c(pb.lsy,po.lsy,pr.lsy)
df<-data.frame("Fruit"=f,"Probability of Being Long, Sweet, and Yellow"=p)
df

##    Fruit Probability.of.Being.Long..Sweet..and.Yellow
## 1 Banana                                   0.96923077
## 2 Orange                                   0.00000000
## 3  Other                                   0.07211538

If a fruit is long, sweet, and yellow it will most likely be a BANANA.

9

Using a data set that relates stopping distance and speed, the following is the result of conducting linear regression on the results:

Residuals: Min 1Q Median 3Q Max -29.069 -9.525 -2.272 9.215 43.201

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 *** — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.38 on 48 degrees of freedom Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438 F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12

Provide an interpretation of the results.

Residuals Residuals provide us with the difference between the model prediction values and the actual values (in this data set that would be the stopping distance). Our residuals are not very symmetrically distributed which means that some of the model’s predictions were distant from the actual values.

Coefficients The coefficients in a linear regression represent the intercept and the slope of our linear model. If the model is accurate it will fit as closely as possible to the points in the data set. The coefficient section has multiple components:

Estimate - The first estimate is the intercept (in our case it is saying that the average car in our dataset takes -17.5791 feet to stop, which is questionable given the context of the data). The second estimate is the slope (i.e. the effect of speed on the distance required for a car to stop, so for every increase in mph the required stopping distance increases by 3.9324 feet).

Standard Error - this measures the average amount that the model estimates vary from the actual average response. Here we want smaller numbers and unsurprisingly the intercept Std. Error is relatively high (again calling that negative distance into question).

t-value - this is a measure of how many standard deviations our estimate is away from 0. The farther away from zero the higher confidence that we can reject a null hypothesis and would have confidence that there is a relationship between speed and distance. These are also used to calculate p-values.

Pr(>t) - this is the probability of observing a value equal or larger than t. Generally speaking a p-value of 5% or less is a good indication of a relationship in the data. R makes this easy to interpret by also including “Signif.codes” stars next to the p-value to denote highly significant p values.

Residual Standard Error Residual Standard Error measures the quality of a linear regression fit. Since models are not perfect, the RSE gives an estimate of the size of expected deviations (in our set that will be ~15.38 feet). The degrees of freedom are the number of data points that went into estimating the parameters used after taking into account the parameters produced in the results (in this case 2: intercept and slope).

R-Squared R² is a statistic that measures how well a model fits the actual data. In this example it is the measure of the linear relationship between speed and distance. We want to get as close to 1 as we can without overfitting. The model above has an R² of ~65%, which means that the variation in the distance it requires to stop can be explained 65% by speed. This means we have a strong R² and that speed is an excellent indicator of stopping distance.

F-Statistic The F-Statistic helps determine if there is a relationship between our predictor and the response variables. The further from 1, the better the F-Statistic is (this will of course be impacted by the sample size). This model has an F-Statistic of 89.57 which is relatively larger than 1 further confirming that it is a good model.

If I have a speed of 12 miles per hour, what will my stopping distance be?

To calculate the stopping distance we want to use the following equation:
Y = mX + C Y = Stopping Distance X = Speed m = coefficient of the speed C = intercept

X = 12
m = 3.9324
C = -17.5791

Y = m * X + C

cat("It will take",Y,"feet to stop.")

## It will take 29.6097 feet to stop.

10

Which variable selection methods are most likely computationally intensive – they will take longer to run than the other methods due to the amount of computation required or the amount of memory required? Explain.

Any of the wrapper methods of variable selection are going be computationally intensive due to the fact that they are iterative (a repetitive process to generate a sequence of outcomes). The most common wrapper methods are:

Forward Selection

Backward Elimination

Recursive Feature Elimination

Embedded methods are also going to be computationally intensive since they combine the iterative qualities of wrapper methods with the simpler (and therefore quicker) filter methods. The most popular embedded methods are:

LASSO Regression

RIDGE Regression

There are also numerous methods that do not fit into one of the basic selection methods. These alternatives often involve the model to “manually” search through the data in order to produce results, which will of course require longer processing times to run through larger and/or more complex sets of data. Two examples of these alternative methods that we have used this quarter are:

k-NN Algorithms (has to find the nearest matching neighbors)

Trees (involve splitting the data into various combinations)

Classification

Regression

Boosted

Pruned

TBANLT 560 | Data Mining Midterm Exam

Chapters 1, 2, 4-9

C.E.Weiner

2/23/2020

Instructions

1

2

A

B

3

A

B

C

4

A

B

C

5

A

B

6

7

8

9

10