set.seed(100)
library(class)
library(data.table)
library(tidyverse)
library(DT)
library(ROCR)

Directions: This midterm contains machine learning topics that were covered in lectures as well as related statistics and coding questions. The exam must be completed using R code. You are free to use any functions from R’s packages unless stated otherwise. For each question, please show your work, including all relevant code and explanations.

Policies: This exam is open book and open note. You may use any materials that you find helpful in solving the problems. However, you must explain your answers in your own words and cite any sources. No collaboration with others is allowed. Please do not discuss the midterm with anyone during the exam period.

Question 1: K Nearest Neighbors and Cross Validation

How many neighbors should we choose in KNN to create the most accurate predictions? One approach is to utilize cross-validation to estimate the error that would be obtained on a testing set.

To evaluate this question, we will use the data set contained in humberside_leukaemia_lymphoma.csv. This file contains similar information to the humberside data.frame that is available in the spatstat.data library. The CSV file includes some additional information that will be useful for this problem. In particular, the variables include:

To utilize KNN and Cross Validation, we will take the following steps:

Development Notes: We recommend using programming techniques to simplify the work of iteratively building many models. Writing a function that fits the model and calculates the predictive accuracy may help.

Question 2: Logistic Regression

Using the same data as in the previous question, we will use logistic regression to create a predictive model of the disease in terms of the spatial coordinates x and y. For this model, use group 5 as the testing set and all of the other groups as the training set. Then answer the following questions:

2a

Show a summary table of the model’s coefficients, rounded to a reasonable number of digits.

            Estimate Std. Error z value Pr(>|z|)
(Intercept)    7.407     13.285   0.558    0.577
x             -0.001      0.002  -0.498    0.619
y              0.000      0.002  -0.294    0.769

2b

Did the patient’s geographic location impact the likelihood of developing childhood leukaemia or lymphoma?

Patient’s geographic location almost has no impact on the likelihood of developing chidhood leukaemia or lymphoma. (P-value is way above 0.05)

2d

If instead of classifications, we decided to use the logistic regression’s predicted probabilities, what would be the median absolute error on the testing set? Note that the absolute value is given by |a - b|, which is always a non-negative number. R’s absolute value function is abs().

[1] 0.3310779

2e

Let’s imagine for a moment that we had not used logistic regression at all. Instead, for each patient in the testing set, we estimated the likelihood of disease by using the percentage of patients with the disease in the training set. What would be the median absolute error on the testing set using this prediction?

[1] 0.3128834
[1] 0.3128834

2f

Given the results obtained in the previous two questions, does logistic regression improve upon simply using the average result for all of the patients? Do these results surprise you? Explain your reasoning.

Median Absolute Error from logistic regression is bigger than from simply using the average result.

But MAE is a linear score which means that all the individual differences are weighted equally. It is an ideal metric to estimate a true linear relationship. But in logistic regression, the true relationship is not linear, individual differences are weighted differently, based on a pre-ditermined threshold. That is why we use accuracy rate or AUC as the metrics to assess the quality of classification models, instead of using linear scores such as RMSE or MAE.

Question 3: Conceptual Questions

Answer each question with a short paragraph.

3a

Why can we use cross-validation to estimate the predictive accuracy of a model?

When we trained a model, we want to make sure it works well on the unseen data and achieve desired accuracy of the predictions. Cross-validation is a re-sampling procedure, that patrition the available sample of data into complementary subsets, perform the analysis on one subset and validate the analysis on the other subset for multiple rounds using different partitions, and gives a combiend outcome. For example, in k-fold cv, sample is partitioned to k fols, and we fit the model using k-1 folds and validate the model using the remaining kth fold. Repeat the procedure untill every k-fold serve as the test set. And then combine the validation results by taking the average of errors as an overall metric for estimating the model’s predictive performance. Using cross-validation, we can make sure all obervations are used for both training and validation, and each observation is used for validation exactly once. We reduce the variability of model by using a combined result, which is a less biased representative of the true relationship of the sample data.

3b

In linear regression, does having a significant p-value for a coefficient ensure that the variable has a meaningful impact on the outcome?

Having a significant p-value doesn’t necessarily ensure the variable has a meaningful impact on the outcome. Instead, we need to combine p-value and estimate coefficient to draw the conclusion. p-value tests the null hypothesis that the variable has no correlation with the dependent variable. If the p-value is less than the significant level, the sample data provideS enough evidence to reject the null hypothesis, thus we can tell there is a non-zero correlation - changes in the independent variable are associated with changes in the response. The value of estimate coefficient, on the other side, is a measure of effect size of changes in the independet varialbe on the dependent varialbe. And with standard error of the coefficient, we can calculate the confidence interval of the strenghth of effect, therefore we can decide whether the varialbe has a meaningful impact on the outcome or not, and what range of size the impact can be.

3c

A laboratory employs a diagnostic blood test that is designed to detect cancer. Like any test, it occasionally leads to mistaken conclusions. What would be the real-world meaning of the false positives and false negatives of the test? Define what these events would be, and then describe the consequences of these mistakes.

False positive refers to an error in which the test result improperly indicates that a patient has cancer, when in reality the patient doen’t have cancer. If false positive error happens, the healthy person will be wrongly given cancer-related tests and treatment. False negative erors in which the test result falsely indicates that a patient doen’t have cancer, when in reality the patient does have cancer. If false negative error happens, there will be a treatment delay for the patient, and may result in lowering the cancer patient survival rate.

3d

What are the challenges of using hierarchical or kmeans clustering when some of the inputs are categorical variables?

H-cluster and kmeans clustering methods reply on matching dissimilarity/distance measure among data points. They are used commonly for numerical data. But the sample space for categorical data is discrete, and may not be ordered nicely, then it will be not meaningful to calculate the Eclidean distance between clusters. Secondly, h-cluster has linkage methods such as complete/single/average to calculate cluster distance before merging clusters. Kmeans use means for that purpose. They are not applicable for categorical data. Categorical data, on the other hand, should use modes as the metric for clustering.