Question 1

You have been hired by a local electronics retailer and the above dataset has been given to you. Manager Bayes Jr.9th wants to create a spreadsheet to predict is a customer is likely prospect. To that end:

  1. Compute prior probabilities for the Prospect Yes/No

  2. Compute the conditional probabilities

P(age-group=youth|prospect=yes) and P(age-group=youth|prospect=no) where age-group is a predictor variable. Compute the conditional probabilities for each predictor variable namely

  1. Assuming the assumptions of Naive Bayes are met

P(prospect|X) where X is one of the predictor variable.

Prior Probabilities

for “Prospect”:

Prior.Prob(“yes”) = 9/14 = 0.64

Prior.Prob(“no”) = 5/14 = 0.36

Conditional Probabilities

Conditional Probability: P(B|A) = P(A and B)/P(A)

The methodology was taken from the following source:

https://people.revoledu.com/kardi/tutorial/Questionnaire/Conditional%20Probability.htm

  1. Agegroup

youth_yes = 2/9= 0.22

middle_yes = 4/9 = 0.44

senior_yes = 3/9 = 0.33

youth_no = 3/5 = 0.6

middle_no = 0/5 = 0

senior_no = 2/5 = 0.4

  1. Networth

high_yes = 2/9 = 0.22

low_yes = 3/9 = 0.33

medium_yes = 4/9 = 0.44

high_no = 2/5 = 0.4

low_no = 1/5 = 0.2

medium_no = 2/5 = 0.4

  1. Status

employed_yes = 3/9 = 0.33

unemployed_yes = 6/9 = 0.67

employed_no = 4/5 = 0.8

unemployed_no = 1/5 = 0.2

  1. Credit Rating

fair_yes = 6/9 = 0.67

excellent_yes = 3/9 = 0.33

fair_no = 2/5 = 0.4

excellent_no = 3/5 = 0.6

Posterior Probabilities

Posterior Probability: P(A|B) = P(B|A)*P(A)/P(B)

The methodology was taken from the following source:

https://people.revoledu.com/kardi/tutorial/Questionnaire/Bayes%20Rule.htm

For “Status” class:

employed_yes = 0.64/0.5 * 0.33 = 0.4224

unemployed_yes = 0.64/0.5 * 0.67 = 0.8576

employed_no = 0.36/0.5 * 0.8 = 0.576

unemployed_no = 0.36/0.5 * 0.2 = 0.144

Question 2

You just recently joined a datascience team.

There are two datasets junk1.txt and junk2.csv

They have two options

  1. They can go back to the client and ask for more data to remedy problems with the data.

  2. They can accept the data and undertake a major analytics exercise.

The team is relying on your dsc skills to determine how they should proceed.

Can you explore the data and recommend actions for each file enumerating the reasons.

Data Set junk1.csv

# reading file
library(dplyr)
df = read.csv("/Users/Olga/Desktop/data622/HW1/junk1.csv",sep="", header=TRUE)
head(df)
##           a         b class
## 1 1.6204214 3.0036241     1
## 2 1.4340220 0.7852487     1
## 3 2.4766615 0.9367761     1
## 4 0.5283093 0.1196222     1
## 5 1.0054081 0.7872866     1
## 6 1.1032636 0.7330594     1

Data set consists of 3 columns: a, b and class

dim(df)
## [1] 100   3

Data set has 100 obs and 3 columns

str(df)
## 'data.frame':    100 obs. of  3 variables:
##  $ a    : num  1.62 1.434 2.477 0.528 1.005 ...
##  $ b    : num  3.004 0.785 0.937 0.12 0.787 ...
##  $ class: int  1 1 1 1 1 1 1 1 1 1 ...

Column “a” and “b” have numeric data type, whereas “class” has integer. It seems that integer is not an appropriate data type for “class” column, instead type should be changed to factor.

#  changing claas from integer to factor of 2 levels
df$class<-as.factor(df$class)
summary(df)
##        a                  b            class 
##  Min.   :-2.29854   Min.   :-3.17174   1:50  
##  1st Qu.:-0.85014   1st Qu.:-1.04712   2:50  
##  Median :-0.04754   Median :-0.07456         
##  Mean   : 0.04758   Mean   : 0.01324         
##  3rd Qu.: 1.09109   3rd Qu.: 1.05342         
##  Max.   : 3.00604   Max.   : 3.10230
isTRUE(duplicated(df))
## [1] FALSE

There is no duplicate rows in the data set.

Data Set looks balanced as both classes are presented equally: 50 obs of class “1” and 50 obs of class “0”.

Data of column a and b is close to normally distributed data.

plot(df$class)

plot(density(df$a))

plot(density(df$b))

Data Set junk2.csv

#  reading file
library(dplyr)
df = read.csv("/Users/Olga/Desktop/data622/HW1/junk2.csv")
head(df)
##            a           b class
## 1  3.1886481  0.92917735     0
## 2  0.8224527  0.04760314     0
## 3  0.8147247  0.02910931     0
## 4 -1.5065362  3.13231360     0
## 5  0.4426887  2.84942822     0
## 6  0.8564405 -0.66143851     0

Data set consists of 3 columns: a, b and class

dim(df)
## [1] 4000    3

Data set has 4000 obs and 3 columns.

str(df)
## 'data.frame':    4000 obs. of  3 variables:
##  $ a    : num  3.189 0.822 0.815 -1.507 0.443 ...
##  $ b    : num  0.9292 0.0476 0.0291 3.1323 2.8494 ...
##  $ class: int  0 0 0 0 0 0 0 0 0 0 ...

Column “a” and “b” have numeric data type, whereas “class” has integer. It seems that integer is not an appropriate data type for “class” column, instead type should be changed to factor.

#  changing claas from integer to factor of 2 levels
df$class<-as.factor(df$class)
summary(df)
##        a                  b            class   
##  Min.   :-4.16505   Min.   :-3.90472   0:3750  
##  1st Qu.:-1.01447   1st Qu.:-0.89754   1: 250  
##  Median : 0.08754   Median :-0.08358           
##  Mean   :-0.05126   Mean   : 0.05624           
##  3rd Qu.: 0.89842   3rd Qu.: 1.00354           
##  Max.   : 4.62647   Max.   : 4.31052
isTRUE(duplicated(df))
## [1] FALSE

There is no duplicate rows in the data set.

Data Set does not look balanced as both classes are not presented equally: 250 obs of class “1” and 3750 obs of class “0”.

plot(df$class)

Conclusions

  1. Conclusion for data set junk1.csv:

Data set looks fine, there are no missing values or duplicates, also it is perfectly balanced.

  1. Conclusion for data set junk2.csv:

Data Set is imbalanced. That means if we use, for example, logistic regression or random forest, then these models tend to generalize by discarding the rare class. If we use decision trees we may not need to balance our data set.

The best possible decision in this case is to get proper data set, if it is not possible then the following steps can be taken:

  • The best way to decide is to explore the data fully. If any other attributes are also imbalanced over their values this will also affect classification results.

  • Resample the training set (Under/over-sampling). In our case it is possible to implement under-sampling by reducing the size of the abundant class as the quantity of data is sufficient.

  • An elegant approach was proposed on Quora. Instead of relying on random samples to cover the variety of the training samples, he suggests clustering the abundant class in r groups, with r being the number of cases in r. For each group, only the medoid (centre of cluster) is kept. The model is then trained with the rare class and the medoids only.

https://www.quora.com/In-classification-how-do-you-handle-an-unbalanced-training-set/answers/1144228?srid=h3G6o

  • Ensemble different resampled datasets. One easy best practice is building n models that use all the samples of the rare class and n-differing samples of the abundant class.

  • Use K-fold Cross-Validation with oversampling. Over-sampling takes observed rare samples and applies bootstrapping to generate new random data based on a distribution function.

It is not an exclusive list of techniques and there is no best approach. Combining different techniques and models can help to evaluate what works best.

Question 3

Please find kNN.R

This R script requires a dataset, labelcol and K (number of nearest neighbors to be considered)

The dataset MUST Be numeric, except the labelcol The labelcol must be the last column in the data.frame All the other columns must be before the labelcol

To DO:

Please find icu.csv The formula to fit is “STA ~ TYP + COMA + AGE + INF”

Read the icu.csv subset it with these 5 features in the formula and STA is the labelcol.

Split the icu 70/30 train/test and run the kNN.R for K=(3,5,7,15,25,50)

submit the result confusionMatrix, Accuracy for each K

Plot Accuracy vs K.

write a short summary of your findings.

# reading file
library(dplyr)
icu.data = read.csv("/Users/Olga/Desktop/data622/HW1/icu.csv", header=TRUE)
head(icu.data)
##   ID STA AGE SEX RACE SER CAN CRN INF CPR SYS HRA PRE TYP FRA PO2 PH PCO
## 1  8   0  27   1    1   0   0   0   1   0 142  88   0   1   0   0  0   0
## 2 12   0  59   0    1   0   0   0   0   0 112  80   1   1   0   0  0   0
## 3 14   0  77   0    1   1   0   0   0   0 100  70   0   0   0   0  0   0
## 4 28   0  54   0    1   0   0   0   1   0 142 103   0   1   1   0  0   0
## 5 32   0  87   1    1   1   0   0   1   0 110 154   1   1   0   0  0   0
## 6 38   0  69   0    1   0   0   0   1   0 110 132   0   1   0   1  0   0
##   BIC CRE LOC
## 1   0   0   0
## 2   0   0   0
## 3   0   0   0
## 4   0   0   0
## 5   0   0   0
## 6   1   0   0
dim(icu.data)
## [1] 200  21
icu.data$STA<-as.factor(icu.data$STA)
summary(icu.data$STA)
##   0   1 
## 160  40
euclideanDist <- function(a, b){
  d = 0
  for(i in c(1:(length(a)) ))
  {
    d = d + (a[[i]]-b[[i]])^2
  }
  d = sqrt(d)
  return(d)
}

knn_predict2 <- function(test_data, train_data, k_value, labelcol){
  pred <- c()  #empty pred vector 
  #LOOP-1
  for(i in c(1:nrow(test_data))){   #looping over each record of test data
    eu_dist =c()          #eu_dist & eu_char empty  vector
    eu_char = c()
    good = 0              #good & bad variable initialization with 0 value
    bad = 0
    
    #LOOP-2-looping over train data 
    for(j in c(1:nrow(train_data))){
 
      #adding euclidean distance b/w test data point and train data to eu_dist vector
      eu_dist <- c(eu_dist, euclideanDist(test_data[i,-c(labelcol)], train_data[j,-c(labelcol)]))
 
      #adding class variable of training data in eu_char
      eu_char <- c(eu_char, as.character(train_data[j,][[labelcol]]))
    }
    
    eu <- data.frame(eu_char, eu_dist) #eu dataframe created with eu_char & eu_dist columns
 
    eu <- eu[order(eu$eu_dist),]       #sorting eu dataframe to gettop K neighbors
    eu <- eu[1:k_value,]               #eu dataframe with top K neighbors
 
    tbl.sm.df<-table(eu$eu_char)
    cl_label<-  names(tbl.sm.df)[[as.integer(which.max(tbl.sm.df))]]
    
    pred <- c(pred, cl_label)
    }
    return(pred) #return pred vector
  }
  

accuracy <- function(test_data,labelcol,predcol){
  correct = 0
  for(i in c(1:nrow(test_data))){
    if(test_data[i,labelcol] == test_data[i,predcol]){ 
      correct = correct+1
    }
  }
  accu = (correct/nrow(test_data)) * 100  
  return(accu)
}
# creating COMA variable
icu.data["COMA"] <- ifelse(icu.data$LOC==2,1,0)
# selecting necessary columns
icu <- icu.data[,c("TYP", "COMA", "AGE", "INF", "STA")]
#  assigning current data set 
knn.df<-icu

labelcol <- 5 #  STA is the fifth column

# create prediction column
predictioncol<-labelcol+1
# create train/test partitions
set.seed(2)
n<-nrow(knn.df)
knn.df<- knn.df[sample(n),]
#  changing data type to integer
train.df <- knn.df[1:as.integer(0.7*n),]
#  changing labelcol of tain data set to factor  
head(train.df)
##     TYP COMA AGE INF STA
## 37    0    0  74   0   0
## 140   1    0  73   0   0
## 114   1    0  58   0   0
## 34    1    0  23   0   0
## 185   1    0  69   0   1
## 184   1    1  65   0   1
train.df$STA<-as.factor(train.df$STA)
K = 3 # number of neighbors to determine the class
table(train.df[,labelcol])
## 
##   0   1 
## 107  33
test.df <- knn.df[as.integer(0.7*n +1):n,]
#  changing labelcol of test data set to factor
test.df$STA<-as.factor(test.df$STA)
table(test.df[,labelcol])
## 
##  0  1 
## 53  7
predictions <- knn_predict2(test.df, train.df, K,labelcol) #calling knn_predict()
test.df[,predictioncol] <- predictions 
print(accuracy(test.df,labelcol,predictioncol))
## [1] 80
table(test.df[[predictioncol]],test.df[[labelcol]])
##    
##      0  1
##   0 48  7
##   1  5  0
acc_summary <- data.frame()
for (K in c(3,5,7,15,25,50)) # number of neighbors to determine the class
{
  
test.df <- knn.df[as.integer(0.7*n +1):n,]
predictions <- knn_predict2(test.df, train.df, K,labelcol)
test.df[,predictioncol] <- predictions 
print(accuracy(test.df,labelcol,predictioncol))
print(table(test.df[[predictioncol]],test.df[[labelcol]]))

acc <- c(K, accuracy(test.df,labelcol,predictioncol))
acc_summary <- rbind(acc_summary, acc)

}
## [1] 80
##    
##      0  1
##   0 48  7
##   1  5  0
## [1] 85
##    
##      0  1
##   0 50  6
##   1  3  1
## [1] 83.33333
##    
##      0  1
##   0 50  7
##   1  3  0
## [1] 88.33333
##    
##      0  1
##   0 53  7
## [1] 88.33333
##    
##      0  1
##   0 53  7
## [1] 88.33333
##    
##      0  1
##   0 53  7
colnames(acc_summary) <- c("K", "Accuracy")
acc_summary
##    K Accuracy
## 1  3 80.00000
## 2  5 85.00000
## 3  7 83.33333
## 4 15 88.33333
## 5 25 88.33333
## 6 50 88.33333
plot(acc_summary,type="o",xlab="K",ylab="Accuracy",main="Accuracy vs K")

Summary:

As shown on the graph, K=15 or more is the optimal number of neighbours as it provides highest accuracy, but the data set is imbalanced with abundant “0” class, thus we got high accuracy for “0” class and significantly lower for “1” class. The steps listed in the answer to Q2 can be taken in order to balance the data set and get better accuracy for “1” class.