You have been hired by a local electronics retailer and the above dataset has been given to you. Manager Bayes Jr.9th wants to create a spreadsheet to predict is a customer is likely prospect. To that end:
Compute prior probabilities for the Prospect Yes/No
Compute the conditional probabilities
P(age-group=youth|prospect=yes) and P(age-group=youth|prospect=no) where age-group is a predictor variable. Compute the conditional probabilities for each predictor variable namely
P(prospect|X) where X is one of the predictor variable.
for “Prospect”:
Prior.Prob(“yes”) = 9/14 = 0.64
Prior.Prob(“no”) = 5/14 = 0.36
Conditional Probability: P(B|A) = P(A and B)/P(A)
The methodology was taken from the following source:
https://people.revoledu.com/kardi/tutorial/Questionnaire/Conditional%20Probability.htm
youth_yes = 2/9= 0.22
middle_yes = 4/9 = 0.44
senior_yes = 3/9 = 0.33
youth_no = 3/5 = 0.6
middle_no = 0/5 = 0
senior_no = 2/5 = 0.4
high_yes = 2/9 = 0.22
low_yes = 3/9 = 0.33
medium_yes = 4/9 = 0.44
high_no = 2/5 = 0.4
low_no = 1/5 = 0.2
medium_no = 2/5 = 0.4
employed_yes = 3/9 = 0.33
unemployed_yes = 6/9 = 0.67
employed_no = 4/5 = 0.8
unemployed_no = 1/5 = 0.2
fair_yes = 6/9 = 0.67
excellent_yes = 3/9 = 0.33
fair_no = 2/5 = 0.4
excellent_no = 3/5 = 0.6
Posterior Probability: P(A|B) = P(B|A)*P(A)/P(B)
The methodology was taken from the following source:
https://people.revoledu.com/kardi/tutorial/Questionnaire/Bayes%20Rule.htm
For “Status” class:
employed_yes = 0.64/0.5 * 0.33 = 0.4224
unemployed_yes = 0.64/0.5 * 0.67 = 0.8576
employed_no = 0.36/0.5 * 0.8 = 0.576
unemployed_no = 0.36/0.5 * 0.2 = 0.144
You just recently joined a datascience team.
There are two datasets junk1.txt and junk2.csv
They have two options
They can go back to the client and ask for more data to remedy problems with the data.
They can accept the data and undertake a major analytics exercise.
The team is relying on your dsc skills to determine how they should proceed.
Can you explore the data and recommend actions for each file enumerating the reasons.
# reading file
library(dplyr)
df = read.csv("/Users/Olga/Desktop/data622/HW1/junk1.csv",sep="", header=TRUE)
head(df)
## a b class
## 1 1.6204214 3.0036241 1
## 2 1.4340220 0.7852487 1
## 3 2.4766615 0.9367761 1
## 4 0.5283093 0.1196222 1
## 5 1.0054081 0.7872866 1
## 6 1.1032636 0.7330594 1
Data set consists of 3 columns: a, b and class
dim(df)
## [1] 100 3
Data set has 100 obs and 3 columns
str(df)
## 'data.frame': 100 obs. of 3 variables:
## $ a : num 1.62 1.434 2.477 0.528 1.005 ...
## $ b : num 3.004 0.785 0.937 0.12 0.787 ...
## $ class: int 1 1 1 1 1 1 1 1 1 1 ...
Column “a” and “b” have numeric data type, whereas “class” has integer. It seems that integer is not an appropriate data type for “class” column, instead type should be changed to factor.
# changing claas from integer to factor of 2 levels
df$class<-as.factor(df$class)
summary(df)
## a b class
## Min. :-2.29854 Min. :-3.17174 1:50
## 1st Qu.:-0.85014 1st Qu.:-1.04712 2:50
## Median :-0.04754 Median :-0.07456
## Mean : 0.04758 Mean : 0.01324
## 3rd Qu.: 1.09109 3rd Qu.: 1.05342
## Max. : 3.00604 Max. : 3.10230
isTRUE(duplicated(df))
## [1] FALSE
There is no duplicate rows in the data set.
Data Set looks balanced as both classes are presented equally: 50 obs of class “1” and 50 obs of class “0”.
Data of column a and b is close to normally distributed data.
plot(df$class)
plot(density(df$a))
plot(density(df$b))
# reading file
library(dplyr)
df = read.csv("/Users/Olga/Desktop/data622/HW1/junk2.csv")
head(df)
## a b class
## 1 3.1886481 0.92917735 0
## 2 0.8224527 0.04760314 0
## 3 0.8147247 0.02910931 0
## 4 -1.5065362 3.13231360 0
## 5 0.4426887 2.84942822 0
## 6 0.8564405 -0.66143851 0
Data set consists of 3 columns: a, b and class
dim(df)
## [1] 4000 3
Data set has 4000 obs and 3 columns.
str(df)
## 'data.frame': 4000 obs. of 3 variables:
## $ a : num 3.189 0.822 0.815 -1.507 0.443 ...
## $ b : num 0.9292 0.0476 0.0291 3.1323 2.8494 ...
## $ class: int 0 0 0 0 0 0 0 0 0 0 ...
Column “a” and “b” have numeric data type, whereas “class” has integer. It seems that integer is not an appropriate data type for “class” column, instead type should be changed to factor.
# changing claas from integer to factor of 2 levels
df$class<-as.factor(df$class)
summary(df)
## a b class
## Min. :-4.16505 Min. :-3.90472 0:3750
## 1st Qu.:-1.01447 1st Qu.:-0.89754 1: 250
## Median : 0.08754 Median :-0.08358
## Mean :-0.05126 Mean : 0.05624
## 3rd Qu.: 0.89842 3rd Qu.: 1.00354
## Max. : 4.62647 Max. : 4.31052
isTRUE(duplicated(df))
## [1] FALSE
There is no duplicate rows in the data set.
Data Set does not look balanced as both classes are not presented equally: 250 obs of class “1” and 3750 obs of class “0”.
plot(df$class)
Data set looks fine, there are no missing values or duplicates, also it is perfectly balanced.
Data Set is imbalanced. That means if we use, for example, logistic regression or random forest, then these models tend to generalize by discarding the rare class. If we use decision trees we may not need to balance our data set.
The best possible decision in this case is to get proper data set, if it is not possible then the following steps can be taken:
The best way to decide is to explore the data fully. If any other attributes are also imbalanced over their values this will also affect classification results.
Resample the training set (Under/over-sampling). In our case it is possible to implement under-sampling by reducing the size of the abundant class as the quantity of data is sufficient.
An elegant approach was proposed on Quora. Instead of relying on random samples to cover the variety of the training samples, he suggests clustering the abundant class in r groups, with r being the number of cases in r. For each group, only the medoid (centre of cluster) is kept. The model is then trained with the rare class and the medoids only.
Ensemble different resampled datasets. One easy best practice is building n models that use all the samples of the rare class and n-differing samples of the abundant class.
Use K-fold Cross-Validation with oversampling. Over-sampling takes observed rare samples and applies bootstrapping to generate new random data based on a distribution function.
It is not an exclusive list of techniques and there is no best approach. Combining different techniques and models can help to evaluate what works best.
Please find kNN.R
This R script requires a dataset, labelcol and K (number of nearest neighbors to be considered)
The dataset MUST Be numeric, except the labelcol The labelcol must be the last column in the data.frame All the other columns must be before the labelcol
To DO:
Please find icu.csv The formula to fit is “STA ~ TYP + COMA + AGE + INF”
Read the icu.csv subset it with these 5 features in the formula and STA is the labelcol.
Split the icu 70/30 train/test and run the kNN.R for K=(3,5,7,15,25,50)
submit the result confusionMatrix, Accuracy for each K
Plot Accuracy vs K.
write a short summary of your findings.
# reading file
library(dplyr)
icu.data = read.csv("/Users/Olga/Desktop/data622/HW1/icu.csv", header=TRUE)
head(icu.data)
## ID STA AGE SEX RACE SER CAN CRN INF CPR SYS HRA PRE TYP FRA PO2 PH PCO
## 1 8 0 27 1 1 0 0 0 1 0 142 88 0 1 0 0 0 0
## 2 12 0 59 0 1 0 0 0 0 0 112 80 1 1 0 0 0 0
## 3 14 0 77 0 1 1 0 0 0 0 100 70 0 0 0 0 0 0
## 4 28 0 54 0 1 0 0 0 1 0 142 103 0 1 1 0 0 0
## 5 32 0 87 1 1 1 0 0 1 0 110 154 1 1 0 0 0 0
## 6 38 0 69 0 1 0 0 0 1 0 110 132 0 1 0 1 0 0
## BIC CRE LOC
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 1 0 0
dim(icu.data)
## [1] 200 21
icu.data$STA<-as.factor(icu.data$STA)
summary(icu.data$STA)
## 0 1
## 160 40
euclideanDist <- function(a, b){
d = 0
for(i in c(1:(length(a)) ))
{
d = d + (a[[i]]-b[[i]])^2
}
d = sqrt(d)
return(d)
}
knn_predict2 <- function(test_data, train_data, k_value, labelcol){
pred <- c() #empty pred vector
#LOOP-1
for(i in c(1:nrow(test_data))){ #looping over each record of test data
eu_dist =c() #eu_dist & eu_char empty vector
eu_char = c()
good = 0 #good & bad variable initialization with 0 value
bad = 0
#LOOP-2-looping over train data
for(j in c(1:nrow(train_data))){
#adding euclidean distance b/w test data point and train data to eu_dist vector
eu_dist <- c(eu_dist, euclideanDist(test_data[i,-c(labelcol)], train_data[j,-c(labelcol)]))
#adding class variable of training data in eu_char
eu_char <- c(eu_char, as.character(train_data[j,][[labelcol]]))
}
eu <- data.frame(eu_char, eu_dist) #eu dataframe created with eu_char & eu_dist columns
eu <- eu[order(eu$eu_dist),] #sorting eu dataframe to gettop K neighbors
eu <- eu[1:k_value,] #eu dataframe with top K neighbors
tbl.sm.df<-table(eu$eu_char)
cl_label<- names(tbl.sm.df)[[as.integer(which.max(tbl.sm.df))]]
pred <- c(pred, cl_label)
}
return(pred) #return pred vector
}
accuracy <- function(test_data,labelcol,predcol){
correct = 0
for(i in c(1:nrow(test_data))){
if(test_data[i,labelcol] == test_data[i,predcol]){
correct = correct+1
}
}
accu = (correct/nrow(test_data)) * 100
return(accu)
}
# creating COMA variable
icu.data["COMA"] <- ifelse(icu.data$LOC==2,1,0)
# selecting necessary columns
icu <- icu.data[,c("TYP", "COMA", "AGE", "INF", "STA")]
# assigning current data set
knn.df<-icu
labelcol <- 5 # STA is the fifth column
# create prediction column
predictioncol<-labelcol+1
# create train/test partitions
set.seed(2)
n<-nrow(knn.df)
knn.df<- knn.df[sample(n),]
# changing data type to integer
train.df <- knn.df[1:as.integer(0.7*n),]
# changing labelcol of tain data set to factor
head(train.df)
## TYP COMA AGE INF STA
## 37 0 0 74 0 0
## 140 1 0 73 0 0
## 114 1 0 58 0 0
## 34 1 0 23 0 0
## 185 1 0 69 0 1
## 184 1 1 65 0 1
train.df$STA<-as.factor(train.df$STA)
K = 3 # number of neighbors to determine the class
table(train.df[,labelcol])
##
## 0 1
## 107 33
test.df <- knn.df[as.integer(0.7*n +1):n,]
# changing labelcol of test data set to factor
test.df$STA<-as.factor(test.df$STA)
table(test.df[,labelcol])
##
## 0 1
## 53 7
predictions <- knn_predict2(test.df, train.df, K,labelcol) #calling knn_predict()
test.df[,predictioncol] <- predictions
print(accuracy(test.df,labelcol,predictioncol))
## [1] 80
table(test.df[[predictioncol]],test.df[[labelcol]])
##
## 0 1
## 0 48 7
## 1 5 0
acc_summary <- data.frame()
for (K in c(3,5,7,15,25,50)) # number of neighbors to determine the class
{
test.df <- knn.df[as.integer(0.7*n +1):n,]
predictions <- knn_predict2(test.df, train.df, K,labelcol)
test.df[,predictioncol] <- predictions
print(accuracy(test.df,labelcol,predictioncol))
print(table(test.df[[predictioncol]],test.df[[labelcol]]))
acc <- c(K, accuracy(test.df,labelcol,predictioncol))
acc_summary <- rbind(acc_summary, acc)
}
## [1] 80
##
## 0 1
## 0 48 7
## 1 5 0
## [1] 85
##
## 0 1
## 0 50 6
## 1 3 1
## [1] 83.33333
##
## 0 1
## 0 50 7
## 1 3 0
## [1] 88.33333
##
## 0 1
## 0 53 7
## [1] 88.33333
##
## 0 1
## 0 53 7
## [1] 88.33333
##
## 0 1
## 0 53 7
colnames(acc_summary) <- c("K", "Accuracy")
acc_summary
## K Accuracy
## 1 3 80.00000
## 2 5 85.00000
## 3 7 83.33333
## 4 15 88.33333
## 5 25 88.33333
## 6 50 88.33333
plot(acc_summary,type="o",xlab="K",ylab="Accuracy",main="Accuracy vs K")
Summary:
As shown on the graph, K=15 or more is the optimal number of neighbours as it provides highest accuracy, but the data set is imbalanced with abundant “0” class, thus we got high accuracy for “0” class and significantly lower for “1” class. The steps listed in the answer to Q2 can be taken in order to balance the data set and get better accuracy for “1” class.