You have been hired by a local electronics retailer and the above dataset has been given to you. Manager Bayes Jr.9th wants to create a spreadsheet to predict is a customer is likely prospect.
Prior YES probability: 64%
Prior NO probability: 36%
P(age-group=youth|prospect=yes): 23%
P(age-group=youth|prospect=no) : 13%
## networth
## class.prospect high low medium Sum
## no 0.14 0.07 0.14 0.36
## yes 0.14 0.21 0.29 0.64
## Sum 0.29 0.29 0.43 1.00
## age.group
## class.prospect middle senior youth Sum
## no 0.00 0.14 0.21 0.36
## yes 0.29 0.21 0.14 0.64
## Sum 0.29 0.36 0.36 1.00
## status
## class.prospect employed unemployed Sum
## no 0.29 0.07 0.36
## yes 0.21 0.43 0.64
## Sum 0.50 0.50 1.00
## credit_rating
## class.prospect excellent fair Sum
## no 0.21 0.14 0.36
## yes 0.21 0.43 0.64
## Sum 0.43 0.57 1.00
P(prospect=no|age-group(youth): 24%
P(prospect=yes|age-group(youth): 50%
P(prospect=no|age-group(middle): 18%
P(prospect=yes|age-group(middle) 42%
P(prospect=no|age-group(senior): 24% *P(prospect=yes|age-group(senior): 50%
You just recently joined a datascience team.
There are two datasets junk1.txt and junk2.csv. They have two options: 1. They can go back to the client and ask for more data to remedy problems with the data. 2. They can accept the data and undertake a major analytics exercise.
The team is relying on your dsc skills to determine how they should proceed. Can you explore the data and recommend actions for each file enumerating the reasons.
The summary of the data show that both datasets are on average similar. The stat description of the data show that they are both similar in mean and standard deviation, but median , skew, kurtosis and se are spread differently.
## a b class
## Min. :-2.29854 Min. :-3.17174 Min. :1.0
## 1st Qu.:-0.85014 1st Qu.:-1.04712 1st Qu.:1.0
## Median :-0.04754 Median :-0.07456 Median :1.5
## Mean : 0.04758 Mean : 0.01324 Mean :1.5
## 3rd Qu.: 1.09109 3rd Qu.: 1.05342 3rd Qu.:2.0
## Max. : 3.00604 Max. : 3.10230 Max. :2.0
## a b class
## Min. :-4.16505 Min. :-3.90472 Min. :0.0000
## 1st Qu.:-1.01447 1st Qu.:-0.89754 1st Qu.:0.0000
## Median : 0.08754 Median :-0.08358 Median :0.0000
## Mean :-0.05126 Mean : 0.05624 Mean :0.0625
## 3rd Qu.: 0.89842 3rd Qu.: 1.00354 3rd Qu.:0.0000
## Max. : 4.62647 Max. : 4.31052 Max. :1.0000
## Observations: 100
## Variables: 3
## $ a <dbl> 1.6204214, 1.4340220, 2.4766615, 0.5283093, 1.0054081, 1.1032...
## $ b <dbl> 3.00362413, 0.78524873, 0.93677611, 0.11962219, 0.78728662, 0...
## $ class <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## Observations: 4,000
## Variables: 3
## $ a <dbl> 3.18864809, 0.82245267, 0.81472472, -1.50653621, 0.44268866, ...
## $ b <dbl> 0.92917735, 0.04760314, 0.02910931, 3.13231360, 2.84942822, -...
## $ class <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## vars n mean sd median trimmed mad min max range skew kurtosis
## a 1 100 0.05 1.27 -0.05 0.03 1.47 -2.30 3.01 5.30 0.13 -0.80
## b 2 100 0.01 1.45 -0.07 -0.02 1.49 -3.17 3.10 6.27 0.12 -0.53
## class 3 100 1.50 0.50 1.50 1.50 0.74 1.00 2.00 1.00 0.00 -2.02
## se
## a 0.13
## b 0.14
## class 0.05
## vars n mean sd median trimmed mad min max range skew kurtosis
## a 1 4000 -0.05 1.30 0.09 -0.02 1.40 -4.17 4.63 8.79 -0.17 -0.34
## b 2 4000 0.06 1.31 -0.08 0.03 1.39 -3.90 4.31 8.22 0.21 -0.35
## class 3 4000 0.06 0.24 0.00 0.00 0.00 0.00 1.00 1.00 3.61 11.06
## se
## a 0.02
## b 0.02
## class 0.00
The initial plots show junk1 data is more spread as of result of lower number of data points and junk2 has a much more clustered set of data points with its higher number of data points. Both datasets have binomial data. The boxplot show that junk1 is fairly symetrical no outliers, while junk2 has several outliers.
In transforming the data we change the class of both datasets to factors. Then rerun the summary and note that class now has 2 classifcation. Junk1 has 1 and 2 and is split evenly 50:50. Junk 2 has 0 and 1 and is split heavily in the 0 values using 94% of the datapoints.
## a b class
## Min. :-2.29854 Min. :-3.17174 1:50
## 1st Qu.:-0.85014 1st Qu.:-1.04712 2:50
## Median :-0.04754 Median :-0.07456
## Mean : 0.04758 Mean : 0.01324
## 3rd Qu.: 1.09109 3rd Qu.: 1.05342
## Max. : 3.00604 Max. : 3.10230
## a b class
## Min. :-4.16505 Min. :-3.90472 0:3750
## 1st Qu.:-1.01447 1st Qu.:-0.89754 1: 250
## Median : 0.08754 Median :-0.08358
## Mean :-0.05126 Mean : 0.05624
## 3rd Qu.: 0.89842 3rd Qu.: 1.00354
## Max. : 4.62647 Max. : 4.31052
Junk1 transformed plots show that there is a negative and low correlation. The ggdensity and histogram plots confirm the 50:50 split of classes as they intersect at 0. In addition, ggpoint plot shows that b and a have a negative trend when class =2 and a positive trend with class =1.
Junk2 transformed plots show that there is a posittive and low correlation. The histogram plots confirm the high count of 0 class. The density plot shows the even spread of class 0 while class 1 has high 80% density for a at 0-2.5 and b at -2.5 - 0. In addition, ggpoint plot shows has a noticable a positive trend when class =0 and a slight positive trend when class=1. The ggpoint plot also shows 0 is clustered through the whole range of data while class =1 is clustered between 0-2.5.
I recommend we use GLM binomial modeling since we have class/factor data in both datasets. However, as the summary of the data shows, junk1 glm linear model shows it is not significant, while junk2 dataset is significant. In addition, junk1 has a limited amount of data while junk2 data is more robust in comparison. Therefore, I would recommend only using junk2 data for modeling. Some additional work is needed to remove outlier points or investigate if further transformations are needed before finalizing the model.
##
## Call:
## glm(formula = class ~ ., family = "binomial", data = j1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.30886 -1.16690 0.02148 1.16699 1.29651
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.005331 0.200613 0.027 0.979
## a -0.101126 0.160193 -0.631 0.528
## b -0.043367 0.140270 -0.309 0.757
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.63 on 99 degrees of freedom
## Residual deviance: 138.17 on 97 degrees of freedom
## AIC: 144.17
##
## Number of Fisher Scoring iterations: 3
##
## Call:
## glm(formula = class ~ ., family = "binomial", data = j2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.77408 -0.34178 -0.15964 -0.06007 2.81798
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.06598 0.14995 -27.11 <2e-16 ***
## a 1.25104 0.08405 14.88 <2e-16 ***
## b -1.25555 0.08651 -14.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1870.3 on 3999 degrees of freedom
## Residual deviance: 1359.4 on 3997 degrees of freedom
## AIC: 1365.4
##
## Number of Fisher Scoring iterations: 7
Use icu.csv The formula to fit is “STA ~ TYP + COMA + AGE + INF”
Read the icu.csv subset it with these 5 features in the formula and STA is the labelcol.
##
## Call:
## glm(formula = STA ~ TYP + COMA + AGE + INF, family = "binomial",
## data = i)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9864 -0.6855 -0.3734 -0.2460 2.5714
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.33474 1.05476 -5.058 4.24e-07 ***
## TYP 2.08673 0.76488 2.728 0.00637 **
## COMA 2.44415 0.82463 2.964 0.00304 **
## AGE 0.02952 0.01151 2.565 0.01033 *
## INF 0.47553 0.41185 1.155 0.24825
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 200.16 on 199 degrees of freedom
## Residual deviance: 160.09 on 195 degrees of freedom
## AIC: 170.09
##
## Number of Fisher Scoring iterations: 6
Run kNN.R for K=(3,5,7,15,25,50). Submit the result confusionMatrix, Accuracy for each K
## [1] "Neighbors:" "3"
## [1] 95
##
## 0 1
## 0 57 3
## [1] "Neighbors:" "5"
## [1] 95
##
## 0 1
## 0 57 3
## [1] "Neighbors:" "7"
## [1] 95
##
## 0 1
## 0 57 3
## [1] "Neighbors:" "15"
## [1] 95
##
## 0 1
## 0 57 3
## [1] "Neighbors:" "25"
## [1] 95
##
## 0 1
## 0 57 3
## [1] "Neighbors:" "50"
## [1] 95
##
## 0 1
## 0 57 3
Plot Accuracy vs K. Write a short summary of your findings.
Code used in analysis
knitr::opts_chunk$set(
echo = FALSE,
message = FALSE,
warning = FALSE
)
#knitr::opts_chunk$set(echo = TRUE)
require(knitr)
library(ggplot2)
library(tidyr)
library(MASS)
library(psych)
library(kableExtra)
library(dplyr)
library(faraway)
library(gridExtra)
library(reshape2)
library(leaps)
library(pROC)
library(caret)
library(naniar)
library(pander)
library(pROC)
library(mlbench)
library(e1071)
library(fpp2)
library(mlr)
library(GGally)
#x<-read.clipboard()
#summary(x)
#write.csv((x),"x.csv")
d<-read.csv("x.csv")
# 1 )Prior Prob
p.y<-nrow(subset(d,d$class.prospect=="yes"))/nrow(d) #YES
p.n<-1-p.y #NO
g.y<-nrow(subset(d,d$age.group=="youth"))/nrow(d) #YOUTH
g.m<-nrow(subset(d,d$age.group=="middle"))/nrow(d) #MIDDLE
g.s<-1-g.y-g.m #SENIOR
s.e<-nrow(subset(d,d$status=="employed"))/nrow(d) #EMPLOYED
s.u<-1-s.e #UNEMPLOYED
n.h<-nrow(subset(d,d$networth=="high"))/nrow(d) #HIGH
n.m<-nrow(subset(d,d$networth=="medium"))/nrow(d) #MEDIUM
n.l<-1-n.h-n.m #LOW
c.e<-nrow(subset(d,d$credit_rating=="excellent"))/nrow(d) #EXCELLENT
c.f<-1-c.e #FAIR
n<-addmargins(prop.table(xtabs(~class.prospect+networth, data=d) ,margin = NULL))
g<-addmargins(prop.table(xtabs(~class.prospect+age.group, data=d) ,margin = NULL))
s<-addmargins(prop.table(xtabs(~class.prospect+status, data=d) ,margin = NULL))
c<-addmargins(prop.table(xtabs(~class.prospect+credit_rating, data=d) ,margin = NULL))
round(n,2)
round(g,2)
round(s,2)
round(c,2)
gypn<-round((g.y*p.n)/((g.y*p.n)+((1-g.y)*p.y)),2)*100
gypy<-round((g.y*p.y)/((g.y*p.y)+((1-g.y)*p.n)),2)*100
gmpn<-round((g.m*p.n)/((g.m*p.n)+((1-g.m)*p.y)),2)*100
gmpy<-round((g.m*p.y)/((g.m*p.y)+((1-g.m)*p.n)),2)*100
gspn<-round((g.s*p.n)/((g.s*p.n)+((1-g.s)*p.y)),2)*100
gspy<-round((g.s*p.y)/((g.s*p.y)+((1-g.s)*p.n)),2)*100
j1<-read.table("junk1.txt", header=TRUE)
j2<-read.csv("junk2.csv", header=TRUE)
#summarize data
summary(j1)
summary(j2)
glimpse(j1)
glimpse(j2)
describe(j1)
describe(j2)
#plot data
plot(j1)
plot(j2)
boxplot(j1)
boxplot(j2)
#transform data
j1$class <-as.factor(j1$class)
j2$class <-as.factor(j2$class)
summary(j1)
summary(j2)
#Plot of Transformation
ggpairs(j1)
ggplot(j1, aes(x=a, y=b, color = class)) +geom_point() +stat_smooth(method="glm", se=TRUE)
qplot(a, data = j1, geom = "histogram",
fill = class)+facet_wrap(~ class)
qplot(b, data = j1, geom = "histogram",
fill = class)+facet_wrap(~ class)
qplot(a, data = j1, geom = "density",
fill = class)
qplot(b, data = j1, geom = "density",
fill = class)
ggpairs(j2)
ggplot(j2, aes(x=a, y=b, color = class)) +geom_point() +stat_smooth(method="glm", se=TRUE)
qplot(a, data = j2, geom = "histogram",
fill = class)+facet_wrap(~ class)
qplot(b, data = j2, geom = "histogram",
fill = class)+facet_wrap(~ class)
qplot(a, data = j2, geom = "density",
fill = class)
qplot(b, data = j2, geom = "density",
fill = class)
#Linear Model
plot(j1$a,j1$b)
abline(lm(a~b,data=j1))
lj1<-glm(class~., data=j1, family = "binomial")
summary(lj1)
par(mfrow=(c(2,2)))
plot(lj1)
par(mfrow=(c(1,1)))
plot(j2$a,j2$b)
abline(lm(a~b,data=j2))
lj2<-glm(class~., data=j2, family = "binomial")
summary(lj2)
par(mfrow=(c(2,2)))
plot(lj2)
i<-read.csv("icu.csv", header=TRUE)
i$COMA<-if_else(i$LOC==2, 1,0)
i<<-i
i.lm<-glm(STA ~ TYP + COMA + AGE + INF, data=i, family="binomial")
summary(i.lm)
i <- subset(select(i, STA,TYP,COMA,AGE,INF ))
i.train<- i[1:(nrow(i)*.7),]
i.test<-i[(nrow(i)*.7+1):nrow(i),]
### KNN Function
euclideanDist <- function(a, b){
d = 0
for(i in c(1:(length(a)) ))
{
d = d + (a[[i]]-b[[i]])^2
}
d = sqrt(d)
return(d)
}
knn_predict2 <- function(test_data, train_data, k_value, labelcol){
pred <- c() #empty pred vector
#LOOP-1
for(i in c(1:nrow(test_data))){ #looping over each record of test data
eu_dist =c() #eu_dist & eu_char empty vector
eu_char = c()
good = 0 #good & bad variable initialization with 0 value
bad = 0
#LOOP-2-looping over train data
for(j in c(1:nrow(train_data))){
#adding euclidean distance b/w test data point and train data to eu_dist vector
eu_dist <- c(eu_dist, euclideanDist(test_data[i,-c(labelcol)], train_data[j,-c(labelcol)]))
#adding class variable of training data in eu_char
eu_char <- c(eu_char, as.character(train_data[j,][[labelcol]]))
}
eu <- data.frame(eu_char, eu_dist) #eu dataframe created with eu_char & eu_dist columns
eu <- eu[order(eu$eu_dist),] #sorting eu dataframe to gettop K neighbors
eu <- eu[1:k_value,] #eu dataframe with top K neighbors
tbl.sm.df<-table(eu$eu_char)
cl_label<- names(tbl.sm.df)[[as.integer(which.max(tbl.sm.df))]]
pred <- c(pred, cl_label)
}
return(pred) #return pred vector
}
accuracy <- function(test_data,labelcol,predcol){
correct = 0
for(i in c(1:nrow(test_data))){
if(test_data[i,labelcol] == test_data[i,predcol]){
correct = correct+1
}
}
accu = (correct/nrow(test_data)) * 100
return(accu)
}
K<-c(3,5,7,15,25,50)
nb<<-data.frame()
for (u in K)
{
#load data
knn.df<-i
labelcol <- 3 # for iris it is the fifth col
predictioncol<-labelcol+1
# create train/test partitions
set.seed(2)
n<-nrow(knn.df)
knn.df<- knn.df[sample(n),]
train.df <- knn.df[1:as.integer(0.7*n),]
K = u # number of neighbors to determine the class
table(train.df[,labelcol])
test.df <- knn.df[as.integer(0.7*n +1):n,]
table(test.df[,labelcol])
predictions <- knn_predict2(test.df, train.df, K,labelcol) #calling knn_predict()
test.df[,predictioncol] <- predictions #Adding predictions in test data as 7th column
print(c("Neighbors:", u))
print(accuracy(test.df,labelcol,predictioncol))
print(table(test.df[[predictioncol]],test.df[[labelcol]]))
ac<-accuracy(test.df,labelcol,predictioncol)
nb<<- rbind(nb,data.frame(K,ac))
}
plot(nb$ac, nb$K)