Probabilities

You have been hired by a local electronics retailer and the above dataset has been given to you. Manager Bayes Jr.9th wants to create a spreadsheet to predict is a customer is likely prospect.

Prior Probabilities

Compute prior probabilities for the Prospect Yes/No

Prior YES probability: 64%
Prior NO probability: 36%

Conditional Probabilities

Compute the conditional probabilities where age-group is a predictor variable. P(age-group=youth|prospect=yes) and P(age-group=youth|prospect=no)

P(age-group=youth|prospect=yes): 23%
P(age-group=youth|prospect=no) : 13%

##               networth
## class.prospect high  low medium  Sum
##            no  0.14 0.07   0.14 0.36
##            yes 0.14 0.21   0.29 0.64
##            Sum 0.29 0.29   0.43 1.00

##               age.group
## class.prospect middle senior youth  Sum
##            no    0.00   0.14  0.21 0.36
##            yes   0.29   0.21  0.14 0.64
##            Sum   0.29   0.36  0.36 1.00

##               status
## class.prospect employed unemployed  Sum
##            no      0.29       0.07 0.36
##            yes     0.21       0.43 0.64
##            Sum     0.50       0.50 1.00

##               credit_rating
## class.prospect excellent fair  Sum
##            no       0.21 0.14 0.36
##            yes      0.21 0.43 0.64
##            Sum      0.43 0.57 1.00

Posterior Probabilities

Assuming the assumptions of Naive Bayes are met compute the posterior probability

P(prospect=no|age-group(youth): 24%
P(prospect=yes|age-group(youth): 50%
P(prospect=no|age-group(middle): 18%
P(prospect=yes|age-group(middle) 42%
P(prospect=no|age-group(senior): 24% *P(prospect=yes|age-group(senior): 50%

Dataset analysis

You just recently joined a datascience team.

There are two datasets junk1.txt and junk2.csv. They have two options: 1. They can go back to the client and ask for more data to remedy problems with the data. 2. They can accept the data and undertake a major analytics exercise.

The team is relying on your dsc skills to determine how they should proceed. Can you explore the data and recommend actions for each file enumerating the reasons.

Summarize Data

The summary of the data show that both datasets are on average similar. The stat description of the data show that they are both similar in mean and standard deviation, but median , skew, kurtosis and se are spread differently.

##        a                  b                class    
##  Min.   :-2.29854   Min.   :-3.17174   Min.   :1.0  
##  1st Qu.:-0.85014   1st Qu.:-1.04712   1st Qu.:1.0  
##  Median :-0.04754   Median :-0.07456   Median :1.5  
##  Mean   : 0.04758   Mean   : 0.01324   Mean   :1.5  
##  3rd Qu.: 1.09109   3rd Qu.: 1.05342   3rd Qu.:2.0  
##  Max.   : 3.00604   Max.   : 3.10230   Max.   :2.0

##        a                  b                class       
##  Min.   :-4.16505   Min.   :-3.90472   Min.   :0.0000  
##  1st Qu.:-1.01447   1st Qu.:-0.89754   1st Qu.:0.0000  
##  Median : 0.08754   Median :-0.08358   Median :0.0000  
##  Mean   :-0.05126   Mean   : 0.05624   Mean   :0.0625  
##  3rd Qu.: 0.89842   3rd Qu.: 1.00354   3rd Qu.:0.0000  
##  Max.   : 4.62647   Max.   : 4.31052   Max.   :1.0000

## Observations: 100
## Variables: 3
## $ a     <dbl> 1.6204214, 1.4340220, 2.4766615, 0.5283093, 1.0054081, 1.1032...
## $ b     <dbl> 3.00362413, 0.78524873, 0.93677611, 0.11962219, 0.78728662, 0...
## $ class <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

## Observations: 4,000
## Variables: 3
## $ a     <dbl> 3.18864809, 0.82245267, 0.81472472, -1.50653621, 0.44268866, ...
## $ b     <dbl> 0.92917735, 0.04760314, 0.02910931, 3.13231360, 2.84942822, -...
## $ class <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

##       vars   n mean   sd median trimmed  mad   min  max range skew kurtosis
## a        1 100 0.05 1.27  -0.05    0.03 1.47 -2.30 3.01  5.30 0.13    -0.80
## b        2 100 0.01 1.45  -0.07   -0.02 1.49 -3.17 3.10  6.27 0.12    -0.53
## class    3 100 1.50 0.50   1.50    1.50 0.74  1.00 2.00  1.00 0.00    -2.02
##         se
## a     0.13
## b     0.14
## class 0.05

##       vars    n  mean   sd median trimmed  mad   min  max range  skew kurtosis
## a        1 4000 -0.05 1.30   0.09   -0.02 1.40 -4.17 4.63  8.79 -0.17    -0.34
## b        2 4000  0.06 1.31  -0.08    0.03 1.39 -3.90 4.31  8.22  0.21    -0.35
## class    3 4000  0.06 0.24   0.00    0.00 0.00  0.00 1.00  1.00  3.61    11.06
##         se
## a     0.02
## b     0.02
## class 0.00

Initial Plots

The initial plots show junk1 data is more spread as of result of lower number of data points and junk2 has a much more clustered set of data points with its higher number of data points. Both datasets have binomial data. The boxplot show that junk1 is fairly symetrical no outliers, while junk2 has several outliers.

Transform Data

In transforming the data we change the class of both datasets to factors. Then rerun the summary and note that class now has 2 classifcation. Junk1 has 1 and 2 and is split evenly 50:50. Junk 2 has 0 and 1 and is split heavily in the 0 values using 94% of the datapoints.

##        a                  b            class 
##  Min.   :-2.29854   Min.   :-3.17174   1:50  
##  1st Qu.:-0.85014   1st Qu.:-1.04712   2:50  
##  Median :-0.04754   Median :-0.07456         
##  Mean   : 0.04758   Mean   : 0.01324         
##  3rd Qu.: 1.09109   3rd Qu.: 1.05342         
##  Max.   : 3.00604   Max.   : 3.10230

##        a                  b            class   
##  Min.   :-4.16505   Min.   :-3.90472   0:3750  
##  1st Qu.:-1.01447   1st Qu.:-0.89754   1: 250  
##  Median : 0.08754   Median :-0.08358           
##  Mean   :-0.05126   Mean   : 0.05624           
##  3rd Qu.: 0.89842   3rd Qu.: 1.00354           
##  Max.   : 4.62647   Max.   : 4.31052

Plot Transformed Data

Junk1

Junk1 transformed plots show that there is a negative and low correlation. The ggdensity and histogram plots confirm the 50:50 split of classes as they intersect at 0. In addition, ggpoint plot shows that b and a have a negative trend when class =2 and a positive trend with class =1.

Junk2

Junk2 transformed plots show that there is a posittive and low correlation. The histogram plots confirm the high count of 0 class. The density plot shows the even spread of class 0 while class 1 has high 80% density for a at 0-2.5 and b at -2.5 - 0. In addition, ggpoint plot shows has a noticable a positive trend when class =0 and a slight positive trend when class=1. The ggpoint plot also shows 0 is clustered through the whole range of data while class =1 is clustered between 0-2.5.

Linear Model

I recommend we use GLM binomial modeling since we have class/factor data in both datasets. However, as the summary of the data shows, junk1 glm linear model shows it is not significant, while junk2 dataset is significant. In addition, junk1 has a limited amount of data while junk2 data is more robust in comparison. Therefore, I would recommend only using junk2 data for modeling. Some additional work is needed to remove outlier points or investigate if further transformations are needed before finalizing the model.

## 
## Call:
## glm(formula = class ~ ., family = "binomial", data = j1)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.30886  -1.16690   0.02148   1.16699   1.29651  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.005331   0.200613   0.027    0.979
## a           -0.101126   0.160193  -0.631    0.528
## b           -0.043367   0.140270  -0.309    0.757
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138.63  on 99  degrees of freedom
## Residual deviance: 138.17  on 97  degrees of freedom
## AIC: 144.17
## 
## Number of Fisher Scoring iterations: 3

## 
## Call:
## glm(formula = class ~ ., family = "binomial", data = j2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.77408  -0.34178  -0.15964  -0.06007   2.81798  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.06598    0.14995  -27.11   <2e-16 ***
## a            1.25104    0.08405   14.88   <2e-16 ***
## b           -1.25555    0.08651  -14.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1870.3  on 3999  degrees of freedom
## Residual deviance: 1359.4  on 3997  degrees of freedom
## AIC: 1365.4
## 
## Number of Fisher Scoring iterations: 7

ICU

Use icu.csv The formula to fit is “STA ~ TYP + COMA + AGE + INF”

Read the icu.csv subset it with these 5 features in the formula and STA is the labelcol.

Split the icu 70/30 train/test

## 
## Call:
## glm(formula = STA ~ TYP + COMA + AGE + INF, family = "binomial", 
##     data = i)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9864  -0.6855  -0.3734  -0.2460   2.5714  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.33474    1.05476  -5.058 4.24e-07 ***
## TYP          2.08673    0.76488   2.728  0.00637 ** 
## COMA         2.44415    0.82463   2.964  0.00304 ** 
## AGE          0.02952    0.01151   2.565  0.01033 *  
## INF          0.47553    0.41185   1.155  0.24825    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 200.16  on 199  degrees of freedom
## Residual deviance: 160.09  on 195  degrees of freedom
## AIC: 170.09
## 
## Number of Fisher Scoring iterations: 6

Run KNN

Run kNN.R for K=(3,5,7,15,25,50). Submit the result confusionMatrix, Accuracy for each K

## [1] "Neighbors:" "3"         
## [1] 95
##    
##      0  1
##   0 57  3
## [1] "Neighbors:" "5"         
## [1] 95
##    
##      0  1
##   0 57  3
## [1] "Neighbors:" "7"         
## [1] 95
##    
##      0  1
##   0 57  3
## [1] "Neighbors:" "15"        
## [1] 95
##    
##      0  1
##   0 57  3
## [1] "Neighbors:" "25"        
## [1] 95
##    
##      0  1
##   0 57  3
## [1] "Neighbors:" "50"        
## [1] 95
##    
##      0  1
##   0 57  3

Plot Accuracy vs K

Plot Accuracy vs K. Write a short summary of your findings.

Appendix

Code used in analysis

knitr::opts_chunk$set(
    echo = FALSE,
    message = FALSE,
    warning = FALSE
)
#knitr::opts_chunk$set(echo = TRUE)
require(knitr)
library(ggplot2)
library(tidyr)
library(MASS)
library(psych)
library(kableExtra)
library(dplyr)
library(faraway)
library(gridExtra)
library(reshape2)
library(leaps)
library(pROC)
library(caret)
library(naniar)
library(pander)
library(pROC)
library(mlbench)
library(e1071)
library(fpp2)
library(mlr)
library(GGally)
#x<-read.clipboard()
#summary(x)
#write.csv((x),"x.csv")

d<-read.csv("x.csv")

# 1 )Prior Prob
p.y<-nrow(subset(d,d$class.prospect=="yes"))/nrow(d) #YES
p.n<-1-p.y #NO

g.y<-nrow(subset(d,d$age.group=="youth"))/nrow(d) #YOUTH
g.m<-nrow(subset(d,d$age.group=="middle"))/nrow(d) #MIDDLE
g.s<-1-g.y-g.m #SENIOR

s.e<-nrow(subset(d,d$status=="employed"))/nrow(d) #EMPLOYED
s.u<-1-s.e #UNEMPLOYED

n.h<-nrow(subset(d,d$networth=="high"))/nrow(d) #HIGH
n.m<-nrow(subset(d,d$networth=="medium"))/nrow(d) #MEDIUM
n.l<-1-n.h-n.m #LOW

c.e<-nrow(subset(d,d$credit_rating=="excellent"))/nrow(d) #EXCELLENT
c.f<-1-c.e #FAIR

n<-addmargins(prop.table(xtabs(~class.prospect+networth, data=d) ,margin = NULL))
g<-addmargins(prop.table(xtabs(~class.prospect+age.group, data=d) ,margin = NULL))
s<-addmargins(prop.table(xtabs(~class.prospect+status, data=d) ,margin = NULL))
c<-addmargins(prop.table(xtabs(~class.prospect+credit_rating, data=d) ,margin = NULL))

round(n,2)
round(g,2)
round(s,2)
round(c,2)

gypn<-round((g.y*p.n)/((g.y*p.n)+((1-g.y)*p.y)),2)*100
gypy<-round((g.y*p.y)/((g.y*p.y)+((1-g.y)*p.n)),2)*100

gmpn<-round((g.m*p.n)/((g.m*p.n)+((1-g.m)*p.y)),2)*100
gmpy<-round((g.m*p.y)/((g.m*p.y)+((1-g.m)*p.n)),2)*100

gspn<-round((g.s*p.n)/((g.s*p.n)+((1-g.s)*p.y)),2)*100
gspy<-round((g.s*p.y)/((g.s*p.y)+((1-g.s)*p.n)),2)*100
j1<-read.table("junk1.txt", header=TRUE)
j2<-read.csv("junk2.csv", header=TRUE)

#summarize data
summary(j1)
summary(j2)
glimpse(j1)
glimpse(j2)
describe(j1)
describe(j2)
#plot data
plot(j1)
plot(j2)
boxplot(j1)
boxplot(j2)


#transform data
j1$class <-as.factor(j1$class)
j2$class <-as.factor(j2$class)
summary(j1)
summary(j2)


#Plot of Transformation
ggpairs(j1)
ggplot(j1, aes(x=a, y=b, color = class)) +geom_point() +stat_smooth(method="glm", se=TRUE)
qplot(a, data = j1, geom = "histogram",
      fill = class)+facet_wrap(~ class)
qplot(b, data = j1, geom = "histogram",
      fill = class)+facet_wrap(~ class)
qplot(a, data = j1, geom = "density",
      fill = class)
qplot(b, data = j1, geom = "density",
      fill = class)
ggpairs(j2)
ggplot(j2, aes(x=a, y=b, color = class)) +geom_point() +stat_smooth(method="glm", se=TRUE)
qplot(a, data = j2, geom = "histogram",
      fill = class)+facet_wrap(~ class)
qplot(b, data = j2, geom = "histogram",
      fill = class)+facet_wrap(~ class)
qplot(a, data = j2, geom = "density",
      fill = class)
qplot(b, data = j2, geom = "density",
      fill = class)


#Linear Model
plot(j1$a,j1$b)
abline(lm(a~b,data=j1))
lj1<-glm(class~., data=j1, family = "binomial")
summary(lj1)
par(mfrow=(c(2,2)))
plot(lj1)
par(mfrow=(c(1,1)))

plot(j2$a,j2$b)
abline(lm(a~b,data=j2))
lj2<-glm(class~., data=j2, family = "binomial")
summary(lj2)
par(mfrow=(c(2,2)))
plot(lj2)


i<-read.csv("icu.csv", header=TRUE)
i$COMA<-if_else(i$LOC==2, 1,0)
i<<-i
i.lm<-glm(STA ~ TYP + COMA + AGE + INF, data=i, family="binomial")
summary(i.lm)

i <- subset(select(i, STA,TYP,COMA,AGE,INF ))

i.train<- i[1:(nrow(i)*.7),]
i.test<-i[(nrow(i)*.7+1):nrow(i),]



### KNN Function

euclideanDist <- function(a, b){
  d = 0
  for(i in c(1:(length(a)) ))
  {
    d = d + (a[[i]]-b[[i]])^2
  }
  d = sqrt(d)
  return(d)
}

knn_predict2 <- function(test_data, train_data, k_value, labelcol){
  pred <- c()  #empty pred vector 
  #LOOP-1
  for(i in c(1:nrow(test_data))){   #looping over each record of test data
    eu_dist =c()          #eu_dist & eu_char empty  vector
    eu_char = c()
    good = 0              #good & bad variable initialization with 0 value
    bad = 0
    
    #LOOP-2-looping over train data 
    for(j in c(1:nrow(train_data))){
 
      #adding euclidean distance b/w test data point and train data to eu_dist vector
      eu_dist <- c(eu_dist, euclideanDist(test_data[i,-c(labelcol)], train_data[j,-c(labelcol)]))
 
      #adding class variable of training data in eu_char
      eu_char <- c(eu_char, as.character(train_data[j,][[labelcol]]))
    }
    
    eu <- data.frame(eu_char, eu_dist) #eu dataframe created with eu_char & eu_dist columns
 
    eu <- eu[order(eu$eu_dist),]       #sorting eu dataframe to gettop K neighbors
    eu <- eu[1:k_value,]               #eu dataframe with top K neighbors
 
    tbl.sm.df<-table(eu$eu_char)
    cl_label<-  names(tbl.sm.df)[[as.integer(which.max(tbl.sm.df))]]
    
    pred <- c(pred, cl_label)
    }
    return(pred) #return pred vector
  }
  

accuracy <- function(test_data,labelcol,predcol){
  correct = 0
  for(i in c(1:nrow(test_data))){
    if(test_data[i,labelcol] == test_data[i,predcol]){ 
      correct = correct+1
    }
  }
  accu = (correct/nrow(test_data)) * 100  
  return(accu)
}
K<-c(3,5,7,15,25,50)
nb<<-data.frame()

for (u in K)
{
    #load data
    knn.df<-i
    labelcol <- 3 # for iris it is the fifth col 
    predictioncol<-labelcol+1
    # create train/test partitions
    set.seed(2)
    n<-nrow(knn.df)
    knn.df<- knn.df[sample(n),]
    
    train.df <- knn.df[1:as.integer(0.7*n),]
    
    K = u # number of neighbors to determine the class
    table(train.df[,labelcol])
    test.df <- knn.df[as.integer(0.7*n +1):n,]
    table(test.df[,labelcol])
    
    predictions <- knn_predict2(test.df, train.df, K,labelcol) #calling knn_predict()
    
    test.df[,predictioncol] <- predictions #Adding predictions in test data as 7th column
    print(c("Neighbors:", u))
    print(accuracy(test.df,labelcol,predictioncol))
    print(table(test.df[[predictioncol]],test.df[[labelcol]]))
    
    ac<-accuracy(test.df,labelcol,predictioncol)
    nb<<- rbind(nb,data.frame(K,ac))
}

plot(nb$ac, nb$K)

Data 622 Homework 1

Anthony Pagan

4/2/2020