Case Study In our fictitious data set, the city of Seattle only receives 911 calls for four reasons - a hot latte spills all over your lap (ouch!), Beavers attack unsuspecting passersbys (watch out for those beavers!), Seal attacks (can’t be too careful), and Marshawn Lynch sightings (people get very excited and choose to call 911 for some reason).

Your task is to run some analysis on this data set and extract insights. Please answer the questions to the best of your ability - there is some room for interpretation.

Load Libraries

Installing Packages and Loading Libraries

install.packages(“plotly”) install.packages(“ggplot2”) install.packages(“knitr”)

File Import

Import the file from the folder

Soultion

Question 1.A: What is the most common reason for calling 911?

count

count(ds_fictitious, c("Type"))

##                      Type freq
## 1         Beaver Accident  508
## 2            Latte Spills  416
## 3 Marshawn Lynch Sighting  324
## 4             Seal Attack  266

Ans: We can clearly see from the frequency table that the most common reason for calling 911 in the city of Seattle is Beaver Attacks.

Question 1.B: Display these results graphically

Bar Plot

ggplot(ds_fictitious, aes(x=Type)) + 
  geom_bar(fill = "dark blue") + ggtitle("Common reasons for calling 911") +
  geom_text(stat='count',aes(label=..count..),vjust=-0.25) + theme_bw() +
  theme(
    plot.background = element_blank()
    ,panel.grid.major = element_blank()
    ,panel.grid.minor = element_blank()
    ,panel.border = element_blank()
  )

Pie Plot (interactive)

ds_fictitious_freq <- count(ds_fictitious, c("Type"))
pie_ds <- data.frame(ds_fictitious_freq)
pie <- plot_ly(pie_ds, labels = ~Type, values = ~freq, type = 'pie') %>%
  layout(title = 'Reason for calling 911 (%)',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
pie

## Warning: package 'bindrcpp' was built under R version 3.3.3

Question 2.A:

Please create a graph of the 911 calls using the ‘Latitude’ and ‘Longitude’ (graph type is up to you) (differentiate call type using colors)

Plot

A simple plot of the co-ordinates shows a pattern in the data

ds_fictitious <- data.frame(datasheet)
min_long <- min(ds_fictitious$Longitude)
max_long <- max(ds_fictitious$Longitude)
min_lat <- min(ds_fictitious$Latitude)
max_lat <- max(ds_fictitious$Latitude)
plot <- ggplot(ds_fictitious, aes(x=Longitude, y=Latitude, color = Type),alpha =0.03) +
  geom_point(size=0.8) +
  scale_x_continuous(limits=c(min_long, max_long)) +
  scale_y_continuous(limits=c(min_lat, max_lat)) 
plot

Density plot

require ("ggplot2")
ggplot (ds_fictitious, aes (x = Longitude, y = Latitude, colour = Type)) + stat_density2d ()

With denist plot, we care now able to isolate the areas for different type of attacks.

Google Map

map + geom_point(data = ds_fictitious, aes(x = Longitude, y = Latitude, color = Type), size=1)

With use of actual maps, we can see how almost all the Seal attacks take place in water. The Beaver attacks are all concentrated near Bellevue, the Latte spills occur to the north. And finally Marshawn Lynch sittings occur in the central and south.

Google Map with satellite view

map + geom_point(data = ds_fictitious, aes(x = Longitude, y = Latitude, color = Type), size=1)

The satellite view offers clarity and helps us even spot the odd accidents that happens out of the concentrated areas. For example, there is a single odd Seal attack on the very north amongst Latte spills.

Question 2.B: Are there any data points that look mislabeled?

By observation, it seems that some of the data points could have been incorrectly classified.

Beaver Accident - Around 20 data points are in the water which is odd, even amongst them 2 are strikingly far in the west and has the most probability of being incorrectly labelled. Seal Attack - Almost all of the seal attacks have been clustered around the Elliott Bay region. Two of these points are far in the north and 1 in the far south east land. Just from the observations, we do not see a reason to believe the Latte Spills and Marshawn Lynch sittings have been incorectly labelled.

Below scatterplot confirms our suspicion that some of the points could have been labelled incorrectly.

The Analyst would need to revisit the data and make sure and check whether these are outliers or are correct data but ocuuring out of the trend region.

require ("lattice")
xyplot(Longitude ~ Latitude, ds_fictitious, groups = Type, xlab ="Longitude" , ylab ="Latitude", pch= 10, auto.key=list(columns = 2))

Question 2.C: If we were to use only ‘Latitude’ and ‘Longitude’, could we make an intelligent decision

as to why a resident dialed 911? (In other words, if we take off the labels - can we still determine which category a 911 call would most likely fall into? Please describe this algorithm and your reason for choosing it)

First, let us check if we can determine the number of clusters just from Longitude and Latitude ## clustering

ds_fictitious <- data.frame(lapply(datasheet, as.factor))
# deletion of missing
ds_fictitious <- na.omit(ds_fictitious) 
# Removing labels from data set
ds_long_lat <- ds_fictitious[ -c(1, 4) ]

# Compute and plot wss for k = 2 to k = 15
set.seed(123)
k.max <- 15
data <- ds_long_lat
wss <- sapply(1:k.max, 
              function(k){kmeans(data, k, nstart=50,iter.max = 15 )$tot.withinss})
wss

##  [1] 16.2450348  5.5648686  3.4762822  2.4637603  1.9734418  1.5373042
##  [7]  1.2484664  1.0402637  0.9323950  0.8437858  0.7613676  0.6925716
## [13]  0.6306789  0.5834480  0.5433205

plot(1:k.max, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters SSE")

From the elbow graph, we see that after k=4, the change in ratio is very less than before. Hence, k=4 is an optimal number for classification (we already know that to be true). This helps reconfirm our hypothesis that the accident calls have an intrensic attribute which can be used to make a prediction about the type of accident just from the available co-ordinates.

Let us create a model to determine if we can predict the category of 911 call using only ‘Latitude’ and ‘Longitude’

Model-1 (kNN)

The first hunch I have is to use K Nearest Neighbor classifier. When we plotted the above plots, we alredy got an idea that the different accident calls have a broad general pattern which is infact very prominent based on location. This helped me comprehend that the distance feature would be significant. Also, as KNN is not known to over generalise the data and for being very effective on small data sets, it should be a good model for consideration.

require("class")
set.seed(99)
#Sampling the dataset provided (70% train, 30% test)
ds_sample <- sample(2, nrow(ds_fictitious), replace=TRUE, prob=c(0.7, 0.3))
ds_train <- ds_fictitious[ds_sample==1,] #Select the 70% of rows
ds_test <- ds_fictitious[ds_sample==2,] #Select the 30% of rows

# Removing the Lables
ds_train1 <- ds_train[-c(1, 4)]
ds_test1 <- ds_test[-c(1, 4)]

# Storing the label for test and train data
train_labels  <- ds_train$Type
# dim(train_labels)
# class(train_labels)
test_labels  <- ds_test$Type
# dim(test_labels)
# class(test_labels)

# Training the model
test_pred1 <- knn(train = ds_train1, test = ds_test1, cl= train_labels,k = 4,prob=TRUE)

The model is built using 70% of the data, while 30% of it has been reserved for validation. We use the R package of knn for building the model that uses Euclidean distance as its underlying default. The only attributes made available to this model is the Longitude and Latitude co-ordinates.

Now, we evaluate the model:

# Evaluate model performance
CrossTable(x = test_labels, y = test_pred1,prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  449 
## 
##  
##                         | test_pred1 
##             test_labels |         Beaver Accident |            Latte Spills | Marshawn Lynch Sighting |             Seal Attack |               Row Total | 
## ------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|
##         Beaver Accident |                     158 |                       0 |                       0 |                       1 |                     159 | 
##                         |                   0.994 |                   0.000 |                   0.000 |                   0.006 |                   0.354 | 
##                         |                   0.994 |                   0.000 |                   0.000 |                   0.013 |                         | 
##                         |                   0.352 |                   0.000 |                   0.000 |                   0.002 |                         | 
## ------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|
##            Latte Spills |                       0 |                     125 |                       3 |                       1 |                     129 | 
##                         |                   0.000 |                   0.969 |                   0.023 |                   0.008 |                   0.287 | 
##                         |                   0.000 |                   0.954 |                   0.036 |                   0.013 |                         | 
##                         |                   0.000 |                   0.278 |                   0.007 |                   0.002 |                         | 
## ------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|
## Marshawn Lynch Sighting |                       0 |                       6 |                      80 |                       1 |                      87 | 
##                         |                   0.000 |                   0.069 |                   0.920 |                   0.011 |                   0.194 | 
##                         |                   0.000 |                   0.046 |                   0.964 |                   0.013 |                         | 
##                         |                   0.000 |                   0.013 |                   0.178 |                   0.002 |                         | 
## ------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|
##             Seal Attack |                       1 |                       0 |                       0 |                      73 |                      74 | 
##                         |                   0.014 |                   0.000 |                   0.000 |                   0.986 |                   0.165 | 
##                         |                   0.006 |                   0.000 |                   0.000 |                   0.961 |                         | 
##                         |                   0.002 |                   0.000 |                   0.000 |                   0.163 |                         | 
## ------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|
##            Column Total |                     159 |                     131 |                      83 |                      76 |                     449 | 
##                         |                   0.354 |                   0.292 |                   0.185 |                   0.169 |                         | 
## ------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|
## 
##

From above confusion matrix, we can draw following inferences: 1. Beaver Accidents - Out of 159 Beaver attacks, only 1 data point was incorrectly classified as a Seal attack 2. Latte Spills - From the 129 data points, 125 was labelled correct, with 3 being incorrectly labelled as a Marshawn Lynch Sighting and 1 as a Seal attack. 3. Marshawn Lynch Sighting - 80/87 labelling was right here, with 6 categorized as a Latte Spill and 1 as a Seal attack. 4. Seal Attacks - Again the model performs exceptionally well to rightly identify 83/84 data points with just one misclassified as a Beaver Attack.

Accurcy on test set - 97.10%

What these numbers tell in conclusion is that ### we can indeed make an intelligent decision as to why a resident dialed 911 using just the location co-ordinates of the call!

Question 2.D: Does the algorithm chosen utilize Euclidean distance?

Should we be concerned that 'Latitude' and 'Longitude' are not necessarily Euclidean?

Ans:

The first model does use Euclidean distance. And it is true that ‘Latitude’ and ‘Longitude’ are not necessarily Euclidean. As lat & long are not co-ordinates in the Cartesian system, it losses some predictive power to reconstruct distances from it. As Earth is a sphere, the direct mapping doe raise concerns. In ideal situation, we would be required to transform the coordinates into some form of Geo Id using transformation functions.

However to get close to 95% of the answer, we could choose to proceed with using Lat and Long as distance co-ordinates. Additionally, we test alternative models particularly Decision Tree/Random forest along with traditional classifiers such as Naive Bayes. The models could then be compared to address our question.

Model-2 (Naive Bayes)

Next, Naive Bayes model is chosen as we can build it fast and comparatively easy. Even as the model assumes independence between attributes, it is one of the most populat classifiers available as it uses both the attributes simultaneously and it can be reused to add new training data without rebuilding the model. Also, to get around the distance problem, we try alternate an algorithm

ds_fictitious <- data.frame(lapply(datasheet, as.factor))
set.seed(364)

#Sampling the dataset provided (65% train, 35% test)
ds_sample <- createDataPartition(ds_fictitious$Type, p=0.65, list=FALSE)
training_set <- ds_fictitious[ds_sample,]
test_set <- ds_fictitious[-ds_sample,]

# Removing the fields
training_set <- training_set[-c(4)]
test_set <- test_set[-c(4)]

# Training the model
classifier=naiveBayes(training_set[,2:3], training_set[,1])

The model is built using 65% of the data. We use the naiveBayes function to build the model.

Now, we evaluate the model:

# Evaluate model performance
table(predict(classifier, test_set[,2:3]), test_set[,1])

##                          
##                           Beaver Accident Latte Spills
##   Beaver Accident                     161           56
##   Latte Spills                          1           81
##   Marshawn Lynch Sighting              10            8
##   Seal Attack                           5            0
##                          
##                           Marshawn Lynch Sighting Seal Attack
##   Beaver Accident                              70          62
##   Latte Spills                                  7           2
##   Marshawn Lynch Sighting                      33           2
##   Seal Attack                                   3          27

summary(test_set)

##                       Type        Latitude       Longitude  
##  Beaver Accident        :177   47.6   : 10   -122.3234:  4  
##  Latte Spills           :145   47.59  :  9   -122.341 :  3  
##  Marshawn Lynch Sighting:113   47.58  :  4   -122.3359:  3  
##  Seal Attack            : 93   47.61  :  4   -122.3214:  3  
##                                47.5199:  3   -122.3142:  3  
##                                47.5449:  3   -122.2684:  3  
##                                (Other):495   (Other)  :509

Accuracy on test set - 83.85% We observe that we obtain a relatively low accuracy plotting the prediction table along with the actual test set. The model is inclined to classify more labels as Beaver as it has the highest number of data points in the training data set. The Naive Bayes model is known to handle discrete variables well, but in out case both the attributes together have a greater predictive power which perhaps explains the relative low accuracy.

Model-3 (Random Forest)

As the predictors are greater than 53 due to factor level for Longitude and Latitude in R for this dataset, implementing Random Forest would take more time or need to be handeled by treating it as simply numeric. This is due to the CART implementation in R, while in Weka uses C45 trees to split categorical attributes. Due to time constraints, Random Forest is implemented in Weka (output attached in file) and we consider the outcome.

The Random Forest model was built in Weka using just the co-ordinates variables as with other models. The model specifics are detailed below.

Question 2.E Please display the results of your algorithm, along with the associated code

=== Weka Run information ===

Scheme: weka.classifiers.trees.RandomForest -P 100 -I 100 -num-slots 1 -K 0 -M 1.0 -V 0.001 -S 1 Relation: rev data for test Instances: 1514 Attributes: 3 Type Latitude Longitude Test mode: split 70.0% train, remainder test

=== Classifier model (full training set) ===

RandomForest

Bagging with 100 iterations and base learner

weka.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1 -do-not-check-capabilities

Time taken to build model: 0.16 seconds

=== Evaluation on test split ===

Time taken to test model on test split: 0.02 seconds

=== Summary ===

Correctly Classified Instances 437 96.2555 % Incorrectly Classified Instances 17 3.7445 % Kappa statistic 0.9489 Mean absolute error 0.0244 Root mean squared error 0.1269 Relative absolute error 6.6232 % Root relative squared error 29.6135 % Total Number of Instances 454

=== Detailed Accuracy By Class ===

             TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
             0.987    0.007    0.987      0.987    0.987      0.981    0.995     0.989     Beaver Accident
             0.974    0.008    0.961      0.974    0.967      0.961    0.991     0.973     Seal Attack
             1.000    0.036    0.907      1.000    0.951      0.935    0.992     0.970     Latte Spills
             0.875    0.000    1.000      0.875    0.933      0.919    0.983     0.969     Marshawn Lynch Sighting

Weighted Avg. 0.963 0.013 0.965 0.963 0.962 0.951 0.991 0.977

Question 3: Please display the number of correct categorizations

=== Confusion Matrix ===

a b c d <– classified as 155 2 0 0 | a = Beaver Accident 2 74 0 0 | b = Seal Attack 0 0 117 0 | c = Latte Spills 0 1 12 91 | d = Marshawn Lynch Sighting

### Accurcy on test set - 96.25%

Inference: Beaver Accident - only two labels were misclassified as Seal Attacks. Seal Attack - Again, only two labels were misclassified as Beaver Accidents. Latte Spills - The model surprisingly predicts it with 100% accuracy. Marshawn Lynch Sighting - This is where we observe some misclassification (1 as Seal Attack and 12 as Latte Spills), The Marsawn sightings data are speard across which perhaps explain why the model is not able to differentiate here as clearly as before.

### Conclusion

We observe that both KNN and Random Forest perform exceedingly well in predicting the lables with just ‘Latitude’ and ‘Longitude’. As the co-ordinates are for data points only within a city, we should be fine with 95% accuracy due to Lat and Long not being Euclidian distances. However, if we wish to completely override this factor, we can choose to proceed with Random Forest algorithm which delivers almost the same results

TechnicalTest

Chaithanya Rao

December 30, 2017