Case Study In our fictitious data set, the city of Seattle only receives 911 calls for four reasons - a hot latte spills all over your lap (ouch!), Beavers attack unsuspecting passersbys (watch out for those beavers!), Seal attacks (can’t be too careful), and Marshawn Lynch sightings (people get very excited and choose to call 911 for some reason).
Your task is to run some analysis on this data set and extract insights. Please answer the questions to the best of your ability - there is some room for interpretation.
Installing Packages and Loading Libraries
install.packages(“plotly”) install.packages(“ggplot2”) install.packages(“knitr”)
Import the file from the folder
count(ds_fictitious, c("Type"))
## Type freq
## 1 Beaver Accident 508
## 2 Latte Spills 416
## 3 Marshawn Lynch Sighting 324
## 4 Seal Attack 266
ggplot(ds_fictitious, aes(x=Type)) +
geom_bar(fill = "dark blue") + ggtitle("Common reasons for calling 911") +
geom_text(stat='count',aes(label=..count..),vjust=-0.25) + theme_bw() +
theme(
plot.background = element_blank()
,panel.grid.major = element_blank()
,panel.grid.minor = element_blank()
,panel.border = element_blank()
)
ds_fictitious_freq <- count(ds_fictitious, c("Type"))
pie_ds <- data.frame(ds_fictitious_freq)
pie <- plot_ly(pie_ds, labels = ~Type, values = ~freq, type = 'pie') %>%
layout(title = 'Reason for calling 911 (%)',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
pie
## Warning: package 'bindrcpp' was built under R version 3.3.3
Please create a graph of the 911 calls using the ‘Latitude’ and ‘Longitude’ (graph type is up to you) (differentiate call type using colors)
ds_fictitious <- data.frame(datasheet)
min_long <- min(ds_fictitious$Longitude)
max_long <- max(ds_fictitious$Longitude)
min_lat <- min(ds_fictitious$Latitude)
max_lat <- max(ds_fictitious$Latitude)
plot <- ggplot(ds_fictitious, aes(x=Longitude, y=Latitude, color = Type),alpha =0.03) +
geom_point(size=0.8) +
scale_x_continuous(limits=c(min_long, max_long)) +
scale_y_continuous(limits=c(min_lat, max_lat))
plot
require ("ggplot2")
ggplot (ds_fictitious, aes (x = Longitude, y = Latitude, colour = Type)) + stat_density2d ()
With denist plot, we care now able to isolate the areas for different type of attacks.
map + geom_point(data = ds_fictitious, aes(x = Longitude, y = Latitude, color = Type), size=1)
With use of actual maps, we can see how almost all the Seal attacks take place in water. The Beaver attacks are all concentrated near Bellevue, the Latte spills occur to the north. And finally Marshawn Lynch sittings occur in the central and south.
map + geom_point(data = ds_fictitious, aes(x = Longitude, y = Latitude, color = Type), size=1)
The satellite view offers clarity and helps us even spot the odd accidents that happens out of the concentrated areas. For example, there is a single odd Seal attack on the very north amongst Latte spills.
Beaver Accident - Around 20 data points are in the water which is odd, even amongst them 2 are strikingly far in the west and has the most probability of being incorrectly labelled. Seal Attack - Almost all of the seal attacks have been clustered around the Elliott Bay region. Two of these points are far in the north and 1 in the far south east land. Just from the observations, we do not see a reason to believe the Latte Spills and Marshawn Lynch sittings have been incorectly labelled.
The Analyst would need to revisit the data and make sure and check whether these are outliers or are correct data but ocuuring out of the trend region.
require ("lattice")
xyplot(Longitude ~ Latitude, ds_fictitious, groups = Type, xlab ="Longitude" , ylab ="Latitude", pch= 10, auto.key=list(columns = 2))
as to why a resident dialed 911? (In other words, if we take off the labels - can we still determine which category a 911 call would most likely fall into? Please describe this algorithm and your reason for choosing it)
First, let us check if we can determine the number of clusters just from Longitude and Latitude ## clustering
ds_fictitious <- data.frame(lapply(datasheet, as.factor))
# deletion of missing
ds_fictitious <- na.omit(ds_fictitious)
# Removing labels from data set
ds_long_lat <- ds_fictitious[ -c(1, 4) ]
# Compute and plot wss for k = 2 to k = 15
set.seed(123)
k.max <- 15
data <- ds_long_lat
wss <- sapply(1:k.max,
function(k){kmeans(data, k, nstart=50,iter.max = 15 )$tot.withinss})
wss
## [1] 16.2450348 5.5648686 3.4762822 2.4637603 1.9734418 1.5373042
## [7] 1.2484664 1.0402637 0.9323950 0.8437858 0.7613676 0.6925716
## [13] 0.6306789 0.5834480 0.5433205
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters SSE")
From the elbow graph, we see that after k=4, the change in ratio is very less than before. Hence, k=4 is an optimal number for classification (we already know that to be true). This helps reconfirm our hypothesis that the accident calls have an intrensic attribute which can be used to make a prediction about the type of accident just from the available co-ordinates.
Let us create a model to determine if we can predict the category of 911 call using only ‘Latitude’ and ‘Longitude’
The first hunch I have is to use K Nearest Neighbor classifier. When we plotted the above plots, we alredy got an idea that the different accident calls have a broad general pattern which is infact very prominent based on location. This helped me comprehend that the distance feature would be significant. Also, as KNN is not known to over generalise the data and for being very effective on small data sets, it should be a good model for consideration.
require("class")
set.seed(99)
#Sampling the dataset provided (70% train, 30% test)
ds_sample <- sample(2, nrow(ds_fictitious), replace=TRUE, prob=c(0.7, 0.3))
ds_train <- ds_fictitious[ds_sample==1,] #Select the 70% of rows
ds_test <- ds_fictitious[ds_sample==2,] #Select the 30% of rows
# Removing the Lables
ds_train1 <- ds_train[-c(1, 4)]
ds_test1 <- ds_test[-c(1, 4)]
# Storing the label for test and train data
train_labels <- ds_train$Type
# dim(train_labels)
# class(train_labels)
test_labels <- ds_test$Type
# dim(test_labels)
# class(test_labels)
# Training the model
test_pred1 <- knn(train = ds_train1, test = ds_test1, cl= train_labels,k = 4,prob=TRUE)
The model is built using 70% of the data, while 30% of it has been reserved for validation. We use the R package of knn for building the model that uses Euclidean distance as its underlying default. The only attributes made available to this model is the Longitude and Latitude co-ordinates.
Now, we evaluate the model:
# Evaluate model performance
CrossTable(x = test_labels, y = test_pred1,prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 449
##
##
## | test_pred1
## test_labels | Beaver Accident | Latte Spills | Marshawn Lynch Sighting | Seal Attack | Row Total |
## ------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|
## Beaver Accident | 158 | 0 | 0 | 1 | 159 |
## | 0.994 | 0.000 | 0.000 | 0.006 | 0.354 |
## | 0.994 | 0.000 | 0.000 | 0.013 | |
## | 0.352 | 0.000 | 0.000 | 0.002 | |
## ------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|
## Latte Spills | 0 | 125 | 3 | 1 | 129 |
## | 0.000 | 0.969 | 0.023 | 0.008 | 0.287 |
## | 0.000 | 0.954 | 0.036 | 0.013 | |
## | 0.000 | 0.278 | 0.007 | 0.002 | |
## ------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|
## Marshawn Lynch Sighting | 0 | 6 | 80 | 1 | 87 |
## | 0.000 | 0.069 | 0.920 | 0.011 | 0.194 |
## | 0.000 | 0.046 | 0.964 | 0.013 | |
## | 0.000 | 0.013 | 0.178 | 0.002 | |
## ------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|
## Seal Attack | 1 | 0 | 0 | 73 | 74 |
## | 0.014 | 0.000 | 0.000 | 0.986 | 0.165 |
## | 0.006 | 0.000 | 0.000 | 0.961 | |
## | 0.002 | 0.000 | 0.000 | 0.163 | |
## ------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|
## Column Total | 159 | 131 | 83 | 76 | 449 |
## | 0.354 | 0.292 | 0.185 | 0.169 | |
## ------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|
##
##
From above confusion matrix, we can draw following inferences: 1. Beaver Accidents - Out of 159 Beaver attacks, only 1 data point was incorrectly classified as a Seal attack 2. Latte Spills - From the 129 data points, 125 was labelled correct, with 3 being incorrectly labelled as a Marshawn Lynch Sighting and 1 as a Seal attack. 3. Marshawn Lynch Sighting - 80/87 labelling was right here, with 6 categorized as a Latte Spill and 1 as a Seal attack. 4. Seal Attacks - Again the model performs exceptionally well to rightly identify 83/84 data points with just one misclassified as a Beaver Attack.
Accurcy on test set - 97.10%
What these numbers tell in conclusion is that ### we can indeed make an intelligent decision as to why a resident dialed 911 using just the location co-ordinates of the call!
Should we be concerned that 'Latitude' and 'Longitude' are not necessarily Euclidean?
The first model does use Euclidean distance. And it is true that ‘Latitude’ and ‘Longitude’ are not necessarily Euclidean. As lat & long are not co-ordinates in the Cartesian system, it losses some predictive power to reconstruct distances from it. As Earth is a sphere, the direct mapping doe raise concerns. In ideal situation, we would be required to transform the coordinates into some form of Geo Id using transformation functions.
However to get close to 95% of the answer, we could choose to proceed with using Lat and Long as distance co-ordinates. Additionally, we test alternative models particularly Decision Tree/Random forest along with traditional classifiers such as Naive Bayes. The models could then be compared to address our question.
Next, Naive Bayes model is chosen as we can build it fast and comparatively easy. Even as the model assumes independence between attributes, it is one of the most populat classifiers available as it uses both the attributes simultaneously and it can be reused to add new training data without rebuilding the model. Also, to get around the distance problem, we try alternate an algorithm
ds_fictitious <- data.frame(lapply(datasheet, as.factor))
set.seed(364)
#Sampling the dataset provided (65% train, 35% test)
ds_sample <- createDataPartition(ds_fictitious$Type, p=0.65, list=FALSE)
training_set <- ds_fictitious[ds_sample,]
test_set <- ds_fictitious[-ds_sample,]
# Removing the fields
training_set <- training_set[-c(4)]
test_set <- test_set[-c(4)]
# Training the model
classifier=naiveBayes(training_set[,2:3], training_set[,1])
The model is built using 65% of the data. We use the naiveBayes function to build the model.
Now, we evaluate the model:
# Evaluate model performance
table(predict(classifier, test_set[,2:3]), test_set[,1])
##
## Beaver Accident Latte Spills
## Beaver Accident 161 56
## Latte Spills 1 81
## Marshawn Lynch Sighting 10 8
## Seal Attack 5 0
##
## Marshawn Lynch Sighting Seal Attack
## Beaver Accident 70 62
## Latte Spills 7 2
## Marshawn Lynch Sighting 33 2
## Seal Attack 3 27
summary(test_set)
## Type Latitude Longitude
## Beaver Accident :177 47.6 : 10 -122.3234: 4
## Latte Spills :145 47.59 : 9 -122.341 : 3
## Marshawn Lynch Sighting:113 47.58 : 4 -122.3359: 3
## Seal Attack : 93 47.61 : 4 -122.3214: 3
## 47.5199: 3 -122.3142: 3
## 47.5449: 3 -122.2684: 3
## (Other):495 (Other) :509
Accuracy on test set - 83.85% We observe that we obtain a relatively low accuracy plotting the prediction table along with the actual test set. The model is inclined to classify more labels as Beaver as it has the highest number of data points in the training data set. The Naive Bayes model is known to handle discrete variables well, but in out case both the attributes together have a greater predictive power which perhaps explains the relative low accuracy.
The Random Forest model was built in Weka using just the co-ordinates variables as with other models. The model specifics are detailed below.
=== Weka Run information ===
Scheme: weka.classifiers.trees.RandomForest -P 100 -I 100 -num-slots 1 -K 0 -M 1.0 -V 0.001 -S 1 Relation: rev data for test Instances: 1514 Attributes: 3 Type Latitude Longitude Test mode: split 70.0% train, remainder test
=== Classifier model (full training set) ===
RandomForest
Bagging with 100 iterations and base learner
weka.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1 -do-not-check-capabilities
Time taken to build model: 0.16 seconds
=== Evaluation on test split ===
Time taken to test model on test split: 0.02 seconds
=== Summary ===
Correctly Classified Instances 437 96.2555 % Incorrectly Classified Instances 17 3.7445 % Kappa statistic 0.9489 Mean absolute error 0.0244 Root mean squared error 0.1269 Relative absolute error 6.6232 % Root relative squared error 29.6135 % Total Number of Instances 454
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.987 0.007 0.987 0.987 0.987 0.981 0.995 0.989 Beaver Accident
0.974 0.008 0.961 0.974 0.967 0.961 0.991 0.973 Seal Attack
1.000 0.036 0.907 1.000 0.951 0.935 0.992 0.970 Latte Spills
0.875 0.000 1.000 0.875 0.933 0.919 0.983 0.969 Marshawn Lynch Sighting
Weighted Avg. 0.963 0.013 0.965 0.963 0.962 0.951 0.991 0.977
=== Confusion Matrix ===
a b c d <– classified as 155 2 0 0 | a = Beaver Accident 2 74 0 0 | b = Seal Attack 0 0 117 0 | c = Latte Spills 0 1 12 91 | d = Marshawn Lynch Sighting
### Accurcy on test set - 96.25%
Inference: Beaver Accident - only two labels were misclassified as Seal Attacks. Seal Attack - Again, only two labels were misclassified as Beaver Accidents. Latte Spills - The model surprisingly predicts it with 100% accuracy. Marshawn Lynch Sighting - This is where we observe some misclassification (1 as Seal Attack and 12 as Latte Spills), The Marsawn sightings data are speard across which perhaps explain why the model is not able to differentiate here as clearly as before.
### Conclusion
We observe that both KNN and Random Forest perform exceedingly well in predicting the lables with just ‘Latitude’ and ‘Longitude’. As the co-ordinates are for data points only within a city, we should be fine with 95% accuracy due to Lat and Long not being Euclidian distances. However, if we wish to completely override this factor, we can choose to proceed with Random Forest algorithm which delivers almost the same results