For years, people have associated the phrase “single” with people ready to mingle, but the last few years has seen people championing the literal meaning. According to Richard Fry and Kim Parker of the Pew Research Center, “roughly 4 in 10 adults ages 25-34 were unpartnered”, a sharp increase in the last 20 years1. The new generations have been at the forefront of this lifestyle, which promotes freedom, independence and non-conformity2
Stepping into Adult Life Single is the New Norm
As the choice to live single is chosen by more and more adults, understanding the kind of places they would choose to work and/or live in becomes just as important. Cities around the world must recognize the factors that may or may not make them popular among singles, as it would give planners and governments an opportunity to boost the development and reputation of their city.
This project attempts to provide insights into the case presented above by asking the following Question:
Can we predict whether a city would appeal to a single adult?
To answer this question, I have sourced a preprocessed dataset from data.world which lists the features of the top 125 cities to live in according to a report by the US News3. This dataset contains several features related to each city, including but not limited to weather, work-life, standard of living and - our target - the proportion of single people in the city.
To answer this question, we will be going through a numerous set of steps outlined below
library(tidyverse)
library(caret)
library(RColorBrewer)
library(ROCR)
library(MLmetrics)
library(mltools)
library(data.table)
library(mice)
library(class)
library(gridExtra)
library(DT)
cities = read.csv("Best US Cities Modified.csv", fileEncoding = "UTF-16LE",
sep = "\t")[1:125,]
names(cities)
## [1] "Avg.Annual.Rainfall" "Avg.Commute.Time...continuous"
## [3] "Avg.High.Temps" "Avg.Low.Temps"
## [5] "City" "Link"
## [7] "Median.Home.Price" "Median.Monthly.Rent"
## [9] "Property.Crime" "State"
## [11] "Violent.crime" "Average.Annual.Salary"
## [13] "Median.Age" "Metro.Population"
## [15] "Percent.Single" "Temperature.Range"
## [17] "Total.Students" "Total.Teachers"
## [19] "Unemployment.Rate" "US.News.Rank"
## [21] "PathOrder" "City.and.State"
## [23] "Coastal" "Coastal.String"
A lot of columns such as PathOrder and
City.and.State are redundant or unnecessary and will be
removed, along with other changes detailed below
cities_summary1 = data.frame(summary(cities)) %>%
mutate(Var1 = rep(1:(n()/ncol(cities)),ncol(cities))) %>%
pivot_wider(id_cols = Var1, names_from = Var2, values_from = Freq) %>%
select(-Var1) %>% replace(is.na(.),"")
datatable(cities_summary1, rownames = F, extensions = "Scroller",
options = list(dom = 't',ordering = F,scrollX = T))
There are quite a few quantitative features and only 1 character
feature that can be a factor, Coastal.String. All these
variables will be cleaned and prepared for Exploratory Data Analysis
# Function to normalize numeric columns
normalize <- function(x){
(x - min(x)) / (max(x) - min(x))
}
# Replace Coastal logical variable with factor of Coastal.Spring
cities_filter = cities %>% mutate(city_type = factor(Coastal.String)) %>%
# Rename Avg.Commute.Time...continuous
rename(Avg.Commute.Time = Avg.Commute.Time...continuous) %>%
# Drop unnecessary columns including Coastal.Spring
select(-c("City","Link","State","City.and.State", "PathOrder", "Coastal",
"Coastal.String")) %>%
# Create a new column representing if city is in top 50 of US News 2019 Ranking
mutate(top_50 = if_else(US.News.Rank <= 50, 1, 0)) %>%
# Normalize only Percent.Single for now
mutate(Percent.Single = normalize(Percent.Single)) %>%
# remove US.News.Rank
select(-US.News.Rank)
fivenum(cities_filter$Percent.Single) # Make top quartile of values 1s and rest 0s
## [1] 0.0000000 0.2777778 0.4055556 0.5000000 1.0000000
# One-Hot encode factor variables (only city_type)
cities_clean = one_hot(as.data.table(cities_filter), cols = "auto",
sparsifyNAs = TRUE, naCols = FALSE,
dropCols = TRUE,dropUnusedLevels = TRUE) %>%
# Create target variable
mutate(singles_like = cut(Percent.Single, c(-1,0.5,1), labels = c(0,1))) %>%
# remove Percent.Single
select(-Percent.Single)
md.pattern(cities_clean, rotate.names = T)
## Avg.Commute.Time Median.Monthly.Rent Property.Crime Violent.crime
## 94 1 1 1 1
## 19 1 1 1 1
## 2 1 1 1 1
## 6 1 1 1 1
## 3 1 1 1 1
## 1 1 1 1 1
## 0 0 0 0
## Average.Annual.Salary Median.Age Metro.Population Unemployment.Rate
## 94 1 1 1 1
## 19 1 1 1 1
## 2 1 1 1 1
## 6 1 1 1 1
## 3 1 1 1 1
## 1 1 1 1 1
## 0 0 0 0
## city_type_Coastal city_type_Non Coastal top_50 singles_like Total.Students
## 94 1 1 1 1 1
## 19 1 1 1 1 1
## 2 1 1 1 1 1
## 6 1 1 1 1 1
## 3 1 1 1 1 0
## 1 1 1 1 1 0
## 0 0 0 0 4
## Total.Teachers Median.Home.Price Avg.Annual.Rainfall Avg.High.Temps
## 94 1 1 1 1
## 19 1 1 0 0
## 2 1 0 1 1
## 6 1 0 0 0
## 3 0 1 1 1
## 1 0 0 1 1
## 4 9 25 25
## Avg.Low.Temps Temperature.Range
## 94 1 1 0
## 19 0 0 4
## 2 1 1 1
## 6 0 0 5
## 3 1 1 2
## 1 1 1 3
## 25 25 117
# As we can see, there are missing values in some of the quantitative columns
# We lose the most amount of rows if we keep the weather data
# Let's see the relationship between them and the target variable
hist1 = ggplot(cities_clean, aes(x = Avg.Annual.Rainfall, y = ..density..,
fill = singles_like)) +
geom_histogram(alpha = 0.5, color = "white", bins = 15) # No major relationship
hist2 = ggplot(cities_clean, aes(x = Avg.High.Temps, y = ..density..,
fill = singles_like)) +
geom_histogram(alpha = 0.5, color = "white", bins = 15) # No major relationship
hist3 = ggplot(cities_clean, aes(x = Avg.Low.Temps, y = ..density..,
fill = singles_like)) +
geom_histogram(alpha = 0.5, color = "white", bins = 15) # No major relationship
hist4 = ggplot(cities_clean, aes(x = Temperature.Range, y = ..density..,
fill = singles_like)) +
geom_histogram(alpha = 0.5, color = "white", bins = 15) # No major relationship
grid.arrange(hist1, hist2, hist3, hist4, ncol = 2)
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 25 rows containing non-finite values (`stat_bin()`).
## Removed 25 rows containing non-finite values (`stat_bin()`).
## Removed 25 rows containing non-finite values (`stat_bin()`).
## Removed 25 rows containing non-finite values (`stat_bin()`).
# We don't lose too much information by deleting them, so we can remove those columns
# What about Total.Students nad Total.Teachers?
hist5 = ggplot(cities_clean, aes(x = Total.Students, y = ..density..,
fill = singles_like)) +
geom_histogram(alpha = 0.5, color = "white", bins = 5) # Same thing nothing major
hist6 = ggplot(cities_clean, aes(x = Total.Teachers, y = ..density..,
fill = singles_like)) +
geom_histogram(alpha = 0.5, color = "white", bins = 5) # Same thing nothing major
grid.arrange(hist5, hist6, ncol = 2)
## Warning: Removed 4 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 4 rows containing non-finite values (`stat_bin()`).
# We can remove these 2 columns as well. Intuitively the other columns seem to have a substantial connection, so we will not remove them and we'll delete any reamining rows that have NA entries
cities_final = cities_clean %>% select(-c("Avg.Annual.Rainfall","Avg.High.Temps",
"Avg.Low.Temps", "Temperature.Range",
"Total.Teachers","Total.Students")) %>%
na.omit() %>% mutate_if(is.numeric, normalize)
# We get the mice now
md.pattern(cities_final, rotate.names = T)
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## Avg.Commute.Time Median.Home.Price Median.Monthly.Rent Property.Crime
## 116 1 1 1 1
## 0 0 0 0
## Violent.crime Average.Annual.Salary Median.Age Metro.Population
## 116 1 1 1 1
## 0 0 0 0
## Unemployment.Rate city_type_Coastal city_type_Non Coastal top_50
## 116 1 1 1 1
## 0 0 0 0
## singles_like
## 116 1 0
## 0 0
cities_summary2 = data.frame(summary(cities_final)) %>%
mutate(Var1 = rep(1:(n()/ncol(cities_final)),ncol(cities_final))) %>%
pivot_wider(id_cols = Var1, names_from = Var2, values_from = Freq) %>%
select(-Var1) %>% replace(is.na(.),"")
datatable(cities_summary2, rownames = F, extensions = "Scroller",
options = list(dom = 't',ordering = F,scrollX = T))
# Specific Columns that I'm interested in
eda_cols = c("Avg.Commute.Time", "Median.Home.Price",
"Median.Monthly.Rent","Violent.Crime",
"Avg.Annual.Salary","Unemployment.Rate")
eda_plot1 = ggplot(cities_final, aes(
x=Avg.Commute.Time, y=..density.., fill = singles_like)) +
geom_histogram(alpha = 0.5, color = "white", bins = 8)
eda_plot2 = ggplot(cities_final, aes(
x=Median.Home.Price, y=..density.., fill = singles_like)) +
geom_histogram(alpha = 0.5, color = "white", bins = 8)
eda_plot3 = ggplot(cities_final, aes(
x=Median.Monthly.Rent, y=..density.., fill = singles_like)) +
geom_histogram(alpha = 0.5, color = "white", bins = 8)
eda_plot4 = ggplot(cities_final, aes(
x=Violent.crime, y=..density.., fill = singles_like)) +
geom_histogram(alpha = 0.5, color = "white", bins = 8)
eda_plot5 = ggplot(cities_final, aes(
x=Average.Annual.Salary, y=..density.., fill = singles_like)) +
geom_histogram(alpha = 0.5, color = "white", bins = 8)
eda_plot6 = ggplot(cities_final, aes(
x=Unemployment.Rate, y=..density.., fill = singles_like)) +
geom_histogram(alpha = 0.5, color = "white", bins = 8)
grid.arrange(eda_plot1, eda_plot2, eda_plot3, eda_plot4, eda_plot5, eda_plot6, ncol = 3)
From the naked eye it looks like the distributions of the quantitative variables irrespecive of class are similar. However, there are enough features and variations within them that the classifier should be able to harness.
# table(cities_final$singles_like)[2]/sum(table(cities_final$singles_like))
set.seed(3)
cities_train_idx = createDataPartition(cities_final$singles_like, p = 0.70, groups=1,
list=FALSE)
cities_train = cities_final[cities_train_idx,]
cities_tune_test = cities_final[-cities_train_idx,]
cities_tune_idx = createDataPartition(cities_tune_test$singles_like, p = 0.5, list = F)
cities_tune = cities_tune_test[cities_tune_idx,]
cities_test = cities_tune_test[-cities_tune_idx,]
# table(cities_train$singles_like)[2]/sum(table(cities_train$singles_like))
# table(cities_tune$singles_like)[2]/sum(table(cities_tune$singles_like))
# table(cities_test$singles_like)[2]/sum(table(cities_test$singles_like))
Prevalences * dataset - 26.72 % * train - 26.83% * tune - 27.78 % * test - 25 %
chooseK = function(k, train_set, val_set, train_class, val_class){
set.seed(3)
class_knn = knn(train = train_set, test = val_set, cl = train_class, k = k,
use.all = TRUE)
pred_mat = table(class_knn, val_class)
accu = sum(pred_mat[row(pred_mat) == col(pred_mat)]) / sum(pred_mat)
print(confusionMatrix(
as.factor(class_knn), as.factor(cities_tune$singles_like),
positive = "1", dnn=c("Prediction", "Actual"), mode ="sens_spec"))
return(cbind(k = k, accuracy = accu))
}
cities_NN_k = sapply(seq(1, 15, 2),chooseK, train_set = cities_train,
val_set = cities_tune, train_class = cities_train$singles_like,
val_class = cities_tune$singles_like)
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 13 1
## 1 0 4
##
## Accuracy : 0.9444
## 95% CI : (0.7271, 0.9986)
## No Information Rate : 0.7222
## P-Value [Acc > NIR] : 0.02264
##
## Kappa : 0.8525
##
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.8000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9286
## Prevalence : 0.2778
## Detection Rate : 0.2222
## Detection Prevalence : 0.2222
## Balanced Accuracy : 0.9000
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 13 1
## 1 0 4
##
## Accuracy : 0.9444
## 95% CI : (0.7271, 0.9986)
## No Information Rate : 0.7222
## P-Value [Acc > NIR] : 0.02264
##
## Kappa : 0.8525
##
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.8000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9286
## Prevalence : 0.2778
## Detection Rate : 0.2222
## Detection Prevalence : 0.2222
## Balanced Accuracy : 0.9000
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 13 1
## 1 0 4
##
## Accuracy : 0.9444
## 95% CI : (0.7271, 0.9986)
## No Information Rate : 0.7222
## P-Value [Acc > NIR] : 0.02264
##
## Kappa : 0.8525
##
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.8000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9286
## Prevalence : 0.2778
## Detection Rate : 0.2222
## Detection Prevalence : 0.2222
## Balanced Accuracy : 0.9000
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 13 1
## 1 0 4
##
## Accuracy : 0.9444
## 95% CI : (0.7271, 0.9986)
## No Information Rate : 0.7222
## P-Value [Acc > NIR] : 0.02264
##
## Kappa : 0.8525
##
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.8000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9286
## Prevalence : 0.2778
## Detection Rate : 0.2222
## Detection Prevalence : 0.2222
## Balanced Accuracy : 0.9000
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 13 1
## 1 0 4
##
## Accuracy : 0.9444
## 95% CI : (0.7271, 0.9986)
## No Information Rate : 0.7222
## P-Value [Acc > NIR] : 0.02264
##
## Kappa : 0.8525
##
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.8000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9286
## Prevalence : 0.2778
## Detection Rate : 0.2222
## Detection Prevalence : 0.2222
## Balanced Accuracy : 0.9000
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 13 1
## 1 0 4
##
## Accuracy : 0.9444
## 95% CI : (0.7271, 0.9986)
## No Information Rate : 0.7222
## P-Value [Acc > NIR] : 0.02264
##
## Kappa : 0.8525
##
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.8000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9286
## Prevalence : 0.2778
## Detection Rate : 0.2222
## Detection Prevalence : 0.2222
## Balanced Accuracy : 0.9000
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 13 1
## 1 0 4
##
## Accuracy : 0.9444
## 95% CI : (0.7271, 0.9986)
## No Information Rate : 0.7222
## P-Value [Acc > NIR] : 0.02264
##
## Kappa : 0.8525
##
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.8000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9286
## Prevalence : 0.2778
## Detection Rate : 0.2222
## Detection Prevalence : 0.2222
## Balanced Accuracy : 0.9000
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 13 1
## 1 0 4
##
## Accuracy : 0.9444
## 95% CI : (0.7271, 0.9986)
## No Information Rate : 0.7222
## P-Value [Acc > NIR] : 0.02264
##
## Kappa : 0.8525
##
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.8000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9286
## Prevalence : 0.2778
## Detection Rate : 0.2222
## Detection Prevalence : 0.2222
## Balanced Accuracy : 0.9000
##
## 'Positive' Class : 1
##
cities_NN_k = tibble(k = cities_NN_k[1,],
accuracy = cities_NN_k[2,])
ggplot(cities_NN_k, aes(x = k, y = accuracy)) +
geom_line(color = "orange", size = 1.5) + geom_point(size = 3)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Surprisingly, the accuracy of the knn model stays the same regardless of
the value of k, suggesting that there is a clear separation between the
two classes and the classifer is performing very well as a result.
However, further evaluation can confirm whether that is the case or if
there could be something else that’s happening. We will proceed forward
with a k-value of 3
set.seed(3)
cities_3NN = knn(train = cities_train, test = cities_tune,
cl = cities_train$singles_like, k = 3, use.all = TRUE, prob = TRUE)
cities_prob = tibble(attr(cities_3NN, "prob"))
cities_final_model = tibble(k_prob = cities_prob$`attr(cities_3NN, "prob")`,
pred=cities_3NN, target=cities_tune$singles_like)
cities_final_model$pos_prec <- ifelse(cities_final_model$pred == 0,
1-cities_final_model$k_prob,
cities_final_model$k_prob)
densityplot(cities_final_model$pos_prec)
confusionMatrix(cities_final_model$pred, cities_final_model$target, positive = "1",
dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 13 1
## 1 0 4
##
## Accuracy : 0.9444
## 95% CI : (0.7271, 0.9986)
## No Information Rate : 0.7222
## P-Value [Acc > NIR] : 0.02264
##
## Kappa : 0.8525
##
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.8000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9286
## Prevalence : 0.2778
## Detection Rate : 0.2222
## Detection Prevalence : 0.2222
## Balanced Accuracy : 0.9000
##
## 'Positive' Class : 1
##
#ROC and AUC curves
cities_pred <- prediction(cities_final_model$pos_prec,cities_final_model$target)
knn_perf <- performance(cities_pred,"tpr","fpr")
plot(knn_perf, colorize=TRUE)
abline(a=0, b= 1)
knn_perf_AUC <- performance(cities_pred,"auc")
print(knn_perf_AUC@y.values)
## [[1]]
## [1] 0.8923077
#LogLoss
LogLoss(as.numeric(cities_final_model$pos_prec), as.numeric(cities_final_model$target))
## [1] 19.24907
#F1Score
F1_Score(y_pred = cities_final_model$pred, y_true = cities_final_model$target, positive = "1")
## [1] 0.8888889
Further evaluation validates previous findings: the ROC and the AUC show curve that is not entirely perpendicular, with the AUC producing a value of 0.8923. The F1 score of 0.8889 implies that the precision and recall are both very strong for this model. The sensitivity value of 0.8 is relatively slow, but draws emphasis to the lack of enough data: Having more cities to work with would’ve improved the sensitivity of the model as well.
As mentioned at the start of this project, the goal was to see if we could predict if a city would be one that single people would like to live in. The model that we have built in the process can predict this with an accuracy of 94.44 %, with supplemental metrics such as the AUC and F1 Score that are just as strong. However, the lack of enough data was very apparent with the model’s sensitivity not being as high as its other metrics. Regardless, the model benefitted from robust preprocessing and the implemented data cleaning techniques.
Future work would involve both collecting data from more cities but also collecting more data in general. Additionally, different kinds of models can be applied to the dataset to draw more insightful conclusions, such as which factors have more of a say in determining whether a single adult would like living in a city.