Living Solo

For years, people have associated the phrase “single” with people ready to mingle, but the last few years has seen people championing the literal meaning. According to Richard Fry and Kim Parker of the Pew Research Center, “roughly 4 in 10 adults ages 25-34 were unpartnered”, a sharp increase in the last 20 years1. The new generations have been at the forefront of this lifestyle, which promotes freedom, independence and non-conformity2

Stepping into Adult Life Single is the New Norm

As the choice to live single is chosen by more and more adults, understanding the kind of places they would choose to work and/or live in becomes just as important. Cities around the world must recognize the factors that may or may not make them popular among singles, as it would give planners and governments an opportunity to boost the development and reputation of their city.

Question

This project attempts to provide insights into the case presented above by asking the following Question:

Can we predict whether a city would appeal to a single adult?

About the Data

To answer this question, I have sourced a preprocessed dataset from data.world which lists the features of the top 125 cities to live in according to a report by the US News3. This dataset contains several features related to each city, including but not limited to weather, work-life, standard of living and - our target - the proportion of single people in the city.

link to data

Methods

To answer this question, we will be going through a numerous set of steps outlined below

  • Load the Data - from a csv file
  • Prepare the Data for EDA
    • Clean the Data - Remove unnecessary features and create ones if needed
    • One-Hot Encode Factor Variables
    • Create the Target Variable
  • Before conducting EDA, resolve missing data
  • Conduct EDA
    • View relationships between quantitative variables and target variable
    • Create final summary statistics for the dataset
  • Build Model - We will be using a K-Nearest-Neighbor Algorithm for this question
    • Create train, tune and test data from the dataset and note prevalences
    • Determine what the value of k should be for the model
    • Create table of predicted and expected labels, as well as positive prediction prob
  • Evaluate Model - using a series of evaluation metrics
    • Confusion Matrix - note sensitivity
    • ROC/AUC - plot relationship between false and true positive rate
    • density - create density plot of model’s positive prediction probability
    • F1_score - determine overall precision and recall

Load Dependencies

library(tidyverse)
library(caret)
library(RColorBrewer)
library(ROCR)
library(MLmetrics)
library(mltools)
library(data.table)
library(mice)
library(class)
library(gridExtra)
library(DT)

Establish Data and Check Columns

cities = read.csv("Best US Cities Modified.csv", fileEncoding = "UTF-16LE",
                      sep = "\t")[1:125,]
names(cities)
##  [1] "Avg.Annual.Rainfall"           "Avg.Commute.Time...continuous"
##  [3] "Avg.High.Temps"                "Avg.Low.Temps"                
##  [5] "City"                          "Link"                         
##  [7] "Median.Home.Price"             "Median.Monthly.Rent"          
##  [9] "Property.Crime"                "State"                        
## [11] "Violent.crime"                 "Average.Annual.Salary"        
## [13] "Median.Age"                    "Metro.Population"             
## [15] "Percent.Single"                "Temperature.Range"            
## [17] "Total.Students"                "Total.Teachers"               
## [19] "Unemployment.Rate"             "US.News.Rank"                 
## [21] "PathOrder"                     "City.and.State"               
## [23] "Coastal"                       "Coastal.String"

A lot of columns such as PathOrder and City.and.State are redundant or unnecessary and will be removed, along with other changes detailed below

Initial Summary Statistics

cities_summary1 = data.frame(summary(cities)) %>% 
  mutate(Var1 = rep(1:(n()/ncol(cities)),ncol(cities))) %>%
  pivot_wider(id_cols = Var1, names_from = Var2, values_from = Freq) %>%
  select(-Var1) %>% replace(is.na(.),"")
datatable(cities_summary1, rownames = F, extensions = "Scroller",
          options = list(dom = 't',ordering = F,scrollX = T))

There are quite a few quantitative features and only 1 character feature that can be a factor, Coastal.String. All these variables will be cleaned and prepared for Exploratory Data Analysis

Clean Variables

# Function to normalize numeric columns
normalize <- function(x){
 (x - min(x)) / (max(x) - min(x))
}

# Replace Coastal logical variable with factor of Coastal.Spring
cities_filter = cities %>% mutate(city_type = factor(Coastal.String)) %>%
  # Rename Avg.Commute.Time...continuous
  rename(Avg.Commute.Time = Avg.Commute.Time...continuous) %>%
  # Drop unnecessary columns including Coastal.Spring
  select(-c("City","Link","State","City.and.State", "PathOrder", "Coastal", 
            "Coastal.String")) %>%
  # Create a new column representing if city is in top 50 of US News 2019 Ranking
  mutate(top_50 = if_else(US.News.Rank <= 50, 1, 0)) %>%
  # Normalize only Percent.Single for now
  mutate(Percent.Single = normalize(Percent.Single)) %>%
  # remove US.News.Rank
  select(-US.News.Rank)

fivenum(cities_filter$Percent.Single) # Make top quartile of values 1s and rest 0s
## [1] 0.0000000 0.2777778 0.4055556 0.5000000 1.0000000
# One-Hot encode factor variables (only city_type)
cities_clean = one_hot(as.data.table(cities_filter), cols = "auto",
                              sparsifyNAs = TRUE, naCols = FALSE, 
                              dropCols = TRUE,dropUnusedLevels = TRUE) %>%
  # Create target variable
  mutate(singles_like = cut(Percent.Single, c(-1,0.5,1), labels = c(0,1))) %>%
  # remove Percent.Single
  select(-Percent.Single)

Missing Data

md.pattern(cities_clean, rotate.names = T)

##    Avg.Commute.Time Median.Monthly.Rent Property.Crime Violent.crime
## 94                1                   1              1             1
## 19                1                   1              1             1
## 2                 1                   1              1             1
## 6                 1                   1              1             1
## 3                 1                   1              1             1
## 1                 1                   1              1             1
##                   0                   0              0             0
##    Average.Annual.Salary Median.Age Metro.Population Unemployment.Rate
## 94                     1          1                1                 1
## 19                     1          1                1                 1
## 2                      1          1                1                 1
## 6                      1          1                1                 1
## 3                      1          1                1                 1
## 1                      1          1                1                 1
##                        0          0                0                 0
##    city_type_Coastal city_type_Non Coastal top_50 singles_like Total.Students
## 94                 1                     1      1            1              1
## 19                 1                     1      1            1              1
## 2                  1                     1      1            1              1
## 6                  1                     1      1            1              1
## 3                  1                     1      1            1              0
## 1                  1                     1      1            1              0
##                    0                     0      0            0              4
##    Total.Teachers Median.Home.Price Avg.Annual.Rainfall Avg.High.Temps
## 94              1                 1                   1              1
## 19              1                 1                   0              0
## 2               1                 0                   1              1
## 6               1                 0                   0              0
## 3               0                 1                   1              1
## 1               0                 0                   1              1
##                 4                 9                  25             25
##    Avg.Low.Temps Temperature.Range    
## 94             1                 1   0
## 19             0                 0   4
## 2              1                 1   1
## 6              0                 0   5
## 3              1                 1   2
## 1              1                 1   3
##               25                25 117
# As we can see, there are missing values in some of the quantitative columns
# We lose the most amount of rows if we keep the weather data
# Let's see the relationship between them and the target variable

hist1 = ggplot(cities_clean, aes(x = Avg.Annual.Rainfall, y = ..density.., 
                                 fill = singles_like)) +
  geom_histogram(alpha = 0.5, color = "white", bins = 15) # No major relationship

hist2 = ggplot(cities_clean, aes(x = Avg.High.Temps, y = ..density.., 
                                 fill = singles_like)) +
  geom_histogram(alpha = 0.5, color = "white", bins = 15) # No major relationship

hist3 = ggplot(cities_clean, aes(x = Avg.Low.Temps, y = ..density.., 
                                 fill = singles_like)) +
  geom_histogram(alpha = 0.5, color = "white", bins = 15) # No major relationship

hist4 = ggplot(cities_clean, aes(x = Temperature.Range, y = ..density.., 
                                 fill = singles_like)) + 
  geom_histogram(alpha = 0.5, color = "white", bins = 15) # No major relationship

grid.arrange(hist1, hist2, hist3, hist4, ncol = 2)
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 25 rows containing non-finite values (`stat_bin()`).
## Removed 25 rows containing non-finite values (`stat_bin()`).
## Removed 25 rows containing non-finite values (`stat_bin()`).
## Removed 25 rows containing non-finite values (`stat_bin()`).

# We don't lose too much information by deleting them, so we can remove those columns
# What about Total.Students nad Total.Teachers?
hist5 = ggplot(cities_clean, aes(x = Total.Students, y = ..density..,
                                 fill = singles_like)) +
  geom_histogram(alpha = 0.5, color = "white", bins = 5) # Same thing nothing major

hist6 = ggplot(cities_clean, aes(x = Total.Teachers, y = ..density..,
                                 fill = singles_like)) +
  geom_histogram(alpha = 0.5, color = "white", bins = 5) # Same thing nothing major

grid.arrange(hist5, hist6, ncol = 2)
## Warning: Removed 4 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 4 rows containing non-finite values (`stat_bin()`).

# We can remove these 2 columns as well. Intuitively the other columns seem to have a substantial connection, so we will not remove them and we'll delete any reamining rows that have NA entries

cities_final = cities_clean %>% select(-c("Avg.Annual.Rainfall","Avg.High.Temps", 
                                          "Avg.Low.Temps", "Temperature.Range",
                                          "Total.Teachers","Total.Students")) %>% 
  na.omit() %>% mutate_if(is.numeric, normalize)

# We get the mice now
md.pattern(cities_final, rotate.names = T)
##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##     Avg.Commute.Time Median.Home.Price Median.Monthly.Rent Property.Crime
## 116                1                 1                   1              1
##                    0                 0                   0              0
##     Violent.crime Average.Annual.Salary Median.Age Metro.Population
## 116             1                     1          1                1
##                 0                     0          0                0
##     Unemployment.Rate city_type_Coastal city_type_Non Coastal top_50
## 116                 1                 1                     1      1
##                     0                 0                     0      0
##     singles_like  
## 116            1 0
##                0 0

Exploratory Data Analysis

Final Summary Statistics

cities_summary2 = data.frame(summary(cities_final)) %>% 
  mutate(Var1 = rep(1:(n()/ncol(cities_final)),ncol(cities_final))) %>%
  pivot_wider(id_cols = Var1, names_from = Var2, values_from = Freq) %>%
  select(-Var1) %>% replace(is.na(.),"")
datatable(cities_summary2, rownames = F, extensions = "Scroller",
          options = list(dom = 't',ordering = F,scrollX = T))

Quantitative Data Plots

# Specific Columns that I'm interested in
eda_cols = c("Avg.Commute.Time", "Median.Home.Price", 
             "Median.Monthly.Rent","Violent.Crime",
             "Avg.Annual.Salary","Unemployment.Rate")

eda_plot1 = ggplot(cities_final, aes(
  x=Avg.Commute.Time, y=..density.., fill = singles_like)) + 
  geom_histogram(alpha = 0.5, color = "white", bins = 8)
eda_plot2 = ggplot(cities_final, aes(
  x=Median.Home.Price, y=..density.., fill = singles_like)) + 
  geom_histogram(alpha = 0.5, color = "white", bins = 8)
eda_plot3 = ggplot(cities_final, aes(
  x=Median.Monthly.Rent, y=..density.., fill = singles_like)) + 
  geom_histogram(alpha = 0.5, color = "white", bins = 8)
eda_plot4 = ggplot(cities_final, aes(
  x=Violent.crime, y=..density.., fill = singles_like)) + 
  geom_histogram(alpha = 0.5, color = "white", bins = 8)
eda_plot5 = ggplot(cities_final, aes(
  x=Average.Annual.Salary, y=..density.., fill = singles_like)) + 
  geom_histogram(alpha = 0.5, color = "white", bins = 8)
eda_plot6 = ggplot(cities_final, aes(
  x=Unemployment.Rate, y=..density.., fill = singles_like)) + 
  geom_histogram(alpha = 0.5, color = "white", bins = 8)

grid.arrange(eda_plot1, eda_plot2, eda_plot3, eda_plot4, eda_plot5, eda_plot6, ncol = 3)

From the naked eye it looks like the distributions of the quantitative variables irrespecive of class are similar. However, there are enough features and variations within them that the classifier should be able to harness.

Model

Creating Train-Tune-Test Sets

# table(cities_final$singles_like)[2]/sum(table(cities_final$singles_like))

set.seed(3)
cities_train_idx = createDataPartition(cities_final$singles_like, p = 0.70, groups=1,
                                    list=FALSE)

cities_train = cities_final[cities_train_idx,]

cities_tune_test = cities_final[-cities_train_idx,]
cities_tune_idx = createDataPartition(cities_tune_test$singles_like, p = 0.5, list = F)

cities_tune = cities_tune_test[cities_tune_idx,]
cities_test = cities_tune_test[-cities_tune_idx,]

# table(cities_train$singles_like)[2]/sum(table(cities_train$singles_like))
# table(cities_tune$singles_like)[2]/sum(table(cities_tune$singles_like))
# table(cities_test$singles_like)[2]/sum(table(cities_test$singles_like))

Prevalences * dataset - 26.72 % * train - 26.83% * tune - 27.78 % * test - 25 %

Determining k

chooseK = function(k, train_set, val_set, train_class, val_class){
  
  set.seed(3)
  class_knn = knn(train = train_set, test = val_set, cl = train_class, k = k,
                  use.all = TRUE)
  
  pred_mat = table(class_knn, val_class)
  accu = sum(pred_mat[row(pred_mat) == col(pred_mat)]) / sum(pred_mat)
  
  print(confusionMatrix(
    as.factor(class_knn), as.factor(cities_tune$singles_like), 
    positive = "1", dnn=c("Prediction", "Actual"), mode ="sens_spec"))
  
  return(cbind(k = k, accuracy = accu))
  
  
}

cities_NN_k = sapply(seq(1, 15, 2),chooseK, train_set = cities_train, 
                     val_set = cities_tune, train_class = cities_train$singles_like,
                     val_class = cities_tune$singles_like)
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 13  1
##          1  0  4
##                                           
##                Accuracy : 0.9444          
##                  95% CI : (0.7271, 0.9986)
##     No Information Rate : 0.7222          
##     P-Value [Acc > NIR] : 0.02264         
##                                           
##                   Kappa : 0.8525          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.2778          
##          Detection Rate : 0.2222          
##    Detection Prevalence : 0.2222          
##       Balanced Accuracy : 0.9000          
##                                           
##        'Positive' Class : 1               
##                                           
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 13  1
##          1  0  4
##                                           
##                Accuracy : 0.9444          
##                  95% CI : (0.7271, 0.9986)
##     No Information Rate : 0.7222          
##     P-Value [Acc > NIR] : 0.02264         
##                                           
##                   Kappa : 0.8525          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.2778          
##          Detection Rate : 0.2222          
##    Detection Prevalence : 0.2222          
##       Balanced Accuracy : 0.9000          
##                                           
##        'Positive' Class : 1               
##                                           
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 13  1
##          1  0  4
##                                           
##                Accuracy : 0.9444          
##                  95% CI : (0.7271, 0.9986)
##     No Information Rate : 0.7222          
##     P-Value [Acc > NIR] : 0.02264         
##                                           
##                   Kappa : 0.8525          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.2778          
##          Detection Rate : 0.2222          
##    Detection Prevalence : 0.2222          
##       Balanced Accuracy : 0.9000          
##                                           
##        'Positive' Class : 1               
##                                           
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 13  1
##          1  0  4
##                                           
##                Accuracy : 0.9444          
##                  95% CI : (0.7271, 0.9986)
##     No Information Rate : 0.7222          
##     P-Value [Acc > NIR] : 0.02264         
##                                           
##                   Kappa : 0.8525          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.2778          
##          Detection Rate : 0.2222          
##    Detection Prevalence : 0.2222          
##       Balanced Accuracy : 0.9000          
##                                           
##        'Positive' Class : 1               
##                                           
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 13  1
##          1  0  4
##                                           
##                Accuracy : 0.9444          
##                  95% CI : (0.7271, 0.9986)
##     No Information Rate : 0.7222          
##     P-Value [Acc > NIR] : 0.02264         
##                                           
##                   Kappa : 0.8525          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.2778          
##          Detection Rate : 0.2222          
##    Detection Prevalence : 0.2222          
##       Balanced Accuracy : 0.9000          
##                                           
##        'Positive' Class : 1               
##                                           
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 13  1
##          1  0  4
##                                           
##                Accuracy : 0.9444          
##                  95% CI : (0.7271, 0.9986)
##     No Information Rate : 0.7222          
##     P-Value [Acc > NIR] : 0.02264         
##                                           
##                   Kappa : 0.8525          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.2778          
##          Detection Rate : 0.2222          
##    Detection Prevalence : 0.2222          
##       Balanced Accuracy : 0.9000          
##                                           
##        'Positive' Class : 1               
##                                           
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 13  1
##          1  0  4
##                                           
##                Accuracy : 0.9444          
##                  95% CI : (0.7271, 0.9986)
##     No Information Rate : 0.7222          
##     P-Value [Acc > NIR] : 0.02264         
##                                           
##                   Kappa : 0.8525          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.2778          
##          Detection Rate : 0.2222          
##    Detection Prevalence : 0.2222          
##       Balanced Accuracy : 0.9000          
##                                           
##        'Positive' Class : 1               
##                                           
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 13  1
##          1  0  4
##                                           
##                Accuracy : 0.9444          
##                  95% CI : (0.7271, 0.9986)
##     No Information Rate : 0.7222          
##     P-Value [Acc > NIR] : 0.02264         
##                                           
##                   Kappa : 0.8525          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.2778          
##          Detection Rate : 0.2222          
##    Detection Prevalence : 0.2222          
##       Balanced Accuracy : 0.9000          
##                                           
##        'Positive' Class : 1               
## 
cities_NN_k = tibble(k = cities_NN_k[1,],
                             accuracy = cities_NN_k[2,])

ggplot(cities_NN_k, aes(x = k, y = accuracy)) +
  geom_line(color = "orange", size = 1.5) + geom_point(size = 3)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Surprisingly, the accuracy of the knn model stays the same regardless of the value of k, suggesting that there is a clear separation between the two classes and the classifer is performing very well as a result. However, further evaluation can confirm whether that is the case or if there could be something else that’s happening. We will proceed forward with a k-value of 3

Further Evaluation

set.seed(3)
cities_3NN = knn(train = cities_train, test = cities_tune, 
                 cl = cities_train$singles_like, k = 3, use.all = TRUE, prob = TRUE)

cities_prob = tibble(attr(cities_3NN, "prob"))
cities_final_model = tibble(k_prob = cities_prob$`attr(cities_3NN, "prob")`, 
                            pred=cities_3NN, target=cities_tune$singles_like)
cities_final_model$pos_prec <- ifelse(cities_final_model$pred == 0,
                                      1-cities_final_model$k_prob,
                                      cities_final_model$k_prob)

densityplot(cities_final_model$pos_prec)

confusionMatrix(cities_final_model$pred, cities_final_model$target, positive = "1", 
                dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 13  1
##          1  0  4
##                                           
##                Accuracy : 0.9444          
##                  95% CI : (0.7271, 0.9986)
##     No Information Rate : 0.7222          
##     P-Value [Acc > NIR] : 0.02264         
##                                           
##                   Kappa : 0.8525          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.2778          
##          Detection Rate : 0.2222          
##    Detection Prevalence : 0.2222          
##       Balanced Accuracy : 0.9000          
##                                           
##        'Positive' Class : 1               
## 
#ROC and AUC curves
cities_pred <- prediction(cities_final_model$pos_prec,cities_final_model$target)

knn_perf <- performance(cities_pred,"tpr","fpr")

plot(knn_perf, colorize=TRUE)
abline(a=0, b= 1)

knn_perf_AUC <- performance(cities_pred,"auc")

print(knn_perf_AUC@y.values)
## [[1]]
## [1] 0.8923077
#LogLoss
LogLoss(as.numeric(cities_final_model$pos_prec), as.numeric(cities_final_model$target))
## [1] 19.24907
#F1Score
F1_Score(y_pred = cities_final_model$pred, y_true = cities_final_model$target, positive = "1")
## [1] 0.8888889

Further evaluation validates previous findings: the ROC and the AUC show curve that is not entirely perpendicular, with the AUC producing a value of 0.8923. The F1 score of 0.8889 implies that the precision and recall are both very strong for this model. The sensitivity value of 0.8 is relatively slow, but draws emphasis to the lack of enough data: Having more cities to work with would’ve improved the sensitivity of the model as well.

Conclusions

As mentioned at the start of this project, the goal was to see if we could predict if a city would be one that single people would like to live in. The model that we have built in the process can predict this with an accuracy of 94.44 %, with supplemental metrics such as the AUC and F1 Score that are just as strong. However, the lack of enough data was very apparent with the model’s sensitivity not being as high as its other metrics. Regardless, the model benefitted from robust preprocessing and the implemented data cleaning techniques.

Future Work

Future work would involve both collecting data from more cities but also collecting more data in general. Additionally, different kinds of models can be applied to the dataset to draw more insightful conclusions, such as which factors have more of a say in determining whether a single adult would like living in a city.

References