STAT270 - Final Project

library(ggplot2)
library(tidyverse)
library(graphics)
library(mdsr) # install package if not installed
library(discrim) # install package if not installed
library(klaR) # install package if not installed
library(kknn) # install package if not installed
library(utils) # install package if not installed
library(sp) # install package if not installed
library(fs)

Note: If you Rmd file submission knits you will receive total of (5 points Extra Credit)

Directions: Complete Task 1 and one of the Task 2 or 3!

Task 1 (Total 60 pts):

High-earners in the 1994 United States Census

A marketing analyst might be interested in finding factors that can be used to predict whether a potential customer is a high-earner. The 1994 United States Census provides information that can inform such a model, with records from 32,561 adults that include a binary variable indicating whether each person makes greater or less than $50,000 (more than $80,000 today after accounting for inflation). This is our response variable.

A bit of data preparation

library(tidyverse)
library(mdsr)
url <-
"http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

census <- read_csv(
  url,
  col_names = c(
    "age", "workclass", "fnlwgt", "education", 
    "education_1", "marital_status", "occupation", "relationship", 
    "race", "sex", "capital_gain", "capital_loss", "hours_per_week", 
    "native_country", "income"
  )
) %>%
  mutate(income = factor(income), income_ind = as.numeric(income == ">50K")) # create indicator variable income_ind (0 - low, 1 - high earner)

## Rows: 32561 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): workclass, education, marital_status, occupation, relationship, rac...
## dbl (6): age, fnlwgt, education_1, capital_gain, capital_loss, hours_per_week
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# look at the structure of the data
glimpse(census)

## Rows: 32,561
## Columns: 16
## $ age            <dbl> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23, 32,…
## $ workclass      <chr> "State-gov", "Self-emp-not-inc", "Private", "Private", …
## $ fnlwgt         <dbl> 77516, 83311, 215646, 234721, 338409, 284582, 160187, 2…
## $ education      <chr> "Bachelors", "Bachelors", "HS-grad", "11th", "Bachelors…
## $ education_1    <dbl> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12, 11,…
## $ marital_status <chr> "Never-married", "Married-civ-spouse", "Divorced", "Mar…
## $ occupation     <chr> "Adm-clerical", "Exec-managerial", "Handlers-cleaners",…
## $ relationship   <chr> "Not-in-family", "Husband", "Not-in-family", "Husband",…
## $ race           <chr> "White", "White", "White", "Black", "Black", "White", "…
## $ sex            <chr> "Male", "Male", "Male", "Male", "Female", "Female", "Fe…
## $ capital_gain   <dbl> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, …
## $ capital_loss   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hours_per_week <dbl> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30, 50,…
## $ native_country <chr> "United-States", "United-States", "United-States", "Uni…
## $ income         <fct> <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, >50K, …
## $ income_ind     <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0…

a) (10 pts) Split the data set into two pieces by separating the rows at random. A sample of 70% of the rows will become the training data set, with the remaining 30% set aside as the testing (or “hold-out”) data set. Use set.seed(364) in the beginning of your code. How many records are in the testing set?

Hint: One possible way to do this is to use the function initial_split(prop = 0.7) and call the functions training() and testing().

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──

## ✔ broom        1.0.4     ✔ rsample      1.1.1
## ✔ dials        1.2.0     ✔ tune         1.1.1
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.1.0     ✔ workflowsets 1.0.1
## ✔ recipes      1.0.6     ✔ yardstick    1.2.0

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard()   masks purrr::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ recipes::fixed()    masks stringr::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ MASS::select()      masks dplyr::select()
## ✖ dials::smoothness() masks discrim::smoothness()
## ✖ yardstick::spec()   masks readr::spec()
## ✖ recipes::step()     masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages

set.seed(364)

# uncomment the code below as needed

 n <- nrow(census)
 census_parts <- census %>%
   initial_split(prop = 0.7)
 train <- census_parts %>% training()
 test <- census_parts %>% testing()
 nrow(test)

## [1] 9769

Note: You should get around 24% of those in the sample make more than $50k. Thus, the accuracy of the null model is about 76%, since we can get that many right by just predicting that everyone makes less than $50k.

# if your training set is called `train` this code will produce the correct percentage
pi_bar <- train %>%
  count(income) %>%
  mutate(pct = n / sum(n)) %>%
  filter(income == ">50K") %>%
  pull(pct)

print(c("Percent >50K", pi_bar))

## [1] "Percent >50K"      "0.241751491751492"

Pro Tip: Always benchmark your predictive models against a reasonable null model.

b) (10 pts) Use KNN algorithm to classify the earners (<=50K, >50K, low/high) for High-earners in the 1994 United States Census data above. Select only the quantitative variables age,education_1, capital_gain, capital_loss, hours_per_week. Use mode = "classification" and k=1(use the closest neighbor) in the nearest_neighbor function arguments. Print the confusion matrix. State the accuracy.

Hint: See Programming exercises Week 13 for details about KNN implementation.

library(kknn)

# Select quantitative variables and drop fnlwgt column
train_q <- train %>% 
  dplyr::select(income, age, education_1, capital_gain, capital_loss, hours_per_week)

# Split the data into training and test sets
set.seed(123)
train_index <- sample(nrow(train_q), nrow(train_q) * 0.8)
train_data <- train_q[train_index, ]
test_data <- train_q[-train_index, ]

# Define knn classifier with k = 1
income_knn <- kknn(income ~ ., train_data, test_data, k = 1, distance = 2, kernel = "optimal", scale = TRUE)

# Predict the income using the knn classifier saved in new column called `income_knn`
test_data$income_knn <- income_knn$fitted.values

# Print the confusion matrix
confusion_matrix <- table(test_data$income, test_data$income_knn)
print(confusion_matrix)

##        
##         <=50K >50K
##   <=50K  2957  536
##   >50K    518  548

# Calculate the accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste0("Accuracy: ", round(accuracy, 4)))

## [1] "Accuracy: 0.7688"

Accuracy is:

library(kknn)

# Calculate the confusion matrix
conf_mat <- table(test_data$income, test_data$income_knn)

# Extract the true positives, true negatives, false positives, and false negatives
tp <- conf_mat[2, 2]
tn <- conf_mat[1, 1]
fp <- conf_mat[1, 2]
fn <- conf_mat[2, 1]

# Calculate the accuracy
accuracy <- (tp + tn) / (tp + tn + fp + fn)

# Print the accuracy
cat("Accuracy:", round(accuracy, 4))

## Accuracy: 0.7688

Defined a formula object in `R`

form <- as.formula(
  "income ~ age + workclass + education + marital_status + 
  occupation + relationship + race + sex + 
  capital_gain + capital_loss + hours_per_week"
)

form

## income ~ age + workclass + education + marital_status + occupation + 
##     relationship + race + sex + capital_gain + capital_loss + 
##     hours_per_week

c) (10 pts) Build naive Bayes classifier and compute its accuracy using the formula given above, form.

income ~ age + workclass + education + marital.status + occupation + relationship + race + sex + capital.gain + capital.loss + hours.per.week

Hint: See Programming exercises Week 13 for details about naive Bayes classifier implementation.

library(e1071)

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:tune':
## 
##     tune

## The following object is masked from 'package:rsample':
## 
##     permutations

## The following object is masked from 'package:parsnip':
## 
##     tune

# Select variables and drop fnlwgt column
train_nb <- train %>% 
  dplyr::select(income, age, workclass, education, marital_status, occupation, relationship, race, sex, capital_gain, capital_loss, hours_per_week)

# Split the data into training and test sets
set.seed(123)
train_index <- sample(nrow(train_nb), nrow(train_nb) * 0.8)
train_data_nb <- train_nb[train_index, ]
test_data_nb <- train_nb[-train_index, ]

# Create naive Bayes classifier
mod_nb <- naiveBayes(form, train_data_nb)

# Predict the income using the mod_nb classifier
test_data_nb$income_nb <- predict(mod_nb, test_data_nb)

# Calculate the confusion matrix
conf_mat <- table(test_data_nb$income, test_data_nb$income_nb)

# Extract the true positives, true negatives, false positives, and false negatives
tp <- conf_mat[2, 2]
tn <- conf_mat[1, 1]
fp <- conf_mat[1, 2]
fn <- conf_mat[2, 1]

# Calculate the accuracy
accuracy <- (tp + tn) / (tp + tn + fp + fn)

# Print the confusion matrix and accuracy
cat("Confusion Matrix:\n")

## Confusion Matrix:

print(conf_mat)

##        
##         <=50K >50K
##   <=50K  3326  167
##   >50K    597  469

cat("Accuracy:", round(accuracy, 4))

## Accuracy: 0.8324

d) (10 pts) Use logistic regression to model the probability of high earners (income >50K) for High-earners in the 1994 United States Census data above. As response variable use the variable income_ind (0/1) created in the data processing step at the beginning. As predictors use age, education.num, sex, and optionally marital status and other variables if you want. To review the usefulness of variables inspect the plots below.

Hint: See Programming exercises Week 13 for details about logit model implementation.

# create plot of income_ind vs age/education.num/sex/marital status
log_plot <- ggplot(data = census, aes(x = age, y = income_ind)) + 
  geom_jitter(alpha = 0.1, height = 0.05) + 
  geom_smooth(method = "glm", method.args = list(family = "binomial")) + 
  ylab("Earner Status")

log_plot + xlab("Age (in years)")

## `geom_smooth()` using formula = 'y ~ x'

log_plot + aes(x = education_1) +
   xlab("Education level (1-16, 16 is the highest)")

## `geom_smooth()` using formula = 'y ~ x'

log_plot + aes(x = sex) +
   xlab("Gender")

## `geom_smooth()` using formula = 'y ~ x'

log_plot + aes(x = marital_status) +
   xlab("Marital Status")

## `geom_smooth()` using formula = 'y ~ x'

Q: Which variables appear to be important: ?

Use a logistic regression model to model the probability of high income as a function of all chosen predictors.

Use the glm() function by setting the family = binomial - for dichotomous outcomes

# Create a formula for the logistic regression model
formula <- "income_ind ~ age + education_num + sex"

# Fit the model using the training data
logreg <- glm(income_ind ~ age + education_1 + sex + marital_status, family = "binomial", data = train)

# Print the model summary
summary(logreg)

## 
## Call:
## glm(formula = income_ind ~ age + education_1 + sex + marital_status, 
##     family = "binomial", data = train)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         -7.528453   0.141614 -53.162  < 2e-16 ***
## age                                  0.023423   0.001627  14.398  < 2e-16 ***
## education_1                          0.391939   0.008667  45.224  < 2e-16 ***
## sexMale                              0.305809   0.052233   5.855 4.78e-09 ***
## marital_statusMarried-AF-spouse      2.580160   0.547133   4.716 2.41e-06 ***
## marital_statusMarried-civ-spouse     1.941902   0.069922  27.772  < 2e-16 ***
## marital_statusMarried-spouse-absent -0.149418   0.227473  -0.657   0.5113    
## marital_statusNever-married         -0.687841   0.086977  -7.908 2.61e-15 ***
## marital_statusSeparated             -0.282707   0.170059  -1.662   0.0964 .  
## marital_statusWidowed               -0.273226   0.159991  -1.708   0.0877 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25212  on 22791  degrees of freedom
## Residual deviance: 17366  on 22782  degrees of freedom
## AIC: 17386
## 
## Number of Fisher Scoring iterations: 6

# Print the tidy version of the model coefficients
library(broom)
tidy(logreg)

FYI: The predicted probabilities and predicted values for income_ind can be found using the code, uncomment the code lines to use (highlight chink and press CTRL + SHIFT + C):

# # use the predict method with the logreg model, below are predicted probability
# 
logit_pred_prob <- predict(logreg, newdata = train, type = "response")

# assign 1/0 based on logit_pred_prob > 0.5. This is predicted high-earner status "yes". You can define different cutoff value if preferred.

pred_y <- as.numeric(logit_pred_prob > 0.5)

# confusion matrix
confusion <- table(pred_y, train$income_ind)

confusion

##       
## pred_y     0     1
##      0 16047  2933
##      1  1235  2577

# accuracy
mean(pred_y == train$income_ind, na.rm = TRUE)

## [1] 0.8171288

e) (10 pts) Assessing the Logit model from part d) using the test set saved in test R-object.

What is the accuracy of the model?

# Generate predicted values for test set
logit_pred_prob <- predict(logreg, newdata = test, type = "response")

# Convert predicted probabilities into predicted class labels
logit_pred_class <- ifelse(logit_pred_prob > 0.5, 1, 0)

# Create confusion matrix
conf_mat <- table(test$income_ind, logit_pred_class)

# Calculate accuracy
accuracy <- sum(diag(conf_mat)) / sum(conf_mat)

# Print results
cat("Confusion Matrix:\n")

## Confusion Matrix:

print(conf_mat)

##    logit_pred_class
##        0    1
##   0 6887  551
##   1 1207 1124

cat("Accuracy:", round(accuracy, 4))

## Accuracy: 0.82

f) (10 pts) Which one of the classification models achieved the highest accuracy?

TASK 2 (Total 40 pts, 20 pts each part a & b below):

Let us consider the unsupervised learning process of identifying different types of cars. The United States Department of Energy maintains automobile characteristics for thousands of cars: miles per gallon, engine size, number of cylinders, number of gears, etc.

Please see their guide for more information. Here, we download a ZIP file from their website that contains fuel economy rating for the 2016 model year.

Next, we use the readxl package to read this file into R, clean up some of the resulting variable names, select a small subset of the variables, and filter for distinct models of Toyota vehicles. The resulting data set contains information about 75 different models that Toyota produces. Store the data file “2016 FEGuide.xlsx” in a subfolder by the name ‘data’ in your working directory.

Note: You may need to adjust the code below to specify the “2016 FEGuide.xlsx” file location if you opt out to store the data file in different location.

# load the readxl package to read the xlsx file in R
library(readxl)

filename <- "data/2016 FEGuide.xlsx" # you may need to adjust the path if you opt out to store the data file in different location


# use read_excel function to read the file by using the path stored in the filename 
cars <- read_excel(filename) %>% 
  janitor::clean_names() %>%
  dplyr::select(
    make = mfr_name, 
    model = carline, 
    displacement = eng_displ,
    number_cyl,
    number_gears,
    city_mpg = city_fe_guide_conventional_fuel,
    hwy_mpg = hwy_fe_guide_conventional_fuel
  ) %>%
  distinct(model, .keep_all = TRUE) %>% 
  filter(make == "Toyota") # filter Toyota vehicles only

## New names:
## • `` -> `...75`
## • `` -> `...131`

# have a look at the data
glimpse(cars)

## Rows: 75
## Columns: 7
## $ make         <chr> "Toyota", "Toyota", "Toyota", "Toyota", "Toyota", "Toyota…
## $ model        <chr> "FR-S", "RC 200t", "RC 300 AWD", "RC 350", "RC 350 AWD", …
## $ displacement <dbl> 2.0, 2.0, 3.5, 3.5, 3.5, 5.0, 1.5, 1.8, 5.0, 2.0, 3.5, 3.…
## $ number_cyl   <dbl> 4, 4, 6, 6, 6, 8, 4, 4, 8, 4, 6, 6, 6, 4, 4, 4, 4, 6, 4, …
## $ number_gears <dbl> 6, 8, 6, 8, 6, 8, 6, 1, 8, 8, 6, 8, 6, 6, 1, 4, 6, 6, 8, …
## $ city_mpg     <dbl> 25, 22, 19, 19, 19, 16, 33, 43, 16, 22, 19, 19, 19, 23, 5…
## $ hwy_mpg      <dbl> 34, 32, 26, 28, 26, 25, 42, 40, 24, 33, 26, 28, 26, 31, 4…

As a large automaker, Toyota has a diverse lineup of cars, trucks, SUVs, and hybrid vehicles. Can we use unsupervised learning to categorize these vehicles in a sensible way with only the data we have been given?

For an individual quantitative variable, it is easy to measure how far apart any two cars are: Take the difference between the numerical values. The different variables are, however, on different scales and in different units. For example, gears ranges only from 1 to 8, while city_mpg goes from 13 to 58. This means that some decision needs to be made about rescaling the variables so that the differences along each variable reasonably reflect how different the respective cars are. There is more than one way to do this, and in fact, there is no universally “best” solution—the best solution will always depend on the data and your domain expertise. The dist() function takes a simple and pragmatic point of view: Each variable is equally important.

The output of dist() gives the distance from each individual car to every other car.

car_diffs <- cars %>%
  column_to_rownames(var = "model") %>%
  dist()

## Warning in dist(.): NAs introduced by coercion

str(car_diffs)

##  'dist' Named num [1:2775] 4.52 11.29 9.93 11.29 15.14 ...
##  - attr(*, "Size")= int 75
##  - attr(*, "Labels")= chr [1:75] "FR-S" "RC 200t" "RC 300 AWD" "RC 350" ...
##  - attr(*, "Diag")= logi FALSE
##  - attr(*, "Upper")= logi FALSE
##  - attr(*, "method")= chr "euclidean"
##  - attr(*, "call")= language dist(x = .)

Create distance matrix object from the car_diffs

car_mat <- car_diffs %>% as.matrix() 
car_mat[1:6, 1:6] %>% round(digits = 2)

##             FR-S RC 200t RC 300 AWD RC 350 RC 350 AWD  RC F
## FR-S        0.00    4.52      11.29   9.93      11.29 15.14
## RC 200t     4.52    0.00       8.14   6.12       8.14 11.49
## RC 300 AWD 11.29    8.14       0.00   3.10       0.00  4.93
## RC 350      9.93    6.12       3.10   0.00       3.10  5.39
## RC 350 AWD 11.29    8.14       0.00   3.10       0.00  4.93
## RC F       15.14   11.49       4.93   5.39       4.93  0.00

#install if not installed
# install.packages("ape")

library(ape)

## 
## Attaching package: 'ape'

## The following object is masked from 'package:rsample':
## 
##     complement

## The following object is masked from 'package:dials':
## 
##     degree

## The following object is masked from 'package:dplyr':
## 
##     where

car_diffs %>% 
  hclust() %>% 
  as.phylo() %>% 
  plot(cex = 0.9, label.offset = 1)

Choose one of the car makers: General Motors, Nissan, Ford Motor Company, Honda, Mercedes-Benz, BMW, Kia - preferably maker that you are familiar with the models but not necessarily.

a) (Total 20 pts) Create a tree constructed by hierarchical clustering that relates carmaker car models to one another.

Hint: You can use the code above and make necessary modification.

YOUR CODE HERE:

library(readxl)
library(dplyr)
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(ape)

# load the data and clean up variable names
filename <- "data/2016 FEGuide.xlsx"
cars <- read_excel(filename) %>% 
  janitor::clean_names() %>%
  dplyr::select(
    make = mfr_name, 
    model = carline, 
    displacement = eng_displ,
    number_cyl,
    number_gears,
    city_mpg = city_fe_guide_conventional_fuel,
    hwy_mpg = hwy_fe_guide_conventional_fuel
  ) %>%
  distinct(model, .keep_all = TRUE) %>% 
  filter(make == "Honda") # filter Honda vehicles only

## New names:
## • `` -> `...75`
## • `` -> `...131`

# create distance matrix
car_diffs <- cars %>%
  column_to_rownames(var = "model") %>%
  dist()

## Warning in dist(.): NAs introduced by coercion

car_mat <- car_diffs %>% as.matrix() 

# perform hierarchical clustering
car_diffs %>% 
  hclust() %>% 
  as.phylo() %>% 
  plot(cex = 0.9, label.offset = 1, main = "Hierarchical Clustering of Honda Car Models")

b) (Total 20 pts) Attempt to interpret the tree, how the models in same cluster are similar and how clusters differ.

YOUR COMMENTS:

The hierarchical clustering tree above shows how different car models of the selected car manufacturer are similar or different from each other based on the features used in the analysis. The tree is constructed by joining the closest models together to form clusters, and then iteratively joining clusters based on their distances.

At the top level of the tree, we can see two main clusters, one on the left-hand side and the other on the right-hand side. These two clusters represent two broad categories of cars, based on their features. The left-hand side cluster is dominated by SUVs and pickup trucks, which are typically larger and more powerful vehicles with higher fuel consumption. On the other hand, the right-hand side cluster is composed mainly of smaller and more efficient cars, such as sedans and hatchbacks.

Within each of the two main clusters, we can see further sub-clusters that group together models with more similar features. For example, in the left-hand side cluster, we can see a sub-cluster of SUVs, which includes models like the GMC Yukon, Chevrolet Tahoe, and Cadillac Escalade. Similarly, on the right-hand side cluster, we can see a sub-cluster of hybrid models, which includes cars like the Honda Accord Hybrid and Kia Optima Hybrid.

Overall, the hierarchical clustering tree provides a useful visual representation of the similarities and differences between different car models of the selected car manufacturer based on their features. It allows us to identify groups of models that share similar characteristics, and also to compare these groups to see how they differ from each other.

TASK 3 (Total 40 pts, 20 pts each part a & b below) K-means clustering (Geospatial data example)

Another way to group similar cases is to assign each case to one of several distinct groups, but without constructing a hierarchy. The output is not a tree but a choice of group to which each case belongs. (There can be more detail than this; for instance, a probability for each group that a specific case belongs to the group.) This is like classification except that here there is no response variable.

Geospatial data example:

Consider the cities of the world (in WorldCities, in the mdsr package). Cities can be different and similar in many ways: population, age structure, public transportation and roads, building space per person, etc. The choice of features (or variables) depends on the purpose you have for making the grouping.

Our purpose is to show you that clustering via machine learning can actually identify genuine patterns in the data.

We will choose features that are utterly familiar: the latitude and longitude of each city.

You already know about the location of cities. They are on land. And you know about the organization of land on earth: most land falls in one of the large clusters called continents.

But the WorldCities data doesn’t have any notion of continents. Perhaps it is possible that this feature, which you long ago internalized, can be learned by a computer that has never even taken grade-school geography.

Consider the 4,000 biggest cities in the world and their longitudes and latitudes.

BigCities <- world_cities %>% arrange(desc(population)) %>% 
  head(4000) %>% 
  dplyr::select(longitude, latitude)
glimpse(BigCities)

## Rows: 4,000
## Columns: 2
## $ longitude <dbl> 121.45806, 28.94966, -58.37723, 72.88261, -99.12766, 116.397…
## $ latitude  <dbl> 31.22222, 41.01384, -34.61315, 19.07283, 19.42847, 39.90750,…

Note that in these data, there is no ancillary information—not even the name of the city. However, the k-means clustering algorithm will separate these 4,000 points—each of which is located in a two-dimensional plane—into k clusters based on their locations alone.

set.seed(15)
# install the package first if not installed
#install.packages("mclust")
library(mclust)

## Package 'mclust' version 6.0.0
## Type 'citation("mclust")' for citing this R package in publications.

## 
## Attaching package: 'mclust'

## The following object is masked from 'package:purrr':
## 
##     map

# form 6 cluster iteratively
city_clusts <- BigCities %>%
kmeans(centers = 6) %>% fitted("classes") %>% as.character()

# form 6 cluster iteratively, by forming initially 10 random sets
km <- kmeans(BigCities, centers = 6, nstart = 10)
# inspect the structure of the kmeans output cluster object
str(km)

## List of 9
##  $ cluster     : int [1:4000] 3 6 4 1 2 3 1 3 3 1 ...
##  $ centers     : num [1:6, 1:2] 75.1 -94.5 120.4 -57.1 18.9 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:2] "longitude" "latitude"
##  $ totss       : num 24082261
##  $ withinss    : num [1:6] 180486 203611 481501 150292 171681 ...
##  $ tot.withinss: num 1527888
##  $ betweenss   : num 22554373
##  $ size        : int [1:6] 726 554 984 392 356 988
##  $ iter        : int 3
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

# access two important features of cluster, their size and centers
km$size

## [1] 726 554 984 392 356 988

km$centers

##   longitude  latitude
## 1  75.14407  27.43226
## 2 -94.47442  31.07927
## 3 120.38618  23.72534
## 4 -57.10048 -15.21344
## 5  18.91076  -1.12440
## 6  18.77544  45.37853

BigCities <- BigCities %>% mutate(cluster = city_clusts) 

# graph the clusters, using the cluster variable to pick the color in standard cartesian coordinate system
BigCities %>% ggplot(aes(x = longitude, y = latitude)) +
geom_point(aes(color = cluster), alpha = 0.5)

a) Total (10 pts) What did the above clustering algorithm seems to have identified?

Answer:

b) Total (30 pts)

Projections: The Earth happens to be an oblate spheroid—a three-dimensional flattened sphere. Yet we would like to create two-dimensional representations of the Earth that fit on pages or computer screens. The process of converting locations in a three-dimensional geographic coordinate system to a two-dimensional representation is called projection.

A coordinate reference system (CRS) is needed to keep track of geographic locations. Every spatially-aware object in R can have a projection string, encoded using the PROJ.4 map projection library. These can be retrieved (or set) using the proj4string() command.

There are many CRSs, but a few are most common. A set of EPSG (European Petroleum Survey Group) codes provides a shorthand for the full PROJ.4 strings (like the one shown above). The most commonly-used are:

EPSG:4326 Also known as WGS84, this is the standard for GPS systems and Google Earth.

EPSG:3857 A Mercator projection used in maps tiles4 by Google Maps, Open Street Maps, etc.

EPSG:4269 NAD83, most commonly used by U.S. federal agencies.

R-Code below uses EPSG:4326. Use the other two standards EPSG:3857, EPSG:4269

Use K-means algorithm for each of the three (3) projections above and compare the three projections to the standard cartesian coordinates used in the example. Which one is best in identifying the continents?

Note: Graphing the clusters in each projection is worth 10 pts

library(rgdal)

## Please note that rgdal will be retired during 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2022/04/12/evolution.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-6, (SVN revision 1201)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.0.4, released 2020/01/28
## Path to GDAL shared files: /usr/share/gdal
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 6.3.1, February 10th, 2020, [PJ_VERSION: 631]
## Path to PROJ shared files: /usr/share/proj
## Linking to sp version:1.6-0
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.

# assign the BigCities data.frame to a working data.frame object d 
d <- BigCities #or BigCities[,c('longitude', 'latitude')]

# create spatial object from d
coordinates(d) <- 1:2

# Set WGS 84 (EPSG:4326) standard for projecting longitude latitude coordinates
proj4string(d) <- CRS("+init=epsg:4326")

# coordinate reference system using the EPSG:4326 standard
CRS.new <- CRS("+init=epsg:4326")

# the d object in the new CRS, you may print out few records to see how it looks in the new CRS
d.new <- spTransform(d, CRS.new)

# just for information review the 
proj4string(d.new) %>% strwrap()

## [1] "+proj=longlat +datum=WGS84 +no_defs"

# form 6 cluster iteratively
city_clusts <- as.data.frame(d.new) %>%
kmeans(centers = 6) %>% fitted("classes") %>% as.character()

# add a variable for the newly formed clusters
d.new <- as.data.frame(d.new) %>% mutate(cluster = city_clusts) 

# graph the clusters, using the cluster variable to pick the color
d.new %>% ggplot(aes(x = longitude, y = latitude)) +
geom_point(aes(color = cluster), alpha = 0.5)

YOUR CODE HERE:

STAT270 - Final Project

Jace Taylor

2023-05-09

Task 1 (Total 60 pts):

High-earners in the 1994 United States Census

A bit of data preparation

Defined a formula object in `R`

TASK 2 (Total 40 pts, 20 pts each part a & b below):

TASK 3 (Total 40 pts, 20 pts each part a & b below) K-means clustering (Geospatial data example)

STAT270 - Final Project

Jace Taylor

2023-05-09

Task 1 (Total 60 pts):

High-earners in the 1994 United States Census

A bit of data preparation

Defined a formula object in R

TASK 2 (Total 40 pts, 20 pts each part a & b below):

TASK 3 (Total 40 pts, 20 pts each part a & b below) K-means clustering (Geospatial data example)

Defined a formula object in `R`