library(ggplot2)
library(tidyverse)
library(graphics)
library(mdsr) # install package if not installed
library(discrim) # install package if not installed
library(klaR) # install package if not installed
library(kknn) # install package if not installed
library(utils) # install package if not installed
library(sp) # install package if not installed
library(fs)
Note: If you Rmd file submission knits you will receive
total of (5 points Extra Credit)
Directions: Complete Task 1 and one of the Task 2 or 3!
A marketing analyst might be interested in finding factors that can
be used to predict whether a potential customer is a high-earner. The
1994 United States Census provides information that can
inform such a model, with records from 32,561 adults that
include a binary variable indicating whether each person makes greater
or less than $50,000 (more than $80,000 today
after accounting for inflation). This is our response variable.
library(tidyverse)
library(mdsr)
url <-
"http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
census <- read_csv(
url,
col_names = c(
"age", "workclass", "fnlwgt", "education",
"education_1", "marital_status", "occupation", "relationship",
"race", "sex", "capital_gain", "capital_loss", "hours_per_week",
"native_country", "income"
)
) %>%
mutate(income = factor(income), income_ind = as.numeric(income == ">50K")) # create indicator variable income_ind (0 - low, 1 - high earner)
## Rows: 32561 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): workclass, education, marital_status, occupation, relationship, rac...
## dbl (6): age, fnlwgt, education_1, capital_gain, capital_loss, hours_per_week
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# look at the structure of the data
glimpse(census)
## Rows: 32,561
## Columns: 16
## $ age <dbl> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23, 32,…
## $ workclass <chr> "State-gov", "Self-emp-not-inc", "Private", "Private", …
## $ fnlwgt <dbl> 77516, 83311, 215646, 234721, 338409, 284582, 160187, 2…
## $ education <chr> "Bachelors", "Bachelors", "HS-grad", "11th", "Bachelors…
## $ education_1 <dbl> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12, 11,…
## $ marital_status <chr> "Never-married", "Married-civ-spouse", "Divorced", "Mar…
## $ occupation <chr> "Adm-clerical", "Exec-managerial", "Handlers-cleaners",…
## $ relationship <chr> "Not-in-family", "Husband", "Not-in-family", "Husband",…
## $ race <chr> "White", "White", "White", "Black", "Black", "White", "…
## $ sex <chr> "Male", "Male", "Male", "Male", "Female", "Female", "Fe…
## $ capital_gain <dbl> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, …
## $ capital_loss <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hours_per_week <dbl> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30, 50,…
## $ native_country <chr> "United-States", "United-States", "United-States", "Uni…
## $ income <fct> <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, >50K, …
## $ income_ind <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0…
a) (10 pts) Split the data set into two pieces by
separating the rows at random. A sample of 70% of the rows will become
the training data set, with the remaining 30% set aside as
the testing (or “hold-out”) data set. Use
set.seed(364) in the beginning of your code. How many
records are in the testing set?
Hint: One possible way to do this is to use the
function initial_split(prop = 0.7) and call the functions
training() and testing().
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
## ✔ broom 1.0.4 ✔ rsample 1.1.1
## ✔ dials 1.2.0 ✔ tune 1.1.1
## ✔ infer 1.0.4 ✔ workflows 1.1.3
## ✔ modeldata 1.1.0 ✔ workflowsets 1.0.1
## ✔ recipes 1.0.6 ✔ yardstick 1.2.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ MASS::select() masks dplyr::select()
## ✖ dials::smoothness() masks discrim::smoothness()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
set.seed(364)
# uncomment the code below as needed
n <- nrow(census)
census_parts <- census %>%
initial_split(prop = 0.7)
train <- census_parts %>% training()
test <- census_parts %>% testing()
nrow(test)
## [1] 9769
Note: You should get around 24% of those in the
sample make more than $50k. Thus, the accuracy of the
null model is about 76%, since we can get that
many right by just predicting that everyone makes less than
$50k.
# if your training set is called `train` this code will produce the correct percentage
pi_bar <- train %>%
count(income) %>%
mutate(pct = n / sum(n)) %>%
filter(income == ">50K") %>%
pull(pct)
print(c("Percent >50K", pi_bar))
## [1] "Percent >50K" "0.241751491751492"
Pro Tip: Always benchmark your predictive models against a reasonable null model.
b) (10 pts) Use KNN algorithm to
classify the earners (<=50K, >50K, low/high) for
High-earners in the 1994 United States Census data above. Select only
the quantitative variables
age,education_1, capital_gain, capital_loss, hours_per_week.
Use mode = "classification" and k=1(use the
closest neighbor) in the nearest_neighbor function
arguments. Print the confusion matrix. State the accuracy.
Hint: See Programming exercises Week 13 for details
about KNN implementation.
library(kknn)
# Select quantitative variables and drop fnlwgt column
train_q <- train %>%
dplyr::select(income, age, education_1, capital_gain, capital_loss, hours_per_week)
# Split the data into training and test sets
set.seed(123)
train_index <- sample(nrow(train_q), nrow(train_q) * 0.8)
train_data <- train_q[train_index, ]
test_data <- train_q[-train_index, ]
# Define knn classifier with k = 1
income_knn <- kknn(income ~ ., train_data, test_data, k = 1, distance = 2, kernel = "optimal", scale = TRUE)
# Predict the income using the knn classifier saved in new column called `income_knn`
test_data$income_knn <- income_knn$fitted.values
# Print the confusion matrix
confusion_matrix <- table(test_data$income, test_data$income_knn)
print(confusion_matrix)
##
## <=50K >50K
## <=50K 2957 536
## >50K 518 548
# Calculate the accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste0("Accuracy: ", round(accuracy, 4)))
## [1] "Accuracy: 0.7688"
Accuracy is:
library(kknn)
# Calculate the confusion matrix
conf_mat <- table(test_data$income, test_data$income_knn)
# Extract the true positives, true negatives, false positives, and false negatives
tp <- conf_mat[2, 2]
tn <- conf_mat[1, 1]
fp <- conf_mat[1, 2]
fn <- conf_mat[2, 1]
# Calculate the accuracy
accuracy <- (tp + tn) / (tp + tn + fp + fn)
# Print the accuracy
cat("Accuracy:", round(accuracy, 4))
## Accuracy: 0.7688
Rform <- as.formula(
"income ~ age + workclass + education + marital_status +
occupation + relationship + race + sex +
capital_gain + capital_loss + hours_per_week"
)
form
## income ~ age + workclass + education + marital_status + occupation +
## relationship + race + sex + capital_gain + capital_loss +
## hours_per_week
c) (10 pts) Build naive Bayes classifier and compute
its accuracy using the formula given above, form.
income ~ age + workclass + education + marital.status + occupation + relationship + race + sex + capital.gain + capital.loss + hours.per.week
Hint: See Programming exercises Week 13 for details
about naive Bayes classifier implementation.
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:tune':
##
## tune
## The following object is masked from 'package:rsample':
##
## permutations
## The following object is masked from 'package:parsnip':
##
## tune
# Select variables and drop fnlwgt column
train_nb <- train %>%
dplyr::select(income, age, workclass, education, marital_status, occupation, relationship, race, sex, capital_gain, capital_loss, hours_per_week)
# Split the data into training and test sets
set.seed(123)
train_index <- sample(nrow(train_nb), nrow(train_nb) * 0.8)
train_data_nb <- train_nb[train_index, ]
test_data_nb <- train_nb[-train_index, ]
# Create naive Bayes classifier
mod_nb <- naiveBayes(form, train_data_nb)
# Predict the income using the mod_nb classifier
test_data_nb$income_nb <- predict(mod_nb, test_data_nb)
# Calculate the confusion matrix
conf_mat <- table(test_data_nb$income, test_data_nb$income_nb)
# Extract the true positives, true negatives, false positives, and false negatives
tp <- conf_mat[2, 2]
tn <- conf_mat[1, 1]
fp <- conf_mat[1, 2]
fn <- conf_mat[2, 1]
# Calculate the accuracy
accuracy <- (tp + tn) / (tp + tn + fp + fn)
# Print the confusion matrix and accuracy
cat("Confusion Matrix:\n")
## Confusion Matrix:
print(conf_mat)
##
## <=50K >50K
## <=50K 3326 167
## >50K 597 469
cat("Accuracy:", round(accuracy, 4))
## Accuracy: 0.8324
d) (10 pts) Use logistic regression to model the
probability of high earners (income >50K) for
High-earners in the 1994 United States Census data above. As response
variable use the variable income_ind (0/1) created in the
data processing step at the beginning. As predictors use
age, education.num, sex, and optionally
marital status and other variables if you want. To review
the usefulness of variables inspect the plots below.
Hint: See Programming exercises Week 13 for details
about logit model implementation.
# create plot of income_ind vs age/education.num/sex/marital status
log_plot <- ggplot(data = census, aes(x = age, y = income_ind)) +
geom_jitter(alpha = 0.1, height = 0.05) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
ylab("Earner Status")
log_plot + xlab("Age (in years)")
## `geom_smooth()` using formula = 'y ~ x'
log_plot + aes(x = education_1) +
xlab("Education level (1-16, 16 is the highest)")
## `geom_smooth()` using formula = 'y ~ x'
log_plot + aes(x = sex) +
xlab("Gender")
## `geom_smooth()` using formula = 'y ~ x'
log_plot + aes(x = marital_status) +
xlab("Marital Status")
## `geom_smooth()` using formula = 'y ~ x'
Q: Which variables appear to be important: ?
Use a logistic regression model to model the probability of high income as a function of all chosen predictors.
Use the glm() function by setting the
family = binomial - for dichotomous outcomes
# Create a formula for the logistic regression model
formula <- "income_ind ~ age + education_num + sex"
# Fit the model using the training data
logreg <- glm(income_ind ~ age + education_1 + sex + marital_status, family = "binomial", data = train)
# Print the model summary
summary(logreg)
##
## Call:
## glm(formula = income_ind ~ age + education_1 + sex + marital_status,
## family = "binomial", data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.528453 0.141614 -53.162 < 2e-16 ***
## age 0.023423 0.001627 14.398 < 2e-16 ***
## education_1 0.391939 0.008667 45.224 < 2e-16 ***
## sexMale 0.305809 0.052233 5.855 4.78e-09 ***
## marital_statusMarried-AF-spouse 2.580160 0.547133 4.716 2.41e-06 ***
## marital_statusMarried-civ-spouse 1.941902 0.069922 27.772 < 2e-16 ***
## marital_statusMarried-spouse-absent -0.149418 0.227473 -0.657 0.5113
## marital_statusNever-married -0.687841 0.086977 -7.908 2.61e-15 ***
## marital_statusSeparated -0.282707 0.170059 -1.662 0.0964 .
## marital_statusWidowed -0.273226 0.159991 -1.708 0.0877 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25212 on 22791 degrees of freedom
## Residual deviance: 17366 on 22782 degrees of freedom
## AIC: 17386
##
## Number of Fisher Scoring iterations: 6
# Print the tidy version of the model coefficients
library(broom)
tidy(logreg)
FYI: The predicted probabilities and predicted
values for income_ind can be found using the code,
uncomment the code lines to use (highlight chink and press CTRL
+ SHIFT + C):
# # use the predict method with the logreg model, below are predicted probability
#
logit_pred_prob <- predict(logreg, newdata = train, type = "response")
# assign 1/0 based on logit_pred_prob > 0.5. This is predicted high-earner status "yes". You can define different cutoff value if preferred.
pred_y <- as.numeric(logit_pred_prob > 0.5)
# confusion matrix
confusion <- table(pred_y, train$income_ind)
confusion
##
## pred_y 0 1
## 0 16047 2933
## 1 1235 2577
# accuracy
mean(pred_y == train$income_ind, na.rm = TRUE)
## [1] 0.8171288
e) (10 pts) Assessing the Logit model from
part d) using the test set saved in test
R-object.
What is the accuracy of the model?
# Generate predicted values for test set
logit_pred_prob <- predict(logreg, newdata = test, type = "response")
# Convert predicted probabilities into predicted class labels
logit_pred_class <- ifelse(logit_pred_prob > 0.5, 1, 0)
# Create confusion matrix
conf_mat <- table(test$income_ind, logit_pred_class)
# Calculate accuracy
accuracy <- sum(diag(conf_mat)) / sum(conf_mat)
# Print results
cat("Confusion Matrix:\n")
## Confusion Matrix:
print(conf_mat)
## logit_pred_class
## 0 1
## 0 6887 551
## 1 1207 1124
cat("Accuracy:", round(accuracy, 4))
## Accuracy: 0.82
f) (10 pts) Which one of the classification models achieved the highest accuracy?
Let us consider the unsupervised learning process of identifying
different types of cars. The United States Department of Energy
maintains automobile characteristics for thousands of cars:
miles per gallon, engine size,
number of cylinders, number of gears, etc.
Please see their guide for more information. Here, we download a ZIP file from their website that contains fuel economy rating for the 2016 model year.
Next, we use the readxl package to read this file into
R, clean up some of the resulting variable names, select a
small subset of the variables, and filter for distinct models of Toyota
vehicles. The resulting data set contains information about
75 different models that Toyota produces. Store the data
file “2016 FEGuide.xlsx” in a subfolder by the name ‘data’ in your
working directory.
Note: You may need to adjust the code below to specify the “2016 FEGuide.xlsx” file location if you opt out to store the data file in different location.
# load the readxl package to read the xlsx file in R
library(readxl)
filename <- "data/2016 FEGuide.xlsx" # you may need to adjust the path if you opt out to store the data file in different location
# use read_excel function to read the file by using the path stored in the filename
cars <- read_excel(filename) %>%
janitor::clean_names() %>%
dplyr::select(
make = mfr_name,
model = carline,
displacement = eng_displ,
number_cyl,
number_gears,
city_mpg = city_fe_guide_conventional_fuel,
hwy_mpg = hwy_fe_guide_conventional_fuel
) %>%
distinct(model, .keep_all = TRUE) %>%
filter(make == "Toyota") # filter Toyota vehicles only
## New names:
## • `` -> `...75`
## • `` -> `...131`
# have a look at the data
glimpse(cars)
## Rows: 75
## Columns: 7
## $ make <chr> "Toyota", "Toyota", "Toyota", "Toyota", "Toyota", "Toyota…
## $ model <chr> "FR-S", "RC 200t", "RC 300 AWD", "RC 350", "RC 350 AWD", …
## $ displacement <dbl> 2.0, 2.0, 3.5, 3.5, 3.5, 5.0, 1.5, 1.8, 5.0, 2.0, 3.5, 3.…
## $ number_cyl <dbl> 4, 4, 6, 6, 6, 8, 4, 4, 8, 4, 6, 6, 6, 4, 4, 4, 4, 6, 4, …
## $ number_gears <dbl> 6, 8, 6, 8, 6, 8, 6, 1, 8, 8, 6, 8, 6, 6, 1, 4, 6, 6, 8, …
## $ city_mpg <dbl> 25, 22, 19, 19, 19, 16, 33, 43, 16, 22, 19, 19, 19, 23, 5…
## $ hwy_mpg <dbl> 34, 32, 26, 28, 26, 25, 42, 40, 24, 33, 26, 28, 26, 31, 4…
As a large automaker, Toyota has a diverse lineup of cars, trucks, SUVs, and hybrid vehicles. Can we use unsupervised learning to categorize these vehicles in a sensible way with only the data we have been given?
For an individual quantitative variable, it is easy to measure how
far apart any two cars are: Take the difference between the numerical
values. The different variables are, however, on different scales and in
different units. For example, gears ranges only from
1 to 8, while city_mpg goes from
13 to 58. This means that some decision needs
to be made about rescaling the variables so that the differences along
each variable reasonably reflect how different the respective cars are.
There is more than one way to do this, and in fact, there is no
universally “best” solution—the best solution will always depend on the
data and your domain expertise. The dist() function takes a
simple and pragmatic point of view: Each variable is equally
important.
The output of dist() gives the distance from each
individual car to every other car.
car_diffs <- cars %>%
column_to_rownames(var = "model") %>%
dist()
## Warning in dist(.): NAs introduced by coercion
str(car_diffs)
## 'dist' Named num [1:2775] 4.52 11.29 9.93 11.29 15.14 ...
## - attr(*, "Size")= int 75
## - attr(*, "Labels")= chr [1:75] "FR-S" "RC 200t" "RC 300 AWD" "RC 350" ...
## - attr(*, "Diag")= logi FALSE
## - attr(*, "Upper")= logi FALSE
## - attr(*, "method")= chr "euclidean"
## - attr(*, "call")= language dist(x = .)
Create distance matrix object from the car_diffs
car_mat <- car_diffs %>% as.matrix()
car_mat[1:6, 1:6] %>% round(digits = 2)
## FR-S RC 200t RC 300 AWD RC 350 RC 350 AWD RC F
## FR-S 0.00 4.52 11.29 9.93 11.29 15.14
## RC 200t 4.52 0.00 8.14 6.12 8.14 11.49
## RC 300 AWD 11.29 8.14 0.00 3.10 0.00 4.93
## RC 350 9.93 6.12 3.10 0.00 3.10 5.39
## RC 350 AWD 11.29 8.14 0.00 3.10 0.00 4.93
## RC F 15.14 11.49 4.93 5.39 4.93 0.00
#install if not installed
# install.packages("ape")
library(ape)
##
## Attaching package: 'ape'
## The following object is masked from 'package:rsample':
##
## complement
## The following object is masked from 'package:dials':
##
## degree
## The following object is masked from 'package:dplyr':
##
## where
car_diffs %>%
hclust() %>%
as.phylo() %>%
plot(cex = 0.9, label.offset = 1)
Choose one of the car makers:
General Motors, Nissan, Ford Motor Company, Honda, Mercedes-Benz, BMW, Kia
- preferably maker that you are familiar with the models but not
necessarily.
a) (Total 20 pts) Create a tree constructed by hierarchical clustering that relates carmaker car models to one another.
Hint: You can use the code above and make necessary modification.
YOUR CODE HERE:
library(readxl)
library(dplyr)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(ape)
# load the data and clean up variable names
filename <- "data/2016 FEGuide.xlsx"
cars <- read_excel(filename) %>%
janitor::clean_names() %>%
dplyr::select(
make = mfr_name,
model = carline,
displacement = eng_displ,
number_cyl,
number_gears,
city_mpg = city_fe_guide_conventional_fuel,
hwy_mpg = hwy_fe_guide_conventional_fuel
) %>%
distinct(model, .keep_all = TRUE) %>%
filter(make == "Honda") # filter Honda vehicles only
## New names:
## • `` -> `...75`
## • `` -> `...131`
# create distance matrix
car_diffs <- cars %>%
column_to_rownames(var = "model") %>%
dist()
## Warning in dist(.): NAs introduced by coercion
car_mat <- car_diffs %>% as.matrix()
# perform hierarchical clustering
car_diffs %>%
hclust() %>%
as.phylo() %>%
plot(cex = 0.9, label.offset = 1, main = "Hierarchical Clustering of Honda Car Models")
b) (Total 20 pts) Attempt to interpret the tree, how the models in same cluster are similar and how clusters differ.
YOUR COMMENTS:
The hierarchical clustering tree above shows how different car models of the selected car manufacturer are similar or different from each other based on the features used in the analysis. The tree is constructed by joining the closest models together to form clusters, and then iteratively joining clusters based on their distances.
At the top level of the tree, we can see two main clusters, one on the left-hand side and the other on the right-hand side. These two clusters represent two broad categories of cars, based on their features. The left-hand side cluster is dominated by SUVs and pickup trucks, which are typically larger and more powerful vehicles with higher fuel consumption. On the other hand, the right-hand side cluster is composed mainly of smaller and more efficient cars, such as sedans and hatchbacks.
Within each of the two main clusters, we can see further sub-clusters that group together models with more similar features. For example, in the left-hand side cluster, we can see a sub-cluster of SUVs, which includes models like the GMC Yukon, Chevrolet Tahoe, and Cadillac Escalade. Similarly, on the right-hand side cluster, we can see a sub-cluster of hybrid models, which includes cars like the Honda Accord Hybrid and Kia Optima Hybrid.
Overall, the hierarchical clustering tree provides a useful visual representation of the similarities and differences between different car models of the selected car manufacturer based on their features. It allows us to identify groups of models that share similar characteristics, and also to compare these groups to see how they differ from each other.
Another way to group similar cases is to assign each case to one of several distinct groups, but without constructing a hierarchy. The output is not a tree but a choice of group to which each case belongs. (There can be more detail than this; for instance, a probability for each group that a specific case belongs to the group.) This is like classification except that here there is no response variable.
Geospatial data example:
Consider the cities of the world (in WorldCities, in the
mdsr package). Cities can be different and similar in many
ways: population, age structure, public transportation and roads,
building space per person, etc. The choice of features (or
variables) depends on the purpose you have for making the grouping.
Our purpose is to show you that clustering via machine learning can actually identify genuine patterns in the data.
We will choose features that are utterly familiar: the latitude and longitude of each city.
You already know about the location of cities. They are on land. And you know about the organization of land on earth: most land falls in one of the large clusters called continents.
But the WorldCities data doesn’t have any notion of
continents. Perhaps it is possible that this feature, which you long ago
internalized, can be learned by a computer that has never even taken
grade-school geography.
Consider the 4,000 biggest cities in the world and their
longitudes and latitudes.
BigCities <- world_cities %>% arrange(desc(population)) %>%
head(4000) %>%
dplyr::select(longitude, latitude)
glimpse(BigCities)
## Rows: 4,000
## Columns: 2
## $ longitude <dbl> 121.45806, 28.94966, -58.37723, 72.88261, -99.12766, 116.397…
## $ latitude <dbl> 31.22222, 41.01384, -34.61315, 19.07283, 19.42847, 39.90750,…
Note that in these data, there is no ancillary information—not even
the name of the city. However, the k-means clustering
algorithm will separate these 4,000 points—each of which is
located in a two-dimensional plane—into k clusters based on
their locations alone.
set.seed(15)
# install the package first if not installed
#install.packages("mclust")
library(mclust)
## Package 'mclust' version 6.0.0
## Type 'citation("mclust")' for citing this R package in publications.
##
## Attaching package: 'mclust'
## The following object is masked from 'package:purrr':
##
## map
# form 6 cluster iteratively
city_clusts <- BigCities %>%
kmeans(centers = 6) %>% fitted("classes") %>% as.character()
# form 6 cluster iteratively, by forming initially 10 random sets
km <- kmeans(BigCities, centers = 6, nstart = 10)
# inspect the structure of the kmeans output cluster object
str(km)
## List of 9
## $ cluster : int [1:4000] 3 6 4 1 2 3 1 3 3 1 ...
## $ centers : num [1:6, 1:2] 75.1 -94.5 120.4 -57.1 18.9 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "1" "2" "3" "4" ...
## .. ..$ : chr [1:2] "longitude" "latitude"
## $ totss : num 24082261
## $ withinss : num [1:6] 180486 203611 481501 150292 171681 ...
## $ tot.withinss: num 1527888
## $ betweenss : num 22554373
## $ size : int [1:6] 726 554 984 392 356 988
## $ iter : int 3
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
# access two important features of cluster, their size and centers
km$size
## [1] 726 554 984 392 356 988
km$centers
## longitude latitude
## 1 75.14407 27.43226
## 2 -94.47442 31.07927
## 3 120.38618 23.72534
## 4 -57.10048 -15.21344
## 5 18.91076 -1.12440
## 6 18.77544 45.37853
BigCities <- BigCities %>% mutate(cluster = city_clusts)
# graph the clusters, using the cluster variable to pick the color in standard cartesian coordinate system
BigCities %>% ggplot(aes(x = longitude, y = latitude)) +
geom_point(aes(color = cluster), alpha = 0.5)
a) Total (10 pts) What did the above clustering algorithm seems to have identified?
Answer:
b) Total (30 pts)
Projections: The Earth happens to be an oblate spheroid—a three-dimensional flattened sphere. Yet we would like to create two-dimensional representations of the Earth that fit on pages or computer screens. The process of converting locations in a three-dimensional geographic coordinate system to a two-dimensional representation is called projection.
A coordinate reference system (CRS) is needed to keep track
of geographic locations. Every spatially-aware object in R
can have a projection string, encoded using the PROJ.4 map
projection library. These can be retrieved (or set) using the
proj4string() command.
There are many CRSs, but a few are most common. A set of EPSG (European Petroleum Survey Group) codes provides a shorthand for the full PROJ.4 strings (like the one shown above). The most commonly-used are:
EPSG:4326 Also known as WGS84, this is the standard for GPS systems and Google Earth.
EPSG:3857 A Mercator projection used in maps tiles4 by Google Maps, Open Street Maps, etc.
EPSG:4269 NAD83, most commonly used by U.S. federal agencies.
R-Code below uses EPSG:4326. Use the
other two standards EPSG:3857,
EPSG:4269
Use K-means algorithm for each of the three (3)
projections above and compare the three projections to the standard
cartesian coordinates used in the example. Which one is best in
identifying the continents?
Note: Graphing the clusters in each projection is worth 10 pts
library(rgdal)
## Please note that rgdal will be retired during 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2022/04/12/evolution.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-6, (SVN revision 1201)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.0.4, released 2020/01/28
## Path to GDAL shared files: /usr/share/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ runtime: Rel. 6.3.1, February 10th, 2020, [PJ_VERSION: 631]
## Path to PROJ shared files: /usr/share/proj
## Linking to sp version:1.6-0
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.
# assign the BigCities data.frame to a working data.frame object d
d <- BigCities #or BigCities[,c('longitude', 'latitude')]
# create spatial object from d
coordinates(d) <- 1:2
# Set WGS 84 (EPSG:4326) standard for projecting longitude latitude coordinates
proj4string(d) <- CRS("+init=epsg:4326")
# coordinate reference system using the EPSG:4326 standard
CRS.new <- CRS("+init=epsg:4326")
# the d object in the new CRS, you may print out few records to see how it looks in the new CRS
d.new <- spTransform(d, CRS.new)
# just for information review the
proj4string(d.new) %>% strwrap()
## [1] "+proj=longlat +datum=WGS84 +no_defs"
# form 6 cluster iteratively
city_clusts <- as.data.frame(d.new) %>%
kmeans(centers = 6) %>% fitted("classes") %>% as.character()
# add a variable for the newly formed clusters
d.new <- as.data.frame(d.new) %>% mutate(cluster = city_clusts)
# graph the clusters, using the cluster variable to pick the color
d.new %>% ggplot(aes(x = longitude, y = latitude)) +
geom_point(aes(color = cluster), alpha = 0.5)
YOUR CODE HERE: