install.packages("tidymodels")
Error in install.packages : Updating loaded packages
Note: If you Rmd file submission knits you will receive total of (5 points Extra Credit)
Directions: Complete Task 1 and one of the Task 2 or 3!
A marketing analyst might be interested in finding factors that can be used to predict whether a potential customer is a high-earner. The 1994 United States Census provides information that can inform such a model, with records from 32,561 adults that include a binary variable indicating whether each person makes greater or less than $50,000 (more than $80,000 today after accounting for inflation). This is our response variable.
library(tidyverse)
library(mdsr)
url <-
"http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
census <- read_csv(
url,
col_names = c(
"age", "workclass", "fnlwgt", "education",
"education_1", "marital_status", "occupation", "relationship",
"race", "sex", "capital_gain", "capital_loss", "hours_per_week",
"native_country", "income"
)
) %>%
mutate(income = factor(income), income_ind = as.numeric(income == ">50K")) # create indicator variable income_ind (0 - low, 1 - high earner)
-- Column specification ----------------------------------------------------------------------------------------------------------------------------------
cols(
age = col_double(),
workclass = col_character(),
fnlwgt = col_double(),
education = col_character(),
education_1 = col_double(),
marital_status = col_character(),
occupation = col_character(),
relationship = col_character(),
race = col_character(),
sex = col_character(),
capital_gain = col_double(),
capital_loss = col_double(),
hours_per_week = col_double(),
native_country = col_character(),
income = col_character()
)
# look at the structure of the data
glimpse(census)
Rows: 32,561
Columns: 16
$ age <dbl> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23, 32, 40, 34, 25, 32, 38, 43, 40, 54, 35, 43, 59, 56, 19, 54, 39, 49, 23, 20, 4~
$ workclass <chr> "State-gov", "Self-emp-not-inc", "Private", "Private", "Private", "Private", "Private", "Self-emp-not-inc", "Private", "Private",~
$ fnlwgt <dbl> 77516, 83311, 215646, 234721, 338409, 284582, 160187, 209642, 45781, 159449, 280464, 141297, 122272, 205019, 121772, 245487, 1767~
$ education <chr> "Bachelors", "Bachelors", "HS-grad", "11th", "Bachelors", "Masters", "9th", "HS-grad", "Masters", "Bachelors", "Some-college", "B~
$ education_1 <dbl> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12, 11, 4, 9, 9, 7, 14, 16, 9, 5, 7, 9, 13, 9, 10, 9, 9, 12, 10, 13, 10, 10, 7, 1~
$ marital_status <chr> "Never-married", "Married-civ-spouse", "Divorced", "Married-civ-spouse", "Married-civ-spouse", "Married-civ-spouse", "Married-spo~
$ occupation <chr> "Adm-clerical", "Exec-managerial", "Handlers-cleaners", "Handlers-cleaners", "Prof-specialty", "Exec-managerial", "Other-service"~
$ relationship <chr> "Not-in-family", "Husband", "Not-in-family", "Husband", "Wife", "Wife", "Not-in-family", "Husband", "Not-in-family", "Husband", "~
$ race <chr> "White", "White", "White", "Black", "Black", "White", "Black", "White", "White", "White", "Black", "Asian-Pac-Islander", "White",~
$ sex <chr> "Male", "Male", "Male", "Male", "Female", "Female", "Female", "Male", "Female", "Male", "Male", "Male", "Female", "Male", "Male",~
$ capital_gain <dbl> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ capital_loss <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2042, 0, 0, 0, 0, 0, 0, 0, 0, 1408, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ hours_per_week <dbl> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30, 50, 40, 45, 35, 40, 50, 45, 60, 20, 40, 40, 40, 40, 40, 60, 80, 40, 52, 44, 4~
$ native_country <chr> "United-States", "United-States", "United-States", "United-States", "Cuba", "United-States", "Jamaica", "United-States", "United-~
$ income <fct> <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, >50K, >50K, >50K, >50K, >50K, <=50K, <=50K, >50K, <=50K, <=50K, <=50K, <=50K, >5~
$ income_ind <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ~
a) (10 pts) Split the data set into two pieces by separating the rows at random. A sample of 70% of the rows will become the training data set, with the remaining 30% set aside as the testing (or “hold-out”) data set. Use set.seed(364) in the beginning of your code. How many records are in the testing set?
Hint: One possible way to do this is to use the function initial_split(prop = 0.7) and call the functions training() and testing().
library(tidymodels)
set.seed(364)
# uncomment the code below as needed
n <- nrow(census)
census_parts <- census %>%
initial_split(prop = 0.7)
train <- census_parts %>% training()
test <- census_parts %>% testing()
nrow(test)
[1] 9768
Note: You should get around 24% of those in the sample make more than $50k. Thus, the accuracy of the null model is about 76%, since we can get that many right by just predicting that everyone makes less than $50k.
pi_bar <- train %>%
count(income) %>%
mutate(pct = n / sum(n)) %>%
filter(income == ">50K") %>%
pull(pct)
print(c("Percent >50K", pi_bar))
[1] "Percent >50K" "0.238143289606458"
Pro Tip: Always benchmark your predictive models against a reasonable null model.
b) (10 pts) Use KNN algorithm to classify the earners (<=50K, >50K, low/high) for High-earners in the 1994 United States Census data above. Select only the quantitative variables age,education_1, capital_gain, capital_loss, hours_per_week. Use mode = "classification" and k=1(use the closest neighbor) in the nearest_neighbor function arguments. Print the confusion matrix. State the accuracy.
Hint: See Programming exercises Week 13 for details about KNN implementation.
library(kknn)
# distance metric only works with quantitative variables, saved them in train_q set
train_q <- train %>% dplyr::select(income, where(is.numeric), -fnlwgt)
# define knn classifier
mK <- nearest_neighbor(neighbors = 5, mode = "classification") %>% set_engine("kknn", scale = TRUE) %>% fit(income ~ ., data = train_q)
# predict the income using the knn classifier saved in new column called `income_knn`
p <- train_q %>% bind_cols(predict(mK, new_data = train_q, type = "class")) %>% rename(income_knn = .pred_class)
# print the confusion matrix
p %>% conf_mat(income, income_knn)
Truth
Prediction <=50K >50K
<=50K 17364 0
>50K 1 5428
Accuracy is:
# Find the Accuracy = (true positive and true negative)/total or use the `accuracy()` function.
Rform <- as.formula(
"income ~ age + workclass + education + marital_status +
occupation + relationship + race + sex +
capital_gain + capital_loss + hours_per_week"
)
form
income ~ age + workclass + education + marital_status + occupation +
relationship + race + sex + capital_gain + capital_loss +
hours_per_week
c) (10 pts) Build naive Bayes classifier and compute its accuracy using the formula given above, form.
income ~ age + workclass + education + marital.status + occupation + relationship + race + sex + capital.gain + capital.loss + hours.per.week
Hint: See Programming exercises Week 13 for details about naive Bayes classifier implementation.
library(discrim)
# create naiveBayes classifier
mNB <- naive_Bayes(mode = "classification") %>% set_engine("klaR") %>% fit(form, data = train)
# use the predict method with the mod_nb model
p <- train %>% bind_cols(predict(mNB, new_data = train, type = "class")) %>% rename(income_nb = .pred_class)
# confusion matrix
p %>% conf_mat(income, income_nb)
Truth
Prediction <=50K >50K
<=50K 17095 3730
>50K 270 1698
# accuracy
accuracy(p, income, income_nb)
d) (10 pts) Use logistic regression to model the probability of high earners (income >50K) for High-earners in the 1994 United States Census data above. As response variable use the variable income_ind (0/1) created in the data processing step at the beginning. As predictors use age, education.num, sex, and optionally marital status and other variables if you want. To review the usefulness of variables inspect the plots below.
Hint: See Programming exercises Week 13 for details about logit model implementation.
log_plot <- ggplot(data = census, aes(x = age, y = income_ind)) +
geom_jitter(alpha = 0.1, height = 0.05) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
ylab("Earner Status")
log_plot + xlab("Age (in years)")
log_plot + aes(x = education_1) +
xlab("Education level (1-16, 16 is the highest)")
log_plot + aes(x = sex) +
xlab("Gender")
log_plot + aes(x = marital_status) +
xlab("Marital Status")
Q: Which variables appear to be important: ?
Use a logistic regression model to model the probability of high income as a function of all chosen predictors.
Use the glm() function by setting the family = binomial - for dichotomous outcomes
logreg <- glm(income_ind ~ age + sex + education_1, family = "binomial", data = train)
tidy(logreg)
FYI: The predicted probabilities and predicted values for income_ind can be found using the code, uncomment the code lines to use (highlight chink and press CTRL + SHIFT + C):
logit_pred_prob <- predict(logreg, newdata = train, type = "response")
#
# # assign 1/0 based on logit_pred_prob > 0.5. This is predicted diabetes status "yes". You can define different cutoff value if preferred.
#
pred_y <- as.numeric(logit_pred_prob > 0.5)
#
# # confusion matrix
confusion <- table(pred_y, train$income_ind)
#
# confusion
#
# # accuracy
mean(pred_y == train$income_ind, na.rm = TRUE)
[1] 0.797701
e) (10 pts) Assessing the Logit model from part d) using the test set saved in test R-object.
What is the accuracy of the model?
logit_pred_prob <- predict(logreg, newdata = census, type = "response")
# assign 1/0 based on logit_pred_prob > 0.5. This is predicted diabetes status "yes". You can define different cutoff value if preferred.
pred_y <- as.numeric(logit_pred_prob > 1.0)
# confusion matrix
confusion <- table(pred_y, census$has_diabetes)
Unknown or uninitialised column: `has_diabetes`.Error in table(pred_y, census$has_diabetes) :
all arguments must have the same length
f) (10 pts) Which one of the classification models achieved the highest accuracy?
Let us consider the unsupervised learning process of identifying different types of cars. The United States Department of Energy maintains automobile characteristics for thousands of cars: miles per gallon, engine size, number of cylinders, number of gears, etc.
Please see their guide for more information. Here, we download a ZIP file from their website that contains fuel economy rating for the 2016 model year.
Next, we use the readxl package to read this file into R, clean up some of the resulting variable names, select a small subset of the variables, and filter for distinct models of Toyota vehicles. The resulting data set contains information about 75 different models that Toyota produces.
Note: (IMPORTANT!) Use the code below to download and unzip the data file. Make sure you create folder data in your working directory.
download.file("https://www.fueleconomy.gov/feg/epadata/16data.zip", destfile = "data/fueleconomy.zip")
unzip("data/fueleconomy.zip", exdir = "data/fueleconomy")
Next, we use the readxl package to read this file into R, clean up some of the resulting variable names, select a small subset of the variables, and filter for distinct models of Toyota vehicles. The resulting data set contains information about 75 different models that Toyota produces.
library(mdsr)
# load the readxl package to read the xlsx file in R
library(readxl)
# list the files in the data/fueleconomy and get the first file (the only one) in it.
filename <- list.files("data/fueleconomy", pattern = "public\\.xlsx")[1]
# use read_excel function to read the file by creating the full path to it, using paste0 function.
cars <- read_excel(paste0("data/fueleconomy/", filename)) %>% data.frame()
# rename the long complex technical names and select only few variables for the make Toyota
# use distinct from dplyr to retain only unique/distinct rows, .keep_all = TRUE - keeps all variables in .data
cars <- cars %>% rename(make = Mfr.Name, model = Carline, displacement = Eng.Displ, cylinders = X..Cyl, city_mpg = City.FE..Guide....Conventional.Fuel, hwy_mpg = Hwy.FE..Guide....Conventional.Fuel, gears = X..Gears) %>% dplyr::select(make, model, displacement, cylinders, gears, city_mpg, hwy_mpg) %>%
distinct(model, .keep_all = TRUE) %>%
filter(make == "Toyota")
# give the observation names instead of numbers, use the model of the vehicle
rownames(cars) <- cars$model
# have a look at the data
glimpse(cars)
As a large automaker, Toyota has a diverse lineup of cars, trucks, SUVs, and hybrid vehicles. Can we use unsupervised learning to categorize these vehicles in a sensible way with only the data we have been given?
For an individual quantitative variable, it is easy to measure how far apart any two cars are: Take the difference between the numerical values. The different variables are, however, on different scales and in different units. For example, gears ranges only from 1 to 8, while city_mpg goes from 13 to 58. This means that some decision needs to be made about rescaling the variables so that the differences along each variable reasonably reflect how different the respective cars are. There is more than one way to do this, and in fact, there is no universally “best” solution—the best solution will always depend on the data and your domain expertise. The dist() function takes a simple and pragmatic point of view: Each variable is equally important.
The output of dist() gives the distance from each individual car to every other car.
car_diffs <- dist(cars)
str(car_diffs)
Create distance matrix object from the car_diffs
car_mat <- car_diffs %>% as.matrix()
car_mat[1:6, 1:6] %>% round(digits = 2)
#install if not installed
# install.packages("ape")
library(ape)
car_diffs %>%
hclust() %>%
as.phylo() %>%
plot(cex = 0.9, label.offset = 1)
Choose one of the car makers: General Motors, Nissan, Ford Motor Company, Honda, Mercedes-Benz, BMW, Kia - preferably maker that you are familiar with the models but not necessarily.
a) (Total 20 pts) Create a tree constructed by hierarchical clustering that relates carmaker car models to one another.
Hint: You can use the code above and make necessary modification.
YOUR CODE HERE:
b) (Total 20 pts) Attempt to interpret the tree, how the models in same cluster are similar and how clusters differ.
YOUR COMMENTS:
Another way to group similar cases is to assign each case to one of several distinct groups, but without constructing a hierarchy. The output is not a tree but a choice of group to which each case belongs. (There can be more detail than this; for instance, a probability for each group that a specific case belongs to the group.) This is like classification except that here there is no response variable.
Geospatial data example:
Consider the cities of the world (in WorldCities, in the mdsr package). Cities can be different and similar in many ways: population, age structure, public transportation and roads, building space per person, etc. The choice of features (or variables) depends on the purpose you have for making the grouping.
Our purpose is to show you that clustering via machine learning can actually identify genuine patterns in the data.
We will choose features that are utterly familiar: the latitude and longitude of each city.
You already know about the location of cities. They are on land. And you know about the organization of land on earth: most land falls in one of the large clusters called continents.
But the WorldCities data doesn’t have any notion of continents. Perhaps it is possible that this feature, which you long ago internalized, can be learned by a computer that has never even taken grade-school geography.
Consider the 4,000 biggest cities in the world and their longitudes and latitudes.
BigCities <- world_cities %>% arrange(desc(population)) %>%
head(4000) %>%
dplyr::select(longitude, latitude)
glimpse(BigCities)
Rows: 4,000
Columns: 2
$ longitude <dbl> 121.45806, 28.94966, -58.37723, 72.88261, -99.12766, 116.39723, 67.01040, 117.17667, 113.25000, 77.23149, 37.61556, 114.06830, 90.4074~
$ latitude <dbl> 31.22222, 41.01384, -34.61315, 19.07283, 19.42847, 39.90750, 24.86080, 39.14222, 23.11667, 28.65195, 55.75222, 22.54554, 23.71040, 37.~
Note that in these data, there is no ancillary information—not even the name of the city. However, the k-means clustering algorithm will separate these 4,000 points—each of which is located in a two-dimensional plane—into k clusters based on their locations alone.
set.seed(15)
# install the package first if not installed
install.packages("mclust")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/mclust_5.4.7.zip'
Content type 'application/zip' length 4671959 bytes (4.5 MB)
downloaded 4.5 MB
package ‘mclust’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Mebox\AppData\Local\Temp\RtmpuEOpST\downloaded_packages
library(mclust)
package 㤼㸱mclust㤼㸲 was built under R version 4.0.5 __ ___________ __ _____________
/ |/ / ____/ / / / / / ___/_ __/
/ /|_/ / / / / / / / /\__ \ / /
/ / / / /___/ /___/ /_/ /___/ // /
/_/ /_/\____/_____/\____//____//_/ version 5.4.7
Type 'citation("mclust")' for citing this R package in publications.
Attaching package: 㤼㸱mclust㤼㸲
The following object is masked from 㤼㸱package:purrr㤼㸲:
map
# form 6 cluster iteratively
city_clusts <- BigCities %>%
kmeans(centers = 6) %>% fitted("classes") %>% as.character()
# form 6 cluster iteratively, by forming initially 10 random sets
km <- kmeans(BigCities, centers = 6, nstart = 10)
# inspect the structure of the kmeans output cluster object
str(km)
List of 9
$ cluster : int [1:4000] 3 6 4 1 2 3 1 3 3 1 ...
$ centers : num [1:6, 1:2] 75.1 -94.5 120.4 -57.1 18.9 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:6] "1" "2" "3" "4" ...
.. ..$ : chr [1:2] "longitude" "latitude"
$ totss : num 24082261
$ withinss : num [1:6] 180486 203611 481501 150292 171681 ...
$ tot.withinss: num 1527888
$ betweenss : num 22554373
$ size : int [1:6] 726 554 984 392 356 988
$ iter : int 3
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
# access two important features of cluster, their size and centers
km$size
[1] 726 554 984 392 356 988
km$centers
longitude latitude
1 75.14407 27.43226
2 -94.47442 31.07927
3 120.38618 23.72534
4 -57.10048 -15.21344
5 18.91076 -1.12440
6 18.77544 45.37853
BigCities <- BigCities %>% mutate(cluster = city_clusts)
# graph the clusters, using the cluster variable to pick the color in standard cartesian coordinate system
BigCities %>% ggplot(aes(x = longitude, y = latitude)) +
geom_point(aes(color = cluster), alpha = 0.5)
a) Total (10 pts) What did the above clustering algorithm seems to have identified?
Answer: In my opinion it identified that many cities are in the Asia and Europe,North America also has a large number of cities, in South America most cities are along the edges, and maybe the most impressive to me at least is that this program that as stated before doesn’t have any geographical knowledge has formed entire continents of cities from the above lines of code.
b) Total (30 pts)
Projections: The Earth happens to be an oblate spheroid—a three-dimensional flattened sphere. Yet we would like to create two-dimensional representations of the Earth that fit on pages or computer screens. The process of converting locations in a three-dimensional geographic coordinate system to a two-dimensional representation is called projection.
A coordinate reference system (CRS) is needed to keep track of geographic locations. Every spatially-aware object in R can have a projection string, encoded using the PROJ.4 map projection library. These can be retrieved (or set) using the proj4string() command.
There are many CRSs, but a few are most common. A set of EPSG (European Petroleum Survey Group) codes provides a shorthand for the full PROJ.4 strings (like the one shown above). The most commonly-used are:
EPSG:4326 Also known as WGS84, this is the standard for GPS systems and Google Earth.
EPSG:3857 A Mercator projection used in maps tiles4 by Google Maps, Open Street Maps, etc.
EPSG:4269 NAD83, most commonly used by U.S. federal agencies.
R-Code below uses EPSG:4326. Use the other two standards EPSG:3857, EPSG:4269
Use K-means algorithm for each of the three (3) projections above and compare the three projections to the standard cartesian coordinates used in the example. Which one is best in identifying the continents?
Note: Graphing the clusters in each projection is worth 10 pts
library(rgdal)
package 㤼㸱rgdal㤼㸲 was built under R version 4.0.5Loading required package: sp
rgdal: version: 1.5-23, (SVN revision 1121)
Geospatial Data Abstraction Library extensions to R successfully loaded
Loaded GDAL runtime: GDAL 3.2.1, released 2020/12/29
Path to GDAL shared files: C:/Program Files/R/R-4.0.3/library/rgdal/gdal
GDAL binary built with GEOS: TRUE
Loaded PROJ runtime: Rel. 7.2.1, January 1st, 2021, [PJ_VERSION: 721]
Path to PROJ shared files: C:/Program Files/R/R-4.0.3/library/rgdal/proj
PROJ CDN enabled: FALSE
Linking to sp version:1.4-5
To mute warnings of possible GDAL/OSR exportToProj4() degradation,
use options("rgdal_show_exportToProj4_warnings"="none") before loading rgdal.
Overwritten PROJ_LIB was C:/Program Files/R/R-4.0.3/library/rgdal/proj
# assign the BigCities data.frame to a working data.frame object d
d <- BigCities #or BigCities[,c('longitude', 'latitude')]
# create spatial object from d
coordinates(d) <- 1:2
# Set WGS 84 (EPSG:4326) standard for projecting longitude latitude coordinates
proj4string(d) <- CRS("+init=epsg:4326")
# coordinate reference system using the EPSG:4326 standard
CRS.new <- CRS("+init=epsg:4326")
# the d object in the new CRS, you may print out few records to see how it looks in the new CRS
d.new <- spTransform(d, CRS.new)
# just for information review the
proj4string(d.new) %>% strwrap()
CRS object has comment, which is lost in output
[1] "+proj=longlat +datum=WGS84 +no_defs"
# form 6 cluster iteratively
city_clusts <- as.data.frame(d.new) %>%
kmeans(centers = 6) %>% fitted("classes") %>% as.character()
# add a variable for the newly formed clusters
d.new <- as.data.frame(d.new) %>% mutate(cluster = city_clusts)
# graph the clusters, using the cluster variable to pick the color
d.new %>% ggplot(aes(x = longitude, y = latitude)) +
geom_point(aes(color = cluster), alpha = 0.5)
YOUR CODE HERE: