Question 0: Survey (2 points)
I would like to get an idea of your interests in Machine Learning and your background. Please tell me:
Your last degree program (B.A., M.A., E.D., Ph.D., etc) ? Your Major and year
Your proficiency with Calculus – Scale 1 (Beginner) -5 (Advanced):
Your proficiency with Linear Algebra – Scale 1 (Beginner) -5 (Advanced):
Your statistical proficiency (Probability, Statistical Distributions) – Scale 1 (Beginner) -5 (Advanced):
Your proficiency with Regression Models (Linear, Logistic, etc.) – Scale 1 (Beginner) -5 (Advanced):
Your proficiency with R programming (Plots, Functions, Rmarkdown), Scale 1 (Beginner) – 5 (Advanced):
The assignment includes a spreadsheet survey.csv. Please fill out the information and submit it along with your homework.
Question 1 (2 points)
We will be using Rstudio in this class to implement the algorithms we learn in class. The goal of this assignment is to get you proficient in the basics of R, such as writing scripts and plotting. If you are having trouble, you can find help by searching the internet (often searching for the specific error message is helpful), reading Data Mining with R by Luis Torgo or R in a Nutshell by Joseph Adler, asking your friends, and coming to office hours. The computing homework needs to be submitted with your name and Uni# with Rmarkdown file and a pdf with the code and explanation.
Install the R and RStudio programs on your computer. Then, inside of RStudio, use the install.packages function to install RMarkdown. Then, in the code chunk below, type version.
_
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 6.0
year 2019
month 04
day 26
svn rev 76424
language R
version.string R version 3.6.0 (2019-04-26)
nickname Planting of a Tree
Question 2 (10 points)
2a (5 points)
Write a function called even.or.odd. Its parameter x will be a numeric vector. Return a character vector that says “odd” for odd numbers and “even” for the even numbers. The results should correctly classify every value. To determine if a number is even or odd, you can use the modulus operator %% (e.g.: 5%%3 = 2). Note: Try to find a solution that uses vector logic instead of a for loop. In R, this is a good programming practice that will speed up your programs.
Display the results of this function on the vector 1:5.
x <- c ( 1 , 4 , 5 , 7 , 9)
even.or.odd <- function (x) {
ifelse ( x %% 2 == 0, 'even','odd')
}
even.or.odd(x)[1] "odd" "even" "odd" "odd" "odd"
2b (5 points)
Write a function my.sum that computes the sum of a numeric vector x. The user can also specify the.type as a character. If the.type is “even”, the function will compute the sum of only the even values in x. If the.type is “odd”, the function will compute the sum of only the odd values in x. If the.type is “all”, the function will compute the sum of the entire vector x. Within the function, you may use the built-in sum function. The function should omit missing values (NA) from the sum. This can be done using the na.rm argument within the sum function.
Display the results of this function for odd, even, and all values of the vector 1:5.
my.sum <- function(x, the.type) {
if(the.type == "even"){
sum(x[(x %%2 == 0)], na.rm = TRUE)
}
else if(the.type == "odd"){
sum(x[(x %%2 == 1)], na.rm = TRUE)
}
else if(the.type == "all"){
sum(x, na.rm = TRUE)
}
}
my.sum(1:5, the.type ="even")[1] 6
Question 3 (10 points)
Load package datasets and load the iris data set. We will try to predict the species of iris from the sepal’s length and width and the petal’s length and width using k−nearest neighbors.
3a (5 points)
Divide the data into training and testing sets. To do so, let’s create an assignment vector called training_row. Each row of the data set will be assigned to the training set (with training_row set to TRUE) with probability 0.8 or to the test set (with training_row set to FALSE) with probability 0.2. Use the sample function to create the training_row vector of TRUE and FALSE values. The vector should be as long as the number of rows in the iris data set.
Then, divide the iris data set into separate training and test sets according to the training_row assignments.
In order to obtain consistent results, we’ll need to set the seed of R’s pseudo-random number generator. To do so, use set.seed(41) in the code chunk labeled constants above.
library(datasets)
set.seed(41)
##create normalization function
##nor <-function(x) { (x - min(x)) / (max(x) - min(x)) }
#run normalization function on first 4 columns of df
##df_norm <- as.data.frame (lapply (df [ , c(1,2,3,4) ], nor))
#extract training and testing set
training_row <- sample(1:nrow (iris) , 0.8*nrow (iris))
train <- iris[training_row, ]
test <- iris[- training_row, ]3b (5 points)
Use the function knn from the package class with k = 2 to classify the data. What proportion of the values are misclassified on the testing set?
Note: In order to use knn, the train and test objects must only include the columns that are used to make the classification. The Species will need to be separated into the cl vector and removed from the train and test objects.
library(class)
#seperate variable of interest into cl vectors
train_labels <- train[, 5]
test_labels <- test[, 5]
#predict classication using train_lables for test set
pred <- knn (train[,-5], test[,-5], cl = train_labels, k = 2)
#calculate misclassification rate
mean(test_labels != pred)[1] 0.1
#alternativley use: sum(test_labels!=pred)/sum(nrow(test)) or use confusion matrix
conf <- table(test_labels,pred)
conf pred
test_labels setosa versicolor virginica
setosa 9 0 0
versicolor 0 7 2
virginica 0 1 11
[1] 0.1
Question 4 (8 points)
Now perform the knn classification for each k value from 1 to 50. For each value of k, compute the percentage of misclassified values on the testing set. Print out your results as a table showing the values of k and the misclassification rates. You can use the datatable function in the DT package to display an HTML-friendly table.
Note: It would help to write a function that performs the knn computation and computes the misclassification rates.
#define range of k_value and misclassification rate
k_values = 1 : 50
mis_df <- data.frame( k = rep(0 , length(k_values)), mis_rate = rep (0 , length(k_values)))
for ( i in k_values) {
k <- k_values[i]
# Make predictions using knn: pred
pred.i <- knn(train[,-5], test[,-5], train_labels, k = i)
# Calculate the misclassificatin rate and store it in mis.i
mis.i <- mean(test_labels != pred.i)
mis_df[i,'k']<- k
mis_df[i,'mis_rate']<- mis.i
}
mis_df k mis_rate
1 1 0.10000000
2 2 0.10000000
3 3 0.10000000
4 4 0.06666667
5 5 0.06666667
6 6 0.03333333
7 7 0.00000000
8 8 0.06666667
9 9 0.00000000
10 10 0.00000000
11 11 0.00000000
12 12 0.03333333
13 13 0.00000000
14 14 0.03333333
15 15 0.00000000
16 16 0.00000000
17 17 0.00000000
18 18 0.00000000
19 19 0.00000000
20 20 0.03333333
21 21 0.06666667
22 22 0.10000000
23 23 0.10000000
24 24 0.10000000
25 25 0.06666667
26 26 0.06666667
27 27 0.06666667
28 28 0.06666667
29 29 0.06666667
30 30 0.10000000
31 31 0.10000000
32 32 0.10000000
33 33 0.10000000
34 34 0.13333333
35 35 0.13333333
36 36 0.13333333
37 37 0.10000000
38 38 0.13333333
39 39 0.13333333
40 40 0.13333333
41 41 0.13333333
42 42 0.13333333
43 43 0.13333333
44 44 0.13333333
45 45 0.13333333
46 46 0.13333333
47 47 0.13333333
48 48 0.13333333
49 49 0.13333333
50 50 0.13333333
Question 5 (20 points)
Use your answers from Question 4 to display the results in the questions below.
5a (5 points)
Plot the misclassification rates on the testing set versus the value of k. Use the plot function. Try different values of the arguments (las, xlim, ylim, xlab, ylab, cex, main) to create a nicer display. Use type = “both” to display both the points and a line.
# Plot the misclassification
plot(mis_df$k, mis_df$mis_rate, main = "Misclassification Rates vs Neighbors", xlab = "k, number of neighbors", ylab = "Misclassification Rates", cex =0.5, type = "both")5b (5 points)
Now create the same plot placing k on a *logarithmic** scale. Make sure to change the label of the x axis to distinguish this.
5c (10 points)
Let’s examine how the results would change if we were to run the knn classifier multiple times. Perform the following steps:
Re-perform the previous work 3 more times. Each time, you should create a new training and test set, apply knn on each value of k from 1 to 50, and compute the misclassification rates on the testing set.
Plot the results of the earlier work along with the 3 new iterations on a single plot. Use the lines function to add additional lines to the earlier plot from 5a (using the linear scale). Use different colors, line types (lty), and point characters (pch) to distinguish the lines.
Use the legend command to place a legend in the top left corner (x = “topleft”) of the plot. Use the same colors and point characters to display which line is which. Label the iterations 1 through 4.
#create function knn_output(k_values)
knn_output <- function(k_values = 1:50){
training_row = sample(1:nrow (iris) , 0.8*nrow (iris))
train = iris[ training_row, ]
test = iris[- training_row, ]
train_labels <- train[, 5]
test_labels <- test[, 5]
k_values <- 1 : 50
mis_df <- data.frame( k = rep( 0 , length(k_values)), mis_rate = rep (0 , length(k_values)))
for (i in k_values){
k <- k_values[i]
pred.i <- knn(train[,-5], test[,-5], train_labels, k = i)
mis.i <- mean(test_labels != pred.i)
mis_df[i,'k']<- k
mis_df[i,'mis_rate']<- mis.i
}
mis_df
}
#run another 3 iterations
knn2 <-knn_output(k_values = 1:50)
knn3 <-knn_output(k_values = 1:50)
knn4 <-knn_output(k_values = 1:50)
#create plot
plot(mis_df$k, mis_df$mis_rate, main = "Misclassification Rates vs Neighbors", xlab = "k, number of neighbors", ylab = "Misclassification Rates", col = "black",lty = 1, pch = 1,type = "b",ylim = c(0,.3))
#add lines of knn2, knn3, knn4
lines(mis_rate ~ k, knn2, col="blue" , lty=2,pch = 2,type = "b",ylim=c(0,.3))
lines(mis_rate ~ k, knn3, col ="red", lty=3,pch = 3,type = "b",ylim=c(0,.3))
lines(mis_rate ~ k, knn4, col ="green", lty=4,pch = 4,type = "b",ylim=c(0,.3))
#add legend
legend("topleft",
legend = c('iteration 1','iteration 2','iteration 3','iteration 4'),
col = c("black","blue","red","green"),
bty = "n",
cex = 0.8,
text.col = c("black","blue","red","green"),
horiz = F,
lty=c(1,2,3,4),
pch = c(1,2,3,4))#My first try:
#knn_mis_rate <- function(k){
# training_row = sample(1:nrow (iris) , 0.8*nrow (iris))
# train = iris[ training_row, ]
# test = iris[- training_row, ]
# train_labels <- train[, 5]
# test_labels <- test[, 5]
# pred <- knn (train[,-5], test[,-5], train_labels, k)
# #calculate mis_rate
# mis_rate <- mean(pred != test_labels)
# return(mis_rate)
#}
#knn_output <- function(k_values = 1:50){
# k_values = 1:50
# mis_rate = c()
# for (k in k_values){
# mis_rate[k] = knn_mis_rate(k)
#}
# cbind.data.frame(k_values, mis_rate)
#}Question 6 (22 points)
Here we’ll work with the Hitters database from the ISLR library, which contains Major League Baseball Data from the 1986 and 1987 seasons (322 observations on 20 variables). For a description of the variables go to: https://rdrr.io/cran/ISLR/man/Hitters.html Install the ISLR package in R if you have not done so already
6d (2 points)
Compute the min, median, mean, and max of Hits, Home Runs, and Runs for a season (not career totals). Remove any missing values from the calculations. Round your results to 1 decimal place.
cal <- function(x) {
c(min = round(min(x, na.rm = TRUE),1),
median = round(median(x,na.rm = TRUE),1),
mean = round(mean(x,na.rm = TRUE),1),
max = round(max(x,na.rm = TRUE),1)
)
}
sapply(Hitters[,c("Hits","HmRun","Runs")], cal) Hits HmRun Runs
min 1 0.0 0.0
median 96 8.0 48.0
mean 101 10.8 50.9
max 238 40.0 130.0
6e (2 points)
What percentage of these players’s seasons had at least 100 hits and 20 home runs? Use the percent function in the scales package to convert a decimal proportion to a percentage.
[1] "16.8%"
6f (2 points)
What is the relationship between different pairs of variables? Let’s look at Salary, Hits, Runs, HmRun, Errors, and Assists. Use the pairs function to display scatterplots of each pair of these variables.
6g (2 points)
Based on these scatterplots, which variables appear to be correlated with Salary, and which ones appear to have little or no correlation with Salary? Provide a short explanation for your assessment.
Based on these scatterplots, Hits, Runs, HmRun appear to be positively correlated with Salary because the there’re clear positive associate for Salary-Hits, Salary-Runs, Salary-HmRun, which represent moderate linear relationships. Errors and Assists appear to have little or no correlation with Salary as the points are dispersed without clear cirection and form. To quantify the correlations, a correlation matrix can be applied using the function: cor(df[,c(“Salary”,“Hits”,“Runs”,“HmRun”,“Errors”,“Assists”)], use = “complete.obs”)
6h (2 points)
Create a new variable called HighRBI for those players with at least 75 RBI (TRUE). Players with less than 75 RBI should have the value FALSE.
6i (2 points)
What percentage of hitters qualified as HighRBI during these seasons?
[1] "18.6%"
6j (2 points)
What is the correlation of HighRBI, Home Runs, Hits, Runs, Assists, and Errors with Salary? Use only the cases in which both variables are measured. Round the answer to two decimal places.
##calculate correlation coefficient matrix for numeric variables, round to 2 decimal
round(cor(H[,c("Salary","Hits","Runs","HmRun","Assists","Errors","HighRBI")], use = "complete.obs") , 2) Salary Hits Runs HmRun Assists Errors HighRBI
Salary 1.00 0.44 0.42 0.34 0.03 -0.01 0.37
Hits 0.44 1.00 0.91 0.53 0.30 0.28 0.55
Runs 0.42 0.91 1.00 0.63 0.18 0.19 0.54
HmRun 0.34 0.53 0.63 1.00 -0.16 -0.01 0.70
Assists 0.03 0.30 0.18 -0.16 1.00 0.70 0.00
Errors -0.01 0.28 0.19 -0.01 0.70 1.00 0.06
HighRBI 0.37 0.55 0.54 0.70 0.00 0.06 1.00
6k (2 points)
How did the salaries differ for players with and without HighRBI? Use the boxplot function and split the salary data by HighRBI status. Do HighRBI players have a higher median salary?
6l (2 points)
Show a histogram of home runs using the hist function with breaks = 20 and freq = FALSE.
Question 7 (10 points)
7a (2 points)
What is the mean and standard deviation of Hits, Runs, Home Runs, RBI, Assists, Errors, and Salaries? Remove any missing values from the calculations. Round the answers to 1 decimal place.
# calculate mean for Hits, Runs, Home Runs, RBI, Assists, Errors, and Salaries
v <- c("Hits", "Runs", "HmRun", "RBI", "Assists","Errors","Salary")
multi.fun <- function(x) {
c(mean = round(mean(x, na.rm = TRUE),1), sd = round(sd(x,na.rm = TRUE),1))
}
sapply(H[, v], multi.fun) Hits Runs HmRun RBI Assists Errors Salary
mean 101.0 50.9 10.8 48.0 106.9 8.0 535.9
sd 46.5 26.0 8.7 26.2 136.9 6.4 451.1
7b (3 points)
Some players only get to play part-time. Show the mean and standard deviations for the same variables as in the previous question only for players with at least 300 AtBat.
Hits Runs HmRun RBI Assists Errors Salary
mean 126.3 64.3 13.8 60.3 133.4 9.4 617.1
sd 35.2 21.3 9.0 23.2 155.0 6.8 467.7
7c (3 points)
Show a scatter plot of Salary versus Home Runs for players with at least 300 AtBat.
7d (2 points)
There is a player with zero home runs and a salary over 2,000 (more than 2 million dollars). Who is this player? What does it look like happened during the season? Are these numbers accurate? Use the internet to search for this player’s results in 1986 and 1987.
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
-Mike Schmidt 20 1 0 0 0 0 2 41 9 2
CRuns CRBI CWalks League Division PutOuts Assists Errors
-Mike Schmidt 6 7 4 N E 78 220 6
Salary NewLeague
-Mike Schmidt 2127.333 N
Question 8 (14 points)
After exploring the Hitters data so extensively, you are asked to build a regression model to predict the hitter’s salary.
8a (7 points)
Build a linear regression model and explain how (or why) you choose certain predictors in your model. Use 70% of the valid data for training and the remaining 30% of the valid data for testing. Please report the Room Mean Squared Error of the model on both the training and testing sets. Note that, what data are considered as “valid” is up to you based on your data exploration. For example, you can exclude certain data because of either missing data or outliers. But please explain how you determine your validate dataset.
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
AtBat 1.00 0.96 0.56 0.90 0.80 0.62 0.01 0.21 0.23 0.21 0.24
Hits 0.96 1.00 0.53 0.91 0.79 0.59 0.02 0.21 0.24 0.19 0.24
HmRun 0.56 0.53 1.00 0.63 0.85 0.44 0.11 0.22 0.22 0.49 0.26
Runs 0.90 0.91 0.63 1.00 0.78 0.70 -0.01 0.17 0.19 0.23 0.24
RBI 0.80 0.79 0.85 0.78 1.00 0.57 0.13 0.28 0.29 0.44 0.31
Walks 0.62 0.59 0.44 0.70 0.57 1.00 0.13 0.27 0.27 0.35 0.33
Years 0.01 0.02 0.11 -0.01 0.13 0.13 1.00 0.92 0.90 0.72 0.88
CAtBat 0.21 0.21 0.22 0.17 0.28 0.27 0.92 1.00 1.00 0.80 0.98
CHits 0.23 0.24 0.22 0.19 0.29 0.27 0.90 1.00 1.00 0.79 0.98
CHmRun 0.21 0.19 0.49 0.23 0.44 0.35 0.72 0.80 0.79 1.00 0.83
CRuns 0.24 0.24 0.26 0.24 0.31 0.33 0.88 0.98 0.98 0.83 1.00
CRBI 0.22 0.22 0.35 0.20 0.39 0.31 0.86 0.95 0.95 0.93 0.95
CWalks 0.13 0.12 0.23 0.16 0.23 0.43 0.84 0.91 0.89 0.81 0.93
PutOuts 0.31 0.30 0.25 0.27 0.31 0.28 -0.02 0.05 0.07 0.09 0.06
Assists 0.34 0.30 -0.16 0.18 0.06 0.10 -0.09 -0.01 -0.01 -0.19 -0.04
Errors 0.33 0.28 -0.01 0.19 0.15 0.08 -0.16 -0.07 -0.07 -0.17 -0.09
Salary 0.39 0.44 0.34 0.42 0.45 0.44 0.40 0.53 0.55 0.52 0.56
CRBI CWalks PutOuts Assists Errors Salary
AtBat 0.22 0.13 0.31 0.34 0.33 0.39
Hits 0.22 0.12 0.30 0.30 0.28 0.44
HmRun 0.35 0.23 0.25 -0.16 -0.01 0.34
Runs 0.20 0.16 0.27 0.18 0.19 0.42
RBI 0.39 0.23 0.31 0.06 0.15 0.45
Walks 0.31 0.43 0.28 0.10 0.08 0.44
Years 0.86 0.84 -0.02 -0.09 -0.16 0.40
CAtBat 0.95 0.91 0.05 -0.01 -0.07 0.53
CHits 0.95 0.89 0.07 -0.01 -0.07 0.55
CHmRun 0.93 0.81 0.09 -0.19 -0.17 0.52
CRuns 0.95 0.93 0.06 -0.04 -0.09 0.56
CRBI 1.00 0.89 0.10 -0.10 -0.12 0.57
CWalks 0.89 1.00 0.06 -0.07 -0.13 0.49
PutOuts 0.10 0.06 1.00 -0.04 0.08 0.30
Assists -0.10 -0.07 -0.04 1.00 0.70 0.03
Errors -0.12 -0.13 0.08 0.70 1.00 -0.01
Salary 0.57 0.49 0.30 0.03 -0.01 1.00
#From the correlation matrix, CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks are highly correalted with #Salary, with correlation coefficient all above 0.50. Besides, a player's salary is based on his #career records, not very related to stats for the current season. Therefore, I will use thoese #variables as predictors for my model.
# there are missing values in set, which will hamper the linear models. Remove NA values:
H <- na.omit(Hitters[,c("Salary","CAtBat","CHits","CHmRun","CRuns","CRBI","CWalks")])
sum(is.na(H))[1] 0
#find outliers using boxplot outcome
findOutliners <- function(x){
outliers <- boxplot(x, plot=FALSE)$out
which(x %in% outliers)
}
#remove all outliers
dim(H)[1] 263 7
H <- H[ -findOutliners(H$CHits), ]
H <- H[ -findOutliners(H$CRBI), ]
H <- H[ -findOutliners(H$CAtBat), ]
H <- H[ -findOutliners(H$CHmRun), ]
H <- H[ -findOutliners(H$CRuns), ]
H <- H[ -findOutliners(H$CWalks), ]
H <- H[ -findOutliners(H$Salary), ]
dim(H) #212 records left, 263-212 =51 observations removed[1] 212 7
#subset train and test set
set.seed(55)
train_index = sample(1:nrow (H) , 0.7*nrow (H))
H_train = H[train_index, ]
H_test = H[- train_index, ]
#build linear regression model
lm <- lm(Salary ~ CRBI + CAtBat + CHmRun + CRuns + CHits + CWalks,data = H_train)
#prediction on training set
pred_train <- predict(lm)
#calculate rmse
lm_train_rmse <- sqrt(mean((H_train$Salary - pred_train)^2, na.rm = TRUE))
#prediction on test set
pred_test <- predict(lm,H_test)
lm_test_rmse <- sqrt(mean((H_test$Salary - pred_test)^2, na.rm = TRUE))
cbind(Model = "LM", `Training RMSE` = lm_train_rmse, `Test RMSE` = lm_test_rmse) Model Training RMSE Test RMSE
[1,] "LM" "210.02047935331" "196.27687440876"
8b (7 points)
Repeat question 8a using KNN with 5 neighbors.
#subset data, assign cl labels
knn_train = H_train[ , -1]
knn_test = H_test[, -1]
knn_train_labels <- H_train[, 1]
knn_test_labels <- H_test[, 1]
#build knn model on train and test sets
knn_training_pred <- knn(knn_train, knn_train, knn_train_labels, k = 5)
knn_test_pred <- knn(knn_train, knn_test,knn_train_labels, k = 5)
#calculate rmse
knn_training_rmse <- sqrt(mean((as.numeric(knn_training_pred) - knn_train_labels)^2, na.rm = TRUE))
knn_test_rmse <- sqrt(mean((as.numeric(knn_test_pred) - knn_test_labels)^2, na.rm = TRUE))
cbind(Model = "KNN",`Training RMSE` = knn_training_rmse, `Test RMSE` = knn_test_rmse) Model Training RMSE Test RMSE
[1,] "KNN" "486.129796527746" "493.416545941693"