Question 0: Survey (2 points)
I would like to get an idea of your interests in Machine Learning and your background. Please tell me:
Your last degree program (B.A., M.A., E.D., Ph.D., etc) ? Your Major and year
Your proficiency with Calculus – Scale 1 (Beginner) -5 (Advanced):
Your proficiency with Linear Algebra – Scale 1 (Beginner) -5 (Advanced):
Your statistical proficiency (Probability, Statistical Distributions) – Scale 1 (Beginner) -5 (Advanced):
Your proficiency with Regression Models (Linear, Logistic, etc.) – Scale 1 (Beginner) -5 (Advanced):
Your proficiency with R programming (Plots, Functions, Rmarkdown), Scale 1 (Beginner) – 5 (Advanced):
The assignment includes a spreadsheet survey.csv. Please fill out the information and submit it along with your homework.
Question 1 (2 points)
We will be using Rstudio in this class to implement the algorithms we learn in class. The goal of this assignment is to get you proficient in the basics of R, such as writing scripts and plotting. If you are having trouble, you can find help by searching the internet (often searching for the specific error message is helpful), reading Data Mining with R by Luis Torgo or R in a Nutshell by Joseph Adler, asking your friends, and coming to office hours. The computing homework needs to be submitted with your name and Uni# with Rmarkdown file and a pdf with the code and explanation.
Install the R and RStudio programs on your computer. Then, inside of RStudio, use the install.packages function to install RMarkdown. Then, in the code chunk below, type version.
_
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 6.0
year 2019
month 04
day 26
svn rev 76424
language R
version.string R version 3.6.0 (2019-04-26)
nickname Planting of a Tree
Question 2 (10 points)
2a (5 points)
Write a function called even.or.odd. Its parameter x will be a numeric vector. Return a character vector that says “odd” for odd numbers and “even” for the even numbers. The results should correctly classify every value. To determine if a number is even or odd, you can use the modulus operator %% (e.g.: 5%%3 = 2). Note: Try to find a solution that uses vector logic instead of a for loop. In R, this is a good programming practice that will speed up your programs.
Display the results of this function on the vector 1:5.
x <- c ( 1 , 4 , 5 , 7 , 9)
even.or.odd <- function (x) {
ifelse ( x %% 2 == 0, 'even','odd')
}
even.or.odd(x)[1] "odd" "even" "odd" "odd" "odd"
2b (5 points)
Write a function my.sum that computes the sum of a numeric vector x. The user can also specify the.type as a character. If the.type is “even”, the function will compute the sum of only the even values in x. If the.type is “odd”, the function will compute the sum of only the odd values in x. If the.type is “all”, the function will compute the sum of the entire vector x. Within the function, you may use the built-in sum function. The function should omit missing values (NA) from the sum. This can be done using the na.rm argument within the sum function.
Display the results of this function for odd, even, and all values of the vector 1:5.
my.sum <- function(x, the.type) {
if(the.type == "even"){
sum(x[(x %%2 == 0)], na.rm = TRUE)
}
else if(the.type == "odd"){
sum(x[(x %%2 == 1)], na.rm = TRUE)
}
else if(the.type == "all"){
sum(x, na.rm = TRUE)
}
}
my.sum(1:5, the.type ="even")[1] 6
Question 3 (10 points)
Load package datasets and load the iris data set. We will try to predict the species of iris from the sepal’s length and width and the petal’s length and width using k−nearest neighbors.
3a (5 points)
Divide the data into training and testing sets. To do so, let’s create an assignment vector called training_row. Each row of the data set will be assigned to the training set (with training_row set to TRUE) with probability 0.8 or to the test set (with training_row set to FALSE) with probability 0.2. Use the sample function to create the training_row vector of TRUE and FALSE values. The vector should be as long as the number of rows in the iris data set.
Then, divide the iris data set into separate training and test sets according to the training_row assignments.
In order to obtain consistent results, we’ll need to set the seed of R’s pseudo-random number generator. To do so, use set.seed(41) in the code chunk labeled constants above.
library(datasets)
set.seed(41)
##create normalization function
##nor <-function(x) { (x - min(x)) / (max(x) - min(x)) }
#run normalization function on first 4 columns of df
##df_norm <- as.data.frame (lapply (df [ , c(1,2,3,4) ], nor))
#extract training and testing set
training_row = sample(1:nrow (iris) , 0.8*nrow (iris))
train = iris[ training_row, c(1:4)]
test = iris[- training_row, c(1:4)]3b (5 points)
Use the function knn from the package class with k = 2 to classify the data. What proportion of the values are misclassified on the testing set?
Note: In order to use knn, the train and test objects must only include the columns that are used to make the classification. The Species will need to be separated into the cl vector and removed from the train and test objects.
library(class)
#seperate variable of interest into cl vectors
train_labels <- iris[training_row, 5]
test_labels <- iris[-training_row, 5]
#predict classication using train_lables for test set
pred <- knn (train, test, cl = train_labels, k = 2)
#calculate misclassification rate
mean(test_labels != pred)[1] 0.1
Question 4 (8 points)
Now perform the knn classification for each k value from 1 to 50. For each value of k, compute the percentage of misclassified values on the testing set. Print out your results as a table showing the values of k and the misclassification rates. You can use the datatable function in the DT package to display an HTML-friendly table.
Note: It would help to write a function that performs the knn computation and computes the misclassification rates.
#define range of k_value and misclassification rate
k_value = 1 : 50
mis_rate <- rep (0 , length (k_value) )
for ( k in k_value) {
# Make predictions using knn: pred
pred <- knn ( train, test, train_labels, k = k)
# Calculate the misclassificatin rate and store it in mis_rate[k]
mis_rate[k] <- mean(pred != test_labels )
mis_rate
}
# load DT
library(DT)
#create dataframe
mis_df <- data.frame (k_value, mis_rate)
mis_df k_value mis_rate
1 1 0.10000000
2 2 0.10000000
3 3 0.10000000
4 4 0.06666667
5 5 0.06666667
6 6 0.03333333
7 7 0.00000000
8 8 0.06666667
9 9 0.00000000
10 10 0.00000000
11 11 0.00000000
12 12 0.03333333
13 13 0.00000000
14 14 0.03333333
15 15 0.00000000
16 16 0.00000000
17 17 0.00000000
18 18 0.00000000
19 19 0.00000000
20 20 0.03333333
21 21 0.06666667
22 22 0.10000000
23 23 0.10000000
24 24 0.10000000
25 25 0.06666667
26 26 0.06666667
27 27 0.06666667
28 28 0.06666667
29 29 0.06666667
30 30 0.10000000
31 31 0.10000000
32 32 0.10000000
33 33 0.10000000
34 34 0.13333333
35 35 0.13333333
36 36 0.13333333
37 37 0.10000000
38 38 0.13333333
39 39 0.13333333
40 40 0.13333333
41 41 0.13333333
42 42 0.13333333
43 43 0.13333333
44 44 0.13333333
45 45 0.13333333
46 46 0.13333333
47 47 0.13333333
48 48 0.13333333
49 49 0.13333333
50 50 0.13333333
Question 5 (20 points)
Use your answers from Question 4 to display the results in the questions below.
5a (5 points)
Plot the misclassification rates on the testing set versus the value of k. Use the plot function. Try different values of the arguments (las, xlim, ylim, xlab, ylab, cex, main) to create a nicer display. Use type = “both” to display both the points and a line.
# Plot the misclassification
plot(k_value, mis_rate, main = "Misclassification Rates vs Neighbors", xlab = "k, number of neighbors", ylab = "Misclassification Rates", cex =0.5, type = "both")5b (5 points)
Now create the same plot placing k on a *logarithmic** scale. Make sure to change the label of the x axis to distinguish this.
5c (10 points)
Let’s examine how the results would change if we were to run the knn classifier multiple times. Perform the following steps:
Re-perform the previous work 3 more times. Each time, you should create a new training and test set, apply knn on each value of k from 1 to 50, and compute the misclassification rates on the testing set.
Plot the results of the earlier work along with the 3 new iterations on a single plot. Use the lines function to add additional lines to the earlier plot from 5a (using the linear scale). Use different colors, line types (lty), and point characters (pch) to distinguish the lines.
Use the legend command to place a legend in the top left corner (x = “topleft”) of the plot. Use the same colors and point characters to display which line is which. Label the iterations 1 through 4.
#now create a function to run KNN
knn_mis_rate <- function(k){
#create train and test, train_labels and test_labels
training_row = sample(1:nrow (iris) , 0.8*nrow (iris))
train = iris[ training_row, c(1:4)]
test = iris[- training_row, c(1:4)]
train_labels <- iris [training_row, 5]
test_labels <- iris[-training_row, 5]
#apply knn function
pred <- knn (train, test, train_labels, k)
#calculate mis_rate
mis_rate <- mean(pred != test_labels)
return(mis_rate)
}
#knn model output
knn_result <- function(k_values = 1:50){
k_values = 1:50
mis_rate = c()
for (k in 1:length(k_values)){
mis_rate[k] = knn_mis_rate(k)
}
cbind.data.frame(k_values, mis_rate)
}
#run 3 more imes
knn2 <- knn_result(k_values = 1:50)
knn3 <- knn_result(k_values = 1:50)
knn4 <- knn_result(k_values = 1:50)
#plot multiple lines for each of three groups, on the plot of group 1
plot(mis_df$k_value, mis_df$mis_rate, main = "Misclassification Rates vs Neighbors", xlab = "k, number of neighbors", ylab = "Misclassification Rates", col = "black",lty =1, pch = 1,type = "b")
#add lines of knn2, knn3, knn4
lines( mis_rate ~ k_values, knn2, col="blue" , lty=2,pch = 2,type = "b")
lines( mis_rate ~ k_values, knn3, col ="red", lty=3,pch = 3,type = "b")
lines(mis_rate ~ k_values, knn4, col ="green", lty=4,pch = 4,type = "b")
#add legend
legend("topleft",
legend = c("Group 1", "Group 2","Group 3", "Group 4"),
col = c("black","blue","red","green"),
bty = "n",
cex = 0.8,
text.col = c("black","blue","red","green"),
horiz = F,
lty=c(1,2,3,4),
pch = c(1,2,3,4),
inset = c(0.1, 0.1,0.1))Question 6 (22 points)
Here we’ll work with the Hitters database from the ISLR library, which contains Major League Baseball Data from the 1986 and 1987 seasons (322 observations on 20 variables). For a description of the variables go to: https://rdrr.io/cran/ISLR/man/Hitters.html Install the ISLR package in R if you have not done so already
6d (2 points)
Compute the min, median, mean, and max of Hits, Home Runs, and Runs for a season (not career totals). Remove any missing values from the calculations. Round your results to 1 decimal place.
#create a function using summary function that has input of a vector, and gives output of Min,Median, Mean, Max as well as 1st Qu and 3rd Qu.
cal <- function(x) {summary(x, digits = 0, na.rm=T)}
cal(H_df$Hits) Min. 1st Qu. Median Mean 3rd Qu. Max.
1 60 100 100 100 200
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 4 8 10 20 40
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 30 50 50 70 100
6e (2 points)
What percentage of these players’s seasons had at least 100 hits and 20 home runs? Use the percent function in the scales package to convert a decimal proportion to a percentage.
[1] "16.8%"
6f (2 points)
What is the relationship between different pairs of variables? Let’s look at Salary, Hits, Runs, HmRun, Errors, and Assists. Use the pairs function to display scatterplots of each pair of these variables.
6g (2 points)
Based on these scatterplots, which variables appear to be correlated with Salary, and which ones appear to have little or no correlation with Salary? Provide a short explanation for your assessment.
Based on these scatterplots, Hits, Runs, HmRun appear to be positively correlated with Salary because the there’re clear positive associate for Salary-Hits, Salary-Runs, Salary-HmRun, which represent moderate linear relationships. Errors and Assists appear to have little or no correlation with Salary as the points are dispersed without clear cirection and form. To quantify the correlations, a correlation matrix can be applied using the function: cor(df[,c(“Salary”,“Hits”,“Runs”,“HmRun”,“Errors”,“Assists”)], use = “complete.obs”)
6h (2 points)
Create a new variable called HighRBI for those players with at least 75 RBI (TRUE). Players with less than 75 RBI should have the value FALSE.
6i (2 points)
What percentage of hitters qualified as HighRBI during these seasons?
#subset df to find Hitters with HighRBI == 'TRUE', and calculate the number of rows
n_HighRBI <- nrow(subset(H_df,HighRBI =='TRUE'))
#calculate the percentage
percent(n_HighRBI/nrow(H_df)) [1] "18.6%"
6j (2 points)
What is the correlation of HighRBI, Home Runs, Hits, Runs, Assists, and Errors with Salary? Use only the cases in which both variables are measured. Round the answer to two decimal places.
#we can calculate pearson's correlation coefficient for pairs of numeric variables - Salary, Home Runs, Hits, Runs, Assists and Errors - in order to quantify their associations.
#To summarize the relationship between a categorical variable - HighRBI - and numeric variable - Salary, a common option is graphical summaries, such as a box plot (showed in the solution of question 6h). But here we can convert type from factor to logical, and treat it as a numeric varialbe in calculating correlation coefficients.
#convert data type from factor to logical
H_df$HighRBI<-as.logical.factor(H_df$HighRBI)
##calculate correlation coefficient matrix for numeric variables, round to 2 decimal
round(cor(H_df[,c("Salary","Hits","Runs","HmRun","Assists","Errors","HighRBI")], use = "complete.obs") , 2) Salary Hits Runs HmRun Assists Errors HighRBI
Salary 1.00 0.44 0.42 0.34 0.03 -0.01 0.37
Hits 0.44 1.00 0.91 0.53 0.30 0.28 0.55
Runs 0.42 0.91 1.00 0.63 0.18 0.19 0.54
HmRun 0.34 0.53 0.63 1.00 -0.16 -0.01 0.70
Assists 0.03 0.30 0.18 -0.16 1.00 0.70 0.00
Errors -0.01 0.28 0.19 -0.01 0.70 1.00 0.06
HighRBI 0.37 0.55 0.54 0.70 0.00 0.06 1.00
6k (2 points)
How did the salaries differ for players with and without HighRBI? Use the boxplot function and split the salary data by HighRBI status. Do HighRBI players have a higher median salary?
6l (2 points)
Show a histogram of home runs using the hist function with breaks = 20 and freq = FALSE.
Question 7 (10 points)
7a (2 points)
What is the mean and standard deviation of Hits, Runs, Home Runs, RBI, Assists, Errors, and Salaries? Remove any missing values from the calculations. Round the answers to 1 decimal place.
# calculate mean for Hits, Runs, Home Runs, RBI, Assists, Errors, and Salaries
v <- c("Hits", "Runs", "HmRun", "RBI", "Assists","Errors","Salary")
multi.fun <- function(x) {
c(mean = round(mean(x, na.rm = TRUE),1), sd = round(sd(x,na.rm = TRUE),1))
}
sapply(H_df[, v], multi.fun) Hits Runs HmRun RBI Assists Errors Salary
mean 101.0 50.9 10.8 48.0 106.9 8.0 535.9
sd 46.5 26.0 8.7 26.2 136.9 6.4 451.1
7b (3 points)
Some players only get to play part-time. Show the mean and standard deviations for the same variables as in the previous question only for players with at least 300 AtBat.
Hits Runs HmRun RBI Assists Errors Salary
mean 126.3 64.3 13.8 60.3 133.4 9.4 617.1
sd 35.2 21.3 9.0 23.2 155.0 6.8 467.7
7c (3 points)
Show a scatter plot of Salary versus Home Runs for players with at least 300 AtBat.
7d (2 points)
There is a player with zero home runs and a salary over 2,000 (more than 2 million dollars). Who is this player? What does it look like happened during the season? Are these numbers accurate? Use the internet to search for this player’s results in 1986 and 1987.
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
-Mike Schmidt 20 1 0 0 0 0 2 41 9 2
CRuns CRBI CWalks League Division PutOuts Assists Errors
-Mike Schmidt 6 7 4 N E 78 220 6
Salary NewLeague
-Mike Schmidt 2127.333 N
Question 8 (14 points)
After exploring the Hitters data so extensively, you are asked to build a regression model to predict the hitter’s salary.
8a (7 points)
Build a linear regression model and explain how (or why) you choose certain predictors in your model. Use 70% of the valid data for training and the remaining 30% of the valid data for testing. Please report the Room Mean Squared Error of the model on both the training and testing sets. Note that, what data are considered as “valid” is up to you based on your data exploration. For example, you can exclude certain data because of either missing data or outliers. But please explain how you determine your validate dataset.
# there are missing values in set, which will hamper the linear models, so must remove them. Remove NA values:
H_df <- na.omit(Hitters[,c("Salary","CRBI","CAtBat","CHmRun","CRuns","CHits","CWalks")])
dim(H_df)[1] 263 7
[1] 0
#find outliers based on boxplot result
findOutliners <- function(x){
outliers <- boxplot(x, plot=FALSE)$out
which(x %in% outliers)
}
#remove all outliers
H_df <- H_df[ -findOutliners(H_df$CHits), ]
H_df <- H_df[ -findOutliners(H_df$CRBI), ]
H_df <- H_df[ -findOutliners(H_df$CAtBat), ]
H_df <- H_df[ -findOutliners(H_df$CHmRun), ]
H_df <- H_df[ -findOutliners(H_df$CRuns), ]
H_df <- H_df[ -findOutliners(H_df$CWalks), ]
H_df <- H_df[ -findOutliners(H_df$Salary), ]
dim(H_df) #212 records left[1] 212 7
#subset train and test set
set.seed(55)
train_index = sample(1:nrow (H_df) , 0.8*nrow (H_df))
H_train = H_df[ train_index, ]
H_test = H_df[- train_index, ]
#build linear regression model
lm <- lm(Salary ~ CRBI + CAtBat + CHmRun + CRuns + CHits + CWalks,data = H_train)
#prediction on training set
pred_train <- predict(lm)
#calculate rmse
lm_train_rmse <- sqrt(mean((H_train$Salary - pred_train)^2, na.rm = TRUE))
#prediction on test set
pred_test <- predict(lm,H_test)
lm_test_rmse <- sqrt(mean((H_test$Salary - pred_test)^2, na.rm = TRUE))
cbind(Model = "LM", `Training RMSE` = lm_train_rmse, `Test RMSE` = lm_test_rmse) Model Training RMSE Test RMSE
[1,] "LM" "209.254474393527" "193.509114635474"
8b (7 points)
Repeat question 8a using KNN with 5 neighbors.
'data.frame': 43 obs. of 6 variables:
$ CRBI : int 46 37 186 290 251 182 132 69 40 108 ...
$ CAtBat: int 396 509 1876 3231 1924 2308 1309 926 514 1064 ...
$ CHmRun: int 12 0 15 36 67 32 27 9 8 11 ...
$ CRuns : int 48 41 192 376 242 349 126 118 57 123 ...
$ CHits : int 101 108 467 825 489 633 308 210 120 290 ...
$ CWalks: int 33 12 161 238 240 308 66 114 39 55 ...
knn_train_labels <- H_train[, 1]
knn_test_labels <- H_test[, 1]
#build knn model on train and test sets
knn_training_pred <- knn(knn_train, knn_train, knn_train_labels, k = 5)
knn_test_pred <- knn(knn_train, knn_test,knn_train_labels, k = 5)
#calculate rmse
knn_training_rmse <- sqrt(mean((as.numeric(knn_training_pred) - knn_train_labels)^2, na.rm = TRUE))
knn_test_rmse <- sqrt(mean((as.numeric(knn_test_pred) - knn_test_labels)^2, na.rm = TRUE))
cbind(Model = "KNN",`Training RMSE` = knn_training_rmse, `Test RMSE` = knn_test_rmse) Model Training RMSE Test RMSE
[1,] "KNN" "480.960064825178" "483.000987137948"