Homework 1

library(ISLR)
library(scales)
library(knitr)
library(rmarkdown)
library(prettydoc)

set.seed(41)

Question 0: Survey (2 points)

I would like to get an idea of your interests in Machine Learning and your background. Please tell me:

Your last degree program (B.A., M.A., E.D., Ph.D., etc) ? Your Major and year
Your proficiency with Calculus – Scale 1 (Beginner) -5 (Advanced):
Your proficiency with Linear Algebra – Scale 1 (Beginner) -5 (Advanced):
Your statistical proficiency (Probability, Statistical Distributions) – Scale 1 (Beginner) -5 (Advanced):
Your proficiency with Regression Models (Linear, Logistic, etc.) – Scale 1 (Beginner) -5 (Advanced):
Your proficiency with R programming (Plots, Functions, Rmarkdown), Scale 1 (Beginner) – 5 (Advanced):

The assignment includes a spreadsheet survey.csv. Please fill out the information and submit it along with your homework.

Question 1 (2 points)

We will be using Rstudio in this class to implement the algorithms we learn in class. The goal of this assignment is to get you proficient in the basics of R, such as writing scripts and plotting. If you are having trouble, you can find help by searching the internet (often searching for the specific error message is helpful), reading Data Mining with R by Luis Torgo or R in a Nutshell by Joseph Adler, asking your friends, and coming to office hours. The computing homework needs to be submitted with your name and Uni# with Rmarkdown file and a pdf with the code and explanation.

Install the R and RStudio programs on your computer. Then, inside of RStudio, use the install.packages function to install RMarkdown. Then, in the code chunk below, type version.

version

               _                           
platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          6.0                         
year           2019                        
month          04                          
day            26                          
svn rev        76424                       
language       R                           
version.string R version 3.6.0 (2019-04-26)
nickname       Planting of a Tree

Question 2 (10 points)

2a (5 points)

Write a function called even.or.odd. Its parameter x will be a numeric vector. Return a character vector that says “odd” for odd numbers and “even” for the even numbers. The results should correctly classify every value. To determine if a number is even or odd, you can use the modulus operator %% (e.g.: 5%%3 = 2). Note: Try to find a solution that uses vector logic instead of a for loop. In R, this is a good programming practice that will speed up your programs.

Display the results of this function on the vector 1:5.

x <- c ( 1 , 4 , 5 , 7 , 9)
even.or.odd <- function (x) {
  ifelse ( x %% 2 == 0, 'even','odd')
}
even.or.odd(x)

[1] "odd"  "even" "odd"  "odd"  "odd"

2b (5 points)

Write a function my.sum that computes the sum of a numeric vector x. The user can also specify the.type as a character. If the.type is “even”, the function will compute the sum of only the even values in x. If the.type is “odd”, the function will compute the sum of only the odd values in x. If the.type is “all”, the function will compute the sum of the entire vector x. Within the function, you may use the built-in sum function. The function should omit missing values (NA) from the sum. This can be done using the na.rm argument within the sum function.

Display the results of this function for odd, even, and all values of the vector 1:5.

my.sum <- function(x, the.type) {
  if(the.type == "even"){
    sum(x[(x %%2 == 0)], na.rm = TRUE)
  }
  else if(the.type == "odd"){
    sum(x[(x %%2 == 1)], na.rm = TRUE)
  }
  else if(the.type == "all"){
    sum(x, na.rm = TRUE)
  }
}

my.sum(1:5, the.type ="even")

[1] 6

Question 3 (10 points)

Load package datasets and load the iris data set. We will try to predict the species of iris from the sepal’s length and width and the petal’s length and width using k−nearest neighbors.

3a (5 points)

Divide the data into training and testing sets. To do so, let’s create an assignment vector called training_row. Each row of the data set will be assigned to the training set (with training_row set to TRUE) with probability 0.8 or to the test set (with training_row set to FALSE) with probability 0.2. Use the sample function to create the training_row vector of TRUE and FALSE values. The vector should be as long as the number of rows in the iris data set.

Then, divide the iris data set into separate training and test sets according to the training_row assignments.

In order to obtain consistent results, we’ll need to set the seed of R’s pseudo-random number generator. To do so, use set.seed(41) in the code chunk labeled constants above.

library(datasets)
set.seed(41)
##create normalization function 
##nor <-function(x) { (x - min(x)) / (max(x) - min(x))  }
#run normalization function on first 4 columns of df
##df_norm <- as.data.frame (lapply (df [ , c(1,2,3,4) ], nor))

#extract training and testing set
training_row = sample(1:nrow (iris) , 0.8*nrow (iris))
train = iris[ training_row, c(1:4)]
test = iris[- training_row, c(1:4)]

3b (5 points)

Use the function knn from the package class with k = 2 to classify the data. What proportion of the values are misclassified on the testing set?

Note: In order to use knn, the train and test objects must only include the columns that are used to make the classification. The Species will need to be separated into the cl vector and removed from the train and test objects.

library(class)
#seperate variable of interest into cl vectors
train_labels <- iris[training_row, 5] 
test_labels <- iris[-training_row, 5]  
#predict classication using train_lables for test set
pred <- knn (train, test, cl = train_labels, k = 2)
#calculate misclassification rate
mean(test_labels != pred)

[1] 0.1

#alternativley use: sum(test_labels!=pred)/sum(nrow(test))

Question 4 (8 points)

Now perform the knn classification for each k value from 1 to 50. For each value of k, compute the percentage of misclassified values on the testing set. Print out your results as a table showing the values of k and the misclassification rates. You can use the datatable function in the DT package to display an HTML-friendly table.

Note: It would help to write a function that performs the knn computation and computes the misclassification rates.

#define range of k_value and misclassification rate
k_value = 1 : 50
mis_rate <- rep (0 , length (k_value) )
for ( k in k_value) {
  # Make predictions using knn: pred
  pred <- knn ( train, test, train_labels, k = k)
  # Calculate the misclassificatin rate and store it in mis_rate[k]
  mis_rate[k] <- mean(pred != test_labels )
  mis_rate
}

# load DT
library(DT)
#create dataframe
mis_df <- data.frame (k_value, mis_rate)
mis_df

   k_value   mis_rate
1        1 0.10000000
2        2 0.10000000
3        3 0.10000000
4        4 0.06666667
5        5 0.06666667
6        6 0.03333333
7        7 0.00000000
8        8 0.06666667
9        9 0.00000000
10      10 0.00000000
11      11 0.00000000
12      12 0.03333333
13      13 0.00000000
14      14 0.03333333
15      15 0.00000000
16      16 0.00000000
17      17 0.00000000
18      18 0.00000000
19      19 0.00000000
20      20 0.03333333
21      21 0.06666667
22      22 0.10000000
23      23 0.10000000
24      24 0.10000000
25      25 0.06666667
26      26 0.06666667
27      27 0.06666667
28      28 0.06666667
29      29 0.06666667
30      30 0.10000000
31      31 0.10000000
32      32 0.10000000
33      33 0.10000000
34      34 0.13333333
35      35 0.13333333
36      36 0.13333333
37      37 0.10000000
38      38 0.13333333
39      39 0.13333333
40      40 0.13333333
41      41 0.13333333
42      42 0.13333333
43      43 0.13333333
44      44 0.13333333
45      45 0.13333333
46      46 0.13333333
47      47 0.13333333
48      48 0.13333333
49      49 0.13333333
50      50 0.13333333

#using the datatable function in the DT package to display an HTML-friendly table.
mis_dt <- datatable (mis_df)
mis_dt

Question 5 (20 points)

Use your answers from Question 4 to display the results in the questions below.

5a (5 points)

Plot the misclassification rates on the testing set versus the value of k. Use the plot function. Try different values of the arguments (las, xlim, ylim, xlab, ylab, cex, main) to create a nicer display. Use type = “both” to display both the points and a line.

# Plot the misclassification
plot(k_value, mis_rate, main = "Misclassification Rates vs Neighbors",  xlab = "k, number of neighbors", ylab = "Misclassification Rates", cex =0.5, type = "both")

5b (5 points)

Now create the same plot placing k on a *logarithmic** scale. Make sure to change the label of the x axis to distinguish this.

plot(k_value, mis_rate, log = "x", main = "Misclassification Rates vs Neighbors",  xlab = "K in log-scale", ylab="Misclassification Rate",cex = 0.5, type = "both")

5c (10 points)

Let’s examine how the results would change if we were to run the knn classifier multiple times. Perform the following steps:

Re-perform the previous work 3 more times. Each time, you should create a new training and test set, apply knn on each value of k from 1 to 50, and compute the misclassification rates on the testing set.
Plot the results of the earlier work along with the 3 new iterations on a single plot. Use the lines function to add additional lines to the earlier plot from 5a (using the linear scale). Use different colors, line types (lty), and point characters (pch) to distinguish the lines.
Use the legend command to place a legend in the top left corner (x = “topleft”) of the plot. Use the same colors and point characters to display which line is which. Label the iterations 1 through 4.

#now create a function to run KNN 
knn_mis_rate <- function(k){
  #create train and test, train_labels and test_labels
  training_row = sample(1:nrow (iris) , 0.8*nrow (iris))
  train = iris[ training_row, c(1:4)]
  test = iris[- training_row, c(1:4)]
  train_labels <- iris [training_row, 5] 
  test_labels <- iris[-training_row, 5]  
  #apply knn function
  pred <- knn (train, test, train_labels, k)
  #calculate mis_rate
  mis_rate <- mean(pred != test_labels)
  return(mis_rate)
}

#knn model output
knn_result <- function(k_values = 1:50){
  k_values = 1:50
  mis_rate = c()
  for (k in 1:length(k_values)){ 
    mis_rate[k] = knn_mis_rate(k)
    }
  cbind.data.frame(k_values, mis_rate)
}

#run 3 more imes
knn2 <- knn_result(k_values = 1:50)
knn3 <- knn_result(k_values = 1:50)
knn4 <- knn_result(k_values = 1:50)

#plot multiple lines for each of three groups, on the plot of group 1
plot(mis_df$k_value, mis_df$mis_rate, main = "Misclassification Rates vs Neighbors", xlab = "k, number of neighbors", ylab = "Misclassification Rates", col = "black",lty =1, pch = 1,type = "b")
#add lines of knn2, knn3, knn4
lines( mis_rate ~ k_values, knn2, col="blue" , lty=2,pch = 2,type = "b")
lines( mis_rate ~ k_values, knn3, col ="red", lty=3,pch = 3,type = "b")
lines(mis_rate ~ k_values, knn4, col ="green", lty=4,pch = 4,type = "b")
#add legend
legend("topleft", 
  legend = c("Group 1", "Group 2","Group 3", "Group 4"), 
  col = c("black","blue","red","green"), 
  bty = "n", 
  cex = 0.8, 
  text.col = c("black","blue","red","green"), 
  horiz = F,
  lty=c(1,2,3,4),
  pch = c(1,2,3,4),
  inset = c(0.1, 0.1,0.1))

Question 6 (22 points)

Here we’ll work with the Hitters database from the ISLR library, which contains Major League Baseball Data from the 1986 and 1987 seasons (322 observations on 20 variables). For a description of the variables go to: https://rdrr.io/cran/ISLR/man/Hitters.html Install the ISLR package in R if you have not done so already

6a (2 points)

What are the dimensions of the data set?

library(ISLR)
H_df <- Hitters

6b (2 points)

How many salaries are missing (NA)?

sum(is.na(H_df$Salary))

[1] 59

6c (2 points)

What is the maximum number of career home runs?

max(H_df$CHmRun)

[1] 548

6d (2 points)

Compute the min, median, mean, and max of Hits, Home Runs, and Runs for a season (not career totals). Remove any missing values from the calculations. Round your results to 1 decimal place.

#create a function using summary function that has input of a vector, and gives output of Min,Median, Mean, Max as well as 1st Qu and 3rd Qu.

cal <- function(x) {summary(x, digits = 0, na.rm=T)}
cal(H_df$Hits)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1      60     100     100     100     200

cal(H_df$HmRun)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       4       8      10      20      40

cal(H_df$Runs)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0      30      50      50      70     100

6e (2 points)

What percentage of these players’s seasons had at least 100 hits and 20 home runs? Use the percent function in the scales package to convert a decimal proportion to a percentage.

library('scales')
n <- nrow(subset(H_df,Hits >= 100 & HmRun >= 20))
percent(n/nrow(H_df))

[1] "16.8%"

6f (2 points)

What is the relationship between different pairs of variables? Let’s look at Salary, Hits, Runs, HmRun, Errors, and Assists. Use the pairs function to display scatterplots of each pair of these variables.

pairs(H_df[,c("Salary","Hits","Runs","HmRun","Errors","Assists")], lower.panel = NULL )

6g (2 points)

Based on these scatterplots, which variables appear to be correlated with Salary, and which ones appear to have little or no correlation with Salary? Provide a short explanation for your assessment.

Based on these scatterplots, Hits, Runs, HmRun appear to be positively correlated with Salary because the there’re clear positive associate for Salary-Hits, Salary-Runs, Salary-HmRun, which represent moderate linear relationships. Errors and Assists appear to have little or no correlation with Salary as the points are dispersed without clear cirection and form. To quantify the correlations, a correlation matrix can be applied using the function: cor(df[,c(“Salary”,“Hits”,“Runs”,“HmRun”,“Errors”,“Assists”)], use = “complete.obs”)

6h (2 points)

Create a new variable called HighRBI for those players with at least 75 RBI (TRUE). Players with less than 75 RBI should have the value FALSE.

library(dplyr)
H_df <- H_df %>% 
  #create a new variable called HighRBI
  mutate(HighRBI = as.factor(case_when(
  RBI >=75 ~ 'TRUE',
  RBI <75 ~ 'FALSE'
  )
  )
)

6i (2 points)

What percentage of hitters qualified as HighRBI during these seasons?

#subset df to find Hitters with HighRBI == 'TRUE', and calculate the number of rows
n_HighRBI <- nrow(subset(H_df,HighRBI =='TRUE'))
#calculate the percentage
percent(n_HighRBI/nrow(H_df))

[1] "18.6%"

6j (2 points)

What is the correlation of HighRBI, Home Runs, Hits, Runs, Assists, and Errors with Salary? Use only the cases in which both variables are measured. Round the answer to two decimal places.

#we can calculate pearson's correlation coefficient for pairs of numeric variables - Salary, Home Runs, Hits, Runs, Assists and Errors - in order to quantify their associations. 
#To summarize the relationship between a categorical variable - HighRBI - and numeric variable - Salary, a common option is graphical summaries, such as a box plot (showed in the solution of question 6h). But here we can convert type from factor to logical, and treat it as a numeric varialbe in calculating correlation coefficients.


#convert data type from factor to logical
H_df$HighRBI<-as.logical.factor(H_df$HighRBI)

##calculate correlation coefficient matrix for numeric variables, round to 2 decimal
round(cor(H_df[,c("Salary","Hits","Runs","HmRun","Assists","Errors","HighRBI")], use = "complete.obs") , 2)

        Salary Hits Runs HmRun Assists Errors HighRBI
Salary    1.00 0.44 0.42  0.34    0.03  -0.01    0.37
Hits      0.44 1.00 0.91  0.53    0.30   0.28    0.55
Runs      0.42 0.91 1.00  0.63    0.18   0.19    0.54
HmRun     0.34 0.53 0.63  1.00   -0.16  -0.01    0.70
Assists   0.03 0.30 0.18 -0.16    1.00   0.70    0.00
Errors   -0.01 0.28 0.19 -0.01    0.70   1.00    0.06
HighRBI   0.37 0.55 0.54  0.70    0.00   0.06    1.00

6k (2 points)

How did the salaries differ for players with and without HighRBI? Use the boxplot function and split the salary data by HighRBI status. Do HighRBI players have a higher median salary?

#careate a boxplot
boxplot(Salary~HighRBI,data= H_df, xlab = "RBI greater than 75")

##Based on the Boxplot, HighRBI players have a higher median salary than non-highRBI players.

6l (2 points)

Show a histogram of home runs using the hist function with breaks = 20 and freq = FALSE.

hist(H_df$HmRun,breaks = 20, freq = FALSE, main = "Home Runs", xlab = "Home Runs" )

Question 7 (10 points)

7a (2 points)

What is the mean and standard deviation of Hits, Runs, Home Runs, RBI, Assists, Errors, and Salaries? Remove any missing values from the calculations. Round the answers to 1 decimal place.

# calculate mean for Hits, Runs, Home Runs, RBI, Assists, Errors, and Salaries

v <- c("Hits", "Runs", "HmRun", "RBI", "Assists","Errors","Salary")

multi.fun <- function(x) {
      c(mean = round(mean(x, na.rm = TRUE),1), sd = round(sd(x,na.rm = TRUE),1))
}

sapply(H_df[, v], multi.fun)

      Hits Runs HmRun  RBI Assists Errors Salary
mean 101.0 50.9  10.8 48.0   106.9    8.0  535.9
sd    46.5 26.0   8.7 26.2   136.9    6.4  451.1

7b (3 points)

Some players only get to play part-time. Show the mean and standard deviations for the same variables as in the previous question only for players with at least 300 AtBat.

sapply(H_df[H_df$AtBat >= 300, v], multi.fun)

      Hits Runs HmRun  RBI Assists Errors Salary
mean 126.3 64.3  13.8 60.3   133.4    9.4  617.1
sd    35.2 21.3   9.0 23.2   155.0    6.8  467.7

7c (3 points)

Show a scatter plot of Salary versus Home Runs for players with at least 300 AtBat.

plot(Salary ~ HmRun, H_df[H_df$AtBat>=300,])

7d (2 points)

There is a player with zero home runs and a salary over 2,000 (more than 2 million dollars). Who is this player? What does it look like happened during the season? Are these numbers accurate? Use the internet to search for this player’s results in 1986 and 1987.

#find the player with HmRun === 0 and Salary > 2000
subset(Hitters, HmRun == 0 & Salary>2000 )

              AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
-Mike Schmidt    20    1     0    0   0     0     2     41     9      2
              CRuns CRBI CWalks League Division PutOuts Assists Errors
-Mike Schmidt     6    7      4      N        E      78     220      6
                Salary NewLeague
-Mike Schmidt 2127.333         N

#the player is Mike Schmidt. He had an injury during that season.

Question 8 (14 points)

After exploring the Hitters data so extensively, you are asked to build a regression model to predict the hitter’s salary.

8a (7 points)

Build a linear regression model and explain how (or why) you choose certain predictors in your model. Use 70% of the valid data for training and the remaining 30% of the valid data for testing. Please report the Room Mean Squared Error of the model on both the training and testing sets. Note that, what data are considered as “valid” is up to you based on your data exploration. For example, you can exclude certain data because of either missing data or outliers. But please explain how you determine your validate dataset.

# there are missing values in set, which will hamper the linear models, so must remove them. Remove NA values:
H_df <- na.omit(Hitters[,c("Salary","CRBI","CAtBat","CHmRun","CRuns","CHits","CWalks")])
dim(H_df)

[1] 263   7

sum(is.na(H_df))

[1] 0

#find outliers based on boxplot result
findOutliners <- function(x){
  outliers <- boxplot(x, plot=FALSE)$out
  which(x %in% outliers)
}

#remove all outliers 
H_df <- H_df[ -findOutliners(H_df$CHits), ]
H_df <- H_df[ -findOutliners(H_df$CRBI), ]
H_df <- H_df[ -findOutliners(H_df$CAtBat), ]
H_df <- H_df[ -findOutliners(H_df$CHmRun), ]
H_df <- H_df[ -findOutliners(H_df$CRuns), ]
H_df <- H_df[ -findOutliners(H_df$CWalks), ]
H_df <- H_df[ -findOutliners(H_df$Salary), ]
dim(H_df) #212 records left

[1] 212   7

#subset train and test set
set.seed(55)
train_index = sample(1:nrow (H_df) , 0.8*nrow (H_df))
H_train = H_df[ train_index, ]
H_test = H_df[- train_index, ]

#build linear regression model
lm <- lm(Salary ~ CRBI + CAtBat + CHmRun + CRuns + CHits + CWalks,data = H_train)
#prediction on training set
pred_train <- predict(lm)
#calculate rmse
lm_train_rmse <- sqrt(mean((H_train$Salary - pred_train)^2, na.rm = TRUE))
#prediction on test set
pred_test <- predict(lm,H_test)
lm_test_rmse <- sqrt(mean((H_test$Salary - pred_test)^2, na.rm = TRUE))

cbind(Model = "LM", `Training RMSE` = lm_train_rmse, `Test RMSE` = lm_test_rmse)

     Model Training RMSE      Test RMSE         
[1,] "LM"  "209.254474393527" "193.509114635474"

8b (7 points)

Repeat question 8a using KNN with 5 neighbors.

#subset data, assign cl labels
knn_train = H_train[ , -1]
knn_test = H_test[, -1]
str(knn_test)

'data.frame':   43 obs. of  6 variables:
 $ CRBI  : int  46 37 186 290 251 182 132 69 40 108 ...
 $ CAtBat: int  396 509 1876 3231 1924 2308 1309 926 514 1064 ...
 $ CHmRun: int  12 0 15 36 67 32 27 9 8 11 ...
 $ CRuns : int  48 41 192 376 242 349 126 118 57 123 ...
 $ CHits : int  101 108 467 825 489 633 308 210 120 290 ...
 $ CWalks: int  33 12 161 238 240 308 66 114 39 55 ...

knn_train_labels <- H_train[, 1] 
knn_test_labels <- H_test[, 1] 

#build knn model on train and test sets
knn_training_pred <- knn(knn_train, knn_train, knn_train_labels,  k = 5)
knn_test_pred <- knn(knn_train, knn_test,knn_train_labels,  k = 5)

#calculate rmse
knn_training_rmse <- sqrt(mean((as.numeric(knn_training_pred) - knn_train_labels)^2, na.rm = TRUE))
knn_test_rmse <- sqrt(mean((as.numeric(knn_test_pred) - knn_test_labels)^2, na.rm = TRUE))

cbind(Model = "KNN",`Training RMSE` = knn_training_rmse, `Test RMSE` = knn_test_rmse)

     Model Training RMSE      Test RMSE         
[1,] "KNN" "480.960064825178" "483.000987137948"