library(ISLR)
library(scales)
library(knitr)
library(rmarkdown)
library(prettydoc)
set.seed(41)

Question 0: Survey (2 points)

I would like to get an idea of your interests in Machine Learning and your background. Please tell me:

The assignment includes a spreadsheet survey.csv. Please fill out the information and submit it along with your homework.

Question 1 (2 points)

We will be using Rstudio in this class to implement the algorithms we learn in class. The goal of this assignment is to get you proficient in the basics of R, such as writing scripts and plotting. If you are having trouble, you can find help by searching the internet (often searching for the specific error message is helpful), reading Data Mining with R by Luis Torgo or R in a Nutshell by Joseph Adler, asking your friends, and coming to office hours. The computing homework needs to be submitted with your name and Uni# with Rmarkdown file and a pdf with the code and explanation.

Install the R and RStudio programs on your computer. Then, inside of RStudio, use the install.packages function to install RMarkdown. Then, in the code chunk below, type version.

               _                           
platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          6.0                         
year           2019                        
month          04                          
day            26                          
svn rev        76424                       
language       R                           
version.string R version 3.6.0 (2019-04-26)
nickname       Planting of a Tree          

Question 2 (10 points)

2a (5 points)

Write a function called even.or.odd. Its parameter x will be a numeric vector. Return a character vector that says “odd” for odd numbers and “even” for the even numbers. The results should correctly classify every value. To determine if a number is even or odd, you can use the modulus operator %% (e.g.: 5%%3 = 2). Note: Try to find a solution that uses vector logic instead of a for loop. In R, this is a good programming practice that will speed up your programs.

Display the results of this function on the vector 1:5.

[1] "odd"  "even" "odd"  "odd"  "odd" 

2b (5 points)

Write a function my.sum that computes the sum of a numeric vector x. The user can also specify the.type as a character. If the.type is “even”, the function will compute the sum of only the even values in x. If the.type is “odd”, the function will compute the sum of only the odd values in x. If the.type is “all”, the function will compute the sum of the entire vector x. Within the function, you may use the built-in sum function. The function should omit missing values (NA) from the sum. This can be done using the na.rm argument within the sum function.

Display the results of this function for odd, even, and all values of the vector 1:5.

[1] 6

Question 3 (10 points)

Load package datasets and load the iris data set. We will try to predict the species of iris from the sepal’s length and width and the petal’s length and width using k−nearest neighbors.

3a (5 points)

Divide the data into training and testing sets. To do so, let’s create an assignment vector called training_row. Each row of the data set will be assigned to the training set (with training_row set to TRUE) with probability 0.8 or to the test set (with training_row set to FALSE) with probability 0.2. Use the sample function to create the training_row vector of TRUE and FALSE values. The vector should be as long as the number of rows in the iris data set.

Then, divide the iris data set into separate training and test sets according to the training_row assignments.

In order to obtain consistent results, we’ll need to set the seed of R’s pseudo-random number generator. To do so, use set.seed(41) in the code chunk labeled constants above.

3b (5 points)

Use the function knn from the package class with k = 2 to classify the data. What proportion of the values are misclassified on the testing set?

Note: In order to use knn, the train and test objects must only include the columns that are used to make the classification. The Species will need to be separated into the cl vector and removed from the train and test objects.

[1] 0.1

Question 4 (8 points)

Now perform the knn classification for each k value from 1 to 50. For each value of k, compute the percentage of misclassified values on the testing set. Print out your results as a table showing the values of k and the misclassification rates. You can use the datatable function in the DT package to display an HTML-friendly table.

Note: It would help to write a function that performs the knn computation and computes the misclassification rates.

   k_value   mis_rate
1        1 0.10000000
2        2 0.10000000
3        3 0.10000000
4        4 0.06666667
5        5 0.06666667
6        6 0.03333333
7        7 0.00000000
8        8 0.06666667
9        9 0.00000000
10      10 0.00000000
11      11 0.00000000
12      12 0.03333333
13      13 0.00000000
14      14 0.03333333
15      15 0.00000000
16      16 0.00000000
17      17 0.00000000
18      18 0.00000000
19      19 0.00000000
20      20 0.03333333
21      21 0.06666667
22      22 0.10000000
23      23 0.10000000
24      24 0.10000000
25      25 0.06666667
26      26 0.06666667
27      27 0.06666667
28      28 0.06666667
29      29 0.06666667
30      30 0.10000000
31      31 0.10000000
32      32 0.10000000
33      33 0.10000000
34      34 0.13333333
35      35 0.13333333
36      36 0.13333333
37      37 0.10000000
38      38 0.13333333
39      39 0.13333333
40      40 0.13333333
41      41 0.13333333
42      42 0.13333333
43      43 0.13333333
44      44 0.13333333
45      45 0.13333333
46      46 0.13333333
47      47 0.13333333
48      48 0.13333333
49      49 0.13333333
50      50 0.13333333

Question 5 (20 points)

Use your answers from Question 4 to display the results in the questions below.

5a (5 points)

Plot the misclassification rates on the testing set versus the value of k. Use the plot function. Try different values of the arguments (las, xlim, ylim, xlab, ylab, cex, main) to create a nicer display. Use type = “both” to display both the points and a line.

5b (5 points)

Now create the same plot placing k on a *logarithmic** scale. Make sure to change the label of the x axis to distinguish this.

5c (10 points)

Let’s examine how the results would change if we were to run the knn classifier multiple times. Perform the following steps:

  1. Re-perform the previous work 3 more times. Each time, you should create a new training and test set, apply knn on each value of k from 1 to 50, and compute the misclassification rates on the testing set.

  2. Plot the results of the earlier work along with the 3 new iterations on a single plot. Use the lines function to add additional lines to the earlier plot from 5a (using the linear scale). Use different colors, line types (lty), and point characters (pch) to distinguish the lines.

  3. Use the legend command to place a legend in the top left corner (x = “topleft”) of the plot. Use the same colors and point characters to display which line is which. Label the iterations 1 through 4.

#now create a function to run KNN 
knn_mis_rate <- function(k){
  #create train and test, train_labels and test_labels
  training_row = sample(1:nrow (iris) , 0.8*nrow (iris))
  train = iris[ training_row, c(1:4)]
  test = iris[- training_row, c(1:4)]
  train_labels <- iris [training_row, 5] 
  test_labels <- iris[-training_row, 5]  
  #apply knn function
  pred <- knn (train, test, train_labels, k)
  #calculate mis_rate
  mis_rate <- mean(pred != test_labels)
  return(mis_rate)
}

#knn model output
knn_result <- function(k_values = 1:50){
  k_values = 1:50
  mis_rate = c()
  for (k in 1:length(k_values)){ 
    mis_rate[k] = knn_mis_rate(k)
    }
  cbind.data.frame(k_values, mis_rate)
}

#run 3 more imes
knn2 <- knn_result(k_values = 1:50)
knn3 <- knn_result(k_values = 1:50)
knn4 <- knn_result(k_values = 1:50)

#plot multiple lines for each of three groups, on the plot of group 1
plot(mis_df$k_value, mis_df$mis_rate, main = "Misclassification Rates vs Neighbors", xlab = "k, number of neighbors", ylab = "Misclassification Rates", col = "black",lty =1, pch = 1,type = "b")
#add lines of knn2, knn3, knn4
lines( mis_rate ~ k_values, knn2, col="blue" , lty=2,pch = 2,type = "b")
lines( mis_rate ~ k_values, knn3, col ="red", lty=3,pch = 3,type = "b")
lines(mis_rate ~ k_values, knn4, col ="green", lty=4,pch = 4,type = "b")
#add legend
legend("topleft", 
  legend = c("Group 1", "Group 2","Group 3", "Group 4"), 
  col = c("black","blue","red","green"), 
  bty = "n", 
  cex = 0.8, 
  text.col = c("black","blue","red","green"), 
  horiz = F,
  lty=c(1,2,3,4),
  pch = c(1,2,3,4),
  inset = c(0.1, 0.1,0.1))

Question 6 (22 points)

Here we’ll work with the Hitters database from the ISLR library, which contains Major League Baseball Data from the 1986 and 1987 seasons (322 observations on 20 variables). For a description of the variables go to: https://rdrr.io/cran/ISLR/man/Hitters.html Install the ISLR package in R if you have not done so already

6a (2 points)

What are the dimensions of the data set?

6b (2 points)

How many salaries are missing (NA)?

[1] 59

6c (2 points)

What is the maximum number of career home runs?

[1] 548

6d (2 points)

Compute the min, median, mean, and max of Hits, Home Runs, and Runs for a season (not career totals). Remove any missing values from the calculations. Round your results to 1 decimal place.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1      60     100     100     100     200 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       4       8      10      20      40 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0      30      50      50      70     100 

6e (2 points)

What percentage of these players’s seasons had at least 100 hits and 20 home runs? Use the percent function in the scales package to convert a decimal proportion to a percentage.

[1] "16.8%"

6f (2 points)

What is the relationship between different pairs of variables? Let’s look at Salary, Hits, Runs, HmRun, Errors, and Assists. Use the pairs function to display scatterplots of each pair of these variables.

6g (2 points)

Based on these scatterplots, which variables appear to be correlated with Salary, and which ones appear to have little or no correlation with Salary? Provide a short explanation for your assessment.

Based on these scatterplots, Hits, Runs, HmRun appear to be positively correlated with Salary because the there’re clear positive associate for Salary-Hits, Salary-Runs, Salary-HmRun, which represent moderate linear relationships. Errors and Assists appear to have little or no correlation with Salary as the points are dispersed without clear cirection and form. To quantify the correlations, a correlation matrix can be applied using the function: cor(df[,c(“Salary”,“Hits”,“Runs”,“HmRun”,“Errors”,“Assists”)], use = “complete.obs”)

6h (2 points)

Create a new variable called HighRBI for those players with at least 75 RBI (TRUE). Players with less than 75 RBI should have the value FALSE.

6j (2 points)

What is the correlation of HighRBI, Home Runs, Hits, Runs, Assists, and Errors with Salary? Use only the cases in which both variables are measured. Round the answer to two decimal places.

        Salary Hits Runs HmRun Assists Errors HighRBI
Salary    1.00 0.44 0.42  0.34    0.03  -0.01    0.37
Hits      0.44 1.00 0.91  0.53    0.30   0.28    0.55
Runs      0.42 0.91 1.00  0.63    0.18   0.19    0.54
HmRun     0.34 0.53 0.63  1.00   -0.16  -0.01    0.70
Assists   0.03 0.30 0.18 -0.16    1.00   0.70    0.00
Errors   -0.01 0.28 0.19 -0.01    0.70   1.00    0.06
HighRBI   0.37 0.55 0.54  0.70    0.00   0.06    1.00

6k (2 points)

How did the salaries differ for players with and without HighRBI? Use the boxplot function and split the salary data by HighRBI status. Do HighRBI players have a higher median salary?

6l (2 points)

Show a histogram of home runs using the hist function with breaks = 20 and freq = FALSE.

Question 7 (10 points)

7a (2 points)

What is the mean and standard deviation of Hits, Runs, Home Runs, RBI, Assists, Errors, and Salaries? Remove any missing values from the calculations. Round the answers to 1 decimal place.

      Hits Runs HmRun  RBI Assists Errors Salary
mean 101.0 50.9  10.8 48.0   106.9    8.0  535.9
sd    46.5 26.0   8.7 26.2   136.9    6.4  451.1

7b (3 points)

Some players only get to play part-time. Show the mean and standard deviations for the same variables as in the previous question only for players with at least 300 AtBat.

      Hits Runs HmRun  RBI Assists Errors Salary
mean 126.3 64.3  13.8 60.3   133.4    9.4  617.1
sd    35.2 21.3   9.0 23.2   155.0    6.8  467.7

7c (3 points)

Show a scatter plot of Salary versus Home Runs for players with at least 300 AtBat.

7d (2 points)

There is a player with zero home runs and a salary over 2,000 (more than 2 million dollars). Who is this player? What does it look like happened during the season? Are these numbers accurate? Use the internet to search for this player’s results in 1986 and 1987.

              AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
-Mike Schmidt    20    1     0    0   0     0     2     41     9      2
              CRuns CRBI CWalks League Division PutOuts Assists Errors
-Mike Schmidt     6    7      4      N        E      78     220      6
                Salary NewLeague
-Mike Schmidt 2127.333         N

Question 8 (14 points)

After exploring the Hitters data so extensively, you are asked to build a regression model to predict the hitter’s salary.

8a (7 points)

Build a linear regression model and explain how (or why) you choose certain predictors in your model. Use 70% of the valid data for training and the remaining 30% of the valid data for testing. Please report the Room Mean Squared Error of the model on both the training and testing sets. Note that, what data are considered as “valid” is up to you based on your data exploration. For example, you can exclude certain data because of either missing data or outliers. But please explain how you determine your validate dataset.

[1] 263   7
[1] 0
[1] 212   7
     Model Training RMSE      Test RMSE         
[1,] "LM"  "209.254474393527" "193.509114635474"

8b (7 points)

Repeat question 8a using KNN with 5 neighbors.

'data.frame':   43 obs. of  6 variables:
 $ CRBI  : int  46 37 186 290 251 182 132 69 40 108 ...
 $ CAtBat: int  396 509 1876 3231 1924 2308 1309 926 514 1064 ...
 $ CHmRun: int  12 0 15 36 67 32 27 9 8 11 ...
 $ CRuns : int  48 41 192 376 242 349 126 118 57 123 ...
 $ CHits : int  101 108 467 825 489 633 308 210 120 290 ...
 $ CWalks: int  33 12 161 238 240 308 66 114 39 55 ...
     Model Training RMSE      Test RMSE         
[1,] "KNN" "480.960064825178" "483.000987137948"