We have quantitative responses (Y-values) to use as a sort of guide to see the accuracy of our predictive Y value. In this case, ‘ii’ will be a vector storing randmly selected row indices. This will be used to split our dataset (cd) into subsets training data (cdtr) and testing data (cdte)
n = nrow(cd) # 1000
set.seed(80) # setting random seed to ensure reproducibility (random results are set)
pin = .80 # proportion of data we will use for training, 80% of data in this case
ii = sample(1:n, floor(pin*n)) # essentially floor(0.80 * 1000) = 800 unique rows are randomly selected
cdtr = cd[ii,] # Training data set selecting rows from indices in ii. (e.g. [3,7,12,...,999])
cdte = cd[-ii,] # Test data set, select all rows of cd *not* in ii