KDnuggets is an interesting site on Data Science, Machine Learning, AI and Analytics.
One acception, or a received meaning, of the word “nugget” is “a small compact portion or unit: nuggets of information” (as described in thefreedictionary.com)
So, we can assume that KDnuggets articles are not intended to be exhaustive in depth.
Portilla uses the R package ISLR, which is one of the various “built-in” R datasets used for exploring and learning Statistical and Analytics related concepts and ideas.
The data set is a collection of college attributes, also called features or “columns”. Such denomination and terminology depends on the analyst “background” knowledge.
Moving on to the data preprocessing stage, the author says that it is important to normalize data before training a neural network. That is nice, but in computer science realm, we have a discipline called “Database management systems”. Inside that discipline, there is a very important concept of “normalization” which is applyed for minimizing the unecessarily duplication of data and guarantee that there will be the minimum overhead for maintaing the data state throughout the lifespan of a record inside an electronical information system.
It is important to differentiate the normalization task inside those realms.
“The less the data is duplicated, the better for maintaining changes on it.”
if("ISLR" %in% rownames(installed.packages()) == FALSE) {
install.packages('ISLR')
} else {
library(ISLR)
}
## Warning: package 'ISLR' was built under R version 4.1.1
print(head(College,30))
## Private Apps Accept Enroll Top10perc
## Abilene Christian University Yes 1660 1232 721 23
## Adelphi University Yes 2186 1924 512 16
## Adrian College Yes 1428 1097 336 22
## Agnes Scott College Yes 417 349 137 60
## Alaska Pacific University Yes 193 146 55 16
## Albertson College Yes 587 479 158 38
## Albertus Magnus College Yes 353 340 103 17
## Albion College Yes 1899 1720 489 37
## Albright College Yes 1038 839 227 30
## Alderson-Broaddus College Yes 582 498 172 21
## Alfred University Yes 1732 1425 472 37
## Allegheny College Yes 2652 1900 484 44
## Allentown Coll. of St. Francis de Sales Yes 1179 780 290 38
## Alma College Yes 1267 1080 385 44
## Alverno College Yes 494 313 157 23
## American International College Yes 1420 1093 220 9
## Amherst College Yes 4302 992 418 83
## Anderson University Yes 1216 908 423 19
## Andrews University Yes 1130 704 322 14
## Angelo State University No 3540 2001 1016 24
## Antioch University Yes 713 661 252 25
## Appalachian State University No 7313 4664 1910 20
## Aquinas College Yes 619 516 219 20
## Arizona State University Main campus No 12809 10308 3761 24
## Arkansas College (Lyon College) Yes 708 334 166 46
## Arkansas Tech University No 1734 1729 951 12
## Assumption College Yes 2135 1700 491 23
## Auburn University-Main Campus No 7548 6791 3070 25
## Augsburg College Yes 662 513 257 12
## Augustana College IL Yes 1879 1658 497 36
## Top25perc F.Undergrad P.Undergrad
## Abilene Christian University 52 2885 537
## Adelphi University 29 2683 1227
## Adrian College 50 1036 99
## Agnes Scott College 89 510 63
## Alaska Pacific University 44 249 869
## Albertson College 62 678 41
## Albertus Magnus College 45 416 230
## Albion College 68 1594 32
## Albright College 63 973 306
## Alderson-Broaddus College 44 799 78
## Alfred University 75 1830 110
## Allegheny College 77 1707 44
## Allentown Coll. of St. Francis de Sales 64 1130 638
## Alma College 73 1306 28
## Alverno College 46 1317 1235
## American International College 22 1018 287
## Amherst College 96 1593 5
## Anderson University 40 1819 281
## Andrews University 23 1586 326
## Angelo State University 54 4190 1512
## Antioch University 44 712 23
## Appalachian State University 63 9940 1035
## Aquinas College 51 1251 767
## Arizona State University Main campus 49 22593 7585
## Arkansas College (Lyon College) 74 530 182
## Arkansas Tech University 52 3602 939
## Assumption College 59 1708 689
## Auburn University-Main Campus 57 16262 1716
## Augsburg College 30 2074 726
## Augustana College IL 69 1950 38
## Outstate Room.Board Books Personal PhD
## Abilene Christian University 7440 3300 450 2200 70
## Adelphi University 12280 6450 750 1500 29
## Adrian College 11250 3750 400 1165 53
## Agnes Scott College 12960 5450 450 875 92
## Alaska Pacific University 7560 4120 800 1500 76
## Albertson College 13500 3335 500 675 67
## Albertus Magnus College 13290 5720 500 1500 90
## Albion College 13868 4826 450 850 89
## Albright College 15595 4400 300 500 79
## Alderson-Broaddus College 10468 3380 660 1800 40
## Alfred University 16548 5406 500 600 82
## Allegheny College 17080 4440 400 600 73
## Allentown Coll. of St. Francis de Sales 9690 4785 600 1000 60
## Alma College 12572 4552 400 400 79
## Alverno College 8352 3640 650 2449 36
## American International College 8700 4780 450 1400 78
## Amherst College 19760 5300 660 1598 93
## Anderson University 10100 3520 550 1100 48
## Andrews University 9996 3090 900 1320 62
## Angelo State University 5130 3592 500 2000 60
## Antioch University 15476 3336 400 1100 69
## Appalachian State University 6806 2540 96 2000 83
## Aquinas College 11208 4124 350 1615 55
## Arizona State University Main campus 7434 4850 700 2100 88
## Arkansas College (Lyon College) 8644 3922 500 800 79
## Arkansas Tech University 3460 2650 450 1000 57
## Assumption College 12000 5920 500 500 93
## Auburn University-Main Campus 6300 3933 600 1908 85
## Augsburg College 11902 4372 540 950 65
## Augustana College IL 13353 4173 540 821 78
## Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University 78 18.1 12 7041
## Adelphi University 30 12.2 16 10527
## Adrian College 66 12.9 30 8735
## Agnes Scott College 97 7.7 37 19016
## Alaska Pacific University 72 11.9 2 10922
## Albertson College 73 9.4 11 9727
## Albertus Magnus College 93 11.5 26 8861
## Albion College 100 13.7 37 11487
## Albright College 84 11.3 23 11644
## Alderson-Broaddus College 41 11.5 15 8991
## Alfred University 88 11.3 31 10932
## Allegheny College 91 9.9 41 11711
## Allentown Coll. of St. Francis de Sales 84 13.3 21 7940
## Alma College 87 15.3 32 9305
## Alverno College 69 11.1 26 8127
## American International College 84 14.7 19 7355
## Amherst College 98 8.4 63 21424
## Anderson University 61 12.1 14 7994
## Andrews University 66 11.5 18 10908
## Angelo State University 62 23.1 5 4010
## Antioch University 82 11.3 35 42926
## Appalachian State University 96 18.3 14 5854
## Aquinas College 65 12.7 25 6584
## Arizona State University Main campus 93 18.9 5 4602
## Arkansas College (Lyon College) 88 12.6 24 14579
## Arkansas Tech University 60 19.6 5 4739
## Assumption College 93 13.8 30 7100
## Auburn University-Main Campus 91 16.7 18 6642
## Augsburg College 65 12.8 31 7836
## Augustana College IL 83 12.7 40 9220
## Grad.Rate
## Abilene Christian University 60
## Adelphi University 56
## Adrian College 54
## Agnes Scott College 59
## Alaska Pacific University 15
## Albertson College 55
## Albertus Magnus College 63
## Albion College 73
## Albright College 80
## Alderson-Broaddus College 52
## Alfred University 73
## Allegheny College 76
## Allentown Coll. of St. Francis de Sales 74
## Alma College 68
## Alverno College 55
## American International College 69
## Amherst College 100
## Anderson University 59
## Andrews University 46
## Angelo State University 34
## Antioch University 48
## Appalachian State University 70
## Aquinas College 65
## Arizona State University Main campus 48
## Arkansas College (Lyon College) 54
## Arkansas Tech University 48
## Assumption College 88
## Auburn University-Main Campus 69
## Augsburg College 58
## Augustana College IL 71
# Create Vector of Column Max and Min Values
maxs <- apply(College[,2:18], 2, max)
mins <- apply(College[,2:18], 2, min)
# Use scale() and convert the resulting matrix to a data frame
scaled.data <- as.data.frame(scale(College[,2:18],center = mins, scale = maxs - mins))
# Check out results
print(head(scaled.data,2))
## Apps Accept Enroll Top10perc
## Abilene Christian University 0.03288693 0.04417701 0.10791254 0.2315789
## Adelphi University 0.04384229 0.07053089 0.07503539 0.1578947
## Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University 0.4725275 0.08716353 0.02454774 0.2634298
## Adelphi University 0.2197802 0.08075165 0.05614839 0.5134298
## Room.Board Books Personal PhD
## Abilene Christian University 0.2395965 0.1577540 0.2977099 0.6526316
## Adelphi University 0.7361286 0.2914439 0.1908397 0.2210526
## Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University 0.71052632 0.4182306 0.1875 0.0726714
## Adelphi University 0.07894737 0.2600536 0.2500 0.1383867
## Grad.Rate
## Abilene Christian University 0.4629630
## Adelphi University 0.4259259
There is a technique known as One-hot encoding, which translates categorical values, such as “yes” or “no”, to numerical values as 0 and 1.
Such conversion is mandatory, since the linear algebra that will be applied latter relies on that prerequisite for performing computations over the data.
# Convert Private column from Yes/No to 1/0
Private = as.numeric(College$Private)-1
data = cbind(Private,scaled.data)
if("caTools" %in% rownames(installed.packages()) == FALSE) {
install.packages('caTools')
} else {
library(caTools)
}
## Warning: package 'caTools' was built under R version 4.1.1
Set some seed for that.
set.seed(101)
# Create Split (any column is fine)
split = sample.split(data$Private, SplitRatio = 0.70)
# Split based off of split Boolean Vector
train = subset(data, split == TRUE)
test = subset(data, split == FALSE)
feats <- names(scaled.data)
# Concatenate strings
f <- paste(feats,collapse=' + ')
f <- paste('Private ~',f)
# Convert to formula
f <- as.formula(f)
f
## Private ~ Apps + Accept + Enroll + Top10perc + Top25perc + F.Undergrad +
## P.Undergrad + Outstate + Room.Board + Books + Personal +
## PhD + Terminal + S.F.Ratio + perc.alumni + Expend + Grad.Rate
if("neuralnet" %in% rownames(installed.packages()) == FALSE) {
install.packages('neuralnet')
} else {
library(neuralnet)
}
## Warning: package 'neuralnet' was built under R version 4.1.1
It is in this step that we let the computer figure out the hyperparameters of the mathematical model.
The output of the “neuralnet” function call is a mathematical statistical model, which is stored in the variable nn for later use in data outside of the training subset.
nn <- neuralnet(f,train,hidden=c(10,10,10),linear.output=FALSE)
Predict values and show the results.
# Compute Predictions off Test Set
predicted.nn.values <- compute(nn,test[2:18])
# Check out net.result
print(head(predicted.nn.values$net.result))
## [,1]
## Adrian College 1
## Alfred University 1
## Allegheny College 1
## Allentown Coll. of St. Francis de Sales 1
## Alma College 1
## Amherst College 1
predicted.nn.values$net.result <- sapply(predicted.nn.values$net.result,round,digits=0)
Main diagonal = 1,1 - true positive
Main diagonal = 2,2 - true negative
Secondary diagonal = 1,2 - False negative
Secondary diagonal = 2,1 - False positive
table(test$Private,predicted.nn.values$net.result)
##
## 0 1
## 0 55 9
## 1 6 163
plot(nn, rep="best")
With that, we have used a subset of the main data, the train subset, to train a neural network and get a mathematical model that can now be used to predict the categorical variable “Private” of outside of the data set we have at hand.
This is known as a supervised learning technique.