Introduction

Data mining consists of extracting data from large sets to create more organized and workable data frames that can be manipulated and used to analyze the data set to try to get insight on what the data holds and what it can tell us. There are many algorithms that are used in the realm of data science and analytics, but we will look at just a few of the popular techniques among data scientists. An example of implementation in rstudio are included to provide an example for each.

1. Classification

classification in itself contains a variety of methods to be implemented. In classification, data is analyzed by its attributes such as the variables used and values of those variables to identify classifiers that can be used to group clusters of variables together based on those attributes. the K nearest neighbor algorithm will be used for this example (EMC Education Services, 2015).

Pros and Cons of Classification

Classification is used widely in real-world applications ranging in various categories such as evaluating customer risks when approving loans. classification is simple to implement and has low computational cost. the method is executed without any prior assumption about the data set and treats each variable as independent(Gupta, 2020). The downside of classifications simplicity is that it makes it slower to execute on large data sets. The numerical values of the variables need to be scaled down each time to successfully execute this program. Missing data and outliers also distort results and accuracy(Gupta, 2020). #### KNN algorithm the knn algorithm is a simple yet highly accurate algorithm that helps create different classes within the data. the algorithm analyzes new data points by looking at its nearest data points relative to it. the data point then reaches out to the number of it’s neighboring data points depending on the measurements set by the k value. It looks at the values of these neighbors and sorts their distances from least to greatest and and picks to top values representing the k value and assigns itself the majority classification among those points (Harrison, 2018).

KNN example

gc <- read.csv("german_credit.csv") 

## Taking back-up of the input file, in case the original data is required later
head (gc) # To check top 6 values of all the variables in data set.

##   Creditability Account.Balance Duration.of.Credit..month.
## 1             1               1                         18
## 2             1               1                          9
## 3             1               2                         12
## 4             1               1                         12
## 5             1               1                         12
## 6             1               1                         10
##   Payment.Status.of.Previous.Credit Purpose Credit.Amount Value.Savings.Stocks
## 1                                 4       2          1049                    1
## 2                                 4       0          2799                    1
## 3                                 2       9           841                    2
## 4                                 4       0          2122                    1
## 5                                 4       0          2171                    1
## 6                                 4       0          2241                    1
##   Length.of.current.employment Instalment.per.cent Sex...Marital.Status
## 1                            2                   4                    2
## 2                            3                   2                    3
## 3                            4                   2                    2
## 4                            3                   3                    3
## 5                            3                   4                    3
## 6                            2                   1                    3
##   Guarantors Duration.in.Current.address Most.valuable.available.asset
## 1          1                           4                             2
## 2          1                           2                             1
## 3          1                           4                             1
## 4          1                           2                             1
## 5          1                           4                             2
## 6          1                           3                             1
##   Age..years. Concurrent.Credits Type.of.apartment No.of.Credits.at.this.Bank
## 1          21                  3                 1                          1
## 2          36                  3                 1                          2
## 3          23                  3                 1                          1
## 4          39                  3                 1                          2
## 5          38                  1                 2                          2
## 6          48                  3                 1                          2
##   Occupation No.of.dependents Telephone Foreign.Worker
## 1          3                1         1              1
## 2          3                2         1              1
## 3          2                1         1              1
## 4          2                2         1              2
## 5          2                1         1              2
## 6          2                2         1              2

looking at the data, we can see financial attributes relating to a bank loan application. we can determine that credibility can be used as a predictor(approved or not), only main attributes are kept to simplify the number of variables to test our algorithm.

gc.subset <- gc[c('Creditability','Age..years.','Sex...Marital.Status','Occupation','Account.Balance','Credit.Amount','Length.of.current.employment','Purpose')]
head(gc.subset)

##   Creditability Age..years. Sex...Marital.Status Occupation Account.Balance
## 1             1          21                    2          3               1
## 2             1          36                    3          3               1
## 3             1          23                    2          2               2
## 4             1          39                    3          2               1
## 5             1          38                    3          2               1
## 6             1          48                    3          2               1
##   Credit.Amount Length.of.current.employment Purpose
## 1          1049                            2       2
## 2          2799                            3       0
## 3           841                            4       9
## 4          2122                            3       0
## 5          2171                            3       0
## 6          2241                            2       0

Normalizing data

the values in this data is crucial to avoid any errors when analyzing knn. to do this, all string values are converted to numeric and set within a valid range to avoid volatility that may skew the model.

normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x))) } # creating a normalize function to keep x and y values within range

gc.subset.n<- as.data.frame(lapply(gc.subset[,2:8], normalize)) # normaliziation is applied to all variables except credibility since it is the predicted variable.
head(gc.subset.n)

##   Age..years. Sex...Marital.Status Occupation Account.Balance Credit.Amount
## 1  0.03571429            0.3333333  0.6666667       0.0000000    0.04396390
## 2  0.30357143            0.6666667  0.6666667       0.0000000    0.14025531
## 3  0.07142857            0.3333333  0.3333333       0.3333333    0.03251898
## 4  0.35714286            0.6666667  0.3333333       0.0000000    0.10300429
## 5  0.33928571            0.6666667  0.3333333       0.0000000    0.10570045
## 6  0.51785714            0.6666667  0.3333333       0.0000000    0.10955211
##   Length.of.current.employment Purpose
## 1                         0.25     0.2
## 2                         0.50     0.0
## 3                         0.75     0.9
## 4                         0.50     0.0
## 5                         0.50     0.0
## 6                         0.25     0.0

Data Splicing

The new data is then separated into 70 percent training set and a 30 percent testing set. The credibility column is removed and placed in it’s own data frame to be used to compare test results to.

set.seed(123)  # To get the same random sample
dat.d <- sample(1:nrow(gc.subset.n),size=nrow(gc.subset.n)*0.7,replace = FALSE) #random selection of 70% data.

train.gc <- gc.subset[dat.d,] # 70% training data
test.gc <- gc.subset[-dat.d,] # remaining 30% test data

#separate credibility into its own dataframe to be used for comparison.
train.gc_labels <- gc.subset[dat.d,1]
test.gc_labels  <- gc.subset[-dat.d,1]

run testing and analyis

The default method is to square root the number of observations to get a k value. there are 700 observations therefore therefore the square root value is 26.45 so we will have two models, one with k=26 and other with k=27. after running the tests we can see that when k=26, we get an accuracy of 67.33 percent. Further optimization may be implemented to get clearer results and dial in on a more precise solution.

library(class)         

NROW(train.gc_labels)   # return number of observations

## [1] 700

knn.26 <-  knn(train=train.gc, test=test.gc, cl=train.gc_labels, k=26)
knn.27 <-  knn(train=train.gc, test=test.gc, cl=train.gc_labels, k=27)

ACC.26 <- 100 * sum(test.gc_labels == knn.26)/NROW(test.gc_labels)  # For knn = 26
ACC.27 <- 100 * sum(test.gc_labels == knn.27)/NROW(test.gc_labels)  # For knn = 27
ACC.26    #Accuracy is 67.67%

## [1] 68.66667

ACC.27

## [1] 69

2. Prediction

Prediction is a vital technique if one is trying to make highly-educated predictions about future data contents within the set. an example of a predictive model is a decision tree(EMC Education Services, 2015). the focus of implementing a decision tree is to build a training model that will predict what the next series of incoming data would look like based on the patterns(rules) within the past data. as other techniques, decision trees are split into training and testing sets. the decision tree outputs values of 1 and 0s as to whether a path is taken or not. estimation of future values or events are done using maximum likelihood based on the patterns and statistical information about the data set such as the variances among the data points(EMC Education Services, 2015).
#### Pros and Cons of Predictive Analysis predictive analysis and decision trees take a rather small amount of time to implement than other techniques and provide a high rate of accuracy. With complete and accurate data, the prediction data mining technique is used in many real-world situations. The rules within decision trees are alos dynamic, making it easy to include new nodes or variables to see how they affect the model and use that for future business decisions(SeattleDataGuy, 2019). A drawback for this technique is that it requires a vast amount of data in order to achieve a high level of accuracy. The data used and metrics used to measure the data must also be universal all across. Merging separate data sets that are not formatted the same will greatly affect cause errors and weaken results(SeattleDataGuy, 2019). Prediction example:

library(rpart)

df<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/LungCapData.csv",sep=";")

head(df)

##   LungCap Age Height Smoke Gender Caesarean
## 1   6.475   6   62.1    no   male        no
## 2  10.125  18   74.7   yes female        no
## 3   9.550  16   69.7    no female       yes
## 4  11.125  14   71.0    no   male        no
## 5   4.800   5   56.9    no   male        no
## 6   6.225  11   58.7    no female        no

the data contains lung cap, age, height, smoke, gender, and caesarean variables. for test purposes, only the first three columns are kept and stored in a new data frame df. the predictive regressional algorithm is applied to the dataframe with the luncap and height parameters and the minsplit parameter of 3. this indicates the number of nodes that need to be in a node before it can be split. the algorithm finds relationships between the two variables and creates a model for the data. a new, separate data frame is created with only a height column that will be used for testing:

df<-df[,c(1,3)]
# create subset of the data with only first 3 columns

#use rpart library to create a decision tree model with lungcap and height variables.
dt<-rpart(df$LungCap~df$Height, data=df,control = rpart.control(minsplit = 3))

## create new height df to test
new<-data.frame(df$Height)

to test our model, the predict function is called taking in the data frame containing our model and the lone height column. the model takes in the height data and predicts the person’s lung capacity. the training and the predicted outputs are plotted using ggplot2. the green cluster are known lung capacities and the red line contains the predicted outcomes of the model. As seen, the predicted values fall within the trained data set’s range.

library(ggplot2)

## run the predict function to test the new height data against the formulated dt dataframe.
pd<-predict(dt,newdata = new)

# merge the original data set with predicted model
df$pd<-pd

ggplot() + geom_point(aes(x=df$Height,y=df$LungCap),color=3) + geom_line(aes(x=df$Height,y=df$pd),color=2) + ggtitle("Decision Tree example") + xlab("Height") + ylab("Lung Cap") + theme(plot.title = element_text(hjust=0.5)) +theme_bw()

3. Clustering

clustering is a machine technique similar to classification except it is not done manually. clustering is essential when exploring data because it creates clusters containing data values with similar attributes to one another. This provides an outline of different groups of data within the set that can be used to decided what further action is needed to work with the data set. The k-means algorithm will be the example used for this technique. The k means algorithm divides the data into different groups based on distance metrics and sums of squares. the relation of data points is then established using these metrics to identify similar data points in clusters(EMC Education Services, 2015).

K-means Example

The built in “USArrests” data is used for this example.

data("USArrests")
rawdf <- na.omit(USArrests)

head(rawdf)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

Data exploration

The data set contains 50 observations, one for each u.s. state and 4 variables within each: murder, assault, urbanPop and Rape. the data contains alot of variance among them, therefore it must be scaled down with the scale function. printing out the new table, df, shows how all the numbers are trimmed down to avoid error caused by high-variance in the k-means computations.

desc_stats <- data.frame(
  Min = apply(rawdf, 2, min), # minimum
  Med = apply(rawdf, 2, median), # median
  Mean = apply(rawdf, 2, mean), # mean
  SD = apply(rawdf, 2, sd), # Standard deviation
  Max = apply(rawdf, 2, max) # Maximum
  )
desc_stats <- round(desc_stats, 1)
head(desc_stats)

##           Min   Med  Mean   SD   Max
## Murder    0.8   7.2   7.8  4.4  17.4
## Assault  45.0 159.0 170.8 83.3 337.0
## UrbanPop 32.0  66.0  65.5 14.5  91.0
## Rape      7.3  20.1  21.2  9.4  46.0

df <- scale(USArrests)
head(df)

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

k-means application

to apply the k-means algorithm, we create pass in the parameters of the datafram(USArrests), k(4 since there are 4 variables), and nstart for the number of centroids to choose for the algorithm.

set.seed(123)
km.res <- kmeans(scale(USArrests), 4, nstart = 25)
km.res

## K-means clustering with 4 clusters of sizes 8, 13, 16, 13
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1  1.4118898  0.8743346 -0.8145211  0.01927104
## 2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 3 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 4  0.6950701  1.0394414  0.7226370  1.27693964
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              4              4              1              4 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              4              3              3              4              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              2              4              3              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              2              1              2              4 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              4              2              1              4 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              4              2              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              4              4              1              2              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              4              3              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              2              2              3 
## 
## Within cluster sum of squares by cluster:
## [1]  8.316061 11.952463 16.212213 19.922437
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

To provide a visualization for the dataframe, the factoextra library is utilized. this generic function takes the datapoints’ cluster identification given by the applied algorithm and plots them on a graph with color barriers for a clearer illustration defining the clusters.

library("factoextra")

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

fviz_cluster(km.res, data = df,
             palette = c("#00AFBB","#2E9FDF", "#E7B800", "#FC4E07"),
             ggtheme = theme_minimal(),
             main = "Partitioning Clustering Plot"
             )

K-means Analysis

the k-means algorithm is highly favored around R users do to its simplicity to group variables and present them easily. the computational time is minimal and the results are highly reliable. k-means is also ideal for large scale data sets, the more observations, the better clustering outcome results. its simplicity is also k-means weakness. k-means can only be applied on numerical data. k-means can provide reliable results, but in-depth exploration of the data is needed to chose the right value for k. another con for k-means is that it assumes that the data is uniform and tries to make equal sized clusters even when it would be ideal not to and that can cause overfitting of the clustering dimenstions.

4. Association

The association algorithm is used to identify patterns among dependently linked variables. This method is used when analyzing features of data to find correlations among other variables are needed(EMC Education Services, 2015). This technique is highly used in the retail and marketing sectors to identify what other services or goods are purchased along with their own products to explore any correlations that may improve their sales. The Association algorithm ranks variables frequencies and relationships to other variables. Real-word applications of this includes grocery stores looking at sales to map item locations throughout the store to make it easier for the customer to purchase similar items(EMC Education Services, 2015). #### Pros and Cons of Association The association technique works best with large data sets because the vast information of data makes it easier to find more commonality among variables compared to limited correlations found in smaller data sets. the drawback of this technique falls on the complexity needed to successfully implement the technique. substantial features of the data is needed to draw conclusive results in the model. furthermore, the computations running on all these variables and features are very expensive and take long execution times (Bala Deshpande, 2019). Example of association technique implementation:

dataset = read.csv('Market_Basket_Optimisation.csv', header = FALSE)


library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

## a sub set of data is made from the original without duplicate transactions.
dataset = read.transactions('Market_Basket_Optimisation.csv', sep = ',', rm.duplicates = TRUE)

## distribution of transactions with duplicates:
## 1 
## 5

summary(dataset)

## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03288973 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1788          1348          1306          1282          1229 
##       (Other) 
##         22405 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17    4 
##   18   19   20 
##    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.914   5.000  20.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus

The data set contains grocery goods purchased at a store. To apply the association algorithm, all duplicates are removed since duplicates cause the algorithm to identify them as separate values, expanding computational cost. Looking at the data summary, there are 7501 rows and 119 columns.

## passing support value and confidence parameters into algorithm.
rules = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.4))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   0.003      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 22 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [281 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

For the apriori algorithm, the support and confidence parameters are needed. The support variable is used to set the target frequency of an item. this variable is derived by taking the total frequency of an item and dividing it by the total number of observations. in this example, only items that are bought 3 times a day will be considered. looking at a time length of one week, we get 21/7500 = .0028 or .003 rounded. the default value of the support value is 80 percent but can be adjusted to fit the situation.

  inspect(sort(rules, by = 'lift')[1:10])

##      lhs                                            rhs                
## [1]  {mineral water,whole wheat pasta}           => {olive oil}        
## [2]  {spaghetti,tomato sauce}                    => {ground beef}      
## [3]  {french fries,herb & pepper}                => {ground beef}      
## [4]  {cereals,spaghetti}                         => {ground beef}      
## [5]  {frozen vegetables,mineral water,soup}      => {milk}             
## [6]  {chocolate,herb & pepper}                   => {ground beef}      
## [7]  {chocolate,mineral water,shrimp}            => {frozen vegetables}
## [8]  {frozen vegetables,mineral water,olive oil} => {milk}             
## [9]  {cereals,ground beef}                       => {spaghetti}        
## [10] {frozen vegetables,soup}                    => {milk}             
##      support     confidence coverage    lift     count
## [1]  0.003866151 0.4027778  0.009598720 6.115863 29   
## [2]  0.003066258 0.4893617  0.006265831 4.980600 23   
## [3]  0.003199573 0.4615385  0.006932409 4.697422 24   
## [4]  0.003066258 0.4600000  0.006665778 4.681764 23   
## [5]  0.003066258 0.6052632  0.005065991 4.670863 23   
## [6]  0.003999467 0.4411765  0.009065458 4.490183 30   
## [7]  0.003199573 0.4210526  0.007598987 4.417225 24   
## [8]  0.003332889 0.5102041  0.006532462 3.937285 25   
## [9]  0.003066258 0.6764706  0.004532729 3.885303 23   
## [10] 0.003999467 0.5000000  0.007998933 3.858539 30

The apriori algorithm implements a bottom up correlation process identifying what items were bought along with other items until there are no more connections left to analyze. a hash tree implementation is then used to identify grouped sets and prunes the ones with infrequent patterns. what is left are the groupings of all he high frequency associated groups (Bala Deshpande, 2019).the results are displayed with the ten highest confidence and support rated groups along with those values. for example, customers who bought olive oil usually buy whole wheat pasta and mineral water.

5. Correlation Analysis

Correlation, or coefficient, analysis is used to measure and analyze the relationships between two variables. the value of the correlations coefficient’s range is between positive and negative 1 and it measures the correlation of two variables. +1 is a perfect correlations score and signifies that the positive increase in one variable is correlated to the increase of another. .4-.8 is considered high correlation. anything less than .1 states no correlation at all. a -1 value means a perfect negative correlation meaning that the increase in one variable correlates to a decrease in another variable(EMC Education Services, 2015). #### Pros and Cons of Correlation Analysis correlation analysis is applied to data before applying prediction analysis techniques and prepares the data for regression which is needed to accurately predict future data. correlation analysis is an essential step to every data analysis or research study. This step provides data scientists with insight about the data that would otherwise go unnoticed and result in weak models. However, correlation analysis is only one step in every data mining technique, therefore it does not reveal everything about the data. Further data exploration needs to be conducted on the correlation test results(Suarez, 2015).

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

##read in stored database in rstudio
states<- state.x77[,1:6]  

describe(states)[, c(1:5, 7:9)]  #get statistical info on data set

##            vars  n    mean      sd  median     mad     min     max
## Population    1 50 4246.42 4464.49 2838.50 2890.33  365.00 21198.0
## Income        2 50 4435.80  614.47 4519.00  581.18 3098.00  6315.0
## Illiteracy    3 50    1.17    0.61    0.95    0.52    0.50     2.8
## Life Exp      4 50   70.88    1.34   70.67    1.54   67.96    73.6
## Murder        5 50    7.38    3.69    6.85    5.19    1.40    15.1
## HS Grad       6 50   53.11    8.08   53.25    8.60   37.80    67.3

the data set included in rstudio contains information on 50 US states’ statistics such as population, income, murder rate, etc. the statistical data for each variable can be seen above.

# store data set' covariances in matrix
matrix1 <- cov(states) 
# round to two decimal places
round(matrix1, 2)

##             Population    Income Illiteracy Life Exp  Murder  HS Grad
## Population 19931683.76 571229.78     292.87  -407.84 5663.52 -3551.51
## Income       571229.78 377573.31    -163.70   280.66 -521.89  3076.77
## Illiteracy      292.87   -163.70       0.37    -0.48    1.58    -3.24
## Life Exp       -407.84    280.66      -0.48     1.80   -3.87     6.31
## Murder         5663.52   -521.89       1.58    -3.87   13.63   -14.55
## HS Grad       -3551.51   3076.77      -3.24     6.31  -14.55    65.24

# store correlation values in matrix
matrix2 <- cor(states)
# round to two decimal places
round(matrix2, 2)

##            Population Income Illiteracy Life Exp Murder HS Grad
## Population       1.00   0.21       0.11    -0.07   0.34   -0.10
## Income           0.21   1.00      -0.44     0.34  -0.23    0.62
## Illiteracy       0.11  -0.44       1.00    -0.59   0.70   -0.66
## Life Exp        -0.07   0.34      -0.59     1.00  -0.78    0.58
## Murder           0.34  -0.23       0.70    -0.78   1.00   -0.49
## HS Grad         -0.10   0.62      -0.66     0.58  -0.49    1.00

# apply spearman method on correlation coefficients
matrix3 <- cor(states, method="spearman")
# round to two decimal places
round(matrix3, 2)

##            Population Income Illiteracy Life Exp Murder HS Grad
## Population       1.00   0.12       0.31    -0.10   0.35   -0.38
## Income           0.12   1.00      -0.31     0.32  -0.22    0.51
## Illiteracy       0.31  -0.31       1.00    -0.56   0.67   -0.65
## Life Exp        -0.10   0.32      -0.56     1.00  -0.78    0.52
## Murder           0.35  -0.22       0.67    -0.78   1.00   -0.44
## HS Grad         -0.38   0.51      -0.65     0.52  -0.44    1.00

To proceed with the correlation analysis, the covariances and correlation coefficients values are stored in a matrix.looking at the matrix, income and high school graduation have strong correlation score of .51 and illiteracy and life expectancy have strong negative correlation of -.56. with this information, further techniques can be applied to explore these correlations more in depth.

References

Bala Deshpande, V. (2019). Association Analysis. Retrieved from: https://www.sciencedirect.com/topics/computer-science/association-analysis

EMC Education Services (Ed.). (2015). Data science and big data analytics: Discovering, analyzing, visualizing and presenting data. Indianapolis, IN: Wiley. ISBN-13: 9781118876138.

Gupta, S. (Feb. 28, 2020). Pros and cons of various Machine Learning algorithms. Retrieved from: https://towardsdatascience.com/pros-and-cons-of-various-classification-ml-algorithms-3b5bfb3c87d6

Harrison, O. (Sep 10, 2018). Machine Learning Basics with the K-nearest Neighbrors Algorithm. Retrieved from: https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761

SeattleDataGuy (Oct 17, 2019). What is Predictive Modeling. Retrieved from:https://medium.com/better-programming/what-is-predictive-modeling-d918a4cf178e

Suarez, H. (July 09, 2015). What is correlation? and Dat analysis tools. retrieved from: https://www.incibe-cert.es/en/blog/correlation-and-data-analysis-tools

Data Mining with R: Part I

Uriel serna Haro

10/2/2020