Austin Dolaway, Evan Heimbach, and Megan Marchetti 4/31/2020
The website https://archive.ics.uci.edu/ml/datasets/Ionosphere contains data evaluating “good” and “bad” radar returns for evidence of structure in the ionosphere. There are 351 observations of 34 predictors and a binary response, g or b. (a) Load the data into R, delete any columns with zero variance, and convert the first column to type num and the response to type factor. Set the seed to 12345 and partition the data using createDataPartition with p=0.7.
library(caret)
library(ISLR)
library(data.table)
library(caretEnsemble)
iondata <- fread("https://archive.ics.uci.edu/ml/machine-learning-databases/ionosphere/ionosphere.data")
iondata$V1 <- as.numeric(iondata$V1)
iondata$V2 <- NULL
iondata$V35 <- as.factor(iondata$V35)
set.seed(12345)
ionindex <- createDataPartition(iondata$V35, p = 0.7, list = FALSE)
iontrain <- iondata[ionindex, ]
iontest <- iondata[-ionindex, ]
summary(output)
Call:
summary.resamples(object = output)
Models: rpart, rf, gbm, svmLinear
Number of resamples: 10
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rpart 0.7083333 0.8350000 0.8575000 0.8576667 0.91 0.96 0
rf 0.7500000 0.8862500 0.9391667 0.9223333 0.99 1.00 0
gbm 0.7916667 0.8891667 0.9183333 0.9185000 0.96 1.00 0
svmLinear 0.7500000 0.8800000 0.8800000 0.8781667 0.88 0.96 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
rpart 0.3913043 0.6185152 0.6812276 0.6861784 0.8062323 0.911032 0
rf 0.4666667 0.7587274 0.8603720 0.8290128 0.9777580 1.000000 0
gbm 0.5652174 0.7393258 0.8198702 0.8180478 0.9110320 1.000000 0
svmLinear 0.4146341 0.7191011 0.7191011 0.7171872 0.7330961 0.911032 0
dotplot(output)
confusionMatrix(table(iontrain$V35, predict(mods$rpart, iontrain)))
Confusion Matrix and Statistics
b g
b 73 16
g 10 148
Accuracy : 0.8947
95% CI : (0.8496, 0.9301)
No Information Rate : 0.664
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7682
Mcnemar's Test P-Value : 0.3268
Sensitivity : 0.8795
Specificity : 0.9024
Pos Pred Value : 0.8202
Neg Pred Value : 0.9367
Prevalence : 0.3360
Detection Rate : 0.2955
Detection Prevalence : 0.3603
Balanced Accuracy : 0.8910
'Positive' Class : b
confusionMatrix(table(iontrain$V35, predict(mods$rf, iontrain)))
Confusion Matrix and Statistics
b g
b 89 0
g 0 158
Accuracy : 1
95% CI : (0.9852, 1)
No Information Rate : 0.6397
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.3603
Detection Rate : 0.3603
Detection Prevalence : 0.3603
Balanced Accuracy : 1.0000
'Positive' Class : b
confusionMatrix(table(iontrain$V35, predict(mods$gbm, iontrain)))
Confusion Matrix and Statistics
b g
b 89 0
g 0 158
Accuracy : 1
95% CI : (0.9852, 1)
No Information Rate : 0.6397
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.3603
Detection Rate : 0.3603
Detection Prevalence : 0.3603
Balanced Accuracy : 1.0000
'Positive' Class : b
confusionMatrix(table(iontrain$V35, predict(mods$svmLinear, iontrain)))
Confusion Matrix and Statistics
b g
b 80 9
g 2 156
Accuracy : 0.9555
95% CI : (0.9217, 0.9776)
No Information Rate : 0.668
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.9017
Mcnemar's Test P-Value : 0.07044
Sensitivity : 0.9756
Specificity : 0.9455
Pos Pred Value : 0.8989
Neg Pred Value : 0.9873
Prevalence : 0.3320
Detection Rate : 0.3239
Detection Prevalence : 0.3603
Balanced Accuracy : 0.9605
'Positive' Class : b
confusionMatrix(table(iontest$V35, predict(mods$rpart, iontest)))
Confusion Matrix and Statistics
b g
b 34 3
g 3 64
Accuracy : 0.9423
95% CI : (0.8787, 0.9785)
No Information Rate : 0.6442
P-Value [Acc > NIR] : 6.661e-13
Kappa : 0.8741
Mcnemar's Test P-Value : 1
Sensitivity : 0.9189
Specificity : 0.9552
Pos Pred Value : 0.9189
Neg Pred Value : 0.9552
Prevalence : 0.3558
Detection Rate : 0.3269
Detection Prevalence : 0.3558
Balanced Accuracy : 0.9371
'Positive' Class : b
confusionMatrix(table(iontest$V35, predict(mods$rf, iontest)))
Confusion Matrix and Statistics
b g
b 33 4
g 1 66
Accuracy : 0.9519
95% CI : (0.8914, 0.9842)
No Information Rate : 0.6731
P-Value [Acc > NIR] : 3.633e-12
Kappa : 0.8932
Mcnemar's Test P-Value : 0.3711
Sensitivity : 0.9706
Specificity : 0.9429
Pos Pred Value : 0.8919
Neg Pred Value : 0.9851
Prevalence : 0.3269
Detection Rate : 0.3173
Detection Prevalence : 0.3558
Balanced Accuracy : 0.9567
'Positive' Class : b
confusionMatrix(table(iontest$V35, predict(mods$gbm, iontest)))
Confusion Matrix and Statistics
b g
b 34 3
g 1 66
Accuracy : 0.9615
95% CI : (0.9044, 0.9894)
No Information Rate : 0.6635
P-Value [Acc > NIR] : 9.701e-14
Kappa : 0.9151
Mcnemar's Test P-Value : 0.6171
Sensitivity : 0.9714
Specificity : 0.9565
Pos Pred Value : 0.9189
Neg Pred Value : 0.9851
Prevalence : 0.3365
Detection Rate : 0.3269
Detection Prevalence : 0.3558
Balanced Accuracy : 0.9640
'Positive' Class : b
confusionMatrix(table(iontest$V35, predict(mods$svmLinear, iontest)))
Confusion Matrix and Statistics
b g
b 26 11
g 1 66
Accuracy : 0.8846
95% CI : (0.8071, 0.9389)
No Information Rate : 0.7404
P-Value [Acc > NIR] : 0.000244
Kappa : 0.7321
Mcnemar's Test P-Value : 0.009375
Sensitivity : 0.9630
Specificity : 0.8571
Pos Pred Value : 0.7027
Neg Pred Value : 0.9851
Prevalence : 0.2596
Detection Rate : 0.2500
Detection Prevalence : 0.3558
Balanced Accuracy : 0.9101
'Positive' Class : b
The ISLR package contains a dataset called Khan that consists of gene expression measurements indicating one of four types of small round blue cell tumours of childhood (SRBCT). (a) The data are already split into training and testing sets. Make dataframes for each and set the seed to 12345.
library(caret)
library(ISLR)
library(data.table)
library(caretEnsemble)
khandata <- Khan
khantrain <- data.frame(Khan$xtrain)
khantest <- data.frame(Khan$xtest)
khantrain$response <- as.factor(Khan$ytrain)
khantest$response <- as.factor(Khan$ytest)
set.seed(12345)
CART2 <- train(response~., data = khantrain, method = "rpart", trControl = trainControl(method = "cv", number = 10))
RF2 <- train(response~., data = khantrain, method = "rf", trControl = trainControl(method = "cv", number = 10))
GBM2 <- train(response~., data = khantrain, method = "gbm", trControl = trainControl(method = "cv", number = 10))
SVM2 <- train(response~., data = khantrain, method = "svmLinear", trControl = trainControl(method = "cv", number = 10))
CART2p <- predict(CART2, newdata = khantrain)
confusionMatrix(CART2p, khantrain$response)
Confusion Matrix and Statistics
Reference
Prediction 1 2 3 4
1 0 0 0 0
2 0 22 0 0
3 7 1 12 0
4 1 0 0 20
Overall Statistics
Accuracy : 0.8571
95% CI : (0.7461, 0.9325)
No Information Rate : 0.3651
P-Value [Acc > NIR] : 1.023e-15
Kappa : 0.7977
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity 0.000 0.9565 1.0000 1.0000
Specificity 1.000 1.0000 0.8431 0.9767
Pos Pred Value NaN 1.0000 0.6000 0.9524
Neg Pred Value 0.873 0.9756 1.0000 1.0000
Prevalence 0.127 0.3651 0.1905 0.3175
Detection Rate 0.000 0.3492 0.1905 0.3175
Detection Prevalence 0.000 0.3492 0.3175 0.3333
Balanced Accuracy 0.500 0.9783 0.9216 0.9884
RF2p <- predict(RF2, newdata = khantrain)
confusionMatrix(RF2p, khantrain$response)
Confusion Matrix and Statistics
Reference
Prediction 1 2 3 4
1 8 0 0 0
2 0 23 0 0
3 0 0 12 0
4 0 0 0 20
Overall Statistics
Accuracy : 1
95% CI : (0.9431, 1)
No Information Rate : 0.3651
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity 1.000 1.0000 1.0000 1.0000
Specificity 1.000 1.0000 1.0000 1.0000
Pos Pred Value 1.000 1.0000 1.0000 1.0000
Neg Pred Value 1.000 1.0000 1.0000 1.0000
Prevalence 0.127 0.3651 0.1905 0.3175
Detection Rate 0.127 0.3651 0.1905 0.3175
Detection Prevalence 0.127 0.3651 0.1905 0.3175
Balanced Accuracy 1.000 1.0000 1.0000 1.0000
GBM2p <- predict(GBM2, newdata = khantrain)
confusionMatrix(GBM2p, khantrain$response)
Confusion Matrix and Statistics
Reference
Prediction 1 2 3 4
1 8 0 0 0
2 0 23 0 0
3 0 0 12 0
4 0 0 0 20
Overall Statistics
Accuracy : 1
95% CI : (0.9431, 1)
No Information Rate : 0.3651
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity 1.000 1.0000 1.0000 1.0000
Specificity 1.000 1.0000 1.0000 1.0000
Pos Pred Value 1.000 1.0000 1.0000 1.0000
Neg Pred Value 1.000 1.0000 1.0000 1.0000
Prevalence 0.127 0.3651 0.1905 0.3175
Detection Rate 0.127 0.3651 0.1905 0.3175
Detection Prevalence 0.127 0.3651 0.1905 0.3175
Balanced Accuracy 1.000 1.0000 1.0000 1.0000
SVM2p <- predict(SVM2, newdata = khantrain)
confusionMatrix(SVM2p, khantrain$response)
Confusion Matrix and Statistics
Reference
Prediction 1 2 3 4
1 8 0 0 0
2 0 23 0 0
3 0 0 12 0
4 0 0 0 20
Overall Statistics
Accuracy : 1
95% CI : (0.9431, 1)
No Information Rate : 0.3651
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity 1.000 1.0000 1.0000 1.0000
Specificity 1.000 1.0000 1.0000 1.0000
Pos Pred Value 1.000 1.0000 1.0000 1.0000
Neg Pred Value 1.000 1.0000 1.0000 1.0000
Prevalence 0.127 0.3651 0.1905 0.3175
Detection Rate 0.127 0.3651 0.1905 0.3175
Detection Prevalence 0.127 0.3651 0.1905 0.3175
Balanced Accuracy 1.000 1.0000 1.0000 1.0000
There was little variation between the high accuracies of these models on the training dataset, however it could be lower when modeling the testing data.
CART2p2 <- predict(CART2, newdata = khantest)
confusionMatrix(CART2p2, khantest$response)
Confusion Matrix and Statistics
Reference
Prediction 1 2 3 4
1 0 0 0 0
2 0 4 0 1
3 3 1 5 1
4 0 1 1 3
Overall Statistics
Accuracy : 0.6
95% CI : (0.3605, 0.8088)
No Information Rate : 0.3
P-Value [Acc > NIR] : 0.005138
Kappa : 0.4386
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity 0.00 0.6667 0.8333 0.6000
Specificity 1.00 0.9286 0.6429 0.8667
Pos Pred Value NaN 0.8000 0.5000 0.6000
Neg Pred Value 0.85 0.8667 0.9000 0.8667
Prevalence 0.15 0.3000 0.3000 0.2500
Detection Rate 0.00 0.2000 0.2500 0.1500
Detection Prevalence 0.00 0.2500 0.5000 0.2500
Balanced Accuracy 0.50 0.7976 0.7381 0.7333
RF2p2 <- predict(RF2, newdata = khantest)
confusionMatrix(RF2p2, khantest$response)
Confusion Matrix and Statistics
Reference
Prediction 1 2 3 4
1 3 0 0 0
2 0 6 0 0
3 0 0 5 0
4 0 0 1 5
Overall Statistics
Accuracy : 0.95
95% CI : (0.7513, 0.9987)
No Information Rate : 0.3
P-Value [Acc > NIR] : 1.662e-09
Kappa : 0.9322
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity 1.00 1.0 0.8333 1.0000
Specificity 1.00 1.0 1.0000 0.9333
Pos Pred Value 1.00 1.0 1.0000 0.8333
Neg Pred Value 1.00 1.0 0.9333 1.0000
Prevalence 0.15 0.3 0.3000 0.2500
Detection Rate 0.15 0.3 0.2500 0.2500
Detection Prevalence 0.15 0.3 0.2500 0.3000
Balanced Accuracy 1.00 1.0 0.9167 0.9667
GBM2p2 <- predict(GBM2, newdata = khantest)
confusionMatrix(GBM2p2, khantest$response)
Confusion Matrix and Statistics
Reference
Prediction 1 2 3 4
1 3 0 0 0
2 0 5 0 1
3 0 0 6 0
4 0 1 0 4
Overall Statistics
Accuracy : 0.9
95% CI : (0.683, 0.9877)
No Information Rate : 0.3
P-Value [Acc > NIR] : 3.773e-08
Kappa : 0.8639
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity 1.00 0.8333 1.0 0.8000
Specificity 1.00 0.9286 1.0 0.9333
Pos Pred Value 1.00 0.8333 1.0 0.8000
Neg Pred Value 1.00 0.9286 1.0 0.9333
Prevalence 0.15 0.3000 0.3 0.2500
Detection Rate 0.15 0.2500 0.3 0.2000
Detection Prevalence 0.15 0.3000 0.3 0.2500
Balanced Accuracy 1.00 0.8810 1.0 0.8667
SVM2p2 <- predict(SVM2, newdata = khantest)
confusionMatrix(SVM2p2, khantest$response)
Confusion Matrix and Statistics
Reference
Prediction 1 2 3 4
1 3 0 0 0
2 0 6 2 0
3 0 0 4 0
4 0 0 0 5
Overall Statistics
Accuracy : 0.9
95% CI : (0.683, 0.9877)
No Information Rate : 0.3
P-Value [Acc > NIR] : 3.773e-08
Kappa : 0.8639
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity 1.00 1.0000 0.6667 1.00
Specificity 1.00 0.8571 1.0000 1.00
Pos Pred Value 1.00 0.7500 1.0000 1.00
Neg Pred Value 1.00 1.0000 0.8750 1.00
Prevalence 0.15 0.3000 0.3000 0.25
Detection Rate 0.15 0.3000 0.2000 0.25
Detection Prevalence 0.15 0.4000 0.2000 0.25
Balanced Accuracy 1.00 0.9286 0.8333 1.00
The accuracy of the CART model was much lower than the accuracy of the SVM model.
The data https://archive.ics.uci.edu/ml/datasets/Energy+efficiency contains data on building characteristics and energy efficiency. What is new here is that there are two response variables, heating load (Y1) and cooling load (Y2).
library(readxl)
library(caret)
library(neuralnet)
enb <- read_excel("A:/Chrome Downloads/ENB2012_data.xlsx")
normal <- function(x){return((x-min(x))/(max(x)-min(x)))}
enb <- as.data.frame(lapply(enb, normal))
set.seed(12345)
trainindex <- createDataPartition(y = enb$Y1, p = 0.7, list = FALSE)
enbtrain <- enb[trainindex, ]
enbtest <- enb[-trainindex, ]
NN3b <- train(Y1~.-Y2, data = enbtrain, method = "nnet", trControl = trainControl(method = "cv", number = 10), trace = FALSE)
NN3bp <- predict(NN3b, enbtest)
cor(NN3bp, enbtest$Y1)^2
[1] 0.9888821
NN3d <- neuralnet(Y1+Y2~X1+X2+X3+X4+X5+X6+X7+X8, data = enbtrain, hidden = 1)
plot(NN3d, rep = "best")
NN3dp <- predict(NN3d, enbtest)
cor(NN3dp, enbtest$Y1)^2
[,1]
[1,] 0.9243322
[2,] 0.9243322
cor(NN3dp, enbtest$Y2)^2
[,1]
[1,] 0.8963057
[2,] 0.8963057
NN3e <- neuralnet(Y1+Y2~X1+X2+X3+X4+X5+X6+X7+X8, data = enbtrain, hidden = c(2,1))
plot(NN3e, rep = "best")
NN3ep <- predict(NN3d, enbtest)
cor(NN3ep, enbtest$Y1)^2
[,1]
[1,] 0.9243322
[2,] 0.9243322
cor(NN3ep, enbtest$Y2)^2
[,1]
[1,] 0.8963057
[2,] 0.8963057
The relatively high \(R^2\) value of 0.92 represents a small difference between the observed and fitted values making it a decent model for the data.
In this problem, we investigate the importance of normalizing the data before constructing a neural network model. Consider the Boston data set in the MASS package with lstat predicting medv. (a) Set the seed to 12345 and construct a neural network model called NN4b with one hidden layer containing one hidden variable. Plot the data and superimpose the model over the data. Comment on the quality of the fit.
library(MASS)
library(caret)
library(neuralnet)
set.seed(12345)
NN4b <- neuralnet(medv ~ lstat, data = Boston, hidden = 1)
predicted <- predict(NN4b, Boston)
plot(Boston$lstat, Boston$medv, type = "p", pch = 20, xlab="lstat", ylab="medv", main="Boston Data")
lines(lowess(Boston$lstat, predicted), lwd=2, col="green")
The model does not represent the Boston data well. The trendline is horizontal while the data has an upward trend.
#normalization function
normalize<-function(x)
{
return((x-min(x))/(max(x)-min(x)))
}
#normalizing data
NormBostonData <- as.data.frame(lapply(Boston, normalize))
#Neural network with normalized data and one hidden layer
NN4d <- neuralnet(medv ~ lstat, data = NormBostonData, hidden = 1)
predictedd <- predict(NN4d, NormBostonData)
plot(NormBostonData$lstat, NormBostonData$medv, type = "p", pch = 20, xlab="lstat", ylab="medv", main="Normalized Boston Data")
lines(lowess(NormBostonData$lstat, predictedd), lwd=2, col="red")
This model represents the Boston data much better than the first model. The trendline follows the trend of the data.
plot(NN4d, rep="best")
The equation of this model is: \(medv = 2.70587-2.53167S(0.99799+6.03666(lstat))\) (e) Using the normalized data, construct a neural network model called NN4f with two hidden layers containing two hidden variables each. Plot the data and superimpose the model over the data and make the curve blue. Comment on the quality of the fit.
#Neural network with normalized data and two hidden layers
NN4f <- neuralnet(medv ~ lstat, data = NormBostonData, hidden = c(2,2))
predictedf <- predict(NN4f, NormBostonData)
plot(NormBostonData$lstat, NormBostonData$medv, type = "p", pch = 20, xlab="lstat", ylab="medv", main="Two Hidden Layers Normalized Boston Data")
lines(lowess(NormBostonData$lstat, predictedf), lwd=2, col="blue")
Similarily to the previous model this model represents the Boston data well. The trendline follows the trend of the data.
A crooked employee at a casino occasionally switches out a fair six-sided die for a weighted six-sided die, and observations of die rolls supervised by this employee are recorded in Casino.csv. (a) Since the employee only rarely switches the dice, initialize the transition matrix to be A =[0.99 0.01 0.02 0.98]. Set the seed to 6789 and and initialize π and B with random positive entries, but be sure that the entries in π add to one and the rows of B add to one.
normalizeProbabilities <- function(x){x/sum(x)}
library(HMM)
casino <- read.csv("A:/Chrome Downloads/Casino.csv", header = TRUE, sep = ",")
set.seed(6789)
PIprobabilities <- normalizeProbabilities(runif(2))
Bprobabilities <- apply(matrix (runif(12), 6), 1, normalizeProbabilities)
transitionMatrix <- matrix(c(.99, .01, .02, .98))
hmm <- initHMM(c("Fair", "Unfair"), 1:6, startProbs = PIprobabilities, transProbs = transitionMatrix, emissionProbs = Bprobabilities)
bw <- baumWelch(hmm, casino$Roll, maxIterations = 50)
bw$hmm$emissionProbs
symbols
states 1 2 3 4 5 6
Fair 0.1666314 0.16706238 0.16771742 0.1667201 0.1672893 0.1645794
Unfair 0.1009619 0.09892086 0.09879078 0.1005153 0.1015320 0.4992791
The model predicts that the weighted part of the die is the one because that is opposite of side six.
The data in KaggleSurvey.csv are derived from the responses to the 2018 Kaggle Machine Learning and Data Science Survey. Respondents were asked “How do you perceive the quality of online learning platforms and MOOCs as compared to the quality of the education provided by traditional brick and mortar institutions?” and responses used the following scale.
kaggleSurvey <- read.csv("A:/Chrome Downloads/KaggleSurvey.csv")
kaggleSurvey$Salary <- NULL
kaggleSurvey$Country <- NULL
kaggleSurvey <- kaggleSurvey[!(is.na(kaggleSurvey[,4]) | kaggleSurvey[,4]==""), ]
kaggleSurvey <- kaggleSurvey[!(is.na(kaggleSurvey[,3]) | kaggleSurvey[,3]==""), ]
library(MASS)
kaggleSurvey$Response <- as.factor(kaggleSurvey$Response)
ORD <- polr(Response~(Gender + Age + Student), data = kaggleSurvey)
testing <- data.frame(Student=c(0,1,0,1), Gender=c(“Male”,“Male”,“Female”,“Female”), Age=c(25,25,25,25)) predict(ORD,newdata = testing, type=“p”)
to see probabilities for each response for 25-year-old people. Which group is most likely to respond “Much better” to the survey question?
testing <- data.frame(Student=c(0,1,0,1),Gender=c("Male","Male","Female","Female"),Age=c(25,25,25,25))
predict(ORD,newdata = testing, type="p")
1 2 3 4 5
1 0.03310606 0.10486481 0.3213671 0.2890439 0.2516181
2 0.02887626 0.09315773 0.3025289 0.2963388 0.2790983
3 0.03827625 0.11858475 0.3400057 0.2787800 0.2243532
4 0.03340870 0.10568552 0.3225823 0.2884737 0.2498498
Group 2 is most likely to respond “Much better” at 27.9%.
ORD$coefficients
GenderFemale GenderMale Age Student
0.281548213 0.432022905 -0.009502285 0.141061847
Looking at the coefficients of ORD, Age has a coefficient of -0.0095 which implies that it has a significant effect on the Response outcomes compared to the other factors.
The file SouthAmerica.csv contains data on ten countries in South America. (a) Load the data, rename the rows with the names of the countries, and use scale to center and scale each column. Then use hclust to produce a cluster dendrogram that displays how similar countries are to one another. Use plot to display the clusters and be sure that the country names are used for the labels.
southAmerica <- read.csv("A:/Chrome Downloads/SouthAmerica.csv")
scaledSA <- scale(southAmerica[, c(2:8)], center = TRUE, scale = TRUE)
rownames(scaledSA) <- c("Argentina", "Bolivia", "Brazil", "Chile","Colombia","Ecuador", "Paraguay", "Peru","Uruguay", "Venezuela")
hClust <- hclust(dist(scaledSA))
plot(hClust, main = "South American Countries", xlab = "Clusters")
Peru and Ecuador are most like Colombia.
hClust$height[2]
[1] 1.43153
Choosing a height of 1.43153 will result in two clusters, one being Colombia and Peru, and the other being Argentina and Uruguay.
The file EducationLevel.csv contains data on education levels in all of the counties in the United States. (a) Load the data. There is no need to scale the columns since the numerical columns are all percentages. Carefully examine the data and do any necessary pre-processing.
educationLevel <- read.csv("A:/Chrome Downloads/EducationLevel.csv", header = TRUE, sep = ",")
educationLevel[1,] <- NA #Row 1 is the whole US
educationLevel <- educationLevel[!(is.na(educationLevel[,4]) | educationLevel[,4]==""), ] #Removes NA rows
set.seed(1234)
kMeans <- kmeans(educationLevel[,4:7], centers = 2)
fips <- educationLevel$FIPS.Code
cluster <- kMeans$cluster
codes <- data.frame(fips, cluster)
library(usmap)
plot_usmap(data=codes, labels=TRUE, value="cluster", label_color="white") + scale_fill_continuous(low="red", high="blue") + theme(legend.position = "none")