Problem 1

The website https://archive.ics.uci.edu/ml/datasets/Ionosphere contains data evaluating “good” and “bad” radar returns for evidence of structure in the ionosphere. There are 351 observations of 34 predictors and a binary response, g or b.

(a)

Load the data into R, delete any columns with zero variance, and convert the first column to type num and the response to type factor. Set the seed to 12345 and partition the data using createDataPartition with p=0.7.

ionosphere = read.csv("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 5\\ionosphere.data", header = FALSE)
ionosphere$V2 <- NULL
ionosphere$V1 <- as.numeric(ionosphere$V1)
ionosphere$V35 <- as.factor(ionosphere$V35)

set.seed(12345)
library(caret)
trainingindices <- createDataPartition(ionosphere$V35, p = 0.7, list = FALSE)
training <- ionosphere[trainingindices, ]
testing <- ionosphere[-trainingindices, ]

(b)

Build CART (method=“rpart”), random forest (method=“rf”), gradient boosting machine (method=“gbm”), and SVM (method=“svmLinear”) models for the response and call them CART1, RF1, GBM1, and SVM1, respectively. Be sure to suppress computational details that your reader doesn’t want to see.

cart

CART1 <- train(V35~., data = training, 
              method="rpart",
              trControl=trainControl(method="cv", 10),
              preProcess=c("center","scale"))
ionospheretrain1 <- predict(CART1, newdata = training)
confusionMatrix(ionospheretrain1, training$V35)$overall[1]

##  Accuracy 
## 0.8947368

ionospheretest1 <- predict(CART1, newdata = testing)
confusionMatrix(ionospheretest1, testing$V35)$overall[1]

##  Accuracy 
## 0.9423077

random forest

RF1 <- train(V35~., data = training, 
              method="rf",
              trControl=trainControl(method="cv", 10),
              preProcess=c("center","scale"))
ionospheretrain2 <- predict(RF1, newdata = training)
confusionMatrix(ionospheretrain2, training$V35)$overall[1]

## Accuracy 
##        1

ionospheretest2 <- predict(RF1, newdata = testing)
confusionMatrix(ionospheretest2, testing$V35)$overall[1]

##  Accuracy 
## 0.9519231

gradient boosting machine

GBM1 <- train(V35~., data = training, 
              method="gbm",
              trControl=trainControl(method="cv", 10),
              preProcess=c("center","scale"))

gradiant boosting machine (accracy)

ionospheretrain3 <- predict(GBM1, newdata = training)
confusionMatrix(ionospheretrain3, training$V35)$overall[1]

## Accuracy 
##        1

ionospheretest3 <- predict(GBM1, newdata = testing)
confusionMatrix(ionospheretest3, testing$V35)$overall[1]

##  Accuracy 
## 0.9519231

SVM

SVM1 <- train(V35~., data = training, 
              method="svmLinear",
              trControl=trainControl(method="cv", 10),
              preProcess=c("center","scale"))
ionospheretrain4 <- predict(SVM1, newdata = training)
confusionMatrix(ionospheretrain4, training$V35)$overall[1]

##  Accuracy 
## 0.9554656

ionospheretest4 <- predict(SVM1, newdata = testing)
confusionMatrix(ionospheretest4, testing$V35)$overall[1]

##  Accuracy 
## 0.8846154

(c)

Compare the accuracies of the four models on the training data.

The accuracies for the training data appear to be in range from 0.9 to 1. This indicates the models are well representative of the data.

(d)

Compare the accuracies of the four models on the testing data.

The accuracies for the testing data appear to range from 0.87 to 0.95. This is an indicator that the models are represenative of the data but not as good as the information from the training data.

Problem 2

The ISLR package contains a dataset called Khan that consists of gene expression measurements indicating one of four types of small round blue cell tumours of childhood (SRBCT).

(a)

The data are already split into training and testing sets. Make dataframes for each and set the seed to 12345.

library(ISLR)
data = Khan

training2 <- data.frame(response = factor(Khan$ytrain), Khan$xtrain)
testing2 <- data.frame(response = factor(Khan$ytest), Khan$xtest)
set.seed(12345)

(b)

Build CART (method=“rpart”), random forest (method=“rf”), gradient boosting machine (method=“gbm”), and SVM (method=“svmLinear”) models for the response and call them CART2, RF2, GBM2, and SVM2, respectively. Be sure to suppress computational details that your reader doesn’t want to see.

cart

library(caret)
CART2 <- train(response~., data = training2, 
              method="rpart",
              trControl=trainControl(method="cv", 10),
              preProcess=c("center","scale"))
datatrain1 <- predict(CART2, newdata = training2)
confusionMatrix(datatrain1, training2$response)$overall[1]

##  Accuracy 
## 0.8571429

datatest1 <- predict(CART2, newdata = testing2)
confusionMatrix(datatest1, testing2$response)$overall[1]

## Accuracy 
##      0.6

random forest

RF2 <- train(response~., data = training2,
             method="rf",
             trControl=trainControl(method="cv", 10),
             preProcess=c("center", "scale"))
datatrain2 <- predict(RF2, newdata = training2)
confusionMatrix(datatrain2, training2$response)$overall[1]

## Accuracy 
##        1

datatest2 <- predict(RF2, newdata = testing2)
confusionMatrix(datatest2, testing2$response)$overall[1]

## Accuracy 
##     0.95

gradient boosting machine

GBM2 <- train(response~., data = training2,
              method="gbm",
              trControl=trainControl(method="cv",10),
              preProcess=c("center", "scale"))

gradient boosting machine (accuracy)

datatrain3 <- predict(GBM2, newdata = training2)
confusionMatrix(datatrain3, training2$response)$overall[1]

## Accuracy 
##        1

datatest3 <- predict(GBM2, newdata = testing2)
confusionMatrix(datatest3, testing2$response)$overall[1]

## Accuracy 
##      0.9

SVM

SVM2 <- train(response~., data = training2,
              method="svmLinear",
              trControl=trainControl(method="cv",10),
              preProcess=c("center", "scale"))
datatrain4 <- predict(SVM2, newdata = training2)
confusionMatrix(datatrain4, training2$response)$overall[1]

## Accuracy 
##        1

datatest4 <- predict(SVM2, newdata = testing2)
confusionMatrix(datatest4, testing2$response)$overall[1]

## Accuracy 
##      0.9

(c)

Compare the accuracies of the four models on the training data.

For the training data, the accuracies for random forest, gradiant boosting machine, and SVM are all 1, indicating they are perfect models for the data. The accuracies for CART is in the 0.8-0.9 range, so this is a good model, but not as representative of the data as the other three.

(d)

Compare the accuracies of the four models on the testing data.

For the testing data, the accuracies for the random forest, gradiant boosting machine, and SVM are in the 0.9-1 range, indicating these models well represent the data. The accuracy for CART is low in the 0.6-0.7 range, indicating this model is not very well representative of the data.

Problem 3

The data https://archive.ics.uci.edu/ml/datasets/Energy+efficiency contains data on building characteristics and energy efficiency. What is new here is that there are two response variables, heating load (Y1) and cooling load (Y2).

(a)

Load the data into R and normalize it so it is suitable for analysis by a neural network. Set the seed to 12345 and use createDataPartition (using Y1 as the response) with p=0.7 to partition the data into training and testing sets.

energy = read.table("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 5\\ENB2012_data.txt", header = TRUE)
normalize <- function(x){
  return((x-min(x))/(max(x)-min(x)))
}
energy <- as.data.frame(lapply(energy, normalize))

library(caret)
set.seed(12345)
trainingindices <- createDataPartition(energy$Y1, p=0.7, list = FALSE)
training3 <- energy[trainingindices, ]
testing3 <- energy[-trainingindices, ]

(b)

Use train with method=“nnet” to build a neural network model for Y1 using X1, X2, …, X8 as predictors. (Don’t use Y2 as a predictor.) Call the model NN3b. Also, note that there is no need to center and scale the predictors.

NN3b <- train(Y1~X1+X2+X3+X4+X5+X6+X7+X8, data = training3,
              method="nnet",
              trControl=trainControl(method="cv", 10))

(c)

Compute R^2 for NN3b on the testing data.

predictNN3b=predict(NN3b, testing3)
cor(predictNN3b,testing3$Y1)^2

## [1] 0.9862628

(d)

To predict both Y1 and Y2, use the neuralnet function in the neuralnet package. Using the training data, build a model called NN3d with one hidden unit in one hidden layer. Plot NN3d using plot(NN3d, rep=“best”) and use the testing data to compute R2 for Y1 and Y2.

library(neuralnet)
NN3d = neuralnet(Y1+Y2~X1+X2+X3+X4+X5+X6+X7, data = training3, hidden = 1)
plot(NN3d, rep="best")

predictNN3d=predict(NN3d, testing3)
cor(predictNN3b,testing3$Y1)^2

## [1] 0.9862628

cor(predictNN3b,testing3$Y2)^2

## [1] 0.9438288

(e)

Make a new model, NN3e, for Y1 and Y2 with two hidden layers, the first having 2 nodes and the second having 1 node. Plot NN3e using plot(NN3e, rep=“best”) and use the testing data to compute R^2 for Y1 and Y2.

NN3e = neuralnet(Y1+Y2~X1+X2+X3+X4+X5+X6+X7+X8, data = training3, hidden=c(2,1))
plot(NN3e, rep="best")

predictNN3e=predict(NN3e,testing3)
cor(predictNN3e, testing3$Y1)^2

##           [,1]
## [1,] 0.9741762
## [2,] 0.9741762

cor(predictNN3e, testing3$Y2)^2

##           [,1]
## [1,] 0.9484193
## [2,] 0.9484193

Problem 4

In this problem, we investigate the importance of normalizing the data before constructing a neural network model. Consider the Boston data set in the MASS package with lstat predicting medv.

(a)

Set the seed to 12345 and construct a neural network model called NN4b with one hidden layer containing one hidden variable. Plot the data and superimpose the model over the data. Comment on the quality of the ﬁt.

library(MASS)
library(neuralnet)
library(caret)

NN4b <- neuralnet(medv~lstat, data=Boston, hidden=1)

plot(Boston$lstat, Boston$medv, pch=19, xlab="Observed Lstat", ylab="Observed Medv")

Boston_data1 <- seq(0,50,0.25)
Boston_data1 <- as.data.frame(Boston_data1)

line <- lines(Boston_data1$Boston_data1, compute(NN4b, Boston_data1)$net.result, lwd=2, col="Green")

This model is not a good fit. The points form a curve while the model predicts a horizontal line.

(b)

Construct a new dataframe containing a normalized version of the Boston data.

normalize <- function(x){return((x-min(x))/(max(x)-min(x)))}
Boston_normalized <- as.data.frame(lapply(Boston, normalize))

(c)

Using the normalized data, construct a neural network model called NN4d with one hidden layer containing one hidden variable. Plot the data and superimpose the model over the data and make the curve red. Comment on the quality of the ﬁt.

library(neuralnet)
library(caret)

NN4d <- neuralnet(medv~lstat, data=Boston_normalized, hidden=1)

plot(Boston_normalized$lstat, Boston_normalized$medv, pch=19, xlab="Normalized Lstat", ylab="Normalized Medv")

Boston_data2 <- seq(0, 1, 0.05)
Boston_data2 <- as.data.frame(Boston_data2)

line_2 <- lines(Boston_data2$Boston_data2, compute(NN4d, Boston_data2)$net.result, lwd=2, col="Red")

This is a better fit than part a. The model now predicts a curve that many of the points lie on.

(d)

Use plot(NN4d, rep=“best”) to visualize the model and write down the corresponding equation. Use S to indicate the activation function.

library(neuralnet)
library(caret)

plot(NN4d, rep="best")

y= 0.17385 + 3.27508S(-1.31578-5.88043)

(e)

Using the normalized data, construct a neural network model called NN4f with two hidden layers containing two hidden variables each. Plot the data and superimpose the model over the data and make the curve blue. Comment on the quality of the ﬁt.

library(neuralnet)
library(caret)

NN4f <- neuralnet(medv~lstat, data=Boston_normalized, hidden=c(2,2))

plot(Boston_normalized$lstat, Boston_normalized$medv, pch=19, xlab="Normalized Lstat", ylab="Normalized Medv")

Boston_data3 <- seq(0, 1, 0.05)
Boston_data3 <- as.data.frame(Boston_data3)

line_3 <- lines(Boston_data3$Boston_data3, compute(NN4f, Boston_data3)$net.result, lwd=2, col="Blue")

This one is the best fit so far. This one has a curve like the previous model but also introduces another curve. forming that “S” shape.

Problem 5

A crooked employee at a casino occasionally switches out a fair six-sided die for a weighted six-sided die, and observations of die rolls supervised by this employee are recorded in Casino.csv.

(a)

Set the seed to 6789 and and initialize π and B with random positive entries, but be sure that the entries in π add to one and the rows of B add to one.

library(HMM)
Casino = read.csv("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 5\\Casino.csv", header = TRUE)
normalizeProbabilities = function(x){x/sum(x)}
Aprobs = matrix(c(0.99,0.02,0.01,0.98), nrow = 2, ncol = 2)
set.seed(6789)
PIprobs = normalizeProbabilities(runif(2))
Bprobs = apply(matrix(runif(12),6),1, normalizeProbabilities)
hmm <- initHMM(c("1", "2"), 1:6, startProbs = PIprobs, transProbs = Aprobs, emissionProbs = Bprobs)

(b)

Use the Baum-Welch algorithm to build a hidden Markov model for the crooked employee’s behavior. What does the model predict for the weights of the unfair die?

model <- baumWelch(hmm, Casino$Roll, maxIterations = 50)

Based on the baunWelch model, state 1 is the fair die. State 2 is the unfair die. Its weights are 0.1009619 for a 1, 0.09892086 for a 2, 0.09879078 for a 3, 0.1005153 for a 4, 0.1015320 for a 5, and 0.4992791 for a 6.

Problem 6

The data in KaggleSurvey.csv are derived from the responses to the 2018 Kaggle Machine Learning and Data Science Survey. Respondents were asked “How do you perceive the quality of online learning platforms and MOOCs as compared to the quality of the education provided by traditional brick and mortar institutions?” and responses used the following scale. 1. Much worse 2. Slightly worse 3. Neither better nor worse/No opinion/I do not know 4. Slightly better 5. Much better

(a)

Since there are a lot of missing salaries, let’s remove the salary data, and since there are so many diﬀerent countries, let’s also remove the country data. From what remains, remove any incomplete cases.

KaggleSurveyData <- read.csv("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 5\\KaggleSurvey.csv")
KaggleSurveyData<-subset(KaggleSurveyData, select = -c(Salary, Country))
KaggleSurveyData<-KaggleSurveyData[complete.cases(KaggleSurveyData),]

(b)

Use the polr function in the MASS package to build an ordinal regression model called ORD using Gender, Age, and Student to predict responses to the survey.

library(MASS)
KaggleSurveyData$Response <- as.factor(KaggleSurveyData$Response)
ORD <- polr(Response~Gender+Age+Student, data =  KaggleSurveyData)

(c)

Which group is most likely to respond “Much better” to the survey question?

testing <- data.frame(Student=c(0,1,0,1), Gender=c("Male", "Male", "Female", "Female"), Age=c(25,25,25,25))
predict(ORD,newdata=testing,type="p")

##            1          2         3         4         5
## 1 0.03310606 0.10486481 0.3213671 0.2890439 0.2516181
## 2 0.02887626 0.09315773 0.3025289 0.2963388 0.2790983
## 3 0.03827625 0.11858475 0.3400057 0.2787800 0.2243532
## 4 0.03340870 0.10568552 0.3225823 0.2884737 0.2498498

Group 2 is most likely to respond “Much better” to the survey question. The probability of this is about 27.9%.

(d)

Carefully explain the aﬀect of Age on the model.

testing2 <- data.frame(Student=c(0,1,0,1), Gender=c("Male", "Male", "Female", "Female"), Age=c(65,65,65,65))
predict(ORD,newdata=testing2,type="p")

##            1         2         3         4         5
## 1 0.04768491 0.1419852 0.3643886 0.2590127 0.1869286
## 2 0.04167271 0.1272591 0.3500645 0.2716839 0.2093198
## 3 0.05500236 0.1588800 0.3769864 0.2440045 0.1651267
## 4 0.04811418 0.1430069 0.3652622 0.2581146 0.1855022

Looking at part c, it appears that age group strongly influences the probability of response in each group. In part c, all four groups were individuals aged 25 and each individual had roughly the same probability of choosing each repsonse(1-5). In other words, responding 1-Much worse had around the same probability in all four groups and the same follows for responses 2-5.

I also ran a similar prediction on 65 year olds. They also have roughly the same probability of choosing each response(1-5).

Problem 7

The ﬁle SouthAmerica.csv contains data on ten countries in South America.

(a)

Load the data, rename the rows with the names of the countries, and use scale to center and scale each column. Then use hclust to produce a cluster dendrogram that displays how similar countries are to one another. Use plot to display the clusters and be sure that the country names are used for the labels.

SouthAmerica_Data <- read.csv("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 5\\SouthAmerica.csv")
Scaled_SouthAmerica<- scale(SouthAmerica_Data[,c(2:8)], center = TRUE, scale = TRUE) #center and scale all numeric variables
row.names(Scaled_SouthAmerica)<- c("Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador", "Paraguay", "Peru","Uruguay","Venezuela") 
clusters7 <- hclust(dist(Scaled_SouthAmerica[1:10,1:7]))
plot(clusters7, xlab="Country")

(b)

According to the dendrogram, which two countries are most like Colombia?

Peru and Ecuador are most like Colombia.

(c)

Suppose that we choose a height so that there are only two clusters. List the countries in each cluster.

I don’t think there is any height where they are only two clusters.

Problem 8

The ﬁle EducationLevel.csv contains data on education levels in all of the counties in the United States.

(a)

Load the data. There is no need to scale the columns since the numerical columns are all percentages. Carefully examine the data and do any necessary pre-processing.

EducationLevel_Data <- read.csv("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 5\\EducationLevel.csv")
EducationLevel_Data1<-EducationLevel_Data[complete.cases(EducationLevel_Data),] #removes rows with any NAs
EducationLevel_Data1 <-  EducationLevel_Data1[EducationLevel_Data1$FIPS.Code %% 1000 != 0, ]
EducationLevel_Data1 <- EducationLevel_Data1[EducationLevel_Data1$State != "DC", ]
EducationLevel_Data1 <- EducationLevel_Data1[EducationLevel_Data1$State != "PR", ]

Kmeans clustering can not handle any NAs in a dataset so I removed them using complete.cases. I also used mod 1000 to remove the state records from the file. Washington DC and Puerto Rico were also removed from the data set.

(b)

Set the seed to 1234. Use K-means clustering (kmeans) with 2 clusters on the percentage data.

set.seed(1234)
clusters <-kmeans(EducationLevel_Data1[,4:7], 2)

(c)

Make a new data frame called codes that has two variables, fips and cluster. fips contains the numerical ID for each county and cluster identiﬁes the cluster to which each county belongs.

codes <- data.frame(fips=EducationLevel_Data1$FIPS.Code, cluster=clusters$cluster)

(d)

Load the usmaps package and run the following code to generate a color-coded map of the US where the color indicates the cluster membership for each county.

library(usmap)
library(ggplot2)# needed for scale_fill_continuous
plot_usmap(data=codes, labels=TRUE, value='cluster', label_color='white') + scale_fill_continuous(low="red", high="green") + theme(legend.position = "none")

Project 5

Sarah Brickman and Maxwell Billante

4/22/2020