Part 1: Association Analysis

  1. Load the bob_ross csv file into R. This data contains information about various paintings done by Bob Ross. Bob Ross was an American artist know for his TV show called The Joy of Painting. This dataset lists possible elements of each painting, such as a barn or tree, and whether they were present or absent.
bob <- read.csv("C:/Users/justt/Desktop/School/622/Homework/HW 1/bob_ross.csv")
  1. To perform association analysis on this dataset, we need to first remove all non-binary columns from the dataset, and then convert the dataset into a matrix of transactions. Do this in R.
bob_remove <- bob[,-1:-2]
bob_convert <- as(as.matrix(bob_remove), "transactions")
  1. Perform association analysis to determine rules with a support of at least 30% and confidence of at least 90%. Bob Ross included trees in a large number of his drawings. Display all of the rules that you find in your output, and using these rules, describe some of the types of landscapes in which Bob Ross tended to draw tree(s).
bob_bask <- apriori(bob_convert, parameter = list(sup = 0.3, conf = 0.9, target = "rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5     0.3      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 120 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[66 item(s), 403 transaction(s)] done [0.00s].
## sorting and recoding items ... [9 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [31 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

There were 31 rules created with Tree(s) in the right hand side. Some of the landscape types that contain tree(s) are River, Grass, Lake, Mountain, Deciduous, Conifer, and of course Tree(s).

Part 2: Predicting a Numeric Column of Data

  1. Load the nhanes_train csv file into R. This data contains information about the weight, age, height, and pulse of various individuals. Use this data in completing the next question.
n_train <- read.csv("C:/Users/justt/Desktop/School/622/Homework/HW 1/nhanes_train.csv")
  1. Develop analytical models to predict a person’s weight based on his/her age, height, and pulse(i.e., use the variable Weight as your target variable). In particular, create a regression model, a decision tree, a bagging model using 100 bootstrapped samples, a random forest with 200 trees, and a boosting model with 200 trees each having 5 splits and a shrinkage/weight of 0.03.
# regression model
rand_train <- lm(Weight~., data = n_train)
# decision tree
dtree_train <- rpart(formula = Weight~., data = n_train)
# bagging model using 100 bootstrapped samples
bagg_train <- bagging(formula = Weight~., data = n_train, nbagg = 100)
# random forest with 200 trees
rf_train <- randomForest(Weight ~., data = n_train, importance = TRUE, ntree = 200)
# boosting model with 200 trees each having 5 splits and a shrinkage/weight of 0.03
boost_train <- gbm(formula = Weight ~., data = n_train, distribution = "gaussian", n.trees = 200, shrinkage = 0.03, interaction.depth = 5)
  1. Load the nhanes_test csv file into R, and use this data to calculate the mean squared error (MSE) for each of the models that you developed in the previous question. Based on your results, which model is best for predicting body weight?
n_test <- read.csv("C:/Users/justt/Desktop/School/622/Homework/HW 1/nhanes_test.csv") 

Part 3: Clustering

  1. Use R to compute the mean of each variable in the nhanes_train csv file.
mean(n_train$Age)
## [1] 45.14699
mean(n_train$Weight)
## [1] 82.63375
mean(n_train$Height)
## [1] 169.5465
mean(n_train$Pulse)
## [1] 72.32045

  1. Perform k-means clustering on the data in the nhanes_train csv file. Create 4 distinct clusters.
library(class)
fit <- kmeans(n_train, 4)

  1. Use R to display the mean of each variable in each cluster, and compare these means with those that you obtained in the previous question.

                    Age       Weight      Height     Pulse
  1. Determine which cluster the data in row 2000 was placed in.
fit$cluster[2000]
## [1] 1

The data for row 2000 was placed in cluster 4.