Part 1: Association Analysis

Load the bob_ross csv file into R. This data contains information about various paintings done by Bob Ross. Bob Ross was an American artist know for his TV show called The Joy of Painting. This dataset lists possible elements of each painting, such as a barn or tree, and whether they were present or absent.

bob <- read.csv("C:/Users/justt/Desktop/School/622/Homework/HW 1/bob_ross.csv")

To perform association analysis on this dataset, we need to first remove all non-binary columns from the dataset, and then convert the dataset into a matrix of transactions. Do this in R.

bob_remove <- bob[,-1:-2]
bob_convert <- as(as.matrix(bob_remove), "transactions")

Perform association analysis to determine rules with a support of at least 30% and confidence of at least 90%. Bob Ross included trees in a large number of his drawings. Display all of the rules that you find in your output, and using these rules, describe some of the types of landscapes in which Bob Ross tended to draw tree(s).

bob_bask <- apriori(bob_convert, parameter = list(sup = 0.3, conf = 0.9, target = "rules"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5     0.3      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 120 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[66 item(s), 403 transaction(s)] done [0.00s].
## sorting and recoding items ... [9 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [31 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

There were 31 rules created with Tree(s) in the right hand side. Some of the landscape types that contain tree(s) are River, Grass, Lake, Mountain, Deciduous, Conifer, and of course Tree(s).

Part 2: Predicting a Numeric Column of Data

Load the nhanes_train csv file into R. This data contains information about the weight, age, height, and pulse of various individuals. Use this data in completing the next question.

n_train <- read.csv("C:/Users/justt/Desktop/School/622/Homework/HW 1/nhanes_train.csv")

Develop analytical models to predict a person’s weight based on his/her age, height, and pulse(i.e., use the variable Weight as your target variable). In particular, create a regression model, a decision tree, a bagging model using 100 bootstrapped samples, a random forest with 200 trees, and a boosting model with 200 trees each having 5 splits and a shrinkage/weight of 0.03.

# regression model
rand_train <- lm(Weight~., data = n_train)

# decision tree
dtree_train <- rpart(formula = Weight~., data = n_train)

# bagging model using 100 bootstrapped samples
bagg_train <- bagging(formula = Weight~., data = n_train, nbagg = 100)

# random forest with 200 trees
rf_train <- randomForest(Weight ~., data = n_train, importance = TRUE, ntree = 200)

# boosting model with 200 trees each having 5 splits and a shrinkage/weight of 0.03
boost_train <- gbm(formula = Weight ~., data = n_train, distribution = "gaussian", n.trees = 200, shrinkage = 0.03, interaction.depth = 5)

Load the nhanes_test csv file into R, and use this data to calculate the mean squared error (MSE) for each of the models that you developed in the previous question. Based on your results, which model is best for predicting body weight?

n_test <- read.csv("C:/Users/justt/Desktop/School/622/Homework/HW 1/nhanes_test.csv")

Regression model MSE 331.5604.
Decision tree MSE 331.4537.
Bagging model, using 100 bootstrapped samples, MSE 322.1741.
Random forest, with 200 trees, MSE 93.95295.
Boosting model MSE 288.4683. This shows that the random forest model has the lower MSE compared to all the other models, meaning that random forest is the most accurate model.

Part 3: Clustering

Use R to compute the mean of each variable in the nhanes_train csv file.

mean(n_train$Age)

## [1] 45.14699

mean(n_train$Weight)

## [1] 82.63375

mean(n_train$Height)

## [1] 169.5465

mean(n_train$Pulse)

## [1] 72.32045

Mean of Age is 45.14699.
Mean of Weight is 82.63375.
Mean of Height is 169.5465.
Mean of Pulse is 72.32045.

Perform k-means clustering on the data in the nhanes_train csv file. Create 4 distinct clusters.

library(class)
fit <- kmeans(n_train, 4)

Use R to display the mean of each variable in each cluster, and compare these means with those that you obtained in the previous question.

                    Age       Weight      Height     Pulse

Mean of cluster 1 - 37.09263 - 115.77789 - 175.9537 - 77.14105
Mean of cluster 2 - 63.45441 - 67.71181 - 161.9109 - 69.48580
Mean of cluster 3 - 54.84072 - 92.01787 - 174.3251 - 69.24931
Mean of cluster 4 - 29.80299 - 70.16010 - 168.1681 - 74.13532
Mean of Variables - 45.14699 - 82.63375 - 169.5465 - 72.32045

Determine which cluster the data in row 2000 was placed in.

fit$cluster[2000]

## [1] 1

The data for row 2000 was placed in cluster 4.

Homework 1

Tammy Brockman

2023-02-01

Part 1: Association Analysis

Part 2: Predicting a Numeric Column of Data

Part 3: Clustering