Homework 1: Data Preparation

1. Give an example of how a hospital research center, with access to all patients records, can use classification, clustering and association analysis to help physicians and/or healthcare administrators make better decisions. For each technique, specify the goal (business question), the input and the output. You may assume you have access to any relevant data you need. (9 pts)

Classification: Goal: Identify patience at risk of breast cancer. You could use classification methods to find similarities between patients that are diagnosed with the same disease. Input: Characteristics of patients with breast cancer. Previous diagnosis, does it run in the family, smoking ext. Output: A prediction of weather someone is likely to have breast cancer or be diagnosed with breast cancer based on similarities that potential patients had with already diagnosed patents.
Clustering: Goal: Identify specific care demands of elderly people using different elderly components. Input: Things like number of medications used, age, mobility, and cognitive stability. Output: The result would cluster individuals based on these characteristics to give a hospital a better idea of how much demand this patient is going to require. The result could cluster them into different groups that would better identify individuals of higher potential need for more ti e consuming care.
Association: Goal: In healthcare, a lot of times someone can be diagnosed with some type of issue and it can lead to discoveries of other problems.So to use associations to say when someone is diagnosed with this they are likely to have this or to check for this. Input: The input data would be from historical records that when a patient was diagnosed with a certain issue in this scenario they were also diagnosed with this other issue. Output: So, using this historical data to associate diseases and certain diagnoses that typically come together.

2. Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio) (2 pts)

- Field length: Discreet, ratio
- Pain level description on a scale of 0 to 5: Discreet, ordinal
- Stock closing price: Continuous, ratio
- Stock symbol: Binary, Nominal

3. Consider the following sample from a data set used for the purpose of predicting home values in Boston. Describe three issues with the data and explain how you would resolve them. (6 pts)

The data set is very small. With this small of a data set it is very unlikely that you could build a model that accurately predicts the home value. To fix this you could sample more data but either going out and collecting the data, scrapping the web for additional data, or buy data from data sources.
This data set also has no unit measurements. Meaning what is the Total Value column measured in? Is it 10’s of thousands or 100’s of thousands. Similar issue with identifying what z-value means. This would have to be fixed by using some type of reference and logical research to make assumptions on the data.
Potentially the lot number would identify the exact location of the house, however Boston is a huge city with many different areas where houses would be different prices. So, to properly model this data set we would need more information on where these houses are located. Are they located in the same neighborhood or are they located on different ends of the city? To fix this we would have to find where the originally data was collected to gather that information then model the data set based on certain locations.

4. Consider the following pairs of records. In each case, specify an appropriate measure of similarity, explain your choice and compute the similarity between the two records. State any assumptions you make. (6 pts)

A

We can use Jaccard because it is Asymetric. Which means we do not include matching absesnes which makes sense when it comes to comapring things that were purchased. In this case it would give us 3/6.

B

This one we would use SMC which means that we count presense and absense equally. 3/6 = .5

C

Need to normalize the data first.

NORM = ((Car1^2) + (Car2²⁾⁾0.5 Engine Cap Norm = 4.72 HP = 330.19 Cylinder = 7.21 Torque = 352.46 Weight = 2109.5

Lables = c('Engine Cap Norm', 'HP', 'Cylinder', 'Torque', 'Weight')
NormalizedNumber = c(4.72, 330.19, 7.21, 352.46, 2109.5)
Car1 = c(2.5, 175, 4, 185, 1100)
Car2 = c(4.0, 280, 6, 300, 1800)

Ncar1 = Car1/NormalizedNumber
NCar2 =Car2/NormalizedNumber
NDAT = data.frame(Lables, Ncar1, NCar2)

#Use Euclidean distance 

Euclidean = .70409
Similarity = .29591

5.A

library(car)
library(sqldf)

## Loading required package: gsubfn

## Loading required package: proto

## Loading required package: RSQLite

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggfortify)

## Loading required package: ggplot2

data(Pottery)

5.B: Explore the dataset Pottery: view the structure, first/last few rows and summary statistics.

```

head(Pottery, 5)

##        Site   Al   Fe   Mg   Ca   Na
## 1 Llanedyrn 14.4 7.00 4.30 0.15 0.51
## 2 Llanedyrn 13.8 7.08 3.43 0.12 0.17
## 3 Llanedyrn 14.6 7.09 3.88 0.13 0.20
## 4 Llanedyrn 11.5 6.37 5.64 0.16 0.14
## 5 Llanedyrn 13.8 7.06 5.34 0.20 0.20

tail(Pottery, 5)

##           Site   Al   Fe   Mg   Ca   Na
## 22 AshleyRails 17.7 1.12 0.56 0.06 0.06
## 23 AshleyRails 18.3 1.14 0.67 0.06 0.05
## 24 AshleyRails 16.7 0.92 0.53 0.01 0.05
## 25 AshleyRails 14.8 2.74 0.67 0.03 0.05
## 26 AshleyRails 19.1 1.64 0.60 0.10 0.03

nrow(Pottery)

## [1] 26

ncol(Pottery)

## [1] 6

summary(Pottery)

##           Site          Al              Fe              Mg       
##  AshleyRails: 5   Min.   :10.10   Min.   :0.920   Min.   :0.530  
##  Caldicot   : 2   1st Qu.:11.95   1st Qu.:1.700   1st Qu.:0.670  
##  IsleThorns : 5   Median :13.80   Median :5.465   Median :3.825  
##  Llanedyrn  :14   Mean   :14.49   Mean   :4.468   Mean   :3.142  
##                   3rd Qu.:17.45   3rd Qu.:6.590   3rd Qu.:4.503  
##                   Max.   :20.80   Max.   :7.090   Max.   :7.230  
##        Ca               Na        
##  Min.   :0.0100   Min.   :0.0300  
##  1st Qu.:0.0600   1st Qu.:0.0500  
##  Median :0.1550   Median :0.1500  
##  Mean   :0.1465   Mean   :0.1585  
##  3rd Qu.:0.2150   3rd Qu.:0.2150  
##  Max.   :0.3100   Max.   :0.5400

5.C:List Site, Mg, and Ca of all rows with Al greater than 12.5

SubSet = sqldf("select Site, Mg, Ca from Pottery where Al > 12.5")

## Loading required package: tcltk

## Warning: Quoted identifiers should have class SQL, use DBI::SQL() if the
## caller performs the quoting.

head(SubSet, 5)

##        Site   Mg   Ca
## 1 Llanedyrn 4.30 0.15
## 2 Llanedyrn 3.43 0.12
## 3 Llanedyrn 3.88 0.13
## 4 Llanedyrn 5.34 0.20
## 5 Llanedyrn 7.23 0.28

5.D: Make a copy of the data set and store it in variable PotteryN. Remove all non numeric columns from PotteryN.

PotteryN = Pottery[,sapply(Pottery,function(x) is.numeric(x))]

5.E: Find the covariance and correlation matrices of PotteryN. Which attributes have the strongest correlation?

(Cl <- cor(PotteryN))

##            Al         Fe         Mg         Ca         Na
## Al  1.0000000 -0.7888222 -0.7983975 -0.7635276 -0.4725948
## Fe -0.7888222  1.0000000  0.9006753  0.7652053  0.6616845
## Mg -0.7983975  0.9006753  1.0000000  0.8419589  0.6427235
## Ca -0.7635276  0.7652053  0.8419589  1.0000000  0.4815327
## Na -0.4725948  0.6616845  0.6427235  0.4815327  1.0000000

(CV <- cov(PotteryN))

##            Al         Fe         Mg           Ca           Na
## Al  8.9559385 -5.6886185 -5.2080677 -0.231307692 -0.191332308
## Fe -5.6886185  5.8068985  4.7308837  0.186663692  0.215708308
## Mg -5.2080677  4.7308837  4.7512055  0.185781538  0.189526462
## Ca -0.2313077  0.1866637  0.1857815  0.010247538  0.006594462
## Na -0.1913323  0.2157083  0.1895265  0.006594462  0.018301538

Mg and Fe have the highest correlation with a possitive slope of .90.

5.F: Scatter plot the attributes with the strongest correlation

plot(PotteryN$Mg,PotteryN$Fe)

5.G: Find the principal components. Plot the first two component, add color to the graph based on the column ‘Site’.

PC = prcomp(PotteryN)
autoplot(PC, data = Pottery, colour = 'Site')