Classification: Goal: Identify patience at risk of breast cancer. You could use classification methods to find similarities between patients that are diagnosed with the same disease. Input: Characteristics of patients with breast cancer. Previous diagnosis, does it run in the family, smoking ext. Output: A prediction of weather someone is likely to have breast cancer or be diagnosed with breast cancer based on similarities that potential patients had with already diagnosed patents.
Clustering: Goal: Identify specific care demands of elderly people using different elderly components. Input: Things like number of medications used, age, mobility, and cognitive stability. Output: The result would cluster individuals based on these characteristics to give a hospital a better idea of how much demand this patient is going to require. The result could cluster them into different groups that would better identify individuals of higher potential need for more ti e consuming care.
Association: Goal: In healthcare, a lot of times someone can be diagnosed with some type of issue and it can lead to discoveries of other problems.So to use associations to say when someone is diagnosed with this they are likely to have this or to check for this. Input: The input data would be from historical records that when a patient was diagnosed with a certain issue in this scenario they were also diagnosed with this other issue. Output: So, using this historical data to associate diseases and certain diagnoses that typically come together.
- Field length: Discreet, ratio
- Pain level description on a scale of 0 to 5: Discreet, ordinal
- Stock closing price: Continuous, ratio
- Stock symbol: Binary, Nominal
The data set is very small. With this small of a data set it is very unlikely that you could build a model that accurately predicts the home value. To fix this you could sample more data but either going out and collecting the data, scrapping the web for additional data, or buy data from data sources.
This data set also has no unit measurements. Meaning what is the Total Value column measured in? Is it 10’s of thousands or 100’s of thousands. Similar issue with identifying what z-value means. This would have to be fixed by using some type of reference and logical research to make assumptions on the data.
Potentially the lot number would identify the exact location of the house, however Boston is a huge city with many different areas where houses would be different prices. So, to properly model this data set we would need more information on where these houses are located. Are they located in the same neighborhood or are they located on different ends of the city? To fix this we would have to find where the originally data was collected to gather that information then model the data set based on certain locations.
We can use Jaccard because it is Asymetric. Which means we do not include matching absesnes which makes sense when it comes to comapring things that were purchased. In this case it would give us 3/6.
This one we would use SMC which means that we count presense and absense equally. 3/6 = .5
Need to normalize the data first.
NORM = ((Car1^2) + (Car22))0.5 Engine Cap Norm = 4.72 HP = 330.19 Cylinder = 7.21 Torque = 352.46 Weight = 2109.5
Lables = c('Engine Cap Norm', 'HP', 'Cylinder', 'Torque', 'Weight')
NormalizedNumber = c(4.72, 330.19, 7.21, 352.46, 2109.5)
Car1 = c(2.5, 175, 4, 185, 1100)
Car2 = c(4.0, 280, 6, 300, 1800)
Ncar1 = Car1/NormalizedNumber
NCar2 =Car2/NormalizedNumber
NDAT = data.frame(Lables, Ncar1, NCar2)
#Use Euclidean distance
Euclidean = .70409
Similarity = .29591
library(car)
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggfortify)
## Loading required package: ggplot2
data(Pottery)
```
head(Pottery, 5)
## Site Al Fe Mg Ca Na
## 1 Llanedyrn 14.4 7.00 4.30 0.15 0.51
## 2 Llanedyrn 13.8 7.08 3.43 0.12 0.17
## 3 Llanedyrn 14.6 7.09 3.88 0.13 0.20
## 4 Llanedyrn 11.5 6.37 5.64 0.16 0.14
## 5 Llanedyrn 13.8 7.06 5.34 0.20 0.20
tail(Pottery, 5)
## Site Al Fe Mg Ca Na
## 22 AshleyRails 17.7 1.12 0.56 0.06 0.06
## 23 AshleyRails 18.3 1.14 0.67 0.06 0.05
## 24 AshleyRails 16.7 0.92 0.53 0.01 0.05
## 25 AshleyRails 14.8 2.74 0.67 0.03 0.05
## 26 AshleyRails 19.1 1.64 0.60 0.10 0.03
nrow(Pottery)
## [1] 26
ncol(Pottery)
## [1] 6
summary(Pottery)
## Site Al Fe Mg
## AshleyRails: 5 Min. :10.10 Min. :0.920 Min. :0.530
## Caldicot : 2 1st Qu.:11.95 1st Qu.:1.700 1st Qu.:0.670
## IsleThorns : 5 Median :13.80 Median :5.465 Median :3.825
## Llanedyrn :14 Mean :14.49 Mean :4.468 Mean :3.142
## 3rd Qu.:17.45 3rd Qu.:6.590 3rd Qu.:4.503
## Max. :20.80 Max. :7.090 Max. :7.230
## Ca Na
## Min. :0.0100 Min. :0.0300
## 1st Qu.:0.0600 1st Qu.:0.0500
## Median :0.1550 Median :0.1500
## Mean :0.1465 Mean :0.1585
## 3rd Qu.:0.2150 3rd Qu.:0.2150
## Max. :0.3100 Max. :0.5400
SubSet = sqldf("select Site, Mg, Ca from Pottery where Al > 12.5")
## Loading required package: tcltk
## Warning: Quoted identifiers should have class SQL, use DBI::SQL() if the
## caller performs the quoting.
head(SubSet, 5)
## Site Mg Ca
## 1 Llanedyrn 4.30 0.15
## 2 Llanedyrn 3.43 0.12
## 3 Llanedyrn 3.88 0.13
## 4 Llanedyrn 5.34 0.20
## 5 Llanedyrn 7.23 0.28
PotteryN = Pottery[,sapply(Pottery,function(x) is.numeric(x))]
(Cl <- cor(PotteryN))
## Al Fe Mg Ca Na
## Al 1.0000000 -0.7888222 -0.7983975 -0.7635276 -0.4725948
## Fe -0.7888222 1.0000000 0.9006753 0.7652053 0.6616845
## Mg -0.7983975 0.9006753 1.0000000 0.8419589 0.6427235
## Ca -0.7635276 0.7652053 0.8419589 1.0000000 0.4815327
## Na -0.4725948 0.6616845 0.6427235 0.4815327 1.0000000
(CV <- cov(PotteryN))
## Al Fe Mg Ca Na
## Al 8.9559385 -5.6886185 -5.2080677 -0.231307692 -0.191332308
## Fe -5.6886185 5.8068985 4.7308837 0.186663692 0.215708308
## Mg -5.2080677 4.7308837 4.7512055 0.185781538 0.189526462
## Ca -0.2313077 0.1866637 0.1857815 0.010247538 0.006594462
## Na -0.1913323 0.2157083 0.1895265 0.006594462 0.018301538
Mg and Fe have the highest correlation with a possitive slope of .90.
plot(PotteryN$Mg,PotteryN$Fe)
PC = prcomp(PotteryN)
autoplot(PC, data = Pottery, colour = 'Site')