It’s a dataset that has 5 variables total: 4 numerical measurements and one categorical variable. The data set is of 150 individual Iris Flowers.
Each flower is measured for sepal length, sepal width, petal length, and petal width. Each flower is also categorized by species (setosa, virginica, and versicolor)
The data are not arranged randomly, but grouped by species.
data(iris)
#fix(iris) # pop up "data editor" appears.
attach(iris) #attach iris dataset to R, so that object can be accessed directly by name
plot(Petal.Length, Petal.Width) #generates scatter plot
Positive correlation observed between petal length and petal width. This is unusual: two clusters of points my indicate two different iris species
plot(Petal.Length, Petal.Width, col=rainbow(3)[Species]) #generates colorful scatter plot
legend("topleft", levels(Species),fill=rainbow(3)) #add legend
Scatter Plot Matrix: To visually explore the correlation between multiple variables, we could generate a matrix of scatter plots
plot(iris,col=rainbow(3)[Species]) #generate scatter plot matrix of iris datasets
Part b) Car Data
A more convenient tool for generating scatter plot matrix is in the “car” library
#install.packages("car")
library(car)
## Loading required package: carData
spm(~Sepal.Length + Sepal.Width + Petal.Length + Petal.Width | Species)
It’s a matrix of scatter plots to visualize if there is a correlation between species and the 4 measurements taken from each flower. This quickly generates all the possible x,y combinations as scatter plots and easily allows you to visualize which combination of variables displays the most variation or correlation.
x1 = rnorm(10000) # generate 10,000 standard normal random numbers as x-axis
x2 = rnorm(10000) # generate 10,000 standard normal randomnumbers as y-axis
#dens(Cols) produces a vector containing colors which encode the local
colors <- densCols(x=x1, y=x2)
plot(x1,x2,col=colors,pch=20) #generate scatter plot
3D scatter plots can visualize 3 variables.
Do not use 3D plots unless you absolutely have no other choice: Too confusing
#install.packages("Rcmdr")
#library(Rcmdr)
scatter3d(Sepal.Length,Sepal.Width,Petal.Length)
## Loading required namespace: rgl
## Loading required namespace: mgcv
## Loading required namespace: MASS
This generated a spinning 3D plot. However, I could not get this to display here. When code is run, it will create the spinning 3D plot in a pop-out window.
boxplot(iris, main="Box Plot for all 5 Variables") #all 5 vars. even "species" which confuses me...
Oddly, Species (a categorical variable) is included here.
boxplot(Sepal.Length~Species, ylab = "Sepal Length", main = "Sepal Length by Species", col="red")
Quantitative EDA technique for two variables
Pearson’s correlation coefficient is most widely used measure of correlation. covariance(X,Y)/ (SDx * SDy)
Pearson’s correlation coefficient R between SL and SW: Is it significantly different from zero?
cor(Sepal.Length, Sepal.Width) #Pearsons Correlation Coefficient
## [1] -0.1175698
This value detects the degree of correlation, and is independent of slope. But it can only detect linear correlation. Cannot detect nonlinear correlation.
This doesn’t appear to be significantly different than zero at first glance. Sepal Length and Sepal Width are probably not correlated, but let’s determine the significance of the correlation coefficient first and calculate a p-value.
Is this correlation coefficient statistically significant? : use Pearson’s Test, Spearman’s Test
Pearson’s Test
cor.test(Sepal.Length, Sepal.Width) #Pearson's product-moment correlation
##
## Pearson's product-moment correlation
##
## data: Sepal.Length and Sepal.Width
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.27269325 0.04351158
## sample estimates:
## cor
## -0.1175698
p-value of 0.1519 > 0.095. Therefore, we fail to reject the null hypothesis that the correlation coefficient is zero. The correlation between sepal length and sepal width is not significant.
cor(Sepal.Length, Sepal.Width, method=“spearman”) cor.test(Sepal.Length, Sepal.Width, method=“spearman”)
Spearman’s Test
cor(Sepal.Length, Sepal.Width, method="spearman")
## [1] -0.1667777
cor.test(Sepal.Length, Sepal.Width, method="spearman")
## Warning in cor.test.default(Sepal.Length, Sepal.Width, method = "spearman"):
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: Sepal.Length and Sepal.Width
## S = 656283, p-value = 0.04137
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1667777
There is a significant (weak) correlation between Sepal length and sepal width. This is at odds with the Pearson’s test. Therefore, there probably is some, very weak correlation.
Everything previously were measurements of correlation between two numerical variables. However, what if we wanted to measure the correlation between a numerical variable and a categorical variable. For example, Sepal length and species in the Iris Dataset.
Couldn’t we assign a numerical value for each species c(1,2,3) then run the analysis?
Determining the correlation between a continuous variable and a categorical can be determined by ANOVA: Analysis of Variance.
The Null hypothesis is that there is no difference in sepal length across the three species of Iris.
aov(formula= Sepal.Length ~ Species)
## Call:
## aov(formula = Sepal.Length ~ Species)
##
## Terms:
## Species Residuals
## Sum of Squares 63.21213 38.95620
## Deg. of Freedom 2 147
##
## Residual standard error: 0.5147894
## Estimated effects may be unbalanced
summary(aov(Sepal.Length ~ Species))
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 63.21 31.606 119.3 <2e-16 ***
## Residuals 147 38.96 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p<0.05 therefore we reject the null that there is no difference in sepal length across the three species of iris. The difference is significant. There’s a correlation between Sepal Length and Species.
Therefore, if we were to build a model to predict species, we must include Sepal Length as a predictor variable.
I tried to use K-nearest neighbors, but ran into trouble when I realized the iris data was sorted by species: not random. This screwed up the test/training data. This model should work once data is randomized, but I need to move on.
irisml <- (iris)
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
round(prop.table(table(irisml$Species))*100, digits = 1)
##
## setosa versicolor virginica
## 33.3 33.3 33.3
summary(irisml[c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))}
irisml_n <- as.data.frame(lapply(irisml[1:4], normalize))
summary(irisml_n$Sepal.Length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2222 0.4167 0.4287 0.5833 1.0000
summary(irisml_n$Sepal.Width)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3333 0.4167 0.4406 0.5417 1.0000
irisml_train <- irisml_n[1:119, ]
irisml_test <- irisml_n[120:150, ]
# create labels for training and test data
irisml_train_labels <- irisml[1:119, 5]
irisml_test_labels <- irisml[120:150, 5]
library(class)
irisml_test_pred <- knn(train = irisml_train, test = irisml_test, cl = irisml_train_labels, k = 3)
library(gmodels)
CrossTable(x= irisml_test_labels, y = irisml_test_pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 31
##
##
## | irisml_test_pred
## irisml_test_labels | versicolor | virginica | Row Total |
## -------------------|------------|------------|------------|
## virginica | 5 | 26 | 31 |
## | 0.161 | 0.839 | |
## -------------------|------------|------------|------------|
## Column Total | 5 | 26 | 31 |
## -------------------|------------|------------|------------|
##
##
## Okay. this is taking longer than I expected. This didn't work and I need to move on. The problem is that the data is not randomized. We learned how to do this somewhere, but I have a lot of work I need to finish. Leaving this here so you see I'm attempting to make connections to the ML class!
heartatk <- read.csv("heartatk4R.csv")
heartatk[1:20,]
## X Patient DIAGNOSIS SEX DRG DIED CHARGES LOS AGE
## 1 1 1 41041 F 122 0 4752.00 10 79
## 2 2 2 41041 F 122 0 3941.00 6 34
## 3 3 3 41091 F 122 0 3657.00 5 76
## 4 4 4 41081 F 122 0 1481.00 2 80
## 5 5 5 41091 M 122 0 1681.00 1 55
## 6 6 6 41091 M 121 0 6378.64 9 84
## 7 7 7 41091 F 121 0 10958.52 15 84
## 8 8 8 41091 F 121 0 16583.93 15 70
## 9 9 9 41041 M 121 0 4015.33 2 76
## 10 10 10 41041 F 123 1 1989.44 1 65
## 11 11 11 41041 F 121 0 7471.63 6 52
## 12 12 12 41091 M 121 0 3930.63 5 72
## 13 13 13 41091 F 122 0 NA 9 83
## 14 14 14 41091 F 122 0 4433.93 4 61
## 15 15 15 41041 M 122 0 3318.21 2 53
## 16 16 16 41041 M 122 0 4863.83 5 77
## 17 17 17 41041 M 121 0 5000.64 3 53
## 18 18 18 41091 M 121 0 4989.71 5 56
## 19 19 19 41041 F 122 0 8645.90 4 66
## 20 20 20 41091 M 122 0 2885.55 4 39
attach(heartatk)
table(SEX)
## SEX
## F M
## 5065 7779
# Pie Chart, Sex
pie(table(SEX),
main= "MI Patient Sex",
col = c("#fc0fc0", "#00b7eb"))
counts <- table(SEX, DIED)
counts
## DIED
## SEX 0 1
## F 4298 767
## M 7136 643
barplot(counts,
main= "MI Prognosis and Sex",
col= c("#fc0fc0", "#00b7eb"),
legend=rownames(counts),
ylab="Frequency",
xlab="0=Did not Die in Hospital, 1=Died in Hospital") #barplots for two factors
Barplot was created. The bar plot shows that there are perhaps more men alive after a heart attack than women. It suggests that the mortality rate for men is different than the mortality rate for women post-MI. But is this a statistically significant difference?
To find out, we’ll use the Pearson’s Chi-Squared test with Yates’ continuity correction.
chisq.test(table(SEX, DIED))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(SEX, DIED)
## X-squared = 147.76, df = 1, p-value < 2.2e-16
p-value is < 0.05. Therefore we reject the null hypothesis that mortality rate is the same for men and women. Mortality rate for men is different (lower) than the mortality rate for women.
The Pearson’s chi-squared test is an approximation to the Fisher Exact test. There are some caveats to the Chi-Squared test and modern-computers can handle the fisher exact test anyway. So let’s use that.
fisher.test(table(SEX,DIED)) #Fisher Exact Test
##
## Fisher's Exact Test for Count Data
##
## data: table(SEX, DIED)
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.4509331 0.5653197
## sample estimates:
## odds ratio
## 0.5049435
p-value < 0.05. Therefore, mortality rate for men is different than the mortality rate for women after being hospitalized for MI.
Explore the correlation between variables in the Heart Attack dataset and give interpretation
1) AGE vs. LOS
# continuous var vs. continuous var ---therefore, use Spearman's
cor.test(AGE, LOS, method="spearman")
## Warning in cor.test.default(AGE, LOS, method = "spearman"): Cannot compute exact
## p-value with ties
##
## Spearman's rank correlation rho
##
## data: AGE and LOS
## S = 2.9448e+11, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1661032
There is a significant correlation between AGE and LOS
2) LOS vs. CHARGES
#Continuous var vs. continuous var ---Therefore, use spearman's
cor.test(LOS, CHARGES, method = "spearman")
## Warning in cor.test.default(LOS, CHARGES, method = "spearman"): Cannot compute
## exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: LOS and CHARGES
## S = 7.8102e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.7384085
There is a significant difference between LOS and CHARGES. There is a significant correlation.
3) AGE vs. DIED
#continuous vs. categorical---Therefore, use ANOVA
aov(AGE ~ DIED)
## Call:
## aov(formula = AGE ~ DIED)
##
## Terms:
## DIED Residuals
## Sum of Squares 137358.1 2257118.2
## Deg. of Freedom 1 12842
##
## Residual standard error: 13.25748
## Estimated effects may be unbalanced
summary(aov(AGE ~ DIED))
## Df Sum Sq Mean Sq F value Pr(>F)
## DIED 1 137358 137358 781.5 <2e-16 ***
## Residuals 12842 2257118 176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is a significant difference between AGE and DIED. There is a significant correlation.
4) DIAGNOSIS vs. DIED
#Categorical vs Categorical
#Tricky: the diagnosis is a diagnosis code. It's a categorical variable.
aov(DIAGNOSIS ~ DIED)
## Call:
## aov(formula = DIAGNOSIS ~ DIED)
##
## Terms:
## DIED Residuals
## Sum of Squares 4166 12934300
## Deg. of Freedom 1 12842
##
## Residual standard error: 31.73622
## Estimated effects may be unbalanced
summary(aov(DIAGNOSIS ~ DIED))
## Df Sum Sq Mean Sq F value Pr(>F)
## DIED 1 4166 4166 4.136 0.042 *
## Residuals 12842 12934300 1007
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is a significant (weak) correlation between Diagnosis and Died.