Part a) Iris Data

It’s a dataset that has 5 variables total: 4 numerical measurements and one categorical variable. The data set is of 150 individual Iris Flowers.

Each flower is measured for sepal length, sepal width, petal length, and petal width. Each flower is also categorized by species (setosa, virginica, and versicolor)

The data are not arranged randomly, but grouped by species.

data(iris)
#fix(iris) # pop up "data editor" appears. 
attach(iris) #attach iris dataset to R, so that object can be accessed directly by name
plot(Petal.Length, Petal.Width)  #generates scatter plot

Positive correlation observed between petal length and petal width. This is unusual: two clusters of points my indicate two different iris species

plot(Petal.Length, Petal.Width, col=rainbow(3)[Species]) #generates colorful scatter plot
legend("topleft", levels(Species),fill=rainbow(3)) #add legend

Scatter Plot Matrix: To visually explore the correlation between multiple variables, we could generate a matrix of scatter plots

plot(iris,col=rainbow(3)[Species]) #generate scatter plot matrix of iris datasets

Part b) Car Data

A more convenient tool for generating scatter plot matrix is in the “car” library

#install.packages("car")
library(car)
## Loading required package: carData
spm(~Sepal.Length + Sepal.Width + Petal.Length + Petal.Width | Species)

It’s a matrix of scatter plots to visualize if there is a correlation between species and the 4 measurements taken from each flower. This quickly generates all the possible x,y combinations as scatter plots and easily allows you to visualize which combination of variables displays the most variation or correlation.

x1 = rnorm(10000) # generate 10,000 standard normal random numbers as x-axis
x2 = rnorm(10000) # generate 10,000 standard normal randomnumbers as y-axis

#dens(Cols) produces a vector containing colors which encode the local
colors <- densCols(x=x1, y=x2) 

plot(x1,x2,col=colors,pch=20) #generate scatter plot

3D scatter plots can visualize 3 variables.

Do not use 3D plots unless you absolutely have no other choice: Too confusing

#install.packages("Rcmdr") 
#library(Rcmdr)
scatter3d(Sepal.Length,Sepal.Width,Petal.Length)
## Loading required namespace: rgl
## Loading required namespace: mgcv
## Loading required namespace: MASS

This generated a spinning 3D plot. However, I could not get this to display here. When code is run, it will create the spinning 3D plot in a pop-out window.

boxplot(iris, main="Box Plot for all 5 Variables") #all 5 vars. even "species" which confuses me...

Oddly, Species (a categorical variable) is included here.

boxplot(Sepal.Length~Species, ylab = "Sepal Length", main = "Sepal Length by Species", col="red")

Quantitative EDA technique for two variables

Pearson’s correlation coefficient is most widely used measure of correlation. covariance(X,Y)/ (SDx * SDy)

Pearson’s correlation coefficient R between SL and SW: Is it significantly different from zero?

cor(Sepal.Length, Sepal.Width) #Pearsons Correlation Coefficient
## [1] -0.1175698

This value detects the degree of correlation, and is independent of slope. But it can only detect linear correlation. Cannot detect nonlinear correlation.

This doesn’t appear to be significantly different than zero at first glance. Sepal Length and Sepal Width are probably not correlated, but let’s determine the significance of the correlation coefficient first and calculate a p-value.

Is this correlation coefficient statistically significant? : use Pearson’s Test, Spearman’s Test

Pearson’s Test

cor.test(Sepal.Length, Sepal.Width) #Pearson's product-moment correlation
## 
##  Pearson's product-moment correlation
## 
## data:  Sepal.Length and Sepal.Width
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.27269325  0.04351158
## sample estimates:
##        cor 
## -0.1175698

p-value of 0.1519 > 0.095. Therefore, we fail to reject the null hypothesis that the correlation coefficient is zero. The correlation between sepal length and sepal width is not significant.

cor(Sepal.Length, Sepal.Width, method=“spearman”) cor.test(Sepal.Length, Sepal.Width, method=“spearman”)

Spearman’s Test

cor(Sepal.Length, Sepal.Width, method="spearman")
## [1] -0.1667777
cor.test(Sepal.Length, Sepal.Width, method="spearman")
## Warning in cor.test.default(Sepal.Length, Sepal.Width, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  Sepal.Length and Sepal.Width
## S = 656283, p-value = 0.04137
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1667777

There is a significant (weak) correlation between Sepal length and sepal width. This is at odds with the Pearson’s test. Therefore, there probably is some, very weak correlation.

Everything previously were measurements of correlation between two numerical variables. However, what if we wanted to measure the correlation between a numerical variable and a categorical variable. For example, Sepal length and species in the Iris Dataset.

Couldn’t we assign a numerical value for each species c(1,2,3) then run the analysis?

Determining the correlation between a continuous variable and a categorical can be determined by ANOVA: Analysis of Variance.

The Null hypothesis is that there is no difference in sepal length across the three species of Iris.

aov(formula= Sepal.Length ~ Species)
## Call:
##    aov(formula = Sepal.Length ~ Species)
## 
## Terms:
##                  Species Residuals
## Sum of Squares  63.21213  38.95620
## Deg. of Freedom        2       147
## 
## Residual standard error: 0.5147894
## Estimated effects may be unbalanced
summary(aov(Sepal.Length ~ Species))
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  63.21  31.606   119.3 <2e-16 ***
## Residuals   147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

p<0.05 therefore we reject the null that there is no difference in sepal length across the three species of iris. The difference is significant. There’s a correlation between Sepal Length and Species.

Therefore, if we were to build a model to predict species, we must include Sepal Length as a predictor variable.

I tried to use K-nearest neighbors, but ran into trouble when I realized the iris data was sorted by species: not random. This screwed up the test/training data. This model should work once data is randomized, but I need to move on.

irisml <- (iris)
table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50
round(prop.table(table(irisml$Species))*100, digits = 1)
## 
##     setosa versicolor  virginica 
##       33.3       33.3       33.3
summary(irisml[c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")])
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))}
irisml_n <- as.data.frame(lapply(irisml[1:4], normalize))
summary(irisml_n$Sepal.Length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2222  0.4167  0.4287  0.5833  1.0000
summary(irisml_n$Sepal.Width)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3333  0.4167  0.4406  0.5417  1.0000
irisml_train <- irisml_n[1:119, ]
irisml_test <- irisml_n[120:150, ]
# create labels for training and test data
irisml_train_labels <- irisml[1:119, 5]
irisml_test_labels <- irisml[120:150, 5]
library(class)
irisml_test_pred <- knn(train = irisml_train, test = irisml_test, cl = irisml_train_labels, k = 3)
library(gmodels)
CrossTable(x= irisml_test_labels, y = irisml_test_pred, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  31 
## 
##  
##                    | irisml_test_pred 
## irisml_test_labels | versicolor |  virginica |  Row Total | 
## -------------------|------------|------------|------------|
##          virginica |          5 |         26 |         31 | 
##                    |      0.161 |      0.839 |            | 
## -------------------|------------|------------|------------|
##       Column Total |          5 |         26 |         31 | 
## -------------------|------------|------------|------------|
## 
## 
## Okay. this is taking longer than I expected. This didn't work and I need to move on. The problem is that the data is not randomized. We learned how to do this somewhere, but I have a lot of work I need to finish. Leaving this here so you see I'm attempting to make connections to the ML class!

Part b) Heart Attack Dataset

heartatk <- read.csv("heartatk4R.csv")
heartatk[1:20,]
##     X Patient DIAGNOSIS SEX DRG DIED  CHARGES LOS AGE
## 1   1       1     41041   F 122    0  4752.00  10  79
## 2   2       2     41041   F 122    0  3941.00   6  34
## 3   3       3     41091   F 122    0  3657.00   5  76
## 4   4       4     41081   F 122    0  1481.00   2  80
## 5   5       5     41091   M 122    0  1681.00   1  55
## 6   6       6     41091   M 121    0  6378.64   9  84
## 7   7       7     41091   F 121    0 10958.52  15  84
## 8   8       8     41091   F 121    0 16583.93  15  70
## 9   9       9     41041   M 121    0  4015.33   2  76
## 10 10      10     41041   F 123    1  1989.44   1  65
## 11 11      11     41041   F 121    0  7471.63   6  52
## 12 12      12     41091   M 121    0  3930.63   5  72
## 13 13      13     41091   F 122    0       NA   9  83
## 14 14      14     41091   F 122    0  4433.93   4  61
## 15 15      15     41041   M 122    0  3318.21   2  53
## 16 16      16     41041   M 122    0  4863.83   5  77
## 17 17      17     41041   M 121    0  5000.64   3  53
## 18 18      18     41091   M 121    0  4989.71   5  56
## 19 19      19     41041   F 122    0  8645.90   4  66
## 20 20      20     41091   M 122    0  2885.55   4  39
attach(heartatk)
table(SEX)
## SEX
##    F    M 
## 5065 7779
# Pie Chart, Sex
pie(table(SEX),
    main= "MI Patient Sex",
    col = c("#fc0fc0", "#00b7eb"))

counts <- table(SEX, DIED)
counts
##    DIED
## SEX    0    1
##   F 4298  767
##   M 7136  643
barplot(counts, 
        main= "MI Prognosis and Sex",
        col= c("#fc0fc0", "#00b7eb"), 
        legend=rownames(counts), 
        ylab="Frequency",
        xlab="0=Did not Die in Hospital, 1=Died in Hospital") #barplots for two factors

Barplot was created. The bar plot shows that there are perhaps more men alive after a heart attack than women. It suggests that the mortality rate for men is different than the mortality rate for women post-MI. But is this a statistically significant difference?

To find out, we’ll use the Pearson’s Chi-Squared test with Yates’ continuity correction.

chisq.test(table(SEX, DIED))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(SEX, DIED)
## X-squared = 147.76, df = 1, p-value < 2.2e-16

p-value is < 0.05. Therefore we reject the null hypothesis that mortality rate is the same for men and women. Mortality rate for men is different (lower) than the mortality rate for women.

The Pearson’s chi-squared test is an approximation to the Fisher Exact test. There are some caveats to the Chi-Squared test and modern-computers can handle the fisher exact test anyway. So let’s use that.

fisher.test(table(SEX,DIED)) #Fisher Exact Test
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(SEX, DIED)
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.4509331 0.5653197
## sample estimates:
## odds ratio 
##  0.5049435

p-value < 0.05. Therefore, mortality rate for men is different than the mortality rate for women after being hospitalized for MI.

Explore the correlation between variables in the Heart Attack dataset and give interpretation

1) AGE vs. LOS

# continuous var vs. continuous var ---therefore, use Spearman's
cor.test(AGE, LOS, method="spearman")
## Warning in cor.test.default(AGE, LOS, method = "spearman"): Cannot compute exact
## p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  AGE and LOS
## S = 2.9448e+11, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1661032

There is a significant correlation between AGE and LOS

2) LOS vs. CHARGES

#Continuous var vs. continuous var ---Therefore, use spearman's
cor.test(LOS, CHARGES, method = "spearman")
## Warning in cor.test.default(LOS, CHARGES, method = "spearman"): Cannot compute
## exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  LOS and CHARGES
## S = 7.8102e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7384085

There is a significant difference between LOS and CHARGES. There is a significant correlation.

3) AGE vs. DIED

#continuous vs. categorical---Therefore, use ANOVA
aov(AGE ~ DIED)
## Call:
##    aov(formula = AGE ~ DIED)
## 
## Terms:
##                      DIED Residuals
## Sum of Squares   137358.1 2257118.2
## Deg. of Freedom         1     12842
## 
## Residual standard error: 13.25748
## Estimated effects may be unbalanced
summary(aov(AGE ~ DIED))
##                Df  Sum Sq Mean Sq F value Pr(>F)    
## DIED            1  137358  137358   781.5 <2e-16 ***
## Residuals   12842 2257118     176                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is a significant difference between AGE and DIED. There is a significant correlation.

4) DIAGNOSIS vs. DIED

#Categorical vs Categorical
#Tricky: the diagnosis is a diagnosis code. It's a categorical variable.
aov(DIAGNOSIS ~ DIED)
## Call:
##    aov(formula = DIAGNOSIS ~ DIED)
## 
## Terms:
##                     DIED Residuals
## Sum of Squares      4166  12934300
## Deg. of Freedom        1     12842
## 
## Residual standard error: 31.73622
## Estimated effects may be unbalanced
summary(aov(DIAGNOSIS ~ DIED))
##                Df   Sum Sq Mean Sq F value Pr(>F)  
## DIED            1     4166    4166   4.136  0.042 *
## Residuals   12842 12934300    1007                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is a significant (weak) correlation between Diagnosis and Died.