library(rmarkdown); library(knitr); library(readxl)
set.seed(37)
stonks <- matrix(c(.00023,.00008,.00012,.00083,.00006,.00008,.00134,.00041,.00039,.00008,.00012,.00041,.00093,.00019,.00002,.00083,.00039,.00019,.00068,.00001,.00006,.00008,.00002,.00001,.00053), nrow = 5, ncol = 5)
rownames(stonks) <- c("SNDL", "WRN", "NGD", "UPH", "WISH")
colnames(stonks) <- c("SNDL", "WRN", "NGD", "UPH", "WISH")
stonks*(5*365-1)
## SNDL WRN NGD UPH WISH
## SNDL 0.41952 0.14592 0.21888 1.51392 0.10944
## WRN 0.14592 2.44416 0.74784 0.71136 0.14592
## NGD 0.21888 0.74784 1.69632 0.34656 0.03648
## UPH 1.51392 0.71136 0.34656 1.24032 0.01824
## WISH 0.10944 0.14592 0.03648 0.01824 0.96672
solve(sqrt(diag(diag(stonks)))) %*% stonks %*%
t(solve(sqrt(diag(diag(stonks)))))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0000000 0.14410321 0.25946325 2.09874521 0.17184995
## [2,] 0.1441032 1.00000000 0.36727383 0.40856179 0.09492916
## [3,] 0.2594633 0.36727383 1.00000000 0.23892284 0.02848725
## [4,] 2.0987452 0.40856179 0.23892284 1.00000000 0.01665742
## [5,] 0.1718499 0.09492916 0.02848725 0.01665742 1.00000000
Well for one, I want to look at the scatter plot matrix to understand what a correlation of 2 looks like! Besides that, I'd want to look at it to figure out patterns, such as linear or nonlinear. It can be difficult to visualize and detect patterns with only numbers.
library(readxl)
DistressData <- read_excel("C:/Users/Sarah Chock/OneDrive - University of St. Thomas/Senior Year/STAT 360 Comp Stat and Data Analysis/Exploratory Data Analysis/DistressData.xlsx")
dd <- as.matrix(DistressData)
dd[which(dd==3)] = 6
dd[which(dd==5)] = 3
dd[which(dd==6)] = 5
Very big variance matrix
SIGMA <- cov(dd)
SIGMA
## Hopelessness Overwhelmed Exhausted VeryLonely VerySad
## Hopelessness 2.0863030 0.6421632 0.72357995 1.2674437 1.3058702
## Overwhelmed 0.6421632 1.1939488 0.86243233 0.6372557 0.6550655
## Exhausted 0.7235799 0.8624323 1.35914207 0.7522959 0.7690344
## VeryLonely 1.2674437 0.6372557 0.75229589 1.9305097 1.3127949
## VerySad 1.3058702 0.6550655 0.76903441 1.3127949 1.8207522
## Depressed 1.5008373 0.5487872 0.67343467 1.2763880 1.3578893
## Anxiety 1.1733621 0.7048860 0.76737523 1.0877960 1.1820557
## SelfHarm 0.5122706 0.1519727 0.20162699 0.4044413 0.4103443
## SuicidalThoughts 0.6653555 0.1743306 0.24008929 0.5246328 0.5277585
## SuicidalAttempts 0.1893862 0.0286334 0.06102314 0.1561426 0.1601406
## Depressed Anxiety SelfHarm SuicidalThoughts
## Hopelessness 1.5008373 1.1733621 0.5122706 0.6653555
## Overwhelmed 0.5487872 0.7048860 0.1519727 0.1743306
## Exhausted 0.6734347 0.7673752 0.2016270 0.2400893
## VeryLonely 1.2763880 1.0877960 0.4044413 0.5246328
## VerySad 1.3578893 1.1820557 0.4103443 0.5277585
## Depressed 2.1677766 1.2979029 0.6265705 0.8124703
## Anxiety 1.2979029 2.0480802 0.4104353 0.5002924
## SelfHarm 0.6265705 0.4104353 1.0523067 0.6501951
## SuicidalThoughts 0.8124703 0.5002924 0.6501951 1.1410428
## SuicidalAttempts 0.2521766 0.1581258 0.3035305 0.3573149
## SuicidalAttempts
## Hopelessness 0.18938622
## Overwhelmed 0.02863340
## Exhausted 0.06102314
## VeryLonely 0.15614261
## VerySad 0.16014058
## Depressed 0.25217657
## Anxiety 0.15812576
## SelfHarm 0.30353049
## SuicidalThoughts 0.35731491
## SuicidalAttempts 0.35118705
CORRELATION <- solve(sqrt(diag(diag(SIGMA)))) %*% SIGMA %*%
t(solve(sqrt(diag(diag(SIGMA)))))
CORRELATION
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1.0000000 0.40687769 0.42970006 0.6315445 0.6700171 0.7057285 0.5676359
## [2,] 0.4068777 1.00000000 0.67701726 0.4197443 0.4442899 0.3411174 0.4507674
## [3,] 0.4297001 0.67701726 1.00000000 0.4644300 0.4888639 0.3923339 0.4599407
## [4,] 0.6315445 0.41974434 0.46443000 1.0000000 0.7002216 0.6239352 0.5470638
## [5,] 0.6700171 0.44428988 0.48886385 0.7002216 1.0000000 0.6834893 0.6121235
## [6,] 0.7057285 0.34111739 0.39233386 0.6239352 0.6834893 1.0000000 0.6159727
## [7,] 0.5676359 0.45076744 0.45994073 0.5470638 0.6121235 0.6159727 1.0000000
## [8,] 0.3457320 0.13558185 0.16859515 0.2837581 0.2964503 0.4148504 0.2795761
## [9,] 0.4312351 0.14935847 0.19279217 0.3534830 0.3661498 0.5165939 0.3272649
## [10,] 0.2212537 0.04421919 0.08832688 0.1896340 0.2002658 0.2890202 0.1864489
## [,8] [,9] [,10]
## [1,] 0.3457320 0.4312351 0.22125365
## [2,] 0.1355818 0.1493585 0.04421919
## [3,] 0.1685951 0.1927922 0.08832688
## [4,] 0.2837581 0.3534830 0.18963399
## [5,] 0.2964503 0.3661498 0.20026578
## [6,] 0.4148504 0.5165939 0.28902022
## [7,] 0.2795761 0.3272649 0.18644894
## [8,] 1.0000000 0.5933645 0.49930038
## [9,] 0.5933645 1.0000000 0.56445709
## [10,] 0.4993004 0.5644571 1.00000000
From my chart, there are several strong pairs to look at: (Depressed, Hopelessness), (VerySad, VeryLonely), (Depressed, VerySad). There are also some not very covariable pairs, such as: (SuicidalAttempts, Overwhelmed), (SuicidalAttempts, Exhausted), (SelfHarm, Overwhelmed)
library(corrplot)
## corrplot 0.92 loaded
corrplot(CORRELATION, method = "number")
corrplot(CORRELATION, method = "pie")
pairs(dd, pch = 16, lower.panel = NULL)
There are a lot of dimensions in this data, so it can be really difficult to detect patterns in the covariability. As you can even tell in this plot, you can barely see anything! This will only become more challenging as dimensions increase.
ames <- read.csv("C:/Users/Sarah Chock/OneDrive - University of St. Thomas/Senior Year/STAT 360 Comp Stat and Data Analysis/Exploratory Data Analysis/Ames.csv")
#cor(ames)
I receive the error that x must be numeric.
sapply(ames, is.numeric)
## Order PID area price MS.SubClass
## TRUE TRUE TRUE TRUE TRUE
## MS.Zoning Lot.Frontage Lot.Area Street Alley
## FALSE TRUE TRUE FALSE FALSE
## Lot.Shape Land.Contour Utilities Lot.Config Land.Slope
## FALSE FALSE FALSE FALSE FALSE
## Neighborhood Condition.1 Condition.2 Bldg.Type House.Style
## FALSE FALSE FALSE FALSE FALSE
## Overall.Qual Overall.Cond Year.Built Year.Remod.Add Roof.Style
## TRUE TRUE TRUE TRUE FALSE
## Roof.Matl Exterior.1st Exterior.2nd Mas.Vnr.Type Mas.Vnr.Area
## FALSE FALSE FALSE FALSE TRUE
## Exter.Qual Exter.Cond Foundation Bsmt.Qual Bsmt.Cond
## FALSE FALSE FALSE FALSE FALSE
## Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2 BsmtFin.SF.2
## FALSE FALSE TRUE FALSE TRUE
## Bsmt.Unf.SF Total.Bsmt.SF Heating Heating.QC Central.Air
## TRUE TRUE FALSE FALSE FALSE
## Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Bsmt.Full.Bath
## FALSE TRUE TRUE TRUE TRUE
## Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr Kitchen.AbvGr
## TRUE TRUE TRUE TRUE TRUE
## Kitchen.Qual TotRms.AbvGrd Functional Fireplaces Fireplace.Qu
## FALSE TRUE FALSE TRUE FALSE
## Garage.Type Garage.Yr.Blt Garage.Finish Garage.Cars Garage.Area
## FALSE TRUE FALSE TRUE TRUE
## Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF Open.Porch.SF
## FALSE FALSE FALSE TRUE TRUE
## Enclosed.Porch X3Ssn.Porch Screen.Porch Pool.Area Pool.QC
## TRUE TRUE TRUE TRUE FALSE
## Fence Misc.Feature Misc.Val Mo.Sold Yr.Sold
## FALSE FALSE TRUE TRUE TRUE
## Sale.Type Sale.Condition
## FALSE FALSE
numAmes <- ames[which(sapply(ames, is.numeric))]
CORR <- cor(numAmes)
corrplot(CORR, method = "color")
There are quite a few dimensions that only print question marks and don't contribute to the heat map (except for when it is correlated with itself). Several of these dimensions include: Lot.Frontage, Bsmt.Half.Bath, Garage.Area. They are producing errors because they have NA values!
completeCORR <- cor(numAmes, use = "complete.obs")
corrplot(completeCORR, method = "color")
pairCORR <- cor(numAmes, use = "pairwise.complete.obs")
corrplot(pairCORR, method = "color")
Truth be told, I can not glean any differences with mine own eyes. But I tried clicking back and forth really quickly between the two heat maps and I found that the complete observations method had stronger magnitudes across the board than with the pairwise method. I think this could be due to biases in which houses are leaving answers blank, but that is just my hypothesis.