STAT 360: Computational Statistics and Data Analysis

Load R Libraries, Import and Attach Relevant Data, and Specify Seed

library(rmarkdown); library(knitr); library(readxl)
set.seed(37)

EXERCISE 01

Part (a)

stonks <- matrix(c(.00023,.00008,.00012,.00083,.00006,.00008,.00134,.00041,.00039,.00008,.00012,.00041,.00093,.00019,.00002,.00083,.00039,.00019,.00068,.00001,.00006,.00008,.00002,.00001,.00053), nrow = 5, ncol = 5)
rownames(stonks) <- c("SNDL", "WRN", "NGD", "UPH", "WISH")
colnames(stonks) <- c("SNDL", "WRN", "NGD", "UPH", "WISH")
stonks*(5*365-1)

##         SNDL     WRN     NGD     UPH    WISH
## SNDL 0.41952 0.14592 0.21888 1.51392 0.10944
## WRN  0.14592 2.44416 0.74784 0.71136 0.14592
## NGD  0.21888 0.74784 1.69632 0.34656 0.03648
## UPH  1.51392 0.71136 0.34656 1.24032 0.01824
## WISH 0.10944 0.14592 0.03648 0.01824 0.96672

Part (b)

solve(sqrt(diag(diag(stonks)))) %*% stonks %*%
  t(solve(sqrt(diag(diag(stonks)))))

##           [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] 1.0000000 0.14410321 0.25946325 2.09874521 0.17184995
## [2,] 0.1441032 1.00000000 0.36727383 0.40856179 0.09492916
## [3,] 0.2594633 0.36727383 1.00000000 0.23892284 0.02848725
## [4,] 2.0987452 0.40856179 0.23892284 1.00000000 0.01665742
## [5,] 0.1718499 0.09492916 0.02848725 0.01665742 1.00000000

Part (c)

Well for one, I want to look at the scatter plot matrix to understand what a correlation of 2 looks like! Besides that, I'd want to look at it to figure out patterns, such as linear or nonlinear. It can be difficult to visualize and detect patterns with only numbers.

EXERCISE 02

Part (a)

library(readxl)
DistressData <- read_excel("C:/Users/Sarah Chock/OneDrive - University of St. Thomas/Senior Year/STAT 360 Comp Stat and Data Analysis/Exploratory Data Analysis/DistressData.xlsx")
dd <- as.matrix(DistressData)
dd[which(dd==3)] = 6
dd[which(dd==5)] = 3
dd[which(dd==6)] = 5

Part (b)

Very big variance matrix

SIGMA <- cov(dd)
SIGMA

##                  Hopelessness Overwhelmed  Exhausted VeryLonely   VerySad
## Hopelessness        2.0863030   0.6421632 0.72357995  1.2674437 1.3058702
## Overwhelmed         0.6421632   1.1939488 0.86243233  0.6372557 0.6550655
## Exhausted           0.7235799   0.8624323 1.35914207  0.7522959 0.7690344
## VeryLonely          1.2674437   0.6372557 0.75229589  1.9305097 1.3127949
## VerySad             1.3058702   0.6550655 0.76903441  1.3127949 1.8207522
## Depressed           1.5008373   0.5487872 0.67343467  1.2763880 1.3578893
## Anxiety             1.1733621   0.7048860 0.76737523  1.0877960 1.1820557
## SelfHarm            0.5122706   0.1519727 0.20162699  0.4044413 0.4103443
## SuicidalThoughts    0.6653555   0.1743306 0.24008929  0.5246328 0.5277585
## SuicidalAttempts    0.1893862   0.0286334 0.06102314  0.1561426 0.1601406
##                  Depressed   Anxiety  SelfHarm SuicidalThoughts
## Hopelessness     1.5008373 1.1733621 0.5122706        0.6653555
## Overwhelmed      0.5487872 0.7048860 0.1519727        0.1743306
## Exhausted        0.6734347 0.7673752 0.2016270        0.2400893
## VeryLonely       1.2763880 1.0877960 0.4044413        0.5246328
## VerySad          1.3578893 1.1820557 0.4103443        0.5277585
## Depressed        2.1677766 1.2979029 0.6265705        0.8124703
## Anxiety          1.2979029 2.0480802 0.4104353        0.5002924
## SelfHarm         0.6265705 0.4104353 1.0523067        0.6501951
## SuicidalThoughts 0.8124703 0.5002924 0.6501951        1.1410428
## SuicidalAttempts 0.2521766 0.1581258 0.3035305        0.3573149
##                  SuicidalAttempts
## Hopelessness           0.18938622
## Overwhelmed            0.02863340
## Exhausted              0.06102314
## VeryLonely             0.15614261
## VerySad                0.16014058
## Depressed              0.25217657
## Anxiety                0.15812576
## SelfHarm               0.30353049
## SuicidalThoughts       0.35731491
## SuicidalAttempts       0.35118705

Part (c)

CORRELATION <- solve(sqrt(diag(diag(SIGMA)))) %*% SIGMA %*%
  t(solve(sqrt(diag(diag(SIGMA)))))
CORRELATION

##            [,1]       [,2]       [,3]      [,4]      [,5]      [,6]      [,7]
##  [1,] 1.0000000 0.40687769 0.42970006 0.6315445 0.6700171 0.7057285 0.5676359
##  [2,] 0.4068777 1.00000000 0.67701726 0.4197443 0.4442899 0.3411174 0.4507674
##  [3,] 0.4297001 0.67701726 1.00000000 0.4644300 0.4888639 0.3923339 0.4599407
##  [4,] 0.6315445 0.41974434 0.46443000 1.0000000 0.7002216 0.6239352 0.5470638
##  [5,] 0.6700171 0.44428988 0.48886385 0.7002216 1.0000000 0.6834893 0.6121235
##  [6,] 0.7057285 0.34111739 0.39233386 0.6239352 0.6834893 1.0000000 0.6159727
##  [7,] 0.5676359 0.45076744 0.45994073 0.5470638 0.6121235 0.6159727 1.0000000
##  [8,] 0.3457320 0.13558185 0.16859515 0.2837581 0.2964503 0.4148504 0.2795761
##  [9,] 0.4312351 0.14935847 0.19279217 0.3534830 0.3661498 0.5165939 0.3272649
## [10,] 0.2212537 0.04421919 0.08832688 0.1896340 0.2002658 0.2890202 0.1864489
##            [,8]      [,9]      [,10]
##  [1,] 0.3457320 0.4312351 0.22125365
##  [2,] 0.1355818 0.1493585 0.04421919
##  [3,] 0.1685951 0.1927922 0.08832688
##  [4,] 0.2837581 0.3534830 0.18963399
##  [5,] 0.2964503 0.3661498 0.20026578
##  [6,] 0.4148504 0.5165939 0.28902022
##  [7,] 0.2795761 0.3272649 0.18644894
##  [8,] 1.0000000 0.5933645 0.49930038
##  [9,] 0.5933645 1.0000000 0.56445709
## [10,] 0.4993004 0.5644571 1.00000000

Part (d)

From my chart, there are several strong pairs to look at: (Depressed, Hopelessness), (VerySad, VeryLonely), (Depressed, VerySad). There are also some not very covariable pairs, such as: (SuicidalAttempts, Overwhelmed), (SuicidalAttempts, Exhausted), (SelfHarm, Overwhelmed)

library(corrplot)

## corrplot 0.92 loaded

corrplot(CORRELATION, method = "number")

corrplot(CORRELATION, method = "pie")

Part (e)

pairs(dd, pch = 16, lower.panel = NULL)

Part (f)

There are a lot of dimensions in this data, so it can be really difficult to detect patterns in the covariability. As you can even tell in this plot, you can barely see anything! This will only become more challenging as dimensions increase.

EXERCISE 03

Part (a)

ames <- read.csv("C:/Users/Sarah Chock/OneDrive - University of St. Thomas/Senior Year/STAT 360 Comp Stat and Data Analysis/Exploratory Data Analysis/Ames.csv")
#cor(ames)

Part (b)

I receive the error that x must be numeric.

Part (c)

sapply(ames, is.numeric)

##           Order             PID            area           price     MS.SubClass 
##            TRUE            TRUE            TRUE            TRUE            TRUE 
##       MS.Zoning    Lot.Frontage        Lot.Area          Street           Alley 
##           FALSE            TRUE            TRUE           FALSE           FALSE 
##       Lot.Shape    Land.Contour       Utilities      Lot.Config      Land.Slope 
##           FALSE           FALSE           FALSE           FALSE           FALSE 
##    Neighborhood     Condition.1     Condition.2       Bldg.Type     House.Style 
##           FALSE           FALSE           FALSE           FALSE           FALSE 
##    Overall.Qual    Overall.Cond      Year.Built  Year.Remod.Add      Roof.Style 
##            TRUE            TRUE            TRUE            TRUE           FALSE 
##       Roof.Matl    Exterior.1st    Exterior.2nd    Mas.Vnr.Type    Mas.Vnr.Area 
##           FALSE           FALSE           FALSE           FALSE            TRUE 
##      Exter.Qual      Exter.Cond      Foundation       Bsmt.Qual       Bsmt.Cond 
##           FALSE           FALSE           FALSE           FALSE           FALSE 
##   Bsmt.Exposure  BsmtFin.Type.1    BsmtFin.SF.1  BsmtFin.Type.2    BsmtFin.SF.2 
##           FALSE           FALSE            TRUE           FALSE            TRUE 
##     Bsmt.Unf.SF   Total.Bsmt.SF         Heating      Heating.QC     Central.Air 
##            TRUE            TRUE           FALSE           FALSE           FALSE 
##      Electrical     X1st.Flr.SF     X2nd.Flr.SF Low.Qual.Fin.SF  Bsmt.Full.Bath 
##           FALSE            TRUE            TRUE            TRUE            TRUE 
##  Bsmt.Half.Bath       Full.Bath       Half.Bath   Bedroom.AbvGr   Kitchen.AbvGr 
##            TRUE            TRUE            TRUE            TRUE            TRUE 
##    Kitchen.Qual   TotRms.AbvGrd      Functional      Fireplaces    Fireplace.Qu 
##           FALSE            TRUE           FALSE            TRUE           FALSE 
##     Garage.Type   Garage.Yr.Blt   Garage.Finish     Garage.Cars     Garage.Area 
##           FALSE            TRUE           FALSE            TRUE            TRUE 
##     Garage.Qual     Garage.Cond     Paved.Drive    Wood.Deck.SF   Open.Porch.SF 
##           FALSE           FALSE           FALSE            TRUE            TRUE 
##  Enclosed.Porch     X3Ssn.Porch    Screen.Porch       Pool.Area         Pool.QC 
##            TRUE            TRUE            TRUE            TRUE           FALSE 
##           Fence    Misc.Feature        Misc.Val         Mo.Sold         Yr.Sold 
##           FALSE           FALSE            TRUE            TRUE            TRUE 
##       Sale.Type  Sale.Condition 
##           FALSE           FALSE

Part (d)

numAmes <- ames[which(sapply(ames, is.numeric))]

Part (e)

CORR <- cor(numAmes)
corrplot(CORR, method = "color")

Part (f)

There are quite a few dimensions that only print question marks and don't contribute to the heat map (except for when it is correlated with itself). Several of these dimensions include: Lot.Frontage, Bsmt.Half.Bath, Garage.Area. They are producing errors because they have NA values!

Part (g)

completeCORR <- cor(numAmes, use = "complete.obs")
corrplot(completeCORR, method = "color")

Part (h)

pairCORR <- cor(numAmes, use = "pairwise.complete.obs")
corrplot(pairCORR, method = "color")

Part (i)

Truth be told, I can not glean any differences with mine own eyes. But I tried clicking back and forth really quickly between the two heat maps and I found that the complete observations method had stronger magnitudes across the board than with the pairwise method. I think this could be due to biases in which houses are leaving answers blank, but that is just my hypothesis.