2 Correlation plots and Scatterplots (8 points)

## install.packages("HistData",  repos = "http://cran.us.r-project.org", dependencies=TRUE).  (After the first compile, we may comment out this line.
library("HistData")
data(GaltonFamilies)
Galton2 <- data.frame(GaltonFamilies)
names(Galton2)
## [1] "family"          "father"          "mother"          "midparentHeight"
## [5] "children"        "childNum"        "gender"          "childHeight"

Note: I will use “we” or “us” hereinafter to avoid first paragraph narrative which in my opinion does not come across as a very convincing means of communicating data analysis. In other words, the use of “we” or “us” does not indicate that another individual or entity assisted in the response to the Seven Mid-Term Exam responses.

2.1. Obtain the correlation matrix of all the numeric and integers variables.

Step One: Clean up data:

Remove “family” and “childNum”:

Galton2 <- subset(GaltonFamilies, select = -c(family, childNum))
Galton3 <- subset(Galton2, select=-gender)
(M <- cor(Galton3))
##                      father      mother midparentHeight    children childHeight
## father           1.00000000  0.06036612       0.7284393 -0.15133262   0.2660385
## mother           0.06036612  1.00000000       0.7278340 -0.03358248   0.2013219
## midparentHeight  0.72843929  0.72783397       1.0000000 -0.12701620   0.3209499
## children        -0.15133262 -0.03358248      -0.1270162  1.00000000  -0.1267196
## childHeight      0.26603854  0.20132195       0.3209499 -0.12671961   1.0000000

Response to Question No. 2.1: As shown above, the R function cor() is used to compute / obtain the correlation matrix. Doing so lets us investigate the dependence between multiple variables (e.g. “father,” “mother,” “midparentHeight,” “children,” “childHeight”) at the same time. In the above resulting is a table correlation coefficients between all the numeric and integers variables (e.g. “father,” “mother,” “midparentHeight,” “children,” “childHeight”) is shown.

2.2 Obtain the correlation plot of all the numeric and integer variables.

#install.packages("corrplot")
corrplot::corrplot(M)

Response to Question No. 2.2: As shown above, we first install the corrplot libray and then execute the R function corrplot(), to create a graphical display of a correlation matrix shown in the response to Question No. 2.1, above (we highlight the most correlated variables in a data table). In the above plot, colors are applied to the correlation coefficients according to the value. We may also reordered the correlation matrix according to the degree of association between variables to learn more about our data (and the associate between family members).

2.3. Obtain the scatterplot matrix of all the variables in Galton2 with gender the first variable and childHeight variable as the output variable.

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(Galton2, 
        columns = c("gender", 
                    "father", 
                    "mother", 
                    "midparentHeight", 
                    "children", 
                    "childHeight" )
        )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Response to Question No. 2.3: In the above scatterplot, we the install the GGally library and execute the ggpairs function (instead of using the native plot())) which displays (i.e. allows for “Visualization of the data”) each pair of numeric variable drawn on the left part of the figure and Pearson correlation displayed on the right. Variable distribution is available on the diagonal with “gender” shown as the first variable and the “childHeight”variable displayed as the output variable.

2.4. Which variables look like potential predictors of childHeight?

Response to Question No. 2.4: As observed in the data above, “gender,” “father,” “mother,” and “midparentHeight.”

2.5. Which pairs of predictors look redundant?

Response to Question No. 2.5: The “father and midparentHeight” and “mother and midparentHeight” pairs appear redundant.

6. Obtain the scatterplot childHeight vs midparentHeight with color of points according to gender.

library(ggplot2)
ggplot(Galton2) +
  aes(x = midparentHeight, y = childHeight, color = gender) +
  geom_point()

Response to Question No. 2.6: We observe according to the color coded legend “Gender” (to the right of the above figure) that the distribution of associated points splits evenly in a linearly pattern (left and right) along the slope.

7. Add to this plot, title = “Original Galton Data”, and subtitle = “Scatterplot”.

library(ggplot2)
ggplot(Galton2) +
  aes(x = midparentHeight, y = childHeight, color  = gender) +
  geom_point() +
  labs(title = "Original Galton Data",
       subtitle = "Scatterplot")

Response to Question No. 2.7: Observe that we added a title and subtitle in the above figure.

8. Add to this plot, loess regression lines for each gender group.

library(ggplot2)
ggplot(Galton2) +
  aes(x = midparentHeight, y = childHeight, color  = gender) +
  geom_point() +
  geom_smooth(method = "loess", se = FALSE) + 
  labs(title = "Original Galton Data",
       subtitle = "Scatterplot")
## `geom_smooth()` using formula 'y ~ x'

Response to Question No. 2.8: In the above plot we apply Local Regressions to each gender group to provide a non-parametric approach that fits two multiple regressions in each local neighborhood (i.e., male and female) . This can be particularly we know that both variables are bound within a range. In particular, we execute the loess() function on the numerical vector to smooth it and to predict the gender locally.