## install.packages("HistData", repos = "http://cran.us.r-project.org", dependencies=TRUE). (After the first compile, we may comment out this line.
library("HistData")
data(GaltonFamilies)
Galton2 <- data.frame(GaltonFamilies)
names(Galton2)
## [1] "family" "father" "mother" "midparentHeight"
## [5] "children" "childNum" "gender" "childHeight"
Note: I will use “we” or “us” hereinafter to avoid first paragraph narrative which in my opinion does not come across as a very convincing means of communicating data analysis. In other words, the use of “we” or “us” does not indicate that another individual or entity assisted in the response to the Seven Mid-Term Exam responses.
Step One: Clean up data:
Remove “family” and “childNum”:
Galton2 <- subset(GaltonFamilies, select = -c(family, childNum))
Galton3 <- subset(Galton2, select=-gender)
(M <- cor(Galton3))
## father mother midparentHeight children childHeight
## father 1.00000000 0.06036612 0.7284393 -0.15133262 0.2660385
## mother 0.06036612 1.00000000 0.7278340 -0.03358248 0.2013219
## midparentHeight 0.72843929 0.72783397 1.0000000 -0.12701620 0.3209499
## children -0.15133262 -0.03358248 -0.1270162 1.00000000 -0.1267196
## childHeight 0.26603854 0.20132195 0.3209499 -0.12671961 1.0000000
Response to Question No. 2.1: As shown above, the R function cor() is used to compute / obtain the correlation matrix. Doing so lets us investigate the dependence between multiple variables (e.g. “father,” “mother,” “midparentHeight,” “children,” “childHeight”) at the same time. In the above resulting is a table correlation coefficients between all the numeric and integers variables (e.g. “father,” “mother,” “midparentHeight,” “children,” “childHeight”) is shown.
#install.packages("corrplot")
corrplot::corrplot(M)
Response to Question No. 2.2: As shown above, we first install the corrplot libray and then execute the R function corrplot(), to create a graphical display of a correlation matrix shown in the response to Question No. 2.1, above (we highlight the most correlated variables in a data table). In the above plot, colors are applied to the correlation coefficients according to the value. We may also reordered the correlation matrix according to the degree of association between variables to learn more about our data (and the associate between family members).
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(Galton2,
columns = c("gender",
"father",
"mother",
"midparentHeight",
"children",
"childHeight" )
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Response to Question No. 2.3: In the above scatterplot, we the install the GGally library and execute the ggpairs function (instead of using the native plot())) which displays (i.e. allows for “Visualization of the data”) each pair of numeric variable drawn on the left part of the figure and Pearson correlation displayed on the right. Variable distribution is available on the diagonal with “gender” shown as the first variable and the “childHeight”variable displayed as the output variable.
Response to Question No. 2.4: As observed in the data above, “gender,” “father,” “mother,” and “midparentHeight.”
Response to Question No. 2.5: The “father and midparentHeight” and “mother and midparentHeight” pairs appear redundant.
library(ggplot2)
ggplot(Galton2) +
aes(x = midparentHeight, y = childHeight, color = gender) +
geom_point()
Response to Question No. 2.6: We observe according to the color coded legend “Gender” (to the right of the above figure) that the distribution of associated points splits evenly in a linearly pattern (left and right) along the slope.
library(ggplot2)
ggplot(Galton2) +
aes(x = midparentHeight, y = childHeight, color = gender) +
geom_point() +
labs(title = "Original Galton Data",
subtitle = "Scatterplot")
Response to Question No. 2.7: Observe that we added a title and subtitle in the above figure.
library(ggplot2)
ggplot(Galton2) +
aes(x = midparentHeight, y = childHeight, color = gender) +
geom_point() +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Original Galton Data",
subtitle = "Scatterplot")
## `geom_smooth()` using formula 'y ~ x'
Response to Question No. 2.8: In the above plot we apply Local Regressions to each gender group to provide a non-parametric approach that fits two multiple regressions in each local neighborhood (i.e., male and female) . This can be particularly we know that both variables are bound within a range. In particular, we execute the loess() function on the numerical vector to smooth it and to predict the gender locally.