1 Data pre-processing (8 points)

**(1) Obtain the summary of the data GaltonFamilies. (Located in HistData package of historical data sets).**

summary(GaltonFamilies)

##      family        father         mother      midparentHeight    children     
##  185    : 15   Min.   :62.0   Min.   :58.00   Min.   :64.40   Min.   : 1.000  
##  066    : 11   1st Qu.:68.0   1st Qu.:63.00   1st Qu.:68.14   1st Qu.: 4.000  
##  120    : 11   Median :69.0   Median :64.00   Median :69.25   Median : 6.000  
##  130    : 11   Mean   :69.2   Mean   :64.09   Mean   :69.21   Mean   : 6.171  
##  166    : 11   3rd Qu.:71.0   3rd Qu.:65.88   3rd Qu.:70.14   3rd Qu.: 8.000  
##  097    : 10   Max.   :78.5   Max.   :70.50   Max.   :75.43   Max.   :15.000  
##  (Other):865                                                                  
##     childNum         gender     childHeight   
##  Min.   : 1.000   female:453   Min.   :56.00  
##  1st Qu.: 2.000   male  :481   1st Qu.:64.00  
##  Median : 3.000                Median :66.50  
##  Mean   : 3.586                Mean   :66.75  
##  3rd Qu.: 5.000                3rd Qu.:69.70  
##  Max.   :15.000                Max.   :79.00  
##

General Observations of the Data:

[Note: I will use “we” or “us” hereinafter to avoid first paragraph narrative which in my opinion does not come across as a very convincing means of communicating data analysis. In other words, the use of “we” or “us” does not indicate that another individual or entity assisted in the response to the Seven Mid-Term Exam responses.]

Executing the above Summary Command in RStudio provides us with a quick overview of the GaltonFamilies data frame. That is, before we begin any analysis of the data provided, we first use RStudio’s built-in Help Command (?GaltonFamilies) to learn about the data set before making any specific observations. This data frame does not contained a large among of the data variables (i.e., 8); if on the other hand, the data frame were large would execute the head(GaltonFamilies) command to show the first few rows.

Regrettably, RStudio’s Help library contains inadequate information about how the data was collected (e.g., observational, experimental design survey, etc.), whether the data shows some non-responses or missing values and what it represents(e.g., how is the data coded, how is data represented, and what are the units o measurement).

Accordingly, a search of the Internet was required (as well as review of our lecture notes) to better understand how the data was collected and the stated relation among the variables. That is, we should first examine the data collection methodology and rule out any obvious selection bias in the data collection (or at least be wary of any possible bias in interpreting the data). Additionally, we should understand what the data purports to represent (i.e., industry data (e.g., Toyota Sales data), societal data (e.g., Galapagos, etc.) as well as any obvious intuitive observations about the variables in populations set (e.g., does one variable stand out as the intercept and one or more of the variables are subsets of that variable))

General Observations of the GaltonFamilies Data Frame

The GaltonFamilies is a well known data set of historical data postulating height is hereditary, typically used for instructional purposes. The data frame is observations of “. . .963 children in 205 families ranging from 1-15 adult children children.” We are told that we have missing values of 29 non-numeric heights. Further, according to RStudio Help and our research, the GaltonFamilies data frame contains eight variables which was developed from an article bySir Francis Galton in the latter part of the 19th Century. See. Galton, F. (1886). Regression Towards Mediocrity in Hereditary Stature, Journal of the Anthropological Institute, 15, 246-263. Galton presented a study of the relationship between the heights of parents and children and observed that that “. . . the heights of children of both tall and short parents appeared to”revert” or “regress” to the mean of the group.”Id. This observation was used in support of Galton famed finding that it shows a “. . . tendency to be a regression to ‘mediocrity’”. As mentioned in our class lecture, Galton is noted as developing a mathematical description of this regression tendency, which is the precursor of today’s regression models.

A further review of the eight variables, where the family variable is said to be listed in descending order of fathers and mothers height. In other words, the data frame contains variables that appear to have some hierarchical relationships (e.g., father and mother are subsets of family and children a sub-subset of father and mother, with gender, a subset therein). Other intervening relationship are also observed among childNum, ChildHeight, midparentHeight.

Specific Observations

(2) Are there any data that should be coded missing?

Yes, we are told that 29 non-numeric heights are not included in the non-numeric heights recorded. Since it is non-numeric we replace with NA.

GaltonFamilies$midparentHeight[GaltonFamilies$midparentHeight  == 0] = NA
GaltonFamilies$childHeight[GaltonFamilies$childHeight  == 0] = NA

(3) Which variables are numeric, integer, or factor?

In R, a numeric variable is the most common “data type” encountered and the variable has values represented as numbers or the value contains decimals. The variables, father, mother, childHeight variables are numeric data types. An integer data type is a special case of numeric data which does not contain decimals. In the Galton data frame, midparentHeight , is and children are integer data types. A factor variable is another special case variable in thethat it also contains text. It is noted that factor variables are used when there are a limited number of unique character strings. In the Galton data frame, the gender variable is a factor data type (i.e., “female” or “male”).

##(4) What is the R command for obtaining the levels of a factor?

levels

(5) Use this command to determine the levels of gender.

See Below solution using levels(GaltonFamilies$gender) command line:

levels(GaltonFamilies$gender)

## [1] "female" "male"

(6) Are the labels sufficiently informational?

No, I assume if you are an experienced R programmer you would assign the GaltonFamilies object to an easier identifier label such as “df” using either assignment command <_ or = . For example, df = GaltonFamilies. (We are not yet experienced R programmers and hesitant to use on a graded assignment). Another non-intuitive variable is midparentHeight which is actually the result of the equation \[(father + 1.08*mother)/2\]. Again, if the R programming required here were extensive to respond to this question, we could rename the midparentHeight data variable to something comfortable for the the programmer (and perhaps a fellow coder, not included here) to see more intuitive in the data analysis (i.e.,make it easier for them to follow), using _ [new_name] <-_ [old_name] command.

(7) Remove the family and childNum columns.

GaltonFamilies = subset(GaltonFamilies, select = -c(family, childNum))

(8) Produce the summary table of the modified dataframe.

summary(GaltonFamilies)

##      father         mother      midparentHeight    children         gender   
##  Min.   :62.0   Min.   :58.00   Min.   :64.40   Min.   : 1.000   female:453  
##  1st Qu.:68.0   1st Qu.:63.00   1st Qu.:68.14   1st Qu.: 4.000   male  :481  
##  Median :69.0   Median :64.00   Median :69.25   Median : 6.000               
##  Mean   :69.2   Mean   :64.09   Mean   :69.21   Mean   : 6.171               
##  3rd Qu.:71.0   3rd Qu.:65.88   3rd Qu.:70.14   3rd Qu.: 8.000               
##  Max.   :78.5   Max.   :70.50   Max.   :75.43   Max.   :15.000               
##   childHeight   
##  Min.   :56.00  
##  1st Qu.:64.00  
##  Median :66.50  
##  Mean   :66.75  
##  3rd Qu.:69.70  
##  Max.   :79.00

DAMA_Mid_Term_Q_01

Dennis Duncan

4/14/2021