Apply all you have learnt thus far to answer the questions and learn more.
We will work the file, weight-height.csv.gz, a csv file with 3 columns, gender, height (inches) and weight (pounds).
Data characterization
1)First uncompress the file using “gunzip weight-height.csv.gz”.
gunzip weight-height.csv.gz
2)Import data from this file into the data frame “wh” a)wh <- read.csv(“weight-height.csv”,header=TRUE)
b) You can read the compressed file directly, but you sometimes need to check contents after importing, so best to have an uncompressed version at hand to use with excel to check the import
5000 males are in this file. 5000 females are in this file.
summary(wh)
## Gender Height Weight
## Female:5000 Min. :54.26 Min. : 64.7
## Male :5000 1st Qu.:63.51 1st Qu.:135.8
## Median :66.32 Median :161.2
## Mean :66.37 Mean :161.4
## 3rd Qu.:69.17 3rd Qu.:187.2
## Max. :79.00 Max. :270.0
Find the mean and standard deviation in height and weight of the whole group and the males and females separately.
Commands you need to use here are mean and sd. Use round(x,2) to round x to 2 decimal places To get data only for the males, use male_wh <- wh[which(wh[,1]==”Male”),]; which(wh[1,]==”Male”) returns the row indices of all the Males, which is used to extract all data for males.
The summary command showed for the group data the the mean of the height data is 66.37 and the mean of the weight data is 161.4.
print("The summary command showed for the group data the the mean of the height data is 66.37 and the mean of the weight data is 161.4.")
## [1] "The summary command showed for the group data the the mean of the height data is 66.37 and the mean of the weight data is 161.4."
#group height std
print("The group height standard deviation is:")
## [1] "The group height standard deviation is:"
round(sd(wh$Height),2)
## [1] 3.85
#group weight std
print("The group weight standard deviation is:")
## [1] "The group weight standard deviation is:"
round(sd(wh$Weight),2)
## [1] 32.11
#male data
male_wh <- wh[which(wh[,1]=="Male"),]
#female data
female_wh <- wh[which(wh[,1]=="Female"),]
#male height mean, male weight mean
print("The male height mean is:")
## [1] "The male height mean is:"
round(mean(male_wh$Height),2)
## [1] 69.03
print("The male weight mean is:")
## [1] "The male weight mean is:"
round(mean(male_wh$Weight),2)
## [1] 187.02
#male height std, male weight std
print("The male height standard deviation is:")
## [1] "The male height standard deviation is:"
round(sd(male_wh$Height),2)
## [1] 2.86
print("The male weight standard deviation is:")
## [1] "The male weight standard deviation is:"
round(sd(male_wh$Weight),2)
## [1] 19.78
#female height mean, female weight mean
print("The female height mean is:")
## [1] "The female height mean is:"
round(mean(female_wh$Height),2)
## [1] 63.71
print("The female weight mean is:")
## [1] "The female weight mean is:"
round(mean(female_wh$Weight),2)
## [1] 135.86
#female height std, female weight std
print("The female height standard deviation is:")
## [1] "The female height standard deviation is:"
round(sd(female_wh$Height),2)
## [1] 2.7
print("The female weight standard deviation is:")
## [1] "The female weight standard deviation is:"
round(sd(female_wh$Weight),2)
## [1] 19.02
Is the variability greater in height or in weight ?
It is not the absolute number that matters and you cannot directly compare inches to pounds. What can be compared is what fraction of the mean is the standard deviation, and use that dimensionless number to compare. The std is 6% of the mean for height and the std is 20% of the mean for weight, which means there is greater variability in weight for this dataset.
#fraction of the mean that is the std for height
std_meanH<- round(3.85/66.37,2)*100
print(paste("The std is",std_meanH,"% of the mean for height."))
## [1] "The std is 6 % of the mean for height."
#fraction of the mean that is the std for weight
std_meanW<- round(32.11/161.4,2)*100
print(paste("The std is",std_meanW,"% of the mean for weight."))
## [1] "The std is 20 % of the mean for weight."
The variability tells me that weight is more easily influenced by environmental factors such as diet and exercise, but height cannot be easily changed after maturity. There are instances in which height can be changed due to injuries or surgery but these are rare and uncommon. The high variability also indicates that weight can fluctuate across a population. Also from personal experience, weight can fluctuates throughout a person from month to month, whereas height does not really fluctuate in one’s lifetime after puberty.
If I was to compare height and weight numbers from 50 years ago, I would guess that height measurements changed the most because although weight data would have the most variability, in order to see a difference you need to reduce the variability in the dataset from 50 years ago and from today to get a better picture of the difference across time. There would be a lot of variability between the population 50 years ago and between the population today, but due to the variability it would be too hard to see a difference between the two populations. I think the best metric to use to compare population today and 50 years ago would be height, because if the population became taller or shorter, the overall weight of the population would also increase or decrease accordingly but not with as much variability as the weight variable and the height variable would therefore allow the researcher to see a difference acorss time.