During my stay at the Whiteley Center at the Friday Harbor Laboratories I organized a R user group meeting. The project directory containig all the files and data will be available here at least until October 2016. The HTML version of this document is available at RPubs
Read some data
data <- read.table("./data/Podatki2012.txt",sep="\t",header=TRUE)
Note: Data are available also here.
Show first few lines
head(data)
## starost mesec spol masa visina roke cevelj lasje oci mati oce majica
## 1 59 7 M 91 178 189 44 T S 155 180 L
## 2 21 1 F 60 173 176 43 T T 162 184 S
## 3 21 7 F 55 178 178 39 T T 170 180 S
## 4 21 8 F 70 167 165 39 S T 160 190 S
## 5 21 4 F 65 171 168 40 T S 169 176 M
## 6 21 3 M 88 171 173 41 T T 165 182 XL
Change incomprehensible variable name
names(data)
## [1] "starost" "mesec" "spol" "masa" "visina" "roke" "cevelj"
## [8] "lasje" "oci" "mati" "oce" "majica"
newNames <- c(
"age",
"month",
"gender",
"weight",
"height",
"arms",
"shoe",
"hair",
"eyes",
"mother",
"father",
"shirt"
)
newNames
## [1] "age" "month" "gender" "weight" "height" "arms" "shoe"
## [8] "hair" "eyes" "mother" "father" "shirt"
names(data) <- newNames
Check the first few lines again
head(data)
## age month gender weight height arms shoe hair eyes mother father shirt
## 1 59 7 M 91 178 189 44 T S 155 180 L
## 2 21 1 F 60 173 176 43 T T 162 184 S
## 3 21 7 F 55 178 178 39 T T 170 180 S
## 4 21 8 F 70 167 165 39 S T 160 190 S
## 5 21 4 F 65 171 168 40 T S 169 176 M
## 6 21 3 M 88 171 173 41 T T 165 182 XL
Convert height (which is measured in centimeters) into meters.
data$height[1:5]
## [1] 178 173 178 167 171
data$height <- data$height/100
data$height[1:5]
## [1] 1.78 1.73 1.78 1.67 1.71
This is kind of dangerous if you analyse interactively. Try to run the rescale height chunk several times. Use Run all previous chunks to rerun analysis up to this point and reset the data.
Always plot your data! Options fig.* for knitr chunks control figure features (weight, height, caption …).
hist(data$weight)
Histogram of weight
Few other plots
with(data,plot(height,weight))
abline( lm( weight ~ height , data= data) )
Refine with col,lwd and pch to make it more visible.
corelation <- cor(data$weight,data$height)
Correlation between weight and height is 0.703.
Boxplots of some variables
par(mfrow=c(1,3))
boxplot(data$age,main="Age")
boxplot(data$height,main="Height [cm]")
boxplot(data$weight,main="Weight [kg]")
It is good to look at the summaries before the analysis.
summary(data)
## age month gender weight height
## Min. :20.00 Min. : 0.000 F:33 Min. :50.00 Min. :1.560
## 1st Qu.:21.00 1st Qu.: 5.000 M:10 1st Qu.:55.50 1st Qu.:1.640
## Median :21.00 Median : 7.000 Median :61.00 Median :1.700
## Mean :22.07 Mean : 6.814 Mean :63.42 Mean :1.699
## 3rd Qu.:22.00 3rd Qu.: 9.500 3rd Qu.:70.00 3rd Qu.:1.735
## Max. :59.00 Max. :11.000 Max. :91.00 Max. :1.890
##
## arms shoe hair eyes mother
## Min. :154.0 Min. :36.00 S:19 S:24 Min. :155.0
## 1st Qu.:163.2 1st Qu.:38.00 T:24 T:19 1st Qu.:160.0
## Median :167.8 Median :39.00 Median :165.0
## Mean :169.3 Mean :40.02 Mean :165.4
## 3rd Qu.:172.5 3rd Qu.:41.50 3rd Qu.:168.0
## Max. :193.0 Max. :48.00 Max. :180.0
## NA's :5 NA's :5
## father shirt
## Min. :170.0 L : 5
## 1st Qu.:174.2 M :19
## Median :179.5 S :16
## Mean :179.1 XL: 1
## 3rd Qu.:182.0 XS: 2
## Max. :190.0
## NA's :5
One is too old for the group. Find and eliminate that case.
max(data$age)
## [1] 59
old <- data$age == max(data$age)
old
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
which(old)
## [1] 1
dim(data)
## [1] 43 12
data <- data[ !old , ]
dim(data)
## [1] 42 12
par(mfrow=c(1,3))
boxplot(data$age,main="Age")
boxplot(data$height,main="Height [cm]")
boxplot(data$weight,main="Weight [kg]")
Body mass index might be of interest for analysis of relation between weight and height.
data$bmi <- data$weight / data$height^2
If you are sure that the data are clean and complete, you can simplify the reference to your variables
attach(data)
From now on you can reference to your variables omitting the data frame name:
head(data$weight)
## [1] 60 55 70 65 88 52
head(weight)
## [1] 60 55 70 65 88 52
par(mfrow=c(1,1))
plot(height,bmi,pch=16,col=gender)
abline(lm(bmi~height),col="blue",lwd=3)
cor(bmi,height)
## [1] 0.1673265
Correlation between bmi and height is much smaller: r = 0.167.