Introduction

During my stay at the Whiteley Center at the Friday Harbor Laboratories I organized a R user group meeting. The project directory containig all the files and data will be available here at least until October 2016. The HTML version of this document is available at RPubs

Reading the data

Read some data

data <- read.table("./data/Podatki2012.txt",sep="\t",header=TRUE)

Note: Data are available also here.

Show first few lines

head(data)
##   starost mesec spol masa visina roke cevelj lasje oci mati oce majica
## 1      59     7    M   91    178  189     44     T   S  155 180      L
## 2      21     1    F   60    173  176     43     T   T  162 184      S
## 3      21     7    F   55    178  178     39     T   T  170 180      S
## 4      21     8    F   70    167  165     39     S   T  160 190      S
## 5      21     4    F   65    171  168     40     T   S  169 176      M
## 6      21     3    M   88    171  173     41     T   T  165 182     XL

Change incomprehensible variable name

names(data)
##  [1] "starost" "mesec"   "spol"    "masa"    "visina"  "roke"    "cevelj" 
##  [8] "lasje"   "oci"     "mati"    "oce"     "majica"
newNames <- c(
  "age",
  "month",
  "gender",
  "weight",
  "height",
  "arms",
  "shoe",
  "hair",
  "eyes",
  "mother",
  "father",
  "shirt"
)
newNames
##  [1] "age"    "month"  "gender" "weight" "height" "arms"   "shoe"  
##  [8] "hair"   "eyes"   "mother" "father" "shirt"
names(data) <- newNames

Check the first few lines again

head(data)
##   age month gender weight height arms shoe hair eyes mother father shirt
## 1  59     7      M     91    178  189   44    T    S    155    180     L
## 2  21     1      F     60    173  176   43    T    T    162    184     S
## 3  21     7      F     55    178  178   39    T    T    170    180     S
## 4  21     8      F     70    167  165   39    S    T    160    190     S
## 5  21     4      F     65    171  168   40    T    S    169    176     M
## 6  21     3      M     88    171  173   41    T    T    165    182    XL

Change of scale

Convert height (which is measured in centimeters) into meters.

data$height[1:5]
## [1] 178 173 178 167 171
data$height <- data$height/100
data$height[1:5]
## [1] 1.78 1.73 1.78 1.67 1.71

This is kind of dangerous if you analyse interactively. Try to run the rescale height chunk several times. Use Run all previous chunks to rerun analysis up to this point and reset the data.

Plot

Always plot your data! Options fig.* for knitr chunks control figure features (weight, height, caption …).

hist(data$weight)
Histogram of weight

Histogram of weight

Few other plots

with(data,plot(height,weight))
abline(  lm(   weight ~ height  ,    data=   data)   )

Refine with col,lwd and pch to make it more visible.

Use of results in text

corelation <- cor(data$weight,data$height)

Correlation between weight and height is 0.703.

Boxplots of some variables

par(mfrow=c(1,3))
boxplot(data$age,main="Age")
boxplot(data$height,main="Height [cm]")
boxplot(data$weight,main="Weight [kg]")

Summary

It is good to look at the summaries before the analysis.

summary(data)
##       age            month        gender     weight          height     
##  Min.   :20.00   Min.   : 0.000   F:33   Min.   :50.00   Min.   :1.560  
##  1st Qu.:21.00   1st Qu.: 5.000   M:10   1st Qu.:55.50   1st Qu.:1.640  
##  Median :21.00   Median : 7.000          Median :61.00   Median :1.700  
##  Mean   :22.07   Mean   : 6.814          Mean   :63.42   Mean   :1.699  
##  3rd Qu.:22.00   3rd Qu.: 9.500          3rd Qu.:70.00   3rd Qu.:1.735  
##  Max.   :59.00   Max.   :11.000          Max.   :91.00   Max.   :1.890  
##                                                                         
##       arms            shoe       hair   eyes       mother     
##  Min.   :154.0   Min.   :36.00   S:19   S:24   Min.   :155.0  
##  1st Qu.:163.2   1st Qu.:38.00   T:24   T:19   1st Qu.:160.0  
##  Median :167.8   Median :39.00                 Median :165.0  
##  Mean   :169.3   Mean   :40.02                 Mean   :165.4  
##  3rd Qu.:172.5   3rd Qu.:41.50                 3rd Qu.:168.0  
##  Max.   :193.0   Max.   :48.00                 Max.   :180.0  
##  NA's   :5                                     NA's   :5      
##      father      shirt  
##  Min.   :170.0   L : 5  
##  1st Qu.:174.2   M :19  
##  Median :179.5   S :16  
##  Mean   :179.1   XL: 1  
##  3rd Qu.:182.0   XS: 2  
##  Max.   :190.0          
##  NA's   :5

One is too old for the group. Find and eliminate that case.

max(data$age)
## [1] 59
old <- data$age == max(data$age)
old
##  [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
which(old)
## [1] 1
dim(data)
## [1] 43 12
data <- data[ !old ,  ]
dim(data)
## [1] 42 12
par(mfrow=c(1,3))
boxplot(data$age,main="Age")
boxplot(data$height,main="Height [cm]")
boxplot(data$weight,main="Weight [kg]")

Add variables

Body mass index might be of interest for analysis of relation between weight and height.

data$bmi <- data$weight / data$height^2

Simplifying variable calling

If you are sure that the data are clean and complete, you can simplify the reference to your variables

attach(data)

From now on you can reference to your variables omitting the data frame name:

head(data$weight)
## [1] 60 55 70 65 88 52
head(weight)
## [1] 60 55 70 65 88 52

Relation between BMI and height

par(mfrow=c(1,1))
plot(height,bmi,pch=16,col=gender)
abline(lm(bmi~height),col="blue",lwd=3)

cor(bmi,height)
## [1] 0.1673265

Correlation between bmi and height is much smaller: r = 0.167.