Markus, August 2013
Example code to import a csv file check available options via ?read.csv or use the Dataset import wizard from RStudio modify the code below to match your situation column headers in the csv file must start with a letter otherwise there will be problems.
Here we import an example file from my online storage Using read.csv to import the file. Have to specify the character that separates columns in the file Here, it is a file that uses “,” to separate columns
df <- read.csv("./R_seminar/example_csv_file.csv",
sep = ",")
# To import a file from your own hard drive, give the full pathname:
# e.g. "C:/Data/my_files/experiment01.csv"
# After the import, always check if the data was imported correctly
summary(df)
## ID Room Pot Species
## Min. : 4 Min. :2.00 Min. : 4 E. saligna :704
## 1st Qu.:125 1st Qu.:3.00 1st Qu.:125 E. sideroxylon:623
## Median :318 Median :5.00 Median :314
## Mean :302 Mean :4.51 Mean :300
## 3rd Qu.:463 3rd Qu.:6.00 3rd Qu.:459
## Max. :589 Max. :7.00 Max. :585
##
## ID1 Room.1 CO2_treatment Temperature
## Min. :0.00 Min. :2.00 Min. :280 ambient :660
## 1st Qu.:1.00 1st Qu.:3.00 1st Qu.:280 elevated:667
## Median :3.00 Median :5.00 Median :400
## Mean :2.51 Mean :4.51 Mean :440
## 3rd Qu.:4.00 3rd Qu.:6.00 3rd Qu.:640
## Max. :5.00 Max. :7.00 Max. :640
##
## measurement_temperature water_treatment ID.1 plant_no
## Min. :28 :1093 Min. : 1 Min. : 4
## 1st Qu.:28 dry: 66 1st Qu.: 332 1st Qu.:125
## Median :28 wet: 168 Median : 664 Median :314
## Mean :31 Mean : 664 Mean :300
## 3rd Qu.:34 3rd Qu.: 996 3rd Qu.:459
## Max. :34 Max. :1327 Max. :585
##
## room_no water_treatment.1 measurement_temperature.1 Obs
## Min. :2.00 :1093 Min. :28 Mode:logical
## 1st Qu.:3.00 dry: 66 1st Qu.:28 NA's:1327
## Median :5.00 wet: 168 Median :28
## Mean :4.51 Mean :31
## 3rd Qu.:6.00 3rd Qu.:34
## Max. :7.00 Max. :34
##
## Time FTime EBal Photo
## Min. : 0 Min. : 251 Min. :0 Min. :-7.74
## 1st Qu.: 1 1st Qu.: 4272 1st Qu.:0 1st Qu.: 2.71
## Median : 1 Median : 8464 Median :0 Median : 9.33
## Mean : 2190 Mean : 9505 Mean :0 Mean : 9.65
## 3rd Qu.: 1078 3rd Qu.:14166 3rd Qu.:0 3rd Qu.:15.70
## Max. :23619 Max. :26496 Max. :0 Max. :29.60
## NA's :349 NA's :991
## Cond Ci Trmmol VpdL
## Min. :0.010 Min. :125 Min. : 0.27 Min. :0.74
## 1st Qu.:0.190 1st Qu.:254 1st Qu.: 3.10 1st Qu.:1.25
## Median :0.290 Median :325 Median : 4.15 Median :1.60
## Mean :0.325 Mean :365 Mean : 4.52 Mean :1.65
## 3rd Qu.:0.410 3rd Qu.:482 3rd Qu.: 5.62 3rd Qu.:1.96
## Max. :1.130 Max. :756 Max. :12.70 Max. :3.40
##
## Column2 Area BLC StmRat BLCond
## Min. :23.4 Min. :0.90 Min. :1.4 Min. :1 Min. : 2.84
## 1st Qu.:25.3 1st Qu.:1.80 1st Qu.:1.4 1st Qu.:1 1st Qu.: 2.84
## Median :27.1 Median :6.00 Median :1.4 Median :1 Median : 2.84
## Mean :27.2 Mean :4.15 Mean :1.6 Mean :1 Mean : 5.12
## 3rd Qu.:28.5 3rd Qu.:6.00 3rd Qu.:1.9 3rd Qu.:1 3rd Qu.: 4.68
## Max. :32.7 Max. :6.00 Max. :2.3 Max. :1 Max. :12.00
## NA's :991 NA's :991
## Tair Tleaf TBlk CO2R CO2S
## Min. :26.6 Min. :23.1 Min. :27.9 Min. :278 Min. :251
## 1st Qu.:27.6 1st Qu.:26.1 1st Qu.:28.0 1st Qu.:280 1st Qu.:277
## Median :31.5 Median :27.8 Median :33.5 Median :400 Median :394
## Mean :30.3 Mean :28.0 Mean :31.0 Mean :440 Mean :431
## 3rd Qu.:32.7 3rd Qu.:29.7 3rd Qu.:34.0 3rd Qu.:640 3rd Qu.:625
## Max. :34.2 Max. :34.9 Max. :34.1 Max. :642 Max. :642
##
## H2OR H2OS RH_R RH_S Flow
## Min. :10.6 Min. :13.9 Min. :21.9 Min. :29.6 Min. :299
## 1st Qu.:15.4 1st Qu.:18.2 1st Qu.:33.4 1st Qu.:41.3 1st Qu.:500
## Median :16.6 Median :20.5 Median :42.0 Median :50.0 Median :500
## Mean :17.7 Mean :21.3 Mean :42.0 Mean :50.4 Mean :480
## 3rd Qu.:19.5 3rd Qu.:24.0 3rd Qu.:48.1 3rd Qu.:57.4 3rd Qu.:500
## Max. :30.8 Max. :35.2 Max. :70.7 Max. :80.2 Max. :503
##
## PARi PARo Press CsMch
## Min. : -1 Min. : 0 Min. :101 Min. :-17.000
## 1st Qu.: 61 1st Qu.: 33 1st Qu.:102 1st Qu.: -3.780
## Median : 300 Median : 137 Median :102 Median : -1.180
## Mean : 507 Mean : 233 Mean :102 Mean : -0.624
## 3rd Qu.: 900 3rd Qu.: 296 3rd Qu.:102 3rd Qu.: 2.500
## Max. :1508 Max. :1809 Max. :102 Max. : 13.000
##
## HsMch CsMch1 BLCslope BLCoffst f_parin
## Min. :-0.480 Min. :0.0 Min. :-0.2 Min. :2.7 Min. :1
## 1st Qu.:-0.030 1st Qu.:0.7 1st Qu.:-0.2 1st Qu.:2.7 1st Qu.:1
## Median : 0.200 Median :1.0 Median :-0.2 Median :2.7 Median :1
## Mean : 0.131 Mean :0.8 Mean :-0.2 Mean :2.7 Mean :1
## 3rd Qu.: 0.290 3rd Qu.:1.0 3rd Qu.:-0.2 3rd Qu.:2.7 3rd Qu.:1
## Max. : 0.670 Max. :1.0 Max. :-0.2 Max. :2.7 Max. :1
## NA's :349 NA's :991 NA's :991 NA's :991
## f_parout alphaK Status
## Min. :0 Min. :0.2 : 85
## 1st Qu.:0 1st Qu.:0.2 111115:1230
## Median :0 Median :0.2 Status: 12
## Mean :0 Mean :0.2
## 3rd Qu.:0 3rd Qu.:0.2
## Max. :0 Max. :0.2
## NA's :991 NA's :991
# In case your file uses "tab" to separate columns:
df_tab <- read.csv("./R_seminar/example_csv_file_separated_by_tab.csv",
sep = "\t")
summary(df_tab)
## ID Room Pot Species
## Min. : 4 Min. :2.00 Min. : 4 E. saligna :704
## 1st Qu.:125 1st Qu.:3.00 1st Qu.:125 E. sideroxylon:623
## Median :318 Median :5.00 Median :314
## Mean :302 Mean :4.51 Mean :300
## 3rd Qu.:463 3rd Qu.:6.00 3rd Qu.:459
## Max. :589 Max. :7.00 Max. :585
##
## ID1 Room.1 CO2_treatment Temperature
## Min. :0.00 Min. :2.00 Min. :280 ambient :660
## 1st Qu.:1.00 1st Qu.:3.00 1st Qu.:280 elevated:667
## Median :3.00 Median :5.00 Median :400
## Mean :2.51 Mean :4.51 Mean :440
## 3rd Qu.:4.00 3rd Qu.:6.00 3rd Qu.:640
## Max. :5.00 Max. :7.00 Max. :640
##
## measurement_temperature water_treatment ID.1 plant_no
## Min. :28 :1093 Min. : 1 Min. : 4
## 1st Qu.:28 dry: 66 1st Qu.: 332 1st Qu.:125
## Median :28 wet: 168 Median : 664 Median :314
## Mean :31 Mean : 664 Mean :300
## 3rd Qu.:34 3rd Qu.: 996 3rd Qu.:459
## Max. :34 Max. :1327 Max. :585
##
## room_no water_treatment.1 measurement_temperature.1 Obs
## Min. :2.00 :1093 Min. :28 Mode:logical
## 1st Qu.:3.00 dry: 66 1st Qu.:28 NA's:1327
## Median :5.00 wet: 168 Median :28
## Mean :4.51 Mean :31
## 3rd Qu.:6.00 3rd Qu.:34
## Max. :7.00 Max. :34
##
## Time FTime EBal Photo
## Min. : 0 Min. : 251 Min. :0 Min. :-7.74
## 1st Qu.: 1 1st Qu.: 4272 1st Qu.:0 1st Qu.: 2.71
## Median : 1 Median : 8464 Median :0 Median : 9.33
## Mean : 2190 Mean : 9505 Mean :0 Mean : 9.65
## 3rd Qu.: 1078 3rd Qu.:14166 3rd Qu.:0 3rd Qu.:15.70
## Max. :23619 Max. :26496 Max. :0 Max. :29.60
## NA's :349 NA's :991
## Cond Ci Trmmol VpdL
## Min. :0.010 Min. :125 Min. : 0.27 Min. :0.74
## 1st Qu.:0.190 1st Qu.:254 1st Qu.: 3.10 1st Qu.:1.25
## Median :0.290 Median :325 Median : 4.15 Median :1.60
## Mean :0.325 Mean :365 Mean : 4.52 Mean :1.65
## 3rd Qu.:0.410 3rd Qu.:482 3rd Qu.: 5.62 3rd Qu.:1.96
## Max. :1.130 Max. :756 Max. :12.70 Max. :3.40
##
## Column2 Area BLC StmRat BLCond
## Min. :23.4 Min. :0.90 Min. :1.4 Min. :1 Min. : 2.84
## 1st Qu.:25.3 1st Qu.:1.80 1st Qu.:1.4 1st Qu.:1 1st Qu.: 2.84
## Median :27.1 Median :6.00 Median :1.4 Median :1 Median : 2.84
## Mean :27.2 Mean :4.15 Mean :1.6 Mean :1 Mean : 5.12
## 3rd Qu.:28.5 3rd Qu.:6.00 3rd Qu.:1.9 3rd Qu.:1 3rd Qu.: 4.68
## Max. :32.7 Max. :6.00 Max. :2.3 Max. :1 Max. :12.00
## NA's :991 NA's :991
## Tair Tleaf TBlk CO2R CO2S
## Min. :26.6 Min. :23.1 Min. :27.9 Min. :278 Min. :251
## 1st Qu.:27.6 1st Qu.:26.1 1st Qu.:28.0 1st Qu.:280 1st Qu.:277
## Median :31.5 Median :27.8 Median :33.5 Median :400 Median :394
## Mean :30.3 Mean :28.0 Mean :31.0 Mean :440 Mean :431
## 3rd Qu.:32.7 3rd Qu.:29.7 3rd Qu.:34.0 3rd Qu.:640 3rd Qu.:625
## Max. :34.2 Max. :34.9 Max. :34.1 Max. :642 Max. :642
##
## H2OR H2OS RH_R RH_S Flow
## Min. :10.6 Min. :13.9 Min. :21.9 Min. :29.6 Min. :299
## 1st Qu.:15.4 1st Qu.:18.2 1st Qu.:33.4 1st Qu.:41.3 1st Qu.:500
## Median :16.6 Median :20.5 Median :42.0 Median :50.0 Median :500
## Mean :17.7 Mean :21.3 Mean :42.0 Mean :50.4 Mean :480
## 3rd Qu.:19.5 3rd Qu.:24.0 3rd Qu.:48.1 3rd Qu.:57.4 3rd Qu.:500
## Max. :30.8 Max. :35.2 Max. :70.7 Max. :80.2 Max. :503
##
## PARi PARo Press CsMch
## Min. : -1 Min. : 0 Min. :101 Min. :-17.000
## 1st Qu.: 61 1st Qu.: 33 1st Qu.:102 1st Qu.: -3.780
## Median : 300 Median : 137 Median :102 Median : -1.180
## Mean : 507 Mean : 233 Mean :102 Mean : -0.624
## 3rd Qu.: 900 3rd Qu.: 296 3rd Qu.:102 3rd Qu.: 2.500
## Max. :1508 Max. :1809 Max. :102 Max. : 13.000
##
## HsMch CsMch1 BLCslope BLCoffst f_parin
## Min. :-0.480 Min. :0.0 Min. :-0.2 Min. :2.7 Min. :1
## 1st Qu.:-0.030 1st Qu.:0.7 1st Qu.:-0.2 1st Qu.:2.7 1st Qu.:1
## Median : 0.200 Median :1.0 Median :-0.2 Median :2.7 Median :1
## Mean : 0.131 Mean :0.8 Mean :-0.2 Mean :2.7 Mean :1
## 3rd Qu.: 0.290 3rd Qu.:1.0 3rd Qu.:-0.2 3rd Qu.:2.7 3rd Qu.:1
## Max. : 0.670 Max. :1.0 Max. :-0.2 Max. :2.7 Max. :1
## NA's :349 NA's :991 NA's :991 NA's :991
## f_parout alphaK Status
## Min. :0 Min. :0.2 : 85
## 1st Qu.:0 1st Qu.:0.2 111115:1230
## Median :0 Median :0.2 Status: 12
## Mean :0 Mean :0.2
## 3rd Qu.:0 3rd Qu.:0.2
## Max. :0 Max. :0.2
## NA's :991 NA's :991
If the file does not have a header in the first line use the “header option” R will create a name for each column automatically.
df_tab_no_header <- read.csv("./R_seminar/example_csv_file_tab_no_header.csv",
sep = "\t",
header = FALSE)
summary(df_tab_no_header)
## V1 V2 V3 V4
## Min. : 4 Min. :2.00 Min. : 4 E. saligna :704
## 1st Qu.:125 1st Qu.:3.00 1st Qu.:125 E. sideroxylon:623
## Median :318 Median :5.00 Median :314
## Mean :302 Mean :4.51 Mean :300
## 3rd Qu.:463 3rd Qu.:6.00 3rd Qu.:459
## Max. :589 Max. :7.00 Max. :585
##
## V5 V6 V7 V8 V9
## Min. :0.00 Min. :2.00 Min. :280 ambient :660 Min. :28
## 1st Qu.:1.00 1st Qu.:3.00 1st Qu.:280 elevated:667 1st Qu.:28
## Median :3.00 Median :5.00 Median :400 Median :28
## Mean :2.51 Mean :4.51 Mean :440 Mean :31
## 3rd Qu.:4.00 3rd Qu.:6.00 3rd Qu.:640 3rd Qu.:34
## Max. :5.00 Max. :7.00 Max. :640 Max. :34
##
## V10 V11 V12 V13 V14
## :1093 Min. : 1 Min. : 4 Min. :2.00 :1093
## dry: 66 1st Qu.: 332 1st Qu.:125 1st Qu.:3.00 dry: 66
## wet: 168 Median : 664 Median :314 Median :5.00 wet: 168
## Mean : 664 Mean :300 Mean :4.51
## 3rd Qu.: 996 3rd Qu.:459 3rd Qu.:6.00
## Max. :1327 Max. :585 Max. :7.00
##
## V15 V16 V17 V18 V19
## Min. :28 Mode:logical Min. : 0 Min. : 251 Min. :0
## 1st Qu.:28 NA's:1327 1st Qu.: 1 1st Qu.: 4272 1st Qu.:0
## Median :28 Median : 1 Median : 8464 Median :0
## Mean :31 Mean : 2190 Mean : 9505 Mean :0
## 3rd Qu.:34 3rd Qu.: 1078 3rd Qu.:14166 3rd Qu.:0
## Max. :34 Max. :23619 Max. :26496 Max. :0
## NA's :349 NA's :991
## V20 V21 V22 V23
## Min. :-7.74 Min. :0.010 Min. :125 Min. : 0.27
## 1st Qu.: 2.71 1st Qu.:0.190 1st Qu.:254 1st Qu.: 3.10
## Median : 9.33 Median :0.290 Median :325 Median : 4.15
## Mean : 9.65 Mean :0.325 Mean :365 Mean : 4.52
## 3rd Qu.:15.70 3rd Qu.:0.410 3rd Qu.:482 3rd Qu.: 5.62
## Max. :29.60 Max. :1.130 Max. :756 Max. :12.70
##
## V24 V25 V26 V27 V28
## Min. :0.74 Min. :23.4 Min. :0.90 Min. :1.4 Min. :1
## 1st Qu.:1.25 1st Qu.:25.3 1st Qu.:1.80 1st Qu.:1.4 1st Qu.:1
## Median :1.60 Median :27.1 Median :6.00 Median :1.4 Median :1
## Mean :1.65 Mean :27.2 Mean :4.15 Mean :1.6 Mean :1
## 3rd Qu.:1.96 3rd Qu.:28.5 3rd Qu.:6.00 3rd Qu.:1.9 3rd Qu.:1
## Max. :3.40 Max. :32.7 Max. :6.00 Max. :2.3 Max. :1
## NA's :991 NA's :991
## V29 V30 V31 V32
## Min. : 2.84 Min. :26.6 Min. :23.1 Min. :27.9
## 1st Qu.: 2.84 1st Qu.:27.6 1st Qu.:26.1 1st Qu.:28.0
## Median : 2.84 Median :31.5 Median :27.8 Median :33.5
## Mean : 5.12 Mean :30.3 Mean :28.0 Mean :31.0
## 3rd Qu.: 4.68 3rd Qu.:32.7 3rd Qu.:29.7 3rd Qu.:34.0
## Max. :12.00 Max. :34.2 Max. :34.9 Max. :34.1
##
## V33 V34 V35 V36 V37
## Min. :278 Min. :251 Min. :10.6 Min. :13.9 Min. :21.9
## 1st Qu.:280 1st Qu.:277 1st Qu.:15.4 1st Qu.:18.2 1st Qu.:33.4
## Median :400 Median :394 Median :16.6 Median :20.5 Median :42.0
## Mean :440 Mean :431 Mean :17.7 Mean :21.3 Mean :42.0
## 3rd Qu.:640 3rd Qu.:625 3rd Qu.:19.5 3rd Qu.:24.0 3rd Qu.:48.1
## Max. :642 Max. :642 Max. :30.8 Max. :35.2 Max. :70.7
##
## V38 V39 V40 V41 V42
## Min. :29.6 Min. :299 Min. : -1 Min. : 0 Min. :101
## 1st Qu.:41.3 1st Qu.:500 1st Qu.: 61 1st Qu.: 33 1st Qu.:102
## Median :50.0 Median :500 Median : 300 Median : 137 Median :102
## Mean :50.4 Mean :480 Mean : 507 Mean : 233 Mean :102
## 3rd Qu.:57.4 3rd Qu.:500 3rd Qu.: 900 3rd Qu.: 296 3rd Qu.:102
## Max. :80.2 Max. :503 Max. :1508 Max. :1809 Max. :102
##
## V43 V44 V45 V46
## Min. :-17.000 Min. :-0.480 Min. :0.0 Min. :-0.2
## 1st Qu.: -3.780 1st Qu.:-0.030 1st Qu.:0.7 1st Qu.:-0.2
## Median : -1.180 Median : 0.200 Median :1.0 Median :-0.2
## Mean : -0.624 Mean : 0.131 Mean :0.8 Mean :-0.2
## 3rd Qu.: 2.500 3rd Qu.: 0.290 3rd Qu.:1.0 3rd Qu.:-0.2
## Max. : 13.000 Max. : 0.670 Max. :1.0 Max. :-0.2
## NA's :349 NA's :991
## V47 V48 V49 V50 V51
## Min. :2.7 Min. :1 Min. :0 Min. :0.2 : 85
## 1st Qu.:2.7 1st Qu.:1 1st Qu.:0 1st Qu.:0.2 111115:1230
## Median :2.7 Median :1 Median :0 Median :0.2 Status: 12
## Mean :2.7 Mean :1 Mean :0 Mean :0.2
## 3rd Qu.:2.7 3rd Qu.:1 3rd Qu.:0 3rd Qu.:0.2
## Max. :2.7 Max. :1 Max. :0 Max. :0.2
## NA's :991 NA's :991 NA's :991 NA's :991
To see the available options to import data in csv format see the help file
?read.csv
For the excercise, we use the built-in “iris” data set that we used last time load the “iris”" example data set
data(iris)
To answer a question that came up during the previous session (in 2012): How to select column of a data frame by name
iris[, "Sepal.Width"]
## [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9
## [18] 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2
## [35] 3.1 3.2 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2
## [52] 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7
## [69] 2.2 2.5 3.2 2.8 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0
## [86] 3.4 3.1 2.3 3.0 2.5 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7
## [103] 3.0 2.9 3.0 3.0 2.5 2.9 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6
## [120] 2.2 3.2 2.8 2.8 2.7 3.3 3.2 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0
## [137] 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2 3.3 3.0 2.5 3.0 3.4 3.0
iris[, c("Sepal.Width", "Species")]
## Sepal.Width Species
## 1 3.5 setosa
## 2 3.0 setosa
## 3 3.2 setosa
## 4 3.1 setosa
## 5 3.6 setosa
## 6 3.9 setosa
## 7 3.4 setosa
## 8 3.4 setosa
## 9 2.9 setosa
## 10 3.1 setosa
## 11 3.7 setosa
## 12 3.4 setosa
## 13 3.0 setosa
## 14 3.0 setosa
## 15 4.0 setosa
## 16 4.4 setosa
## 17 3.9 setosa
## 18 3.5 setosa
## 19 3.8 setosa
## 20 3.8 setosa
## 21 3.4 setosa
## 22 3.7 setosa
## 23 3.6 setosa
## 24 3.3 setosa
## 25 3.4 setosa
## 26 3.0 setosa
## 27 3.4 setosa
## 28 3.5 setosa
## 29 3.4 setosa
## 30 3.2 setosa
## 31 3.1 setosa
## 32 3.4 setosa
## 33 4.1 setosa
## 34 4.2 setosa
## 35 3.1 setosa
## 36 3.2 setosa
## 37 3.5 setosa
## 38 3.6 setosa
## 39 3.0 setosa
## 40 3.4 setosa
## 41 3.5 setosa
## 42 2.3 setosa
## 43 3.2 setosa
## 44 3.5 setosa
## 45 3.8 setosa
## 46 3.0 setosa
## 47 3.8 setosa
## 48 3.2 setosa
## 49 3.7 setosa
## 50 3.3 setosa
## 51 3.2 versicolor
## 52 3.2 versicolor
## 53 3.1 versicolor
## 54 2.3 versicolor
## 55 2.8 versicolor
## 56 2.8 versicolor
## 57 3.3 versicolor
## 58 2.4 versicolor
## 59 2.9 versicolor
## 60 2.7 versicolor
## 61 2.0 versicolor
## 62 3.0 versicolor
## 63 2.2 versicolor
## 64 2.9 versicolor
## 65 2.9 versicolor
## 66 3.1 versicolor
## 67 3.0 versicolor
## 68 2.7 versicolor
## 69 2.2 versicolor
## 70 2.5 versicolor
## 71 3.2 versicolor
## 72 2.8 versicolor
## 73 2.5 versicolor
## 74 2.8 versicolor
## 75 2.9 versicolor
## 76 3.0 versicolor
## 77 2.8 versicolor
## 78 3.0 versicolor
## 79 2.9 versicolor
## 80 2.6 versicolor
## 81 2.4 versicolor
## 82 2.4 versicolor
## 83 2.7 versicolor
## 84 2.7 versicolor
## 85 3.0 versicolor
## 86 3.4 versicolor
## 87 3.1 versicolor
## 88 2.3 versicolor
## 89 3.0 versicolor
## 90 2.5 versicolor
## 91 2.6 versicolor
## 92 3.0 versicolor
## 93 2.6 versicolor
## 94 2.3 versicolor
## 95 2.7 versicolor
## 96 3.0 versicolor
## 97 2.9 versicolor
## 98 2.9 versicolor
## 99 2.5 versicolor
## 100 2.8 versicolor
## 101 3.3 virginica
## 102 2.7 virginica
## 103 3.0 virginica
## 104 2.9 virginica
## 105 3.0 virginica
## 106 3.0 virginica
## 107 2.5 virginica
## 108 2.9 virginica
## 109 2.5 virginica
## 110 3.6 virginica
## 111 3.2 virginica
## 112 2.7 virginica
## 113 3.0 virginica
## 114 2.5 virginica
## 115 2.8 virginica
## 116 3.2 virginica
## 117 3.0 virginica
## 118 3.8 virginica
## 119 2.6 virginica
## 120 2.2 virginica
## 121 3.2 virginica
## 122 2.8 virginica
## 123 2.8 virginica
## 124 2.7 virginica
## 125 3.3 virginica
## 126 3.2 virginica
## 127 2.8 virginica
## 128 3.0 virginica
## 129 2.8 virginica
## 130 3.0 virginica
## 131 2.8 virginica
## 132 3.8 virginica
## 133 2.8 virginica
## 134 2.8 virginica
## 135 2.6 virginica
## 136 3.0 virginica
## 137 3.4 virginica
## 138 3.1 virginica
## 139 3.0 virginica
## 140 3.1 virginica
## 141 3.1 virginica
## 142 3.1 virginica
## 143 2.7 virginica
## 144 3.2 virginica
## 145 3.3 virginica
## 146 3.0 virginica
## 147 2.5 virginica
## 148 3.0 virginica
## 149 3.4 virginica
## 150 3.0 virginica
Some more complex subsetting of a data frame Select every third element
# the following gets the job done, but there is lots of code within the brackets
iris[seq(from = 0, to = nrow(iris), by = 3), ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 3 4.7 3.2 1.3 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 102 5.8 2.7 5.1 1.9 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 150 5.9 3.0 5.1 1.8 virginica
# create a vector with the elements that you want (we modularise the code)
my.index <- seq(from = 0, to = nrow(iris), by = 3)
# use "my.index" in the bracket statement
iris[my.index, ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 3 4.7 3.2 1.3 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 102 5.8 2.7 5.1 1.9 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 150 5.9 3.0 5.1 1.8 virginica
Take a random sample of 25 elements from rows of iris
iris[sample(1:nrow(iris), 25), ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 119 7.7 2.6 6.9 2.3 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 8 5.0 3.4 1.5 0.2 setosa
## 150 5.9 3.0 5.1 1.8 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 44 5.0 3.5 1.6 0.6 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 115 5.8 2.8 5.1 2.4 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 40 5.1 3.4 1.5 0.2 setosa
## 135 6.1 2.6 5.6 1.4 virginica
## 64 6.1 2.9 4.7 1.4 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 110 7.2 3.6 6.1 2.5 virginica
## 70 5.6 2.5 3.9 1.1 versicolor
## 104 6.3 2.9 5.6 1.8 virginica
## 15 5.8 4.0 1.2 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 132 7.9 3.8 6.4 2.0 virginica
## 31 4.8 3.1 1.6 0.2 setosa
## 101 6.3 3.3 6.0 2.5 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 21 5.4 3.4 1.7 0.2 setosa
## 59 6.6 2.9 4.6 1.3 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
Apply a function the “apply” family of functions is large and very complex
As iris is a dataframe (which itself is just a list of vectors), lapply allows to apply a function to each element of a list
out <- lapply(iris[, 3:4], mean) # arithmetic mean
out
## $Petal.Length
## [1] 3.758
##
## $Petal.Width
## [1] 1.199
out.sd <- lapply(iris[, 3:4], sd) # standard deviation
out.sd
## $Petal.Length
## [1] 1.765
##
## $Petal.Width
## [1] 0.7622
Missing values can cause many headaches, but there a function available to identify and handle them
Copy the iris data to a new object
my.iris <- iris
# introduce a missing value ("NA") in the iris data
# missing values are represented by "NA".
# NA is not a text, it is defined as a internal "logical" constant.
# see ?NA
my.iris[1, 3] <- NA
out <- lapply(my.iris[, 3:4], mean)
# as there is one "NA" value, the mean is "NA" as well:
out
## $Petal.Length
## [1] NA
##
## $Petal.Width
## [1] 1.199
Calculations with missing values To calculate the mean for columns with missing values, the mean function needs to be told explicitley to remove NA from the sample.
When options are passed on to a function, the statement becomes more complex:
out <- lapply(my.iris[, 3:4], function(x) mean(x, na.rm = TRUE))
out
## $Petal.Length
## [1] 3.774
##
## $Petal.Width
## [1] 1.199
# there is a shorthand for this fortunately:
# in many cases, options for the functions can be passed through
out <- lapply(my.iris[, 3:4], mean, na.rm = TRUE)
out
## $Petal.Length
## [1] 3.774
##
## $Petal.Width
## [1] 1.199
# But the long form with the additional function(x) is useful in many occasions!
Checking for missing values with “is.na”
Calculate the amount of “NA” in columns
# results in either "true" or "false"
is.na(iris[1,1])
## [1] FALSE
is.na(my.iris[1,3])
## [1] TRUE
Sapply returns the result of an “lapply” function in a simplyfied format. sapply is a replacement for lapply. Difference is only in the output format.
Compare the two statements below. See ?lapply
lapply(my.iris[, 1:5], function(x) sum(is.na(x))) # returns a list
## $Sepal.Length
## [1] 0
##
## $Sepal.Width
## [1] 0
##
## $Petal.Length
## [1] 1
##
## $Petal.Width
## [1] 0
##
## $Species
## [1] 0
sapply(my.iris[, 1:5], function(x) sum(is.na(x))) # returns named integers (vector)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 0 0 1 0 0
# get rid of rows that have any "NA"
my.iris.com <- na.omit(my.iris)
# check the amount of rows
nrow(my.iris.com)
## [1] 149
How many rows did the original iris data have in comparison? Find a way to check!
Create a dataframe that indicates “NA” for each individual element
is.na(my.iris[1:nrow(my.iris), ])
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 FALSE FALSE TRUE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE
## 7 FALSE FALSE FALSE FALSE FALSE
## 8 FALSE FALSE FALSE FALSE FALSE
## 9 FALSE FALSE FALSE FALSE FALSE
## 10 FALSE FALSE FALSE FALSE FALSE
## 11 FALSE FALSE FALSE FALSE FALSE
## 12 FALSE FALSE FALSE FALSE FALSE
## 13 FALSE FALSE FALSE FALSE FALSE
## 14 FALSE FALSE FALSE FALSE FALSE
## 15 FALSE FALSE FALSE FALSE FALSE
## 16 FALSE FALSE FALSE FALSE FALSE
## 17 FALSE FALSE FALSE FALSE FALSE
## 18 FALSE FALSE FALSE FALSE FALSE
## 19 FALSE FALSE FALSE FALSE FALSE
## 20 FALSE FALSE FALSE FALSE FALSE
## 21 FALSE FALSE FALSE FALSE FALSE
## 22 FALSE FALSE FALSE FALSE FALSE
## 23 FALSE FALSE FALSE FALSE FALSE
## 24 FALSE FALSE FALSE FALSE FALSE
## 25 FALSE FALSE FALSE FALSE FALSE
## 26 FALSE FALSE FALSE FALSE FALSE
## 27 FALSE FALSE FALSE FALSE FALSE
## 28 FALSE FALSE FALSE FALSE FALSE
## 29 FALSE FALSE FALSE FALSE FALSE
## 30 FALSE FALSE FALSE FALSE FALSE
## 31 FALSE FALSE FALSE FALSE FALSE
## 32 FALSE FALSE FALSE FALSE FALSE
## 33 FALSE FALSE FALSE FALSE FALSE
## 34 FALSE FALSE FALSE FALSE FALSE
## 35 FALSE FALSE FALSE FALSE FALSE
## 36 FALSE FALSE FALSE FALSE FALSE
## 37 FALSE FALSE FALSE FALSE FALSE
## 38 FALSE FALSE FALSE FALSE FALSE
## 39 FALSE FALSE FALSE FALSE FALSE
## 40 FALSE FALSE FALSE FALSE FALSE
## 41 FALSE FALSE FALSE FALSE FALSE
## 42 FALSE FALSE FALSE FALSE FALSE
## 43 FALSE FALSE FALSE FALSE FALSE
## 44 FALSE FALSE FALSE FALSE FALSE
## 45 FALSE FALSE FALSE FALSE FALSE
## 46 FALSE FALSE FALSE FALSE FALSE
## 47 FALSE FALSE FALSE FALSE FALSE
## 48 FALSE FALSE FALSE FALSE FALSE
## 49 FALSE FALSE FALSE FALSE FALSE
## 50 FALSE FALSE FALSE FALSE FALSE
## 51 FALSE FALSE FALSE FALSE FALSE
## 52 FALSE FALSE FALSE FALSE FALSE
## 53 FALSE FALSE FALSE FALSE FALSE
## 54 FALSE FALSE FALSE FALSE FALSE
## 55 FALSE FALSE FALSE FALSE FALSE
## 56 FALSE FALSE FALSE FALSE FALSE
## 57 FALSE FALSE FALSE FALSE FALSE
## 58 FALSE FALSE FALSE FALSE FALSE
## 59 FALSE FALSE FALSE FALSE FALSE
## 60 FALSE FALSE FALSE FALSE FALSE
## 61 FALSE FALSE FALSE FALSE FALSE
## 62 FALSE FALSE FALSE FALSE FALSE
## 63 FALSE FALSE FALSE FALSE FALSE
## 64 FALSE FALSE FALSE FALSE FALSE
## 65 FALSE FALSE FALSE FALSE FALSE
## 66 FALSE FALSE FALSE FALSE FALSE
## 67 FALSE FALSE FALSE FALSE FALSE
## 68 FALSE FALSE FALSE FALSE FALSE
## 69 FALSE FALSE FALSE FALSE FALSE
## 70 FALSE FALSE FALSE FALSE FALSE
## 71 FALSE FALSE FALSE FALSE FALSE
## 72 FALSE FALSE FALSE FALSE FALSE
## 73 FALSE FALSE FALSE FALSE FALSE
## 74 FALSE FALSE FALSE FALSE FALSE
## 75 FALSE FALSE FALSE FALSE FALSE
## 76 FALSE FALSE FALSE FALSE FALSE
## 77 FALSE FALSE FALSE FALSE FALSE
## 78 FALSE FALSE FALSE FALSE FALSE
## 79 FALSE FALSE FALSE FALSE FALSE
## 80 FALSE FALSE FALSE FALSE FALSE
## 81 FALSE FALSE FALSE FALSE FALSE
## 82 FALSE FALSE FALSE FALSE FALSE
## 83 FALSE FALSE FALSE FALSE FALSE
## 84 FALSE FALSE FALSE FALSE FALSE
## 85 FALSE FALSE FALSE FALSE FALSE
## 86 FALSE FALSE FALSE FALSE FALSE
## 87 FALSE FALSE FALSE FALSE FALSE
## 88 FALSE FALSE FALSE FALSE FALSE
## 89 FALSE FALSE FALSE FALSE FALSE
## 90 FALSE FALSE FALSE FALSE FALSE
## 91 FALSE FALSE FALSE FALSE FALSE
## 92 FALSE FALSE FALSE FALSE FALSE
## 93 FALSE FALSE FALSE FALSE FALSE
## 94 FALSE FALSE FALSE FALSE FALSE
## 95 FALSE FALSE FALSE FALSE FALSE
## 96 FALSE FALSE FALSE FALSE FALSE
## 97 FALSE FALSE FALSE FALSE FALSE
## 98 FALSE FALSE FALSE FALSE FALSE
## 99 FALSE FALSE FALSE FALSE FALSE
## 100 FALSE FALSE FALSE FALSE FALSE
## 101 FALSE FALSE FALSE FALSE FALSE
## 102 FALSE FALSE FALSE FALSE FALSE
## 103 FALSE FALSE FALSE FALSE FALSE
## 104 FALSE FALSE FALSE FALSE FALSE
## 105 FALSE FALSE FALSE FALSE FALSE
## 106 FALSE FALSE FALSE FALSE FALSE
## 107 FALSE FALSE FALSE FALSE FALSE
## 108 FALSE FALSE FALSE FALSE FALSE
## 109 FALSE FALSE FALSE FALSE FALSE
## 110 FALSE FALSE FALSE FALSE FALSE
## 111 FALSE FALSE FALSE FALSE FALSE
## 112 FALSE FALSE FALSE FALSE FALSE
## 113 FALSE FALSE FALSE FALSE FALSE
## 114 FALSE FALSE FALSE FALSE FALSE
## 115 FALSE FALSE FALSE FALSE FALSE
## 116 FALSE FALSE FALSE FALSE FALSE
## 117 FALSE FALSE FALSE FALSE FALSE
## 118 FALSE FALSE FALSE FALSE FALSE
## 119 FALSE FALSE FALSE FALSE FALSE
## 120 FALSE FALSE FALSE FALSE FALSE
## 121 FALSE FALSE FALSE FALSE FALSE
## 122 FALSE FALSE FALSE FALSE FALSE
## 123 FALSE FALSE FALSE FALSE FALSE
## 124 FALSE FALSE FALSE FALSE FALSE
## 125 FALSE FALSE FALSE FALSE FALSE
## 126 FALSE FALSE FALSE FALSE FALSE
## 127 FALSE FALSE FALSE FALSE FALSE
## 128 FALSE FALSE FALSE FALSE FALSE
## 129 FALSE FALSE FALSE FALSE FALSE
## 130 FALSE FALSE FALSE FALSE FALSE
## 131 FALSE FALSE FALSE FALSE FALSE
## 132 FALSE FALSE FALSE FALSE FALSE
## 133 FALSE FALSE FALSE FALSE FALSE
## 134 FALSE FALSE FALSE FALSE FALSE
## 135 FALSE FALSE FALSE FALSE FALSE
## 136 FALSE FALSE FALSE FALSE FALSE
## 137 FALSE FALSE FALSE FALSE FALSE
## 138 FALSE FALSE FALSE FALSE FALSE
## 139 FALSE FALSE FALSE FALSE FALSE
## 140 FALSE FALSE FALSE FALSE FALSE
## 141 FALSE FALSE FALSE FALSE FALSE
## 142 FALSE FALSE FALSE FALSE FALSE
## 143 FALSE FALSE FALSE FALSE FALSE
## 144 FALSE FALSE FALSE FALSE FALSE
## 145 FALSE FALSE FALSE FALSE FALSE
## 146 FALSE FALSE FALSE FALSE FALSE
## 147 FALSE FALSE FALSE FALSE FALSE
## 148 FALSE FALSE FALSE FALSE FALSE
## 149 FALSE FALSE FALSE FALSE FALSE
## 150 FALSE FALSE FALSE FALSE FALSE
# to kick out rows for that thave "NA" in a specific column
my.iris.com <- my.iris[!is.na(my.iris$Petal.Length), ]
aggregate(x = iris, by=list(iris$Species), FUN = mean)
## Warning: argument is not numeric or logical: returning NA
## Warning: argument is not numeric or logical: returning NA
## Warning: argument is not numeric or logical: returning NA
## Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 setosa 5.006 3.428 1.462 0.246 NA
## 2 versicolor 5.936 2.770 4.260 1.326 NA
## 3 virginica 6.588 2.974 5.552 2.026 NA
# the above "aggregate" gives errors for non-numeric elements
# see the warnings "returning NA" for the elements in "Species"
# i.e. it is not possible to calculate a mean value from species names.
# Please note as well, that missing values result in a missing mean
aggregate(x = my.iris, by=list(iris$Species), FUN = mean)
## Warning: argument is not numeric or logical: returning NA
## Warning: argument is not numeric or logical: returning NA
## Warning: argument is not numeric or logical: returning NA
## Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 setosa 5.006 3.428 NA 0.246 NA
## 2 versicolor 5.936 2.770 4.260 1.326 NA
## 3 virginica 6.588 2.974 5.552 2.026 NA
But “aggregate” allows to pass additional arguments to the function. This is indicated in ?aggregate by the “…” “further arguments passed to or used by methods”
So, the following works for missing numerics.
aggregate(x = my.iris, by=list(iris$Species), FUN = mean, na.rm = TRUE)
## Warning: argument is not numeric or logical: returning NA
## Warning: argument is not numeric or logical: returning NA
## Warning: argument is not numeric or logical: returning NA
## Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 setosa 5.006 3.428 1.463 0.246 NA
## 2 versicolor 5.936 2.770 4.260 1.326 NA
## 3 virginica 6.588 2.974 5.552 2.026 NA
# of course, the complex statement used before works here as well:
aggregate(x = my.iris, by=list(iris$Species), FUN = function(x) mean(x, na.rm = TRUE))
## Warning: argument is not numeric or logical: returning NA
## Warning: argument is not numeric or logical: returning NA
## Warning: argument is not numeric or logical: returning NA
## Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 setosa 5.006 3.428 1.463 0.246 NA
## 2 versicolor 5.936 2.770 4.260 1.326 NA
## 3 virginica 6.588 2.974 5.552 2.026 NA
But there are still errors for the Species names:
How to apply “aggregate”" to numeric elements only? The simple way is to manually indicate the columns that are numeric but is there a “better” way?
# First, figure out which element of iris is numeric via:
# check if a column is a factor
col.num <- sapply(iris[, 1:ncol(iris)], is.factor)
col.num
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## FALSE FALSE FALSE FALSE TRUE
# keep only the numeric columns in this object
# "is.factor" results in FALSE for numeric elements, therefor, "is.factor == FALSE" is the stuff we want to keep
# "which" returns elements of an object for which a test returns TRUE
# return the lements of col.um that have a value of "FALSE"
col.num <- which(col.num == FALSE)
# how to use information from one object to select stuff from another:
# Here we use the names of the values in "col.num" in a selection statement
# See ?match or ?'%in%'
# Value matching via %in% returns a logical vector "returns a logical vector indicating if there is a match or not for its left operand" (from ?'%in').
aggregate(x = iris[, names(iris) %in% names(col.num)],
by = list(iris$Species),
FUN = mean)
## Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.006 3.428 1.462 0.246
## 2 versicolor 5.936 2.770 4.260 1.326
## 3 virginica 6.588 2.974 5.552 2.026
Recoding exisiting data into a new element in the dataframe
Example with an ifelse statement.
If the test-statement result is correct (true), the first option gets used, otherwise the second option is used. See the helpfile ?ifelse.
ifelse(5 > 8, "no, five is not larger than 8", "yes, 5 is smaller than eigth")
## [1] "yes, 5 is smaller than eigth"
ifelse(iris$Species == "virginica", "yes", "not virginica")
## [1] "not virginica" "not virginica" "not virginica" "not virginica"
## [5] "not virginica" "not virginica" "not virginica" "not virginica"
## [9] "not virginica" "not virginica" "not virginica" "not virginica"
## [13] "not virginica" "not virginica" "not virginica" "not virginica"
## [17] "not virginica" "not virginica" "not virginica" "not virginica"
## [21] "not virginica" "not virginica" "not virginica" "not virginica"
## [25] "not virginica" "not virginica" "not virginica" "not virginica"
## [29] "not virginica" "not virginica" "not virginica" "not virginica"
## [33] "not virginica" "not virginica" "not virginica" "not virginica"
## [37] "not virginica" "not virginica" "not virginica" "not virginica"
## [41] "not virginica" "not virginica" "not virginica" "not virginica"
## [45] "not virginica" "not virginica" "not virginica" "not virginica"
## [49] "not virginica" "not virginica" "not virginica" "not virginica"
## [53] "not virginica" "not virginica" "not virginica" "not virginica"
## [57] "not virginica" "not virginica" "not virginica" "not virginica"
## [61] "not virginica" "not virginica" "not virginica" "not virginica"
## [65] "not virginica" "not virginica" "not virginica" "not virginica"
## [69] "not virginica" "not virginica" "not virginica" "not virginica"
## [73] "not virginica" "not virginica" "not virginica" "not virginica"
## [77] "not virginica" "not virginica" "not virginica" "not virginica"
## [81] "not virginica" "not virginica" "not virginica" "not virginica"
## [85] "not virginica" "not virginica" "not virginica" "not virginica"
## [89] "not virginica" "not virginica" "not virginica" "not virginica"
## [93] "not virginica" "not virginica" "not virginica" "not virginica"
## [97] "not virginica" "not virginica" "not virginica" "not virginica"
## [101] "yes" "yes" "yes" "yes"
## [105] "yes" "yes" "yes" "yes"
## [109] "yes" "yes" "yes" "yes"
## [113] "yes" "yes" "yes" "yes"
## [117] "yes" "yes" "yes" "yes"
## [121] "yes" "yes" "yes" "yes"
## [125] "yes" "yes" "yes" "yes"
## [129] "yes" "yes" "yes" "yes"
## [133] "yes" "yes" "yes" "yes"
## [137] "yes" "yes" "yes" "yes"
## [141] "yes" "yes" "yes" "yes"
## [145] "yes" "yes" "yes" "yes"
## [149] "yes" "yes"
iris$my.factor <- ifelse(iris$Petal.Width > 1.5,
"Large Petal",
"Small Petal")
# check the modified dataframe
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1
## 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3
## Median :5.80 Median :3.00 Median :4.35 Median :1.3
## Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2
## 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8
## Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5
## Species my.factor
## setosa :50 Length:150
## versicolor:50 Class :character
## virginica :50 Mode :character
##
##
##
# recode the characters to factor
iris$my.factor <- as.factor(iris$my.factor)
# check again
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1
## 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3
## Median :5.80 Median :3.00 Median :4.35 Median :1.3
## Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2
## 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8
## Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5
## Species my.factor
## setosa :50 Large Petal:52
## versicolor:50 Small Petal:98
## virginica :50
##
##
##
# aggregate over two factors
aggregate(x = iris[, names(iris) %in% names(col.num)],
by = list(iris$Species, iris$my.factor),
FUN = mean)
## Group.1 Group.2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 versicolor Large Petal 6.180 3.120 4.820 1.660
## 2 virginica Large Petal 6.617 3.002 5.572 2.062
## 3 setosa Small Petal 5.006 3.428 1.462 0.246
## 4 versicolor Small Petal 5.909 2.731 4.198 1.289
## 5 virginica Small Petal 6.133 2.533 5.233 1.467
# Using another function:
aggregate(x = iris[, names(iris) %in% names(col.num)],
by = list(iris$Species, iris$my.factor),
FUN = sd)
## Group.1 Group.2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 versicolor Large Petal 0.3271 0.2775 0.2387 0.08944
## 2 virginica Large Petal 0.6445 0.3054 0.5594 0.24187
## 3 setosa Small Petal 0.3525 0.3791 0.1737 0.10539
## 4 versicolor Small Petal 0.5286 0.2953 0.4485 0.16952
## 5 virginica Small Petal 0.1528 0.3055 0.3215 0.05774
Wait - how to do standard error?
se = sd / sqrt(n)
from http://cran.r-project.org/doc/manuals/R-intro.html “[…] Suppose further we needed to calculate the standard errors of the state income means. To do this we need to write an R function to calculate the standard error for any given vector. Since there is an builtin function var() to calculate the sample variance, such a function is a very simple one liner, specified by the assignment:”"
stderr <- function(x) sqrt(var(x)/length(x))
# But have to account for potential missing values!
# btw var is square of sd (see ?var)
stderr <- function(x) {
sqrt(var(x[!is.na(x)]) / length(x[!is.na(x)]))
}
# the same utilising "sd". The previous function is the recommended one in the R FAQ.
stde.2 <- function(x) {
sd(x[!is.na(x)]) / sqrt(length(x[!is.na(x)]))
}
# use the self-defined functions
aggregate(x = iris[, names(iris) %in% names(col.num)],
by = list(iris$Species, iris$my.factor),
FUN = stderr)
## Group.1 Group.2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 versicolor Large Petal 0.14629 0.12410 0.10677 0.04000
## 2 virginica Large Petal 0.09401 0.04454 0.08159 0.03528
## 3 setosa Small Petal 0.04985 0.05361 0.02456 0.01490
## 4 versicolor Small Petal 0.07881 0.04402 0.06685 0.02527
## 5 virginica Small Petal 0.08819 0.17638 0.18559 0.03333
aggregate(x = iris[, names(iris) %in% names(col.num)],
by = list(iris$Species, iris$my.factor),
FUN = stde.2)
## Group.1 Group.2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 versicolor Large Petal 0.14629 0.12410 0.10677 0.04000
## 2 virginica Large Petal 0.09401 0.04454 0.08159 0.03528
## 3 setosa Small Petal 0.04985 0.05361 0.02456 0.01490
## 4 versicolor Small Petal 0.07881 0.04402 0.06685 0.02527
## 5 virginica Small Petal 0.08819 0.17638 0.18559 0.03333
# Keep the ouptut of the aggregate function in the new object "my.res"
my.res <- aggregate(x = iris[, names(iris) %in% names(col.num)],
by = list(iris$Species, iris$my.factor),
FUN = stderr)
my.res
## Group.1 Group.2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 versicolor Large Petal 0.14629 0.12410 0.10677 0.04000
## 2 virginica Large Petal 0.09401 0.04454 0.08159 0.03528
## 3 setosa Small Petal 0.04985 0.05361 0.02456 0.01490
## 4 versicolor Small Petal 0.07881 0.04402 0.06685 0.02527
## 5 virginica Small Petal 0.08819 0.17638 0.18559 0.03333
# Question: Is it possible to round the results?
# See the helpfile ?round. There are several options for rounding.
# round only works with numeric data.
# here we use the square brackets to select the columns three to six only,
# then we round all data to three digits and put the result into the object "my.res.round"
my.res.round <- sapply(my.res[, 3:6],
function(x) round(x, digits = 3))
# by default, "sapply" returns a matrix. If a dataframe is preferred, the conversion is easy:
my.res.round <- as.data.frame(my.res.round)
# how to get the information on groups and petal size into this data frame?
# we just copy the columns that we want and add them to the rounded data frame:
my.res.round$Species <- my.res$Group.1
my.res.round$Petal.size <- my.res$Group.2
head(my.res.round)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.size
## 1 0.146 0.124 0.107 0.040 versicolor Large Petal
## 2 0.094 0.045 0.082 0.035 virginica Large Petal
## 3 0.050 0.054 0.025 0.015 setosa Small Petal
## 4 0.079 0.044 0.067 0.025 versicolor Small Petal
## 5 0.088 0.176 0.186 0.033 virginica Small Petal
# how to re-order the columns in a data frame?
# this is done with square brackets again.
my.res.round[, c(5, 6, 1:4)]
## Species Petal.size Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 versicolor Large Petal 0.146 0.124 0.107 0.040
## 2 virginica Large Petal 0.094 0.045 0.082 0.035
## 3 setosa Small Petal 0.050 0.054 0.025 0.015
## 4 versicolor Small Petal 0.079 0.044 0.067 0.025
## 5 virginica Small Petal 0.088 0.176 0.186 0.033
TO DO Find out how to store the results of the stderr aggregation in a new object.
Find out how to rename the generic names “Group.1” etc to something more meaningful in the new object.
Find out how to do an anova on Species and Petal size category for the original iris data.