Introduction to R. Session 01.

Markus, August 2013

importing csv files
selecting data
missing values
applying functions

Example code to import a csv file check available options via ?read.csv or use the Dataset import wizard from RStudio modify the code below to match your situation column headers in the csv file must start with a letter otherwise there will be problems.

Here we import an example file from my online storage Using read.csv to import the file. Have to specify the character that separates columns in the file Here, it is a file that uses “,” to separate columns

df <- read.csv("./R_seminar/example_csv_file.csv", 
                sep = ",")

# To import a file from your own hard drive, give the full pathname:
# e.g. "C:/Data/my_files/experiment01.csv"

# After the import, always check if the data was imported correctly
summary(df)

##        ID           Room           Pot                Species   
##  Min.   :  4   Min.   :2.00   Min.   :  4   E. saligna    :704  
##  1st Qu.:125   1st Qu.:3.00   1st Qu.:125   E. sideroxylon:623  
##  Median :318   Median :5.00   Median :314                       
##  Mean   :302   Mean   :4.51   Mean   :300                       
##  3rd Qu.:463   3rd Qu.:6.00   3rd Qu.:459                       
##  Max.   :589   Max.   :7.00   Max.   :585                       
##                                                                 
##       ID1           Room.1     CO2_treatment   Temperature 
##  Min.   :0.00   Min.   :2.00   Min.   :280   ambient :660  
##  1st Qu.:1.00   1st Qu.:3.00   1st Qu.:280   elevated:667  
##  Median :3.00   Median :5.00   Median :400                 
##  Mean   :2.51   Mean   :4.51   Mean   :440                 
##  3rd Qu.:4.00   3rd Qu.:6.00   3rd Qu.:640                 
##  Max.   :5.00   Max.   :7.00   Max.   :640                 
##                                                            
##  measurement_temperature water_treatment      ID.1         plant_no  
##  Min.   :28                 :1093        Min.   :   1   Min.   :  4  
##  1st Qu.:28              dry:  66        1st Qu.: 332   1st Qu.:125  
##  Median :28              wet: 168        Median : 664   Median :314  
##  Mean   :31                              Mean   : 664   Mean   :300  
##  3rd Qu.:34                              3rd Qu.: 996   3rd Qu.:459  
##  Max.   :34                              Max.   :1327   Max.   :585  
##                                                                      
##     room_no     water_treatment.1 measurement_temperature.1   Obs         
##  Min.   :2.00      :1093          Min.   :28                Mode:logical  
##  1st Qu.:3.00   dry:  66          1st Qu.:28                NA's:1327     
##  Median :5.00   wet: 168          Median :28                              
##  Mean   :4.51                     Mean   :31                              
##  3rd Qu.:6.00                     3rd Qu.:34                              
##  Max.   :7.00                     Max.   :34                              
##                                                                           
##       Time           FTime            EBal         Photo      
##  Min.   :    0   Min.   :  251   Min.   :0     Min.   :-7.74  
##  1st Qu.:    1   1st Qu.: 4272   1st Qu.:0     1st Qu.: 2.71  
##  Median :    1   Median : 8464   Median :0     Median : 9.33  
##  Mean   : 2190   Mean   : 9505   Mean   :0     Mean   : 9.65  
##  3rd Qu.: 1078   3rd Qu.:14166   3rd Qu.:0     3rd Qu.:15.70  
##  Max.   :23619   Max.   :26496   Max.   :0     Max.   :29.60  
##                  NA's   :349     NA's   :991                  
##       Cond             Ci          Trmmol           VpdL     
##  Min.   :0.010   Min.   :125   Min.   : 0.27   Min.   :0.74  
##  1st Qu.:0.190   1st Qu.:254   1st Qu.: 3.10   1st Qu.:1.25  
##  Median :0.290   Median :325   Median : 4.15   Median :1.60  
##  Mean   :0.325   Mean   :365   Mean   : 4.52   Mean   :1.65  
##  3rd Qu.:0.410   3rd Qu.:482   3rd Qu.: 5.62   3rd Qu.:1.96  
##  Max.   :1.130   Max.   :756   Max.   :12.70   Max.   :3.40  
##                                                              
##     Column2          Area           BLC          StmRat      BLCond     
##  Min.   :23.4   Min.   :0.90   Min.   :1.4   Min.   :1   Min.   : 2.84  
##  1st Qu.:25.3   1st Qu.:1.80   1st Qu.:1.4   1st Qu.:1   1st Qu.: 2.84  
##  Median :27.1   Median :6.00   Median :1.4   Median :1   Median : 2.84  
##  Mean   :27.2   Mean   :4.15   Mean   :1.6   Mean   :1   Mean   : 5.12  
##  3rd Qu.:28.5   3rd Qu.:6.00   3rd Qu.:1.9   3rd Qu.:1   3rd Qu.: 4.68  
##  Max.   :32.7   Max.   :6.00   Max.   :2.3   Max.   :1   Max.   :12.00  
##  NA's   :991                   NA's   :991                              
##       Tair          Tleaf           TBlk           CO2R          CO2S    
##  Min.   :26.6   Min.   :23.1   Min.   :27.9   Min.   :278   Min.   :251  
##  1st Qu.:27.6   1st Qu.:26.1   1st Qu.:28.0   1st Qu.:280   1st Qu.:277  
##  Median :31.5   Median :27.8   Median :33.5   Median :400   Median :394  
##  Mean   :30.3   Mean   :28.0   Mean   :31.0   Mean   :440   Mean   :431  
##  3rd Qu.:32.7   3rd Qu.:29.7   3rd Qu.:34.0   3rd Qu.:640   3rd Qu.:625  
##  Max.   :34.2   Max.   :34.9   Max.   :34.1   Max.   :642   Max.   :642  
##                                                                          
##       H2OR           H2OS           RH_R           RH_S           Flow    
##  Min.   :10.6   Min.   :13.9   Min.   :21.9   Min.   :29.6   Min.   :299  
##  1st Qu.:15.4   1st Qu.:18.2   1st Qu.:33.4   1st Qu.:41.3   1st Qu.:500  
##  Median :16.6   Median :20.5   Median :42.0   Median :50.0   Median :500  
##  Mean   :17.7   Mean   :21.3   Mean   :42.0   Mean   :50.4   Mean   :480  
##  3rd Qu.:19.5   3rd Qu.:24.0   3rd Qu.:48.1   3rd Qu.:57.4   3rd Qu.:500  
##  Max.   :30.8   Max.   :35.2   Max.   :70.7   Max.   :80.2   Max.   :503  
##                                                                           
##       PARi           PARo          Press         CsMch        
##  Min.   :  -1   Min.   :   0   Min.   :101   Min.   :-17.000  
##  1st Qu.:  61   1st Qu.:  33   1st Qu.:102   1st Qu.: -3.780  
##  Median : 300   Median : 137   Median :102   Median : -1.180  
##  Mean   : 507   Mean   : 233   Mean   :102   Mean   : -0.624  
##  3rd Qu.: 900   3rd Qu.: 296   3rd Qu.:102   3rd Qu.:  2.500  
##  Max.   :1508   Max.   :1809   Max.   :102   Max.   : 13.000  
##                                                               
##      HsMch            CsMch1       BLCslope       BLCoffst      f_parin   
##  Min.   :-0.480   Min.   :0.0   Min.   :-0.2   Min.   :2.7   Min.   :1    
##  1st Qu.:-0.030   1st Qu.:0.7   1st Qu.:-0.2   1st Qu.:2.7   1st Qu.:1    
##  Median : 0.200   Median :1.0   Median :-0.2   Median :2.7   Median :1    
##  Mean   : 0.131   Mean   :0.8   Mean   :-0.2   Mean   :2.7   Mean   :1    
##  3rd Qu.: 0.290   3rd Qu.:1.0   3rd Qu.:-0.2   3rd Qu.:2.7   3rd Qu.:1    
##  Max.   : 0.670   Max.   :1.0   Max.   :-0.2   Max.   :2.7   Max.   :1    
##                   NA's   :349   NA's   :991    NA's   :991   NA's   :991  
##     f_parout       alphaK       Status    
##  Min.   :0     Min.   :0.2         :  85  
##  1st Qu.:0     1st Qu.:0.2   111115:1230  
##  Median :0     Median :0.2   Status:  12  
##  Mean   :0     Mean   :0.2                
##  3rd Qu.:0     3rd Qu.:0.2                
##  Max.   :0     Max.   :0.2                
##  NA's   :991   NA's   :991

# In case your file uses "tab" to separate columns:

df_tab <- read.csv("./R_seminar/example_csv_file_separated_by_tab.csv",
                   sep = "\t")

summary(df_tab)

##        ID           Room           Pot                Species   
##  Min.   :  4   Min.   :2.00   Min.   :  4   E. saligna    :704  
##  1st Qu.:125   1st Qu.:3.00   1st Qu.:125   E. sideroxylon:623  
##  Median :318   Median :5.00   Median :314                       
##  Mean   :302   Mean   :4.51   Mean   :300                       
##  3rd Qu.:463   3rd Qu.:6.00   3rd Qu.:459                       
##  Max.   :589   Max.   :7.00   Max.   :585                       
##                                                                 
##       ID1           Room.1     CO2_treatment   Temperature 
##  Min.   :0.00   Min.   :2.00   Min.   :280   ambient :660  
##  1st Qu.:1.00   1st Qu.:3.00   1st Qu.:280   elevated:667  
##  Median :3.00   Median :5.00   Median :400                 
##  Mean   :2.51   Mean   :4.51   Mean   :440                 
##  3rd Qu.:4.00   3rd Qu.:6.00   3rd Qu.:640                 
##  Max.   :5.00   Max.   :7.00   Max.   :640                 
##                                                            
##  measurement_temperature water_treatment      ID.1         plant_no  
##  Min.   :28                 :1093        Min.   :   1   Min.   :  4  
##  1st Qu.:28              dry:  66        1st Qu.: 332   1st Qu.:125  
##  Median :28              wet: 168        Median : 664   Median :314  
##  Mean   :31                              Mean   : 664   Mean   :300  
##  3rd Qu.:34                              3rd Qu.: 996   3rd Qu.:459  
##  Max.   :34                              Max.   :1327   Max.   :585  
##                                                                      
##     room_no     water_treatment.1 measurement_temperature.1   Obs         
##  Min.   :2.00      :1093          Min.   :28                Mode:logical  
##  1st Qu.:3.00   dry:  66          1st Qu.:28                NA's:1327     
##  Median :5.00   wet: 168          Median :28                              
##  Mean   :4.51                     Mean   :31                              
##  3rd Qu.:6.00                     3rd Qu.:34                              
##  Max.   :7.00                     Max.   :34                              
##                                                                           
##       Time           FTime            EBal         Photo      
##  Min.   :    0   Min.   :  251   Min.   :0     Min.   :-7.74  
##  1st Qu.:    1   1st Qu.: 4272   1st Qu.:0     1st Qu.: 2.71  
##  Median :    1   Median : 8464   Median :0     Median : 9.33  
##  Mean   : 2190   Mean   : 9505   Mean   :0     Mean   : 9.65  
##  3rd Qu.: 1078   3rd Qu.:14166   3rd Qu.:0     3rd Qu.:15.70  
##  Max.   :23619   Max.   :26496   Max.   :0     Max.   :29.60  
##                  NA's   :349     NA's   :991                  
##       Cond             Ci          Trmmol           VpdL     
##  Min.   :0.010   Min.   :125   Min.   : 0.27   Min.   :0.74  
##  1st Qu.:0.190   1st Qu.:254   1st Qu.: 3.10   1st Qu.:1.25  
##  Median :0.290   Median :325   Median : 4.15   Median :1.60  
##  Mean   :0.325   Mean   :365   Mean   : 4.52   Mean   :1.65  
##  3rd Qu.:0.410   3rd Qu.:482   3rd Qu.: 5.62   3rd Qu.:1.96  
##  Max.   :1.130   Max.   :756   Max.   :12.70   Max.   :3.40  
##                                                              
##     Column2          Area           BLC          StmRat      BLCond     
##  Min.   :23.4   Min.   :0.90   Min.   :1.4   Min.   :1   Min.   : 2.84  
##  1st Qu.:25.3   1st Qu.:1.80   1st Qu.:1.4   1st Qu.:1   1st Qu.: 2.84  
##  Median :27.1   Median :6.00   Median :1.4   Median :1   Median : 2.84  
##  Mean   :27.2   Mean   :4.15   Mean   :1.6   Mean   :1   Mean   : 5.12  
##  3rd Qu.:28.5   3rd Qu.:6.00   3rd Qu.:1.9   3rd Qu.:1   3rd Qu.: 4.68  
##  Max.   :32.7   Max.   :6.00   Max.   :2.3   Max.   :1   Max.   :12.00  
##  NA's   :991                   NA's   :991                              
##       Tair          Tleaf           TBlk           CO2R          CO2S    
##  Min.   :26.6   Min.   :23.1   Min.   :27.9   Min.   :278   Min.   :251  
##  1st Qu.:27.6   1st Qu.:26.1   1st Qu.:28.0   1st Qu.:280   1st Qu.:277  
##  Median :31.5   Median :27.8   Median :33.5   Median :400   Median :394  
##  Mean   :30.3   Mean   :28.0   Mean   :31.0   Mean   :440   Mean   :431  
##  3rd Qu.:32.7   3rd Qu.:29.7   3rd Qu.:34.0   3rd Qu.:640   3rd Qu.:625  
##  Max.   :34.2   Max.   :34.9   Max.   :34.1   Max.   :642   Max.   :642  
##                                                                          
##       H2OR           H2OS           RH_R           RH_S           Flow    
##  Min.   :10.6   Min.   :13.9   Min.   :21.9   Min.   :29.6   Min.   :299  
##  1st Qu.:15.4   1st Qu.:18.2   1st Qu.:33.4   1st Qu.:41.3   1st Qu.:500  
##  Median :16.6   Median :20.5   Median :42.0   Median :50.0   Median :500  
##  Mean   :17.7   Mean   :21.3   Mean   :42.0   Mean   :50.4   Mean   :480  
##  3rd Qu.:19.5   3rd Qu.:24.0   3rd Qu.:48.1   3rd Qu.:57.4   3rd Qu.:500  
##  Max.   :30.8   Max.   :35.2   Max.   :70.7   Max.   :80.2   Max.   :503  
##                                                                           
##       PARi           PARo          Press         CsMch        
##  Min.   :  -1   Min.   :   0   Min.   :101   Min.   :-17.000  
##  1st Qu.:  61   1st Qu.:  33   1st Qu.:102   1st Qu.: -3.780  
##  Median : 300   Median : 137   Median :102   Median : -1.180  
##  Mean   : 507   Mean   : 233   Mean   :102   Mean   : -0.624  
##  3rd Qu.: 900   3rd Qu.: 296   3rd Qu.:102   3rd Qu.:  2.500  
##  Max.   :1508   Max.   :1809   Max.   :102   Max.   : 13.000  
##                                                               
##      HsMch            CsMch1       BLCslope       BLCoffst      f_parin   
##  Min.   :-0.480   Min.   :0.0   Min.   :-0.2   Min.   :2.7   Min.   :1    
##  1st Qu.:-0.030   1st Qu.:0.7   1st Qu.:-0.2   1st Qu.:2.7   1st Qu.:1    
##  Median : 0.200   Median :1.0   Median :-0.2   Median :2.7   Median :1    
##  Mean   : 0.131   Mean   :0.8   Mean   :-0.2   Mean   :2.7   Mean   :1    
##  3rd Qu.: 0.290   3rd Qu.:1.0   3rd Qu.:-0.2   3rd Qu.:2.7   3rd Qu.:1    
##  Max.   : 0.670   Max.   :1.0   Max.   :-0.2   Max.   :2.7   Max.   :1    
##                   NA's   :349   NA's   :991    NA's   :991   NA's   :991  
##     f_parout       alphaK       Status    
##  Min.   :0     Min.   :0.2         :  85  
##  1st Qu.:0     1st Qu.:0.2   111115:1230  
##  Median :0     Median :0.2   Status:  12  
##  Mean   :0     Mean   :0.2                
##  3rd Qu.:0     3rd Qu.:0.2                
##  Max.   :0     Max.   :0.2                
##  NA's   :991   NA's   :991

If the file does not have a header in the first line use the “header option” R will create a name for each column automatically.

df_tab_no_header <- read.csv("./R_seminar/example_csv_file_tab_no_header.csv",
                             sep = "\t",
                             header = FALSE)

summary(df_tab_no_header)

##        V1            V2             V3                   V4     
##  Min.   :  4   Min.   :2.00   Min.   :  4   E. saligna    :704  
##  1st Qu.:125   1st Qu.:3.00   1st Qu.:125   E. sideroxylon:623  
##  Median :318   Median :5.00   Median :314                       
##  Mean   :302   Mean   :4.51   Mean   :300                       
##  3rd Qu.:463   3rd Qu.:6.00   3rd Qu.:459                       
##  Max.   :589   Max.   :7.00   Max.   :585                       
##                                                                 
##        V5             V6             V7             V8            V9    
##  Min.   :0.00   Min.   :2.00   Min.   :280   ambient :660   Min.   :28  
##  1st Qu.:1.00   1st Qu.:3.00   1st Qu.:280   elevated:667   1st Qu.:28  
##  Median :3.00   Median :5.00   Median :400                  Median :28  
##  Mean   :2.51   Mean   :4.51   Mean   :440                  Mean   :31  
##  3rd Qu.:4.00   3rd Qu.:6.00   3rd Qu.:640                  3rd Qu.:34  
##  Max.   :5.00   Max.   :7.00   Max.   :640                  Max.   :34  
##                                                                         
##   V10            V11            V12           V13        V14      
##     :1093   Min.   :   1   Min.   :  4   Min.   :2.00      :1093  
##  dry:  66   1st Qu.: 332   1st Qu.:125   1st Qu.:3.00   dry:  66  
##  wet: 168   Median : 664   Median :314   Median :5.00   wet: 168  
##             Mean   : 664   Mean   :300   Mean   :4.51             
##             3rd Qu.: 996   3rd Qu.:459   3rd Qu.:6.00             
##             Max.   :1327   Max.   :585   Max.   :7.00             
##                                                                   
##       V15       V16               V17             V18             V19     
##  Min.   :28   Mode:logical   Min.   :    0   Min.   :  251   Min.   :0    
##  1st Qu.:28   NA's:1327      1st Qu.:    1   1st Qu.: 4272   1st Qu.:0    
##  Median :28                  Median :    1   Median : 8464   Median :0    
##  Mean   :31                  Mean   : 2190   Mean   : 9505   Mean   :0    
##  3rd Qu.:34                  3rd Qu.: 1078   3rd Qu.:14166   3rd Qu.:0    
##  Max.   :34                  Max.   :23619   Max.   :26496   Max.   :0    
##                                              NA's   :349     NA's   :991  
##       V20             V21             V22           V23       
##  Min.   :-7.74   Min.   :0.010   Min.   :125   Min.   : 0.27  
##  1st Qu.: 2.71   1st Qu.:0.190   1st Qu.:254   1st Qu.: 3.10  
##  Median : 9.33   Median :0.290   Median :325   Median : 4.15  
##  Mean   : 9.65   Mean   :0.325   Mean   :365   Mean   : 4.52  
##  3rd Qu.:15.70   3rd Qu.:0.410   3rd Qu.:482   3rd Qu.: 5.62  
##  Max.   :29.60   Max.   :1.130   Max.   :756   Max.   :12.70  
##                                                               
##       V24            V25            V26            V27           V28   
##  Min.   :0.74   Min.   :23.4   Min.   :0.90   Min.   :1.4   Min.   :1  
##  1st Qu.:1.25   1st Qu.:25.3   1st Qu.:1.80   1st Qu.:1.4   1st Qu.:1  
##  Median :1.60   Median :27.1   Median :6.00   Median :1.4   Median :1  
##  Mean   :1.65   Mean   :27.2   Mean   :4.15   Mean   :1.6   Mean   :1  
##  3rd Qu.:1.96   3rd Qu.:28.5   3rd Qu.:6.00   3rd Qu.:1.9   3rd Qu.:1  
##  Max.   :3.40   Max.   :32.7   Max.   :6.00   Max.   :2.3   Max.   :1  
##                 NA's   :991                   NA's   :991              
##       V29             V30            V31            V32      
##  Min.   : 2.84   Min.   :26.6   Min.   :23.1   Min.   :27.9  
##  1st Qu.: 2.84   1st Qu.:27.6   1st Qu.:26.1   1st Qu.:28.0  
##  Median : 2.84   Median :31.5   Median :27.8   Median :33.5  
##  Mean   : 5.12   Mean   :30.3   Mean   :28.0   Mean   :31.0  
##  3rd Qu.: 4.68   3rd Qu.:32.7   3rd Qu.:29.7   3rd Qu.:34.0  
##  Max.   :12.00   Max.   :34.2   Max.   :34.9   Max.   :34.1  
##                                                              
##       V33           V34           V35            V36            V37      
##  Min.   :278   Min.   :251   Min.   :10.6   Min.   :13.9   Min.   :21.9  
##  1st Qu.:280   1st Qu.:277   1st Qu.:15.4   1st Qu.:18.2   1st Qu.:33.4  
##  Median :400   Median :394   Median :16.6   Median :20.5   Median :42.0  
##  Mean   :440   Mean   :431   Mean   :17.7   Mean   :21.3   Mean   :42.0  
##  3rd Qu.:640   3rd Qu.:625   3rd Qu.:19.5   3rd Qu.:24.0   3rd Qu.:48.1  
##  Max.   :642   Max.   :642   Max.   :30.8   Max.   :35.2   Max.   :70.7  
##                                                                          
##       V38            V39           V40            V41            V42     
##  Min.   :29.6   Min.   :299   Min.   :  -1   Min.   :   0   Min.   :101  
##  1st Qu.:41.3   1st Qu.:500   1st Qu.:  61   1st Qu.:  33   1st Qu.:102  
##  Median :50.0   Median :500   Median : 300   Median : 137   Median :102  
##  Mean   :50.4   Mean   :480   Mean   : 507   Mean   : 233   Mean   :102  
##  3rd Qu.:57.4   3rd Qu.:500   3rd Qu.: 900   3rd Qu.: 296   3rd Qu.:102  
##  Max.   :80.2   Max.   :503   Max.   :1508   Max.   :1809   Max.   :102  
##                                                                          
##       V43               V44              V45           V46      
##  Min.   :-17.000   Min.   :-0.480   Min.   :0.0   Min.   :-0.2  
##  1st Qu.: -3.780   1st Qu.:-0.030   1st Qu.:0.7   1st Qu.:-0.2  
##  Median : -1.180   Median : 0.200   Median :1.0   Median :-0.2  
##  Mean   : -0.624   Mean   : 0.131   Mean   :0.8   Mean   :-0.2  
##  3rd Qu.:  2.500   3rd Qu.: 0.290   3rd Qu.:1.0   3rd Qu.:-0.2  
##  Max.   : 13.000   Max.   : 0.670   Max.   :1.0   Max.   :-0.2  
##                                     NA's   :349   NA's   :991   
##       V47           V48           V49           V50          V51      
##  Min.   :2.7   Min.   :1     Min.   :0     Min.   :0.2         :  85  
##  1st Qu.:2.7   1st Qu.:1     1st Qu.:0     1st Qu.:0.2   111115:1230  
##  Median :2.7   Median :1     Median :0     Median :0.2   Status:  12  
##  Mean   :2.7   Mean   :1     Mean   :0     Mean   :0.2                
##  3rd Qu.:2.7   3rd Qu.:1     3rd Qu.:0     3rd Qu.:0.2                
##  Max.   :2.7   Max.   :1     Max.   :0     Max.   :0.2                
##  NA's   :991   NA's   :991   NA's   :991   NA's   :991

To see the available options to import data in csv format see the help file

?read.csv

For the excercise, we use the built-in “iris” data set that we used last time load the “iris”" example data set

data(iris)

To answer a question that came up during the previous session (in 2012): How to select column of a data frame by name

iris[, "Sepal.Width"]

##   [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9
##  [18] 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2
##  [35] 3.1 3.2 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2
##  [52] 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7
##  [69] 2.2 2.5 3.2 2.8 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0
##  [86] 3.4 3.1 2.3 3.0 2.5 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7
## [103] 3.0 2.9 3.0 3.0 2.5 2.9 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6
## [120] 2.2 3.2 2.8 2.8 2.7 3.3 3.2 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0
## [137] 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2 3.3 3.0 2.5 3.0 3.4 3.0

iris[, c("Sepal.Width", "Species")]

##     Sepal.Width    Species
## 1           3.5     setosa
## 2           3.0     setosa
## 3           3.2     setosa
## 4           3.1     setosa
## 5           3.6     setosa
## 6           3.9     setosa
## 7           3.4     setosa
## 8           3.4     setosa
## 9           2.9     setosa
## 10          3.1     setosa
## 11          3.7     setosa
## 12          3.4     setosa
## 13          3.0     setosa
## 14          3.0     setosa
## 15          4.0     setosa
## 16          4.4     setosa
## 17          3.9     setosa
## 18          3.5     setosa
## 19          3.8     setosa
## 20          3.8     setosa
## 21          3.4     setosa
## 22          3.7     setosa
## 23          3.6     setosa
## 24          3.3     setosa
## 25          3.4     setosa
## 26          3.0     setosa
## 27          3.4     setosa
## 28          3.5     setosa
## 29          3.4     setosa
## 30          3.2     setosa
## 31          3.1     setosa
## 32          3.4     setosa
## 33          4.1     setosa
## 34          4.2     setosa
## 35          3.1     setosa
## 36          3.2     setosa
## 37          3.5     setosa
## 38          3.6     setosa
## 39          3.0     setosa
## 40          3.4     setosa
## 41          3.5     setosa
## 42          2.3     setosa
## 43          3.2     setosa
## 44          3.5     setosa
## 45          3.8     setosa
## 46          3.0     setosa
## 47          3.8     setosa
## 48          3.2     setosa
## 49          3.7     setosa
## 50          3.3     setosa
## 51          3.2 versicolor
## 52          3.2 versicolor
## 53          3.1 versicolor
## 54          2.3 versicolor
## 55          2.8 versicolor
## 56          2.8 versicolor
## 57          3.3 versicolor
## 58          2.4 versicolor
## 59          2.9 versicolor
## 60          2.7 versicolor
## 61          2.0 versicolor
## 62          3.0 versicolor
## 63          2.2 versicolor
## 64          2.9 versicolor
## 65          2.9 versicolor
## 66          3.1 versicolor
## 67          3.0 versicolor
## 68          2.7 versicolor
## 69          2.2 versicolor
## 70          2.5 versicolor
## 71          3.2 versicolor
## 72          2.8 versicolor
## 73          2.5 versicolor
## 74          2.8 versicolor
## 75          2.9 versicolor
## 76          3.0 versicolor
## 77          2.8 versicolor
## 78          3.0 versicolor
## 79          2.9 versicolor
## 80          2.6 versicolor
## 81          2.4 versicolor
## 82          2.4 versicolor
## 83          2.7 versicolor
## 84          2.7 versicolor
## 85          3.0 versicolor
## 86          3.4 versicolor
## 87          3.1 versicolor
## 88          2.3 versicolor
## 89          3.0 versicolor
## 90          2.5 versicolor
## 91          2.6 versicolor
## 92          3.0 versicolor
## 93          2.6 versicolor
## 94          2.3 versicolor
## 95          2.7 versicolor
## 96          3.0 versicolor
## 97          2.9 versicolor
## 98          2.9 versicolor
## 99          2.5 versicolor
## 100         2.8 versicolor
## 101         3.3  virginica
## 102         2.7  virginica
## 103         3.0  virginica
## 104         2.9  virginica
## 105         3.0  virginica
## 106         3.0  virginica
## 107         2.5  virginica
## 108         2.9  virginica
## 109         2.5  virginica
## 110         3.6  virginica
## 111         3.2  virginica
## 112         2.7  virginica
## 113         3.0  virginica
## 114         2.5  virginica
## 115         2.8  virginica
## 116         3.2  virginica
## 117         3.0  virginica
## 118         3.8  virginica
## 119         2.6  virginica
## 120         2.2  virginica
## 121         3.2  virginica
## 122         2.8  virginica
## 123         2.8  virginica
## 124         2.7  virginica
## 125         3.3  virginica
## 126         3.2  virginica
## 127         2.8  virginica
## 128         3.0  virginica
## 129         2.8  virginica
## 130         3.0  virginica
## 131         2.8  virginica
## 132         3.8  virginica
## 133         2.8  virginica
## 134         2.8  virginica
## 135         2.6  virginica
## 136         3.0  virginica
## 137         3.4  virginica
## 138         3.1  virginica
## 139         3.0  virginica
## 140         3.1  virginica
## 141         3.1  virginica
## 142         3.1  virginica
## 143         2.7  virginica
## 144         3.2  virginica
## 145         3.3  virginica
## 146         3.0  virginica
## 147         2.5  virginica
## 148         3.0  virginica
## 149         3.4  virginica
## 150         3.0  virginica

Some more complex subsetting of a data frame Select every third element

# the following gets the job done, but there is lots of code within the brackets
iris[seq(from = 0, to = nrow(iris), by = 3), ]

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 3            4.7         3.2          1.3         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 102          5.8         2.7          5.1         1.9  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 150          5.9         3.0          5.1         1.8  virginica

# create a vector with the elements that you want (we modularise the code)
my.index <- seq(from = 0, to = nrow(iris), by = 3)

# use "my.index" in the bracket statement
iris[my.index, ]

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 3            4.7         3.2          1.3         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 102          5.8         2.7          5.1         1.9  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 150          5.9         3.0          5.1         1.8  virginica

Take a random sample of 25 elements from rows of iris

iris[sample(1:nrow(iris), 25), ]

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 119          7.7         2.6          6.9         2.3  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 8            5.0         3.4          1.5         0.2     setosa
## 150          5.9         3.0          5.1         1.8  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 44           5.0         3.5          1.6         0.6     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 115          5.8         2.8          5.1         2.4  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 40           5.1         3.4          1.5         0.2     setosa
## 135          6.1         2.6          5.6         1.4  virginica
## 64           6.1         2.9          4.7         1.4 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 110          7.2         3.6          6.1         2.5  virginica
## 70           5.6         2.5          3.9         1.1 versicolor
## 104          6.3         2.9          5.6         1.8  virginica
## 15           5.8         4.0          1.2         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 132          7.9         3.8          6.4         2.0  virginica
## 31           4.8         3.1          1.6         0.2     setosa
## 101          6.3         3.3          6.0         2.5  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 21           5.4         3.4          1.7         0.2     setosa
## 59           6.6         2.9          4.6         1.3 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor

Apply a function the “apply” family of functions is large and very complex

As iris is a dataframe (which itself is just a list of vectors), lapply allows to apply a function to each element of a list

out    <- lapply(iris[, 3:4], mean) # arithmetic mean

out

## $Petal.Length
## [1] 3.758
## 
## $Petal.Width
## [1] 1.199

out.sd <- lapply(iris[, 3:4], sd)   # standard deviation

out.sd

## $Petal.Length
## [1] 1.765
## 
## $Petal.Width
## [1] 0.7622

Working with missing values

Missing values can cause many headaches, but there a function available to identify and handle them

Copy the iris data to a new object

my.iris <- iris

# introduce a missing value ("NA") in the iris data
# missing values are represented by "NA". 
# NA is not a text, it is defined as a internal "logical" constant.
# see ?NA

my.iris[1, 3] <- NA
out <- lapply(my.iris[, 3:4], mean)

# as there is one "NA" value, the mean is "NA" as well:
out

## $Petal.Length
## [1] NA
## 
## $Petal.Width
## [1] 1.199

Calculations with missing values To calculate the mean for columns with missing values, the mean function needs to be told explicitley to remove NA from the sample.

When options are passed on to a function, the statement becomes more complex:

out <- lapply(my.iris[, 3:4], function(x) mean(x, na.rm = TRUE))
out

## $Petal.Length
## [1] 3.774
## 
## $Petal.Width
## [1] 1.199

# there is a shorthand for this fortunately:
# in many cases, options for the functions can be passed through
out <- lapply(my.iris[, 3:4], mean, na.rm = TRUE)
out

## $Petal.Length
## [1] 3.774
## 
## $Petal.Width
## [1] 1.199

# But the long form with the additional function(x) is useful in many occasions!

Checking for missing values with “is.na”

Calculate the amount of “NA” in columns

# results in either "true" or "false"
is.na(iris[1,1])

## [1] FALSE

is.na(my.iris[1,3])

## [1] TRUE

Sapply returns the result of an “lapply” function in a simplyfied format. sapply is a replacement for lapply. Difference is only in the output format.

Compare the two statements below. See ?lapply

lapply(my.iris[, 1:5], function(x) sum(is.na(x))) # returns a list

## $Sepal.Length
## [1] 0
## 
## $Sepal.Width
## [1] 0
## 
## $Petal.Length
## [1] 1
## 
## $Petal.Width
## [1] 0
## 
## $Species
## [1] 0

sapply(my.iris[, 1:5], function(x) sum(is.na(x))) # returns named integers (vector)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##            0            0            1            0            0

# get rid of rows that have any "NA"
my.iris.com <- na.omit(my.iris)

# check the amount of rows
nrow(my.iris.com)

## [1] 149

How many rows did the original iris data have in comparison? Find a way to check!

Create a dataframe that indicates “NA” for each individual element

is.na(my.iris[1:nrow(my.iris), ])

##     Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          FALSE       FALSE         TRUE       FALSE   FALSE
## 2          FALSE       FALSE        FALSE       FALSE   FALSE
## 3          FALSE       FALSE        FALSE       FALSE   FALSE
## 4          FALSE       FALSE        FALSE       FALSE   FALSE
## 5          FALSE       FALSE        FALSE       FALSE   FALSE
## 6          FALSE       FALSE        FALSE       FALSE   FALSE
## 7          FALSE       FALSE        FALSE       FALSE   FALSE
## 8          FALSE       FALSE        FALSE       FALSE   FALSE
## 9          FALSE       FALSE        FALSE       FALSE   FALSE
## 10         FALSE       FALSE        FALSE       FALSE   FALSE
## 11         FALSE       FALSE        FALSE       FALSE   FALSE
## 12         FALSE       FALSE        FALSE       FALSE   FALSE
## 13         FALSE       FALSE        FALSE       FALSE   FALSE
## 14         FALSE       FALSE        FALSE       FALSE   FALSE
## 15         FALSE       FALSE        FALSE       FALSE   FALSE
## 16         FALSE       FALSE        FALSE       FALSE   FALSE
## 17         FALSE       FALSE        FALSE       FALSE   FALSE
## 18         FALSE       FALSE        FALSE       FALSE   FALSE
## 19         FALSE       FALSE        FALSE       FALSE   FALSE
## 20         FALSE       FALSE        FALSE       FALSE   FALSE
## 21         FALSE       FALSE        FALSE       FALSE   FALSE
## 22         FALSE       FALSE        FALSE       FALSE   FALSE
## 23         FALSE       FALSE        FALSE       FALSE   FALSE
## 24         FALSE       FALSE        FALSE       FALSE   FALSE
## 25         FALSE       FALSE        FALSE       FALSE   FALSE
## 26         FALSE       FALSE        FALSE       FALSE   FALSE
## 27         FALSE       FALSE        FALSE       FALSE   FALSE
## 28         FALSE       FALSE        FALSE       FALSE   FALSE
## 29         FALSE       FALSE        FALSE       FALSE   FALSE
## 30         FALSE       FALSE        FALSE       FALSE   FALSE
## 31         FALSE       FALSE        FALSE       FALSE   FALSE
## 32         FALSE       FALSE        FALSE       FALSE   FALSE
## 33         FALSE       FALSE        FALSE       FALSE   FALSE
## 34         FALSE       FALSE        FALSE       FALSE   FALSE
## 35         FALSE       FALSE        FALSE       FALSE   FALSE
## 36         FALSE       FALSE        FALSE       FALSE   FALSE
## 37         FALSE       FALSE        FALSE       FALSE   FALSE
## 38         FALSE       FALSE        FALSE       FALSE   FALSE
## 39         FALSE       FALSE        FALSE       FALSE   FALSE
## 40         FALSE       FALSE        FALSE       FALSE   FALSE
## 41         FALSE       FALSE        FALSE       FALSE   FALSE
## 42         FALSE       FALSE        FALSE       FALSE   FALSE
## 43         FALSE       FALSE        FALSE       FALSE   FALSE
## 44         FALSE       FALSE        FALSE       FALSE   FALSE
## 45         FALSE       FALSE        FALSE       FALSE   FALSE
## 46         FALSE       FALSE        FALSE       FALSE   FALSE
## 47         FALSE       FALSE        FALSE       FALSE   FALSE
## 48         FALSE       FALSE        FALSE       FALSE   FALSE
## 49         FALSE       FALSE        FALSE       FALSE   FALSE
## 50         FALSE       FALSE        FALSE       FALSE   FALSE
## 51         FALSE       FALSE        FALSE       FALSE   FALSE
## 52         FALSE       FALSE        FALSE       FALSE   FALSE
## 53         FALSE       FALSE        FALSE       FALSE   FALSE
## 54         FALSE       FALSE        FALSE       FALSE   FALSE
## 55         FALSE       FALSE        FALSE       FALSE   FALSE
## 56         FALSE       FALSE        FALSE       FALSE   FALSE
## 57         FALSE       FALSE        FALSE       FALSE   FALSE
## 58         FALSE       FALSE        FALSE       FALSE   FALSE
## 59         FALSE       FALSE        FALSE       FALSE   FALSE
## 60         FALSE       FALSE        FALSE       FALSE   FALSE
## 61         FALSE       FALSE        FALSE       FALSE   FALSE
## 62         FALSE       FALSE        FALSE       FALSE   FALSE
## 63         FALSE       FALSE        FALSE       FALSE   FALSE
## 64         FALSE       FALSE        FALSE       FALSE   FALSE
## 65         FALSE       FALSE        FALSE       FALSE   FALSE
## 66         FALSE       FALSE        FALSE       FALSE   FALSE
## 67         FALSE       FALSE        FALSE       FALSE   FALSE
## 68         FALSE       FALSE        FALSE       FALSE   FALSE
## 69         FALSE       FALSE        FALSE       FALSE   FALSE
## 70         FALSE       FALSE        FALSE       FALSE   FALSE
## 71         FALSE       FALSE        FALSE       FALSE   FALSE
## 72         FALSE       FALSE        FALSE       FALSE   FALSE
## 73         FALSE       FALSE        FALSE       FALSE   FALSE
## 74         FALSE       FALSE        FALSE       FALSE   FALSE
## 75         FALSE       FALSE        FALSE       FALSE   FALSE
## 76         FALSE       FALSE        FALSE       FALSE   FALSE
## 77         FALSE       FALSE        FALSE       FALSE   FALSE
## 78         FALSE       FALSE        FALSE       FALSE   FALSE
## 79         FALSE       FALSE        FALSE       FALSE   FALSE
## 80         FALSE       FALSE        FALSE       FALSE   FALSE
## 81         FALSE       FALSE        FALSE       FALSE   FALSE
## 82         FALSE       FALSE        FALSE       FALSE   FALSE
## 83         FALSE       FALSE        FALSE       FALSE   FALSE
## 84         FALSE       FALSE        FALSE       FALSE   FALSE
## 85         FALSE       FALSE        FALSE       FALSE   FALSE
## 86         FALSE       FALSE        FALSE       FALSE   FALSE
## 87         FALSE       FALSE        FALSE       FALSE   FALSE
## 88         FALSE       FALSE        FALSE       FALSE   FALSE
## 89         FALSE       FALSE        FALSE       FALSE   FALSE
## 90         FALSE       FALSE        FALSE       FALSE   FALSE
## 91         FALSE       FALSE        FALSE       FALSE   FALSE
## 92         FALSE       FALSE        FALSE       FALSE   FALSE
## 93         FALSE       FALSE        FALSE       FALSE   FALSE
## 94         FALSE       FALSE        FALSE       FALSE   FALSE
## 95         FALSE       FALSE        FALSE       FALSE   FALSE
## 96         FALSE       FALSE        FALSE       FALSE   FALSE
## 97         FALSE       FALSE        FALSE       FALSE   FALSE
## 98         FALSE       FALSE        FALSE       FALSE   FALSE
## 99         FALSE       FALSE        FALSE       FALSE   FALSE
## 100        FALSE       FALSE        FALSE       FALSE   FALSE
## 101        FALSE       FALSE        FALSE       FALSE   FALSE
## 102        FALSE       FALSE        FALSE       FALSE   FALSE
## 103        FALSE       FALSE        FALSE       FALSE   FALSE
## 104        FALSE       FALSE        FALSE       FALSE   FALSE
## 105        FALSE       FALSE        FALSE       FALSE   FALSE
## 106        FALSE       FALSE        FALSE       FALSE   FALSE
## 107        FALSE       FALSE        FALSE       FALSE   FALSE
## 108        FALSE       FALSE        FALSE       FALSE   FALSE
## 109        FALSE       FALSE        FALSE       FALSE   FALSE
## 110        FALSE       FALSE        FALSE       FALSE   FALSE
## 111        FALSE       FALSE        FALSE       FALSE   FALSE
## 112        FALSE       FALSE        FALSE       FALSE   FALSE
## 113        FALSE       FALSE        FALSE       FALSE   FALSE
## 114        FALSE       FALSE        FALSE       FALSE   FALSE
## 115        FALSE       FALSE        FALSE       FALSE   FALSE
## 116        FALSE       FALSE        FALSE       FALSE   FALSE
## 117        FALSE       FALSE        FALSE       FALSE   FALSE
## 118        FALSE       FALSE        FALSE       FALSE   FALSE
## 119        FALSE       FALSE        FALSE       FALSE   FALSE
## 120        FALSE       FALSE        FALSE       FALSE   FALSE
## 121        FALSE       FALSE        FALSE       FALSE   FALSE
## 122        FALSE       FALSE        FALSE       FALSE   FALSE
## 123        FALSE       FALSE        FALSE       FALSE   FALSE
## 124        FALSE       FALSE        FALSE       FALSE   FALSE
## 125        FALSE       FALSE        FALSE       FALSE   FALSE
## 126        FALSE       FALSE        FALSE       FALSE   FALSE
## 127        FALSE       FALSE        FALSE       FALSE   FALSE
## 128        FALSE       FALSE        FALSE       FALSE   FALSE
## 129        FALSE       FALSE        FALSE       FALSE   FALSE
## 130        FALSE       FALSE        FALSE       FALSE   FALSE
## 131        FALSE       FALSE        FALSE       FALSE   FALSE
## 132        FALSE       FALSE        FALSE       FALSE   FALSE
## 133        FALSE       FALSE        FALSE       FALSE   FALSE
## 134        FALSE       FALSE        FALSE       FALSE   FALSE
## 135        FALSE       FALSE        FALSE       FALSE   FALSE
## 136        FALSE       FALSE        FALSE       FALSE   FALSE
## 137        FALSE       FALSE        FALSE       FALSE   FALSE
## 138        FALSE       FALSE        FALSE       FALSE   FALSE
## 139        FALSE       FALSE        FALSE       FALSE   FALSE
## 140        FALSE       FALSE        FALSE       FALSE   FALSE
## 141        FALSE       FALSE        FALSE       FALSE   FALSE
## 142        FALSE       FALSE        FALSE       FALSE   FALSE
## 143        FALSE       FALSE        FALSE       FALSE   FALSE
## 144        FALSE       FALSE        FALSE       FALSE   FALSE
## 145        FALSE       FALSE        FALSE       FALSE   FALSE
## 146        FALSE       FALSE        FALSE       FALSE   FALSE
## 147        FALSE       FALSE        FALSE       FALSE   FALSE
## 148        FALSE       FALSE        FALSE       FALSE   FALSE
## 149        FALSE       FALSE        FALSE       FALSE   FALSE
## 150        FALSE       FALSE        FALSE       FALSE   FALSE

# to kick out rows for that thave "NA" in a specific column
my.iris.com <- my.iris[!is.na(my.iris$Petal.Length), ]

Aggregate data

aggregate(x = iris, by=list(iris$Species), FUN = mean)

## Warning: argument is not numeric or logical: returning NA
## Warning: argument is not numeric or logical: returning NA
## Warning: argument is not numeric or logical: returning NA

##      Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     setosa        5.006       3.428        1.462       0.246      NA
## 2 versicolor        5.936       2.770        4.260       1.326      NA
## 3  virginica        6.588       2.974        5.552       2.026      NA

# the above "aggregate" gives errors for non-numeric elements
# see the warnings "returning NA" for the elements in "Species"
# i.e. it is not possible to calculate a mean value from species names.

# Please note as well, that missing values result in a missing mean
aggregate(x = my.iris, by=list(iris$Species), FUN = mean)

## Warning: argument is not numeric or logical: returning NA
## Warning: argument is not numeric or logical: returning NA
## Warning: argument is not numeric or logical: returning NA

##      Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     setosa        5.006       3.428           NA       0.246      NA
## 2 versicolor        5.936       2.770        4.260       1.326      NA
## 3  virginica        6.588       2.974        5.552       2.026      NA

But “aggregate” allows to pass additional arguments to the function. This is indicated in ?aggregate by the “…” “further arguments passed to or used by methods”

So, the following works for missing numerics.

aggregate(x = my.iris, by=list(iris$Species), FUN = mean, na.rm = TRUE)

## Warning: argument is not numeric or logical: returning NA
## Warning: argument is not numeric or logical: returning NA
## Warning: argument is not numeric or logical: returning NA

##      Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     setosa        5.006       3.428        1.463       0.246      NA
## 2 versicolor        5.936       2.770        4.260       1.326      NA
## 3  virginica        6.588       2.974        5.552       2.026      NA

# of course, the complex statement used before works here as well:
aggregate(x = my.iris, by=list(iris$Species), FUN = function(x) mean(x, na.rm = TRUE))

## Warning: argument is not numeric or logical: returning NA
## Warning: argument is not numeric or logical: returning NA
## Warning: argument is not numeric or logical: returning NA

##      Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     setosa        5.006       3.428        1.463       0.246      NA
## 2 versicolor        5.936       2.770        4.260       1.326      NA
## 3  virginica        6.588       2.974        5.552       2.026      NA

But there are still errors for the Species names:

How to apply “aggregate”" to numeric elements only? The simple way is to manually indicate the columns that are numeric but is there a “better” way?

# First, figure out which element of iris is numeric via:
# check if a column is a factor
col.num <- sapply(iris[, 1:ncol(iris)], is.factor)
col.num

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##        FALSE        FALSE        FALSE        FALSE         TRUE

# keep only the numeric columns in this object
# "is.factor" results in FALSE for numeric elements, therefor, "is.factor == FALSE" is the stuff we want to keep
# "which" returns elements of an object for which a test returns TRUE
# return the lements of col.um that have a value of "FALSE"
col.num <- which(col.num == FALSE)

# how to use information from one object to select stuff from another:
# Here we use the names of the values in "col.num" in a selection statement
# See ?match or ?'%in%'
# Value matching via %in% returns a logical vector "returns a logical vector indicating if there is a match or not for its left operand" (from ?'%in').
aggregate(x = iris[, names(iris) %in% names(col.num)],
           by = list(iris$Species),
           FUN = mean)

##      Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026

Adding a variable to the data

Recoding exisiting data into a new element in the dataframe

Example with an ifelse statement.

If the test-statement result is correct (true), the first option gets used, otherwise the second option is used. See the helpfile ?ifelse.

ifelse(5 > 8, "no, five is not larger than 8", "yes, 5 is smaller than eigth")

## [1] "yes, 5 is smaller than eigth"

ifelse(iris$Species == "virginica", "yes", "not virginica")

##   [1] "not virginica" "not virginica" "not virginica" "not virginica"
##   [5] "not virginica" "not virginica" "not virginica" "not virginica"
##   [9] "not virginica" "not virginica" "not virginica" "not virginica"
##  [13] "not virginica" "not virginica" "not virginica" "not virginica"
##  [17] "not virginica" "not virginica" "not virginica" "not virginica"
##  [21] "not virginica" "not virginica" "not virginica" "not virginica"
##  [25] "not virginica" "not virginica" "not virginica" "not virginica"
##  [29] "not virginica" "not virginica" "not virginica" "not virginica"
##  [33] "not virginica" "not virginica" "not virginica" "not virginica"
##  [37] "not virginica" "not virginica" "not virginica" "not virginica"
##  [41] "not virginica" "not virginica" "not virginica" "not virginica"
##  [45] "not virginica" "not virginica" "not virginica" "not virginica"
##  [49] "not virginica" "not virginica" "not virginica" "not virginica"
##  [53] "not virginica" "not virginica" "not virginica" "not virginica"
##  [57] "not virginica" "not virginica" "not virginica" "not virginica"
##  [61] "not virginica" "not virginica" "not virginica" "not virginica"
##  [65] "not virginica" "not virginica" "not virginica" "not virginica"
##  [69] "not virginica" "not virginica" "not virginica" "not virginica"
##  [73] "not virginica" "not virginica" "not virginica" "not virginica"
##  [77] "not virginica" "not virginica" "not virginica" "not virginica"
##  [81] "not virginica" "not virginica" "not virginica" "not virginica"
##  [85] "not virginica" "not virginica" "not virginica" "not virginica"
##  [89] "not virginica" "not virginica" "not virginica" "not virginica"
##  [93] "not virginica" "not virginica" "not virginica" "not virginica"
##  [97] "not virginica" "not virginica" "not virginica" "not virginica"
## [101] "yes"           "yes"           "yes"           "yes"          
## [105] "yes"           "yes"           "yes"           "yes"          
## [109] "yes"           "yes"           "yes"           "yes"          
## [113] "yes"           "yes"           "yes"           "yes"          
## [117] "yes"           "yes"           "yes"           "yes"          
## [121] "yes"           "yes"           "yes"           "yes"          
## [125] "yes"           "yes"           "yes"           "yes"          
## [129] "yes"           "yes"           "yes"           "yes"          
## [133] "yes"           "yes"           "yes"           "yes"          
## [137] "yes"           "yes"           "yes"           "yes"          
## [141] "yes"           "yes"           "yes"           "yes"          
## [145] "yes"           "yes"           "yes"           "yes"          
## [149] "yes"           "yes"

iris$my.factor <- ifelse(iris$Petal.Width > 1.5,
                        "Large Petal",
                        "Small Petal")

# check the modified dataframe                      
summary(iris)

##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5  
##        Species    my.factor        
##  setosa    :50   Length:150        
##  versicolor:50   Class :character  
##  virginica :50   Mode  :character  
##                                    
##                                    
##

# recode the characters to factor
iris$my.factor <- as.factor(iris$my.factor)

# check again
summary(iris)

##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5  
##        Species         my.factor 
##  setosa    :50   Large Petal:52  
##  versicolor:50   Small Petal:98  
##  virginica :50                   
##                                  
##                                  
##

# aggregate over two factors
aggregate(x   = iris[, names(iris) %in% names(col.num)],
          by  = list(iris$Species, iris$my.factor),
          FUN = mean)

##      Group.1     Group.2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 versicolor Large Petal        6.180       3.120        4.820       1.660
## 2  virginica Large Petal        6.617       3.002        5.572       2.062
## 3     setosa Small Petal        5.006       3.428        1.462       0.246
## 4 versicolor Small Petal        5.909       2.731        4.198       1.289
## 5  virginica Small Petal        6.133       2.533        5.233       1.467

# Using another function: 
aggregate(x   = iris[, names(iris) %in% names(col.num)],
          by  = list(iris$Species, iris$my.factor),
          FUN = sd)

##      Group.1     Group.2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 versicolor Large Petal       0.3271      0.2775       0.2387     0.08944
## 2  virginica Large Petal       0.6445      0.3054       0.5594     0.24187
## 3     setosa Small Petal       0.3525      0.3791       0.1737     0.10539
## 4 versicolor Small Petal       0.5286      0.2953       0.4485     0.16952
## 5  virginica Small Petal       0.1528      0.3055       0.3215     0.05774

Wait - how to do standard error?

se = sd / sqrt(n)

from http://cran.r-project.org/doc/manuals/R-intro.html “[…] Suppose further we needed to calculate the standard errors of the state income means. To do this we need to write an R function to calculate the standard error for any given vector. Since there is an builtin function var() to calculate the sample variance, such a function is a very simple one liner, specified by the assignment:”"

stderr <- function(x) sqrt(var(x)/length(x))

# But have to account for potential missing values!
# btw var is square of sd (see ?var)
stderr <- function(x) {
              sqrt(var(x[!is.na(x)]) / length(x[!is.na(x)]))
              }

# the same utilising "sd". The previous function is the recommended one in the R FAQ.              
stde.2 <-  function(x) {
              sd(x[!is.na(x)]) / sqrt(length(x[!is.na(x)]))
              }

# use the self-defined functions
aggregate(x   = iris[, names(iris) %in% names(col.num)],
          by  = list(iris$Species, iris$my.factor),
          FUN = stderr)

##      Group.1     Group.2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 versicolor Large Petal      0.14629     0.12410      0.10677     0.04000
## 2  virginica Large Petal      0.09401     0.04454      0.08159     0.03528
## 3     setosa Small Petal      0.04985     0.05361      0.02456     0.01490
## 4 versicolor Small Petal      0.07881     0.04402      0.06685     0.02527
## 5  virginica Small Petal      0.08819     0.17638      0.18559     0.03333

aggregate(x   = iris[, names(iris) %in% names(col.num)],
          by  = list(iris$Species, iris$my.factor),
          FUN = stde.2)

##      Group.1     Group.2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 versicolor Large Petal      0.14629     0.12410      0.10677     0.04000
## 2  virginica Large Petal      0.09401     0.04454      0.08159     0.03528
## 3     setosa Small Petal      0.04985     0.05361      0.02456     0.01490
## 4 versicolor Small Petal      0.07881     0.04402      0.06685     0.02527
## 5  virginica Small Petal      0.08819     0.17638      0.18559     0.03333

# Keep the ouptut of the aggregate function in the new object "my.res"
my.res <- aggregate(x   = iris[, names(iris) %in% names(col.num)],
            by  = list(iris$Species, iris$my.factor),
            FUN = stderr)

my.res

##      Group.1     Group.2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 versicolor Large Petal      0.14629     0.12410      0.10677     0.04000
## 2  virginica Large Petal      0.09401     0.04454      0.08159     0.03528
## 3     setosa Small Petal      0.04985     0.05361      0.02456     0.01490
## 4 versicolor Small Petal      0.07881     0.04402      0.06685     0.02527
## 5  virginica Small Petal      0.08819     0.17638      0.18559     0.03333

# Question: Is it possible to round the results?
# See the helpfile ?round. There are several options for rounding.

# round only works with numeric data.
# here we use the square brackets to select the columns three to six only, 
# then we round all data to three digits and put the result into the object "my.res.round"
my.res.round <- sapply(my.res[, 3:6], 
                       function(x) round(x, digits = 3)) 

# by default, "sapply" returns a matrix. If a dataframe is preferred, the conversion is easy:
my.res.round <- as.data.frame(my.res.round)

# how to get the information on groups and petal size into this data frame?
# we just copy the columns that we want and add them to the rounded data frame:
my.res.round$Species <- my.res$Group.1
my.res.round$Petal.size <- my.res$Group.2

head(my.res.round)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species  Petal.size
## 1        0.146       0.124        0.107       0.040 versicolor Large Petal
## 2        0.094       0.045        0.082       0.035  virginica Large Petal
## 3        0.050       0.054        0.025       0.015     setosa Small Petal
## 4        0.079       0.044        0.067       0.025 versicolor Small Petal
## 5        0.088       0.176        0.186       0.033  virginica Small Petal

# how to re-order the columns in a data frame?
# this is done with square brackets again.
my.res.round[, c(5, 6, 1:4)]

##      Species  Petal.size Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 versicolor Large Petal        0.146       0.124        0.107       0.040
## 2  virginica Large Petal        0.094       0.045        0.082       0.035
## 3     setosa Small Petal        0.050       0.054        0.025       0.015
## 4 versicolor Small Petal        0.079       0.044        0.067       0.025
## 5  virginica Small Petal        0.088       0.176        0.186       0.033

TO DO Find out how to store the results of the stderr aggregation in a new object.

Find out how to rename the generic names “Group.1” etc to something more meaningful in the new object.

Find out how to do an anova on Species and Petal size category for the original iris data.