Spring 2025

Data Processing

Processing Data

  • It’s rare to simply read in a data file and then start analyzing
  • Usually, we have to do something with the data
  • It’s very bad practice to modify the actual data files
  • Better to manipulate the data in code as a part of your pipeline
  • This is self-documenting and maintains the original data integrity

Common Manipulation Needs

  • Select only a certain subset of variables or observations
  • Cleaning, constraining, or filling in missing data
  • Sort, organize, or reshape data differently
  • Summarize, aggregate, or group data
  • Combine, merge, or join data from different sources
  • Reorder data sets based on values
  • Calculate alternative variables / statistics based on the original data

Processing with R

Subsettting Data

  • As we have seen, we can use subset() to:
    • Get a new data frame containing only observations that meet some criteria
    • Get a new data frame containing only a subset of variables
colnames(ChickWeight)
## [1] "weight" "Time"   "Chick"  "Diet"
subset(ChickWeight, weight > 330, select=c(weight,Time))
##     weight Time
## 232    331   21
## 388    341   21
## 398    332   18
## 399    361   20
## 400    373   21

The transform() and reorder() functions, p.1

  • We use transform() to produce a new data frame that has been changed in some way
head(transform(ChickWeight, weight=-weight))
##   weight Time Chick Diet
## 1    -42    0     1    1
## 2    -51    2     1    1
## 3    -59    4     1    1
## 4    -64    6     1    1
## 5    -76    8     1    1
## 6    -93   10     1    1
  • We use reorder to reorder a variable based on the value of another
head(reorder(ChickWeight$Time, ChickWeight$weight))
## [1] 0  2  4  6  8  10
## Levels: 0 2 4 6 8 10 12 14 16 18 20 21

The transform() and reorder() functions, p.2

  • We use transform() and reorder() to produce a new data frame sorted by a particular variable
head(transform(ChickWeight, reorder(Time, weight)))
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

Or just use arrange()

  • The dplyr package contains the useful function arrange()
  • This allows you to sort a dataset given an ordered list of variables
library(dplyr,quietly=TRUE, warn.conflicts=FALSE)
head(arrange(ChickWeight, Diet, Chick, Time, weight))
##   weight Time Chick Diet
## 1     39    0    18    1
## 2     35    2    18    1
## 3     41    0    16    1
## 4     45    2    16    1
## 5     49    4    16    1
## 6     51    6    16    1

Filtering Data, Old School

  • Old-school R filtering of data:
ChickWeight[ ChickWeight$Chick==1 & ChickWeight$Time > 10, ]
##    weight Time Chick Diet
## 7     106   12     1    1
## 8     125   14     1    1
## 9     149   16     1    1
## 10    171   18     1    1
## 11    199   20     1    1
## 12    205   21     1    1

Filtering Data Using dplyr

  • Or use the dplyr filter function:
filter(ChickWeight, Chick==1, Time > 10)
##   weight Time Chick Diet
## 1    106   12     1    1
## 2    125   14     1    1
## 3    149   16     1    1
## 4    171   18     1    1
## 5    199   20     1    1
## 6    205   21     1    1

Slicing Data

  • You can also use the [] operators to get specific rows or columns
 ChickWeight[ c(3,5,7), 1:3 ]
##   weight Time Chick
## 3     59    4     1
## 5     76    8     1
## 7    106   12     1

Aggregating Data

  • Sometimes you need to collapse many rows together by taking some kind of summary statistic
  • We can use aggregate() to do this, giving it:
    • The data set
    • A list of the variables containing values to collapse
    • The function used to aggregate
subset(aggregate(state.x77,list(Region = state.region), mean),
       select=c(Region, Population, Income, Illiteracy))
##          Region Population   Income Illiteracy
## 1     Northeast   5495.111 4570.222   1.000000
## 2         South   4208.125 4011.938   1.737500
## 3 North Central   4803.000 4611.083   0.700000
## 4          West   2915.308 4702.615   1.023077

Joining Two Datasets Horizontally, p.1

  • The merge() function will take two data frames and attempt to join them
  • It adds columns and replicates values where necessary
A = data.frame(Var1 = c(1,2,3), Var2=factor(c("a","b","c")))
B = data.frame(Var3=c(-3, 2.7, 52))
merge(A,B)
##   Var1 Var2 Var3
## 1    1    a -3.0
## 2    2    b -3.0
## 3    3    c -3.0
## 4    1    a  2.7
## 5    2    b  2.7
## 6    3    c  2.7
## 7    1    a 52.0
## 8    2    b 52.0
## 9    3    c 52.0

Joining Two Datasets Horizontally, p.2

  • When data frames have the same variables, merge() includes only data where these variables share values
  • It’s like a union of the sets of row-tuples
A = data.frame(Var1 = c(1,2,3), Var2=factor(c("a","b","c")))
B = data.frame(Var2=factor(c("a","a","b")), Var3=c(-3, 2.7, 52))
merge(A,B)
##   Var2 Var1 Var3
## 1    a    1 -3.0
## 2    a    1  2.7
## 3    b    2 52.0

Joining Two Datasets Vertically

  • The rbind() function will take the rows of one data frame and attempt to bind them to the bottom of another
  • The column names must match
A = data.frame(Var1=c(1,2,3), Var2=factor(c("a","b","c")))
B = data.frame(Var1=c(9,10),   Var2=factor(c("b","c")))
rbind(A,B)
##   Var1 Var2
## 1    1    a
## 2    2    b
## 3    3    c
## 4    9    b
## 5   10    c

More Sophisticed SQL-Style Joins

  • inner join: Return only the rows in which the left table have matching keys in the right table
    merge(df1, df2, by="CustomerId")
  • outer join: Returns all rows from both tables, join records from the left which have matching keys in the right table
    merge(x=df1, y=df2, by="CustomerId", all=T)
  • left-outer join: Return all rows from the left table, and any rows with matching keys from the right table
    merge(x=df1, y=df2, by="CustomerId", all.x=T)
  • right-outer join: Return all rows from the right table, and any rows with matching keys from the left table
    merge(x=df1, y=df2, by="CustomerId", all.y=T)
  • cross join: Return the Cartesian product of the two dables
    merge(x=df1, y=df2, by=NULL)

More Information About Joins

Adding Columns, Old School

mydata <- ChickWeight
mydata$NewColumn <- mydata$weight^2
head(mydata, n=5)
##   weight Time Chick Diet NewColumn
## 1     42    0     1    1      1764
## 2     51    2     1    1      2601
## 3     59    4     1    1      3481
## 4     64    6     1    1      4096
## 5     76    8     1    1      5776

Adding Columns with dplyr mutate

  • The mutate() function from the dplyr package provides:
    • Simpler way to access variables
    • Ability to add multiple variables at a time
    • Ability to use new variables in calculations directly
mydata <- ChickWeight
mutate(mydata,
       NewColumn = weight^2,
       SecNewCol = NewColumn / 2)
##     weight Time Chick Diet NewColumn SecNewCol
## 1       42    0     1    1      1764     882.0
## 2       51    2     1    1      2601    1300.5
## 3       59    4     1    1      3481    1740.5
## 4       64    6     1    1      4096    2048.0
## 5       76    8     1    1      5776    2888.0
## 6       93   10     1    1      8649    4324.5
## 7      106   12     1    1     11236    5618.0
## 8      125   14     1    1     15625    7812.5
## 9      149   16     1    1     22201   11100.5
## 10     171   18     1    1     29241   14620.5
## 11     199   20     1    1     39601   19800.5
## 12     205   21     1    1     42025   21012.5
## 13      40    0     2    1      1600     800.0
## 14      49    2     2    1      2401    1200.5
## 15      58    4     2    1      3364    1682.0
## 16      72    6     2    1      5184    2592.0
## 17      84    8     2    1      7056    3528.0
## 18     103   10     2    1     10609    5304.5
## 19     122   12     2    1     14884    7442.0
## 20     138   14     2    1     19044    9522.0
## 21     162   16     2    1     26244   13122.0
## 22     187   18     2    1     34969   17484.5
## 23     209   20     2    1     43681   21840.5
## 24     215   21     2    1     46225   23112.5
## 25      43    0     3    1      1849     924.5
## 26      39    2     3    1      1521     760.5
## 27      55    4     3    1      3025    1512.5
## 28      67    6     3    1      4489    2244.5
## 29      84    8     3    1      7056    3528.0
## 30      99   10     3    1      9801    4900.5
## 31     115   12     3    1     13225    6612.5
## 32     138   14     3    1     19044    9522.0
## 33     163   16     3    1     26569   13284.5
## 34     187   18     3    1     34969   17484.5
## 35     198   20     3    1     39204   19602.0
## 36     202   21     3    1     40804   20402.0
## 37      42    0     4    1      1764     882.0
## 38      49    2     4    1      2401    1200.5
## 39      56    4     4    1      3136    1568.0
## 40      67    6     4    1      4489    2244.5
## 41      74    8     4    1      5476    2738.0
## 42      87   10     4    1      7569    3784.5
## 43     102   12     4    1     10404    5202.0
## 44     108   14     4    1     11664    5832.0
## 45     136   16     4    1     18496    9248.0
## 46     154   18     4    1     23716   11858.0
## 47     160   20     4    1     25600   12800.0
## 48     157   21     4    1     24649   12324.5
## 49      41    0     5    1      1681     840.5
## 50      42    2     5    1      1764     882.0
## 51      48    4     5    1      2304    1152.0
## 52      60    6     5    1      3600    1800.0
## 53      79    8     5    1      6241    3120.5
## 54     106   10     5    1     11236    5618.0
## 55     141   12     5    1     19881    9940.5
## 56     164   14     5    1     26896   13448.0
## 57     197   16     5    1     38809   19404.5
## 58     199   18     5    1     39601   19800.5
## 59     220   20     5    1     48400   24200.0
## 60     223   21     5    1     49729   24864.5
## 61      41    0     6    1      1681     840.5
## 62      49    2     6    1      2401    1200.5
## 63      59    4     6    1      3481    1740.5
## 64      74    6     6    1      5476    2738.0
## 65      97    8     6    1      9409    4704.5
## 66     124   10     6    1     15376    7688.0
## 67     141   12     6    1     19881    9940.5
## 68     148   14     6    1     21904   10952.0
## 69     155   16     6    1     24025   12012.5
## 70     160   18     6    1     25600   12800.0
## 71     160   20     6    1     25600   12800.0
## 72     157   21     6    1     24649   12324.5
## 73      41    0     7    1      1681     840.5
## 74      49    2     7    1      2401    1200.5
## 75      57    4     7    1      3249    1624.5
## 76      71    6     7    1      5041    2520.5
## 77      89    8     7    1      7921    3960.5
## 78     112   10     7    1     12544    6272.0
## 79     146   12     7    1     21316   10658.0
## 80     174   14     7    1     30276   15138.0
## 81     218   16     7    1     47524   23762.0
## 82     250   18     7    1     62500   31250.0
## 83     288   20     7    1     82944   41472.0
## 84     305   21     7    1     93025   46512.5
## 85      42    0     8    1      1764     882.0
## 86      50    2     8    1      2500    1250.0
## 87      61    4     8    1      3721    1860.5
## 88      71    6     8    1      5041    2520.5
## 89      84    8     8    1      7056    3528.0
## 90      93   10     8    1      8649    4324.5
## 91     110   12     8    1     12100    6050.0
## 92     116   14     8    1     13456    6728.0
## 93     126   16     8    1     15876    7938.0
## 94     134   18     8    1     17956    8978.0
## 95     125   20     8    1     15625    7812.5
## 96      42    0     9    1      1764     882.0
## 97      51    2     9    1      2601    1300.5
## 98      59    4     9    1      3481    1740.5
## 99      68    6     9    1      4624    2312.0
## 100     85    8     9    1      7225    3612.5
## 101     96   10     9    1      9216    4608.0
## 102     90   12     9    1      8100    4050.0
## 103     92   14     9    1      8464    4232.0
## 104     93   16     9    1      8649    4324.5
## 105    100   18     9    1     10000    5000.0
## 106    100   20     9    1     10000    5000.0
## 107     98   21     9    1      9604    4802.0
## 108     41    0    10    1      1681     840.5
## 109     44    2    10    1      1936     968.0
## 110     52    4    10    1      2704    1352.0
## 111     63    6    10    1      3969    1984.5
## 112     74    8    10    1      5476    2738.0
## 113     81   10    10    1      6561    3280.5
## 114     89   12    10    1      7921    3960.5
## 115     96   14    10    1      9216    4608.0
## 116    101   16    10    1     10201    5100.5
## 117    112   18    10    1     12544    6272.0
## 118    120   20    10    1     14400    7200.0
## 119    124   21    10    1     15376    7688.0
## 120     43    0    11    1      1849     924.5
## 121     51    2    11    1      2601    1300.5
## 122     63    4    11    1      3969    1984.5
## 123     84    6    11    1      7056    3528.0
## 124    112    8    11    1     12544    6272.0
## 125    139   10    11    1     19321    9660.5
## 126    168   12    11    1     28224   14112.0
## 127    177   14    11    1     31329   15664.5
## 128    182   16    11    1     33124   16562.0
## 129    184   18    11    1     33856   16928.0
## 130    181   20    11    1     32761   16380.5
## 131    175   21    11    1     30625   15312.5
## 132     41    0    12    1      1681     840.5
## 133     49    2    12    1      2401    1200.5
## 134     56    4    12    1      3136    1568.0
## 135     62    6    12    1      3844    1922.0
## 136     72    8    12    1      5184    2592.0
## 137     88   10    12    1      7744    3872.0
## 138    119   12    12    1     14161    7080.5
## 139    135   14    12    1     18225    9112.5
## 140    162   16    12    1     26244   13122.0
## 141    185   18    12    1     34225   17112.5
## 142    195   20    12    1     38025   19012.5
## 143    205   21    12    1     42025   21012.5
## 144     41    0    13    1      1681     840.5
## 145     48    2    13    1      2304    1152.0
## 146     53    4    13    1      2809    1404.5
## 147     60    6    13    1      3600    1800.0
## 148     65    8    13    1      4225    2112.5
## 149     67   10    13    1      4489    2244.5
## 150     71   12    13    1      5041    2520.5
## 151     70   14    13    1      4900    2450.0
## 152     71   16    13    1      5041    2520.5
## 153     81   18    13    1      6561    3280.5
## 154     91   20    13    1      8281    4140.5
## 155     96   21    13    1      9216    4608.0
## 156     41    0    14    1      1681     840.5
## 157     49    2    14    1      2401    1200.5
## 158     62    4    14    1      3844    1922.0
## 159     79    6    14    1      6241    3120.5
## 160    101    8    14    1     10201    5100.5
## 161    128   10    14    1     16384    8192.0
## 162    164   12    14    1     26896   13448.0
## 163    192   14    14    1     36864   18432.0
## 164    227   16    14    1     51529   25764.5
## 165    248   18    14    1     61504   30752.0
## 166    259   20    14    1     67081   33540.5
## 167    266   21    14    1     70756   35378.0
## 168     41    0    15    1      1681     840.5
## 169     49    2    15    1      2401    1200.5
## 170     56    4    15    1      3136    1568.0
## 171     64    6    15    1      4096    2048.0
## 172     68    8    15    1      4624    2312.0
## 173     68   10    15    1      4624    2312.0
## 174     67   12    15    1      4489    2244.5
## 175     68   14    15    1      4624    2312.0
## 176     41    0    16    1      1681     840.5
## 177     45    2    16    1      2025    1012.5
## 178     49    4    16    1      2401    1200.5
## 179     51    6    16    1      2601    1300.5
## 180     57    8    16    1      3249    1624.5
## 181     51   10    16    1      2601    1300.5
## 182     54   12    16    1      2916    1458.0
## 183     42    0    17    1      1764     882.0
## 184     51    2    17    1      2601    1300.5
## 185     61    4    17    1      3721    1860.5
## 186     72    6    17    1      5184    2592.0
## 187     83    8    17    1      6889    3444.5
## 188     89   10    17    1      7921    3960.5
## 189     98   12    17    1      9604    4802.0
## 190    103   14    17    1     10609    5304.5
## 191    113   16    17    1     12769    6384.5
## 192    123   18    17    1     15129    7564.5
## 193    133   20    17    1     17689    8844.5
## 194    142   21    17    1     20164   10082.0
## 195     39    0    18    1      1521     760.5
## 196     35    2    18    1      1225     612.5
## 197     43    0    19    1      1849     924.5
## 198     48    2    19    1      2304    1152.0
## 199     55    4    19    1      3025    1512.5
## 200     62    6    19    1      3844    1922.0
## 201     65    8    19    1      4225    2112.5
## 202     71   10    19    1      5041    2520.5
## 203     82   12    19    1      6724    3362.0
## 204     88   14    19    1      7744    3872.0
## 205    106   16    19    1     11236    5618.0
## 206    120   18    19    1     14400    7200.0
## 207    144   20    19    1     20736   10368.0
## 208    157   21    19    1     24649   12324.5
## 209     41    0    20    1      1681     840.5
## 210     47    2    20    1      2209    1104.5
## 211     54    4    20    1      2916    1458.0
## 212     58    6    20    1      3364    1682.0
## 213     65    8    20    1      4225    2112.5
## 214     73   10    20    1      5329    2664.5
## 215     77   12    20    1      5929    2964.5
## 216     89   14    20    1      7921    3960.5
## 217     98   16    20    1      9604    4802.0
## 218    107   18    20    1     11449    5724.5
## 219    115   20    20    1     13225    6612.5
## 220    117   21    20    1     13689    6844.5
## 221     40    0    21    2      1600     800.0
## 222     50    2    21    2      2500    1250.0
## 223     62    4    21    2      3844    1922.0
## 224     86    6    21    2      7396    3698.0
## 225    125    8    21    2     15625    7812.5
## 226    163   10    21    2     26569   13284.5
## 227    217   12    21    2     47089   23544.5
## 228    240   14    21    2     57600   28800.0
## 229    275   16    21    2     75625   37812.5
## 230    307   18    21    2     94249   47124.5
## 231    318   20    21    2    101124   50562.0
## 232    331   21    21    2    109561   54780.5
## 233     41    0    22    2      1681     840.5
## 234     55    2    22    2      3025    1512.5
## 235     64    4    22    2      4096    2048.0
## 236     77    6    22    2      5929    2964.5
## 237     90    8    22    2      8100    4050.0
## 238     95   10    22    2      9025    4512.5
## 239    108   12    22    2     11664    5832.0
## 240    111   14    22    2     12321    6160.5
## 241    131   16    22    2     17161    8580.5
## 242    148   18    22    2     21904   10952.0
## 243    164   20    22    2     26896   13448.0
## 244    167   21    22    2     27889   13944.5
## 245     43    0    23    2      1849     924.5
## 246     52    2    23    2      2704    1352.0
## 247     61    4    23    2      3721    1860.5
## 248     73    6    23    2      5329    2664.5
## 249     90    8    23    2      8100    4050.0
## 250    103   10    23    2     10609    5304.5
## 251    127   12    23    2     16129    8064.5
## 252    135   14    23    2     18225    9112.5
## 253    145   16    23    2     21025   10512.5
## 254    163   18    23    2     26569   13284.5
## 255    170   20    23    2     28900   14450.0
## 256    175   21    23    2     30625   15312.5
## 257     42    0    24    2      1764     882.0
## 258     52    2    24    2      2704    1352.0
## 259     58    4    24    2      3364    1682.0
## 260     74    6    24    2      5476    2738.0
## 261     66    8    24    2      4356    2178.0
## 262     68   10    24    2      4624    2312.0
## 263     70   12    24    2      4900    2450.0
## 264     71   14    24    2      5041    2520.5
## 265     72   16    24    2      5184    2592.0
## 266     72   18    24    2      5184    2592.0
## 267     76   20    24    2      5776    2888.0
## 268     74   21    24    2      5476    2738.0
## 269     40    0    25    2      1600     800.0
## 270     49    2    25    2      2401    1200.5
## 271     62    4    25    2      3844    1922.0
## 272     78    6    25    2      6084    3042.0
## 273    102    8    25    2     10404    5202.0
## 274    124   10    25    2     15376    7688.0
## 275    146   12    25    2     21316   10658.0
## 276    164   14    25    2     26896   13448.0
## 277    197   16    25    2     38809   19404.5
## 278    231   18    25    2     53361   26680.5
## 279    259   20    25    2     67081   33540.5
## 280    265   21    25    2     70225   35112.5
## 281     42    0    26    2      1764     882.0
## 282     48    2    26    2      2304    1152.0
## 283     57    4    26    2      3249    1624.5
## 284     74    6    26    2      5476    2738.0
## 285     93    8    26    2      8649    4324.5
## 286    114   10    26    2     12996    6498.0
## 287    136   12    26    2     18496    9248.0
## 288    147   14    26    2     21609   10804.5
## 289    169   16    26    2     28561   14280.5
## 290    205   18    26    2     42025   21012.5
## 291    236   20    26    2     55696   27848.0
## 292    251   21    26    2     63001   31500.5
## 293     39    0    27    2      1521     760.5
## 294     46    2    27    2      2116    1058.0
## 295     58    4    27    2      3364    1682.0
## 296     73    6    27    2      5329    2664.5
## 297     87    8    27    2      7569    3784.5
## 298    100   10    27    2     10000    5000.0
## 299    115   12    27    2     13225    6612.5
## 300    123   14    27    2     15129    7564.5
## 301    144   16    27    2     20736   10368.0
## 302    163   18    27    2     26569   13284.5
## 303    185   20    27    2     34225   17112.5
## 304    192   21    27    2     36864   18432.0
## 305     39    0    28    2      1521     760.5
## 306     46    2    28    2      2116    1058.0
## 307     58    4    28    2      3364    1682.0
## 308     73    6    28    2      5329    2664.5
## 309     92    8    28    2      8464    4232.0
## 310    114   10    28    2     12996    6498.0
## 311    145   12    28    2     21025   10512.5
## 312    156   14    28    2     24336   12168.0
## 313    184   16    28    2     33856   16928.0
## 314    207   18    28    2     42849   21424.5
## 315    212   20    28    2     44944   22472.0
## 316    233   21    28    2     54289   27144.5
## 317     39    0    29    2      1521     760.5
## 318     48    2    29    2      2304    1152.0
## 319     59    4    29    2      3481    1740.5
## 320     74    6    29    2      5476    2738.0
## 321     87    8    29    2      7569    3784.5
## 322    106   10    29    2     11236    5618.0
## 323    134   12    29    2     17956    8978.0
## 324    150   14    29    2     22500   11250.0
## 325    187   16    29    2     34969   17484.5
## 326    230   18    29    2     52900   26450.0
## 327    279   20    29    2     77841   38920.5
## 328    309   21    29    2     95481   47740.5
## 329     42    0    30    2      1764     882.0
## 330     48    2    30    2      2304    1152.0
## 331     59    4    30    2      3481    1740.5
## 332     72    6    30    2      5184    2592.0
## 333     85    8    30    2      7225    3612.5
## 334     98   10    30    2      9604    4802.0
## 335    115   12    30    2     13225    6612.5
## 336    122   14    30    2     14884    7442.0
## 337    143   16    30    2     20449   10224.5
## 338    151   18    30    2     22801   11400.5
## 339    157   20    30    2     24649   12324.5
## 340    150   21    30    2     22500   11250.0
## 341     42    0    31    3      1764     882.0
## 342     53    2    31    3      2809    1404.5
## 343     62    4    31    3      3844    1922.0
## 344     73    6    31    3      5329    2664.5
## 345     85    8    31    3      7225    3612.5
## 346    102   10    31    3     10404    5202.0
## 347    123   12    31    3     15129    7564.5
## 348    138   14    31    3     19044    9522.0
## 349    170   16    31    3     28900   14450.0
## 350    204   18    31    3     41616   20808.0
## 351    235   20    31    3     55225   27612.5
## 352    256   21    31    3     65536   32768.0
## 353     41    0    32    3      1681     840.5
## 354     49    2    32    3      2401    1200.5
## 355     65    4    32    3      4225    2112.5
## 356     82    6    32    3      6724    3362.0
## 357    107    8    32    3     11449    5724.5
## 358    129   10    32    3     16641    8320.5
## 359    159   12    32    3     25281   12640.5
## 360    179   14    32    3     32041   16020.5
## 361    221   16    32    3     48841   24420.5
## 362    263   18    32    3     69169   34584.5
## 363    291   20    32    3     84681   42340.5
## 364    305   21    32    3     93025   46512.5
## 365     39    0    33    3      1521     760.5
## 366     50    2    33    3      2500    1250.0
## 367     63    4    33    3      3969    1984.5
## 368     77    6    33    3      5929    2964.5
## 369     96    8    33    3      9216    4608.0
## 370    111   10    33    3     12321    6160.5
## 371    137   12    33    3     18769    9384.5
## 372    144   14    33    3     20736   10368.0
## 373    151   16    33    3     22801   11400.5
## 374    146   18    33    3     21316   10658.0
## 375    156   20    33    3     24336   12168.0
## 376    147   21    33    3     21609   10804.5
## 377     41    0    34    3      1681     840.5
## 378     49    2    34    3      2401    1200.5
## 379     63    4    34    3      3969    1984.5
## 380     85    6    34    3      7225    3612.5
## 381    107    8    34    3     11449    5724.5
## 382    134   10    34    3     17956    8978.0
## 383    164   12    34    3     26896   13448.0
## 384    186   14    34    3     34596   17298.0
## 385    235   16    34    3     55225   27612.5
## 386    294   18    34    3     86436   43218.0
## 387    327   20    34    3    106929   53464.5
## 388    341   21    34    3    116281   58140.5
## 389     41    0    35    3      1681     840.5
## 390     53    2    35    3      2809    1404.5
## 391     64    4    35    3      4096    2048.0
## 392     87    6    35    3      7569    3784.5
## 393    123    8    35    3     15129    7564.5
## 394    158   10    35    3     24964   12482.0
## 395    201   12    35    3     40401   20200.5
## 396    238   14    35    3     56644   28322.0
## 397    287   16    35    3     82369   41184.5
## 398    332   18    35    3    110224   55112.0
## 399    361   20    35    3    130321   65160.5
## 400    373   21    35    3    139129   69564.5
## 401     39    0    36    3      1521     760.5
## 402     48    2    36    3      2304    1152.0
## 403     61    4    36    3      3721    1860.5
## 404     76    6    36    3      5776    2888.0
## 405     98    8    36    3      9604    4802.0
## 406    116   10    36    3     13456    6728.0
## 407    145   12    36    3     21025   10512.5
## 408    166   14    36    3     27556   13778.0
## 409    198   16    36    3     39204   19602.0
## 410    227   18    36    3     51529   25764.5
## 411    225   20    36    3     50625   25312.5
## 412    220   21    36    3     48400   24200.0
## 413     41    0    37    3      1681     840.5
## 414     48    2    37    3      2304    1152.0
## 415     56    4    37    3      3136    1568.0
## 416     68    6    37    3      4624    2312.0
## 417     80    8    37    3      6400    3200.0
## 418     83   10    37    3      6889    3444.5
## 419    103   12    37    3     10609    5304.5
## 420    112   14    37    3     12544    6272.0
## 421    135   16    37    3     18225    9112.5
## 422    157   18    37    3     24649   12324.5
## 423    169   20    37    3     28561   14280.5
## 424    178   21    37    3     31684   15842.0
## 425     41    0    38    3      1681     840.5
## 426     49    2    38    3      2401    1200.5
## 427     61    4    38    3      3721    1860.5
## 428     74    6    38    3      5476    2738.0
## 429     98    8    38    3      9604    4802.0
## 430    109   10    38    3     11881    5940.5
## 431    128   12    38    3     16384    8192.0
## 432    154   14    38    3     23716   11858.0
## 433    192   16    38    3     36864   18432.0
## 434    232   18    38    3     53824   26912.0
## 435    280   20    38    3     78400   39200.0
## 436    290   21    38    3     84100   42050.0
## 437     42    0    39    3      1764     882.0
## 438     50    2    39    3      2500    1250.0
## 439     61    4    39    3      3721    1860.5
## 440     78    6    39    3      6084    3042.0
## 441     89    8    39    3      7921    3960.5
## 442    109   10    39    3     11881    5940.5
## 443    130   12    39    3     16900    8450.0
## 444    146   14    39    3     21316   10658.0
## 445    170   16    39    3     28900   14450.0
## 446    214   18    39    3     45796   22898.0
## 447    250   20    39    3     62500   31250.0
## 448    272   21    39    3     73984   36992.0
## 449     41    0    40    3      1681     840.5
## 450     55    2    40    3      3025    1512.5
## 451     66    4    40    3      4356    2178.0
## 452     79    6    40    3      6241    3120.5
## 453    101    8    40    3     10201    5100.5
## 454    120   10    40    3     14400    7200.0
## 455    154   12    40    3     23716   11858.0
## 456    182   14    40    3     33124   16562.0
## 457    215   16    40    3     46225   23112.5
## 458    262   18    40    3     68644   34322.0
## 459    295   20    40    3     87025   43512.5
## 460    321   21    40    3    103041   51520.5
## 461     42    0    41    4      1764     882.0
## 462     51    2    41    4      2601    1300.5
## 463     66    4    41    4      4356    2178.0
## 464     85    6    41    4      7225    3612.5
## 465    103    8    41    4     10609    5304.5
## 466    124   10    41    4     15376    7688.0
## 467    155   12    41    4     24025   12012.5
## 468    153   14    41    4     23409   11704.5
## 469    175   16    41    4     30625   15312.5
## 470    184   18    41    4     33856   16928.0
## 471    199   20    41    4     39601   19800.5
## 472    204   21    41    4     41616   20808.0
## 473     42    0    42    4      1764     882.0
## 474     49    2    42    4      2401    1200.5
## 475     63    4    42    4      3969    1984.5
## 476     84    6    42    4      7056    3528.0
## 477    103    8    42    4     10609    5304.5
## 478    126   10    42    4     15876    7938.0
## 479    160   12    42    4     25600   12800.0
## 480    174   14    42    4     30276   15138.0
## 481    204   16    42    4     41616   20808.0
## 482    234   18    42    4     54756   27378.0
## 483    269   20    42    4     72361   36180.5
## 484    281   21    42    4     78961   39480.5
## 485     42    0    43    4      1764     882.0
## 486     55    2    43    4      3025    1512.5
## 487     69    4    43    4      4761    2380.5
## 488     96    6    43    4      9216    4608.0
## 489    131    8    43    4     17161    8580.5
## 490    157   10    43    4     24649   12324.5
## 491    184   12    43    4     33856   16928.0
## 492    188   14    43    4     35344   17672.0
## 493    197   16    43    4     38809   19404.5
## 494    198   18    43    4     39204   19602.0
## 495    199   20    43    4     39601   19800.5
## 496    200   21    43    4     40000   20000.0
## 497     42    0    44    4      1764     882.0
## 498     51    2    44    4      2601    1300.5
## 499     65    4    44    4      4225    2112.5
## 500     86    6    44    4      7396    3698.0
## 501    103    8    44    4     10609    5304.5
## 502    118   10    44    4     13924    6962.0
## 503    127   12    44    4     16129    8064.5
## 504    138   14    44    4     19044    9522.0
## 505    145   16    44    4     21025   10512.5
## 506    146   18    44    4     21316   10658.0
## 507     41    0    45    4      1681     840.5
## 508     50    2    45    4      2500    1250.0
## 509     61    4    45    4      3721    1860.5
## 510     78    6    45    4      6084    3042.0
## 511     98    8    45    4      9604    4802.0
## 512    117   10    45    4     13689    6844.5
## 513    135   12    45    4     18225    9112.5
## 514    141   14    45    4     19881    9940.5
## 515    147   16    45    4     21609   10804.5
## 516    174   18    45    4     30276   15138.0
## 517    197   20    45    4     38809   19404.5
## 518    196   21    45    4     38416   19208.0
## 519     40    0    46    4      1600     800.0
## 520     52    2    46    4      2704    1352.0
## 521     62    4    46    4      3844    1922.0
## 522     82    6    46    4      6724    3362.0
## 523    101    8    46    4     10201    5100.5
## 524    120   10    46    4     14400    7200.0
## 525    144   12    46    4     20736   10368.0
## 526    156   14    46    4     24336   12168.0
## 527    173   16    46    4     29929   14964.5
## 528    210   18    46    4     44100   22050.0
## 529    231   20    46    4     53361   26680.5
## 530    238   21    46    4     56644   28322.0
## 531     41    0    47    4      1681     840.5
## 532     53    2    47    4      2809    1404.5
## 533     66    4    47    4      4356    2178.0
## 534     79    6    47    4      6241    3120.5
## 535    100    8    47    4     10000    5000.0
## 536    123   10    47    4     15129    7564.5
## 537    148   12    47    4     21904   10952.0
## 538    157   14    47    4     24649   12324.5
## 539    168   16    47    4     28224   14112.0
## 540    185   18    47    4     34225   17112.5
## 541    210   20    47    4     44100   22050.0
## 542    205   21    47    4     42025   21012.5
## 543     39    0    48    4      1521     760.5
## 544     50    2    48    4      2500    1250.0
## 545     62    4    48    4      3844    1922.0
## 546     80    6    48    4      6400    3200.0
## 547    104    8    48    4     10816    5408.0
## 548    125   10    48    4     15625    7812.5
## 549    154   12    48    4     23716   11858.0
## 550    170   14    48    4     28900   14450.0
## 551    222   16    48    4     49284   24642.0
## 552    261   18    48    4     68121   34060.5
## 553    303   20    48    4     91809   45904.5
## 554    322   21    48    4    103684   51842.0
## 555     40    0    49    4      1600     800.0
## 556     53    2    49    4      2809    1404.5
## 557     64    4    49    4      4096    2048.0
## 558     85    6    49    4      7225    3612.5
## 559    108    8    49    4     11664    5832.0
## 560    128   10    49    4     16384    8192.0
## 561    152   12    49    4     23104   11552.0
## 562    166   14    49    4     27556   13778.0
## 563    184   16    49    4     33856   16928.0
## 564    203   18    49    4     41209   20604.5
## 565    233   20    49    4     54289   27144.5
## 566    237   21    49    4     56169   28084.5
## 567     41    0    50    4      1681     840.5
## 568     54    2    50    4      2916    1458.0
## 569     67    4    50    4      4489    2244.5
## 570     84    6    50    4      7056    3528.0
## 571    105    8    50    4     11025    5512.5
## 572    122   10    50    4     14884    7442.0
## 573    155   12    50    4     24025   12012.5
## 574    175   14    50    4     30625   15312.5
## 575    205   16    50    4     42025   21012.5
## 576    234   18    50    4     54756   27378.0
## 577    264   20    50    4     69696   34848.0
## 578    264   21    50    4     69696   34848.0
head(mydata)
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

The group_by() function

  • Many operations on data sets occur over certain groups of data
  • The dplyr package has a function group_by() that returns a group of tables based on some property
  • ggplot2 understands these groups
myStates = mutate(as.data.frame(state.x77), Region=state.region)
group_by(myStates,Region)
## # A tibble: 50 × 9
## # Groups:   Region [4]
##    Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost   Area Region 
##         <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl>  <dbl> <fct>  
##  1       3615   3624        2.1       69.0   15.1      41.3    20  50708 South  
##  2        365   6315        1.5       69.3   11.3      66.7   152 566432 West   
##  3       2212   4530        1.8       70.6    7.8      58.1    15 113417 West   
##  4       2110   3378        1.9       70.7   10.1      39.9    65  51945 South  
##  5      21198   5114        1.1       71.7   10.3      62.6    20 156361 West   
##  6       2541   4884        0.7       72.1    6.8      63.9   166 103766 West   
##  7       3100   5348        1.1       72.5    3.1      56     139   4862 Northe…
##  8        579   4809        0.9       70.1    6.2      54.6   103   1982 South  
##  9       8277   4815        1.3       70.7   10.7      52.6    11  54090 South  
## 10       4931   4091        2         68.5   13.9      40.6    60  58073 South  
## # ℹ 40 more rows

The summarise() Function

  • The dplyr package gives a simpler way to do many things one would do with multiple calls to aggregate and reconstruction of a new table
  • The summarise() function understands groups
myStates = mutate(as.data.frame(state.x77), Region=state.region)
summarise(group_by(myStates,Region), 
          AvgPopulationSize=mean(Population),
          SDPopulationSize=sqrt(var(Population)))
## # A tibble: 4 × 3
##   Region        AvgPopulationSize SDPopulationSize
##   <fct>                     <dbl>            <dbl>
## 1 Northeast                 5495.            6080.
## 2 South                     4208.            2780.
## 3 North Central             4803             3703.
## 4 West                      2915.            5579.

Processing with Python

Pandas Query

import pandas as pd
mtcars = pd.read_csv('https://www.cmi.ac.in/~sourish/mtcars.csv')
print(mtcars.query('mpg > 22  &  hp<70'))
##      mpg  cyl   disp  hp  drat     wt   qsec  vs  am  gear  carb
## 7   24.4    4  146.7  62  3.69  3.190  20.00   1   0     4     2
## 17  32.4    4   78.7  66  4.08  2.200  19.47   1   1     4     1
## 18  30.4    4   75.7  52  4.93  1.615  18.52   1   1     4     2
## 19  33.9    4   71.1  65  4.22  1.835  19.90   1   1     4     1
## 25  27.3    4   79.0  66  4.08  1.935  18.90   1   1     4     1

Pandas Selecting Columns

mtcars[ ['mpg','hp', 'carb'] ].head()
##     mpg   hp  carb
## 0  21.0  110     4
## 1  21.0  110     4
## 2  22.8   93     1
## 3  21.4  110     1
## 4  18.7  175     2

Pandas Add a Column

mtcars['efficiency_ratio'] = mtcars['hp']/mtcars['mpg']
mtcars.head()
##     mpg  cyl   disp   hp  drat  ...  vs  am  gear  carb  efficiency_ratio
## 0  21.0    6  160.0  110  3.90  ...   0   1     4     4          5.238095
## 1  21.0    6  160.0  110  3.90  ...   0   1     4     4          5.238095
## 2  22.8    4  108.0   93  3.85  ...   1   1     4     1          4.078947
## 3  21.4    6  258.0  110  3.08  ...   1   0     3     1          5.140187
## 4  18.7    8  360.0  175  3.15  ...   0   0     3     2          9.358289
## 
## [5 rows x 12 columns]

Pandas Aggregation, High Level

  • The DataFrame class has the method agg()
  • It allows you to aggregate using built-in function:
    • mean/median
    • sum
    • min/max
    • standard deviation
    • variance
    • product
  • There’s also a way to pass a User Defined Function, though that’s beyond our discussion

Pandas Aggregation, Example

mtcars.agg(['mean','sum','min','max','median','std','var','prod'])
##                  mpg         cyl  ...          carb  efficiency_ratio
## mean    2.009063e+01    6.187500  ...  2.812500e+00      8.786765e+00
## sum     6.429000e+02  198.000000  ...  9.000000e+01      2.811765e+02
## min     1.040000e+01    4.000000  ...  1.000000e+00      1.710526e+00
## max     3.390000e+01    8.000000  ...  8.000000e+00      2.233333e+01
## median  1.920000e+01    6.000000  ...  2.000000e+00      6.658181e+00
## std     6.026948e+00    1.785922  ...  1.615200e+00      6.073858e+00
## var     3.632410e+01    3.189516  ...  2.608871e+00      3.689176e+01
## prod    1.264241e+41    0.000000  ...  1.391569e+12      5.549885e+26
## 
## [8 rows x 12 columns]

Concatenating DataFrames in Pandas

When you have two or more DataFrames with the same column variables, and you want to put them together:

df1 = pd.DataFrame({"Student":["Eric", "Susan", "Todd"],
                    "Grade":pd.Categorical(['B','A','C'])})

df2 = pd.DataFrame({"Student":["Anthony", "John", "Jodie"],
                    "Grade":pd.Categorical(['A','B','B'])})
              
result = pd.concat([df1, df2])
print(result)
##    Student Grade
## 0     Eric     B
## 1    Susan     A
## 2     Todd     C
## 0  Anthony     A
## 1     John     B
## 2    Jodie     B

Also, Pandas as a Merge

left = pd.DataFrame({ "key1": ["K0", "K1", "K2", "K3"],
                      "key2": ["K0", "K1", "K0", "K1"],
                      "A": ["A0", "A1", "A2", "A3"],
                      "B": ["B0", "B1", "B2", "B3"] })
   
right = pd.DataFrame({ "key1": ["K0", "K1", "K1", "K2"],
                       "key2": ["K0", "K0", "K0", "K0"],
                       "C": ["C0", "C1", "C2", "C3"],
                       "D": ["D0", "D1", "D2", "D3"] })
  
result = pd.merge(left, right, on="key1")

More Sophisticed SQL-Style Joins

  • inner join: Return only the rows in which the left table have matching keys in the right table
    pd.merge(left, right, how="inner",on=["key1","key2"])
  • outer join: Returns all rows from both tables, join records from the left which have matching keys in the right table
    pd.merge(left, right, how="outer",on=["key1""key2"])
  • left-outer join: Return all rows from the left table, and any rows with matching keys from the right table
    pd.merge(left, right, how="left",on=["key1","key2"])
  • right-outer join: Return all rows from the right table, and any rows with matching keys from the left table
    pd.merge(left, right, how="right",on=["key1","key2"])
  • cross join: Return the Cartesian product of the two dables
    pd.merge(left, right, how="cross")

More Information About Joins:

Pandas groupby

  • You can also group and aggregate, much like R’s group_by and summarise
  • Though it’s more difficult to create several new aggregation columns at once
mtcars.groupby(['cyl','gear']).mean()
##              mpg        disp          hp  ...    am      carb  efficiency_ratio
## cyl gear                                  ...                                  
## 4   3     21.500  120.100000   97.000000  ...  0.00  1.000000          4.511628
##     4     26.925  102.625000   76.000000  ...  0.75  1.500000          2.995326
##     5     28.200  107.700000  102.000000  ...  1.00  2.000000          3.608553
## 6   3     19.750  241.500000  107.500000  ...  0.00  1.000000          5.470646
##     4     19.750  163.800000  116.500000  ...  0.50  4.000000          5.948138
##     5     19.700  145.000000  175.000000  ...  1.00  6.000000          8.883249
## 8   3     15.050  357.616667  194.166667  ...  0.00  3.083333         13.568820
##     5     15.400  326.000000  299.500000  ...  1.00  6.000000         19.521097
## 
## [8 rows x 10 columns]