Stream Name: Texaschikkita
Stream URL: https://rpubs.com/Texaschikkita
Stream ID: 9962324179
Measurement Id: G-CV2648GQMK

Module 14: Data Parallelism and Graph Parallelism

1. Introduction to Data Parallelism

Big Data Overview

  • Mathematical Representation of Parallel Computation: Given a dataset \(D\) and a task \(T\), the task can be divided into \(n\) parts: \[ T(D) = \sum_{i=1}^n T(D_i) \] where \(D_i\) are subsets of the data \(D\), processed independently.

  • Throughput Analysis: The speedup \(S\) of parallel processing compared to serial processing: \[ S = \frac{T_{\text{serial}}}{T_{\text{parallel}}} \] where \(T_{\text{serial}}\) is the time for sequential execution, and \(T_{\text{parallel}}\) is the execution time with \(n\) parallel units.


2. Data Parallelism

MapReduce Framework

  • Mathematical Steps:
    • Mapping: Transform data into intermediate key-value pairs: \[ f(x) = (k, v) \] where \(k\) is the key and \(v\) is the value extracted from data \(x\).

    • Reducing: Aggregate values based on the keys: \[ g(k, [v_1, v_2, ..., v_m]) = \text{Result} \]

Grid Search Parameter Tuning

  • Given parameters \(\Theta_1, \Theta_2, \ldots, \Theta_k\), the Cartesian product of these parameters defines the grid: \[ G = \Theta_1 \times \Theta_2 \times \ldots \times \Theta_k \]
    • Example: For \(\text{gamma} = \{1000, 10000, 100000\}\) and \(\text{C} = \{1, 10, 100\}\): \[ G = \{(1000, 1), (1000, 10), \ldots, (100000, 100)\} \]
  • Cross-validation error for each parameter combination: \[ E(\theta) = \frac{1}{k} \sum_{i=1}^k \text{Loss}(\hat{y}_i, y_i) \] where \(k\) is the number of folds, \(\hat{y}_i\) is the predicted value, and \(y_i\) is the true value.

4. Randomized Search Optimization

Mathematical Optimization:

  • Instead of evaluating all grid points, sample \(n\) random points: \[ \Theta_{\text{sampled}} \subset \Theta \]
  • Choose the best parameters based on sampled evaluations: \[ \theta^* = \arg\min_{\theta \in \Theta_{\text{sampled}}} E(\theta) \]

5. Challenges and Insights

Challenges of Iterative Algorithms:

  • Many machine learning algorithms, like Gradient Descent, are difficult to parallelize due to iterative dependencies: \[ \theta_{t+1} = \theta_t - \eta \nabla f(\theta_t) \] where \(\theta_t\) is the parameter at iteration \(t\), \(\eta\) is the learning rate, and \(\nabla f\) is the gradient of the objective function.

Fault Tolerance in Distributed Systems:

  • Use techniques like checkpointing: \[ S = \frac{T_{\text{failure-recovery}}}{T_{\text{no-failure}}} \] where \(S\) is the system’s fault tolerance efficiency.

6. Summary

Mathematical insights enhance our understanding of parallel and distributed computing: 1. Data Parallelism: - Divide and conquer: \(T(D) = \sum_{i=1}^n T(D_i)\). - Effective for tasks like grid search and MapReduce. 2. Graph Parallelism: - Future modules will explore partitioning nodes and edges in complex tasks.

Tools like Scikit-learn and visualization techniques (e.g., heatmaps) make these concepts practical and accessible.


library(Rgraphviz)

# Example graph
g <- new("graphNEL", nodes = c("A", "B", "C"), edgemode = "directed")
g <- addEdge("A", "B", g)
g <- addEdge("B", "C", g)

# Render graph
plot(g)
library(DiagrammeR)

grViz("
  digraph dot {
    A -> B -> C;
    B -> D;
  }
")
LS0tDQp0aXRsZTogIlIgTm90ZWJvb2siDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQo8IS0tIEdvb2dsZSB0YWcgKGd0YWcuanMpIC0tPiAgDQo8c2NyaXB0IGFzeW5jIHNyYz0iaHR0cHM6Ly93d3cuZ29vZ2xldGFnbWFuYWdlci5jb20vZ3RhZy9qcz9pZD1HLUNWMjY0OEdRTUsiPjwvc2NyaXB0PiAgDQo8c2NyaXB0PiAgDQogIHdpbmRvdy5kYXRhTGF5ZXIgPSB3aW5kb3cuZGF0YUxheWVyIHx8IFtdOyAgDQogIGZ1bmN0aW9uIGd0YWcoKXtkYXRhTGF5ZXIucHVzaChhcmd1bWVudHMpO30gIA0KICBndGFnKCdqcycsIG5ldyBEYXRlKCkpOyAgDQoNCiAgZ3RhZygnY29uZmlnJywgJ0ctQ1YyNjQ4R1FNSycpOyAgDQo8L3NjcmlwdD4gIA0KDQpTdHJlYW0gTmFtZTogVGV4YXNjaGlra2l0YSAgDQpTdHJlYW0gVVJMOiBodHRwczovL3JwdWJzLmNvbS9UZXhhc2NoaWtraXRhICANClN0cmVhbSBJRDogOTk2MjMyNDE3OSAgDQpNZWFzdXJlbWVudCBJZDogRy1DVjI2NDhHUU1LICANCi0tLQ0KDQoNCiMjIE1vZHVsZSAxNDogRGF0YSBQYXJhbGxlbGlzbSBhbmQgR3JhcGggUGFyYWxsZWxpc20NCg0KIyMjICoqMS4gSW50cm9kdWN0aW9uIHRvIERhdGEgUGFyYWxsZWxpc20qKg0KDQojIyMjICoqQmlnIERhdGEgT3ZlcnZpZXcqKg0KLSAqKk1hdGhlbWF0aWNhbCBSZXByZXNlbnRhdGlvbiBvZiBQYXJhbGxlbCBDb21wdXRhdGlvbioqOg0KICBHaXZlbiBhIGRhdGFzZXQgXCggRCBcKSBhbmQgYSB0YXNrIFwoIFQgXCksIHRoZSB0YXNrIGNhbiBiZSBkaXZpZGVkIGludG8gXCggbiBcKSBwYXJ0czoNCiAgXFsNCiAgVChEKSA9IFxzdW1fe2k9MX1ebiBUKERfaSkNCiAgXF0NCiAgd2hlcmUgXCggRF9pIFwpIGFyZSBzdWJzZXRzIG9mIHRoZSBkYXRhIFwoIEQgXCksIHByb2Nlc3NlZCBpbmRlcGVuZGVudGx5Lg0KDQotICoqVGhyb3VnaHB1dCBBbmFseXNpcyoqOg0KICBUaGUgc3BlZWR1cCBcKCBTIFwpIG9mIHBhcmFsbGVsIHByb2Nlc3NpbmcgY29tcGFyZWQgdG8gc2VyaWFsIHByb2Nlc3Npbmc6DQogIFxbDQogIFMgPSBcZnJhY3tUX3tcdGV4dHtzZXJpYWx9fX17VF97XHRleHR7cGFyYWxsZWx9fX0NCiAgXF0NCiAgd2hlcmUgXCggVF97XHRleHR7c2VyaWFsfX0gXCkgaXMgdGhlIHRpbWUgZm9yIHNlcXVlbnRpYWwgZXhlY3V0aW9uLCBhbmQgXCggVF97XHRleHR7cGFyYWxsZWx9fSBcKSBpcyB0aGUgZXhlY3V0aW9uIHRpbWUgd2l0aCBcKCBuIFwpIHBhcmFsbGVsIHVuaXRzLg0KDQotLS0NCg0KIyMjICoqMi4gRGF0YSBQYXJhbGxlbGlzbSoqDQoNCiMjIyMgKipNYXBSZWR1Y2UgRnJhbWV3b3JrKioNCi0gKipNYXRoZW1hdGljYWwgU3RlcHMqKjoNCiAgLSAqKk1hcHBpbmcqKjogVHJhbnNmb3JtIGRhdGEgaW50byBpbnRlcm1lZGlhdGUga2V5LXZhbHVlIHBhaXJzOg0KICAgIFxbDQogICAgZih4KSA9IChrLCB2KQ0KICAgIFxdDQogICAgd2hlcmUgXCggayBcKSBpcyB0aGUga2V5IGFuZCBcKCB2IFwpIGlzIHRoZSB2YWx1ZSBleHRyYWN0ZWQgZnJvbSBkYXRhIFwoIHggXCkuDQoNCiAgLSAqKlJlZHVjaW5nKio6IEFnZ3JlZ2F0ZSB2YWx1ZXMgYmFzZWQgb24gdGhlIGtleXM6DQogICAgXFsNCiAgICBnKGssIFt2XzEsIHZfMiwgLi4uLCB2X21dKSA9IFx0ZXh0e1Jlc3VsdH0NCiAgICBcXQ0KDQojIyMjICoqR3JpZCBTZWFyY2ggUGFyYW1ldGVyIFR1bmluZyoqDQotIEdpdmVuIHBhcmFtZXRlcnMgXCggXFRoZXRhXzEsIFxUaGV0YV8yLCBcbGRvdHMsIFxUaGV0YV9rIFwpLCB0aGUgQ2FydGVzaWFuIHByb2R1Y3Qgb2YgdGhlc2UgcGFyYW1ldGVycyBkZWZpbmVzIHRoZSBncmlkOg0KICBcWw0KICBHID0gXFRoZXRhXzEgXHRpbWVzIFxUaGV0YV8yIFx0aW1lcyBcbGRvdHMgXHRpbWVzIFxUaGV0YV9rDQogIFxdDQogIC0gRXhhbXBsZTogRm9yIFwoIFx0ZXh0e2dhbW1hfSA9IFx7MTAwMCwgMTAwMDAsIDEwMDAwMFx9IFwpIGFuZCBcKCBcdGV4dHtDfSA9IFx7MSwgMTAsIDEwMFx9XCk6DQogICAgXFsNCiAgICBHID0gXHsoMTAwMCwgMSksICgxMDAwLCAxMCksIFxsZG90cywgKDEwMDAwMCwgMTAwKVx9DQogICAgXF0NCg0KLSBDcm9zcy12YWxpZGF0aW9uIGVycm9yIGZvciBlYWNoIHBhcmFtZXRlciBjb21iaW5hdGlvbjoNCiAgXFsNCiAgRShcdGhldGEpID0gXGZyYWN7MX17a30gXHN1bV97aT0xfV5rIFx0ZXh0e0xvc3N9KFxoYXR7eX1faSwgeV9pKQ0KICBcXQ0KICB3aGVyZSBcKCBrIFwpIGlzIHRoZSBudW1iZXIgb2YgZm9sZHMsIFwoIFxoYXR7eX1faSBcKSBpcyB0aGUgcHJlZGljdGVkIHZhbHVlLCBhbmQgXCggeV9pIFwpIGlzIHRoZSB0cnVlIHZhbHVlLg0KDQotLS0NCg0KIyMjICoqMy4gVmlzdWFsaXphdGlvbiBvZiBQYXJhbWV0ZXIgU2VhcmNoKioNCi0gVXNlIGhlYXRtYXBzIHRvIHZpc3VhbGl6ZSBwZXJmb3JtYW5jZSBvdmVyIHRoZSBwYXJhbWV0ZXIgc3BhY2UuDQotICoqTWF0aGVtYXRpY2FsIFJlcHJlc2VudGF0aW9uKio6DQogIFRoZSBncmlkIHNlYXJjaCBwZXJmb3JtYW5jZSBtYXRyaXg6DQogIFxbDQogIE0oaSwgaikgPSBcdGV4dHttZWFuXF9zY29yZX0oQ19pLCBcZ2FtbWFfaikNCiAgXF0NCiAgd2hlcmUgXCggTShpLCBqKSBcKSBzdG9yZXMgdGhlIGNyb3NzLXZhbGlkYXRpb24gc2NvcmUgZm9yIHBhcmFtZXRlciBjb21iaW5hdGlvbiBcKCAoQ19pLCBcZ2FtbWFfaikgXCkuDQoNCmBgYHB5dGhvbg0KaW1wb3J0IG1hdHBsb3RsaWIucHlwbG90IGFzIHBsdA0KaW1wb3J0IG51bXB5IGFzIG5wDQoNCiMgU2ltdWxhdGVkIHJlc3VsdHMgZm9yIHZpc3VhbGl6YXRpb24NCnNjb3JlcyA9IG5wLmFycmF5KFtbMC44LCAwLjg1LCAwLjg3XSwgWzAuODMsIDAuODgsIDAuODldLCBbMC44NSwgMC44OSwgMC45XV0pDQpDX3ZhbHVlcyA9IFsxLCAxMCwgMTAwXQ0KZ2FtbWFfdmFsdWVzID0gWzEwMDAsIDEwMDAwLCAxMDAwMDBdDQoNCiMgSGVhdG1hcA0KcGx0Lmltc2hvdyhzY29yZXMsIGludGVycG9sYXRpb249J25lYXJlc3QnLCBjbWFwPSd2aXJpZGlzJykNCnBsdC50aXRsZSgnR3JpZCBTZWFyY2ggUGVyZm9ybWFuY2UnKQ0KcGx0LmNvbG9yYmFyKCkNCnBsdC54dGlja3MocmFuZ2UobGVuKGdhbW1hX3ZhbHVlcykpLCBnYW1tYV92YWx1ZXMpDQpwbHQueXRpY2tzKHJhbmdlKGxlbihDX3ZhbHVlcykpLCBDX3ZhbHVlcykNCnBsdC54bGFiZWwoJ0dhbW1hJykNCnBsdC55bGFiZWwoJ0MnKQ0KcGx0LnNob3coKQ0KYGBgDQoNCi0tLQ0KDQojIyMgKio0LiBSYW5kb21pemVkIFNlYXJjaCBPcHRpbWl6YXRpb24qKg0KDQojIyMjICoqTWF0aGVtYXRpY2FsIE9wdGltaXphdGlvbioqOg0KLSBJbnN0ZWFkIG9mIGV2YWx1YXRpbmcgYWxsIGdyaWQgcG9pbnRzLCBzYW1wbGUgXCggbiBcKSByYW5kb20gcG9pbnRzOg0KICBcWw0KICBcVGhldGFfe1x0ZXh0e3NhbXBsZWR9fSBcc3Vic2V0IFxUaGV0YQ0KICBcXQ0KLSBDaG9vc2UgdGhlIGJlc3QgcGFyYW1ldGVycyBiYXNlZCBvbiBzYW1wbGVkIGV2YWx1YXRpb25zOg0KICBcWw0KICBcdGhldGFeKiA9IFxhcmdcbWluX3tcdGhldGEgXGluIFxUaGV0YV97XHRleHR7c2FtcGxlZH19fSBFKFx0aGV0YSkNCiAgXF0NCg0KLS0tDQoNCiMjIyAqKjUuIENoYWxsZW5nZXMgYW5kIEluc2lnaHRzKioNCg0KIyMjIyAqKkNoYWxsZW5nZXMgb2YgSXRlcmF0aXZlIEFsZ29yaXRobXMqKjoNCi0gTWFueSBtYWNoaW5lIGxlYXJuaW5nIGFsZ29yaXRobXMsIGxpa2UgR3JhZGllbnQgRGVzY2VudCwgYXJlIGRpZmZpY3VsdCB0byBwYXJhbGxlbGl6ZSBkdWUgdG8gaXRlcmF0aXZlIGRlcGVuZGVuY2llczoNCiAgXFsNCiAgXHRoZXRhX3t0KzF9ID0gXHRoZXRhX3QgLSBcZXRhIFxuYWJsYSBmKFx0aGV0YV90KQ0KICBcXQ0KICB3aGVyZSBcKCBcdGhldGFfdCBcKSBpcyB0aGUgcGFyYW1ldGVyIGF0IGl0ZXJhdGlvbiBcKCB0IFwpLCBcKCBcZXRhIFwpIGlzIHRoZSBsZWFybmluZyByYXRlLCBhbmQgXCggXG5hYmxhIGYgXCkgaXMgdGhlIGdyYWRpZW50IG9mIHRoZSBvYmplY3RpdmUgZnVuY3Rpb24uDQoNCiMjIyMgKipGYXVsdCBUb2xlcmFuY2UgaW4gRGlzdHJpYnV0ZWQgU3lzdGVtcyoqOg0KLSBVc2UgdGVjaG5pcXVlcyBsaWtlIGNoZWNrcG9pbnRpbmc6DQogIFxbDQogIFMgPSBcZnJhY3tUX3tcdGV4dHtmYWlsdXJlLXJlY292ZXJ5fX19e1Rfe1x0ZXh0e25vLWZhaWx1cmV9fX0NCiAgXF0NCiAgd2hlcmUgXCggUyBcKSBpcyB0aGUgc3lzdGVtJ3MgZmF1bHQgdG9sZXJhbmNlIGVmZmljaWVuY3kuDQoNCi0tLQ0KDQojIyMgKio2LiBTdW1tYXJ5KioNCg0KTWF0aGVtYXRpY2FsIGluc2lnaHRzIGVuaGFuY2Ugb3VyIHVuZGVyc3RhbmRpbmcgb2YgcGFyYWxsZWwgYW5kIGRpc3RyaWJ1dGVkIGNvbXB1dGluZzoNCjEuICoqRGF0YSBQYXJhbGxlbGlzbSoqOg0KICAgLSBEaXZpZGUgYW5kIGNvbnF1ZXI6IFwoIFQoRCkgPSBcc3VtX3tpPTF9Xm4gVChEX2kpIFwpLg0KICAgLSBFZmZlY3RpdmUgZm9yIHRhc2tzIGxpa2UgZ3JpZCBzZWFyY2ggYW5kIE1hcFJlZHVjZS4NCjIuICoqR3JhcGggUGFyYWxsZWxpc20qKjoNCiAgIC0gRnV0dXJlIG1vZHVsZXMgd2lsbCBleHBsb3JlIHBhcnRpdGlvbmluZyBub2RlcyBhbmQgZWRnZXMgaW4gY29tcGxleCB0YXNrcy4NCg0KVG9vbHMgbGlrZSBTY2lraXQtbGVhcm4gYW5kIHZpc3VhbGl6YXRpb24gdGVjaG5pcXVlcyAoZS5nLiwgaGVhdG1hcHMpIG1ha2UgdGhlc2UgY29uY2VwdHMgcHJhY3RpY2FsIGFuZCBhY2Nlc3NpYmxlLg0KDQotLS0NCg0KYGBge3J9DQpsaWJyYXJ5KFJncmFwaHZpeikNCg0KIyBFeGFtcGxlIGdyYXBoDQpnIDwtIG5ldygiZ3JhcGhORUwiLCBub2RlcyA9IGMoIkEiLCAiQiIsICJDIiksIGVkZ2Vtb2RlID0gImRpcmVjdGVkIikNCmcgPC0gYWRkRWRnZSgiQSIsICJCIiwgZykNCmcgPC0gYWRkRWRnZSgiQiIsICJDIiwgZykNCg0KIyBSZW5kZXIgZ3JhcGgNCnBsb3QoZykNCmBgYA0KDQpgYGB7cn0NCmxpYnJhcnkoRGlhZ3JhbW1lUikNCg0KZ3JWaXooIg0KICBkaWdyYXBoIGRvdCB7DQogICAgQSAtPiBCIC0+IEM7DQogICAgQiAtPiBEOw0KICB9DQoiKQ0KYGBgDQo=