Stream Name: Texaschikkita
Stream URL: https://rpubs.com/Texaschikkita
Stream ID: 9962324179
Measurement Id: G-CV2648GQMK
—
Module 14: Data Parallelism and Graph Parallelism
1. Introduction to Data Parallelism
Big Data Overview
Mathematical Representation of Parallel
Computation: Given a dataset \(D\) and a task \(T\), the task can be divided into \(n\) parts: \[
T(D) = \sum_{i=1}^n T(D_i)
\] where \(D_i\) are subsets of
the data \(D\), processed
independently.
Throughput Analysis: The speedup \(S\) of parallel processing compared to
serial processing: \[
S = \frac{T_{\text{serial}}}{T_{\text{parallel}}}
\] where \(T_{\text{serial}}\)
is the time for sequential execution, and \(T_{\text{parallel}}\) is the execution time
with \(n\) parallel units.
2. Data Parallelism
MapReduce Framework
- Mathematical Steps:
Mapping: Transform data into intermediate
key-value pairs: \[
f(x) = (k, v)
\] where \(k\) is the key and
\(v\) is the value extracted from data
\(x\).
Reducing: Aggregate values based on the keys:
\[
g(k, [v_1, v_2, ..., v_m]) = \text{Result}
\]
Grid Search Parameter Tuning
- Given parameters \(\Theta_1, \Theta_2,
\ldots, \Theta_k\), the Cartesian product of these parameters
defines the grid: \[
G = \Theta_1 \times \Theta_2 \times \ldots \times \Theta_k
\]
- Example: For \(\text{gamma} = \{1000,
10000, 100000\}\) and \(\text{C} = \{1,
10, 100\}\): \[
G = \{(1000, 1), (1000, 10), \ldots, (100000, 100)\}
\]
- Cross-validation error for each parameter combination: \[
E(\theta) = \frac{1}{k} \sum_{i=1}^k \text{Loss}(\hat{y}_i, y_i)
\] where \(k\) is the number of
folds, \(\hat{y}_i\) is the predicted
value, and \(y_i\) is the true
value.
3. Visualization of Parameter Search
- Use heatmaps to visualize performance over the parameter space.
- Mathematical Representation: The grid search
performance matrix: \[
M(i, j) = \text{mean\_score}(C_i, \gamma_j)
\] where \(M(i, j)\) stores the
cross-validation score for parameter combination \((C_i, \gamma_j)\).
import matplotlib.pyplot as plt
import numpy as np
# Simulated results for visualization
scores = np.array([[0.8, 0.85, 0.87], [0.83, 0.88, 0.89], [0.85, 0.89, 0.9]])
C_values = [1, 10, 100]
gamma_values = [1000, 10000, 100000]
# Heatmap
plt.imshow(scores, interpolation='nearest', cmap='viridis')
plt.title('Grid Search Performance')
plt.colorbar()
plt.xticks(range(len(gamma_values)), gamma_values)
plt.yticks(range(len(C_values)), C_values)
plt.xlabel('Gamma')
plt.ylabel('C')
plt.show()
4. Randomized Search Optimization
Mathematical Optimization:
- Instead of evaluating all grid points, sample \(n\) random points: \[
\Theta_{\text{sampled}} \subset \Theta
\]
- Choose the best parameters based on sampled evaluations: \[
\theta^* = \arg\min_{\theta \in \Theta_{\text{sampled}}} E(\theta)
\]
5. Challenges and Insights
Challenges of Iterative Algorithms:
- Many machine learning algorithms, like Gradient Descent, are
difficult to parallelize due to iterative dependencies: \[
\theta_{t+1} = \theta_t - \eta \nabla f(\theta_t)
\] where \(\theta_t\) is the
parameter at iteration \(t\), \(\eta\) is the learning rate, and \(\nabla f\) is the gradient of the objective
function.
Fault Tolerance in Distributed Systems:
- Use techniques like checkpointing: \[
S = \frac{T_{\text{failure-recovery}}}{T_{\text{no-failure}}}
\] where \(S\) is the system’s
fault tolerance efficiency.
6. Summary
Mathematical insights enhance our understanding of parallel and
distributed computing: 1. Data Parallelism: - Divide
and conquer: \(T(D) = \sum_{i=1}^n
T(D_i)\). - Effective for tasks like grid search and MapReduce.
2. Graph Parallelism: - Future modules will explore
partitioning nodes and edges in complex tasks.
Tools like Scikit-learn and visualization techniques (e.g., heatmaps)
make these concepts practical and accessible.
library(Rgraphviz)
# Example graph
g <- new("graphNEL", nodes = c("A", "B", "C"), edgemode = "directed")
g <- addEdge("A", "B", g)
g <- addEdge("B", "C", g)
# Render graph
plot(g)
library(DiagrammeR)
grViz("
digraph dot {
A -> B -> C;
B -> D;
}
")
LS0tDQp0aXRsZTogIlIgTm90ZWJvb2siDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQo8IS0tIEdvb2dsZSB0YWcgKGd0YWcuanMpIC0tPiAgDQo8c2NyaXB0IGFzeW5jIHNyYz0iaHR0cHM6Ly93d3cuZ29vZ2xldGFnbWFuYWdlci5jb20vZ3RhZy9qcz9pZD1HLUNWMjY0OEdRTUsiPjwvc2NyaXB0PiAgDQo8c2NyaXB0PiAgDQogIHdpbmRvdy5kYXRhTGF5ZXIgPSB3aW5kb3cuZGF0YUxheWVyIHx8IFtdOyAgDQogIGZ1bmN0aW9uIGd0YWcoKXtkYXRhTGF5ZXIucHVzaChhcmd1bWVudHMpO30gIA0KICBndGFnKCdqcycsIG5ldyBEYXRlKCkpOyAgDQoNCiAgZ3RhZygnY29uZmlnJywgJ0ctQ1YyNjQ4R1FNSycpOyAgDQo8L3NjcmlwdD4gIA0KDQpTdHJlYW0gTmFtZTogVGV4YXNjaGlra2l0YSAgDQpTdHJlYW0gVVJMOiBodHRwczovL3JwdWJzLmNvbS9UZXhhc2NoaWtraXRhICANClN0cmVhbSBJRDogOTk2MjMyNDE3OSAgDQpNZWFzdXJlbWVudCBJZDogRy1DVjI2NDhHUU1LICANCi0tLQ0KDQoNCiMjIE1vZHVsZSAxNDogRGF0YSBQYXJhbGxlbGlzbSBhbmQgR3JhcGggUGFyYWxsZWxpc20NCg0KIyMjICoqMS4gSW50cm9kdWN0aW9uIHRvIERhdGEgUGFyYWxsZWxpc20qKg0KDQojIyMjICoqQmlnIERhdGEgT3ZlcnZpZXcqKg0KLSAqKk1hdGhlbWF0aWNhbCBSZXByZXNlbnRhdGlvbiBvZiBQYXJhbGxlbCBDb21wdXRhdGlvbioqOg0KICBHaXZlbiBhIGRhdGFzZXQgXCggRCBcKSBhbmQgYSB0YXNrIFwoIFQgXCksIHRoZSB0YXNrIGNhbiBiZSBkaXZpZGVkIGludG8gXCggbiBcKSBwYXJ0czoNCiAgXFsNCiAgVChEKSA9IFxzdW1fe2k9MX1ebiBUKERfaSkNCiAgXF0NCiAgd2hlcmUgXCggRF9pIFwpIGFyZSBzdWJzZXRzIG9mIHRoZSBkYXRhIFwoIEQgXCksIHByb2Nlc3NlZCBpbmRlcGVuZGVudGx5Lg0KDQotICoqVGhyb3VnaHB1dCBBbmFseXNpcyoqOg0KICBUaGUgc3BlZWR1cCBcKCBTIFwpIG9mIHBhcmFsbGVsIHByb2Nlc3NpbmcgY29tcGFyZWQgdG8gc2VyaWFsIHByb2Nlc3Npbmc6DQogIFxbDQogIFMgPSBcZnJhY3tUX3tcdGV4dHtzZXJpYWx9fX17VF97XHRleHR7cGFyYWxsZWx9fX0NCiAgXF0NCiAgd2hlcmUgXCggVF97XHRleHR7c2VyaWFsfX0gXCkgaXMgdGhlIHRpbWUgZm9yIHNlcXVlbnRpYWwgZXhlY3V0aW9uLCBhbmQgXCggVF97XHRleHR7cGFyYWxsZWx9fSBcKSBpcyB0aGUgZXhlY3V0aW9uIHRpbWUgd2l0aCBcKCBuIFwpIHBhcmFsbGVsIHVuaXRzLg0KDQotLS0NCg0KIyMjICoqMi4gRGF0YSBQYXJhbGxlbGlzbSoqDQoNCiMjIyMgKipNYXBSZWR1Y2UgRnJhbWV3b3JrKioNCi0gKipNYXRoZW1hdGljYWwgU3RlcHMqKjoNCiAgLSAqKk1hcHBpbmcqKjogVHJhbnNmb3JtIGRhdGEgaW50byBpbnRlcm1lZGlhdGUga2V5LXZhbHVlIHBhaXJzOg0KICAgIFxbDQogICAgZih4KSA9IChrLCB2KQ0KICAgIFxdDQogICAgd2hlcmUgXCggayBcKSBpcyB0aGUga2V5IGFuZCBcKCB2IFwpIGlzIHRoZSB2YWx1ZSBleHRyYWN0ZWQgZnJvbSBkYXRhIFwoIHggXCkuDQoNCiAgLSAqKlJlZHVjaW5nKio6IEFnZ3JlZ2F0ZSB2YWx1ZXMgYmFzZWQgb24gdGhlIGtleXM6DQogICAgXFsNCiAgICBnKGssIFt2XzEsIHZfMiwgLi4uLCB2X21dKSA9IFx0ZXh0e1Jlc3VsdH0NCiAgICBcXQ0KDQojIyMjICoqR3JpZCBTZWFyY2ggUGFyYW1ldGVyIFR1bmluZyoqDQotIEdpdmVuIHBhcmFtZXRlcnMgXCggXFRoZXRhXzEsIFxUaGV0YV8yLCBcbGRvdHMsIFxUaGV0YV9rIFwpLCB0aGUgQ2FydGVzaWFuIHByb2R1Y3Qgb2YgdGhlc2UgcGFyYW1ldGVycyBkZWZpbmVzIHRoZSBncmlkOg0KICBcWw0KICBHID0gXFRoZXRhXzEgXHRpbWVzIFxUaGV0YV8yIFx0aW1lcyBcbGRvdHMgXHRpbWVzIFxUaGV0YV9rDQogIFxdDQogIC0gRXhhbXBsZTogRm9yIFwoIFx0ZXh0e2dhbW1hfSA9IFx7MTAwMCwgMTAwMDAsIDEwMDAwMFx9IFwpIGFuZCBcKCBcdGV4dHtDfSA9IFx7MSwgMTAsIDEwMFx9XCk6DQogICAgXFsNCiAgICBHID0gXHsoMTAwMCwgMSksICgxMDAwLCAxMCksIFxsZG90cywgKDEwMDAwMCwgMTAwKVx9DQogICAgXF0NCg0KLSBDcm9zcy12YWxpZGF0aW9uIGVycm9yIGZvciBlYWNoIHBhcmFtZXRlciBjb21iaW5hdGlvbjoNCiAgXFsNCiAgRShcdGhldGEpID0gXGZyYWN7MX17a30gXHN1bV97aT0xfV5rIFx0ZXh0e0xvc3N9KFxoYXR7eX1faSwgeV9pKQ0KICBcXQ0KICB3aGVyZSBcKCBrIFwpIGlzIHRoZSBudW1iZXIgb2YgZm9sZHMsIFwoIFxoYXR7eX1faSBcKSBpcyB0aGUgcHJlZGljdGVkIHZhbHVlLCBhbmQgXCggeV9pIFwpIGlzIHRoZSB0cnVlIHZhbHVlLg0KDQotLS0NCg0KIyMjICoqMy4gVmlzdWFsaXphdGlvbiBvZiBQYXJhbWV0ZXIgU2VhcmNoKioNCi0gVXNlIGhlYXRtYXBzIHRvIHZpc3VhbGl6ZSBwZXJmb3JtYW5jZSBvdmVyIHRoZSBwYXJhbWV0ZXIgc3BhY2UuDQotICoqTWF0aGVtYXRpY2FsIFJlcHJlc2VudGF0aW9uKio6DQogIFRoZSBncmlkIHNlYXJjaCBwZXJmb3JtYW5jZSBtYXRyaXg6DQogIFxbDQogIE0oaSwgaikgPSBcdGV4dHttZWFuXF9zY29yZX0oQ19pLCBcZ2FtbWFfaikNCiAgXF0NCiAgd2hlcmUgXCggTShpLCBqKSBcKSBzdG9yZXMgdGhlIGNyb3NzLXZhbGlkYXRpb24gc2NvcmUgZm9yIHBhcmFtZXRlciBjb21iaW5hdGlvbiBcKCAoQ19pLCBcZ2FtbWFfaikgXCkuDQoNCmBgYHB5dGhvbg0KaW1wb3J0IG1hdHBsb3RsaWIucHlwbG90IGFzIHBsdA0KaW1wb3J0IG51bXB5IGFzIG5wDQoNCiMgU2ltdWxhdGVkIHJlc3VsdHMgZm9yIHZpc3VhbGl6YXRpb24NCnNjb3JlcyA9IG5wLmFycmF5KFtbMC44LCAwLjg1LCAwLjg3XSwgWzAuODMsIDAuODgsIDAuODldLCBbMC44NSwgMC44OSwgMC45XV0pDQpDX3ZhbHVlcyA9IFsxLCAxMCwgMTAwXQ0KZ2FtbWFfdmFsdWVzID0gWzEwMDAsIDEwMDAwLCAxMDAwMDBdDQoNCiMgSGVhdG1hcA0KcGx0Lmltc2hvdyhzY29yZXMsIGludGVycG9sYXRpb249J25lYXJlc3QnLCBjbWFwPSd2aXJpZGlzJykNCnBsdC50aXRsZSgnR3JpZCBTZWFyY2ggUGVyZm9ybWFuY2UnKQ0KcGx0LmNvbG9yYmFyKCkNCnBsdC54dGlja3MocmFuZ2UobGVuKGdhbW1hX3ZhbHVlcykpLCBnYW1tYV92YWx1ZXMpDQpwbHQueXRpY2tzKHJhbmdlKGxlbihDX3ZhbHVlcykpLCBDX3ZhbHVlcykNCnBsdC54bGFiZWwoJ0dhbW1hJykNCnBsdC55bGFiZWwoJ0MnKQ0KcGx0LnNob3coKQ0KYGBgDQoNCi0tLQ0KDQojIyMgKio0LiBSYW5kb21pemVkIFNlYXJjaCBPcHRpbWl6YXRpb24qKg0KDQojIyMjICoqTWF0aGVtYXRpY2FsIE9wdGltaXphdGlvbioqOg0KLSBJbnN0ZWFkIG9mIGV2YWx1YXRpbmcgYWxsIGdyaWQgcG9pbnRzLCBzYW1wbGUgXCggbiBcKSByYW5kb20gcG9pbnRzOg0KICBcWw0KICBcVGhldGFfe1x0ZXh0e3NhbXBsZWR9fSBcc3Vic2V0IFxUaGV0YQ0KICBcXQ0KLSBDaG9vc2UgdGhlIGJlc3QgcGFyYW1ldGVycyBiYXNlZCBvbiBzYW1wbGVkIGV2YWx1YXRpb25zOg0KICBcWw0KICBcdGhldGFeKiA9IFxhcmdcbWluX3tcdGhldGEgXGluIFxUaGV0YV97XHRleHR7c2FtcGxlZH19fSBFKFx0aGV0YSkNCiAgXF0NCg0KLS0tDQoNCiMjIyAqKjUuIENoYWxsZW5nZXMgYW5kIEluc2lnaHRzKioNCg0KIyMjIyAqKkNoYWxsZW5nZXMgb2YgSXRlcmF0aXZlIEFsZ29yaXRobXMqKjoNCi0gTWFueSBtYWNoaW5lIGxlYXJuaW5nIGFsZ29yaXRobXMsIGxpa2UgR3JhZGllbnQgRGVzY2VudCwgYXJlIGRpZmZpY3VsdCB0byBwYXJhbGxlbGl6ZSBkdWUgdG8gaXRlcmF0aXZlIGRlcGVuZGVuY2llczoNCiAgXFsNCiAgXHRoZXRhX3t0KzF9ID0gXHRoZXRhX3QgLSBcZXRhIFxuYWJsYSBmKFx0aGV0YV90KQ0KICBcXQ0KICB3aGVyZSBcKCBcdGhldGFfdCBcKSBpcyB0aGUgcGFyYW1ldGVyIGF0IGl0ZXJhdGlvbiBcKCB0IFwpLCBcKCBcZXRhIFwpIGlzIHRoZSBsZWFybmluZyByYXRlLCBhbmQgXCggXG5hYmxhIGYgXCkgaXMgdGhlIGdyYWRpZW50IG9mIHRoZSBvYmplY3RpdmUgZnVuY3Rpb24uDQoNCiMjIyMgKipGYXVsdCBUb2xlcmFuY2UgaW4gRGlzdHJpYnV0ZWQgU3lzdGVtcyoqOg0KLSBVc2UgdGVjaG5pcXVlcyBsaWtlIGNoZWNrcG9pbnRpbmc6DQogIFxbDQogIFMgPSBcZnJhY3tUX3tcdGV4dHtmYWlsdXJlLXJlY292ZXJ5fX19e1Rfe1x0ZXh0e25vLWZhaWx1cmV9fX0NCiAgXF0NCiAgd2hlcmUgXCggUyBcKSBpcyB0aGUgc3lzdGVtJ3MgZmF1bHQgdG9sZXJhbmNlIGVmZmljaWVuY3kuDQoNCi0tLQ0KDQojIyMgKio2LiBTdW1tYXJ5KioNCg0KTWF0aGVtYXRpY2FsIGluc2lnaHRzIGVuaGFuY2Ugb3VyIHVuZGVyc3RhbmRpbmcgb2YgcGFyYWxsZWwgYW5kIGRpc3RyaWJ1dGVkIGNvbXB1dGluZzoNCjEuICoqRGF0YSBQYXJhbGxlbGlzbSoqOg0KICAgLSBEaXZpZGUgYW5kIGNvbnF1ZXI6IFwoIFQoRCkgPSBcc3VtX3tpPTF9Xm4gVChEX2kpIFwpLg0KICAgLSBFZmZlY3RpdmUgZm9yIHRhc2tzIGxpa2UgZ3JpZCBzZWFyY2ggYW5kIE1hcFJlZHVjZS4NCjIuICoqR3JhcGggUGFyYWxsZWxpc20qKjoNCiAgIC0gRnV0dXJlIG1vZHVsZXMgd2lsbCBleHBsb3JlIHBhcnRpdGlvbmluZyBub2RlcyBhbmQgZWRnZXMgaW4gY29tcGxleCB0YXNrcy4NCg0KVG9vbHMgbGlrZSBTY2lraXQtbGVhcm4gYW5kIHZpc3VhbGl6YXRpb24gdGVjaG5pcXVlcyAoZS5nLiwgaGVhdG1hcHMpIG1ha2UgdGhlc2UgY29uY2VwdHMgcHJhY3RpY2FsIGFuZCBhY2Nlc3NpYmxlLg0KDQotLS0NCg0KYGBge3J9DQpsaWJyYXJ5KFJncmFwaHZpeikNCg0KIyBFeGFtcGxlIGdyYXBoDQpnIDwtIG5ldygiZ3JhcGhORUwiLCBub2RlcyA9IGMoIkEiLCAiQiIsICJDIiksIGVkZ2Vtb2RlID0gImRpcmVjdGVkIikNCmcgPC0gYWRkRWRnZSgiQSIsICJCIiwgZykNCmcgPC0gYWRkRWRnZSgiQiIsICJDIiwgZykNCg0KIyBSZW5kZXIgZ3JhcGgNCnBsb3QoZykNCmBgYA0KDQpgYGB7cn0NCmxpYnJhcnkoRGlhZ3JhbW1lUikNCg0KZ3JWaXooIg0KICBkaWdyYXBoIGRvdCB7DQogICAgQSAtPiBCIC0+IEM7DQogICAgQiAtPiBEOw0KICB9DQoiKQ0KYGBgDQo=