Performance Optimization¶

Groggy is designed for high performance with a Rust core, but optimal performance requires understanding the library's design patterns. This guide covers benchmarking, optimization techniques, and scaling strategies.

Performance Philosophy¶

Groggy's performance comes from:

Rust core: Zero-cost abstractions, memory safety without garbage collection
Columnar storage: Cache-friendly data layout for bulk operations
Lazy evaluation: Views and deferred computation where beneficial
Parallel algorithms: Multi-threaded execution for large graphs
Attribute-first design: Fast filtering and queries on node/edge attributes
Explicit trait-backed delegation: Direct PyO3 methods (100ns FFI budget) instead of dynamic lookups

Trait-Backed Delegation (v0.5.1+)

Starting in v0.5.1, Groggy uses explicit PyO3 methods backed by Rust traits rather than dynamic __getattr__ delegation. This provides:

20x faster method calls: ~100ns per FFI call vs ~2000ns for dynamic lookup
Better IDE support: All methods discoverable via autocomplete
Clear stack traces: Direct method names in debugging

See Trait-Backed Delegation for architectural details.

Benchmarking¶

Basic Timing¶

Measure operation performance:

import groggy as gr
import time

g = gr.generators.erdos_renyi(n=10000, p=0.01)

# Time an operation
start = time.perf_counter()
components = g.connected_components()
elapsed = time.perf_counter() - start

print(f"Connected components: {elapsed*1000:.2f} ms")

Using timeit¶

For more accurate measurements:

import timeit

def benchmark():
    g = gr.generators.erdos_renyi(n=10000, p=0.01)
    return g.connected_components()

# Run multiple times
times = timeit.repeat(benchmark, number=10, repeat=3)
avg_time = min(times) / 10

print(f"Average: {avg_time*1000:.2f} ms")

Memory Profiling¶

Track memory usage:

import tracemalloc

tracemalloc.start()

g = gr.generators.erdos_renyi(n=100000, p=0.001)
components = g.connected_components()

current, peak = tracemalloc.get_traced_memory()
print(f"Current: {current / 1024**2:.1f} MB")
print(f"Peak: {peak / 1024**2:.1f} MB")

tracemalloc.stop()

Optimization Patterns¶

Pattern 0: Batch Executor for Loops¶

If you use builder.iterate(), Groggy will attempt to compile the loop body into a Batch Plan and execute it in Rust with a single FFI call per loop. This is the portable fast path and works on macOS ARM64 where JIT compilation is not available.

Tips: - Set GROGGY_DEBUG_BATCH=1 to see why a loop was not batched. - Set GROGGY_DISABLE_BATCH=1 to force fallback execution (useful for A/B benchmarking).

Pattern 1: Bulk Operations¶

Avoid: Loops over individual operations Prefer: Bulk operations

# ❌ Slow: Individual operations
g = gr.Graph()
for i in range(10000):
    g.add_node(id=i, value=i*2)

# ✅ Fast: Bulk operations
nodes_data = [{"id": i, "value": i*2} for i in range(10000)]
g.add_nodes(nodes_data)

Why faster: - Single FFI call vs 10,000 FFI calls - Rust can optimize batch processing - Better cache locality

Pattern 2: Attribute-First Filtering¶

Avoid: Fetching all data then filtering in Python Prefer: Push filters to Rust

# ❌ Slow: Filter in Python
all_nodes = g.nodes.table().to_pandas()
young = all_nodes[all_nodes['age'] < 30]

# ✅ Fast: Filter in Rust
young = g.nodes[g.nodes['age'] < 30]

Why faster: - Filtering happens in Rust (compiled code) - Only matching data crosses FFI boundary - Columnar storage optimized for attribute queries

Pattern 3: Views vs Materialization¶

Understand when to materialize:

# Views are free
sub = g.nodes[g.nodes['age'] < 30]  # O(1) view creation

# But repeated access on views can be costly
for _ in range(1000):
    count = sub.node_count()  # Recomputes filter each time

# Materialize for repeated access
sub_graph = sub.to_graph()  # O(n) materialization
for _ in range(1000):
    count = sub_graph.node_count()  # O(1) cached result

Guidelines: - Use views for single-pass operations - Materialize for repeated access - Profile to find the crossover point

Pattern 4: Avoid Unnecessary Conversions¶

# ❌ Slow: Multiple conversions
df = g.nodes.table().to_pandas()
ages = df['age'].tolist()
mean_age = sum(ages) / len(ages)

# ✅ Fast: Use native operations
mean_age = g.nodes['age'].mean()

Why faster: - No DataFrame overhead - No Python list creation - Computation stays in Rust

Pattern 5: Sparse vs Dense Matrices¶

For large graphs, understand matrix representations:

# Large sparse graph
g = gr.generators.erdos_renyi(n=10000, p=0.001)

# Adjacency is sparse (fast)
A = g.adjacency_matrix()  # ~100K edges, not 100M

# Operations respect sparsity
A2 = A.matmul(A)  # Sparse matrix multiply

# Dense conversion only if needed
if A.is_sparse() and some_condition:
    A_dense = A.dense()  # Careful: 400MB for 10K nodes

Pattern 6: Parallel Algorithms¶

Groggy automatically parallelizes when beneficial:

# Automatically parallel for large graphs
g = gr.generators.erdos_renyi(n=100000, p=0.0001)
components = g.connected_components()  # Uses multiple threads

No configuration needed: - Algorithms choose parallelization based on size - Rust's rayon handles thread pool - GIL released during computation

Pattern 7: Incremental Updates¶

For evolving graphs, use incremental operations:

# ❌ Slow: Rebuild entire graph
for new_edge in stream:
    g_old = g
    g = gr.Graph()
    # Copy all nodes/edges + new edge

# ✅ Fast: Incremental updates
for new_edge in stream:
    g.add_edge(*new_edge)  # O(1) append

Scaling Strategies¶

Small Graphs (< 1K nodes)¶

Characteristics: - FFI overhead dominates - Memory not a concern - Single-threaded often faster

Recommendations:

# Minimize FFI crossings
data = prepare_all_data()  # Python-side prep
g = gr.Graph()
g.add_nodes(data['nodes'])  # Single bulk call
g.add_edges(data['edges'])

Medium Graphs (1K - 100K nodes)¶

Characteristics: - Computation and memory balanced - Parallelization starts helping - Attribute queries very fast

Recommendations:

# Use bulk operations
g.connected_components(inplace=True, label='component')

# Leverage columnar storage
high_degree = g.nodes[g.nodes['degree'] > 10]

# Cache computed results
degrees = g.degree()  # Compute once
for analysis in analyses:
    analysis.use_degrees(degrees)

Large Graphs (100K - 10M nodes)¶

Characteristics: - Memory becomes critical - Parallel algorithms essential - Sparse representations required

Recommendations:

# Work with subgraphs
core = g.nodes[g.nodes['degree'] > 5]
core_analysis = core.connected_components()

# Stream processing for edges
# (if streaming API available)

# Use sparse matrices
A = g.adjacency_matrix()
assert A.is_sparse()  # Verify sparse

Very Large Graphs (> 10M nodes)¶

Characteristics: - Cannot fit all in memory - Must use sampling/partitioning - Approximate algorithms needed

Recommendations:

# Sample for analysis
sample = g.nodes.sample(100000)
sample_graph = sample.to_graph()
metrics = sample_graph.degree()

# Partition by components
components = g.connected_components()
for comp in components[:10]:  # Process largest 10
    analyze(comp.to_graph())

# Use approximate algorithms
# (when available)
# estimate = g.approximate_pagerank()

Common Performance Issues¶

Issue 1: Python Loops Over Graph Elements¶

Problem:

# Very slow for large graphs
total = 0
for node in g.nodes.ids():
    total += g.nodes[node]['value']

Solution:

# Much faster
total = g.nodes['value'].sum()

Why: - First version: N FFI calls, Python arithmetic - Second version: 1 FFI call, Rust arithmetic

Issue 2: Repeated DataFrame Conversions¶

Problem:

for attribute in ['age', 'score', 'rank']:
    df = g.nodes.table().to_pandas()  # Slow conversion each time
    print(df[attribute].mean())

Solution:

# Convert once
df = g.nodes.table().to_pandas()
for attribute in ['age', 'score', 'rank']:
    print(df[attribute].mean())

# Or better: stay in Groggy
for attribute in ['age', 'score', 'rank']:
    print(g.nodes[attribute].mean())

Issue 3: Unnecessary Materialization¶

Problem:

# Materializes even though we only need count
filtered = g.nodes[g.nodes['age'] < 30].to_graph()
count = filtered.node_count()

Solution:

# View is sufficient
filtered = g.nodes[g.nodes['age'] < 30]
count = len(filtered)  # or filtered.node_count()

Issue 4: Inefficient Attribute Access¶

Problem:

# Access pattern unfriendly to columnar storage
for node in g.nodes.ids():
    process(
        g.nodes[node]['age'],
        g.nodes[node]['score'],
        g.nodes[node]['rank']
    )

Solution:

# Columnar access
ages = g.nodes['age'].to_list()
scores = g.nodes['score'].to_list()
ranks = g.nodes['rank'].to_list()

for age, score, rank in zip(ages, scores, ranks):
    process(age, score, rank)

# Or even better: bulk operation
result = process_bulk(
    g.nodes['age'],
    g.nodes['score'],
    g.nodes['rank']
)

Profiling and Debugging¶

Finding Bottlenecks¶

Use Python profilers:

import cProfile
import pstats

def analysis():
    g = gr.generators.karate_club()
    g.connected_components(inplace=True)
    return g.nodes.table().to_pandas()

# Profile
profiler = cProfile.Profile()
profiler.enable()
result = analysis()
profiler.disable()

# View results
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)

Rust-Level Profiling¶

For core Rust performance:

# Build with profiling
cargo build --release --features profiling

# Use system profilers
# macOS: Instruments
# Linux: perf, valgrind

Memory Leaks¶

Check for memory growth:

import gc

for i in range(100):
    g = gr.generators.erdos_renyi(n=10000, p=0.01)
    components = g.connected_components()

    if i % 10 == 0:
        gc.collect()
        current, peak = tracemalloc.get_traced_memory()
        print(f"Iteration {i}: {current / 1024**2:.1f} MB")

Performance Checklist¶

Before Optimization¶

[ ] Profile to find actual bottleneck
[ ] Measure baseline performance
[ ] Identify the slowest 20% of code
[ ] Understand data size and growth

Graph Construction¶

[ ] Use bulk add_nodes() and add_edges()
[ ] Prepare data in Python, send in batches
[ ] Avoid repeated small additions
[ ] Set attributes during creation when possible

Graph Queries¶

[ ] Use attribute-based filtering
[ ] Keep filters in Rust (not Python)
[ ] Materialize only when needed
[ ] Cache frequently accessed results

Algorithms¶

[ ] Let Groggy parallelize automatically
[ ] Work on subgraphs when appropriate
[ ] Use in-place operations when available
[ ] Batch algorithm runs when possible

Data Export¶

[ ] Convert to DataFrame once, not repeatedly
[ ] Use appropriate format (CSV, Parquet, Bundle)
[ ] Export only needed attributes
[ ] Stream large exports if available

Real-World Examples¶

# Load large social network
g = gr.GraphTable.load_bundle("social_network.bundle")

# ✅ Efficient analysis
start = time.time()

# Filter in Rust
active_users = g.nodes[g.nodes['last_active'] > cutoff_date]

# Compute in bulk
degrees = active_users.degree()

# Materialize for multi-use
active_graph = active_users.to_graph()
components = active_graph.connected_components()

# Export efficiently
results = {
    'degrees': degrees.to_list(),
    'num_components': len(components)
}

print(f"Analysis: {time.time() - start:.2f}s")

Example 2: Real-Time Updates¶

# Streaming graph updates
g = gr.Graph()

batch = []
for event in event_stream:
    batch.append(event)

    # Batch updates
    if len(batch) >= 1000:
        edges = [(e['src'], e['tgt'], e['attrs']) for e in batch]
        g.add_edges(edges)
        batch.clear()

        # Periodic analysis
        if g.edge_count() % 10000 == 0:
            metrics = compute_metrics(g)
            emit_metrics(metrics)

Example 3: Large Graph Sampling¶

# Too large to process fully
g = gr.GraphTable.load_bundle("huge_graph.bundle")

# Sample and analyze
sample_size = 10000
sample = g.nodes.sample(sample_size)
sample_graph = sample.to_graph()

# Fast analysis on sample
degrees = sample_graph.degree()
clustering = sample_graph.clustering_coefficient()

# Extrapolate to full graph
estimated_avg_degree = degrees.mean() * (g.node_count() / sample_size)

Performance Goals¶

Groggy targets these performance characteristics:

Operation	Target Complexity	Notes
Add node	O(1) amortized	Append to columnar storage
Add edge	O(1) amortized	Append to edge list
Attribute query	O(n)	Full column scan
Attribute filter	O(n)	Vectorized in Rust
BFS/DFS	O(V + E)	Parallel for large graphs
Connected components	O(V + E)	Union-find with path compression
Degree computation	O(E)	Linear in edge count
Subgraph creation (view)	O(1)	Lazy view
Subgraph materialization	O(V + E)	Copy induced subgraph

Quick Reference¶

Fast Operations¶

# These are optimized
g.add_nodes(data)                    # Bulk insert
g.nodes[g.nodes['age'] > 30]        # Attribute filter
g.degree()                           # Degree computation
g.connected_components()             # Parallel algorithm
A = g.adjacency_matrix()             # Sparse matrix
g.nodes['age'].mean()                # Columnar stats

Slow Patterns to Avoid¶

# Avoid these
for node in g.nodes.ids():           # Python loop
    g.nodes[node]['value']           # Individual access

df = g.to_pandas()                   # Repeated conversions
for analysis in analyses:
    df = g.to_pandas()

sub.to_graph()                       # Unnecessary materialization
if sub.node_count() > 10:           # When view suffices

When to Materialize¶

# Keep as view
sub = g.nodes[filter]
count = len(sub)                     # One-time use

# Materialize
sub = g.nodes[filter]
sub_g = sub.to_graph()
for _ in range(100):                 # Repeated access
    analysis(sub_g)

Performance Optimization¶

Performance Philosophy¶

Benchmarking¶

Basic Timing¶

Using timeit¶

Memory Profiling¶

Optimization Patterns¶

Pattern 0: Batch Executor for Loops¶

Pattern 1: Bulk Operations¶

Pattern 2: Attribute-First Filtering¶

Pattern 3: Views vs Materialization¶

Pattern 4: Avoid Unnecessary Conversions¶

Pattern 5: Sparse vs Dense Matrices¶

Pattern 6: Parallel Algorithms¶

Pattern 7: Incremental Updates¶

Scaling Strategies¶

Small Graphs (< 1K nodes)¶

Medium Graphs (1K - 100K nodes)¶

Large Graphs (100K - 10M nodes)¶

Very Large Graphs (> 10M nodes)¶

Common Performance Issues¶

Issue 1: Python Loops Over Graph Elements¶

Issue 2: Repeated DataFrame Conversions¶

Issue 3: Unnecessary Materialization¶

Issue 4: Inefficient Attribute Access¶

Profiling and Debugging¶

Finding Bottlenecks¶

Rust-Level Profiling¶

Memory Leaks¶

Performance Checklist¶

Before Optimization¶

Graph Construction¶

Graph Queries¶

Algorithms¶

Data Export¶

Real-World Examples¶

Example 1: Social Network Analysis¶

Example 2: Real-Time Updates¶

Example 3: Large Graph Sampling¶

Performance Goals¶

Quick Reference¶

Fast Operations¶

Slow Patterns to Avoid¶

When to Materialize¶

See Also¶