Appendix C: Performance Cookbook¶

Practical optimization patterns for Groggy

This cookbook provides actionable performance guidance based on Groggy's architecture. Each recipe includes the pattern, why it works, and when to use it.

Understanding Groggy's Performance Model¶

Core Performance Characteristics¶

Groggy's performance is based on three architectural decisions:

Columnar Storage - Attributes stored as separate arrays
Immutable Views - No copying unless explicitly requested
Rust Core - High-performance algorithms and data structures

Key Insight:

Bulk operations on columns are 10-100x faster than iteration over individual items

Recipe 1: Batch Operations Over Loops¶

❌ Anti-Pattern: Item-by-Item Iteration¶

# SLOW: Python loop, FFI call per node
for node_id in g.node_ids():
    age = g.nodes[node_id]["age"]  # FFI call
    if age < 30:
        young_nodes.append(node_id)

Why it's slow: - Python loop overhead - FFI crossing per iteration - No SIMD optimization

✅ Best Practice: Columnar Operations¶

# FAST: Single columnar filter
young = g.nodes[g.nodes["age"] < 30]

Why it's fast: - Single FFI crossing - Bulk columnar scan (cache-friendly) - SIMD-optimized filtering - Returns view (no copy)

Performance: ~100x faster for large graphs

When to Use¶

✅ Filtering nodes/edges by attributes
✅ Aggregating statistics
✅ Transforming attribute values
❌ Complex per-node logic that can't be vectorized

Recipe 2: Minimize FFI Crossings¶

Understanding FFI Overhead¶

Each call from Python to Rust has overhead (~50-100ns). For tight loops, this adds up.

❌ Anti-Pattern: Frequent Small Calls¶

# SLOW: Many small FFI calls
total = 0
for node_id in g.node_ids():  # FFI call
    degree = g.degree(node_id)  # FFI call per node
    total += degree

✅ Best Practice: Bulk Retrieval¶

# FAST: Single bulk operation
degrees = g.degrees()  # One FFI call, returns NumArray
total = degrees.sum()  # Computed in Rust

Performance: O(n) FFI calls → O(1) FFI call

Strategy¶

Identify the operation you need
Check if bulk version exists (often does)
Retrieve all at once, process in Rust
Return result to Python

Common Bulk Operations¶

# Instead of loops, use bulk operations:
g.degrees()              # All degrees at once
g.node_ids()             # All node IDs
g.nodes["attribute"]     # Entire attribute column
g.table()                # All data as table

Recipe 3: Use Views, Materialize Only When Needed¶

Understanding Views¶

Most Groggy operations return views (lightweight references), not copies.

✅ Views Are Cheap¶

# All O(1) - no data copying:
subgraph = g.nodes[g.nodes["age"] < 30]     # View
table = g.table()                            # View
array = g["age"]                             # View

⚠️ Materialization Is Expensive¶

# These create new data structures:
df = g.nodes.table().to_pandas()    # Materializes to DataFrame
g_copy = subgraph.to_graph()        # Copies to new Graph
np_array = matrix.to_numpy()        # Converts to NumPy array

Best Practice: Delay Materialization¶

# Good: Chain views, materialize at end
result = (g.nodes[g.nodes["age"] < 30]    # View
           .table()                        # View
           .select(["name", "age"])        # View
           .to_pandas())                   # Materialize once

# Bad: Materialize early
df = g.nodes.table().to_pandas()           # Materialize
young_df = df[df["age"] < 30]              # Python filtering
result_df = young_df[["name", "age"]]      # Python selection

Performance Impact: - Views: O(1) time, O(1) memory - Materialization: O(n) time, O(n) memory

When to Materialize¶

✅ Final output for external libraries (pandas, numpy)
✅ When you need a persistent copy
✅ For complex Python-side processing
❌ In the middle of a chain

Recipe 4: Leverage Columnar Filtering¶

How Columnar Filtering Works¶

Groggy scans attribute columns using SIMD-optimized operations.

# FAST: Columnar scan with SIMD
active_users = g.nodes[g.nodes["active"] == True]
heavy_edges = g.edges[g.edges["weight"] > 5.0]
young_adults = g.nodes[
    (g.nodes["age"] >= 18) & (g.nodes["age"] < 30)
]

Complexity: O(n) with SIMD optimization, cache-friendly

⚠️ Slower Topology Filters¶

# SLOWER: Topology traversal
two_hop_neighbors = g.neighbors(node, depth=2)

Complexity: O(degree^depth)

Best Practice: Filter Attributes First¶

# Good: Filter 100k nodes to 1k, then traverse
young = g.nodes[g.nodes["age"] < 25]  # Fast columnar filter
neighbors = young.neighbors(depth=2)   # Smaller graph to traverse

# Bad: Traverse first, then filter
all_neighbors = g.neighbors(some_node, depth=2)  # Large graph
young_neighbors = all_neighbors[all_neighbors["age"] < 25]

Why it's better: - Columnar filter is O(n) but cache-friendly - Reduces graph size before expensive traversal - SIMD optimization on attributes

Recipe 5: Choose the Right Data Structure¶

Graph vs Table vs Array vs Matrix¶

Each structure is optimized for different operations:

# Graph - topology operations:
g.neighbors(node)                # Fast: adjacency list lookup
g.add_edge(n1, n2)              # Fast: O(1) amortized

# Table - columnar queries:
g.nodes.table().select(["age", "name"])  # Fast: column selection
df = g.table().to_pandas()               # Fast: bulk conversion

# Array - numeric operations:
ages = g.nodes["age"]            # Fast: direct column access
mean_age = ages.mean()           # Fast: SIMD aggregation

# Matrix - linear algebra:
A = g.adjacency_matrix()         # Fast: CSR sparse matrix
eigvals = A.eigenvalues()        # Fast: optimized linalg

Performance by Operation¶

Operation	Best Structure	Complexity	Notes
Get neighbors	Graph	O(degree)	Adjacency list
Filter by attribute	Graph → Subgraph	O(n)	SIMD columnar
Aggregate statistics	Array	O(n)	SIMD optimized
Column selection	Table	O(1)	View creation
Matrix operations	Matrix	Varies	Sparse-optimized
Export to pandas	Table	O(n)	Bulk conversion

Recipe: Choose Based on Next Operation¶

# If filtering → use Graph/Subgraph:
subset = g.nodes[condition]

# If aggregating → use Array:
ages = g["age"]
stats = {"mean": ages.mean(), "std": ages.std()}

# If exporting → use Table:
df = g.table().to_pandas()

# If linear algebra → use Matrix:
L = g.laplacian_matrix()
embedding = L.eigenvalues()[:10]

Recipe 6: Attribute Sparsity Optimization¶

Understanding Sparse Attributes¶

Not all nodes have all attributes. Groggy handles this efficiently.

✅ Sparse Storage Is Free¶

# Only 10% of nodes have "email" attribute - that's fine:
g.add_node(id=0, name="Alice", email="alice@example.com")
g.add_node(id=1, name="Bob")  # No email
g.add_node(id=2, name="Carol")  # No email

# Sparse attributes don't waste memory:
emails = g["email"]  # Only stores 1 value, not 3

Memory: O(non-null values), not O(total nodes)

⚠️ Avoid Dense Computation on Sparse Data¶

# Bad: Materializes all values (including nulls):
df = g.nodes.table().to_pandas()
email_df = df[["email"]]  # Creates full column with NaN

# Good: Filter first:
has_email = g.nodes[g.nodes["email"].is_not_null()]
emails = has_email["email"].to_list()  # Only non-null values

Best Practice: Query Before Materializing¶

# Check sparsity first:
total_nodes = g.node_count()
has_attr = g.nodes[g.nodes["attr"].is_not_null()].node_count()
sparsity = has_attr / total_nodes

if sparsity < 0.1:
    # Sparse - filter first:
    subset = g.nodes[g.nodes["attr"].is_not_null()]
    values = subset["attr"]
else:
    # Dense - direct access:
    values = g["attr"]

Recipe 7: Efficient Subgraph Operations¶

Subgraphs Are Views¶

Creating a subgraph is O(1) - it's just a view.

✅ Chain Subgraph Operations¶

# Fast: All views, no copying
result = (g.nodes[g.nodes["age"] < 30]           # View
          .edges[g.edges["weight"] > 5.0]        # View on view
          .connected_components()                 # Algorithm
          .largest()                              # View
          .table())                               # View

Total copies: 0 (until to_pandas() or similar)

⚠️ Converting to Graph Copies Data¶

# Expensive: Creates new graph
subgraph = g.nodes[condition]
new_graph = subgraph.to_graph()  # O(n + m) copy

Only convert when: - Need to modify the subgraph - Persisting for later use - Passing to external library

Best Practice: Stay in Subgraph¶

# Good: Work with subgraph directly
subgraph = g.nodes[g.nodes["active"] == True]
components = subgraph.connected_components()
largest = components.largest()
stats = largest.table().agg({"degree": "mean"})

# Bad: Convert to graph unnecessarily
subgraph = g.nodes[g.nodes["active"] == True]
new_g = subgraph.to_graph()  # Expensive copy
components = new_g.connected_components()
# ... rest of operations

Recipe 8: Matrix Operation Optimization¶

Sparse vs Dense Matrices¶

Groggy automatically chooses the right format based on graph density.

Understanding Sparsity¶

# Check sparsity:
n = g.node_count()
m = g.edge_count()
density = m / (n * n)

# Sparse: density < 0.01 (1%)
# Dense: density > 0.1 (10%)

✅ Sparse Matrix Operations¶

# For sparse graphs (most real networks):
A = g.adjacency_matrix()  # Returns sparse CSR matrix

# These are optimized for sparse:
degrees = A.sum(axis=1)        # O(m) not O(n²)
product = A @ A                # Only non-zero entries
eigvals = A.eigenvalues(k=10)  # Sparse eigensolver

⚠️ When to Densify¶

# Only densify if:
# 1. Matrix is already dense (density > 0.1)
# 2. Algorithm requires dense format
# 3. Performance testing shows benefit

# Explicit densification:
A_dense = A.to_dense()  # Memory: O(n²) vs O(m)

Best Practice: Check Sparsity First¶

# Adaptive approach:
A = g.adjacency_matrix()

if A.density() < 0.01:
    # Sparse operations:
    result = A.sparse_operation()
else:
    # Dense operations:
    A_dense = A.to_dense()
    result = A_dense.dense_operation()

Recipe 9: Memory Management¶

Understanding Memory Usage¶

Groggy's memory footprint:

Total Memory = Structure + Attributes + Views + History

Structure: O(n + m) - nodes and edges
Attributes: O(total non-null values)
Views: O(view metadata) - negligible
History: O(states × changes)

✅ Minimize History Overhead¶

# If you don't need history:
g = gr.Graph(track_history=False)  # Saves memory

# Or clear history periodically:
g.clear_history()  # Frees old states
g.commit()         # Start fresh

✅ Efficient Bulk Loading¶

# Good: Batch loading
nodes = [{"id": i, "name": f"Node{i}"} for i in range(10000)]
g.add_nodes(nodes)  # Single bulk operation

# Bad: One at a time
for i in range(10000):
    g.add_node(name=f"Node{i}")  # Repeated overhead

Performance: Bulk is ~10x faster

✅ Release Large Objects¶

# After expensive operation:
large_matrix = g.adjacency_matrix()
result = compute_something(large_matrix)

# Release when done:
del large_matrix  # Frees memory

Monitor Memory¶

import psutil
import os

# Check memory usage:
process = psutil.Process(os.getpid())
print(f"Memory: {process.memory_info().rss / 1024**2:.1f} MB")

# Profile operations:
mem_before = process.memory_info().rss
result = expensive_operation()
mem_after = process.memory_info().rss
print(f"Operation used: {(mem_after - mem_before) / 1024**2:.1f} MB")

Recipe 10: Algorithm Selection¶

Choose the Right Algorithm¶

Different algorithms have different complexity for the same task.

Shortest Paths¶

# Unweighted graph → BFS:
path = g.shortest_path(source, target)  # O(n + m)

# Weighted graph → Dijkstra:
path = g.shortest_path(source, target, weighted=True)  # O((n + m) log n)

# All pairs → Floyd-Warshall (small graphs only):
all_paths = g.all_pairs_shortest_paths()  # O(n³)

Connected Components¶

# Undirected → Union-Find:
components = g.connected_components()  # O(n + m)

# Directed → Tarjan's:
strong_components = g.strongly_connected_components()  # O(n + m)

Centrality Measures¶

# Exact centrality (small graphs):
bc = g.betweenness_centrality()  # O(nm)

# Approximate centrality (large graphs):
bc_approx = g.betweenness_centrality(approximate=True, samples=100)  # O(sm)

Rule of Thumb¶

Graph Size	Algorithm Choice
n < 1,000	Exact algorithms fine
1,000 < n < 100,000	Efficient exact algorithms
n > 100,000	Approximate or sampling-based

Recipe 11: Parallel Processing¶

When Groggy Uses Parallelism¶

Groggy automatically parallelizes certain operations:

# These use multiple cores:
g.connected_components()  # Parallel component finding
A.eigenvalues()          # Parallel linear algebra
g.spectral_embedding()   # Parallel SVD

✅ Batch Independent Operations¶

# Process multiple subgraphs in parallel:
from concurrent.futures import ProcessPoolExecutor

def process_component(subgraph):
    return subgraph.table().agg({"weight": "mean"})

components = g.connected_components()

with ProcessPoolExecutor() as executor:
    results = list(executor.map(process_component, components))

⚠️ GIL Considerations¶

Python's GIL limits pure Python parallelism, but Groggy's Rust core releases it:

# Good: Rust operations release GIL
with ThreadPoolExecutor() as executor:
    futures = [
        executor.submit(g.connected_components)
        for g in graphs
    ]
    results = [f.result() for f in futures]

# Bad: Python operations hold GIL
# (use ProcessPoolExecutor instead)

Performance Checklist¶

Before Optimization¶

[ ] Profile first - measure, don't guess
[ ] Identify bottleneck - where is time spent?
[ ] Check if it matters - is this code on critical path?

Optimization Strategies¶

Use bulk operations instead of loops
Minimize FFI crossings - batch calls
Delay materialization - work with views
Filter attributes first - reduce data size
Choose right structure - graph/table/array/matrix
Leverage sparsity - don't materialize nulls
Chain subgraph ops - avoid unnecessary copies
Use sparse matrices - for low-density graphs
Clear history - if not needed
Pick right algorithm - complexity matters

Common Pitfalls¶

Anti-Pattern	Better Approach	Speedup
Python loops	Bulk operations	~100x
Many small FFI calls	Single bulk call	~10x
Early materialization	Delay to end of chain	~5x
Dense on sparse	Keep sparse	~10x memory
Full graph traversal	Filter first	Varies

Profiling Tools¶

Timing Operations¶

import time

start = time.perf_counter()
result = expensive_operation()
elapsed = time.perf_counter() - start
print(f"Operation took: {elapsed:.3f}s")

Memory Profiling¶

from memory_profiler import profile

@profile
def my_function():
    g = gr.Graph()
    # ... operations
    return g.table()

Visual Profiling¶

import cProfile
import pstats

# Profile code:
cProfile.run('expensive_operation()', 'profile_stats')

# Analyze:
stats = pstats.Stats('profile_stats')
stats.sort_stats('cumulative')
stats.print_stats(20)  # Top 20 functions

Performance By Graph Size¶

Small Graphs (n < 1,000)¶

✅ All algorithms work fine
✅ Dense operations acceptable
✅ No special optimization needed
Focus on code clarity

Medium Graphs (1,000 < n < 100,000)¶

✅ Use columnar operations
✅ Sparse matrices for low density
✅ Efficient algorithms (Dijkstra not Floyd-Warshall)
Watch memory usage

Large Graphs (n > 100,000)¶

✅ Bulk operations essential
✅ Sparse matrices mandatory
✅ Approximate algorithms when possible
✅ Consider sampling for analysis
Profile everything

Very Large Graphs (n > 1,000,000)¶

✅ Streaming or batch processing
✅ Sample-based statistics
✅ Distributed processing if needed
Consider if Groggy is right tool