astral-sh/ruff

Consider a more data-oriented AST representation

Open

#15,657 opened on Jan 21, 2025

View on GitHub
 (2 comments) (2 reactions) (0 assignees)Rust (2,088 forks)batch import
corehelp wantedneeds-design

Repository metrics

Stars
 (47,527 stars)
PR merge metrics
 (Avg merge 5d 4h) (463 merged PRs in 30d)

Description

We spend a noticeable amount of time (in both Ruff and Red Knot) parsing, and a noticeable subset of that is in malloc. Our current AST representation is just a straight tree of nodes, which can put a lot of pressure on the memory allocator and isn't very cache-friendly when we go to consume the AST later.

flamegraph

In https://github.com/astral-sh/ruff/pull/12419, @MichaReiser explored using bumpalo. This would reduce malloc overhead, since all AST nodes would be allocated from the bump allocator. But the data structure itself was still directly storing the tree structure. In https://github.com/astral-sh/ruff/pull/12419#issuecomment-2262727929, Micha referenced this blog post about Zig's data-oriented "struct of arrays" representation. We should consider exploring a similar approach to see if we can achieve similar speedups and memory savings.

(This would be much easier with https://github.com/astral-sh/ruff/issues/15655, since we wouldn't have to copy/paste the data representation changes in each of the syntax nodes. But this will likely require extensive updates everywhere we consume ASTs, since all of that code currently assumes it can use direct field access. We might consider a separate first step that updates consumers to use accessor methods, which would just read the fields directly without changing the representation.)

Contributor guide