We’re incredibly excited to announce the release of JetStream 3, built in close collaboration with Apple, Mozilla, and other partners in the web ecosystem!
While we’ve covered the high-level details of this release in our shared announcement blog post, we wanted to take a moment here to dive a little deeper. In this post, we’ll pull back the curtain on the benchmark itself, explore the methodology behind our choices, and share the motivations driving these major updates.
Before we get into the "what," it helps to talk about the "why." Why do browser engineers care so much about benchmarks?
At its core, benchmarking serves as a critical safety net for catching performance regressions before they ever reach users. But beyond that, benchmarks act as a powerful motivation function—a sort of "gamification" for browser engineers. Having a clear target helps us prioritize our efforts and decide exactly which optimizations deserve our focus. It also drives healthy competitiveness between different browser engines, which ultimately lifts the entire web ecosystem.
Of course, the ultimate goal isn't just to make a number on a chart go up; it's to meaningfully improve user experience and real-world performance.
Just like Speedometer 3, JetStream 3 is the result of a massive collaborative effort across all major browser engines, including Apple, Mozilla, and Google.
We adopted a strict consensus model for this release. This means we only added new workloads when everyone agreed they were valuable and representative. This open governance model has led to an incredibly productive collaboration with buy-in from multiple parties, ensuring the benchmark serves the best interests of the overall Web ecosystem.
The last major release, JetStream 2, came out in 2019. In the technology space—and especially on the Web—six years is an eternity.
There's a well-known concept in economics called Goodhart's Law, which states that when a measure becomes a target, it ceases to be a good measure. Over time, engines naturally optimize for the specific patterns of a benchmark, and the metrics slowly lose their correlation with real-world performance. Speedometer recently received a massive update to account for this, and it only makes sense that JetStream is next in line.
You might be wondering: with the recent release of Speedometer 3, why do we need another benchmark?
While Speedometer is fantastic for measuring UI rendering and DOM manipulation, JetStream has a different focus: the computationally intensive parts of Web applications. We're talking about use cases like browser-based games, physics simulations, framework cores, cryptography, and complex algorithms.
There are also practical engineering considerations. JetStream is designed so that it can run in engine shells—like d8, the standalone shell for V8. For engine developers, this is a massive advantage. Building a shell is significantly quicker than compiling a full browser like Chrome, allowing engineers to iterate faster. Because d8 is single-process, it also produces far less background noise, leading to more stable testing. This shell-compatibility also makes JetStream highly valuable for hardware and device vendors running simulators. It is a trade-off—a shell is slightly further removed from a full, real-world browser environment—but the engineering velocity it unlocks is well worth it.
d8
Building a benchmark requires a delicate balance between microbenchmarks and real applications.
Microbenchmarks are great engineering tools; they have a high signal-to-noise ratio and make it easy to see the effects of one specific optimization. While they make sense for early improvements of new features, they also often encourage overfitting in the long run. Engines might optimize heavily for a tiny loop that looks great on the benchmark but does absolutely nothing to help real users.
Because of this, a primary criterion for inclusion in JetStream 3 is that a workload should represent a real, end-to-end use case (or at least a highly abstracted form of one).
We also heavily prioritized diversity. We don’t want workloads that all exercise the exact same hot loop. We want coverage across different frameworks, varied libraries, diverse source languages, and distinct toolchains.
Finally, we had to lay down some practical ground rules:
One of the most significant shifts in JetStream 3 is an increased focus and major update with regards to WebAssembly (Wasm).
When JetStream 2 was created, Wasm was still in its infancy. Fast forward to today, and Wasm is significantly more widespread.
Because the language has evolved so rapidly, JetStream 2 became outdated quickly. It only tested the Wasm MVP (Minimum Viable Product). Today, the Wasm spec includes powerful features like SIMD (single instruction, multiple data), WasmGC, and Exception Handling—none of which were being properly benchmarked.
The ecosystem of tools has also completely transformed. The old workloads relied almost entirely on ancient versions of Emscripten compiling C/C++, often utilizing the deprecated asm.js backend via asm2wasm. Furthermore, some of the old microbenchmarks mis-incentivized the wrong optimizations. For example, the old HashSet-wasm workload rewarded aggressive inlining that actually hurt performance in real-world user scenarios.
asm.js
asm2wasm
HashSet-wasm
To fix this, we sought out entirely new Wasm workloads, introducing 12 in total.
We expanded our toolchain coverage from just C++ to include five new toolchains: J2CL, Dart2wasm, Kotlin/Wasm, Rust, and .NET. This means we are now actively benchmarking Wasm generated from Java, Dart, Kotlin, Rust, and C#!
These workloads represent actual end-to-end tasks, including:
These aren't tiny, kilobyte-sized modules anymore. These are multi-megabyte applications that produce diverse, complex flamegraphs, pushing engines to their limits. Reflecting its heightened importance on the modern web, Wasm now makes up 15-20% of the overall benchmark suite, up from just 7% in JetStream 2. Beyond new workloads, JetStream 3 also overhauls scoring to ensure that runtime performance—not just instantiation—is accurately reflected in the total score.
We have many new larger JavaScript workloads that better represent how JS is used in the wild. Additionally to just measuring the pure execution speed we have "startup" workloads that include parsing and frameworks setup code – more closely matching what happens on initial page load.
With JetStream 3, the browser benchmarking space has made another big step forward and brought a new tool for browsers to improve performance for their valued users. Alongside Speedometer and MotionMark, these benchmarks give a clear view not only to browser vendors but also to users about various engine’s performance.
If you’d like to contribute to the benchmark with your own workloads or have suggestions for how we can make it better, feel free to join the repository on GitHub. We’re continually iterating on these benchmarks and will have more updates on each in the future as well.