Pyodide - Mozilla 实验项目在浏览器中实现满足数据科学需求的 Python 技术栈
Pyodide is an experimental project from Mozilla to create a full Python data science stack that runs entirely in the browser.
The impetus for Pyodide came from working on another Mozilla project, Iodide, which we presented in an earlier post. Iodide is a tool for data science experimentation and communication based on state-of-the-art web technologies. Notably, it’s designed to perform data science computation within the browser rather than on a remote kernel.
It’s also been argued more generally that Python not running in the browser represents an existential threat to the language—with so much user interaction happening on the web or on mobile devices, it needs to work there or be left behind. Therefore, while Pyodide tries to meet the needs of Iodide first, it is engineered to be useful on its own as well.
For another quick example, here’s a simple doodling script that lets you draw in the browser window:
from js import document, iodide canvas = iodide.output.element('canvas') canvas.setAttribute('width', 450) canvas.setAttribute('height', 300) context = canvas.getContext("2d") context.strokeStyle = "#df4b26" context.lineJoin = "round" context.lineWidth = 5 pen = False lastPoint = (0, 0) def onmousemove(e): global lastPoint if pen: newPoint = (e.offsetX, e.offsetY) context.beginPath() context.moveTo(lastPoint, lastPoint) context.lineTo(newPoint, newPoint) context.closePath() context.stroke() lastPoint = newPoint def onmousedown(e): global pen, lastPoint pen = True lastPoint = (e.offsetX, e.offsetY) def onmouseup(e): global pen pen = False canvas.addEventListener('mousemove', onmousemove) canvas.addEventListener('mousedown', onmousedown) canvas.addEventListener('mouseup', onmouseup)
And this is what it looks like:
The best way to learn more about what Pyodide can do is to just go and try it! There is a demo notebook (50MB download) that walks through the high-level features. The rest of this post will be more of a technical deep-dive into how it works.
There were already a number of impressive projects bringing Python to the browser when we started Pyodide. Unfortunately, none addressed our specific goal of supporting a full-featured mainstream data science stack, including NumPy, Pandas, Scipy, and Matplotlib.
PyPyJs is a build of the alternative just-in-time compiling Python implementation, PyPy, to the browser, using emscripten. It has the potential to run Python code really quickly, for the same reasons that PyPy does. Unfortunately, it has the same issues with performance with C extensions that PyPy does.
All of these approaches would have required us to rewrite the scientific computing tools to achieve adequate performance. As someone who used to work a lot on Matplotlib, I know how many untold person-hours that would take: other projects have tried and stalled, and it’s certainly a lot more work than our scrappy upstart team could handle. We therefore needed to build a tool that was based as closely as possible on the standard implementations of Python and the scientific stack that most data scientists already use.
After a discussion with some of Mozilla’s WebAssembly wizards, we saw that the key to building this was emscripten and WebAssembly: technologies to port existing code written in C to the browser. That led to the discovery of an existing but dormant build of Python for emscripten, cpython-emscripten, which was ultimately used as the basis for Pyodide.
emscripten and WebAssembly
There are many ways of describing what emscripten is, but most importantly for our purposes, it provides two things:
- A compiler from C/C++ to WebAssembly
- A compatibility layer that makes the browser feel like a native computing environment
Pyodide is put together by:
- Downloading the source code of the mainstream Python interpreter (CPython), and the scientific computing packages (NumPy, etc.)
- Applying a very small set of changes to make them work in the new environment
- Compiling them to WebAssembly using emscripten’s compiler
By emulating the file system and other features of a standard computing environment, emscripten makes moving existing projects to the web browser possible with surprisingly few changes. (Some day, we may move to using WASI as the system emulation layer, but for now emscripten is the more mature and complete option).
Putting it all together, to load Pyodide in your browser, you need to download:
- The compiled Python interpreter as WebAssembly.
- A packaged file system containing all the files the Python interpreter will need, most notably the Python standard library.
These files can be quite large: Python itself is 21MB, NumPy is 7MB, and so on. Fortunately, these packages only have to be downloaded once, after which they are stored in the browser’s cache.
Using all of these pieces in tandem, the Python interpreter can access the files in its standard library, start up, and then start running the user’s code.
What works and doesn’t work
We run CPython’s unit tests as part of Pyodide’s continuous testing to get a handle on what features of Python do and don’t work. Some things, like threading, don’t work now, but with the newly-available WebAssembly threads, we should be able to add support in the near future.
Other features, like low-level networking sockets, are unlikely to ever work because of the browser’s security sandbox. Sorry to break it to you, your hopes of running a Python minecraft server inside your web browser are probably still a long way off. Nevertheless, you can still fetch things over the network using the browser’s APIs (more details below).
How fast is it?
Notably, code that runs a lot of inner loops in Python tends to be slower by a larger factor than code that relies on NumPy to perform its inner loops. Below are the results of running various Pure Python and Numpy benchmarks in Firefox and Chrome compared to natively on the same hardware.
object instances as two distinct types.
dicts (dictionaries) are just mappings of keys to values. On the other hand,
objects generally have methods that “do something”
Object. (Yes, I’ve oversimplified here to make a point.)
Object, it’s impossible to efficiently guess whether it should be converted to a Python
object. Therefore, we have to use
a proxy and let “duck typing” resolve the situation.
Duck typing is the principle that rather than asking a variable “are you a duck?” you ask it “do you walk like a duck?” and “do you quack like a duck?” and infer from that that it’s probably a duck, or at least does duck-like
Object: it wraps it in a proxy and lets the Python code using it decide how to handle it. Of course, this doesn’t always work, the duck may actually be a rabbit.
Thus, Pyodide also provides ways to explicitly handle these conversions.
Accessing Web APIs and the DOM
Proxies also turn out to be the key to accessing the Web APIs, or the set of functions the browser provides that make it do things. For example, a large part of the Web API is on the
document object. You can get that from Python
from js import document
This imports the
All of this happens through proxies that look up what the
document object can do on-the-fly. Pyodide doesn’t need to include a comprehensive list of all of the Web APIs the browser has.
There are important data types that are specific to data science, and Pyodide has special support for these as well. Multidimensional arrays are collections of (usually numeric) values, all of the same type. They tend to be quite large, and
knowing that every element is the same type has real performance advantages over Python’s
Arrays that can hold elements of any type.
Since in practice these arrays can get quite large, we don’t want to copy them between language runtimes. Not only would that take a long time, but having two copies in memory simultaneously would tax the limited memory the browser has available.
Real-time interactive visualization
One of the advantages of doing the data science computation in the browser rather than in a remote kernel, as Jupyter does, is that interactive visualizations don’t have to communicate over a network to reprocess and redisplay their data. This greatly reduces the latency — the round trip time it takes from the time the user moves their mouse to the time an updated plot is displayed to the screen.
The Python scientific stack is not a monolith—it’s actually a collection of loosely-affiliated packages that work together to create a productive environment. Among the most popular are NumPy (for numerical arrays and basic computation), Scipy (for more sophisticated general-purpose computation, such as linear algebra), Matplotlib (for visualization) and Pandas (for tabular data or “data frames”). You can see the full and constantly updated list of the packages that Pyodide builds for the browser here.
Some of these packages were quite straightforward to bring into Pyodide. Generally, anything written in pure Python without any extensions in compiled languages is pretty easy. In the moderately difficult category are projects like Matplotlib, which required special code to display plots in an HTML canvas. On the extremely difficult end of the spectrum, Scipy has been and remains a considerable challenge.
Roman Yurchak worked on making the large amount of legacy Fortran in Scipy compile to WebAssembly. Kirill Smelkov improved emscripten so shared objects can be reused by other shared objects, bringing Scipy to a more manageable size. (The work of these outside contributors was supported by Nexedi). If you’re struggling porting a package to Pyodide, please reach out to us on Github: there’s a good chance we may have run into your problem before.
Since we can’t predict which of these packages the user will ultimately need to do their work, they are downloaded to the browser individually, on demand. For example, when you import NumPy:
import numpy as np
Pyodide fetches the NumPy library (and all of its dependencies) and loads them into the browser at that time. Again, these files only need to be downloaded once, and are stored in the browser’s cache from then on.
Adding new packages to Pyodide is currently a semi-manual process that involves adding files to the Pyodide build. We’d prefer, long term, to take a distributed approach to this so anyone could contribute packages to the ecosystem without going through a single project. The best-in-class example of this is conda-forge. It would be great to extend their tools to support WebAssembly as a platform target, rather than redoing a large amount of effort.
Additionally, Pyodide will soon have support to load packages directly from PyPI (the main community package repository for Python), if that package is pure Python and distributes its package in the wheel format. This gives Pyodide access to around 59,000 packages, as of today.
- Level 1: Just string output, so it’s useful as a basic console REPL (read-eval-print-loop).
We definitely want to encourage this brave new world, and are excited about the possibilities of having even more languages interoperating together. Let us know what you’re working on!
If you haven’t already tried Pyodide in action, go try it now! (50MB download)
It’s been really gratifying to see all of the cool things that have been created with Pyodide in the short time since its public launch. However, there’s still lots to do to turn this experimental proof-of-concept into a professional tool for everyday data science work. If you’re interested in helping us build that future, come find us on gitter, github and our mailing list.
About Michael Droettboom
Michael Droettboom is a Data Engineer at Mozilla, using data to improve the web while respecting the privacy of its users. He has built software tools to support many other disciplines, including the computational humanities, astronomy and medicine. He is a former lead developer of matplotlib and the original author of airspeed velocity.