Python 3.11 will gain performance at the cost of a bit more memory, speed gains seem to be between 10% and 60%

In an effort to improve the performance of the Python programming language, Microsoft releases Faster CPython. It’s a Microsoft-funded project, whose members include Python inventor Guido van Rossum, senior Microsoft software engineer Eric Snow, and Mark Shannon who is under contract with Microsoft as the project’s technical lead. Python is widely known to be so slow. While Python will never match the performance of low-level languages ​​like C, Fortran, or even Java, we’d like it to be competitive with fast implementations of scripting languages, like V8 for JavaScript, says Mark Shannon.

To be efficient, virtual machines for dynamic languages ​​must specialize the code they execute according to the types and values ​​of the program being executed. This specialization is often associated with Just-in-time (JIT) compilers, but it is beneficial even without machine code generation.

Note that specialization improves performance, and adaptation allows the interpreter to change quickly when the usage pattern of a program changes, thus limiting the amount of extra work caused by poor specialization.

A session scheduled as part of the EuroPython event, Europe’s largest Python conference, to be held in Dublin in July will focus on some of the changes that help speed up the process. Shannon will describe l’adaptive specializing interpreter of Python 3.11which is PEP (Python Enhancement Proposal) 659. This is a technique called specialization which, as Shannon explains, is usually done in the context of a compiler, but research shows that specialization in a interpreter can increase performance significantly.

This PEP proposes to use an adaptive interpreter which specializes the code dynamically, but on a very small region, and which is able to adapt to a bad specialization quickly and at low cost.

Adding a specialized and adaptive CPython interpreter will bring significant performance improvements. It is difficult to give significant figures, because it depends a lot on benchmarks and work that has not yet been done. Extensive experiments suggest speedups of up to 50%. Even if the speed gain was only 25%, it would still be a nice improvement,” says Shannon.

Specifically, we want to achieve these performance goals with CPython to benefit all Python users, including those who cannot use PyPy or other alternative virtual machines,” he adds. When Devclass spoke with Python Leadership Council Member and Lead Developer Pablo Galindo regarding the new Memray memory profiler, he described how the Python team is using Microsoft’s work in version 3.11.

One of the things we’re doing is making the interpreter faster, says Pablo Galindo, Python board member and lead developer. But it’s also going to use a little more memory, just a little bit, because most of these optimizations have some kind of memory cost, since we have to store things for later use, or because we have a optimized version but sometimes someone needs to request an unoptimized version for debugging, so we have to store both.

Achieving these performance goals is a long way to go, and will require a lot of engineering effort, but we can take a significant step towards these goals by speeding up the interpreter. Academic research and practical implementations have shown that a fast interpreter is a key component of a fast virtual machine, Shannon said.

Virtual Machine Acceleration

Typical optimizations for virtual machines are expensive, so a long “boot up” time is required to be certain that the cost of optimization is justified. In order to achieve fast accelerations with no noticeable warm-up time, the VM should speculate that specialization is warranted even following a few executions of a function. To do this, the interpreter must be able to optimize and de-optimize continuously and cheaply. By using adaptive and speculative specialization at the granularity of individual virtual machine instructions, the Python team achieved a faster interpreter that also generates profiling information for more sophisticated optimizations in the future.

There are many practical ways to speed up a virtual machine for a dynamic language. However, specialization is the most important, both in itself and as a catalyst for further optimizations. So it makes sense to focus our efforts on specialization in the first place, if we want to improve CPython’s performance,” says the Faster CPython project team. Specialization is usually done in the context of a JIT compiler, but research shows that specialization in an interpreter can significantly improve performance, and even exceed that of a regular compiler.

Several methods have been proposed in the academic literature, but most attempt to optimize domains larger than a single bytecode. Using domains larger than a single statement requires code to handle deoptimization in the middle of a domain. Specialization at the level of individual bytecodes makes deoptimization trivial, as it cannot occur in the middle of a region.

By speculatively specializing individual bytecodes, we can achieve significant performance improvements with nothing but the most local and trivial deoptimizations to implement. The closest approach to this PEP in the literature is “Inline Caching meets Quickening”. This PEP has the benefits of inline caching, but adds the ability to deoptimize quickly, making performance more robust in cases where specialization fails or is unstable.

Speedup due to specialization is difficult to determine, as many specializations depend on other optimizations. The speed gains seem to be between 10% and 60%. Most speed gains come directly from specialization. The biggest contributors are attribute lookup, global variable, and callout speedups.

Implmentation

Adaptive Instructions

Each instruction that would benefit from specialization is replaced by an adaptive version during the quickening. For example, the LOAD_ATTR instruction is replaced by LOAD_ATTR_ADAPTIVE. Each adaptive instruction periodically attempts to specialize.

Spcialisation

The CPython bytecode contains many instructions that represent high-level operations, and would benefit from specialization. Examples include CALL, LOAD_ATTR, LOAD_GLOBAL, and BINARY_ADD.

Introducing a category of specialized instructions for each of these instructions allows for efficient specialization, since each new instruction is specialized for a single task. Each family will include an “adaptive” instruction, which maintains a counter and attempts to specialize when that counter reaches zero. Each category will also include one or more specialized instructions that perform the equivalent of the generic operation much faster, provided their inputs are as expected.

Each specialized instruction maintains a saturation counter which is incremented when the inputs meet expectations. If the entries are not as expected, the counter will be decremented and the generic operation will be executed. If the counter reaches the minimum value, the instruction is deoptimized by simply replacing its opcode with the adaptive version.

Auxiliary data

Most specialized instruction families require more information than an 8-bit operand can hold. To do this, a number of 16-bit inputs immediately following the instruction are used to store this data. This is a form of online cache, an “online data cache”. Unspecialized, or adaptive, instructions will use the first entry in this cache as a counter, and simply skip over the others.

Cots

Memory usage

An obvious concern with any system that does some sort of caching is: how much extra memory is it using?

Memory usage comparison with 3.10

CPython 3.10 used 2 bytes per instruction, until the number of executions reached ~2000 when it allocates another byte per instruction and 32 bytes per instruction with a cache (LOAD_GLOBAL and LOAD_ATTR).

The following table shows the additional bytes per instruction to support opcache 3.10 or the proposed adaptive interpreter on a 64-bit machine.

3.10 cold is before the code has reached the limit of ~2000. 3.10 hot shows the cache usage once the threshold is reached. Relative memory usage depends on how much code is hot enough to trigger cache creation in 3.10. The break-even point, where the memory used by 3.10 is the same as for 3.11, is ~70%. It should also be noted that the actual bytecode is only part of a code object. Code objects also include names, constants, and lots of debugging information.

In summary, for most applications where many functions are relatively unused, version 3.11 will consume more memory than version 3.10.

Source : Python

And you?

What is your opinion on the subject?

What do you think of the Faster CPython project?

Version 3.11 will consume more memory than version 3.10, what do you think?

In your opinion, is it interesting to gain in performance at the cost of a little more memory?

See as well :

Python 3.11 will improve the location of errors in tracebacks and bring new features

Version 3.2 of the Django framework is available, with automatic discovery of AppConfig, it brings new decorators for the administration module

Django 2.0 is available in stable version, what is new in this version of the Web framework written in Python?

JetBrains supports Django: get a 30% discount for the purchase of an individual PyCharm Professional license and all proceeds will be donated to the Django Foundation

Leave a Replay