A week following the launch of the Dragonflight expansion, Blizzard decided to reach out to its players. WoW devs discuss the technical aspect and the challenges that the development team has to overcome for a release of such magnitude.
I suggest you find the message in full below.
Now that the Dragonflight release is behind us, we would like to come back with you on these last days from a technical point of view. The objective is to explain a little better what a worldwide release of this kind represents, what can go well or badly, the problems that can arise and the solutions at our disposal to solve them.
Internally, we call a day like last Monday a “content launch”, because the release of an extension is a long-term job and is not limited to a single date. World of Warcraft is far from a rigid, static game and has come a long way since its debut 18 years ago, and even over the past couple of years. As a result, we need to change how we deploy content as the game grows.
An expansion is now deployed in several stages: first the code is integrated into the existing content, then the pre-launch events allow to add the new game systems and finally, on the day of the “content launch”, the quests , areas and dungeons become accessible. Each step is targeted and allows us to isolate problems more effectively. However, for such complex systems, one cannot always prepare for everything.
One of the novelties specific to this extension consisted of a launch of content triggered by a timed event, that is to say several modifications made to the game programmed to be deployed simultaneously. These changes, if made manually, would inevitably involve the possibility of human error, or the risk that an internal or external breakdown might thwart the process. A timed and automated launch significantly reduces risk.
Another change made with Dragonflight was the improved encryption of game data. This change allows us to send crucial information to the game on the client side so that cutscenes, dialogues and quest scripts work correctly and at the right time. , without ever making the data vulnerable or accessible before its time. We know our community’s passion for WoW; when you are burning to discover the new, it can be tempting to jump on the smallest bit of information and thus spoil the surprise. This new encryption system aims to avoid this kind of situation, and to ensure that the content is always available when the right moment arises, and not before.
We have identified that the high latencies and server instability of the past week were caused by the interaction between these two new systems. The simulation server (the one that manages your movements and your actions in game) found itself having to recalculate several hundred times per second what should be accessible or not and this for each simulation. In fact, the system quickly became saturated with these calculations, the simulations began to bog down, and requests from other services were queued. For players, this results in latency and “World server down” error messages.
Fortunately, we were able to identify the problem. Data encrypted while waiting for a specific event to unlock them exposed a logic error in the code: a misplaced line of code instructed the server to recalculate which data to encrypt, when nothing had changed .
Here’s how our investigation went. It all starts at midnight, Paris time. Thanks to the tests conducted, we already knew that the Horde ship would arrive a few moments before the Alliance ship. Most of us are also in-game with our characters on the docks, at the two endpoints, with other windows open to view data, charts, and various charts. We’re also in direct contact with our fellow support teams across Blizzard.
Before launch, we prepared for several worst-case scenarios and how to avoid or fix them quickly through the testing phases. For example, we had prepared portals to allow players to reach the Dragon Islands in case the boats were not working.
At 00:2, the Horde ship arrives as scheduled. Hooray! The players, including us, crowd on board. Some and some of us remain at the dock, in case the portals need to be activated. The boat leaves, but several passengers do not arrive at their destination, suffer disconnections or find themselves stranded.
We immediately review the data, graphs and tables. There are not many people on the Dragon Islands. Several colleagues are reporting issues with their character names and realms. Others report problems with processor load and with the NFS (Network File Storage, the file management protocol) used by our servers. Our teams continue to monitor the game and report issues.
After the Horde ships, we see the arrival of the Alliance ships. Most don’t arrive, and Horde ships don’t return.
We’re starting to get the big picture: Ships are stuck and Dragon Island servers are slower to respond than expected. Now is the time to roll up our sleeves and come up with solutions. This is not the first time that boats have been a problem. We activate the portals, and we continue to investigate. Our NFS is very clearly overloaded. Too many requests are pending and our already overloaded simulation server coordination system begins to launch calculations for all unsuccessful requests. The infrastructure is totally submerged. Worse still, activating portals only exacerbated the problem, as they can be clicked on repeatedly, which generates even more requests. We are disabling portals.
We are doing everything we can so that as many players as possible can play, but the service does not behave at all as it did during the test phases. We proceed by elimination, thanks to the data collected during these test phases.
It’s getting late, but part of the team continues to work on solving the problems, while the others go to rest so they can return to take over first thing in the morning.
Tuesday morning we have a much better overview of the problem. We now know that game clients receive too much information for quest management. We will understand later that this is not the initial cause of the problem. Our API protocol requires too much NFS. The code that handles new interactions with NPCs is abnormally slow. The service is taking too long to push all data changes made in patches to game clients. Players who were able to join the Dragon Islands are now suffering from huge latency.
On Wednesday morning, luck played in our favor: by searching the code in question, we discovered unusual interactions with the encryption system. We’re starting to wonder if the encryption system is the root of all these problems. And indeed, it turns out to be true. The slowness of the encryption system explains all the other problems with data transfer, NFS, CPU overhead and latency. Once the source of the problem is identified, the author of the system in question is able to quickly correct the errors.
However, implementing a fix in code used by so many different departments is not a push of a button. It will now be necessary to transfer all the characters to new simulations so that they are carried out taking into account the fixes. We are also rushing a little too much, which generates an additional load on another service. A restart of the servers will be necessary, but we are pushing it back to apply it during off-peak hours and not to add to the annoyance of the players. With the fix in place, performance and stability have been greatly improved.
It was not easy to identify the problem and fix it, but our team was very responsive and diligent, and were able to deploy a fix as quickly as possible. In software engineering, the goal is not to never make mistakes, but to try to minimize the chances of them happening, to be able to identify them quickly when they happen, to have the right tools available…
…and an incredible team ready to overcome all obstacles.[Text Wrapping Break]
– World of Warcraft Tech Team