ejabberd & Nintendo Switch NPNS
Last week, Taiyo Watanabe from Nintendo shared some insights about Nintendo Switch Push Notification infrastructure (NPNS) implementation. Here are some details from his presentation at ElixirFest conference.
The Nintendo Switch NPNS is a large-scale realtime messaging service based on ejabberd. The game consoles are constantly connected to the infrastructure, listening for push notifications. So far, Nintendo shipped more than 34 million units, thus the infrastructure needs to handle millions of concurrent connections.
The notifications include user-based (directed at active players) and system-based (automatic updates, parental controls). They also include the PubSub topic-based ones like the update notifications and important news, and are available when connected and after a successful login.
In terms of technology, XMPP was chosen due to the large feature set of that protocol. ejabberd was chosen because of its scalability and robustness. It is built in Erlang and inherits many of Erlang strengths: scalability, robustness, native clustering support, and more.
To ease maintenance, Nintendo ejabberd cluster is split into two areas: the inner ejabberd (cpu-bound) and outer ejabberd (memory-bound). All the nodes are connected by the distributed Erlang communication.
The job of the inner ejabberd is to retrieve messages from Amazon Simple Queue Service (SQS), while the outer ejabberd handles the distribution to the Switch game consoles. Business requirements include 1 million concurrent connections, and the recovery for bursting logins in max 30 minutes.
The first performance results of the load test with a simple stock ejabberd service was not sufficient for the production requirements, with capacity at 300k connections and 150 connections per second. Therefore, performance tuning and system redesign was conducted.
The first improvement to reduce memory footprint was to remove the XML parser except from the session memory when it was not active. The second change was to tune OpenSSL parameters to reduce the buffer size. Consequently, the maximum capacity increased to around 750k concurrent connections.
Another issue was the congestion on sockets during burst logins and slow item distribution for some topics, due to the process bottleneck. To solve this, a dedicated distribution worker process was created for each socket. As a result, login throughput increased from 150 per second to 300.
However, the mass delivery of contents created another congestion between the databases and Erlang cluster. To mitigate this, the action was to move content distribution processes from the inner to the outer cluster. Additionally, selective relaying from the inner to outer cluster was set up using a process dictionary to speed up access.
Another interesting problem appeared from the fact that XMPP accepts multiple logins of the same ID (for example, multiple devices). This resulted in too many session processes being created and too many controls being performed to apply standard XMPP business rules. As those rules were not needed in Nintendo use case, the quick solution was to just do nothing when a session is disconnected, and to make the session process die in 1 minute instead of 30.
The above issue was found using Erlang tracing. However, log processing time became a pain and was increasing exponentially, because of reductions (erlang:bumpreductions/1
). To be able to handle that, a “dam” process was created to perform rate limiting. Thanks to it, 500k log lines could be processed in 30 seconds instead of 2 hours!
After all the tuning, the infrastructure met business requirements and was ready for production deployment. But that wasn’t the end of tweaking.
Production service was up, but monitoring signalled Redis load was five times bigger than what was expected. A message loop was discovered under certain disconnection conditions. One possible cause might be unprocessed messages triggering an exchange loop. However, testing any hypothesis on the whole production system was too risky. Therefore, a manual hot code deployment was applied directly on only one of the nodes, handling just a small part of the traffic.
Luckily, the patch on that node was behaving properly, and it was extended to all other nodes. Another higher load was detected in relation to small messages, caused by “busy waiting” BEAM scheduler. The solution: switch +sbwt
from medium
to very_short
resulting in CPU use dropping from 30% to 21%.
The key lesson from this project is to always perform load-testing and profiling. Searching for bottlenecks should be continuous, keeping a lookout for C extensions (NIFs). A good approach is to keep increasing parallelism, reduce per-process load and set rate-limiting. Scaling a service is hard and directly depends on your use case. As you scale, you may discover new bottlenecks and parameters to tune and adjust. Scaling a service is an ongoing effort.
Today, the Nintendo Switch NPNS handles 10 000 000 (10 million!) simultaneous connections, 2 000 000 000 (2 billion!!) messages per day with 100-200k connections per node and 600 messages per second!
The greatest advantage of ejabberd is its incredible scalability and throughput. Erlang VM was also very useful, allowing troubleshooting a live system using Erlang remote shell. In the end, this project was implemented in just 6 months, with 1 month of prototyping. It’s been up and running ever since without major issues.
Here is the link to the slides of the presentation (in Japanese).