Matrix and XMPP: Thoughts on Improving Messaging Protocols – Part 1
For over two decades, ProcessOne has been developing large-scale messaging platforms, powering some of the largest services in the world. Our mission is to build the best messaging back-ends imaginable–an exciting yet complex challenge.
We began with XMPP (eXtensible Messaging and Presence Protocol), but the need for interoperability and support for a variety of use cases led us to implement additional protocols. Our stack now supports:
- XMPP (eXtensible Messaging and Presence Protocol): A robust, highly scalable, and flexible protocol for real-time messaging.
- MQTT (Message Queuing Telemetry Transport): The standard for IoT messaging, ideal for lightweight communication between devices.
- SIP (Session Initiation Protocol): A widely used standard for voice-over-IP (VoIP) communications.
- Matrix: A decentralized protocol for secure, real-time communication.
A Distributed Protocol That Replicates Data Across Federated Servers
This brings me to the topic of Matrix. Matrix is designed not just to be federated but also distributed. While it uses the term “decentralized,” I find this slightly misleading. A federated protocol is inherently decentralized, as it allows users across different domains to communicate–think email, XMPP, and Matrix itself. What truly sets Matrix apart from XMPP is its data distribution model.
Matrix is distributed because it aims to ensure that all participating nodes in a conversation have a copy of that conversation, typically in end-to-end encrypted form. This ensures high availability: if the primary node hosting a conversation becomes unavailable, the conversation can continue on another node.
In Matrix, a conversation is represented as a graph, a replicated document containing all events related to a group discussion. You can think of each conversation as a mini-blockchain, except that instead of forming a chain, the events create a graph.
Resource Penalty: Computing and Storage
As with any design decision, there are trade-offs. In the case of Matrix, this comes with a performance penalty. Since conversations are replicated across nodes, the protocol performs merge operations to ensure consistency between the replicated data. The higher the traffic, the greater the cost of these merge operations, which adds CPU load on both the Matrix node and its database. Additionally, there is a significant cost in terms of storage.
If the Matrix network were to scale massively, with many nodes and conversations, it would encounter the same growth challenges as blockchain protocols. Each node must store a copy of the conversation, and the amount of replication depends on the number of conversations and nodes globally. As these numbers grow, so does the replication factor.
Comparison with XMPP
XMPP, on the other hand, is event-based rather than document-based. It processes and distributes events in the order they arrive without attempting to merge conversation histories. This simpler approach avoids the replication of group chat data across federated nodes, but it comes with some limitations.
Here’s how XMPP mitigates these limitations:
- One-on-One Conversations: Each user’s messages are archived on their server, keeping the replication factor under control (usually limited to two copies).
- Group Chats: If a chatroom goes down, the conversation becomes unavailable for both local and remote users. However, XMPP has strategies to reduce the need for data replication. Several servers implement clustering, making it possible to upgrade the service node by node. If one node is taken down (for maintenance, for instance), another node can take over the management of the chatroom.
- Hot Code Upgrades: Some servers, like ejabberd, allow hot code upgrades, which means minor updates can be applied without shutting down the node, minimizing downtime and the need for data replication.
- Message Archiving for MUC Rooms: Some servers offer message archiving for multi-user chat (MUC) rooms, but also allows users to store on their local server recent chat history in select MUC for future reference.
- Cluster Replication (ejabberd Business Edition): Chatrooms can be replicated within a cluster, ensuring they remain available even if a node crashes.
Thanks to the typically high uptime of XMPP servers, especially for clustered services, intermittent availability of servers in the network hasn’t posed a significant issue, so large-scale data replication hasn’t been a necessity.
What’s Next for XMPP?
This comparison suggests potential improvements for XMPP. Could XMPP benefit from an optional feature to address the centralized nature of chatrooms? Possibly. What if there were a caching and resynchronization protocol for multi-user chatrooms across different servers? This could enhance the robustness of the federation without the storage burden of full content replication, offering the best of both worlds.
What’s Next for Matrix?
Matrix, by design, comes with trade-offs. One of its key goals is to resist censorship, which is vital in certain situations and countries. That’s why we believe it is worth trying to improve the Matrix protocol to address those use cases. There’s still room to optimize the protocol, specifically by reducing the cost of running a distributed data store. We plan to propose and implement improvements on how merge operations work to make Matrix more efficient.
I’ll share our proposals in the next article: Thoughts on Improving Messaging Protocols — Part 2, Matrix.
—
As always, this is an open discussion, and I’d be happy to dive deeper into these topics if you’re interested.
Feedback received:
- 2024-10-25: I had envisioned the concept of automatic caching for MUC messages. However, there is an existing XEP that defines the behavior for federated MUC in XMPP: XEP-0289: Federated MUC for Constrained Environments. I am not aware of any implementations of this yet, but it’s a topic we are interested in exploring.