Message Archiving Benchmark: How Many Letters Are in Messages?

ProcessOne
· 1 min read
Send by email

Let’s look at distribution of the number of letters in message’s body. Note, that it’s not a byte length, it’s an amount of Unicode symbols. Cyrillic characters are represented using 2 bytes in UTF-8, so some messages can be actually 2 times longer in bytes. Also AFAIK English sentences are generally shorter than Russian, so average message length should be less for servers with English-speaking users.

Here is the plot of length distribution histogram.

image



It is well-known that the number of letters per word and the number of words per sentence are log-normally distributed, so no wonder this distribution is also log-normal. Green line here plots probability density function (PDF) of Log-N(2.83, 1.15), and you can see it fits actual data pretty good.