Voice conferences

Work-related voice conferences never seemed handy to me: they are strictly real-time (quite an issue if you don't maintain a sleep regimen or any of participants have anything else scheduled), effectively half-duplex (only one person can talk at a time for the speech to be intelligible), there is no reliable and easy way to get greppable logs (transcriptions), and unless it is combined with textual chat, there is no way to copy and paste texts, to share links or program output.

They are probably good for multiplayer games, when you are busy controlling a character, but also need to coordinate actions timely. They may also be nice for those who are not used to reading and typing, for a casual chat, and while using less convenient devices (e.g., phones) to communicate.

Here are my notes on nicer protocols and software for those.

Requirements and concerns

An unpleasant thing about voice communication is speaker recognition: coupled with unencrypted or unknown protocols, surveillance, and data breaches, it can be quite uncomfortable to use. So my initial requirements are end-to-end encryption, an open protocol, at least an open source (preferably libre) client for GNU/Linux in existence, preferably a FLOSS server that is easy to deploy (if the protocol needs a server).

Apparently the requirements imposed by the majority of users, which should also be taken into account in order to actually use such a protocol, are that it should be extremely easy to set and to use on various systems: not more than a few mouse clicks or touchscreen taps. Perhaps being well-known is another thing that is important to inexperienced users, since the less known things they tend to find tend to be malware even by the relaxed, non-RMS definition of malware.

And the obvious requirement for it is to work well: acceptable sound/video quality (no perceivable noise, pauses, or delays) even over poor connection, perhaps NAT traversal, etc.

Protocols

There is a comparison of VoIP software and a few more lists in Wikipedia, and in the YBTI map. Apparently casual users mostly think in terms of client software that implements those protocols, so the clients are even more widely known.

XMPP + Jingle (with (S)RTP)
As WebRTC, uses ICE and RTP, with just negotiation over XMPP/Jingle. The quality of implementations varies, as with most other protocols. Media calls can work fine if clients and servers are set properly. I am using those regularly with Prosody, coturn, and Conversations (after setting everything myself: see the XMPP, private server setup, and Debian 11 workstation notes), and implemented audio calls in rexmpp. Though it is primarily for one-to-one calls, not conferencing.
SIP (also using (S)RTP)
Those are fairly common for VoIP, and the software is available. Including Android API for SIP; Linphone (though it is quite bloated and more or less unusable on a laptop, with windows being too large for its screen, and not resizable), baresip for Android (rather nice, though crashes on video calls, works poorly for voice calls on some phones), Twinkle for desktop GNU/Linux systems (though apparently it requires user names to be set, doesn't work with hosts only in SIP URIs), Kamailio is lightweight SIP router. I guess SIP is commonly used in LANs, but TLS and SRTP/ZRTP/etc are available, and it can be set for a VPN (e.g., IPsec-based, with strongSwan). It is rather nice to use with static addresses and peer-to-peer connections over IPsec.
WebRTC (using (S)RTP, as others)

WebRTC looks like bloat in web browsers, but it is handy: NAT traversal (ICE, STUN, TURN) is present, end-to-end encryption (DTLS), voice and video conferences, supported by common web browsers for a while now, making it relatively easy to use: a single mouse click to get into a conference. It is not perfect, but open and standardized, and reuses other standards.

I found it quite painful to use with public servers sometime around 2015, and UDP hole punching did not seem to work well, with random ports making it harder to fix manually, and without relevant IRC channels or XMPP conferences in sight, ultimately failing to actually use it, but possibly things have improved since.

After 2020, I observed that Jitsi Meet uses WebRTC (and Jitsi Videobridge bridges it to Jitsi's regular SIP), and works fine. As of 2024, it is not in Debian repositories (because of its many JVM-based dependencies complicating the packaging, and those would be quite heavy anyway), but there is Janus, a WebRTC server, and Jangouts to go with it (along with coturn, nginx, etc). They are relatively lightweight, for WebRTC software, but there are awkward bugs (e.g., I noticed the Jangouts issue #439), Jangouts looks abandoned, and WebRTC in a web browser seems unneecessarily complicated for such a task.

Mumble
Gaming-oriented, the server depends on Qt5 libraries, and it appears to focus on the software, rather than the protocol and its documentation. It uses a custom audio streaming container format complete with encryption, unclear why, and the TLS requirement seems excessive for already secure channels. Nevertheless, it looks quite pragmatic and simple, works, and very easy to set (among other things, because it depends on manual certificate management): mumble-server, regular Mumble client for desktop systems, Mumla for Android.
Experimental projects
There are various experimental (or planned, or research) projects, some of which attempt to build on distributed systems, yet they are rarely usable even by advanced users, if they work at all. Salsify looked interesting, though it only supported video streams, and does not look like it is going to support audio.
Audio over HTTP
HTTP is poorly suited for live media streaming, particularly because of the high latency, but since it is supported by commonly installed web browsers, the HTTP connections are least dodgy, and HTTP is even simple when compared to WebRTC, I decided to try it out, implementing bwchat. Apparently the latency added by TCP and HTTP is not that bad, but most of it is added by the HTML "audio" element's buffering (though maybe it can be disabled), and its caching is awkward; generally it does not seem to aim conferencing.

Building blocks

Opus is a good codec, though even PCM would work fine in many cases. RTP and Ogg are both fine for streaming, but normally require an external negotiation to pass metadata, for which SIP is employed sometimes. Though just as with reading from files, sometimes the identification data can be retrieved from stream packets directly (such as Opus headers, magic signatures). RTP can be used with SRTP, while anything can be used over more general (and preferably UDP-based) encryption protocols: DTLS, IPsec, WireGuard. Custom container formats replace RTP sometimes, and those are fairly simple. Hacks for NAT traversal have to be employed commonly.

But much of that is not easily applicable, given that software choices are often limited by the users' systems, experience, time availability and willingness to experiment and to tinker with it.

Conferencing setup

In addition to protocols (and related software) covered here, one should set a microphone (see computer hardware notes), and noise and/or echo cancellation (see, for instance, the CentOS 7 workstation notes).

Conclusion

As of 2024, perhaps the only FLOSS option for voice conferencing I observed working fine in practice (smoothly, even with casual computer users) is Jitsi Meet with web browsers, which uses WebRTC, even though it is awkward to deploy. Mumble looks like it should work, too, but I had no opportunity to try it with others in a non-test setting.

The problem does not look hard, and its parts are somewhat solved, yet actually having a voice conference is challenging still. Perhaps even more so than file transfer between users.