Reverse engineering of protocols

Sometimes I have to implement protocols used by specialized hardware that is shipped without protocol specification, but with an MS Windows GUI program (which is not suitable for automation or integration into other systems). Likewise with file formats. Here are assorted and general observations on the topic.

Tools

As the Wikipedia article on reverse engineering mentions, there are papers on (and tools for) automatic protocol and message format extraction. Which may be nice, but will not make use of potentially available additional information, in some cases would require more samples than viable to obtain, and often are not readily usable. Besides, usually it is not a good idea to rely just on a single tool, though maybe it is worthwhile to try those.

What I am usually using is a programming language (with a REPL for convenience) for experimentation, a text editor (Emacs in my case, with which it is handy to write and execute functions for collected data processing in Elisp, and which includes a hex editor), tcpdump(8) along with tcpflow(1) and a few other common utilities, MS Windows (on a VM or on a dedicated machine; a specific version may be needed to run both software and RE tools), and various specialized tools that may be needed (e.g., PEiD for initial program analysis, decompilers for certain languages (mostly for C# and Java, though there's Ghidra too), R if statistics can help, and there are OllyDbg as the last resort (though apparently a modern/maintained alternative is x64dbg), though so far I did not need it for network protocols).

Process

Once initial information (decompiled code if you are lucky, sources or specifications of other protocols by the same manufacturer, etc) is gathered, the general process just follows the scientific method: make observations, formulate hypotheses, experiment, repeat until satisfied (that is, somewhere between it being usable, and purpose of every command and bit being clear). Much of it is inspection of collected packets.

While textual protocols (and file formats) are likely to be trivial to analyse, in my experience they are rare. But with binary ones it is still not hard to compare packets with different program input/output, identify varying bits, and figure how to decode/encode those. Packet structure can also become apparent after simply comparing different packets.

Even when decompiled code or some specification is available, often it is still easier to focus on actual transmitted packets, since specifications are not always accurate or complete, and the (decompiled) code of those programs can be quite hard to read.

Below are observations on different aspects of packet/file formats:

Encryption: Encryption can be a major obstacle, but so far I have only came across one protocol that used a textbook example of bad encryption.
Checksums: While they are rarely useful on top of protocols and storage that all have integrity checks, they are often used, and should be calculated. A mistake I keep making is to assume that those would be CRC (see also: a painless guide to CRC error detection algorithms V3.00 (9/24/96)), and spending time picking parameters (see also: Reverse-Engineering a CRC Algorithm), only to notice that those are arithmetic sums, or all the values XORed together. Though occasionally proper CRC is used.
Timestamps and numeric values: There is a lot of strange ways to encode timestamps. Sometimes different date and time components are packed together with a few bits for each, sometimes BCD is used, and I'm not recalling regular UNIX timestamps in unspecified protocols (though there were a few in semi-specified ones). The situation is similar with numeric values (particularly signed and rational numbers), but usually it is easy to figure out if there are multiple encoded values to compare, and their corresponding decoded values are known.
Compression: The closest thing I saw to a sensible compression in this context was ZIP (see also: "Zip - How not to design a file format"), but usually it is just about cramming values into a binary format and restricting available values (including those for dates). A related common practice is to throw a part of information away after receiving it, so working with secondary information sources gets even more awkward. An interesting example of compression-related awkwardness is BUFR, though it is open and specified, while small closed protocols tend to be simpler.
Underlying protocols: Sometimes only a low-level interface is known initially, but even when rather generic ones are used, basic and common protocols are usually used on top of those: e.g., TCP/IP over Ethernet, virtual serial port over USB (with suitable serial adapter drivers included into Linux).

Such protocols may create TOCTOU issues, and tend to be awkward and strange, with silly solutions to be expected, but still relatively basic.

Documentation

Once a protocol is reverse-engineered, it is the time to document it.

When there is official documentation for such a protocol, it is usually in MS Word, MS Windows CHM, or even MS Excel file format (though sometimes exported into PDF). Newer ones may use something like a web-based project management system with WYSIWYG editing to document protocols. Often it is very verbose, repeating the same information for every documented command, while still omitting important information. The awkwardness here usually matches that in the corresponding software and protocols.

To get unified and easily readable documentation (mostly for myself, since I maintain those then), usually I am using Texinfo, with a few common sections:

Brief description (describing particular RE process if further RE may be needed, referencing official documentation if available, referencing implementation, etc)
Packet format (a table)
Encoding of used data types (descriptions and/or code samples)
Messages (commands and their parameters)
Protocol itself

I used to use LaTeX for that, but finding info files to be more pleasant to work with, and it allows to keep all the system documentation in the same format, readable in different environments.

Implementation

When working with hundreds or thousands of potentially buggy devices, over unreliable channels, possibly implementing relatively complex algorithms to work with them (e.g., once there was a terminal emulator embedded into the client, with curses-like interface controlled by the remote device, which had to be automated), while not being entirely certain about protocols, and with many types of those, it is desirable to be certain that the issues that arise are not caused by your implementation, and to be able to identify such issues quickly.

Views on the ways to achieve software reliability differ, and it is a rather large topic, but perhaps worth stressing its importance here as well. Haskell (particularly with attoparsec) and UNIX philosophy (e.g., a program per protocol, text streams) seem to work well for me.