Sometimes I have to implement protocols used by specialized hardware that is shipped without protocol specification, but with an MS Windows GUI program (which is not suitable for automation or integration into other systems). Likewise with file formats. Here are assorted and general observations on the topic.
As the Wikipedia article on reverse engineering mentions, there are papers on (and tools for) automatic protocol and message format extraction. Which may be nice, but will not make use of potentially available additional information, in some cases would require more samples than viable to obtain, and often are not readily usable. Besides, usually it is not a good idea to rely just on a single tool, though maybe it is worthwhile to try those.
What I am usually using is a programming language (with a REPL
for convenience) for experimentation, a text editor (Emacs in my
case, with which it is handy to write and execute functions for
collected data processing in Elisp, and which includes a hex
editor), tcpdump(8)
along
with tcpflow(1)
and a few other common utilities,
MS Windows (on a VM or on a dedicated machine; a specific
version may be needed to run both software and RE tools), and
various specialized tools that may be needed (e.g., PEiD for
initial program analysis, decompilers for certain languages
(mostly for C# and Java, though there's Ghidra too), R if
statistics can help, and there are OllyDbg as the last resort
(though apparently a modern/maintained alternative is x64dbg),
though so far I did not need it for network protocols).
Once initial information (decompiled code if you are lucky, sources or specifications of other protocols by the same manufacturer, etc) is gathered, the general process just follows the scientific method: make observations, formulate hypotheses, experiment, repeat until satisfied (that is, somewhere between it being usable, and purpose of every command and bit being clear). Much of it is inspection of collected packets.
While textual protocols (and file formats) are likely to be trivial to analyse, in my experience they are rare. But with binary ones it is still not hard to compare packets with different program input/output, identify varying bits, and figure how to decode/encode those. Packet structure can also become apparent after simply comparing different packets.
Even when decompiled code or some specification is available, often it is still easier to focus on actual transmitted packets, since specifications are not always accurate or complete, and the (decompiled) code of those programs can be quite hard to read.
Below are observations on different aspects of packet/file formats:
Such protocols may create TOCTOU issues, and tend to be awkward and strange, with silly solutions to be expected, but still relatively basic.
Once a protocol is reverse-engineered, it is the time to document it.
When there is official documentation for such a protocol, it is usually in MS Word, MS Windows CHM, or even MS Excel file format (though sometimes exported into PDF). Newer ones may use something like a web-based project management system with WYSIWYG editing to document protocols. Often it is very verbose, repeating the same information for every documented command, while still omitting important information. The awkwardness here usually matches that in the corresponding software and protocols.
To get unified and easily readable documentation (mostly for myself, since I maintain those then), usually I am using Texinfo, with a few common sections:
I used to use LaTeX for that, but finding info files to be more pleasant to work with, and it allows to keep all the system documentation in the same format, readable in different environments.
When working with hundreds or thousands of potentially buggy devices, over unreliable channels, possibly implementing relatively complex algorithms to work with them (e.g., once there was a terminal emulator embedded into the client, with curses-like interface controlled by the remote device, which had to be automated), while not being entirely certain about protocols, and with many types of those, it is desirable to be certain that the issues that arise are not caused by your implementation, and to be able to identify such issues quickly.
Views on the ways to achieve software reliability differ, and it is a
rather large topic, but perhaps worth stressing its importance here as
well. Haskell (particularly with attoparsec
) and UNIX
philosophy (e.g., a program per protocol, text streams) seem to work well
for me.