Source code archaeology

From C64-Wiki
Jump to navigationJump to search

Source code archaeology also known as software archaeology is the study of undocumented or poorly documented software through the application of reverse engineering techniques. The process is tedious and it can take several months for a source code archaeologist to fully "grok" how everything works. This is true even for a comparatively small executable. There are several groups currently active in the C64 scene that actively engage in source code archaeology as part of their releases. Perhaps the best known in recent memory is the group Reengine and Mod (ReM).

The end product of source code archaeology is typically:

  • Reconstructed source code that can be built with modern tools yet is functionally equivalent to the lost/unreleased original source code.
  • A binary file that is a byte-exact copy after assembly takes place.
  • Extremely detailed documentation on the inner workings of analyzed software.
  • From this, new or heavily revised technical and end-user documentation.

Difficulties[edit | edit source]

Reconstructing source code from a binary executable code is not a simple process. Very often, it can be difficult to tell instructions from data. The 6502, like most platforms, does not internally mark some bytes as executable and other bytes as data.

Tokenized basic at the start of a file is easy for an experienced programmer to spot. But programs that mix BASIC and ASM — usually in the form of DATA statements can be difficult to deal with. Effectively, each BASIC token must be parsed with a tool like petcat and then each ASM statement must be extracted from the DATA statements. Obviously, this adds an additional layer of complexity to an already complex process.

The use of software obtusification techniques implemented by the original authors can cause problem with disassembly as well. While obtusification techniques use of hardware trickery (very often in the name of performance) can also make properly understanding the code exceedingly difficult.

The use of compressed data is very common, especially in the 8-bit era when every byte of RAM was precious. Most of these have been documented by this point. But, non-standard or novel and undocumented compression techniques can similarly be a real headache for the source code archeologist.

Programs that make aggressive use of binary overlay techniques in order to fit in the C64's limited memory can be hard to understand. This is a problem on two fronts: first, parts of executable are swapped in and out of memory dynamically. Effectively, the same memory address can and often does point to different things at different times. This technique is often used by large multi-disk programs. This leads to the other problem indirectly related to binary overlay techniques: files on different disks with the same name but completely different functions.

Lastly, applications and games that make use of their own internal scripting language can be an especially difficult challenge. At that point, the source code archaeologist must "peel back the onion" and not only understand the code but grasp the inner workings of the application's script interpreter. Very often fuzzing techniques are required to fully work out the meaning of various parts of the scripting language in cases where available scripts don't fully use every possible feature of the scripting language to its fullest extent.

Idealized High Level Process[edit | edit source]

Phase 1: Manual Disassembly[edit | edit source]

  • Dump binary file(s) hex.
  • pick an assembler and text editor.
  • byte-by-byte disassembly by hand.
  • attempt to assemble the new source.
  • fix disassembly mistakes and typos; wash-rinse-repeat.
  • At this point a (barely) human readable .asm file has been reconstructed.

Phase 2: Deep Documentation[edit | edit source]

  • Map out application memory.
  • Analyze the application control flow.
  • Analyze data structures.
  • Note all jump locations, constant values, and branch locations.
  • Provide human readable aliases and branch labels.
  • Begin drafting documentation and commenting the source code.

Phase 3: Code Modularization[edit | edit source]

  • Create macros for commonly reused code.
  • Separate data into resource files.
  • Split the large code base into individual files.

Phase 4: Final Polish[edit | edit source]

  • Patch any bugs.
  • Add any new features desired.
  • Release.

Tools[edit | edit source]

  • Vim, Emacs, Notepad++ or some other powerful text editor.
  • VICE, especially the machine monitor and petcat.
  • xxd for converting binary files to hex dumps.
  • diff for finding differences in similar text files.
  • cmp for byte-by-byte file comparisons.
  • DirMaster an MS Windows tool for manipulating disk images.
  • C64list a BASIC tokenizer/detokenizer.

See Also[edit | edit source]