Login
 

Atalasoft Imaging SDK Development Blog

Document Imaging and Developer Commentary

Blog Home RSS Feed Old Archive Atalasoft.com

I Like the 80s with Struct Style


Posted by Steve Hawley
May 04, 2012 Comments



In this article, I’m going to talk about 80s era file formats and ways you can support them in .NET and keep your code sane, safe, and short, but to start, let’s talk about data file formats and why they are the way they are or were the way they were.

First, let’s consider why we even have data files.  Persistence of data is one reason, but historically you would frequently see data files built because you couldn’t keep the entire dataset in memory.  In the dim dark ages, you were lucky if you had virtual memory (I’m looking at you, Macintosh System 6 and earlier) or if it was lousy (I’m looking at you Windows 3.1 and earlier), so you would try to keep your memory footprint low.  The obvious solution is to dump things to a temporary file then read them back in later when you needed them again.  The trick is that you want to incur as little overhead as possible, so your reading and writing code had to be simple.  A typical data structure might look like this:

typedef struct t_person {

    char gender;

    char age;

    char firstname[14];

    char lastname[16];

} t_person;


OK – that’s a quick, badly-designed structure for representing a person.  So how would you write it out?

if (fwrite(somePerson, 1, sizeof(t_person), fp) != sizeof(t_person))

    return FALSE; /* fail */


We like fwrite, oh yes we do.  So much so that if we have an array of t_person, we might write this routine for dumping it out:

int writePeople(t_person *people, int count, FILE *fp)

{

    int totalBytes = count * sizeof(t_person);

    return fwrite(people, count, sizeof(t_person), fp) == totalBytes;

}


and then you think – oh dang, I didn’t write out how big the array was and  then you realize that you’re filling a temp file with many heterogeneous types and wouldn’t it be better if the file was self identifying?  So then you would write code like this:

int writeTag(INT32 tag, FILE *fp)

{

    return fwrite(&tag, 1, sizeof(INT32), fp) == sizeof(INT32);

}

 

int writePeople(t_person *person, int count, FILE *fp)

{

    if (!writeTag(kPersonTag, fp)) return FALSE;

    if (!writeTag(count, fp)) return FALSE;

 

    int totalBytes = count * sizeof(t_person);

    return fwrite(people, count, sizeof(t_person), fp) == totalBytes;

}


and by solving a problem (forgot the array length and didn’t identify the type, I’ve created several new problems.  First of, fwrite was fine to use if all I was writing was characters (this is a lie, by the way), but the moment I write a non-byte, I’m now writing the int with the implicit byte ordering of the processor running the code.  That means that if the tag is 0x01020304, then on a little endian processor, you will write the following bytes in sequence: 04, 03, 02, 01, but on a big endian processor (such as a 68K), you would write 01, 02, 03, 04.  This means that when you inevitably port your code from, say Windows to Macintosh, your reading code (if you use fread) won’t work.  Ouch.

Why did we use fwrite again?  Why did that seem like such a good idea?  Well, we could write the entire data structure in one shot, one line of code.  Otherwise, we would have to write code to write each element out in a platform-neutral way.  You’re going to have to do that, because two things just entered the equation from C that made your life worse: some compilers take the liberty of changing your structure layout in order to make accessing the elements more efficient.  So that lie back there about fwrite being fine with a struct of only characters: if your compiler inserts pad bytes, you’re writing more data than you suspect.  Dang.  Then, when C introduce enumerated values, they left it to the compiler to decide the data type best-suited for holding all the enumerated values.  The problem is that some compilers made different decisions than others and so your struct size would be different, depending on the compiler.

At this point you realize that writing structs isn’t quite so easy, and maybe that job at the garden center is looking a whole lot nicer.  Still, this gives you a little historical perspective.

Flash forward to current technologies.  How can we read these 80’s era data structures into C# and not suffer.  You could try to set up a struct in C# and play games with struct layout attributes and read the data in in one fell swoop, but trust me, this is not the way to go.  If that was bad in the 80’s, it’s just as bad now.

Let’s make some assumptions – first, let’s pretend that our data file is in big endian order.  Second, let’s assume that we already have a class has methods in it like this:

public class BigEndianReader {

    public static bool Read(Stream stm, out ulong ul) { /* ... */ }

    public static bool Read(Stream stm, out long ul) { /* ... */ }

    public static bool Read(Stream stm, out uint ul) { /* ... */ }

    public static bool Read(Stream stm, out int ul) { /* ... */ }

    public static bool Read(Stream stm, out short ul) { /* ... */ }

    public static bool Read(Stream stm, out ushort ul) { /* ... */ }

 

    public static bool ReadScalar(Stream stm, Type ft, out object o)

    {

        if (ft == typeof(byte))

        {

            int b = stm.ReadByte();

 

            if (b < 0) return false;

            o = (byte)b;

            return true;

        }

 

        if (ft == typeof(sbyte))

        {

            int b = stm.ReadByte();

 

            if (b < 0) return false;

            o = (sbyte)b;

            return true;

        }

 

        if (ft == typeof(int))

        {

            int i = 0;

 

            if (!Read(stm, out i)) return false;

            o = i;

            return true;

        }

        // ...

        o = null;

        return false;

    }

}

 

This is a set of methods that handle reading in scalar types and one general routine for switching based on the type.  This is very straightforward code – no surprises.

Now lets figure out how to read in an array of scalars without knowing it’s element type:

private static bool ReadIntoArray(Stream stm, Type ft, FieldInfo fi, object o)

{

    Array arr = fi.GetValue(o) as Array;

 

    if (arr == null) return false;

 

    Type arrType = arr.GetType().GetElementType();

 

    if (!IsScalar(arrType)) return false;

 

    for (int i=0; i < arr.GetLength(0); i++) {

        object val = null;

        if (!ReadScalar(stm, arrType, out val)) return false;

        arr.SetValue(val, i);

    }

 

    return true;

}


This will read in an array of scalars.  How do we know that the array type is a scalar?  We have a private predicate IsScalar() that tells us.  It essentially checks to see if the type is byte, sbyte, short, ushort, etc.

Now this is where the fun part comes in.  We write a routine to auto-populate a data structure using reflection:

public static bool ReadType(Stream stm, object o, params string[] names)

{

    Type t = o.GetType();

 

    foreach(string name in names)

    {

        FieldInfo fi = t.GetField(name);

 

        if (fi == null)

            throw new ArgumentException("unable to find field " + name);

 

        Type ft = fi.FieldType;

 

        if (IsScalar(ft))

        {

            object val = null;

            if (!ReadScalar(stm, ft, out val)) return false;

            fi.SetValue(o, val);

        }

        else if (IsArray(ft))

        {

            ReadIntoArray(stm, ft, fi, o);

        }

    }

}


In this routine, we pass in a Stream, an object, and a list of names of fields to be populated.  We could also use properties if we wanted, but I’m sticking with fields right now.  With all this in place, let’s set our stage for parsing data from TrueType fonts.  Keep in mind that these structures are purely representational and are not part of a good API since the abstraction is wrong.  Here is a class that represents the TrueType font header structure:

class TTFontHeader {

    public uint TableVersion;

    public uint FontRevision;

    public uint CheckSumAdjustment;

    public uint MagicNumber;

    public ushort Flags;

    public ushort UnitsPerEm;

    public long Created;

    public long Modified;

    public short XMin;

    public short YMin;

    public short XMax;

    public short YMax;

    public ushort MacStyle;

    public ushort LowestRecPPEM;

    public short FontDirectionHint;

    public short IndexToLocFormat;

    public short GlyphDataFormat;

 

    public static TTFontHeader FromStream(Stream stm)

    {

 

        TTFontHeader fh = new TTFontHeader();

 

        if (!BigEndianReader.ReadType(stm, fh, "TableVersion", "FontRevision", "CheckSumAdjustment", "MagicNumber", "Flags",

                "UnitsPerEm", "Created", "Modified", "XMin", "YMin", "XMax", "YMax", "MacStyle", "LowestRecPPEM",

                "FontDirectionHint", "IndexToLocFormat", "GlyphDataFormat"))

            throw new Exception("unable to read TTFont header");

 

        return fh;

    }

}


Now I’ve got very simple code to read in all the fields.  Just to remind you of the hell of the 80’s, recall that for any given data structure you may have different versions.  For example, the OS/2 metrics table inside TrueType files may include more fields depending on the version.  We can work with that by writing code like this:

public static TTOS2WindowsMetrics FromStream(Stream stm)

{

    ushort version = 0;

 

    if (!BigEndianReader.Read(stm, out version))

        throw new Exception("failure reading OS/2 version");

 

     TTOS2WindowsMetrics met = new TTOS2WindowsMetrics(version);  

     string[] names = null;

 

     switch (version) {

     case 0: names = _v0FieldNames; break;

     case 1: names = _v1FieldNames; break;

     case 2: case 3: case 4: names = _v2and3and4Fields; break;

     default: throw new Exception("unexpected OS/2 metrics version");

     }

 

     if (!BigEndianReader.ReadType(stm, met, names))

         throw new Exception("failure reading OS/2 metrics");

 

     return met;

}


One of the reasons why I like this is that it will fail fast if you rename fields (you unit tested that, right?) and should be highly round-trip testable which should make it easy to catch missed values, typos in strings and so on.  Again, this example uses fields for all the data values.  This is a decision to model the C structures as directly as possible.  They could very easily have been properties or the code code be extended to handle either.  When you build appropriate small tools, you can build robust higher level tools for making the mundane easier to do.

Posted: 5/4/2012 2:06:48 PM by Steve Hawley | with 0 comments


Trackback URL: http://www.atalasoft.com/trackback/1d866759-3d74-491e-917c-85b6fd977702/I-Like-the-80s-with-Struct-Style.aspx?culture=en-US

Comments
Blog post currently doesn't have any comments.

Subscribe

Register to receive our monthly newsletter.
preload preload preload