Looking back it seems like complete lunacy.  For a period of time my method of choice to create blog entries was to write them out in Word 2003 and then export html.  The almost unreadable text has been haunting me for a long time and I've finally gone back and fixed them.  Along the way I built a small utility for doing this and I thought I would share it.


It almost makes sense to me now.  Our blog software features a very tempting "paste from word" button (see the paste icon with a little W?).

In order to write blogs while still getting work done it's necessary for the process to be as painless as possible.  Besides that button, Word 2003 had two features that really drew me in, the ability to easily change text size and the ability to retain the color in my code samples from Visual Studio. 

The problem was always that it filled my posts with strange proprietary html tags.  Also, the formatting in Firefox always looked kind of off.  The text would overlap just a bit.  The font was also pretty terrible looking...Ok, it was just plain ugly.

I'm not the only one who suffered from these issues.  Even Jeff Atwood of Coding Horror has experienced this.  He wrote a little utility to clean up Word's html but it didn't work well for me.  Probably because our CMS does some sort of voodoo when the "Paste from Word" button is pressed.

So in the end, with some help from Dave T the regex ninja, I was able to throw together a conversion app in C#.  The goal of this app was to preserve three things while removing everything else:

  1. The Links
  2. The Coloring
  3. The Text Size

It's not perfect but with 9 transforms I was able to get the documents into a state where they were easily be reformatted.

You can grab the project here if you are interested.