Database Refactoring & Ruby

I don't do a lot of database work, I believe in rendering unto the DBA that which is the DBA's. But ya got's to pay the bills, so I've been doing some refactoring work on a database. While flirting with RedGate's SQL Dependency Tracker, I finally figured out how to replace my current Visio install wtih Enterprise Architect Edition and that did a good-enough job for me in terms of reverse- and forward-engineering the database.

The next phase was formatting the initial data. This was in the form of about half-a-dozen text dumps (a few in XML, a few in CSV) from 3 different sources, each with their own names and structures. Several involved not-particularly-difficult data formatting as well (date formats, etc.).

I did this with Ruby, using the interactive shell a little, but always dropping back to scripts, since ultimately the datasources might very well create a "new" dump and I wanted to be able to automate the process as much as possible. The ability to create an iterator (the "yield" keyword) turned out to be a nice part of this, as it was trivial to do this development generating just a few thousand records and then basically expand the iterator. Of course this would have been possible using normal imperative techniques, but language choice is all about what approaches the language facilitates, not what is "possible" in the language.

So the development of the data-transform scripts was very positive. As the dataset grew, though, the performance became problematic: I was only generating about 1,000 records per minute. Since the complete dataset would be on the order of 1,000,000 records, I was looking at an evening-to-morning batch run.

That would be no big deal if I knew that it was a one-time effort, but that's not appropriate in Our Agile Times, right? One must be sanguine to the possibility of the database being refactored and repopulated often.

Additionally, the text dumps were not crystalline. Every once in a while there would be some heretofore-unseen oddity that would, in the best case, raise an exception and stop the processing and, in the worst case, quietly corrupt the data stream. And, of course, when the destination is a normalized data structure, the corruption of one record can have far-reaching implications.

Further, it's difficult to know exactly why the performance is what it is. Am I IO-bound, CPU-bound, or memory-bound? Am I thrashing the Garbage Collector? Are there performance implications to those nice-to-use iterators? I like the Ruby language, but I am far from an expert in its internals. This raises two issues about choosing a non-mainstream language.

One issue is that when you get an abstraction leak, the resource advantages of a mainstream language / platform can become very significant. If there were a book called "Ruby Performance Programming," I'm sure it would help. If I were running Ruby on either the JVM or the CLR, I would be able to leverage lots of existing tools and knowledge about memory behavior.

The other issue is that, had I put my credibility onthe line selecting Ruby to script the process, I would be open to criticism. "It would have run faster had you programmed it in [preferred mainstream language]!" If I was vulnerable to office politics, the choice of Ruby would have "given them an opening" to criticize me. (I can't remember if there's a general name for "No one ever got fired for choosing IBM," but that's the dynamic at play.) It is irrelevant, for the purposes of that kind of real-world issue, whether the mainstream, conservative language really would have led to significantly faster runtime performance (my hunch is that it almost certainly would) and whether any such advantage at runtime would have outweighed the development-time advantages (my hunch is it almost certainly would not).

Since I am a programmer and not an evangelist, my solution will disappoint theorists: I reworked the Ruby enough to create 5,000 records at a time and to take "restart" command-line parameters that could be used to re-start the process midway through the data, after a  data corruption is manually diagnosed. It's far from a perfect solution: it's easy to imagine an import taking 3 complete days (day 1 start, data corruption occurs in evening. Corrected day 2 morning, restart runs into day 2 evening, data corruption 2 happens).