Guess Who's Back?

February 16, 2005

The blog post hiatus has ended! Here's what's new in the world o' Pablotron. First of all, the main hard drive on vault — my file/database/LDAP/email server — bit the dust last Wednesday. Fortunately the drive just started to fail (instead of dying outright). I had ample room to do immediate backups, and I had an unused 160G drive laying around. I spent most of Sunday afternoon and all of Monday evening, partitioning the new drive and copying stuff back to it. As far as I can tell, the only thing I actually lost was the words file for spamprobe. I don't really consider that much of a loss, since I save all my email (even the cursed spam), so I can easily toss the requisite good and bad corpora at spamprobe to get things going again. Even though I'm short a 100G drive now, the experience overall has been a positive one. Here's some thoughts I had; maybe they'll prevent a week of stress for someone else:

  • Regular backups are just something you do. The ad-hoc backups I've been doing are better than nothing, but they wouldn't have done me any good if the my drive had died outright. Had the circumstances been different, I would have lost weeks, possibly even a month of email. My solution is (rather, will be, once everything is up and running again) an NFS-mounted backup directory on every machine (obviously not for peope who don't like NFS)). Each machine will be responsible for it's own daily and weekly backups, via cron. Depending on how large this data set is, I'll be burning DVDs of the backup directory contents on a weekly or bi-weekly basis. Aside: Richard (richlowe) has been advocating revision controlled config files for quite a while (eg. cvs -d pabs@cvs:/cvs co etc-files/vault); maybe I'll give that for a spin, too.
  • Distribute services across machines. I've got 4 other machines sitting around twiddling their thumbs at the moment. Any of them coud easily be an authentication, database, email, LDAP, or CVS server, but instead they're all sitting around twiddling their thumbs (to be fair, sumo is my IRC /PostGres machine, but that hardly qualifies as a crippling load).
  • Keep extra hardware laying around. As a true geek you're already doing this, of course :). The drive in vault started failing at 1:30 in the morning on a Wendesday morning. I was able to start making backups and moving stuff around right then. If I didn't have the extra hard drive, I would have been SOL for several platter-scraping hours.
  • Losing your spam filter settings means you get to say cool words like "corpora" on your web page.

On the non-catastrophic hardware failure front, I upgraded halcyon to the latest Xorg, then promptly downgraded to the latest stable release. Here's the approximate order of events:

  1. Spent an hour or two configuring, compiling, and installing the latest Xorg.
  2. Ran X, and found out that the proprietary NVidia driver isn't compatible with the latest CVS snapshot of Xorg.
  3. Discovered just how painful the composite extension is without hardware acceleration by foolishly attempting to run X using the nv driver. Hint: Imagine using Netscape Navigator 3.0 on your old Commodore 64 with Photoshop doing an RLE Gaussian Blur on a 100 meg image in the background.
  4. Promptly downgraded to the stable release, cursing both NVidia for their proprietary sillyness, and the bastards at freedesktop.org for having the audacity to make source code changes that inconvenienced me. I spent plenty of time on this step, so go ahead and re-read that last paragraph a couple of times.

Since I spent the majority of a Sunday afternoon recompiling X no less than 3 times, I also took the opportunity to try out the latest Enlightenment DR16 from CVS (yes Kim, I'm one of the few people still using e16). It's got it's own built-in, mostly (semi?) working composite manager, so the neither the patch nor the xcompmgr hackery I describe in this post are necessary any more). The new default theme looks great, too!

Why use other peoples' broken software when you can write your own? Here's the latest on the Pablotron coding front:

  • I've converted the RSS feeds on pablotron.org, paulduncan.org, and raggle.org from steaming loads of standards-incompliant crap to pedantically-correct RSS 2.0. If your RSS aggregator couldn't read my pages before, it probably can now (unless your aggregator is based on the RSS library built-in to Ruby 1.8, but I'll get to that part of the story in a few minutes...)
  • Lots and lots and lots of updates to the next version of Raggle. Some of the changes are even by me! Thomas Kirchner (redshift) has been doing an unbelievable amount of work on the CVS version of Raggle. So much so, in fact, that I feel kind of embarassed calling this latest version mine at all. So I think when it's ready for release, we'll call it kirchneraggle or something more suitable ;).
  • This patch for Ruby which adds wcolor_set support to the built-in Curses interface. Ville suggested it eons ago, and that was the last thing stopping me from porting Raggle from Ncurses-Ruby.
  • A partially working Curses windowing library for Ruby. This isn't in CVS just yet, but don't worry, I've got some new stuff for you to play with. Keep reading...

The big stuff I've been working on lately is core of the future Raggle. Before I begin, here's a high-level overview of how the components interact with one another (yup, a diagram!):

next gen raggle

I've mentioned Squaggle previously, but for those of you sleeping in the back of the class (you know who you are), here's a brief recap. Squaggle is the SQLite-Ruby-based engine for Raggle. It's cleaner, faster, it uses less memory, and it lets me do all sorts of cool things I can't really do with the current engine (fancty delicious-style tagging, fast cross-feed searching, smart/auto categorization, and more). The version of Squaggle in CVS is functional (it even includes a usable WEBrick-based interface.

So what's this new stuff on ye olde diagram? libptime is a C-based RFC822 datetime and W3C datetime parsing library. It's BSD licensed, so you can download version 0.1.0 (signature), and use it to your heart's content. The other new library on the diagram is libfeed, an Expat-based RSS (0.9x, 1.0, and 2.0)/Atom feed parser. Why bother writing an RSS parser in C? The existing Raggle engine is slow, partly from being DOM-based, and partly from being written in Ruby. Don't get me wrong, REXML is a great XML parser, but RSS aggregators deal in volume, and I want to be sure the volume isn't constrained by parsing. I also noticed there wasn't a nice C-based RSS/Atom parsing library. Now there is (well, almost!). If that doesn't convince you, then maybe this will:


pabs@halcyon:~/cvs/libfeed/test> du -sh data/big-pdo-wdom.rss 
15M     data/big-pdo-wdom.rss
pabs@halcyon:~/cvs/libfeed/test> time perl -mXML::RSS -e \
  '$rss = new XML::RSS; $rss->parsefile("data/big-pdo-wdom.rss");'
real    7m56.892s
user    4m31.578s
sys     0m19.939s
pabs@halcyon:~/cvs/libfeed/test> time perl -mXML::RSS -e \
  '$rss = new XML::RSS; $rss->parsefile("data/big-pdo-wdom.rss");'
real    5m57.838s
user    4m28.727s
sys     0m3.703s
pabs@halcyon:~/cvs/libfeed/test> time ruby -rrss/2.0 -e \
  'RSS::Parser::parse(File.read("data/big-pdo-wdom.rss"))'
real    2m30.950s
user    1m46.904s
sys     0m8.610s
pabs@halcyon:~/cvs/libfeed/test> time ./testfeed data/big-pdo-wdom.rss \
  >/dev/null 2>&1
real    0m2.195s
user    0m1.472s
sys     0m0.104s
pabs@halcyon:~/cvs/libfeed/test> time ./testfeed data/big-pdo-wdom.rss \
  >/dev/null 2>&1
real    0m2.010s
user    0m1.475s
sys     0m0.099s

The Perl times were so bad I had to run them twice to be sure. 60 times faster than Ruby and over 100 times faster than Perl; I'd say that's a pretty good start :).

Unfortunately, I have to be awake in three hours, so I'll have to save the rest of the next-gen Raggle description for another day...