I’ve been working on an update to my homegrown RSS aggregator for a while now. It’s been “nearly” ready for a while, but it took me time to convince myself it was actually ready. It’s funny how quickly I become relatively conservative about pushing out updates to a tool that’s in use, even if it’s just me using it. I finally shut off the old version and am using the new code exclusively now after running them side-by-side for a couple days and seeing that the new code is doing roughly what I expect. It’s not a huge change. The basic idea is the same as the thing I deployed last October.
The fetcher I deployed last October ran as three processes:
- rssimap: Listened to a Spread channel for new articles. Appended them to an IMAP folder.
- rsssuck: Ran out of cron to fetch and process feeds. Broadcast new and updated articles to a Spread channel.
- rsslog: Listened to a spread channel for log messages. Wrote them to a file. Both of the other two pieces logged to this spread channel.
The new one:
- is a single process run from cron. The one process fetches, processes, and appends new articles to IMAP folders
- has some rules for appending to different folders based on the feed category — so all the photo blogs can go into a seperate folder
- includes more headers in the appended IMAP messages, including a Date header (oops) and some X-* extension headers to make it possible to map a folder message back to the article in the database
- has several bug fixes in the feed handling and uses the latest feedparser release
- uses a (slightly) adaptive scheduler. The old fetcher tried to update every feed every time it was run. The new one adjusts the update period based on how often it sees a feed getting updated.
- is slightly clever about detecting when an article is a dupe — it uses an article hash, the unique link, and (this is the new bit) the feed’s article “guid” , if available, to determine if it’s seen an article before. That caught all of the remaining cases where the old code misread an updated/changed article as a new one.
- Uses apsw instead of pysqlite2 — the segfault incident convinced me apsw was more robust.
I’m pleased. Not yet pleased enough to publish all this experimental code for the world to see, though.