agg is a news aggregator that stays focused on its goal, namely reading a feed and creating a representation that can be worked with in the whole system. The process is straight-forward: store all items[1] that have a publication date newer than that of the latest item received previously.

That works remarkably well in the common case. A less common case is feeds without publication dates for their items. In fact, the pubDate tag is not required according to the RSS 2.0 specification. What to do in this case?

The solution is simple: if the author ignored publication dates, the aggregator shall do so, too. This has the inconvenient effect that once you've read all items and probably deleted them afterwards, running agg again on that feed will lead to all these items being there again!

So we need some form of caching[2]. However, such a rare case that hasn't much to do with agg's job anyways should not end up in the same piece of software. Here are my tools for the output format of versions 0.2:

agg_filter to remove an item iff it has already been cached:

ITEM_NAME="$1"
FEED_PATH="`dirname "$ITEM_NAME"`"
FEED_NAME="`basename "$FEED_PATH"`"
grep -q "`agg_hash "$ITEM_NAME"`" ".$FEED_NAME.cache"
[ $? -eq 0 ] && nomtime "rm -rf" "$ITEM_NAME"

agg_cache to cache an item:

ITEM_NAME="$1"
FEED_PATH="`dirname "$ITEM_NAME"`"
FEED_NAME="`basename "$FEED_PATH"`"
agg_hash "$ITEM_NAME" >> ".$FEED_NAME.cache"

agg_hash to create a hash of an item:

cat "$1/title" "$1/desc" "$1/link" 2>/dev/null | sha1sum | awk '{ print $1; }'

With the agg_each script posted in using agg 0.2, the scripts are usually used as follows:

> $fetch_all_feeds_and_pipe_to_agg
> agg_each agg_filter "feed without pubDates"
> agg_each agg_cache  "feed without pubDates"
> agg_each agg_read *

But why this hassle? Why not just integrate it into agg, or use a real newsreader for that matter?

To be fair, on operating systems with more powerful concepts, agg is unnecessary since it basically does nothing more than performing a deserialization of a news feed.

UNIX, however, is a rudimentary operating system and lacks powerful concepts. In this case the problem boils down to the completely rudimentary native objects and methods of communication.

When a set of people communicates openly with the requirement that everyone should be able to take part in the communication, every person in the set must speak in a way everyone can understand. Thus, the communication can only be as smart as the dumbest person in that set. Else, you'd have to introduce even more people into the set because translators between the smart and dumb ones are required. Not only is this process cumbersome, but also you'll be occasionally lost in translation.

Speaking in terms of this metaphor, UNIX is dumb. Processes are only expected to communicate via streams of lines of text, files and directories[3] this is especially true for the whole base system.

So, in order for users to be able to actually use the system (as opposed to using yet another application), agg either has to be file-based or needs even more (and even more complex) auxiliary tools, which indirectly leads to exactly the problems mentioned here.

Yes, it's a hassle. But it's the only practical way to at least partially achieve something Alan Kay has explained in his talk The Computer Revolution Hasn't Happened Yet at OOPSLA 1997:

Well I had programmed Caesar Franck's heroic piece—and if you know this piece, it is made for the largest organs that have ever been made. The loudest organs that have ever been made, in the largest cathedrals that had ever been made, because it's a nineteenth century symphonic type organ work, and Biggs was asking my friend to play this on this dinky, little organ.—He said, But how can I play this, on this? Biggs, he said, Just play it grand. Just play it grand. To stay with the future as it moves, is to always play your systems more grand than they seem to be right now.


  1. Currently there's a bug, but this is the concept anyways.
  2. A cache of your brain, that is, so that the software doesn't try make you read data that's already in there again.
  3. Yes, and *argv[], fooenv(), shmfoo() etc. They don't matter in our case since we have "complex" data structures and no sane person would ever imagine using a computer by writing C and following the ancient edit-compile-debug cycle. Also, this issue has already been covered in TraditionalApplicationsConfigurationInterfaces and DisconnectedMonoliths.
Posted 2011-05-13 09:15 Tags: agg

Version 0.2 of agg, has just been released. It is mostly consistent with the 0.1 versions in terms of bugs, but the output format is completely different.

The previous version created (absolutely poorly formatted) HTML files to represent the news items. This was good enough for my use case. But when I briefly told a friend about this project, he proposed a different use case that was not possible in the current concept.

As always, my subconscious started working on this issue, and some time later something popped into my mind. I've been claiming that agg does the simplest thing that could possibly work, namely only dumping news feed items. But this was not entirely true. In fact, agg also knew a bit about HTML and formatted the output accordingly. This not only violated one of agg's goals (having a single responsibility) but also virtually discarded all meta information. Such meta information, however, is required for use cases like the one proposed by said friend of mine.

Starting with the 0.2 versions, agg represents news items as directories with all supported (and available) properties as single files (currently title, desc(ription) and link). For starters, here's a new (compacted) version of the CLI I posted for the previous versions.

agg_each:

CMD="$1"
while [ $# -gt 1 ]; do
    shift
    find "$1" -mindepth 1 -maxdepth 1 -type d -exec "$CMD" '{}' \;
done

agg_read:

ITEM=$1
function delete() { nomtime "rm -rf" "$1"; }
function tui() { agg_htmlize "$1" | elinks; }
function gui() {
    TMPFILE="/tmp/`basename "$1"`.html"
    agg_htmlize "$1" > "$TMPFILE"
    opera "$TMPFILE" # asynchronous if already running
    sleep 3
    rm "$TMPFILE"
}
function TUI() { elinks "`cat "$1/link"`"; }
function GUI() { opera "`cat "$1/link"`"; }

function prompt()
{
    echo "$1"
    CMD=
    while [ "$CMD" != t -a "$CMD" != T -a "$CMD" != g -a "$CMD" != G -a "$CMD" != d -a "$CMD" != n ]; do
        echo -n "[t]ui, [g]ui, [T]UI, [G]UI, [d]elete, [n]ext: "
        read CMD
    done
}

while [ "$CMD" != n -a "$CMD" != d ]; do
    prompt "$ITEM"
    case "$CMD" in
        t)  tui "$ITEM";;
        g)  gui "$ITEM";;
        T)  TUI "$ITEM";;
        G)  GUI "$ITEM";;
        d)  delete "$ITEM";;
        n)  ;;
        *)  exit 1
    esac
done

Htmlizing can work as follows:

TITLE="`cat "$1"/title`"
BODY="`cat "$1"/desc`"
LINK="`cat "$1"/link`"

cat << EOF
<html>
<h1>$TITLE</h1>
<p>$BODY</p>
<p><a href="$LINK">Link: $LINK</a></p>
</html>
EOF

The workflow then looks as follows:

> cd ~/feeds
> $script_to_fetch_all_feeds
> agg_each agg_read *
./Foo Feed/Just in: Bar happened
[t]ui, [g]ui, [T]UI, [G]UI, [d]elete, [n]ext:
...

Each item, one after another, its link and all the links it contains can be browsed using the browser (text/gui) selected. By using the capitalized key, the respective browser directly opens the link specified by the news item (if any), which is useful for sites that crop their feed items excessively.

Additionally, browsers should not have problems with the file contents anymore, provided you perform proper htmlization that the browser recognizes as such.

Now the interface is truly flexible.

Posted 2011-05-11 21:18 Tags: agg

Today I released the first version of agg, a news aggregator following the UNIX philosophy.

As clearly stated, it has many bugs. However, they are predictable and agg is working fine for all of the other feeds I've subscribed to.

Since a news aggregator that "just dumps" the feed's contents into a directory hierarchy provides only rudimentary efficiency[1] (but much higher flexibility and freedom than those monolithic, unprogrammable, totalitarian systems usually used), I've written a small interface for it (or rather for the file system structure).

These scripts are nothing special. They provide only the necessary features and are simple enough to achieve high efficiency[2].

How agg can be used to subscribe to multiple feeds is already shown in the man page. For reading the news items, I'm using two scripts: agg_read to read a specific item of a specific feed, and agg_read_all that calls the former for every item.

agg_read_all is trivial:

find -type f -exec agg_read '{}' \;

agg_read is still simple:

### CONFIG

TUI_READER=elinks
GUI_READER=opera

### END CONFIG

set -e

ITEM=$1

function delete()
{
    owd="`pwd`"
    cd "`dirname \"$1\"`"
    nomtime "rm '`basename \"$1\"`'"
    cd $owd
}

function prompt()
{
    echo "$1"
    CMD=
    while [ "$CMD" != t -a "$CMD" != g -a "$CMD" != d -a "$CMD" != n ]; do
        echo -n "[t]ui, [g]ui, [d]elete, [n]ext: "
        read CMD
    done
}

while [ "$CMD" != n -a "$CMD" != d ]; do
    prompt "$ITEM"
    case "$CMD" in
        t)  $TUI_READER "$ITEM";;
        g)  $GUI_READER "$ITEM";;
        d)  delete "$ITEM";;
        n)  ;;
        *)  exit 1
    esac
done

The workflow then looks as follows:

> cd ~/feeds
> agg_fetch_all
> agg_read_all
./Foo Feed/Just in: Bar happened
[t]ui, [g]ui, [d]elete, [n]ext: 
...

And each item, one after another, its link and all the links it contains can be browsed using the browser (text/gui) selected.

A simple solution, and only an example of the seemingly endless amount of interfaces that could be written for the output of agg. None of them needs to give a damn about XML, RSS, Atom[3] or download logic.

Heck, even agg itsself knows nothing of networking!


  1. human-time
  2. as usual, efficiency measured in human-time
  3. Atom not implemented as of yet
Posted 2011-04-08 21:46 Tags: agg

You might want to check out the archive of posts tagged "agg".