agg is a news aggregator that stays focused on its goal, namely reading a feed and creating a representation that can be worked with in the whole system. The process is straight-forward: store all items[1] that have a publication date newer than that of the latest item received previously.
That works remarkably well in the common case. A less common case is feeds without publication dates for their items. In fact, the pubDate tag is not required according to the RSS 2.0 specification. What to do in this case?
The solution is simple: if the author ignored publication dates, the aggregator shall do so, too. This has the inconvenient effect that once you've read all items and probably deleted them afterwards, running agg again on that feed will lead to all these items being there again!
So we need some form of caching[2]. However, such a rare case that hasn't much to do with agg's job anyways should not end up in the same piece of software. Here are my tools for the output format of versions 0.2:
agg_filter to remove an item iff it has already been cached:
ITEM_NAME="$1"
FEED_PATH="`dirname "$ITEM_NAME"`"
FEED_NAME="`basename "$FEED_PATH"`"
grep -q "`agg_hash "$ITEM_NAME"`" ".$FEED_NAME.cache"
[ $? -eq 0 ] && nomtime "rm -rf" "$ITEM_NAME"
agg_cache to cache an item:
ITEM_NAME="$1"
FEED_PATH="`dirname "$ITEM_NAME"`"
FEED_NAME="`basename "$FEED_PATH"`"
agg_hash "$ITEM_NAME" >> ".$FEED_NAME.cache"
agg_hash to create a hash of an item:
cat "$1/title" "$1/desc" "$1/link" 2>/dev/null | sha1sum | awk '{ print $1; }'
With the agg_each script posted in
using agg 0.2, the scripts are usually used as follows:
> $fetch_all_feeds_and_pipe_to_agg
> agg_each agg_filter "feed without pubDates"
> agg_each agg_cache "feed without pubDates"
> agg_each agg_read *
But why this hassle? Why not just integrate it into agg, or use a real newsreader for that matter?
To be fair, on operating systems with more powerful concepts, agg is unnecessary since it basically does nothing more than performing a deserialization of a news feed.
UNIX, however, is a rudimentary operating system and lacks powerful concepts. In this case the problem boils down to the completely rudimentary native objects and methods of communication.
When a set of people communicates openly with the requirement that everyone should be able to take part in the communication, every person in the set must speak in a way everyone can understand. Thus, the communication can only be as smart as the dumbest person in that set. Else, you'd have to introduce even more people into the set because translators between the smart and dumb ones are required. Not only is this process cumbersome, but also you'll be occasionally lost in translation.
Speaking in terms of this metaphor, UNIX is dumb. Processes are only expected to communicate via streams of lines of text, files and directories[3] this is especially true for the whole base system.
So, in order for users to be able to actually use the system (as opposed to using yet another application), agg either has to be file-based or needs even more (and even more complex) auxiliary tools, which indirectly leads to exactly the problems mentioned here.
Yes, it's a hassle. But it's the only practical way to at least partially achieve something Alan Kay has explained in his talk The Computer Revolution Hasn't Happened Yet at OOPSLA 1997:
Well I had programmed Caesar Franck's heroic piece—and if you know this piece, it is made for the largest organs that have ever been made. The loudest organs that have ever been made, in the largest cathedrals that had ever been made, because it's a nineteenth century symphonic type organ work, and Biggs was asking my friend to play this on this dinky, little organ.—He said, But how can I play this, on this? Biggs, he said, Just play it grand. Just play it grand. To stay with the future as it moves, is to always play your systems more grand than they seem to be right now.
- Currently there's a bug, but this is the concept anyways.
- A cache of your brain, that is, so that the software doesn't try make you read data that's already in there again.
- Yes, and
*argv[],fooenv(),shmfoo()etc. They don't matter in our case since we have "complex" data structures and no sane person would ever imagine using a computer by writing C and following the ancient edit-compile-debug cycle. Also, this issue has already been covered in TraditionalApplicationsConfigurationInterfaces and DisconnectedMonoliths.
Version 0.2 of agg, has just been released. It is mostly consistent with the 0.1 versions in terms of bugs, but the output format is completely different.
The previous version created (absolutely poorly formatted) HTML files to represent the news items. This was good enough for my use case. But when I briefly told a friend about this project, he proposed a different use case that was not possible in the current concept.
As always, my subconscious started working on this issue, and some time later something popped into my mind. I've been claiming that agg does the simplest thing that could possibly work, namely only dumping news feed items. But this was not entirely true. In fact, agg also knew a bit about HTML and formatted the output accordingly. This not only violated one of agg's goals (having a single responsibility) but also virtually discarded all meta information. Such meta information, however, is required for use cases like the one proposed by said friend of mine.
Starting with the 0.2 versions, agg represents news items as directories with all supported (and available) properties as single files (currently title, desc(ription) and link). For starters, here's a new (compacted) version of the CLI I posted for the previous versions.
agg_each:
CMD="$1"
while [ $# -gt 1 ]; do
shift
find "$1" -mindepth 1 -maxdepth 1 -type d -exec "$CMD" '{}' \;
done
agg_read:
ITEM=$1
function delete() { nomtime "rm -rf" "$1"; }
function tui() { agg_htmlize "$1" | elinks; }
function gui() {
TMPFILE="/tmp/`basename "$1"`.html"
agg_htmlize "$1" > "$TMPFILE"
opera "$TMPFILE" # asynchronous if already running
sleep 3
rm "$TMPFILE"
}
function TUI() { elinks "`cat "$1/link"`"; }
function GUI() { opera "`cat "$1/link"`"; }
function prompt()
{
echo "$1"
CMD=
while [ "$CMD" != t -a "$CMD" != T -a "$CMD" != g -a "$CMD" != G -a "$CMD" != d -a "$CMD" != n ]; do
echo -n "[t]ui, [g]ui, [T]UI, [G]UI, [d]elete, [n]ext: "
read CMD
done
}
while [ "$CMD" != n -a "$CMD" != d ]; do
prompt "$ITEM"
case "$CMD" in
t) tui "$ITEM";;
g) gui "$ITEM";;
T) TUI "$ITEM";;
G) GUI "$ITEM";;
d) delete "$ITEM";;
n) ;;
*) exit 1
esac
done
Htmlizing can work as follows:
TITLE="`cat "$1"/title`"
BODY="`cat "$1"/desc`"
LINK="`cat "$1"/link`"
cat << EOF
<html>
<h1>$TITLE</h1>
<p>$BODY</p>
<p><a href="$LINK">Link: $LINK</a></p>
</html>
EOF
The workflow then looks as follows:
> cd ~/feeds
> $script_to_fetch_all_feeds
> agg_each agg_read *
./Foo Feed/Just in: Bar happened
[t]ui, [g]ui, [T]UI, [G]UI, [d]elete, [n]ext:
...
Each item, one after another, its link and all the links it contains can be browsed using the browser (text/gui) selected. By using the capitalized key, the respective browser directly opens the link specified by the news item (if any), which is useful for sites that crop their feed items excessively.
Additionally, browsers should not have problems with the file contents anymore, provided you perform proper htmlization that the browser recognizes as such.
Now the interface is truly flexible.
Today, I wrote some code calculating the digital root of a (positive) integer, just for fun.
Again, I used a hypothetical language, here presented using the syntax of Smalltalk. The resulting function reads pretty much straight-forward:
i := 4528.
[ :i | i := (i asString asArray collect: #asInteger) map: #+ onto: 0 ] until: [ i < 10 ].
Obviously, we can not collect the single
digits of a string as integers. The result would be a
string... of integers?
map:onto: has been used as a hypothetical
object oriented implementation of the ideas behind
inject:into:. The latter can not be used in
conjunction with the "method" +[1]. This may
come suprising for someone who used the
Symbol>>value: hack too frequently. In
fact, that hack creates the impression that certain methods
of the Collection framework send messages to
the individual elements. Which is wrong; it's all about
evaluating a block with them as parameter.
Additionally, I encountered a really ugly inconsitency in Integer>>asCharacter and Character>>asInteger. Consider the following snippet:
3 asCharacter asInteger.
What does it evaluate to? 3? Wrong. It's 51. Because that's the ASCII value of the character 3.
It is blatantly obvious that these methods should be inverses of each other and that no low-level detail like character encoding should be leaked on such a high level.
Having considered these problems, here's some real Smalltalk code that runs in Squeak:
i := 4528.
[ i < 10 ] whileFalse:
[ i := ((i asString asArray collect: #asString)
collect: #asInteger) inject: 0 into:
[:sum :each | sum + each]].
Actually, we don't need to collect: #asInteger,
since converting characters or strings in arithmetical
operations is done implicitly in Squeak. In other words,
we're losing type safety.
Yeah, and that's how to calculate the digital root!
- Yeah, it's not a method, see the following link.
Today I released the first version of agg, a news aggregator following the UNIX philosophy.
As clearly stated, it has many bugs. However, they are predictable and agg is working fine for all of the other feeds I've subscribed to.
Since a news aggregator that "just dumps" the feed's contents into a directory hierarchy provides only rudimentary efficiency[1] (but much higher flexibility and freedom than those monolithic, unprogrammable, totalitarian systems usually used), I've written a small interface for it (or rather for the file system structure).
These scripts are nothing special. They provide only the necessary features and are simple enough to achieve high efficiency[2].
How agg can be used to subscribe to multiple feeds is
already shown in the man page. For reading the news items,
I'm using two scripts: agg_read to read a
specific item of a specific feed, and
agg_read_all that calls the former for every
item.
agg_read_all is trivial:
find -type f -exec agg_read '{}' \;
agg_read is still simple:
### CONFIG
TUI_READER=elinks
GUI_READER=opera
### END CONFIG
set -e
ITEM=$1
function delete()
{
owd="`pwd`"
cd "`dirname \"$1\"`"
nomtime "rm '`basename \"$1\"`'"
cd $owd
}
function prompt()
{
echo "$1"
CMD=
while [ "$CMD" != t -a "$CMD" != g -a "$CMD" != d -a "$CMD" != n ]; do
echo -n "[t]ui, [g]ui, [d]elete, [n]ext: "
read CMD
done
}
while [ "$CMD" != n -a "$CMD" != d ]; do
prompt "$ITEM"
case "$CMD" in
t) $TUI_READER "$ITEM";;
g) $GUI_READER "$ITEM";;
d) delete "$ITEM";;
n) ;;
*) exit 1
esac
done
The workflow then looks as follows:
> cd ~/feeds
> agg_fetch_all
> agg_read_all
./Foo Feed/Just in: Bar happened
[t]ui, [g]ui, [d]elete, [n]ext:
...
And each item, one after another, its link and all the links it contains can be browsed using the browser (text/gui) selected.
A simple solution, and only an example of the seemingly endless amount of interfaces that could be written for the output of agg. None of them needs to give a damn about XML, RSS, Atom[3] or download logic.
Heck, even agg itsself knows nothing of networking!
- human-time
- as usual, efficiency measured in human-time
- Atom not implemented as of yet
Generic Programming
Admittedly, I don't really know what to write about this topic. Programming is generic by nature: You specify a list of instructions to be applied to abstract input.
Well, "Generic Programming" is in vogue, which is not surprising since it allows for better abstraction. Better abstraction than the average mainstream language provides, that is. "Generic Programming" is nothing more than hacking statically typed languages with a more dynamic approach.
To state the obvious again, here's a tiny example.
// C++ std::max
template <class T> const T& max(const T& a, const T& b)
{
return b < a ? a : b;
}
"Same algorithm in Smalltalk"
maxOf: a and: b
b < a ifTrue: [ ^a ] ifFalse: [ ^b ].
"Or, using proper OOP:"
max b
self < b ifTrue: [ ^self ] ifFalse: [ ^b ].
It is somewhat funny that major languages and/or their compilers have problems implementing features to allow for "Generic Programming" -- and monstrous standards and a complex and inconsistent syntax are problems, too. Languages like Smalltalk support this approach out-of-the-box, programmers use it intuitionally.
Let's just let the term "Generic Programming" die -- it is the most natural form of programming and deserves special treatment only in flawed languages.
Coding Standards Are Misleading
The functional aspect of software is almost always disregarded in coding standards. At best, they focus on technical aspects, but most of them focus primarily on syntax. They explain how source code should be formatted to be readable. It has been proven that maintainability is important. However, consistent and clear syntax gives a fake illusion of clean and understandable code.
Syntax is just an arbitrary representation of software. In a sophisticated environment, you could redefine the syntax of the language easily. E.g. changing the characters for blocks from [] to {} would be one command (two commands if there would be conflicts). Complex redifinitions would be possible, too, as well as displaying and modifying the same code in a graphical way (21st century, anyone?).
We have to think of the source code that we write as just one of several ways to design software. Software does not need to be directly bound to what we write. Software is rather a form of byte code -- semantically correct and independent from any representation. If we think of software as being objects and message sends that exist beyond textual (or any other) source, we have laid the foundation for a language that can evolve and adapt its representation to various display formats or users.
But what does this mean for coding standards?
Well, if the code you write does not have a syntax but will be displayed using the syntax rules of the one viewing it, most (parts of) coding standards have lost their right to exist.
However, software engineering has standards.
Few coding standards imply or mention that programmers will or should also regard the source-level design. But that topic is far too large to be discussed in a coding standard and could fill several books. And, in fact, it already has: There are, for example, works on design principles and smells[1], design patterns[2], refactoring[3][4] and code smell[5][6]. Developers should also be able to chose the right data structures and algorithms for the task at hand -- an issue that fills a lot of other books[7][8][9].
Yet, the most important point has not been mentioned: The reason why we write software. Usually, we program (directly or indirectly) for some kind of customer, even if it's only ourself. Coding standards do in no way specify any goals of or approaches to interaction design, usability engineering or how one might call it. There are, again, various books but since I don't know whether they're worth it, I'll spare you.
Coding standards seem to favor syntax over technical aspects over functionality. A sure way to create conceptually and technically broken software.
References
- Martin 2003: Agile Software Development. Principles, Patterns, Practices.
- The Gang of Four 1994: Design Patterns: Elements of Reusable Object-Oriented Software
- Fowler & Beck 1999: Refactoring. Improving the design of existing code.
- Kerievsky 2004: Refactoring to Patterns.
- Fowler & Beck 1999: Refactoring. Improving the design of existing code.
- Mäntylä 2003: Bad Smells in Software -- a Taxonomy and an Empirical Study
- Knuth 1968
- Knuth 1969
- Knuth 1973
Sanity ranking, take two
Suppose there is a directory that contains files which shall be sorted according to the second character of their name.
Windows
Launch a browser and search for a tool that can do this, if not already installed. If you don't find one, you're screwed.
Unix
ls | sed "s/\(.\)\(.*\)/\2 \1/g" | sort | sed "s/\(.*\) \(.\)/\2\1/g"
Sorting is done by manipulating the data you're querying so that it can be sorted. Afterwards the corrupted data is restored manually.
Irrational, potentially faulty and requires a little knowledge of regex.
Smalltalk
filenames asSortedCollection: [ :a :b | (a at: 2) < (b at: 2) ].
Readable and understandable, logically correct and less tedious to type than the unixish solution.
Sucurity Hole
It seems that the use of a single account (that is regularly in contact with the outside) as primary account is fairly widespread. This account seems to be often allowed to log into root by running su.
This is a very dumb idea. Typing the passphrase for root into a potentially compromised system environment is about as secure as working as root directly.
Let's just assume that an application had a security hole and someone tampered with your environment. Since traditional operating systems use the broken concept of access control lists instead of the capability security model, this is not hard to imagine. Run the following line, and then run su.
alias su='echo -n "Password: "; read -s PASS; echo; echo "Hello $YOUR_NAME,"; echo "You have been hacked."; echo "Password was \"$PASS\"."; echo'
The su you're calling might always be a trojan. You might not notice until it's too late.
If you want to do something as root, switch to a virtual console. Since the login screen might not be the login screen but a phishy process, press CTRL + ALT + SYSRQ + k to kill it and force a reload of the real login screen.
Splitting a mailbox (sanity ranking I)
Splitting up a mailbox depending on the recipient's addresses sounds like an easy task: Look at each mail and move it into the mailbox named after the recipient.
Simple things should be simple, complex things should be possible. -- Alan Kay
I'm stuck with GNU/Linux because of the lack of significantly better alternatives. I'll take this system as an example for unixoid systems, as it should mirror the Unix way of doing things quite well.
Many of the basic tools in Unix work on a stream of information, which is processed line by line. Information is typically stored in plaintext files which are organized in a hierarchy of directories. As email is plaintext, too and maildir stores each email in a seperate file, splitting such a mailbox should be an easy task for unixoid systems.
I don't want to go into detail about the script that I had to hack together -- it doesn't even work properly. Anyway, here's the code:
#!/bin/sh
ADDR="`cat $1 | grep '^To: ' | sed -e 's/To: //'`"
echo $ADDR | tr '[:space:]' '\n' | line::each line::noempty > /tmp/msplit
ADDR="`cat /tmp/msplit | \
egrep -i 'foo@example.org|bar@example.org' | \
tr -d '<' | tr -d '>' | tr -d '"' `"
if [ -n "$ADDR" ]; then
mkdir "$ADDR"
mv "$1" "$ADDR"
fi
The script had to be called for every email (see find and xargs).
The question is: How could one accomplish this task on a sane operating system?
newBoxes := Set new.
mails collect: #recipient thenDo: [ :each | newBoxes at: each put: Mailbox new ].
mails do: [ :each | (newBoxes at: each recipient) add: each ].
You might want to check out the archive of posts tagged "code".