Threads - an HTML News Repository Generator

Threads is a program used to create news repositories containing indexes over articles posted to USENET newsgroups, facilitating easy retrieval of postings according to different criteria, such as date or subject. A news repository is a WWW Home Page with links to the various indexes. It is e.g. being used to manage the news repository for comp.lang.beta maintained at Aarhus University, Denmark.

Threads is written in the object-oriented BETA programming language.

Contents

What Types of Indexes are Generated?

Threads generates four different types of indexes:
  1. Index sorted by active threads
  2. Index sorted by date
  3. Index sorted by poster
  4. Index sorted by subject
The date, poster, and subject indexes are listings of all postings sorted by the contents of the Date:, From:, and Subject: lines of the articles, respectively.

The active threads index is similar to the subject index, but includes only those subject lists (or threads) that contain at least one posting less than 14 days old. It is a convenient way to catch up on recent discussions.

What do I need to Generate News Repositories?

You need access to a NNTP server from which Threads can download news articles, and you need a directory to contain the news repository. First time Threads is invoked, it will create the necessary directories if they are not already created (more on the directory structure later). If you want the news repository to contain old articles (older than the ones available from the server), you will have to obtain these by other means (and install them manually - ask for details). You naturally also need the threads executable compiled for your platform.

If you're interested, you can obtain the source code for the threads news repository system through (ftp).

How do I Create a News Repository?

The command
   threads -g group -d dir -s server -m mail
where "group" is the name of the newsgroup, for which you want a repository, "dir" is a file path, referring to the directory where the news repository should be located, "server" is the name of the NNTP server, from which news articles can be obtained, and finally "mail" is the e-mail address of the person responsible for the news repository.

A more realistic example might thus be

   threads -g comp.lang.beta -d ~beta/public_html/News -s news.cs.au.dk -m jlk@cs.au.dk
which would create a news repository for the "comp.lang.beta" news group. The news repository will be located in the directory "~beta/public_html/News", using the "news.cs.au.dk" NNTP server, and with "jlk@cs.au.dk" as the responsible for the news repository.

The result is this case, that the news repository will be accessible through the URL "https://beta.cs.au.dk/News/"

How do I Maintain the News Repository?

The easiest thing is to run the appropriate threads command once a day, properly using a crontab entry (the comp.lang.beta news repository is maintained by a cron job being run every night).

The Directory Structure of a News Repository

The directory structure of a news repository is:
   "dir"/		(the path given as argument -d to threads)
      index.html
      volume1996/
         active.html	(only in the current year)
         date.html
         subject.html
         poster.html
         news/
            (contains one file for each article in 1996)
      volume1995/
         date.html
         subject.html
         poster.html
         news/
            (contains one file for each article in 1995)
      volume1994/
         date.html
         subject.html
         poster.html
         news/
            (contains one file for each article in 1994)
assuming that this news repository have been active in 1994-1996, and that 1996 is the current year.

The active.html, date.html, subject.html, and poster.html files for the current year (here 1996) will be regenerated each time Threads is invoked.

The index.html file is only created first time Threads is invoked. You have to edit it manually to make it refer to new volumes each year.

The automatically generated index.html file is just to give you an idea of how it might look, but you are free to edit it in any way to conform to your wishes - Threads will never modify this particular file - only create a new index.html file, if Threads finds out that it is missing.

How do I Regenerate the Index Files?

If you want to regenerate the indexes, the only thing you have to do is to remove the .html files you wish to have regenerated. Then Threads will automatically regenerate them next time it is run (also for old volumes).

One reason for doing this, it is you are doing clean-up of a volume (removing irrelevant news entries from the news/ directory in a volume). You will can then create clean indexes this way.

Another reason might be, that the index files have been destroyed one way or the other.

I'm curious. How do you Parse the News Articles?

A certain amount of normalisation on the contents of the date, poster, and subject lines takes place before processing begins. For example,
   Date: 1 Jan 1995 12:00:00 GMT
   Date: Sun, 01 Jan 1995 12:00:00 +0000
are both acceptable ways to specify noon, January 1st, 1995. Similarly,
   From: jacobse@cs.au.dk (Jacob Seligmann)
   From: "Jacob Seligmann" <jacobse@cs.au.dk>
will both be classified as postings from "Jacob Seligmann".

The Threads news repository generator has been running for quite some time now, and has been tested on a large number of postings from a wide range of newsgroups. It correctly handles all articles I have yet encountered, but it is quite possible that the heuristics may fail if confronted with exotic field formats I haven't come across. If this happens, please send me a copy of the troublesome article so I can extend the normalisation mechanism accordingly.

Threads Development History

Threads was originally inspired by the news archive available through The Eiffel Page.

The first BETA versions was created by Jacob Seligmann, Aarhus University, jacobse@cs.au.dk. These versions used Perl scripts etc. for obtaining the news articles, and required direct access to the news spool directories at the NNTP server machine.

The current version (totally implemented in BETA, and using NNTP protocol for obtaining the articles) is created by Jørgen Lindskov Knudsen, Aarhus University, jlknudsen@cs.au.dk.

Download Source

The full source can be downloaded for free from
ftp://ftp.cs.au.dk/pub/beta/threads
To compile it, you will need
The Mjølner System
Install this and simply compile the threads.bet file: beta threads.
Jørgen Lindskov Knudsen / Aarhus University / jlknudsen@cs.au.dk