Path: news.daimi.aau.dk!jlk From: jlk@daimi.aau.dk (J|rgen Lindskov Knudsen) Newsgroups: daimi,daimi.beta Subject: Off-line News Reading and News Archival Date: 28 Mar 1996 07:58:08 GMT Organization: DAIMI, Computer Science Dept. at Aarhus University Lines: 221 Distribution: daimi Message-ID: <4jdgqg$ki2@gjallar.daimi.aau.dk> Reply-To: jlknudsen@daimi.aau.dk (Jorgen Lindskov Knudsen) NNTP-Posting-Host: lithium.daimi.aau.dk Xref: news.daimi.aau.dk daimi:13872 daimi.beta:10185 Off-line News Reading and News Archival --------------------------------------- If you wish to read news articles off-line, or wish to maintain an archive of news articles, you should have a look at the Threads News Repository Generator. Threads is a public domain application which implements a NNTP client capable of maintaining such archives. Below you will find a more detailed description of the facilities of Threads, including FTP address, where you can obtain the source code for Threads. Threads is written in the object-oriented BETA programming language. For more information on BETA, have a look at the BETA Home Page (URL http://wwww.daimi.aau.dk/~beta/info/). Regards, Jorgen Lindskov Knudsen, Computer Science Department, Aarhus University Ny Munkegade 116, DK-8000 Aarhus C, DENMARK E-mail: jlknudsen@daimi.aau.dk, Phone: +45 89 42 32 33, Fax: +45 89 42 32 55 ************ BETA information Sources ************************ * WWW: http://www.mjolner.dk * * http://www.daimi.aau.dk/~beta/info * * News: comp.lang.beta * * FAQ: http://www.daimi.aau.dk/~beta/FAQ * * E-mail: info@mjolner.dk * * Address: Mjolner Informatics, Science Park Aarhus, * * Gustav Wieds Vej 10, DK-8000, Aarhus C, DENMARK * * Tel.: +45 86 20 20 00 * * Fax.: +45 86 20 12 22 * ************************************************************** ---------------------------------------------------------------------------- Threads - an HTML News Repository Generator Threads is a program used to create news repositories containing indexes over articles posted to USENET newsgroups, facilitating easy retrieval of postings according to different criteria, such as date or subject. A news repository is a WWW Home Page with links to the various indexes. It is e.g. being used to manage the news repository for comp.lang.beta maintained at Aarhus University, Denmark (URL: http://www.daimi.aau.dk/~beta/News). Contents * What Types of Index Files are Generated? * What do I need to Generate News Repositories? * How do I Create a News Repository? * How do I Maintain the News Repository? * The Directory Structure of a News Repository * How do I Regenerate the Index Files? * I'm curious. How do you Parse the News Articles? * Threads Development History ---------------------------------------------------------------------------- What Types of Indexes are Generated? Threads generates four different types of indexes: 1. Index sorted by active threads 2. Index sorted by date 3. Index sorted by poster 4. Index sorted by subject The date, poster, and subject indexes are listings of all postings sorted by the contents of the Date:, From:, and Subject: lines of the articles, respectively. The active threads index is similar to the subject index, but includes only those subject lists (or threads) that contain at least one posting less than 14 days old. It is a convenient way to catch up on recent discussions. ---------------------------------------------------------------------------- What do I need to Generate News Repositories? You need access to a NNTP server from which Threads can download news articles, and you need a directory to contain the news repository. First time Threads is invoked, it will create the necessary directories if they are not already created (more on the directory structure later). If you want the news repository to contain old articles (older than the ones available from the server), you will have to obtain these by other means (and install them manually - ask for details). You naturally also need the threads executable compiled for your platform. If you're interested, you can obtain the source code for the threads news repository system through (URL: ftp://ftp.daimi.aau.dk/pub/beta/threads/). ---------------------------------------------------------------------------- How do I Create a News Repository? The command threads -g group -d dir -s server -m mail where "group" is the name of the newsgroup, for which you want a repository, "dir" is a file path, referring to the directory where the news repository should be located, "server" is the name of the NNTP server, from which news articles can be obtained, and finally "mail" is the e-mail address of the person responsible for the news repository. A more realistic example might thus be threads -g comp.lang.beta -d ~beta/public_html/News -s news.daimi.aau.dk -m jlk@daimi.aau.dk which would create a news repository for the "comp.lang.beta" news group. The news repository will be located in the directory "~beta/public_html/News", using the "news.daimi.aau.dk" NNTP server, and with "jlk@daimi.aau.dk" as the responsible for the news repository. The result is this case, that the news repository will be accessible through the URL "http://www.daimi.aau.dk/~beta/News/" ---------------------------------------------------------------------------- How do I Maintain the News Repository? The easiest thing is to run the appropriate threads command once a day, properly using a crontab entry (the comp.lang.beta news repository is maintained by a cron job being run every night). ---------------------------------------------------------------------------- The Directory Structure of a News Repository The directory structure of a news repository is: "dir"/ (the path given as argument -d to threads) index.html volume1996/ active.html (only in the current year) date.html subject.html poster.html news/ (contains one file for each article in 1996) volume1995/ date.html subject.html poster.html news/ (contains one file for each article in 1995) volume1994/ date.html subject.html poster.html news/ (contains one file for each article in 1994) assuming that this news repository have been active in 1994-1996, and that 1996 is the current year. The active.html, date.html, subject.html, and poster.html files for the current year (here 1996) will be regenerated each time Threads is invoked. The index.html file is only created first time Threads is invoked. You have to edit it manually to make it refer to new volumes each year. The automatically generated index.html file is just to give you an idea of how it might look, but you are free to edit it in any way to conform to your wishes - Threads will never modify this particular file - only create a new index.html file, if Threads finds out that it is missing. ---------------------------------------------------------------------------- How do I Regenerate the Index Files? If you want to regenerate the indexes, the only thing you have to do is to remove the .html files you wish to have regenerated. Then Threads will automatically regenerate them next time it is run (also for old volumes). One reason for doing this, it is you are doing clean-up of a volume (removing irrelevant news entries from the news/ directory in a volume). You will can then create clean indexes this way. Another reason might be, that the index files have been destroyed one way or the other. ---------------------------------------------------------------------------- I'm curious. How do you Parse the News Articles? A certain amount of normalisation on the contents of the date, poster, and subject lines takes place before processing begins. For example, Date: 1 Jan 1995 12:00:00 GMT Date: Sun, 01 Jan 1995 12:00:00 +0000 are both acceptable ways to specify noon, January 1st, 1995. Similarly, From: jacobse@daimi.aau.dk (Jacob Seligmann) From: "Jacob Seligmann" will both be classified as postings from "Jacob Seligmann". The Threads news repository generator has been running for quite some time now, and has been tested on a large number of postings from a wide range of newsgroups. It correctly handles all articles I have yet encountered, but it is quite possible that the heuristics may fail if confronted with exotic field formats I haven't come across. If this happens, please send me a copy of the troublesome article so I can extend the normalisation mechanism accordingly. ---------------------------------------------------------------------------- Threads Development History Threads was originally inspired by the news archive available through The Eiffel Page. The first BETA versions was created by Jacob Seligmann, Aarhus University, jacobse@daimi.aau.dk. These versions used Perl scripts etc. for obtaining the news articles, and required direct access to the news spool directories at the NNTP server machine. The current version (totally implemented in BETA, and using NNTP protocol for obtaining the articles) is created by Jørgen Lindskov Knudsen, Aarhus University, jlknudsen@daimi.aau.dk. ---------------------------------------------------------------------------- Jørgen Lindskov Knudsen / Aarhus University / jlknudsen@daimi.aau.dk