ClariNet ClariNews Extraction

ClariNet's web-formatted ClariNews articles have our full USENET headers and a body of HTML. Some news readers (such as the one that comes with Netscape Navigator) can display the HTML body of such news articles without problem. Other newsreaders display the articles as though they were plain text, which is not very readable with all the embedded tags. One alternative is to make it possible to read our news in a web browser.

In order to use a web browser to display our ClariNews articles you must either employ a CGI script or extract the news articles into web pages. One interesting compromise is to just extract the xcache articles (mainly images, but also some useful HTML) and filter the other articles via a CGI.

If the articles are extracted into pure HTML, we can take advantage of the ability to name each article. This lets us save a sequence of story updates with the same filename. Doing this makes programs such as web-indexers work much more easily and avoids having to deal with invalid article references (caused by superseded news articles).

This document will detail how our articles are constructed to make extraction a simple process. You may also be interested in the format of the body of our ClariNews Articles.

To News or not to News

If you're extracting the articles into named files, you don't even need to process them with a news system. It is easy to add-on extraction to an existing news system, but it is also possible to have the extraction programs do all the work of saving, remembering, and expiring the articles.

If you choose to extract some or all articles in addition to your news processing, you just need to process each article that arrives in the news system (or just the articles that arrive in the clari.web.xcache.* groups) to be processed by the extraction program. This allows the news system to create and maintain the news-overview files and depends on the news system's expire process to know when to remove the named files (though it does require an extra expire step to remove the named articles corresponding to the removed news articles). The only additional information that is needed is what filename each article is saved under, and this information is provided via the X-Fn header in each of our top-level stories. The default configuration saves this information in a "web overview" file that the webextract and webexpire scripts maintain.

If you choose to forego the news processing, the extraction script can create web-overview files of its own. We suggest making them the same format as news-overview files, but you don't have to save them down deep in a directory structure like the news system does (unless you want to). Our sample extraction script saves them by name in a single directory. It also saves an expires value for each article so that a simple expire program can remove files that have outlived their usefulness. You will probably want to have the extraction program maintain a history file to weed out duplicates (since we feed most sites via multiple servers for redundancy), but it is not absolutely vital since duplicates get saved under the same name and proper handling of the overview files will weed out any duplicates that creep in.

How To Extract `clari.web.*` Articles Into Web Pages

To extract each article in clari.web.*, parse the following headers:

Content-Type (Standard MIME header)
Content-Transfer-Encoding (ditto)

The following headers should also be parsed if you're building web-overview files:

X-Fn (eXtraction FileName)
Xref (newsgroup names with numbers)
Subject
From
Date
Expires

Skip any article without a Content-Type header (such as the control messages). You will undoubtedly want to limit articles that the script processes to just those in clari.web.* or maybe even clari.web.xcache.*.

All articles are extracted into the filename indicated by the x-fn subheader of the Content-Type line (if this is a top-level article this data is duplicated in the article's X-Fn header, but you shouldn't confuse the two). For example:

Subject: Some Article X-Fn: wed/br/Rslug-word.xy_8.html Content-Type: text/html; x-fn="wed/br/Rslug-word.xy_8.html" <HTML>Story...

Always validate this name to ensure that someone isn't attempting to feed you bogus articles. The name should have the form topdir/hashdir/filename, the dir names and filename should always start with an alphanumeric character, and they should consist of the characters A-Z, a-z, 0-9, '.', '-', and '_'. Reject any article with a value that doesn't fall within these parameters.

The dir names can be any string of alphanumeric characters and should be created if they doesn't exist. The current scheme creates 2-letter hash directories inside several top-level directories: wed (web-extracted data), xwd (extra web data), wes (web-extracted samples), and xws (extra web samples). The "extra" files are support files such as images (as opposed to story files that would be presented as part of the available news groups).

As a rule these top-level extra directories will always start with an 'x' so that the expire program can find the extra files efficiently. Note also that since the sample articles have their own separate directory structure, it allows sample stories to have a different ".htaccess" file controlling their access. You'll have to make special arrangements with ClariNet if you want users outside your company to access such sample news, however.

There are also a couple special directories in the xws dir named news and sports. These directories store HTML snippets that can be directly included in a web page to show a recent photo/caption combination. You may wish to run a photo randomizer (clariphoto.pl) that filters a source web page into a destination web page processing server-side includes PLUS a new include:

[an error occurred while processing this directive] Since the includes are replaced with the web pages when the source file is processed, it means your web server doesn't even have to be setup to handle server side includes PLUS it is more efficient. The program can even modify the caption file's data to output the HTML in the format of your choice. The photo/caption snippets should be put up on a web page that is accessable only to your paid ClariNet subscribers unless you've made special arrangements with ClariNet to provide a wider distribution.

If the Content-Type is not multipart/mixed, check if the body is encoded with base64 encoding (the only encoding style we use). If not, it will have a type of text/html and should be extracted by filtering the body of the article to convert any news-relative HTML links into web-relative links and make any other desired changes. All links that need to be converted have a CLARI-XFN='...' tag (attached to an IMG or Anchor tag) or a CLARI-XGN='...' tag (attached to an Anchor tag). This extra tag immediately follows the value that should be changed. For example:

Here's a couple perl substitution statements that convert the HTML links (note that we use single quotes, not double):

   s(\b(HREF|SRC)='([^'>]+)' CLARI-XFN='([^'>]+)')
    ($1='$NDIR_REF$3' CLARI-MID='$2');
   s(\bHREF='([^'>]+)' CLARI-XGN='([^'>]+)')
    (HREF='$GROUP_REF$2' CLARI-GID='$1');

This looks for HREF or SRC after a word boundary with an associated ='news-relative-link' block that is immediately followed by a CLARI-XFN tag with the associated ='web-relative-link' block. If found, it swaps the two blocks, renaming the original one as either CLARI-MID or CLARI-GID. It also prefixes the web-block with the http reference to get to the information ($NDIR_REF for the named files and $GROUP_REF for the newsgroup files/CGI). Here's how the previous examples might look after processing:

The sample scripts go one step further and trim off the "clari.web." portion of the string since it makes the strings shorter and more readable. The sample CGI script knows how to translate usa.top back into clari.web.usa.top, so this works just fine. If you don't like this trimming, remove the extra code from the sample extraction script (see below).

Another thing that you will probably want to do is to uncomment the CLARI-ITEM XINFO block. The XINFO block contains data that is typically already shown when an article is in its newsgroup state (such as the newsgroups), so it is commented out using an HTML comment block. If you delete the line that contains the beginning and ending tags for the XINFO block you will uncomment the information so that it will be displayed in the story's extracted state. See the example article to see this in action.

If the article was base64 encoded, you simply decode the body into the indicated file. The following perl subroutine will decode a line of base64 data and return the result:

sub decode_base64 { local($_) = @_; tr:ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/::cd; # Odd order of translate string gets around some perl oddities tr|NABCDEFGHIJKLMOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567+/98|- !""#$%&'()*+,./0123456789:\;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[^_]\\|; unpack('u*', pack('C', 32 + int(length($_) * 3 / 4)) . $_); }

The only other articles we send out are encoded with multipart MIME (and are never nested -- only the main article is ever multipart). When you encounter a Content-Type of multipart/mixed, note the boundary string and unpack each piece as though you were processing a separate article. When you are scanning for the boundary string keep in mind that there is an implied "--" at the start of the line and that the last boundary in the file will also have a additional trailing "--".

Keeping Track of the Names (Overview Files)

To access the extracted files you will need a quick reference of all the filenames. You can either configure your news software to put the X-Fn header in the news-overview or web-oveview files (see below) or allow webextract and webexpire to maintain web-overview files. We recommend using the web-overview versions.

Note that both C News and INN require you to modify the source code to put the X-Fn header into the news overview files. Modifying the source is the only to configure C News, and INN does not allow the configuration file to contain unknown headers, so you'll have to do some extra work if you want to avoid the web-overview files. Since most sites don't want the extra characters in a file that most newsreaders download, using the web-overview files is usually the best way to go.

The X-Fn article header is only present when the article is a "top level" article -- a story or a feature binary. It duplicates the x-fn subheader on the Content-Type header to make it easy to add just the filename to a news-system's overview creation process. Any article without a X-Fn header is only referenced indirectly through the stories and doesn't require an overview entry.

If you are creating web-overview files it is recommended that you construct them in the same format as the news-overview files and save them by name in their own subdirectory. Feel free to omit the contents of any fields you don't need. News-overview files have the following fields separated by tabs:

Num Subject From Date [Message-ID] [References] [Lines] [Bytes] EXTRAS

It doesn't matter what number you use as long as it increases for each new article. The easiest way to do this is to use the article-number in the Xref header for each newsgroup. Some news systems (such as INN) don't include an Xref header for articles that aren't cross-posted, however, so if the header is missing you should figure out the group name and article number from the name of the file you're processing (which is usually of the form "clari/group/1234"). If you are not processing the files via a news system, there will still be an Xref header in the data stream that comes from the sender's system. If you're getting a direct feed from us, you'll get an Xref header for every article (since we're running C News).

Items in []'s are omitted in the web-overview files created by the sample extraction code. Note that when you omit a header, you still need to include the separating tabs.

The EXTRAS field is a field of multiple values, each with its preceding header. You should include the "Expires: value" header and the "X-Fn: value" header, separated by tabs. The sample code stores the expires header as a Unix date (which is the number of seconds since the Unix epoch). This number is either derived from the article's Expires header (if present) or from the Date header plus 2 weeks (the usual expire time).

You only need to write overview data for an article with an X-Fn header, and you never need to write such entries for pieces in a multipart article (which is only used to pack xcache articles). You need to append a line to each of the newsgroups that the article posted in. The sample code outputs all the overview entries to a single file prefixed by the newsgroup name and then bursts the entries into the appropriate files in clusters (using sort). This minimizes the opening of each overview file.

Sample Programs

There's a few sample programs available (see the index for a full list). All the sample perl programs use a single configuration script, custom_config.pl, which you'll need to grab and customize for your particular setup. Other optional custom_*.pl files exist to customize the output of various scripts -- see the aforementioned sample file index for the whole list, as well as a ChangeLog file that lists all the recent changes.

You can also grab a tar file that includes all the perl scripts, plus the gif images that they reference.

The web extraction script (webextract.pl) takes the articles and turns them into files. I've also created a C version of this latter script (webextract.c) with most of the same features and options except that it only works in conjunction with a news system.

The perl version is written to use the TimeDate module (grab it from CPAN) which means that you need to be running at least perl 5.003.

If you plan to run without a news system, you may still wish to maintain a history file using a dbz database (nntp can use it to refuse to receive duplicates, for instance). If so, grab my DBZ_File perl module (DBZ_File.tar.gz). Alternately, just change the "DBZ" references to "DBM" and you won't need a non-standard module (it will just create a non-standard history database).

The C version must be linked with a date-parsing function, such as the one provided in this freely-available yacc script (parsedate.y). Non-ANSI systems will need to either tweak the program or join the rest of us in the modern world (maybe install gcc?).

To install the news version of the program with C News, I suggest applying a simple patch that logs the filename in the news log file. You would then run webextract as a daemon reading the news log file, like this:

	webextract -l /usr/lib/news/log clari

This presumes that you've created a sys file entry for "clari" like this:

	clari:clari.web/all:F:/dev/null

This -l (ell) mode runs as a daemon that reads the current version of the log file, notices when it changes each day, and unpacks everything that gets tagged with the indicated feed name ("clari" in this case). It writes out a log.pos file to keep track of where it left off when it exits.

To install it in INN, you could make sure that the INN option NNTPLINK_LOG is enabled (it puts the filename in the news log file) and run it in a manner similar to the above:

	webextract -l /var/log/news/news clari

Please note: With INN, this NNTPLINK_LOG variable needs to be compiled into the binary. It's normally off by default in the source code. You just need to uncomment the line with that flag, then recompile, and it will log the article name. You can then put the line into your newsfeeds file (below) to add the "clari" tag to the news logfile that the webextract.pl script scans for.

From my INN build in /local/src/inn-1.5b2/config/config.data

##  Put nntplink info (filename) into the log?
#### =()<NNTPLINK_LOG   @<NNTPLINK_LOG>@>()=
# Enabled for ClariWeb Extraction - 02/28/97 - PGS
#NNTPLINK_LOG           DONT
NNTPLINK_LOG            DO

(By default, it's set to "DONT" and I've left that line commented out for reference. You can just change "DONT" to "DO" and be done with it, if you like.)

Then you'll have to put an entry in your newsfeeds file like this:

	clari:!*,clari.web.*:Tl:

As above, please note that the character just after the "T" is a lowercase "ell", not a numeral "one". This line tells your INN process to create a pseudo-feed to a system called "clari", which does NOT apply to all of your news articles (!*) but DOES applie to any clari.web.* groups, and is a type (T) of "Log entry only", meaning that it will just put a mark into the log file (such as /var/log/news/news) with the tagline of "clari". This tagline is what the webextract looks for while running. All articles tagged with "clari" will be extracted into HTML files and put into the overview files. (Do a 'man newsfeeds' for full info on this.)

You could alternately install it as a channel process by using this newsfeeds entry:

	clari:!*,clari.web.*:Tc,Wn:/some/path/webextract -c

If you do that, you'd also have to check periodically for batch files (which would get created if your disk overflowed) and run webextract from cron to process the batchfiles. The batch-run syntax is:

	webextract -b /var/spool/news/out.going/clari

If the program fully processes the batch file, it removes it.

I don't recommend running it as a per-article process, but if you want to do it, call it like this (C News sys file syntax):

	clari:clari.web/all::/some/path/webextract -p %s

See also the discussion of how to deal with separate web and news servers.

Newsgroup lists and Customizing for different Editions

Depending on which edition of the ClariNews you subscribe to, you will have to add the apropriate newsgroups to your active file on your news server. These can be found at our FTP site and can be easily added to your active file by using a helper script to automate things. (You want to get the newsgroups.web.X files, where X is your edition. There are also newsgroups.X, but these are not the groups with web-formatted news, so they aren't what you want for ClariWeb.)

NOTE: Michael A. Grady <m-grady@uiuc.edu>, of the University of Illinois, has developed an nice guide to some of the customizations needed for his installation. Keep in mind that his document is pretty old now, and so the problem he mentions with newsgroups references to non-existant groups in lower editions has been fixed in the scripts here for quite some time now. If you want the most up-to-date scripts, grab them from our server.

Presenting Articles to the User, Sample Icons & Graphics

Once the extraction is done, you need to present the lists of articles to the user in some easy-to-use fashion. One way to do this is to write a CGI script that reads the news or web-overview files and displays the latest stories in each group. We've provided a sample perl script (clariweb.cgi) that can read either news-overview or web-overview files (you can't use the news overview files without adding the "x-fn:" header to them).

A more efficient way (especially for larger sites) is to regularly update a set of files that contain a snapshot of the articles in each group. Since named articles update in place there's almost never a story that gets canceled, so the files will stay accurate between updates. Our sample script for this processing (grouplist.pl) outputs two files per group. One is a brief list for the first visit to the group, the other is the full list of articles that will be displayed if the user clicks on the "Show all articles" link. If you go this route you should put the grouplist.pl script into cron to be updated every 10 minutes or so. The script only updates the files if they've changed, so it is fairly efficient to run it on a frequent basis.

If you want to display a summary of top stories in various areas, such as "World, USA, Sports" (etc), this can be accomplished with something similar to the (gettop.pl) script. It looks for the latest top-story article in the "xwd/links" subdir (where useful HTML articles are put) and extracts just the story portion into a file. This file can then be included via a SSI (server-side include) directive. If you don't have SSI turned on at your web site, the clariphoto.pl script can do the include for you -- especially handy if you're putting photo/caption HTML onto the same web page.

If you don't want to grab the gifs you need directly from our web server, you can get them as a part of the perl source tar file that includes all the perl scripts, plus the gif images that they reference. This includes the little 'camera.gif' (that denotes a photocaptioned story), and the logo (which must be displayed with the news).

Expiring the Named Articles

If you're processing the files with a news system, you can follow-up the normal expire with a program that reads all the overview files and creates a hash of the remaining X-Fn filenames. Then, a single pass through the files in the named-article directories will let you find all the files that need to be removed. The exception is the xcache files, which are not directly referenced in the overview files. To make this easy to deal with, we've ensured that all xcache files get put into a top-level directory that starts with an 'x'. Thus, while processing the files in the subdirectories of "xwd" you'd remove it a day after your longest web expire time (usually 2 weeks + 1 day). Since you're removing the files in the same directory you're reading, this is fairly efficient.

On the other hand, if you're creating web-overview files, then the expire script can simply read through them and remove all entries that have expired by using the expires header included for this purpose. The sample script makes the removal of the files more efficient by creating a list, sorting it, and eliminating the duplicates before calling "unlink". This optimizes the task by grouping the files into common directories. Since it is rewriting the source of the information as it goes (the web-overview files) it writes the removal list to a file so that any interruption in the task will not lose the list of files to remove. Also, be sure to use locking to avoid conflicts with the overview creation system. You then have to expire the xcache files in the same manner as previously detailed.

The sample expire script (webexpire.pl) has configuration options in the user-configuration section that tells it if you want to read web-overview files or news-overview files, and if it needs to handle the aging of the history file (if you're running without a news system). It also has some command-line options: -v, to output some stats to stderr and an option; -n, that tells it to just output the list of removals to stdout.

How to deal with separate web and news servers

Many sites face the problem that they don't currently have both a news server and web server on the same system. Here are some ways to get around this problem.

Which one you pick depends on load and familiarity with software.

NFS mounting of directories

The easiest thing to do is to allow the web extraction script to read the articles from the news server via NFS. This of course has a cost in NFS overhead, but since the files are only read once, the cost is pretty low.

The best way to configure web extraction in this case is to read the news log file via NFS (to know what articles to extract) -- this is the only other file that needs to be available via NFS besides the articles files themselves.

An alternate approach is to NFS mount the ClariNews portion of the web tree on your news machine. You can then run web extraction any way you like on the news server and it will write the results to your web server. Since most installations choose to create web-overview files, this method probably has a slightly higher NFS overhead than mounting things the other way around.

Whether you're running grouplist.pl or clariweb.cgi, you'd just run it on your web machine as ususal as long as there are web-overview files to read. If you decide to use the news-overview files instead, this will cause a much larger NFS traffic load, and should probably be avoided.

C News on Web Server

Another way to get news to the web machine is to install a news server there with just a ClariNews feed. We recommend the use of C news, which does not put the huge memory load on a system than INN does. INN was primarily designed for sites with lots of incoming NNTP traffic. It runs a huge daemon to make this efficient, but you don't need that for a single incoming feed of low-volume ClariNews. Even at 9mb a day, a ClariNet feed does not put a serious impact on a machine.

We recommend this because news software like C News was designed exactly for the purpose of mirroring a news tree to another machine, handling expiring it -- the works.

Of course if you are familiar with INN and and don't know C News at all, you may not want to install an unfamiliar package. If you do use INN for this, you will want to tune it for minimum memory usage. (Or buy a bit more RAM -- at under $4 per megabyte, sometimes RAM problems are best solved with money these days rather than sysadmin time.) You can set the flag INN_DBZINCORE to 0 to avoid it keeping a lot of RAM at a cost in efficiency, when you compile it.

The web script even has an option to delete the originating article after it is successfully extracted, so you would not have to sacrifice any large amounts of disk space Alternately you could simply expire the news after one day to keep things small.

Web server on the news server

Another alternative is to put up a web server on the news server for no other purpose than to read news, and run our CGI and news unpacking programs. Again, such a web server would only serve requests for news stories, and as such the load of it would not be high. It doesn't need to be a full production web server.

Which web server is up to you. Apache is free of course, but you may have another choice.

File Copying tools

As a last resort, you can arrange to build the files on one system and copy them to the web server. This can be done with a tool like "rdist" or you can easily have the web extractor or other tools make a list of the files they changed and you can feed that list into a program like CPIO to make a stream to somehow send to the other system (via rsh, rcp or even FTP) for unpacking.

If your news server and web server are totally divided by firewalls that block everything but FTP, we strongly recommend you just install C News on your web server or a web server on your news system. These tools were meant to mirror news feeds through firewalls! But if you can't, you can:

Modify the web unpacking scripts to generate a list of files they have changed, or
Use a tool like "find" to build a list of files that have changed after an unpack, then:
Take the list of files and put it into "cpio" or some similar program.
Take the output from cpio and FTP it to the other system.
On the other system, when the FTP archive arrives, unpack with CPIO into the right tree.
Arrange some sort of cleanup that will get rid of old files after a while.

Mailing list, example images, and further help

There is a mailing list for the ClariWeb admin community: clariweb-admins@clari.net

Here you'll find other admins who are in the process of or have implemented ClariWeb at their sites, with scripts, hints, and tips on how they've set up ClariWeb, as well as answers to questions.

To join, send a mail to clariweb-admins-request@clari.net with the word "subscribe" in the Subject: line.

There is also a Snapshot Archive of images showing what some other ClariWeb sites are doing, and the variety of ways that ClariWeb can be customized to fit in with the overall motif of your website. These should provide some good ideas as you plan how you want to present the ClariNews on your pages.

Original Doc by: Wayne Davison
Maintained by: Patrick Salsbury

Last modified: Thu Jun 26 12:44:50 PDT

	`Content-Type`		(Standard MIME header)
	`Content-Transfer-Encoding`		(ditto)

	`X-Fn`		(eXtraction FileName)
	`Xref`		(newsgroup names with numbers)
	`Subject`
	`From`
	`Date`
	`Expires`

Contents:

NFS mounting of directories

C News on Web Server

Web server on the news server

File Copying tools