clari.web.*
Articles Into Web Pages
In order to use a web browser to display our ClariNews articles you must either employ a CGI script or extract the news articles into web pages. One interesting compromise is to just extract the xcache articles (mainly images, but also some useful HTML) and filter the other articles via a CGI.
If the articles are extracted into pure HTML, we can take advantage of the ability to name each article. This lets us save a sequence of story updates with the same filename. Doing this makes programs such as web-indexers work much more easily and avoids having to deal with invalid article references (caused by superseded news articles).
This document will detail how our articles are constructed to make
extraction a simple process. You may also be interested in the
format of the body of our ClariNews Articles.
To News or not to News
If you're extracting the articles into named files, you don't even need to process them with a news system. It is easy to add-on extraction to an existing news system, but it is also possible to have the extraction programs do all the work of saving, remembering, and expiring the articles.
If you choose to extract some or all articles in addition to your
news processing, you just need to process each article that arrives
in the news system (or just the articles that arrive in the
clari.web.xcache.* groups) to be processed by the extraction program.
This allows the news system to create and maintain the news-overview
files and depends on the news system's expire process to know when to
remove the named files (though it does require an extra expire step
to remove the named articles corresponding to the removed news
articles). The only additional information that is needed is what
filename each article is saved under, and this information is
provided via the X-Fn
header in each of our top-level
stories. The default configuration saves this information in a
"web overview" file that the webextract and webexpire scripts
maintain.
If you choose to forego the news processing, the extraction script
can create web-overview files of its own. We suggest making them
the same format as news-overview files, but you don't have to save
them down deep in a directory structure like the news system does
(unless you want to). Our sample extraction script saves them by
name in a single directory. It also saves an expires value for each
article so that a simple expire program can remove files that have
outlived their usefulness. You will probably want to have the
extraction program maintain a history file to weed out duplicates
(since we feed most sites via multiple servers for redundancy), but
it is not absolutely vital since duplicates get saved under the same
name and proper handling of the overview files will weed out any
duplicates that creep in.
How To Extract
clari.web.*
Articles Into Web Pages
To extract each article in clari.web.*, parse the following headers:
Content-Type
| (Standard MIME header) | ||
Content-Transfer-Encoding | (ditto) |
The following headers should also be parsed if you're building web-overview files:
X-Fn
| (eXtraction FileName) | ||
Xref
| (newsgroup names with numbers) | ||
Subject
| |||
From
| |||
Date
| |||
Expires
|
Skip any article without a Content-Type
header (such
as the control messages).
You will undoubtedly want to limit articles that the
script processes to just those in clari.web.*
or
maybe even clari.web.xcache.*
.
All articles are extracted into the filename indicated by the x-fn
subheader of the Content-Type
line (if this is a top-level
article this data is duplicated in the article's X-Fn
header,
but you shouldn't confuse the two). For example:
Always validate this name to
ensure that someone isn't attempting to feed you bogus articles.
The name should have the form topdir/hashdir/filename
,
the dir names and filename should always start with an alphanumeric
character, and they should consist of the characters
A-Z
, a-z
, 0-9
,
'.'
, '-'
, and '_'
.
Reject any article with a value that doesn't fall within these parameters.
The dir names can be any string of alphanumeric characters and
should be created if they doesn't exist.
The current scheme creates 2-letter hash directories inside
several top-level directories: wed
(web-extracted data),
xwd
(extra web data), wes
(web-extracted
samples), and xws
(extra web samples).
The "extra" files are support files such as images (as opposed to
story files that would be presented as part of the available
news groups).
As a rule these top-level extra directories will always start with an 'x' so that the expire program can find the extra files efficiently. Note also that since the sample articles have their own separate directory structure, it allows sample stories to have a different ".htaccess" file controlling their access. You'll have to make special arrangements with ClariNet if you want users outside your company to access such sample news, however.
There are also a couple special directories in the xws dir named
news
and sports
.
These directories store HTML snippets that can be directly
included in a web page to show a recent photo/caption combination.
You may wish to run a photo randomizer
(clariphoto.pl)
that filters a source web page into a destination web page processing
server-side includes PLUS a new include:
If the Content-Type
is not multipart/mixed
,
check if the body is encoded with base64
encoding (the only
encoding style we use).
If not, it will have a type of text/html
and should be
extracted by filtering the body of the article to convert any
news-relative HTML links into web-relative links
and make any other desired changes.
All links that need to be converted have a CLARI-XFN='...'
tag
(attached to an IMG or Anchor tag)
or a CLARI-XGN='...'
tag
(attached to an Anchor tag).
This extra tag immediately follows the value that should be changed.
For example:
Here's a couple perl substitution statements that convert the HTML links (note that we use single quotes, not double):
s(\b(HREF|SRC)='([^'>]+)' CLARI-XFN='([^'>]+)') ($1='$NDIR_REF$3' CLARI-MID='$2'); s(\bHREF='([^'>]+)' CLARI-XGN='([^'>]+)') (HREF='$GROUP_REF$2' CLARI-GID='$1');
This looks for HREF
or SRC
after a word
boundary with an associated ='news-relative-link'
block
that is immediately followed by a CLARI-XFN
tag with the associated ='web-relative-link'
block.
If found, it
swaps the two blocks, renaming the original one as either CLARI-MID
or CLARI-GID.
It also prefixes the web-block with the http reference to get to
the information ($NDIR_REF
for the named files and
$GROUP_REF
for the newsgroup files/CGI).
Here's how the previous examples might look after processing:
The sample scripts go one step further and trim off the "clari.web." portion of the string since it makes the strings shorter and more readable. The sample CGI script knows how to translate usa.top back into clari.web.usa.top, so this works just fine. If you don't like this trimming, remove the extra code from the sample extraction script (see below).
Another thing that you will probably want to do is to uncomment the CLARI-ITEM XINFO block. The XINFO block contains data that is typically already shown when an article is in its newsgroup state (such as the newsgroups), so it is commented out using an HTML comment block. If you delete the line that contains the beginning and ending tags for the XINFO block you will uncomment the information so that it will be displayed in the story's extracted state. See the example article to see this in action.
If the article was base64
encoded, you simply decode
the body into the indicated file.
The following perl subroutine will decode a line of
base64 data and return the result:
The only other articles we send out are encoded with multipart MIME
(and are never nested -- only the main article is ever multipart).
When you encounter a Content-Type
of
multipart/mixed
, note the boundary
string
and unpack each piece as though you were processing
a separate article. When you are scanning for the boundary string
keep in mind that there is an implied "--" at the start of the line
and that the last boundary in the file will also have a additional
trailing "--".
Keeping Track of the Names (Overview Files)
To access the extracted files you will need a quick reference of all
the filenames. You can either configure your
news software to put the X-Fn
header in the news-overview
or web-oveview files (see below) or allow webextract and webexpire
to maintain web-overview files. We recommend using the web-overview
versions.
Note that both C News and INN require you to modify the source code
to put the X-Fn
header into the news overview files.
Modifying the source is the only to configure C News, and INN does not
allow the configuration file to contain unknown headers, so you'll
have to do some extra work if you want to avoid the web-overview
files. Since most sites don't want the extra characters in a file
that most newsreaders download, using the web-overview files is
usually the best way to go.
The X-Fn
article header is only present when the
article is a "top level" article -- a story or a feature binary.
It duplicates the x-fn
subheader on the
Content-Type
header to make it easy to add just the
filename to a news-system's overview creation process.
Any article without a X-Fn
header is only referenced
indirectly through the stories and doesn't require an overview entry.
If you are creating web-overview files it is recommended that you construct them in the same format as the news-overview files and save them by name in their own subdirectory. Feel free to omit the contents of any fields you don't need. News-overview files have the following fields separated by tabs:
Num Subject From Date [Message-ID] [References] [Lines] [Bytes] EXTRAS
It doesn't matter what number you use as long as it increases for
each new article. The easiest way to do this is to use the
article-number in the Xref
header for each newsgroup.
Some news systems (such as INN) don't include an Xref header for
articles that aren't cross-posted, however, so if the header is
missing you should figure out the group name and article number from
the name of the file you're processing (which is usually of the form
"clari/group/1234"). If you are not processing the files via a news
system, there will still be an Xref header in the data stream that
comes from the sender's system. If you're getting a direct feed
from us, you'll get an Xref header for every article (since we're
running C News).
Items in []'s are omitted in the web-overview files created by the sample extraction code. Note that when you omit a header, you still need to include the separating tabs.
The EXTRAS field is a field of multiple values, each with its
preceding header. You should include the "Expires:
value
" header and the "X-Fn: value
" header,
separated by tabs. The sample code stores the expires header as a
Unix date (which is the number of seconds since the Unix epoch).
This number is either derived from the article's Expires
header (if present) or from the Date
header plus 2 weeks
(the usual expire time).
You only need to write overview data for an article with an
X-Fn
header, and you never need to write such entries
for pieces in a multipart article (which is only used to pack xcache
articles). You need to append a line to each of the newsgroups that
the article posted in. The sample code outputs all the overview
entries to a single file prefixed by the newsgroup name and then
bursts the entries into the appropriate files in clusters (using
sort). This minimizes the opening of each overview file.
Sample Programs
There's a few sample programs available (see the index for a full list). All the sample perl programs use a single configuration script, custom_config.pl, which you'll need to grab and customize for your particular setup. Other optional custom_*.pl files exist to customize the output of various scripts -- see the aforementioned sample file index for the whole list, as well as a ChangeLog file that lists all the recent changes.
You can also grab a tar file that includes all the perl scripts, plus the gif images that they reference.
The web extraction script (webextract.pl) takes the articles and turns them into files. I've also created a C version of this latter script (webextract.c) with most of the same features and options except that it only works in conjunction with a news system.
The perl version is written to use the TimeDate module (grab it from CPAN) which means that you need to be running at least perl 5.003.
If you plan to run without a news system, you may still wish to maintain a history file using a dbz database (nntp can use it to refuse to receive duplicates, for instance). If so, grab my DBZ_File perl module (DBZ_File.tar.gz). Alternately, just change the "DBZ" references to "DBM" and you won't need a non-standard module (it will just create a non-standard history database).
The C version must be linked with a date-parsing function, such as the one provided in this freely-available yacc script (parsedate.y). Non-ANSI systems will need to either tweak the program or join the rest of us in the modern world (maybe install gcc?).
To install the news version of the program with C News, I suggest applying a simple patch that logs the filename in the news log file. You would then run webextract as a daemon reading the news log file, like this:
webextract -l /usr/lib/news/log clariThis presumes that you've created a sys file entry for "clari" like this:
clari:clari.web/all:F:/dev/nullThis -l (ell) mode runs as a daemon that reads the current version of the log file, notices when it changes each day, and unpacks everything that gets tagged with the indicated feed name ("clari" in this case). It writes out a
log.pos
file to keep track of where
it left off when it exits.
To install it in INN, you could make sure that the INN option
NNTPLINK_LOG
is enabled (it puts the filename in
the news log file) and run it in a manner similar to the above:
webextract -l /var/log/news/news clariPlease note: With INN, this
NNTPLINK_LOG
variable
needs to be compiled into the binary.
It's normally off by default in the source code.
You just need to uncomment the line with that flag,
then recompile, and it will log the article name.
You can then put the line
into your newsfeeds
file (below) to add the "clari" tag
to the news logfile that the webextract.pl script scans for.
From my INN build in /local/src/inn-1.5b2/config/config.data
## Put nntplink info (filename) into the log? #### =()<NNTPLINK_LOG @<NNTPLINK_LOG>@>()= # Enabled for ClariWeb Extraction - 02/28/97 - PGS #NNTPLINK_LOG DONT NNTPLINK_LOG DO(By default, it's set to "DONT" and I've left that line commented out for reference. You can just change "DONT" to "DO" and be done with it, if you like.)
Then you'll have to put an entry in your newsfeeds file like this:
clari:!*,clari.web.*:Tl:As above, please note that the character just after the "T" is a lowercase "ell", not a numeral "one". This line tells your INN process to create a pseudo-feed to a system called "clari", which does NOT apply to all of your news articles (!*) but DOES applie to any clari.web.* groups, and is a type (T) of "Log entry only", meaning that it will just put a mark into the log file (such as /var/log/news/news) with the tagline of "clari". This tagline is what the webextract looks for while running. All articles tagged with "clari" will be extracted into HTML files and put into the overview files. (Do a 'man newsfeeds' for full info on this.)
You could alternately install it as a channel process by using this newsfeeds entry:
clari:!*,clari.web.*:Tc,Wn:/some/path/webextract -cIf you do that, you'd also have to check periodically for batch files (which would get created if your disk overflowed) and run webextract from cron to process the batchfiles. The batch-run syntax is:
webextract -b /var/spool/news/out.going/clariIf the program fully processes the batch file, it removes it.
I don't recommend running it as a per-article process, but if you want to do it, call it like this (C News sys file syntax):
clari:clari.web/all::/some/path/webextract -p %s
See also the discussion of
how to deal with separate web and news servers.
Newsgroup lists and Customizing for different
Editions
Depending on which edition of the ClariNews you subscribe to, you will have to add the apropriate newsgroups to your active file on your news server. These can be found at our FTP site and can be easily added to your active file by using a helper script to automate things. (You want to get the newsgroups.web.X files, where X is your edition. There are also newsgroups.X, but these are not the groups with web-formatted news, so they aren't what you want for ClariWeb.)
NOTE: Michael A. Grady <m-grady@uiuc.edu>, of the University of Illinois, has developed an nice guide to some of the customizations needed for his installation. Keep in mind that his document is pretty old now, and so the problem he mentions with newsgroups references to non-existant groups in lower editions has been fixed in the scripts here for quite some time now. If you want the most up-to-date scripts, grab them from our server.
Presenting Articles to the User, Sample Icons &
Graphics
Once the extraction is done, you need to present the lists of articles to the user in some easy-to-use fashion. One way to do this is to write a CGI script that reads the news or web-overview files and displays the latest stories in each group. We've provided a sample perl script (clariweb.cgi) that can read either news-overview or web-overview files (you can't use the news overview files without adding the "x-fn:" header to them).
A more efficient way (especially for larger sites) is to regularly update a set of files that contain a snapshot of the articles in each group. Since named articles update in place there's almost never a story that gets canceled, so the files will stay accurate between updates. Our sample script for this processing (grouplist.pl) outputs two files per group. One is a brief list for the first visit to the group, the other is the full list of articles that will be displayed if the user clicks on the "Show all articles" link. If you go this route you should put the grouplist.pl script into cron to be updated every 10 minutes or so. The script only updates the files if they've changed, so it is fairly efficient to run it on a frequent basis.
If you want to display a summary of top stories in various areas, such as "World, USA, Sports" (etc), this can be accomplished with something similar to the (gettop.pl) script. It looks for the latest top-story article in the "xwd/links" subdir (where useful HTML articles are put) and extracts just the story portion into a file. This file can then be included via a SSI (server-side include) directive. If you don't have SSI turned on at your web site, the clariphoto.pl script can do the include for you -- especially handy if you're putting photo/caption HTML onto the same web page.
If you don't want to grab the gifs you need directly from our web
server, you can get them as a part of the
perl source
tar file that includes all the perl scripts, plus the gif images
that they reference.
This includes the little 'camera.gif'
(that denotes a photocaptioned story), and the
logo (which must be displayed with
the news).
If you're processing the files with a news system, you can follow-up
the normal expire with a program that reads all the overview files
and creates a hash of the remaining X-Fn
filenames.
Then, a single pass through the files in the named-article
directories will let you find all the files that need to be removed.
The exception is the xcache files, which are not directly referenced
in the overview files.
To make this easy to deal with, we've ensured that all xcache
files get put into a top-level directory that starts with an 'x'.
Thus, while processing the files in the subdirectories of "xwd"
you'd remove it a day after your longest web expire time
(usually 2 weeks + 1 day).
Since you're removing the files in the same directory you're
reading, this is fairly efficient.
On the other hand, if you're creating web-overview files, then the expire script can simply read through them and remove all entries that have expired by using the expires header included for this purpose. The sample script makes the removal of the files more efficient by creating a list, sorting it, and eliminating the duplicates before calling "unlink". This optimizes the task by grouping the files into common directories. Since it is rewriting the source of the information as it goes (the web-overview files) it writes the removal list to a file so that any interruption in the task will not lose the list of files to remove. Also, be sure to use locking to avoid conflicts with the overview creation system. You then have to expire the xcache files in the same manner as previously detailed.
The sample expire script (webexpire.pl) has configuration options in the user-configuration section that tells it if you want to read web-overview files or news-overview files, and if it needs to handle the aging of the history file (if you're running without a news system). It also has some command-line options: -v, to output some stats to stderr and an option; -n, that tells it to just output the list of removals to stdout.
How to deal with separate web and news servers
Many sites face the problem that they don't currently have both a news server and web server on the same system. Here are some ways to get around this problem.
Which one you pick depends on load and familiarity with software.
The easiest thing to do is to allow the web extraction script to read the articles from the news server via NFS. This of course has a cost in NFS overhead, but since the files are only read once, the cost is pretty low.
The best way to configure web extraction in this case is to read the news log file via NFS (to know what articles to extract) -- this is the only other file that needs to be available via NFS besides the articles files themselves.
An alternate approach is to NFS mount the ClariNews portion of the web tree on your news machine. You can then run web extraction any way you like on the news server and it will write the results to your web server. Since most installations choose to create web-overview files, this method probably has a slightly higher NFS overhead than mounting things the other way around.
Whether you're running grouplist.pl or clariweb.cgi, you'd just run it on your web machine as ususal as long as there are web-overview files to read. If you decide to use the news-overview files instead, this will cause a much larger NFS traffic load, and should probably be avoided.
Another way to get news to the web machine is to install a news server there with just a ClariNews feed. We recommend the use of C news, which does not put the huge memory load on a system than INN does. INN was primarily designed for sites with lots of incoming NNTP traffic. It runs a huge daemon to make this efficient, but you don't need that for a single incoming feed of low-volume ClariNews. Even at 9mb a day, a ClariNet feed does not put a serious impact on a machine.
We recommend this because news software like C News was designed exactly for the purpose of mirroring a news tree to another machine, handling expiring it -- the works.
Of course if you are familiar with INN and and don't know C News at all, you may not want to install an unfamiliar package. If you do use INN for this, you will want to tune it for minimum memory usage. (Or buy a bit more RAM -- at under $4 per megabyte, sometimes RAM problems are best solved with money these days rather than sysadmin time.) You can set the flag INN_DBZINCORE to 0 to avoid it keeping a lot of RAM at a cost in efficiency, when you compile it.
The web script even has an option to delete the originating article after it is successfully extracted, so you would not have to sacrifice any large amounts of disk space Alternately you could simply expire the news after one day to keep things small.
Another alternative is to put up a web server on the news server for no other purpose than to read news, and run our CGI and news unpacking programs. Again, such a web server would only serve requests for news stories, and as such the load of it would not be high. It doesn't need to be a full production web server.
Which web server is up to you. Apache is free of course, but you may have another choice.
As a last resort, you can arrange to build the files on one system and copy them to the web server. This can be done with a tool like "rdist" or you can easily have the web extractor or other tools make a list of the files they changed and you can feed that list into a program like CPIO to make a stream to somehow send to the other system (via rsh, rcp or even FTP) for unpacking.
If your news server and web server are totally divided by firewalls that block everything but FTP, we strongly recommend you just install C News on your web server or a web server on your news system. These tools were meant to mirror news feeds through firewalls! But if you can't, you can:
Mailing list, example images, and further help
There is a mailing list for the ClariWeb admin community: clariweb-admins@clari.net
Here you'll find other admins who are in the process of or have implemented ClariWeb at their sites, with scripts, hints, and tips on how they've set up ClariWeb, as well as answers to questions.
To join, send a mail to clariweb-admins-request@clari.net with the word "subscribe" in the Subject: line.
There is also a Snapshot Archive of images showing what some other ClariWeb sites are doing, and the variety of ways that ClariWeb can be customized to fit in with the overall motif of your website. These should provide some good ideas as you plan how you want to present the ClariNews on your pages.