If you are seeing this message use the text links at the bottom of this page for navagation

ClariNet PhotoGrab Perl Script

In order to liven up your web pages, ClariNet offers a special script, written in the Perl language which extracts photos, captions, headlines and other material from ClariNet photo stories as they arrive at your site.

This script is then programmed to write out the photos and other materials into the files of your web server, with HTML defined by a template that you write.

You can then arrange for a web page, such as your user home page, to include the HTML -- and the link to the photo -- within the page using the "server side include" mechanism found in most web servers.

You can see an example of this on the ClariNet home page.

In general, you can use this to provide the pictures and linked stories and newsgroups to your users, who are of course licenced to read ClariNet news. Before you provide these items on a page open to the general public, you should contact our sales department for permission. Only customers at certain levels of service can receive such permission.

To install the Perl script needs a little system administration knowledge. Modifying it requires some knowledge of Perl. The administrator of USENET news should be involved. However, to modify the templates you need not know any Perl. If you are comfortable writing HTML you should have little difficulty working with the template file to customize how the pictures, captions and headlines look on your own web pages.

While we are happy to take bug reports at photograb@clari.net, PhotoGrab is offered on an as-is basis with no official tech support.

Please note that currently this script is meant for use with the Proletext standard ClariNet feed where photos come in multipart MIME articles, and not the new "clari.web" feed where articles are sent directly in HTML.

You can read our web page on working with the template file for more details.

What's below is for system admins and skilled webmasters.

Getting and Configuring PhotoGrab

You can pick up a TAR archive of the script and sample template files from our FTP site. We provide a number of different templates with some different picture styles for you to choose from or modify. You actually don't need a template file, there is a default template built into the script.

You will need Perl on your system. We recommend Perl5, we've seen some problems with Perl4.

Pick a place on your system for the script to reside. Modify the first line to point at your perl interpreter if need be. Also, you may need to modify the first variable set in the program, $http_root to point to the root of your web server's file tree.

This program needs to run on the news server, and it needs to write into a directory in the web server's file tree. If you don't have your web server on your news server, it can write into that tree via NFS, or you can modify the program to write into a local directory and copy the files over with some tool like rcp or rdist. It is important that this be done quickly so that the picture on the web page and the live news spool be kept in sync.

Installing it (1)

In your web page root directory -- usually /local/etc/httpd/htdocs -- create a directory for the program to operate. The default name for this directory is claripic. Change the script if you change the name. This directory should be used for nothing else but photograb. If you put your own files in there, they are at risk.

This directory must be writable by the "news" user that operates your USENET news system, so that it can create files there. For testing, you might want to make it writable by yourself as well.

If you want, you can now install a template -- the default file is main.template in the claripic directory, however you don't need this for testing.

The package comes with a shell script named "install" that will do all the installation work, but you should check it over to make sure it puts things in the places you want. It modifies the USENET news system configuration files if you have them in the standard place. (If you have both C news and INN or run them in non-standard places you may have to hand-edit the script.)

Testing

To test, run the shell script "testit". This will test the script on two sample photo articles in the directory, a small and a large. It will attempt to scale the large one to a smaller size -- it needs certain programs to do this.

It tells you how the test fared. Note that the test of a scaled version of the large picture may fail because your scaling programs are different from ours, and they produce a jpeg that looks OK but is not identical byte for byte with ours. Look at the jpeg manually.

You can also test the program yourself by grabbing a small photo article from the spool directory for clari.news.photos and putting it into a suitable testfile.

perl photograb <testfile

This will result in 3 files being created in claripic. One will be a "jpg" file -- a compressed picture -- with the photo from the article.

The other will be main.txt, which is the HTML code to give the picture a nice presentation on a web page. You will learn how to include it in your page with server side includes shortly.

The final file will be a file called last_expire. This file notes the last time the directory was cleaned out. Every 2 days or so, when photograb is run, it removes any pictures that are older than a week. This stops the directory from eating up disk space with all the pictures that get saved there.

Modifying a template

If you want slightly different HTML, you should create a template file. The default template file, to be found in claripic is main.template. The archive came with several examples of templates. Pick one you like, modify it as you wish, and copy it to the claripic directory.

The template file is actually a Perl script, but it's mostly just comments and one big Perl "print" statement that prints everything between two tags.

But it does more than just print the raw text. It also does variable interpolation -- the insertion of the contents of Perl variables and expressions -- into the template string. The program prepares many useful variables like $headline and $caption for use in the template.

If the template maintainer is a perl programmer, they can do more. We have a web page on how to configure the template.

If you use ClariCGI you probably will want to use it to serve up the story for users who click on it, rather than a news: URL

Live Install

As noted, the "install" shell script that comes with photograb will do all the install steps except of course for the modification of your chosen web page to include a reference to a picture, and the reconfiguration of your web server to support server side includes. However, it is a good idea for you to understand what the install script wants to do, and you may have to modify it yourself slightly.

Once you have tested the program you are ready to run it live. Your USENET news system includes the ability to run a program every time an article arrives in a newsgroup or set of newsgroups. You create a line much like that for an outgoing newsfeed, and specify a program to run on each article.

For example, the normal way to run this program is to run it on every article that arrives in clari.news.photos. It will discard any large photo items or non-photo items. Only a moderate number of items arrive in this newsgroup every day so the load is not too high.

You may of course pick another newsgroup, such as clari.sports.photos, but you need to pick one that has a photo reasonably often, at least once per week, otherwise you may put up a photo on your web page whose news article will expire before another photo arrives.

The first line of the perl script can either contain:

#!/local/bin/perl

(or wherever your perl interpreter is located) and the script can be made directly executable, or you can just run perl on it. For now we're going to assume that the script has been installed as /usr/lib/news/photograb and the perl interpreter is "/local/bin/perl".

C News

In C News, you might put a line in your sys (/usr/lib/news/sys) file of the form:

photofeed:clari.news.photos::/local/bin/perl /usr/lib/news/photograb <%s

This runs every article in that newsgroup through photograb, with the article on the standard input. The program is run as the "news" user, the same one that owns all the news files. Note that no distributions are listed -- you could list the valid distributions you get at your system.

INN

With INN, the line is very similar, but it goes in the "newsfeeds" file, usually /usr/lib/news/newsfeeds. In addition, unlike C-News, you must tell INN you have changed the newsfeeds file with the command ctlinnd reload newsfeeds after you are done.

photofeed:clari.news.photos::/local/bin/perl /usr/lib/news/photograb

In this case, INN will run the program as the user that owns the "innd" directory, (usually /usr/lib/news/innd).

Don't forget to do the "ctlinnd reload" to tell INN it has a new entry.

Options

You can grab more than one type of photo. Right now photograb takes one option, "p=name" to specify the tag for the photo you are grabbing.

The default tag is "main" and the files you have seen above, such as main.txt and main.template and so on are named using this tag. So for example if you want to also grab the sports photo as well as the main news photo, you can say:

photofeed:clari.sports.photos::/local/bin/perl /usr/lib/news/photograb p=sport

Another option is "dir=directory" which will set the directory in which files are stored. This is good for testing.

See below for the special "size=" option.

Putting this into your home page

Right now the best way to do this is to use the "server side include" feature found in most web servers, including NCSA httpd, Apache, Netscape server and several others.

You can learn about server side includes in the NCSA HTTPD manual as most other servers copied their way of doing things.

Photograb also supports the direct rewriting of the target HTML page to insert the photo material if you don't support server side includes, or if you don't want to. This is slightly riskier for various reasons, but if your page is accessed a lot, it is more efficient than server side includes.

The server side include allows the insertion of a text file, with HTML, into a page of HTML. The HTML file output by this program, main.txt is suitable for server side include.

With most web servers, the syntax for a server side include is an HTML comment with special tags. Simply take the web page you wish to insert a photo into, pick a good space to insert the photo, and add a tag of the form:

<!--#include file="claripic/main.txt" -->

Note the pathname relative to the page inserting the text. If you need to refer to a file that is not at or below your web page (ie. you are not at the root) you must use a virtual path, such as

<!--#include virtual="/claripic/main.txt" -->

Now the tough part. In order to interpret server side includes, your file usually must be stored with the extension "shtml" instead of "html" in most servers. This has nothing to do with security (that's shttp). The server will only do the inclusion if the page has this extension. This may cause some complication if you wish to do this insertion on a home page, normally known as "index.html". You can configure all HTML pages to be interpreted for server-side-includes in many servers, though this has a cost in efficiency. You can also configure the system to look for "index.shtml" as well as "index.html" as an index page with a configuration line of (NCSC, Apache etc.):

DirectoryIndex index.shtml index.html

in the srm.conf file.

Your web server may not have server-side includes turned on at all, so you will need to turn it on. You will want to read about the security implications of this, and may only want to turn on the most basic abilities, and not the more risky "exec" feature. We recommend you set your directories as Options IncludesNoExec (in your httpd configuation file or in a per-directory configuration file) to turn on includes but not execution includes at this time.

To turn on the processing of includes at all you need to have a line like:

AddType text/x-server-parsed-html .shtml

AddType text/x-server-parsed-html .html

to turn it on for all HTML files, at least in the directory you are configuring. (This goes in srm.conf of NCSA/Apache style servers.)

Direct Insert instead of SSI

Photograb can also rewrite a web page for you in-place, inserting the HTML to insert a picture.

This has some advantages and disadvantages. The advantages are that it is actually more efficient than server side includes, but this should only come into play if your web page is accessed extremely frequently, and even then a good cache will still make the includes reasonably efficient.

The other advantage is you don't have to turn on or even understand server side includes.

The disadvantage is that a program will be regularly rewriting one of your important pages. It does this by creating a new version of the page in the same directory, and then moving the new page on top of the old one. The risks involved here are:

The "news" user which runs photograb must have write permission on the directory that holds the web page, so that it can create a new file there and move that file on top of the original.
The web page file will end up getting owned by the "news" user. Even if it starts off owned by somebody else.
The web page file may not have any links to it. The replacement process would undo any links because it removes the old page and inserts the new one instantly. (This is needed because if we just wrote to the file, a user might fetch the file from the web server at the same time as it was being written, and get a partial page.) Photograb complains, and won't alter a page that has links to it.
You must follow a special procedure to edit the page. If you edit it at the time a photo arrives, the arriving photo may erase your editing work or even damage the page. The procedure for editing is detailed below.
Currently the system can insert only one photo on a page, though it could be modified to build the pages a different way to insert more than one at a later date. For now use SSI to get multiple photos per web page.

Direct Replacement Procedure

Designate the page you wish to insert a photo into, such as index.html. Create a copy of it, known as the "source", by taking that name and adding the prefix .grab to the end of it. As noted, this file must be in a directory the "news" user can write. If you want to do a page in the root of your web server, we recommend that you create a subdirectory for this page, and do a symbolic link from the web server root down to the actual page. A physical link must not be used.

Since the claripic directory probably already exists for this system and it can be written by the news user, it may make sense to just use this.

In the source file (index.html.grab for example) insert the following two lines where you want the picture to go. Put them on a line by themselves. Any text between them will be removed when a picture is inserted.

<!-- PHOTOGRAB INSERTED PORTION START --> 
<!-- PHOTOGRAB INSERTED PORTION END -->

Insert these lines exactly as shown.

Do all the other setup as required for the server side include method, including the creation of the template file and any needed directories.

Now insert an inocation of photograb into the news "sys" or "newsfeeds" file as detailed elsewhere. Add to photograb a command line option of the form: "r=filename". The "filename" will be the name of the HTML file to be worked upon, without .grab on the end, ie. the destination file.

When photograb runs with the "r=" option, it will read the .grab source file, and create a new file with the extension .new, replacing any lines between the two tags shown above with the text generated by your template. It then moves the .new file on top of the old file, retaining permissions but replacing ownership, and that deletes the old file.

Optionally if the source file with .grab is not used, the original web page will be used as the source. There are reasons not to do this, since it becomes difficult to safely edit the real file with both web fetching and photo updating going on. If you don't plan to edit the main file this is OK, or you must edit it with care.

We recommend you set permissions on the master file to be 0444, or "r--r--r--" so that it is read-only. This will remind people not to write it, but to edit the source instead.

Handling multiple files

If you do more than one template with your invocation of photograb, you can also rewrite several files. To do this, the string #PROF# in the replace filename will be substitued with the name of the current template. (The default template is "main".)

Thus if you use "r=/local/etc/httpd/htdocs/claripic/page#PROF#.html" then the program will work on the files pagemain.html and pagesport.html if you run a "main" and "sport" template.

As noted you can't have more than one insertion per file.

Editing the Source

To edit the source file there is a very mild risk of editing it in place. You can usually edit it in place if you check after the fact that photograb didn't try to read it while you were writing it out. However, the safe way to edit it is to take a copy, change it, save it to a new file (anything but the file with a .new suffix) and then move (mv) the new file onto the .grab file. That is 100% safe

Doing only the picture

If you're totally averse to any insertion of text, you can simply inline a picture into a web page, but you can't get any text. So you would need to associate a link with it to say, "Click here to read news and get the story behind this picture." We don't recommend this but can provide more instructions on how to do it if desired. (There are problems with having the picture filename remain constant, since most web browsers will cache it and not show changes in it.)

Setting up for an HTML editor

If you are going to be editing the template files yourself, you probably understand what is needed to test the system. However, if you are going to be working with a user who knows HTML, but not system administration or web server configuration, you may want to take some extra steps.

In particular, consider making a script to run photograb with the "dir=directory" option to have it output into a test directory, or even "dir=." to output to the current directory. It's good to test new template files this way before installing them live. It will read the template file from that directory and write back out to that directory.

You can also extract a sample picture article for testing and save it, perhaps making a test script that runs in that picture in a test directory.

You will need to tell your HTML editor where to find the template to edit, and how to test, plus point them to our page on templates.

Synchronization

It is important, if your web template is going to provide links to stories by message-id, that photograb run on the same news server as the server where users will read articles by message-id. Otherwise the web page and news server may get out of sync.

Pictures at different sizes

While we feel the small photo size we use in our distribution is just about right as far as image size, quality and file size are concerned, you may want to render photos at a different size. You can do that by using the "size=" option to photograb. This will cause the system to interpret the full sized photo articles that come to your system (these are 640 by 480 in size and too large to be a teaser on a web page.) Photograb will then take the large photo and send it through a pipe of special commands to scale it down to the size you specify.

You can say "size=100" to make a picture no more than 100 pixels wide, with whatever height is right at that width, or you can say "size=100,80" to make a picture that is no more than 100 wide or 80 high, though one of the dimensions will almost certainly be less than the maximum in the result. (Our pictures tend to come 640 by 480 or 480 by 640 but there is variance.)

If this option is provided, the large photo is processed instead of the small photo. However, the $message_id and $large_message_id variables still refer to the small photo story and the large photo story. The variable $attributes and the $image_width and $image_height are recalculated after you scale the photo.

You can say "size=full" and the program will extract the large photo at its full size, generally this will be the full width of a web page and you won't want to put text beside it.

In order to make all this work you must have 3 programs which are available free from the net:

djpeg -- which unpacks JPEGs into PBM+ (Portable Bitmap Plus) format
pnmscale -- the scaling program from the PBM+ package;
cjpeg -- A free JPEG compressor that saves out the result

CJPEG and DJPEG come together from the independent JPEG group. You can FTP the source from the official repository on Netcom or UUNET. PBM can be found all over the net, including at wuarchive. A newer version, called NetPBM is at this FTP directory. These programs must be in the "path" of the news user or you can specify an explicit path by modifying the $scalebin varible in the source. You can also adjust the jpeg quality (default 55) in the source.

Summary

Get photograb and unpack in suitable place on your system. (cd wherever; tar xvf photograb.tar)
Get test picture article and test system. (testit)
Modify $html_root if need be in photograb.
(Modify "install" script and execute -- it does the following things)
Create claripic directory in Web page tree, writable by "news" user plus perhaps others. (cd $html_root; mkdir claripic; chown news claripic)
Copy suitable main.template to claripic directory. (cp main.template $html_root/claripic)
Install command to run photograb into news "sys" or "newsfeeds" file. If INN, do "ctlinnd reload newsfeeds" after installing. (edit /usr/lib/news/??;ctlinnd reload newsfeeds)
Make sure server-side includes are enabled on your web server, at least for page that is to include picture. (edit /local/etc/httpd/conf/srm.conf)
Insert include line into desired web page, rename to ".shtml" if this is how you have configured for server-side includes.
Make test script and directory for template editor, if need be.