The body of each article is HTML.
It contains an <HTML>
tag, a
<HEAD>
section (including
a <TITLE>
) and a
<BODY>
section.
In the BODY are embedded tags to allow various bits of data to
be parsed and/or modified.
These are useful to have in addition to the article headers because the
headers can be stripped off to turn the articles into pure HTML.
These extra tags are simply ignored by a web browser without any ill
effects.
All the ClariNet data tags have the form
<CLARI-ITEM TYPE>
along with the corresponding closing tag of
</CLARI-ITEM>
.
The TYPE varies depending on what we're marking
(see below for a full list).
This tag is used in two different ways: a single-line form and a multi-line form. The single-line form looks like this:
The multi-line form looks like this:
CLARI-ITEM
tag is the
last thing on the line prior to the data, though not necessarily
the only thing.
The closing tag is always the first thing on the line following the data,
and is always marked with its TYPE.
Other CLARI-ITEM
s may be contained within a
multi-line item (such as a CAPTION
being
contained by a PHOTO-TABLE
).
CLARI-ITEM
tags respect the nested structure of HTML --
you'll never see a block element that is opened in a tag but not
closed or closed that wasn't opened.
The following
Not all of these items are present in every story.
Just ignore any unknown tags (the normal thing to do when parsing HTML).
Also remember that it is easy to tell programatically if you're dealing
with a multi-line tag because the opening tag will be the last item
prior to a newline.
There are also some special sub-tags for certain tags.
Some
Note that if you are parsing extracted files, these tags will have already
been processed, and in the case of the Here's an example article.
It contains one in-line photo (there can be 0, 1 at the top, or 1 at the
top and 1 at the bottom) and 3 icon images.
WASHINGTON, Aug 19 (AFP) - The United Parcel Service and the
Teamsters union have struck a tentative deal to end a two-week-old
strike against the package shipper that had snarled commerce
nationwide, the parties announced early Tuesday. The agreement could mean a resumption in UPS deliveries as early
as Wednesday, said Teamster president Ron Carey, who appeared at a
news conference here with US Labor Secretary Alexis Herman. "I congratulate both sides for their victory," Herman told
reporters at a downtown hotel, where officials from both sides --
along with a federal mediator -- had been locked in near-nonstop
talks since early Thursday. Herman hailed both parties, saying the company and the union
held "shared values" for their workers. "They also have a real commitment (to the workers) that I
believe is demonstrated in the historic settlement they have
reached," she said. An estimated 185,000 Teamster drivers and handlers walked off
the job August 4 in a dispute over union demands for more full-time
jobs and rejection of a company plan to replace the Teamster pension
scheme with one of its own. It was Herman who had coaxed the two sides back to the
bargaining table last week and who had spent the last several days
exhorting them to narrow their differences. The accord could also be viewed as a victory for President Bill
Clinton, who held firm against anxious appeals from the company and
retailers for presidential intervention to end the walkout. Clinton had all along insisted a settlement had to come from the
parties themselves rather than the government. "Today's agreement represents their hard work and determination
to reconcile their differences for the good of the company, its
employees and the customers they serve," the president said in a
statement issued from the Massachusetts island of Martha's Vineyard,
where he is vacationing and -- on Tuesday -- celebrating his 51st
birthday. UPS chief negotiator David Murray, who also appeared with
Herman, declined to comment on details of the accord until it had
been formally presented to the workers. But he said the company would resume its services "very soon." "We believe it is an agreement that we'll be able to remain
competitive with," he added, while acknowledging that "no one comes
away from the bargaining table with everything he wanted." Carey by contrast was willing and eager to talk about the accord
and called his own news conference at Teamster headquarters here. He said the agreement would meet the union demand for 10,000 new
full-time jobs over the life of a five-year contract, created
largely by combining part-time positions. The company had offered to
create 1,000 new full-time slots. The deal also secured "very large pension increases," as much as
50 percent in some cases, Carey said. "These plans are now under the Teamster pension plan just as
they've always been -- not a company controlled pension plan." In addition, the accord would boost salaries, with the earnings
of a part-time worker increasing from 11 to 15 dollars an hour by
the end of the contract, according to the Teamster boss. The package in additon replaces subcontractors with UPS workers
and establishes new safety guarantees for workers handling heavy
packages, Carey said. With approval from several union committees, workers could soon
begin returning to their jobs, he added. The rank and file would presumably then have to vote on the
package, probably by mail. Earlier, Teamster officials had taken heart from opinion polls
showing public sympathy for the strikers. UPS had warned that bowing to union demand would cripple the
company, noting that the strike had cost it between 200 and 300
million dollars a week as well as valuable regular customers who
were switching their business to competitors. While no dollar value was ever put forward as to the effect of
the stoppage on the US economy, retailers and small businesses --
unable to afford prices charged by UPS competitors -- were
particularly hard hit. You can view how it looks when it is formatted,
if you like.
CLARI-ITEM
types are currently defined:
Date
header.
From
header.
<TITLE>
,
but may have some data stripped (such as a trailing date).
story
" or
"feature
").
IMG
tags have CLARI-XFN
after the
"SRC", some A
nchor tags have CLARI-XGN
after the "HREF", and various tags (such as HTML and BODY) can
have a CLARI-STYLE
tag.
The purpose of these tags is to allow the news-hosted version
to be easily converted into a web-hosted version.
For instance, all external references to newsgroups and other
articles (including images) are in news-relative links.
If the articles are translated into a web-based format using our
extraction technique,
the conversion is done using these tags.
CLARI-XFN
and
CLARI-XGN
tags, these values will be swapped with the
value of the preceding SRC or HREF sub-tag.
When this happens the previous news reference is placed in the
sub-tag CLARI-MID
(for message IDs) and CLARI-GID
(for group references).
ClariNet
Want to tell us what you think about the ClariNews? Please feel
free to email us your comments.