URFC 002 K NEWS Keyword News System Proposal By Brad Templeton Looking Glass Software Limited (brad@looking.uucp) USENET Request For Comment 002 NOTE: This is essentially a document from around 1983 with a few minor updates. Since then I have given up on the idea of keywords as the only type of news classifica- tion, and instead believe in a multi-level hierarchy with newsgroups, topics and keywords. Nonetheless, most of what was said in this document 5 years ago still applies. For some time people have been using a news system based on newsgroups. This is a short outline of my proposal for a news system based on a classification system I called keywords. The only essential difference between a newsgroup and a keyword is that the Keyword news system (or K news) is designed so that there is a very small overhead for each keyword. It is thus possible to have thousands and thousands of active keywords with little overhead. It is my feeling that several problems have emerged with the old newsgroup style system. Many of them are solved by K news. (1) Due to the limited number of groups, there is a great deal of traffic concerning what articles belong in which groups and whether certain groups should be created or destroyed. Under K news, there is no such discussion. If you want a new keyword, you create it. If you want to use a name that is long and descriptive, you can. If discussions go under several keywords, it is easy to add them to your list. (2) The limited number of groups also creates groups like "misc.misc" K news eliminates the need for net.misc and allows easy renaming of net.general. (3) Current systems only allow an "or"ing of groups when dealing with multiple groups. In K news, it is possi- ble to request articles that deal with a set of key- words. ie. one can ask to be shown only articles that contain both the "science fiction" keyword and the "movie" keyword. Brad Templeton 1 URFC 002 K NEWS (4) Current systems do not allow grouping all followups to a given article together, or sorting articles according to posting date. K news provides this because it uses sort(1) on the complete list of articles to be seen. (5) Current news systems are slower than they should be because they must scan each newsgroup a users sub- scribes to to see if there is news. Knews does not have this problem. The K news design can be slower to start up, but will be instantaneous in operation. (6) Current systems just don't allow users to be selective enough in filtering news efficiently. There's just too much volume, and secretary programs, kill files and the "n" key aren't enough. By providing keywords, we get an extra level of selectivity in reading news. (7) Current news systems have difficulty in showing each article only once to a given user, particularly if two different news reading sessions are involved. The K news implementation scheme I suggest does not encounter this problem. 1. The Keyword Environment K news can solve the B news problems by promoting a different environment with keywords. First of all, the dis- tribution of an article is taken out of the keyword name. This means all keywords are valid over all distributions. The fact that there is an "auto" keyword means you can post an auto article to netwide, statewide or even local distri- bution. This should cut down on the number of people advertising their cars to "rec.auto" because the only auto group has netwide distribution. An article will have several keywords. The K news sys- tem will probably insist on members from certain sets of keywords be there. For example, there should be a distribu- tion keyword with any article that is not local. There might be a "followup" keyword on any followup, although these can be detected from their "References" string. Key- words like "spoiler" and "flame" can be put with articles so that people can request not to see them. (Ridiculous groups like alt.flame go away.) It seems that all articles seem to fall into a certain set of classes. These classes are "query", "original infor- mation", "reprint", "opinion" and "followup". There are some sub-classes, such as "flame" (a type of opinion) and "source code"/"binary" (types of original information). It might be a good idea to insist that all posters provide one of these keywords, with the followup keyword being automatic. Thus a reader can shut off all queries or all Brad Templeton 2 URFC 002 K NEWS opinion articles, or both. Groups like "misc.misc" will no longer be needed. Any new discussion can easily rate a new keyword, from "big mac" to "socks in hyperspace". The group "net.general" is still a bit of a problem, but it can now be replaced with some- thing like "announcement for all users", and there will be very little implicit cry to put the article in the netwide distribution. There will still be problems, but they will be reduced. It's also possible that we will still get a lot of the "You posted that to the wrong keyword" type stuff. It is hoped that since adding and deleting keywords in a subscrip- tion list will be quite easy, people will not complain too much about this. Even so, it is still possible that some utility to help users select keywords will be required. Each site will keep all known keywords in a DBM type file. (This will be the total overhead for each keyword.) The DBM file entry might contain who first used the keyword, a one line entry describing it, and its newsgroup mapping on a B to K interface system. A simple utility might scan a user's article for any of the keywords that occur in the text of the article and suggest them as possible entries. In addi- tion, if the user suggests a new keyword when posting an article, a search for keywords that the new one could be an incorrect spelling of would be in order. Since the keywords are the important thing that get copied over in a followup, the subject line will not remain the same. One current problem under B news is that you get discussions that wander under the same subject line. This subject soon becomes meaningless. Any followup generated with K news will have an entirely new subject line, since both the keywords and the References string will provide an indication of what is a followup to what. 1.1. Types of Keywords Most keywords will be user generated. Stuff like "microcomputer", "trs-80", "space shuttle", "frank zappa" and "homosexuality". Others will be system generated. These are keywords that apply to distribution, sites and such things. These keywords will all have an "=" sign in them for matching purposes. "distribution=usa" would match articles with usa in the distribution field. "site=looking" would catch articles from Looking Glass Software. All keywords when processed by the system will be set into lower case, and all sections of white space will be mapped to a single space. An "s" on the end of a keyword will not be important in comparison so we don't have worries about pluralization. Keywords will be sorted into Brad Templeton 3 URFC 002 K NEWS alphabetical order inside the article so that the same set of keywords is always identical when compared. 2. The K news implementation To develop a keyword based system, we need a different implementation scheme than the one use for B news. In par- ticular, keywords must have minimal overhead associated with them. Things like an entire directory and a line in every .newsrc file for each keyword can't be used. One of the facts that can be used in a new implementa- tion is that the average news reader normally reads only the news that has arrived since news was last read. Thus, instead of scanning directories and keeping track of what has been read, K news scans a history file and keeps track of what has NOT been read. In a given session, the history file is scanned from the point in time when news was last read. In addition, a file of articles not read from the previous session is scanned. The user may request to see the old articles first, or to have them merged in with the newer ones. Finding out what to read is a simple matter of scanning a few files and should be quite fast. I have set out some ideas for implementing the K news system. The idea breaks the news software into a series of simple, efficient modules. This scheme could also be applied to other news systems. I will present a brief sum- mary of the modules with more details further on. (1) The "inews" program takes articles, stores them in files and writes out history records describing each article. One history file is kept per day, or other appropriate time interval. (2) The subscription filter program grabs a list of arti- cles not yet seen from the history files, and matches them against the user's subscription file. It writes out a file containing a list of articles the user wants to see according to the subscriptions. With a typical 700 articles per day, this means processing and pattern matching a file with 700 long lines. It should nor- mally not take longer than a grep on such a file. (about 2 seconds on a vax-like machine.) (3) Any standard sort program sorts the output of the filter program to provide a list of articles the user wants to see, in the order the user wants to see them. (Estimated time for Unix sort: 5 seconds.) Brad Templeton 4 URFC 002 K NEWS (4) A variety of user interface programs read in the list of articles to see, and presents these articles in a way the user likes. Most of the work is already done, so there can be several of these. (5) Various utilities for use by user interface programs will exist, including joke decryption, following up and subscription file management. A special utility would exist to take all the articles to be seen and put them into a "batch" for sending to other systems. Thus other systems become just like users, with subscription files. Here follows more detail. 2.1. Receiving Posted and Transmitted News The "inews" equivalent of K news should be quite simple to implement. When an article comes it, all it need to is place the article in a file somewhere (it could even let B news do this for it) with possible header processing. Once the article has been placed, a header record must be written out to the K news history file for that date, describing various header attributes of the article and what file it was put in. It is not necessary that there be a transmis- sion mechanism if batching of news is intended. It would still be possible to include one, however. As noted above, the article can be put in a file by a special program that returns the name of the file. This puts the operating system related things in a simple pro- gram, and makes the system more portable. Whatever the pro- gram is that places the article in a file, the filename is passed to the K news pickup program. This program will take the article, and examine the header. Important information about the article will be written to a special history file. This will include the keywords associated with the article, the "References:" string of the article plus its message-id, the date of posting, the pathname of the file containing the article with optional seek address and length, and finally the subject line. Note, by the way, that in the case of a followup, any extra keywords that were not in the original article will have to be placed in an extra field so they are not involved in the sorting that groups articles with their followups. History files will be maintained on a one per day basis, in a special history directory. Each history file's name will be formed from the date for that history file. (Perhaps in days since the birthday of the net, or perhaps in the form yymmdd.) There may be a new history file each day, each week, or even every hour as the site requires. K inews will query the date and time from the system and Brad Templeton 5 URFC 002 K NEWS decide which history file to append to. 2.2. News reading stage one - history filter The first stage of any news reading is the history filter that is common to all news reading and collecting programs. This program first notes the last time the user read news and finds the appropriate spot in the history files. This list of articles in the history file is com- bined (if the user has requested it) with a special per-user list of articles that have already been processed, but which the user has decided to read later. (As this user file is already in the proper order, the merging may actually take place later to be more efficient.) Now the system has a list of possible articles to read. It must decide which ones the user wants to see. To do this, we use a user created "subscription file". This file contains a list of keyword patterns describing the user's taste in articles. The subscription file is read in and parsed into a tree. As will be described the subscription file contains keywords and keyword patterns that will be matched against articles. Each pattern is given a "sort value" that indicates how important the associated keywords are. This sort value may either explicit, or derived impli- citly from the order of the subscription file. Articles will be shown in the order dictated by the sort value of their keywords, so users can direct the order that their news will be seen in. Lines from the history file are read in, and matched against the subscription list. If they match, the appropri- ate line is written out onto a temporary file. Matching can be done on keywords or other information, such as the article-ids in followup chains, the poster, the posting site, the distribution and anything else that is imple- mented. It is important to note that the ability to match on article-ids allows users to request or shut out discus- sion chains based on followups. Instead of writing out the keywords to this file, we write instead the sort values given to each keyword. These sort values are themselves sorted on the line before being written. The old keywords are also output, but not for the purposes of sorting. Once the new file is prepared it is sent off to sort(1), possibly with the file of previously skipped arti- cles appended to it. The first sort key is the keyword sort values. Since followups all have the same base keywords, they will match as equal in the first sort key. Since we are sorting by the keyword sort values, the output file will have the articles sorted by keywords in the presentation order the user requested. The next sort key is the "Refer- ences" chain, which includes the message-id of the actual Brad Templeton 6 URFC 002 K NEWS article tacked on the end. Original articles will have just their own ID present. Thus all followups to a given article are sorted in a nice tree. Other information output includes the date of posting. While we want to sort on this date for articles at the same "level" (on a followup basis), it is impossible for most sort programs to do this. This sorting must be done in a second pass (it's fairly simple) or right within the user interface program. Any amount of additional information can be output to this file. In theory, most of the header information could be written (lines would start getting pretty long) so that the user interface program need not even open the article file for articles a user says "no" to. This is a trade-off to be worked out. One important item that has to be there, of course, is the name of the file where the article actu- ally resides. Once sort is called we will have a file which has, in addition to a lot of extra information, a list of pathnames for the articles the user wishes to read. The keywords on these articles may also be present. This is passed to phase two. 2.3. Date & Discussion Sorting A special pass may be used to sort by the date within a discussion, since many will want this. This is a simple task that can be left to the user interface phase, but it could also be done in general for anybody to use it. It would be slower this way, since a whole extra pass would be required. 2.4. User Interface User interface programs will vary from being dumb to quite fancy. Since it gets passed a readymade list of arti- cles, there is not much work to do. All a simple one need do is go through the list, and doing what msgs or readnews currently does to each file. These programs will handle replies, followups etc. Special utilities will be provided for cancelling etc. When a user skips an article for later review, the pro- gram can write the appropriate line to the unread article file noted above. It is hoped the average user will not let this file get too big. More sophisticated programs will keep track of a list of seek addresses in the sort output file that mark articles that have not been read, and output this at the end of a session. This allows programs to allow Brad Templeton 7 URFC 002 K NEWS users to skip back and forth among the articles since the information is not written out until the end. In fact, it might be a useful utility to provide for writers of user interfaces. User interfaces can get quite fancy, with screen sys- tems like notesfiles and rn. It would be nice to provide a feature so that unrecognized commands are passed to the shell with a search path list including a special directory for news commands. (Perhaps an environment variable so the user can specify.) In the news command directory you put simple commands like "decrypt" and "undigest" with appropri- ate short names. It is expected that several user inter- faces will be written, including one just like RN and one just like notesfiles. All interfaces to the subscription file by the user interface program should be though other programs that are part of phase one if possible. This keeps things apart. 2.5. B and K news Interface and Transition In the design of K news, we can plan for three schemes of usage. One is to design K news without paying attention to any other news systems. This would require creation of a totally new net that won't talk to newsgroup based nets. This would be slow, but has the appeal that it would create a net that wasn't bogged down the way the current one is. This "let them stew in their mess" attitude is a bit snobby though, and could create a lot of problems in getting K news accepted. Another thing to consider is that there is a high probability somebody will put together some kind of inter- face between systems that is jury-rigged and far from the best. This happened with the Notes-B news interface, and created a royal mess that was worse than the problems that would have resulted from working together on things. Another scheme is to make a system that can interface to B news, but doesn't plan to do so for long. The idea would be that if K news were good enough, everybody would eventually switch and we would have a new pure system. In the meantime they could co-exist. Aside from the technical problems involved, there is the question of when the switch would occur, and if the idea of newsgroups would ever get out of the system. The compromise solution is to plan for permanent cooperation by incorporating the newsgroup idea into K news. A newsgroup becomes a special, high overhead keyword. In K news, it is used as a directory name for storing articles, and as the interface to B news. In this system, we demand that K news users provide newsgroups as well as keywords on their articles. Although this has some problems in educat- ing the users, I think it is no worse than sticking with Brad Templeton 8 URFC 002 K NEWS newsgroups. If newsgroups exist, and B news sites exist, a mechan- ism is required that maps newsgroups to appropriate key- words. One simple mechanism is just to include a Newsgroup=xxx keyword for each newsgroup an article belongs in. K news users can select that keyword in their subscrip- tion file. Slightly more sophisticated would be to create a mapping table at B to K interface sites so that articles in a group like "net.columbia" get keywords of the form "Newsgroup=net.columbia" and "space shuttle". 2.6. Shipping to other sites With new modifications to uux possible, It is envisioned that each site receiving news from a K news site would essentially have a .newsrc like file on the forwarding site. This is to say that each site would be in the same position as a user, with a keyword subscription list and a list of unread articles. Forwarding could either be done by using the same process a user does to read news when a transfer is made, or by having the K inews check each arti- cle in the subscription files for known sites. The first way, of course, is much more efficient. With batching, the first stage readnews process could be run to collect the chosen files in a batch. 2.6.1. Distribution In order to keep a site's subscription file simple, distribution keywords (required on all articles) will be matched by "distribution=xxx", where xxx is stuff like "local", "canada", "usa", and the dreaded "worldwide" (equal to "net"). The default distribution for posted articles will be set locally, but it should be encouraged to be as small as reasonable, such as the local state or province. One problem with this sort of distribution scheme (and the current B system) is that sometimes a user really does want an article distributed netwide in the "auto" newsgroup but only locally in the "general" newsgroup. Consideration must thus be given to explicit distribution bindings on key- words. My suggestion is to have the "distribution" keywords (as we think of them now) apply to all keywords, except those with an explicit distribution. Thus a file with: Subject: Toronto Space museum opens Distrubtion: local Keywords: events, space/north america Such an article would go to "events" readers locally, and "space" readers both locally and all over the continent. Brad Templeton 9 URFC 002 K NEWS 3. Subscription List One of the most important facets of the K news imple- mentation I propose is the use of a sophisticated subscrip- tion list. This list would be used by both users and sites to decide what articles are to be seen during a session. Fundamental to this scheme is the ability to define keyword patterns, so that selections can be done on not just single keywords (as B news works) but on arbitrary combinations. The first reading program will maintain two files. The first of these is the subscription list. This tells which keywords and discussions the user is interested in. This will be a list of keywords subscribed to and boolean expres- sions built from them. Keywords are actually text strings, but they may not contain a special set of characters which are used to delimit them. These characters are "=" ":", ",", "!", "[", "]", "&", "|", "*", "/", "(", and ")" to start with. Some, like "=", are used within meta-keywords to match special conditions known to the software like sites, article-ids and the like. No doubt more special characters should be reserved for future use, while some should be allowed within keywords. Each line in a subscrip- tion file consists of a keyword pattern to describe the user's interests. In addition, some special lines in the subscription file will tell what the user wants done with articles from the previous session, and possibly special options. A typical subscription line lists a keyword pattern. For example, the line: science fiction Asks for all articles with the keyword "science fiction". Quotes may be required, but this is a matter to be decided. It also makes sense that any blank fields in a keyword be compressed to one space so that typos do not cause problems. The line "!star wars defence" would ask that no articles with the keyword "star wars defence" be shown. We can also ask for "Ronald Reagan & taxation" to ask for all articles with both of the keywords show. Similarly "Ronald Reagan & !taxation" shows us all articles about old Ron that do not contain the taxation keyword. Or we could go for Ronald Reagan & !( taxation | star wars defence ) Which shows us articles about Ron that have nothing to do with taxation or the star wars defence scheme. The order in the file is important. When phase one tries to figure out if a user wants to see an article, it scans through the information in the subscription list, in Brad Templeton 10 URFC 002 K NEWS order. It stops as soon as it finds some form of definite information. This means either positive information or negative information. If the first line in your subscrip- tion file is "Ronald Reagan", you will see all such arti- cles, even if they contain other keywords that you hate. Likewise, if the first line in the file is "!Ronald Reagan", you will never see an article about him, even if it contains a keyword you subscribe to later on. (There is an alternate system described below to change this.) The character "*" will match any keyword. It would be placed on the last line of a subscription file to indicate that any keyword not marked with an "!" is subscribed to. It is doubtful anybody would use this after the number of keywords grows. Keywords may have "sort attributes" on them to indicate which keywords you would like to see first in a session. These are essentially ascii strings which will be passed to sort(1). If you want to see articles about "system shut- down" first, you give it a low value like "A". If you want to see articles about "big mac" last you give a priority of the form "zzzzzzzz". The nice thing about this is that when you have a new keyword, you can easily give it a priority between any two that exist, unless you have given something a priority like "^@", in which case it would be first for all time. We now see lines like: system shutdown [AAA] space [bb] & challenger [cc] 3.1. Sample file Here are some sample subscription lines that you might have. The comments actually would not be in the file, although that could be a possible feature. OPTIONS: +newkeywords +oldnews ; show me new keywords that have come in, ; and mix in my old news from before !flame ; show me no flame articles !query ; show me no "does anybody have" articles system news microcomputer & !trs-80 ;anything on micros that isn't on trs-80s unix & !(4bsd | version 7) sex & drugs ; anything about both rock & roll ; 8-) site=looking & poster=brad ; anything from me - the default ;-) movies & distribution=ontario ; movie articles from my own province only distribution=local ; anything posted on my own machine art=123@looking ; that article and any followups !art=124@looking ; none of that article or any followups !(!source code & #size>7K) ; a possible feature, no file bigger than 7k Brad Templeton 11 URFC 002 K NEWS ; that isn't a source file 4. Typical Session The typical user interface program will first check to see what new keywords have come in since the last session. These will be recorded in a separate history file in which the last position read must be recorded. The user, if it is requested by appropriate options, will then be given a list of new keywords that have appeared since the last time news was read. Some systems will query the user and allow him or her to place these new keywords in the subscription files. The user interface must now call the phase one program. with appropriate options, and the name of a temporary file to put the sort output in. It may also request the sort output on a pipe if that is all it needs. (Most programs will want to be able to seek back in the output file.) Arti- cles will then be shown in the order requested, grouped perhaps according to followup discussions or major keywords. At the end, a list of unread articles will be written out. Articles will probably be grouped by discussions and higher priority keywords. Followups will insist on a change of subject and allow an addition of keywords and a change of the distribution. 5. Alternate Subscription Idea It is possible users will require more control on which subscription lines get priority than the order in the file. Thus it is proposed that keywords get points based on how much a user wants to see a keyword. Keywords you want to see would get positive points and keywords you don't want to see would get negative points. For example: "Ronald Reagan : 5" would assign 5 points to any article containing that keyword. On the other hand "star wars defence : -4" and "taxation : -6" would assign negative points to those key- words. In this case, you would see articles with Reagan and star wars defence, but would not see articles with Reagan and taxation. Scores would apply to whole lines. For exam- ple: (Ronald Reagan [abc] & taxation [cde]) : 20 Would give 20 points to any article with both keywords. In this system, any article must scan the whole list. For every match we get, we add the points assigned for that match to our sum. If, at the end, the sum is >= 0, the users sees the article. If negative, it is not seen. It should also be possible to assign scores of "oo" and "-oo" which would represent infinite scores and stop the scan Brad Templeton 12 URFC 002 K NEWS right away. In any system, by the way, the whole subscription file must be read into RAM. Since the phase one program has lit- tle to do but read this file, however, the K news system should be able to handle large subscription files. Since followup message-ids will also be placed in this file, a utility that deletes very old ones would be a good idea. 5.1. More Random Ideas We can add subscription features as we like. It will have to be worked out what users want. Some ideas include the scheme above, plus: (1) The ability to match a keyword only if it is alone on the line. For example, you might want to see articles about "microcomputers" but not if they are associated with other topics. Same with "abortion". This would be done with a numeric "keyword count" variable, so you might say "abortion & #keycount == 1". (2) Real pattern matching on keywords, regular expression style. This might be too slow, for if you don't allow it, it lets the keyword programs map the keywords seen to integers for easy matching. But it might be worth it. (3) Pattern matching on the subject. This is something various news secretaries do. In theory, this should not be necessary as any important word you might search for would probably be a keyword. (4) Pattern matching on the body. This could be done by means of those special hash formulae (such as csh uses) that tell if a given string is NOT within an article, with some reliability. Body pattern matching would only be applied on articles that need it, or things would get too slow. (5) Timestamps on patterns added by programs to the sub- scription files. When you decide to shut off a discus- sion, the software will add a "!123@looking" to your file. You don't want these to build up, so it might be good to have timestamps on them so that they can be removed later on once a discussion is dead. (6) Piles more in the way of special keywords in the required group, so people can be more specific. Dif- ferent types of classified ads. (7) Facilities for moderators. Ability to pattern match on the moderator of choice. Brad Templeton 13 URFC 002 K NEWS 6. Criticism and Answers Of course no system is perfect and some have pointed out a few problems that may arise with K news. For most of these, I feel that the problem is even worse with news- groups, or at least little better. The main point is that some people feel that there are too many newsgroups now as it is. This is to say that there are too many to remember them all. Some feel that with the proliferation of keywords, users will be less certain what keyword to use, and post to the wrong keyword more often. Thus some important information that you might have seen could be lost. It's my opinion that far more important information is lost today because of the noise that results from newsgroups being to general in scope. I, and many others, have unsub- scribed to groups we are interested in because we can't han- dle all the garbage in the group to sort out the gems. I also use the "n" key a great deal - on over 70% of the arti- cles in groups I do read. If the subject is too short, or "Orphaned response" or that sort of think, I say "n" right away. To keep this down, the answer is more software. As the need arises, we might see fancy programs to help people find the right keywords. Whenever somebody creates a keyword, it will be their duty to make a short description of it, including related words. Thus the creator of "ronald reagan" would add a line saying: president, arms race, abortion, economy, usa, government, politics and an appropriate utility could take words from the user (perhaps even text of an article) and "grep" for words in the keyword list. This would be an special utility called by the news posting utility, so it could be written and maintained at yet another location. This tool could also use standard spelling correction algorithms to suggest key- words. Naturally, news administrators could update these keyword descriptions if the creator of the keyword didn't come up with a good one. A control message could even keep the file up to date. Of course, keywords can be organized in hierarchies to make them easier to find. 7. Comments This is just a draft proposal, and lots of little details are missing. comments are welcome. Also welcome is somebody to implement the thing since many people are too Brad Templeton 14 URFC 002 K NEWS busy to do so. The implementation could be done in spots, and much of the code can be taken from the existing B news since the same header formats etc. would be used. I can be reached at watmath!looking!brad Watmath is called by ihnp4, decvax, utzoo, uunet and many others. Brad Templeton 15