Curbing junk USENET News without curbing server performance.An article on reducing the effect of spam on your USENET News servers without killing your server's performance in the mean time. Article by Bulent Yilmaz Curbing junk USENET News without curbing server performance. USENET News usage has grown enormously over the past few years. Current News plants are expected to push DS3 levels of traffic and above. Despite increased demands upon the infrastructure, plants are often left without important hardware upgrades, and most News administrators are being forced to do more with less. Frequently, turning off spam filtering is one of the first choices administrators make in order to keep their hardware going. An Introduction to MicroSieve Today's large news plants typically have multiple relatively low-powered servers. These servers are configured to receive News from different peers, transport the News amongst themselves, and then push the News back out to the peers. This layout is not only effective at moving News, but also is an opportunity to distribute News filtering overhead. The two most common spam filtering systems in use for USENET news are Cleanfeed and Spam Hippo. Both are effective at filtering junk News, but are unable to cope with running on outdated hardware while processing the ever increasing volume of News. Cleanfeed is written in Perl which forces trade-offs in the filtration process. Perl's regular expression processing capability is excellent for use on News articles which are mostly text. However Perl's shortcoming is that it is an interpreted language which means it runs far slower than code running natively on the system. Spam Hippo circumvents the interpreted code problem by running natively on the system. It, however, does not seem to offer the necessary improvement over Cleanfeed. It is unclear why Spam Hippo is not faster than it is because the source code is not made available for review, but it is clear that it performs an excessive amount of CPU intensive filtering such as its fuzzy article matching. Microbrew MicroSieve is the proposed solution to the junk News problem. It is designed to address the problems with both Cleanfeed and Spam Hippo. It is written in C and thus does not run in an interpreted environment. MicroSieve contains a small, but useful subset of the filters in Cleanfeed and thus is fast and effective. The design goals of MicroSieve were specifically tuned for high volume News platforms. It had to be able to handle News feeds at DS3 speeds and above. Furthermore, it had to be able to run for months on end without slowing down, crashing, or leaking memory when handling terabytes of data passing through it. Features of MicroSieve MicroSieve has many features common to Cleanfeed as well as new features designed to exploit the specific features of a distributed News system. The following checks are supported and occur in the following order: Maximum Article Size This filter is designed to reduce overhead on a News server by rejecting articles above a set size. This keeps the server from processing news that will be kept on the news spool but will never be sent to peers because of the maximum article transmit size set in the News server software. Blackhole Paths This allows the News administrator the ability to filter news based on the path of the article. News systems which are known to produce a disproportionate amount of spam to legitimate news can be effectively black-holed to reduce overhead in junk processing. Auto-accept Paths This is the primary feature designed specifically for distributed News plants. By setting the auto-accept path to the tag that his own news system puts in the path header, the system administrator can effectively keep news that has already been filtered from being filtered again. This is important because 40% to 60% of the news each of the servers receives is from another node in the same plant and thus already has been accepted as legitimate news. Binaries in non-binary groups This is a common junk news problem. There are many newsgroups specifically slated to receive binary data. Binaries posted outside of these groups are generally spam and usually are not welcome by readers of that newsgroup. This filter is a bit lenient so that small binaries (less than 4Kb) such as vCards can be appended to news posts. Maximum Cross Posts limit This limits the number of newsgroups that a single article can be posted to. Generally, articles posted to a large number of groups are spam. Spam Bots Checks This filter checks for articles posted by spam bots. This list is somewhat outdated and is included mostly for completeness. This will generate around one to two hits per day. User Supplied Regular Expressions This feature was the most recently added and is designed to allow administrators of lower volume sited to have finer control of what articles are allowed to pass through the system. These slow down MicroSieve by an order of magnitude and are not well suited to high volume news platforms. Duplicates Checking When everything check out, all there is left to do is filter for duplicates. This filter proves to be the work-horse of MicroSieve. It provides checking for duplicate articles which contributes as the highest amount of junk news. The filter is heavily tuned to be as fast as possible while allowing the largest amount of backlogged articles. The hashing algorithm, however, is designed to be fast instead of cryptographically secure, and thus is not 100% fool-proof. Sites where completeness is of utmost importance should not use this filter. Building and Using MicroSieve MicroSieve is designed to be easy to build and use. It uses the familiar GNU autoconf generated 'configure' script. Most of MicroSieve's features can only be enabled at compile time as to increase speed. It is distributed with sensible defaults which should work for most sites. Administrators are strongly encouraged to pick a set of options tailored to their deployment. The following configure-time options choose the set of filters that MicroSieve uses while running: --disable-block-binaries Turns off the blocking of binaries in non-binary Newsgroups. Blocking of binaries in non-binary groups is on by default. --enable-bot-checks Turns on checks for spambots. Turning this on will generally yield few hits and is thus off by default. --enable-user-regex Turns on the capability for user supplied regular expressions. Simply using this option will slow down MicroSieve by an order of magnitude, so make sure this is really needed before turning it on. User supplied regular expressions are off by default. --disable-auto-accept Turns off automatic acceptance filter. Sites which will not be using MicroSieve in a distributed environment should use this option. Auto-accept is on by default. --disable-auto-blackhole Disables the automatic rejection filter. If administrators do not wish to block specific sites, they should use this option. Auto-blackhole is on by default. --disable-max-crossposts Removes the filter which checks for a maximum number of cross-posts. Limiting the number of cross-posts is on by default. --disable-history-check Disables the duplicates checking filter. Checking for duplicates is on by default. After the configure script is run, the program can be built and installed by running 'make; make install'. It will be installed in '/usr/local' by default. Once the program is installed changes should be made to the configuration file which is located in the 'etc' directory where it was installed (/usr/local/etc by default). The following options are the most commonly modified: < Insert: Default configuration file (usieve.conf) > MaxArticleSize (default: 1048576) Configures the maximum allowable size of each article. This is not an option that can be turned off. Sites where there is no limit imposed on the size of the article should set this to a sensible number but no larger than about 10MB since it is used to allocate a buffer and setting it too high can cause excessive memory usage. ArticleHistorySize (default: 1048575) This configures the number of articles that kept in the duplicate checking list. Generally, a bigger number will yield a bigger number of hits. However a larger number will cause the program to run slower and have a larger memory footprint. For maximum efficiency, a value should be selected that is (2^n - 1). StatsInterval (default: 300) This specifies the amount of time in seconds when stats will be written to disk. MaxCrossPosts (default: 5) This specifies the maximum number of newsgroups that one article can be posted to simultaneously. AutoAcceptRegex (default: None) This specifies the regular expression to use for path header based auto accept. If this feature was not turned off at compile-time, it needs to be set to your site's header banner before MicroSieve is started. BlackholeRegex (default: !FFFFFFFF!) This specifies the regular expression to use for path header based auto reject. If this feature was not turned off at compile-time, it needs to be set to a list of sites to block in regular expression format. A complete list of options can be found in the 'INSTALL' file that is distributed with the MicroSieve tarball. After the configuration file has been modified to suit the needs of the site, it needs to be integrated into the News server. Currently, only the Highwind series of News servers (Cyclone, Typhoon, et al.) are supported. The following line needs to be added to the 'start.conf' file which comes with the News server software. PROGRAM="-program / |