Curbing junk USENET News without curbing server performance.

An article on reducing the effect of spam on your USENET News servers without killing your server's performance in the mean time.

Article by Bulent Yilmaz

Curbing junk USENET News without curbing server performance.

USENET News usage has grown enormously over the past few years.  Current
News plants are expected to push DS3 levels of traffic and above.  Despite
increased demands upon the infrastructure, plants are often left without
important hardware upgrades, and most News administrators are
being forced to do more with less.  Frequently, turning off spam filtering is
one of the first choices administrators make in order to keep their
hardware going.

An Introduction to MicroSieve

Today's large news plants typically have multiple relatively low-powered 
servers.  These servers are configured to receive News from 
different peers, transport the News amongst themselves, and then push the News 
back out to the peers.  This layout is not only effective at moving News, but
also is an opportunity to distribute News filtering overhead.

The two most common spam filtering systems in use for USENET news are Cleanfeed
and Spam Hippo.  Both are effective at filtering junk News, but are unable to
cope with running on outdated hardware while processing the ever increasing 
volume of News.

Cleanfeed is written in Perl which forces trade-offs in the filtration process.
Perl's regular expression processing capability is excellent for use on News
articles which are mostly text.  However Perl's shortcoming is that it is an 
interpreted language which means it runs far slower than code running natively
on the system.

Spam Hippo circumvents the interpreted code problem by running natively on the
system.  It, however, does not seem to offer the necessary improvement over
Cleanfeed.  It is unclear why Spam Hippo is not faster than it is because the
source code is not made available for review, but it is clear that it 
performs an excessive amount of CPU intensive filtering such as its fuzzy 
article matching. 

Microbrew MicroSieve is the proposed solution to the junk News problem.  It is
designed to address the problems with both Cleanfeed and Spam Hippo.  It is 
written in C and thus does not run in an interpreted environment.  MicroSieve 
contains a small, but useful subset of the filters in Cleanfeed and thus 
is fast and effective.

The design goals of MicroSieve were specifically tuned for high volume News
platforms.  It had to be able to handle News feeds at DS3 speeds and above.
Furthermore, it had to be able to run for months on end without slowing down, 
crashing, or leaking memory when handling terabytes of data passing through
it.

Features of MicroSieve

MicroSieve has many features common to Cleanfeed as well as new features
designed to exploit the specific features of a distributed News system.  The
following checks are supported and occur in the following order:

Maximum Article Size
This filter is designed to reduce overhead on a News server by rejecting 
articles above a set size.  This keeps the server from processing news that
will be kept on the news spool but will never be sent to peers because of the
maximum article transmit size set in the News server software.

Blackhole Paths
This allows the News administrator the ability to filter news based on the
path of the article.  News systems which are known to produce a 
disproportionate amount of spam to legitimate news can be effectively
black-holed to reduce overhead in junk processing.

Auto-accept Paths
This is the primary feature designed specifically for distributed News plants.
By setting the auto-accept path to the tag that his own news system puts in the
path header, the system administrator can effectively keep news that has 
already been filtered from being filtered again.  This is important because 
40% to 60% of the news each of the servers receives is from another node in the
same plant and thus already has been accepted as legitimate news.

Binaries in non-binary groups
This is a common junk news problem.  There are many newsgroups specifically
slated to receive binary data.  Binaries posted outside of these groups are
generally spam and usually are not welcome by readers of that newsgroup.  This
filter is a bit lenient so that small binaries (less than 4Kb) such as vCards
can be appended to news posts.

Maximum Cross Posts limit
This limits the number of newsgroups that a single article can be posted to.
Generally, articles posted to a large number of groups are spam.

Spam Bots Checks
This filter checks for articles posted by spam bots.  This list is somewhat 
outdated and is included mostly for completeness.  This will generate around
one to two hits per day. 

User Supplied Regular Expressions
This feature was the most recently added and is designed to allow 
administrators of lower volume sited to have finer control of what articles 
are allowed to pass through the system.  These slow down MicroSieve by an 
order of magnitude and are not well suited to high volume news platforms.

Duplicates Checking
When everything check out, all there is left to do is filter for duplicates.
This filter proves to be the work-horse of MicroSieve.  It provides checking 
for duplicate articles which contributes as the highest amount of junk news.
The filter is heavily tuned to be as fast as possible while allowing the 
largest amount of backlogged articles.  The hashing algorithm, however, is 
designed to be fast instead of cryptographically secure, and thus is not 100% 
fool-proof.  Sites where completeness is of utmost importance should not use
this filter.

Building and Using MicroSieve

MicroSieve is designed to be easy to build and use.  It uses the familiar
GNU autoconf generated 'configure' script.  Most of MicroSieve's features can
only be enabled at compile time as to increase speed.  It is distributed with 
sensible defaults which should work for most sites.  Administrators are 
strongly encouraged to pick a set of options tailored to their deployment.

The following configure-time options choose the set of filters that MicroSieve
uses while running:

--disable-block-binaries
Turns off the blocking of binaries in non-binary Newsgroups.  Blocking of 
binaries in non-binary groups is on by default.

--enable-bot-checks
Turns on checks for spambots.  Turning this on will generally yield few hits 
and is thus off by default.

--enable-user-regex
Turns on the capability for user supplied regular expressions.  Simply using
this option will slow down MicroSieve by an order of magnitude, so make sure 
this is really needed before turning it on.  User supplied regular expressions 
are off by default.

--disable-auto-accept
Turns off automatic acceptance filter.  Sites which will not be using 
MicroSieve in a distributed environment should use this option.  Auto-accept 
is on by default.

--disable-auto-blackhole
Disables the automatic rejection filter.  If administrators do not wish
to block specific sites, they should use this option.  Auto-blackhole is on by
default.

--disable-max-crossposts
Removes the filter which checks for a maximum number of cross-posts.  Limiting
the number of cross-posts is on by default.

--disable-history-check
Disables the duplicates checking filter.  Checking for duplicates is on by
default.

After the configure script is run, the program can be built and installed by 
running 'make; make install'.  It will be installed in '/usr/local' by default.

Once the program is installed changes should be made to the configuration file
which is located in the 'etc' directory where it was installed (/usr/local/etc
by default).  The following options are the most commonly modified:

< Insert: Default configuration file (usieve.conf) >

MaxArticleSize (default: 1048576)
Configures the maximum allowable size of each article.  This is not an option
that can be turned off.  Sites where there is no limit imposed on the size of
the article should set this to a sensible number but no larger than about 10MB
since it is used to allocate a buffer and setting it too high can cause 
excessive memory usage.

ArticleHistorySize (default: 1048575)
This configures the number of articles that kept in the duplicate checking list.
Generally, a bigger number will yield a bigger number of hits.  However a larger
number will cause the program to run slower and have a larger memory footprint.
For maximum efficiency, a value should be selected that is (2^n - 1).

StatsInterval (default: 300)
This specifies the amount of time in seconds when stats will be written to disk.

MaxCrossPosts (default: 5)
This specifies the maximum number of newsgroups that one article can be posted 
to simultaneously.

AutoAcceptRegex (default: None)
This specifies the regular expression to use for path header based auto accept.
If this feature was not turned off at compile-time, it needs to be set to your
site's header banner before MicroSieve is started.

BlackholeRegex (default: !FFFFFFFF!)
This specifies the regular expression to use for path header based auto reject.
If this feature was not turned off at compile-time, it needs to be set to a
list of sites to block in regular expression format.

A complete list of options can be found in the 'INSTALL' file that is 
distributed with the MicroSieve tarball.

After the configuration file has been modified to suit the needs of the site,
it needs to be integrated into the News server.  Currently, only the Highwind
series of News servers (Cyclone, Typhoon, et al.) are supported.

The following line needs to be added to the 'start.conf' file which comes with
the News server software.

PROGRAM="-program //usieve -body"; export PROGRAM

At this point, the News server can be started as normal.  To check if MicroSieve
is running use 'ps' and look for 'usieve'.

Performance Tips and Tricks

If MicroSieve is not performing fast enough for your environment, a few changes
can be made to make it run a bit faster.

Regular Expression Library
On Linux/Glibc systems the standard regular expression library for C is very
slow.  To solve this problem, MicroSieve is set up to be able to work with the 
PCRE regular expression library which must be installed separately.  This option
can be enabled by using the option '--with-libpcre=' during the 
'configure' process.  PCRE does not appear to offer a speed improvement on
Solaris systems and for the case of MicroSieve is only useful on Linux systems.

Article History Logging Overhead
Some speed can be gained by reducing the article history size in the 
configuration file.  It is also very important this to a size that is (2^n - 1)
for optimal performance.

Automatic Acceptance Settings
In sites where multiple News servers are running, this is by far the easiest
way to get major speed improvements.  This effectively distributes the filtering
load across all of the News servers.  This option can also be used with trusted
peers which are running any News filter program with a compatible set of 
filters (i.e. Cleanfeed).

Selecting the Most Useful Filters
Administrators who have a high enough volume of News flowing through their
site so that MicroSieve is unable to run must start becoming more selective 
about the filters they enable.  High volume sites should not use user supplied 
regular expressions.  The spambot checks can generally be turned off without 
allowing too many extra articles through.  Turning off the maximum cross posts 
filter can also give a significant performance boost since it is a fairly CPU 
intensive process.

Generating Some Stats

In order measure how effective each filter is in the operation of MicroSieve 
as a whole, some rudimentary statistics are available.  It is a log of the
number of articles and kilobytes that have been accepted or rejected as well 
as a breakdown of the filters and the amount of articles they have rejected.

The statistics are written out to disk periodically as specified in the 
'StatsInterval' field in the configuration file.  They are logged to the file
specified in the 'StatsFile' option in the configuration file.

< Insert: Sample Statistics Entry (statsample.txt) >

A pair of perl scripts are available to parse the information in the statistics
file and generate graphs from them.  Both are also available from the 
MicroSieve website.

The first of the two scripts is the statistics file parser.  It takes the
information in the human-readable statistics file and converts it to a pipe
delimited file containing all of the same information.  This file can be used 
to perform additional statistical analysis by importing it into a spreadsheet 
program.  The parsed file is used by the graphs generator.

The second perl script is the graphs generator.  It requires GD 1.84 as well 
as the perl module for that version of GD.  It outputs the graphs in PNG format
and thus requires GD to be built with PNG support.  To accompany the graphs, 
this script also generates an the html file 'usieve.html' to make it easier to
display them on a web page.

To generate graphs all that has to be done is to run the two scripts back
to back:

$ /usr/local/bin/usievestats.pl /usr/local/var/usieve.stats > stats.clean
$ /usr/local/bin/stats-graphs.pl stats.clean

This will parse the standard MicroSieve stats file, 
'/usr/local/var/usieve.stats' and outputs it to standard output which in the 
example is redirected to the file 'stats.clean'.  The second command will take 
the parsed statistics file (stats.clean) and generate the html file 
'usieve.html' and the graphs in PNG format in the current directory.

< Insert: Screenshot of web page with graphs (graphs.jpg) >

Copyright ©1997 - 2021, Bulent Yilmaz