Commons Feedparser - Overview - Jakarta FeedParser

Jakarta FeedParser

Jakarta FeedParser is a Java RSS/Atom parser designed to elegantly support all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability.

FeedParser was the parser API designed by Kevin Burton for NewsMonster and has been donated to the ASF in order to continue development.

FeedParser differs from most other RSS/Atom parsers in that it is not DOM based but event based (similar to SAX). Instead of the low level startElement() API present in SAX, we provide higher level events based on feed parsing information.

Events are also given to the caller independent of the underlying format. This is accomplished with a Feed Event Model that isolates your application from the underlying feed format. This enables transparent support for all RSS versions including Atom. We also hide format specific implementation such as dates (RFC 822 in RSS 2.0 and 0.9x and ISO 8601 in RSS 1.0 and Atom) and other metadata.

The FeedParser distribution also includes:

An implementation of RSS and Atom autodiscovery.
Support for all content modules including xhtml:body, mod_content (RDF and inline), atom:content, and atom:summary
Atom 1.0 link API as well as RSS 1.0 mod_link API
An HTML link parser for finding all links in an HTML source file and expanding them to become full URLs instead of relative.

Feed Location

The locate package provides an API for determing all valid feeds for a given weblog URL. We also attempt to profile popular blogging services including Moveable Type, Blogger, Xanga, etc. Some of these services have subtle incorrect behavior and we can correct these to return feeds for sites that would otherwise fail.

Feed location within FeedParser is simple. Simply pass a URL to FeedLocator which will parse your HTML for your weblog and return all references to feeds with a FeedList

Liberal Parsing

We support so called liberal parsing to accept feeds which while not valid XML would parse with just a few modifications. These include subtle modifications to text before the XML declaration, entity decoding, etc.

Supported Feed Formats

Jakarta FeedParser supports the following syndication formats:

RSS 1.0
RSS 0.9
RSS 0.91
RSS 0.92
RSS 2.0
Atom 0.3 (deprecated)
Atom 0.4 (deprecated)
Atom 0.5
OPML
FOAF
Changes.xml
XFN

In addition the following module supports is available:

Dublin Core (mod_dc)
mod_content
mod_aggregation
mod_dcterms
xhtml:body: Provided for XHTML RSS bodies
mod_taxonomy: Helps enables tags within RSS feeds; RSS 2.0 enclosures
wfw:commentRSS: WFW commentRSS support for linking to additional RSS feeds for comments

API

Developers place all their logic in a FeedParserListener which then receives callbacks from the FeedParser which knows about specific XML formats. They then pass the FeedParser an InputStream and they are ready to get events:


//create a new FeedParser...
FeedParser parser = FeedParserFactory.newFeedParser();

//create a listener for handling our callbacks
FeedParserListener listener = new DefaultFeedParserListener() {

        public void onChannel( FeedParserState state,
                               String title,
                               String link,
                               String description ) throws FeedParserException {

            System.out.println( "Found a new channel: " + title );

        }

        public void onItem( FeedParserState state,
                            String title,
                            String link,
                            String description,
                            String permalink ) throws FeedParserException {

            System.out.println( "Found a new published article: " + permalink );
            
        }

        public void onCreated( FeedParserState state, Date date ) throws FeedParserException {
            System.out.println( "Which was created on: " + date );
        }

    };

//specify the feed we want to fetch

String resource = "http://peerfear.org/rss/index.rss";

if ( args.length == 1 )
    resource = args[0];

System.out.println( "Fetching resource:" + resource );

//use the FeedParser network IO package to fetch our resource URL
ResourceRequest request = ResourceRequestFactory.getResourceRequest( resource );

//grab our input stream
InputStream is = request.getInputStream();

//start parsing our feed and have the above onItem methods called
parser.parse( listener, is, resource );

This is a trivial example from the HelloFeedParser demo distributed within FeedParser. Other events such as onChannel, onImage, onLink can be used to obtain additional metadata.

This is done to allow for extension of the RSS specification in the future as well as support for additional namespaces. For example the RSS 1.0, 2.0, and Atom specification all support different date mechanisms. The FeedParser simply passes onCreated, onIssued methods via the MetaFeedParserListener interface.

Content

Content is a generic name for a body of text within an RSS or Atom post. Due to various format difference there are a number of ways to include content in a post including HTML encoded content in the description element, RSS 1.0 mod_content, xhtml:body, atom:content, atom:summary, etc.

The FeedParser includes a generic ContentFeedParserListener which allows you to intercept all content markup from all RSS formats including Atom.

Strict Specification Conformance

Currently the FeedParser does NOT require that XML feeds meet RSS and Atom specifications to the letter. While this is part of liberal feed parsing in general there are secions of the Atom specification for example which MUST have child elements.

For example: atom:entry elements MUST contain exactly one atom:id element.

We try to follow Postel's law here and allow feeds to pass in this situation. We may adopt a policy in the future for both strict XML parsing and strict format compliance which would trigger exception in the event of a feed not exactly matching the specification.

In practice if your application requires data from a feed you need to assert within your code that you have all correct data before moving forward.

Network IO

The FeedParser also includes an advanced networking layer which meets the requirements necessary for providing XML aggregations services over HTTP. This includes support for If-None-Match (ETags), If-Modified-Since (HTTP 304 Not Modified), gzip content encoding (compression), User Agent modification, non-infinite timeouts, event callbacks for download progress, support for setting HTTP Referrer headers, maximum content downloads (no files larger than N bytes), ability to use custom HTTP methods (HEAD, GET, PUT, POST) etc.

While various APIs already exist for providing HTTP support (java.net.URL and HttpClient) we're using a version of java.net.URL that meets all our requirements and is very reliable.

Future plans are to migrate to an HTTP implementation (probably HttpClient) which supports NIO based async event IO. This library still needs to be developed but in event IO needs to be used to provide a scalable system.

The Network IO sets a default user agent of:

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1; aggregator:FeedParser; http://commons.apache.org/sandbox/feedparser/) Gecko/20021130

Visualizing FeedParser Events

The FeedParser includes a sample console application which accepts a URL to a feed, parses it, receives events, and then outputs them to the console.

%shell% java org.apache.commons.feedparser.Main http://www.eakes.org/blog/atom.xml
debug: init()
onLocale: en
debug: onChannel
        title : Michael Eakes
        link : 'http://www.eakes.org/blog/'
        description : The Weblog of Michael Eakes
debug: onChannelEnd
debug: onItem
        title : Flickr and Good URI Design
        link : 'http://www.eakes.org/blog/archives/2005/01/flickr_uris.html'
        description : I noticed that Flickr had some pretty sweet URIs, but I wanted to find out exactly what it was that made them good. To brush up on URI design, I scoured this great list of resources compiled by Tanya Rabourn:...
onLink: 
        rel: alternate
        href: http://www.eakes.org/blog/archives/2005/01/flickr_uris.html
        type: text/html

Alternative RSS/Atom and Feed Parsers

If for some reason FeedParser doesn't meet you needs (and we'd love to find out why) there are other alternatives.

Rome: While Rome lacks autodiscovery and a networking layer it does provide a nice DOM API (if this is what you require) and the developers from both projects are friendly and cooperate.
Universal FeedParser: The Universal FeedParser is a python-based parser which happens to conflict somewhat in our use of names.

Dependencies

We try to keep the library dependencies of FeedParser down to a minimum. Right now a few are required that might be deprecated in FeedParser 2.0.

library	version	required
jaxen-full		yes
jdom		yes
log4j	1.2.6	yes
xercesImpl		yes
xml-apis		yes
commons-httpclient	3.0-beta1	no (experimental support for networking)

Feedparser

Project Documentation

Commons

ASF