Jakarta FeedParser is a Java RSS/Atom parser designed to elegantly support all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability.
FeedParser was the parser API designed by Kevin Burton for NewsMonster and has been donated to the ASF in order to continue development.
FeedParser differs from most other RSS/Atom parsers in that it is not DOM based but event based (similar to SAX). Instead of the low level startElement() API present in SAX, we provide higher level events based on feed parsing information.
Events are also given to the caller independent of the underlying format. This is accomplished with a Feed Event Model that isolates your application from the underlying feed format. This enables transparent support for all RSS versions including Atom. We also hide format specific implementation such as dates (RFC 822 in RSS 2.0 and 0.9x and ISO 8601 in RSS 1.0 and Atom) and other metadata.
The FeedParser distribution also includes:
The locate
package provides an API for determing all
valid feeds for a given weblog URL. We also attempt to profile
popular blogging services including Moveable Type, Blogger,
Xanga, etc. Some of these services have subtle incorrect
behavior and we can correct these to return feeds for sites that
would otherwise fail.
Feed location within FeedParser is simple. Simply pass a URL to FeedLocator which will parse your HTML for your weblog and return all references to feeds with a FeedList
We support so called liberal
parsing to accept feeds
which while not valid XML would parse with just
a few modifications. These include subtle modifications to text
before the XML declaration, entity decoding, etc.
Jakarta FeedParser supports the following syndication formats:
In addition the following module supports is available:
Developers place all their logic in a FeedParserListener which then receives callbacks from the FeedParser which knows about specific XML formats. They then pass the FeedParser an InputStream and they are ready to get events:
//create a new FeedParser... FeedParser parser = FeedParserFactory.newFeedParser(); //create a listener for handling our callbacks FeedParserListener listener = new DefaultFeedParserListener() { public void onChannel( FeedParserState state, String title, String link, String description ) throws FeedParserException { System.out.println( "Found a new channel: " + title ); } public void onItem( FeedParserState state, String title, String link, String description, String permalink ) throws FeedParserException { System.out.println( "Found a new published article: " + permalink ); } public void onCreated( FeedParserState state, Date date ) throws FeedParserException { System.out.println( "Which was created on: " + date ); } }; //specify the feed we want to fetch String resource = "http://peerfear.org/rss/index.rss"; if ( args.length == 1 ) resource = args[0]; System.out.println( "Fetching resource:" + resource ); //use the FeedParser network IO package to fetch our resource URL ResourceRequest request = ResourceRequestFactory.getResourceRequest( resource ); //grab our input stream InputStream is = request.getInputStream(); //start parsing our feed and have the above onItem methods called parser.parse( listener, is, resource );
This is a trivial example from the HelloFeedParser demo distributed within FeedParser. Other events such as onChannel, onImage, onLink can be used to obtain additional metadata.
This is done to allow for extension of the RSS specification in the future as well as support for additional namespaces. For example the RSS 1.0, 2.0, and Atom specification all support different date mechanisms. The FeedParser simply passes onCreated, onIssued methods via the MetaFeedParserListener interface.
Content is a generic name for a body of text within an RSS or Atom post. Due to various format difference there are a number of ways to include content in a post including HTML encoded content in the description element, RSS 1.0 mod_content, xhtml:body, atom:content, atom:summary, etc.
The FeedParser includes a generic ContentFeedParserListener which allows you to intercept all content markup from all RSS formats including Atom.
Currently the FeedParser does NOT require that XML feeds meet RSS and Atom specifications to the letter. While this is part of liberal feed parsing in general there are secions of the Atom specification for example which MUST have child elements.
For example:
atom:entry elements MUST contain exactly one atom:id element.
We try to follow Postel's law here and allow feeds to pass in this situation. We may adopt a policy in the future for both strict XML parsing and strict format compliance which would trigger exception in the event of a feed not exactly matching the specification.
In practice if your application requires data from a feed you need to assert within your code that you have all correct data before moving forward.
The FeedParser also includes an advanced networking layer which meets the requirements necessary for providing XML aggregations services over HTTP. This includes support for If-None-Match (ETags), If-Modified-Since (HTTP 304 Not Modified), gzip content encoding (compression), User Agent modification, non-infinite timeouts, event callbacks for download progress, support for setting HTTP Referrer headers, maximum content downloads (no files larger than N bytes), ability to use custom HTTP methods (HEAD, GET, PUT, POST) etc.
While various APIs already exist for providing HTTP support (java.net.URL and HttpClient) we're using a version of java.net.URL that meets all our requirements and is very reliable.
Future plans are to migrate to an HTTP implementation (probably HttpClient) which supports NIO based async event IO. This library still needs to be developed but in event IO needs to be used to provide a scalable system.
The Network IO sets a default user agent of:
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1; aggregator:FeedParser; http://commons.apache.org/sandbox/feedparser/) Gecko/20021130
The FeedParser includes a sample console application which accepts a URL to a feed, parses it, receives events, and then outputs them to the console.
%shell% java org.apache.commons.feedparser.Main http://www.eakes.org/blog/atom.xml debug: init() onLocale: en debug: onChannel title : Michael Eakes link : 'http://www.eakes.org/blog/' description : The Weblog of Michael Eakes debug: onChannelEnd debug: onItem title : Flickr and Good URI Design link : 'http://www.eakes.org/blog/archives/2005/01/flickr_uris.html' description : I noticed that Flickr had some pretty sweet URIs, but I wanted to find out exactly what it was that made them good. To brush up on URI design, I scoured this great list of resources compiled by Tanya Rabourn:... onLink: rel: alternate href: http://www.eakes.org/blog/archives/2005/01/flickr_uris.html type: text/html
If for some reason FeedParser doesn't meet you needs (and we'd love to find out why) there are other alternatives.
We try to keep the library dependencies of FeedParser down to a minimum. Right now a few are required that might be deprecated in FeedParser 2.0.
library | version | required |
---|---|---|
jaxen-full | yes | |
jdom | yes | |
log4j | 1.2.6 | yes |
xercesImpl | yes | |
xml-apis | yes | |
commons-httpclient | 3.0-beta1 | no (experimental support for networking) |