A couple of years ago I wrote the buddycloud channel directory as my GSoC’12 project and as my first buddycloud project to go live in production.
The component is running fine in search.buddycloud.org, however the crawler wasn’t crawling as we required. The crawler was designed on top of XEP-0060, and uses smack with its pub-sub extensions to discover Buddycloud channels and items on the Buddycloud channel server.
But then comes a day when you get the guts to look at old code. Reason: the crawler was dumb dumb. It crawled every single item of every single node every time it ran. Stupid, eh? But there’s more: by using smack’s XEP-0060 implementation, the crawler missed an important feature that the Buddycloud XEP specifies - XEP-0059, a.k.a. RSM.
The Buddycloud XEP uses RSM in several occasions, eg.: disco#items and pubsub <items>.
<iq type=’result’ from=’channelserver.example.com’ to=’email@example.com/KillerApp’ id=’items1’>
<entry xmlns=”http://www.w3.org/2005/Atom” xmlns:thr=”http://purl.org/syndication/thread/1.0”>
<content type=”text”>A comment, wondering what all this testing does</content>
<!— [more items…] —>
Thus, we needed to somehow integrate RSM on smack to get the crawler properly working, so that it could page through all results that the channel server provides and avoid looking into the past when not needed.
Finally, we achieved that by creating a class that works as a packet extension called RSMSet and by overriding some of the PubSub classes offered by smack, such as the PubSub class itself. Now, when we need to do RSM, we ask the PubSubManager for a new PubSub packet and then inject the RSMSet extension into it. As in:
DiscoverItems discoRequest = new DiscoverItems(); discoRequest.setTo(discoverInfo.getFrom()); RSMSet rsmSet = new RSMSet(); rsmSet.setAfter(discoverInfo.getRsmSet().getLast()); discoRequest.setRsmSet(rsmSet);
And then, we store the last item crawled in every node inside the directory’s database, so the crawler can be smarter about what to crawl.
Next step on performance improvement list: crawl the /firehose when possible.