Failing the right way

Authors: Andre Kelpe

2018/05/30

Not the big player

Open source is everywhere. It is nearly impossible to find a company that does not use open source software to build their products. There are open source projects for everything, yet there are a few big ones that get all the attention. A good metric for determining if somethings is big and popular is if somebody made laptop-stickes for it. It is highly unlikely that somebody will print stickers for slf4j or apache-commons-lang, yet they are used everywhere. This post is about an open source project that falls into the no-stickers category, at least I believe it does: Failsafe. (not to be confused with the failsafe-maven-plugin)

Just call a service, what could go wrong?

Failsafe describes itself as “Simple, sophisticated failure handling” and that is exactly what you get. Modern software development revolves more and more around calling something else via a socket and hoping that you get the correct result on time. While it is certainly great to have a plethora of services to build upon, handling problems with the distributed nature of those services is not trivial. Everybody working in software has written retry-logic for well identified error cases. Maybe you added a back-off strategy with exponentially growing delays. Maybe you even made it asnychronous managing everything in a background thread. It all is doable and even fun when you write it the first few times in your life, but after a while you realize that you spent more time on the “support-code” than on the actual business logic. We did not come to the promised land of the cloud only to discover that there is now even more code to be written…

This is where Failsafe comes in. I think I once bookmarked Failsafe, but never did anything with it. When I started my current job, I could see it being used all over the code base and I believe more people should adopt it. The two code snippets below are redacted examples from a real codebase and by looking at them you will immediately see what is going on.

// somewhere in the constructor
retryPolicy = new RetryPolicy()
    .abortOn(SomeCatastrophicException.class)
    .withDelay(config.getRetryDelay(), TimeUnit.MILLISECONDS)
    .withMaxRetries(config.getRetryCount());

// somewhere in some important method
Failsafe.with(retryPolicy)
    .onRetry(th -> log.warn("download of artifacts failed", th))
    .get(() -> downloadService.fetchArtifacts(someArg, otherArg));

The intent of how often, when and when not, and how long to retry is immediately clear to anybody reading above code. You do not even have to be a java-programmer to understand it. It is expressive and easy to read. If the policy or the log message or any other aspect ever has to change, the code stays readable. There is a lot more that Failsafe can do, after all it is Simple, sophisticated failure handling.

This not a tutorial for Failsafe, but a spotlight on this little library. The documentation is comprehensive will get you started in no time. Give it a try and if you liked this post get in touch!