Sunday, 12 February 2017

Using Castle Windsor to implement a Circuit breaker

TL;DR;

So that we don't send out vast numbers of SMSes or emails unintentionally, we have taken the 'Circuit Breaker' design pattern described in 'Release It!' (Michael Nygard) and implemented it using Castle Windsor's Interceptor functionality.

Details

Michael Nygard explains that a software circuit breaker works by "wrapping dangerous operations with a component that can circumvent calls when the system is not healthy", which is an excellent fit for Castle's Interceptor as it proxies method calls and the implementation (of the interceptor) can decide whether or not to invoke the original operation, based on whatever logic makes sense (in our case number of emails or SMSes sent that day, compared against historical data).

Even better, because the component that sends emails and SMSes subscribes to a message queue, "once the danger has passed, the circuit breaker can be reset to restore full function to the system." and we can simply replay any failed messages on the bus that should have been processed.

Implementation

It is very important to us that we do not send lots of SMSes or emails accidentally, perhaps because of a bug in our software or incorrect application configuration.  A circuit breaker is conceptually a good fit for that requirement: it should stop the system from going haywire when it tries to do too much of something (in same way that a real circuit breaker prevents your house from burning down because you've plugged in five electric fires in your living room).

As part of our distributed, bus-based system we have a single component that deals with sending SMSes and emails and that's all it knows how to do (there is no domain knowledge in there, it simply receives requests to send messages and makes the relevant API request to various 3rd party providers).  It listens out for various (very similar) requests for messages (generated by other components in the system) and calls our providers' apis, which again makes it a very good fit for a circuit breaker as there is, effectively, only one place to insert it.

Using a very light wrapper around EasyNetQ we have a number of 'Processors' whose job is to pick messages up from a queue and process them.  Since they all implement the same interface, it is very easy to write an interceptor for requests for various types of email that looks like this:

public class EmailVolumeInterceptor : IInterceptor
{
        private readonly EmailVolumeBreaker circuitBreaker;

        public EmailVolumeInterceptor(EmailVolumeBreaker circuitBreaker)
        {
            this.circuitBreaker = circuitBreaker;
        }

        public void Intercept(IInvocation invocation)
        {
           if(circuitBreaker.ProcessMessageTripsBreaker()
           {
                 throw new CircuitBreakerException("Limit hit for emails to be sent today");
           }

           invocation.Proceed();
        }
}

We wire these up directly in an installer class for Windsor:
Component.For<IDrDrEventProcessor<SendEmail>>().ImplementedBy<EmailSender>().Interceptors<EmailVolumeInterceptor>(),

In our case (for emails) we don't need to know anything about the request, we just need to increment a count of emails that have been sent.  Our circuit breaker has some logic around working out what its limit is and then checking whether this request would exceed that, not to mention alerting us if there's a problem.  It also has some logic around what to do when 2/3 of capacity have been reached, which extends the metaphor somewhat, but we believe could be useful.

When the breaker trips all requests to send an SMS or email (depending on which breaker has tripped) cause an exception (the method always returns true), which means that all the requests are put onto our error queue so we can replay them once we have resolved the issue.

Resetting the breaker requires manual intervention, forcing someone to resolve the issue that caused the problem in the first place.

This is our first implementation of a circuit breaker and it's definitely been a learning experience. Technically it's been reasonably straight forward, but the difficulties have arisen from business questions:
 - what is a reasonable test for the breaker to perform? i.e. what is the metric you choose in order to decide whether your call to send an SMS should be ignored or not?
 - are we prepared to deal with having to fix things after it trips (akin to having to reset your microwave clock after a real circuit breaker trips)?
 - wait, what happens if we send a load of junk but the volume of it doesn't trip the breaker? (a circuit breaker won't stop you getting an electric shock from something that's wired incorrectly ... although it might trip if you create a short to Earth ... but I digress).
 - are you sure it's working?  Should we test it in production?

2 comments: