This post is the second of two parts, see part 1 of the post here.
Automation and Continuous Deployment (CD)
Getting our CD process up and running took some time and effort, but once we did, our productivity improved significantly. We created a pipeline for each of our microservices so we could deploy them independently. Each build includes quick running unit tests, while cucumber tests run after deploying the artifact to an OpenStack instance. Any test failures cause the build to fail, so the onus is on the developers to fix the code (or update the test) immediately. While we can deploy continuously to production, our stage and production deploys are currently push button deploys. We use the blue-green deployment model so that our services are never offline.
As developers, we are steeped in our code and intimate with all its inner workings. Of course, this is not so for anyone else needing to interact with our code. Writing the initial documentation is typically well-handled , keeping it up to date can be a seemingly Sisyphean task. We realized the need for documenting our code as our QA team started testing our endpoints. We initially documented the code and endpoints in a wiki page, but because it required manual updating, and we were still changing the code, it soon became obsolete. We needed a way to automatically update the documentation as we were changing the code. We decided to use Swagger to generate our API documentation. This involved making some configuration changes and adding annotations to our code. Now, not only is our documentation current, but we are also able to test our endpoints, view sample payloads etc. right in the documentation. The documentation is now a tool for individuals to learn and test our code by making API calls. No more updating the wiki or explaining to other teams that the documentation is almost up to date, albeit with caveats.
Swagger documentation for the GET Coupons collection endpoint
Microservices by their nature can involve calls to multiple services especially at the coordination layer (platform service). Troubleshooting a single transaction can become difficult in cases involving multiple services. We use Splunk to query across multiple services and applications. Ideally, we would prefer to use a Correlation Id that is generated at the original request and passed along to each subsequent request using a header property. The correlation id could be logged and searched to query across multiple services. Correlation Ids are not currently supported across all internal services, so instead we identified a property that is available to most of our services. We log this property with every entry. The logging class we use comes from a library shared across multiple services. We have modified this class to make it easier to log this information. In our case, this singular piece of information is the campaign id. We also use a time interval along with the campaign id to narrow our search.
Monitoring and Metrics
Because each service is small by definition, troubleshooting within a service is fairly easy. However, each service interacts with multiple services so a single request may cross several boundaries, which requires us to have a holistic view across our entire system. We achieve this by using the following tools:
- Splunk – allows us to search across multiple services. We have dashboards that span multiple services and provide an overview of errors across multiple services. We look for specific errors using preconfigured searches to notify us when they occur.
- New Relic – In depth monitoring of each service. We also set up key transactions that notified us when error rates exceeded a preset threshold. New Relic custom dashboards also provide us response times and throughput for our service endpoints.
- StatsD – We use StatsD to track business related information, such as the number of coupons created and the number of coupons claimed. We visualize this information using Grafana.
A Grafana rendering of our coupon service GET response times
We consume our own services, so we experience the same goodness and pain (if any) as our consumers. We wrote API clients to make it easier to use our services. The challenge while implementing these clients was to shield internal logic and data structures of our services. We have found a tendency for business logic and data structures to bleed into the clients when the team that wrote the service also writes the client. We had to consciously guard against this, and in fact we had to rewrite the original client to prevent this. We also strove to keep our clients as lean as possible, without imposing library versions on our consumers. A word of advice: you will have very unhappy consumers if your client library holds your consumer hostage to a specific version of a framework!
If Ben Franklin were a software developer he may very well have said “In this world, nothing can be said to be certain, besides death, taxes and changes in business requirements”! Our services evolve as requirements change. Adding new endpoints is not a problem, but our clients break when the service contracts change for existing endpoints. We version our URIs to handle these breaking changes. We support the previous version for a period of time so that consumers have time to switch over to the new version. We had a choice of storing the version as a header property or in the URI, and we chose the latter approach because it makes the version obvious to the consumer.
Testing and Improving Performance
We have a dedicated team that runs performance tests on our services. When we met with them, they asked us about our expected throughput and Service Level Agreements (SLAs) concerning response times. We had throughput data from our previous campaigns, however SLAs were a different matter. The responses of our individual services were fast, but the response times of our platform service was not as performant. Because the platform service calls 5 individual downstream services, we needed to take a hard look at how to set that SLA, Our platform service calls 5 downstream services, each of which at the most guarantees a SLA of 200 ms with some of these services not under our control. So, we reviewed our sequence of operations to see if there were any areas where we could improve efficiency. We found that we were creating the confirmation email template at the coupon creation time. The confirmation email is sent when the coupon is claimed, so we moved the template creation to the point when the coupon becomes active. The product owner felt that it was acceptable for the creation process to take longer since multiple steps were involved. Caching resources that change infrequently was another efficiency that has helped. Another option that we will explore in the future is moving some of our synchronous request-response operations to be asynchronous calls.
Our performance architect has suggested that we look at the performance for the 95th percentile of our requests as opposed to the average response time. The average response time can be skewed by faster transactions, especially for JSON schema validation errors where the transactions fail quickly. Using the percentile gives a true picture of the response times experienced by the majority of our users. We use SOASTA’s CloudTest for executing performance tests and visualizing the results, all triggered by our Jenkins jobs.
Microservices and the Organization
At an organizational level there are going to be several microservices. These services may be exposed to external services in addition to being consumed by internal services. It is important that the organization develop standards that govern service interfaces. Having consistent patterns for URIs, pagination, error responses, consistent use of HTTP codes, payloads for batch processing and responses etc makes consumers of services jobs easier. The services themselves can be written in different languages and use different frameworks, as long as they present a uniform interface. While anybody in our team can work on any service supported by our team, individual team members have gravitated towards becoming code stewards of individual services. They are most familiar with the working of the service and we include them in code reviews and pull requests.
Conway’s law does come into the picture – “Any organization that designs a system will inevitably produce a design whose structure is a copy of the organization’s communication structure”. As an engineering organization, we are trying to move towards smaller teams that are more or less responsible for their applications from development to deployment. This autonomy results in more modular and loosely coupled code. There is a natural tension resulting from this, as teams are fairly autonomous and the microservices architecture lends itself to decentralized governance. However as mentioned earlier, having consistent standards (irrespective of implementation methodology) makes life a lot easier for developers within and outside the organization.
Our journey from a monolithic application to a microservices based architecture has been an interesting one. It has been exciting to work on smaller services that we can rewrite fairly quickly if required.
- We are able to use different languages and frameworks.
- We can scale horizontally.
However, it has also required a shift in our mindset. Working with independent services has allowed us to move to a CD model, making us more responsible for our code deployments, as opposed to a “throw it over the wall” mode of operation.
- Coming from a monolithic architecture, we’ve had to resist the urge to create shared libraries that result in communication between services.
- Troubleshooting can be harder since there are a lot more moving parts, and this has prompted us to pay more attention to monitoring.
Topics such as security, scaling, etc. have not been addressed in this blog and they are worthy of their own posts. We have room for improvement and optimization and there are always changes around the corner, all of which serve to make this journey all the more exciting!