Over the last year and a half, the Buffer team has tripled in size, from 25 teammates to 80. And our technology stack has ballooned as well, from a simple setup with just the basic AWS components to a much more fully utilized set of AWS and other external services.
There are now so many pieces that make up the whole of the application: from how we store our data in mongo using compose, to how we run the application in an Elastic Beanstalk, to how we manage our data warehousing using Redshift. There’s also been so much that all of us have learned while growing.
One natural progression of this growth in new technology and new teammates has been the birth of the systems team within Buffer. This team emerged out of the necessity to support the stability and future innovations at Buffer.
Having this specialized team has given us a chance to take a little more time for reflection on our technology decisions that we’ve made thus far. One of our goals that we came up with was to reduce the cost of our infrastructure by making our resource usage more efficient.
The “spring cleaning” of the Buffer infrastructure has been an incredibly fun journey and one where we learned about lots of things for the first time. I’d love to share more of this story and what we’ve learned along the way. It’d be amazing if any of our wonderful community would be up for sharing their experiences as well and any advice that we could take up. 🙂
Beginning the cleanup
One of the fun parts as we started our cleanup and cost reductions was the realization that we weren’t entirely sure where to start. This gave us a great indication into the need to maintain some kind of track record to see how our money was being split across multiple resources. Of course we weren’t entirely without a hunch and so some of the areas we thought we’d look into were:
- Clearing out duplicate data storage’s that were no longer necessary
- Optimizing our data warehouses
- Finding out how we could improve our cost efficiency across other resources on AWS.
So with the introductions done, let’s dive in!
Working with redundancies
For the longest while we’ve used fluent.d to log events in our systems.
When we started we stored these logs in a mongo database on compose. At the time this was a very reasonable decision given the document format of logs. In fact it seemed a perfect match for a NoSQL database.
Over time however we realized that from a perspective of scaling and data research, this wasn’t the best option.
After some evaluation we opted to move this data to Amazon’s data warehouse offering, Redshift. Since this itself was a new experience for us, we did it in true Buffer style where we took the quickest path to action. This meant doing things in stages. What we didn’t do (and reflecting on this, maybe we could have improved on) was set a checklist to say when all transitioning was complete.
Late in February, as part of the systems team’s OKR’s to reduce infrastructure cost, we took a look at the database we were transitioning the data off. We weren’t entirely sure if the data was no longer being put into the DB. Some quick scripts and code analysis sessions later we discovered that the last insert into the mongo database was done sometime in December, over two months prior! A few last verification scripts were written, and once we were confident we shut down that database.
Total cost savings from this exercise: $5,000 USD per month.
This may seem like it should have been obvious, but as we did things progressively we never quite realized when we were done. Our hunch was that this isn’t too uncommon especially in an environment that embraces a fast pace of learning and execution. But this was just the first win from our spring cleaning.
We then turned our eye towards the data instances being used for our Elasticsearch clusters. While our utilization of the instances was nearly optimal (meaning that we were consistently running just below the threshold recommended by Amazon before an upgrade), we still wondered if we could save costs.
It turns out that apart from its general purpose instance types, Amazon offers highly specialized instances to suit various purposes. There are instance types for work that requires fast in memory reading but fewer disk reads. There are instance types that do the opposite. There are even instance types that are bestowed with graphic cards to be utilized for processor intensive tasks.
For the Elastic search cluster we turned towards the instance types that are optimized for disk reads. We migrated the ES cluster from the “m” type (general purpose) to the “r” type instances that AWS offers, and after we checked, we had savings roughly worth $1,200 USD a month in total.
Our final major wins came from analyzing the rest of our AWS infrastructure to make sure that things were running optimally. In our research on how to best perform this, we discovered Trusted Advisor, which very quickly became our best friend.
Trusted Advisor basically provides advice on how you can improve your usage of AWS resources. It provides four categories of advice.
- Fault tolerance
- Cost optimization
Of these we were interested mostly in cost optimization for our current task. This alone showed us possible savings of about $8,000 USD per month! Whoa.
As we looked over the suggestions, the two big categories (that accounted for about 90% of the savings), were:
- Zombie instances. Instances that were incurring a cost but have extremely low or no usage would be perfect candidates for scaling down or completely terminating.
- High usage instances that are recommended for reserved instance types. AWS provides a few instance types, and the most cost efficient is a reserved instance type. Think of it as a wholesale purchase where you’d purchase server time a for a few years, and in return, AWS gives you a pretty hefty discount. There are plenty of pros and cons around this, but basically if you’ve been having consistent traffic to some servers and you have a decent idea of what the future is going to look like for those servers, it’s a great idea to reserve your instances and cash in on that discount.
Converting high usage instances to reserved instances is something that requires a certain amount of testing and analysis to make sure that we are picking the right instance types and which of the configurations of keeping it inside or outside of a Virtual Private Cloud is better for the long term. So for the time being we kept this on the back burner and moved to the faster win area which was to shut down or scale down the zombie instances.
This process involved us checking on individual instances, investigating what their purpose was in some cases, checking on who started it up, whether it was still required or whether it could be scaled down. This was not straightforward but it was a fantastic experience to learn what requirements the team might have of AWS resources in the future.
We were really grateful that there’s so much to learn and that the resources are so readily available. We’ll soon be activating cloud trail to have a better look of everything going on. This process has also given us a chance to think of better ways to help people setup AWS infrastructure in a more uniform manner that would encourage them to experiment.
Fun fact: We’ve even bounced around the idea of having a slash command for our slack to spin up a new server where a command like:
/awsbot new small server at <subdomain>.buffer.com
would spin up a new small instance, with all the proper pieces such as security groups, EBS backed data volumes, https built in by setting up the elastic IPs along with an ELB to manage https traffic, and finally the domains being set up by route53. It’s just an idea at this stage but we feel like it could be incredibly exciting to empower people to experiment with creating their own mini services for buffer.
One thing to note is that we didn’t keep track of how many instances we were shutting down or the cost savings on each one (yikes!). We do know that we have saved over $3,800 USD a month from taking down instances, and we’ve saved roughly $1,000 USD a month by scaling down instances. These are actually huge wins and we were thrilled to see the numbers adding up.
The quick tally
|Mongo db cleanup||5,000|
|Elastic search node change||1,200|
|EC2 scale down||1,000|
|Zombie instance removal||3,800|
Given that these numbers are a little less than the actual amount we are saving, we are saving just over $132,000 USD per year! That’s huge, and incredibly exciting!
In hindsight, would we have done anything different?
In many ways this period of becoming more cost-efficient felt like us making time for reflection in a way we had not done so previously. The outcomes of this have been wonderful. While we may have been tempted to avoid inefficiencies in the first place by putting in processes at the beginning, we don’t feel that would be representative of who we are.
We feel that putting in processes very early on can cause a few possible outcomes that we would prefer to avoid:
Introducing stop energy.
Putting in processes and checklists before we have a clear idea of what exactly needs checks and balances means that we could hinder team members’ willingness to jump in and learn things. We could also end up hindering our ability to even let people come aboard and learn.
Today we can see a lot better of what areas we wouldn’t want to put checks and balances on and how we can support it better instead.A good example would be that right now, anyone with access to AWS infrastructure can create a few EC2 instances, no questions asked. If we had started off with having introduced a checklist, one might have had to request permission so that we can track what’s happening.
With our experience now we can now use a combination of cloud trail, some automated retrieval of data from trusted advisor, and then put together a neat chatbot that pings slack to check with the owners of the servers as to whether or not the instances can be terminated if they show low use.Another example would be having a manual that everyone must follow to setup a server no matter what.
With the experience we have now, we see it might be better to provide something like a slash command where depending on the purpose we can setup a simple EC2 instance automatically or setup a larger cluster. Basically we could let people simply jump in and start doing what they need to do.
Reluctance to innovate.
Another possible outcome of checks and balances being introduced too early is reluctance to innovate.
With sufficient hindering processes in place, we could very well have introduced a sense that maybe it would be better to leave things as is. After all, checks and balances mean that any new ideas must also have processes introduced with them.
One of my favorite Buffer values is the bias towards action.
Having been able to experiment with ideas with minimal checks and oversight and maximum trust has been incredibly liberating. I can’t imagine a Buffer that’s not like that everyday.
We feel like our view towards letting ourselves learn new pieces of the AWS stack and pick up things at will has yielded only benefits when we look at the long term.
We aren’t done, though, and we recognize there’s always room for more improvement and reflection. We are just scratching the surface of efficient resource usage and we haven’t even begun our dive into creating tools to empower teammates to follow the best practices themselves.
This is a journey that’s just beginning for us, and we can’t wait to share how we progress in the future.
Can you think of anything we should look into? Have you ever had a chance to make your processes more efficient in some way? I’d be keen to hear all your thoughts in the comments!