I was recently charged with a task to fix a setting on multiple Amazon S3 buckets containing anywhere from a few thousand objects to counts of 50 million. As any professional DevOps person would do, I wrote a script to handle this work. Basic workflow was this: loop through ALL objects in bucket, check for incorrect setting, if incorrect setting exists, update it. And that’s basically what I wrote, passed through code reviews, the team and I were happy with the plan and execution.

Running in Test

As one should always include as part of a change, I needed to test at different scales before running on our production bucket. The first bucket outside of raw testing had a few thousand objects and overall the script functioned admirably. The script was ready for the next level, a bucket with about 15 million objects.

And that’s where things started to fall apart during the initial scoping. You see I never checked how many objects were in these buckets when I started writing the code. I just had this nebulous concept of “some” objects, never verifying for hard data. As one can then infer, as soon as we scaled up…the script slowed down. It was running a single process, over a loop of 15 million objects, and with some handwavy math we can understand that it would have taken days running from my laptop, even 1 second per object check.

At Scale

With the full understanding of the size of these buckets, I knew that I needed to switch my execution method to be multi-processing. For as many processes as possible, start a new process and do work, then die. There was nothing but API limits and physical hardware to limit how fast this could churn through objects. Running on my laptop, my CPU (An i7-8750H) jumped to 100% CPU utilization and was at least tenfold faster. Screaming fast, but alas my computer could not keep up and cpu heat throttling kept kicking in to slow things down. I could no longer handle this task locally, I spun up an EC2 instance with like 8 cores and let it rip. Still took most of the day to get through our largest buckets, but at least it did not take weeks to months.

Here is the video proof of the speed up on my local machine.

Lessons Learned

I’m working for a full “In the cloud” company, who truly embraces cloud native technologies. Just like they have embraced working at that scale, as someone on the backend, even something as simple as this Python script needs to reflect and be able to handle working at the same scale they are. That needs to be considered as a requirement from day one, not as an additional oops fix at the end of testing rounds.