Amazon S3 is an object storage on cloud which is very cheap. Because it is cheap, I think most of the people are not getting the optimal usage of it. There are lot of features and options offered by S3 for managing different work loads and optimizing the usage for different scenarios.
If you have only few GBs, then you probably would not have thought of the optimal usage, but if it is 10s or 100s of GBs or more than that and you will be concerned about the cost, and you might already know what I am going to share.
Recently when I was checking our AWS bill, I noticed there is a considerable amount of charge for S3. So I digged into our S3 buckets and checked what’s in there. What I found was that there are lot of temporary files and backups stored which were very old. Some of these are very rarely accessed and kept only as an archive and some are not needed to access again at all unless there is an audit purpose. So I looked at what S3 has offered in terms of reducing cost in cases like these.
Before I explain our use case and how we reduced the cost, I’ll explain a few core features of Amazon S3 that we must know.
- Storage classes
- Lifecycle rules
- Analytics — Storage class analysis
Amazon S3 offers a range of storage classes to be used at different use cases. And they are based on the performance of access, that is whether they are frequently accessed or not. And obviously based on the performance, the cost also changes. By the time I am publishing this article S3 provides the following storage classes:
storage classes for frequent access
storage classes for infrequent access
Above IA stands for Infrequently Accessed.
Read more on storage classes here.
Lifecycle rules define a set of actions that S3 applies to an object based on the age (time since the creation date). There are two main actions supported by the time of the publishing of this article.
- Transition actions — Change the storage class
- Expiration actions — Deletes the object
You can get more details on lifecycle rules from here.
Analytics — Storage Class Analysis
If you need to get an understanding about the usage patterns of the objects in your bucket and need help with deciding the relevant ages at which the relevant storage classes need to be changed or the object be deleted, this is a great feature that Amazon S3 provides. S3 Analytics observes the data access patterns and show a nice graphical report which you can use to determine the right storage class for the right data.
You can get more on storage class analysis from here.
I’m not sure whether you noticed this already, however if you look closer at the storage classes, there is a class named GLACIER which is also available as a separate service in AWS console. Amazon Glacier service is a dedicated service optimized for storing infrequently used data like archives and backups. If your data doesn’t have a lifecycle and is marked as an archive or a backup since the creation, you can use the dedicated Amazon Glacier service.
You can read more details on Amazon Glacier from here.
Our Use Case
Okay now that you know the important features that S3 provides, I’ll share more details about our use case. Well, actually we got multiple use cases but I’ll share only one since all have a similar configuration with different filter criterias.
We have multiple web applications of which the static content has been stored in S3. Each time we deploy a new version, we backup the current version to a different S3 bucket and leave it for a certain amount of time for rollback purposes, in case something goes wrong with the new deployment. Since we deploy very frequently, our backup bucket continuously gets filled up with lots of older files which are no longer needed and we were spending money on these. We could manually check from time to time and delete the files that are no longer needed.However, keeping track manually won’t work in the long run.
Now I think the solution is pretty much straight forward. We started using S3 lifecycle rules. In order to decide on the best age margins to change the storage class, we first ran a storage class analysis to check the data access patterns and found that the data older than 14 days are not being accessed. This can be further validated since we have 2 week release cycles most of the time. Now that we have the age margin, we created a life cycle rule in which the storage class is changed to Glacier after 14 days since the data were uploaded to the bucket.
Apart from changing the storage class to Glacier, we have added object expiration for some other buckets in which we have temporary files used by our data processors.
Finally what I have to say is Amazon has thought of most of the use cases we have and optimized and added features to S3 to get the optimum usage. All these have been documented in their documentation. So, it’s really worth to have a look at it before you design your solution.