Archiving Data in the Cloud: A Custom Solution for 100s of Terabytes of Raw Data

Sep 7

In the modern data-driven landscape, businesses often find themselves grappling with enormous volumes of data. Our journey at Big Cloud Country has been no different. We faced a significant challenge: efficiently managing and archiving our rapidly growing data stored in AWS S3. With our data growing at about 500 GB/day, sometimes peaking at 1 TB on heavy days, traditional data management strategies were proving inadequate.

The Challenge: Growing Data and Ineffective Lifecycle Policies

In our quest to manage and archive the rapidly expanding volume of data in AWS S3, we initially turned to AWS S3 lifecycle policies. Generally, these policies are a go-to solution for many organizations, as they automate the process of moving data to more cost-effective storage classes or archiving and deleting it based on its age and access patterns.

However, we encountered a significant roadblock: the lifecycle policies did not function as we had anticipated. Rather than efficiently transitioning data to cheaper storage classes or archiving it according to our set parameters, we found a considerable amount of data lingering in the standard storage tier. This was a critical issue because it meant incurring higher costs for storing data that didn't need to be readily accessible or was rarely accessed.

S3 costs ballooned despite having lifecycle policies in place

So what happened?

Interaction with Data Lake Services: Our data lake extensively used services like AWS Glue crawlers and Athena queries. Each interaction with these services reset the “last accessed” time of our files in S3. This behavior made it impossible for the lifecycle policies to accurately determine which data was infrequently accessed and should be archived.
Dynamic Data Patterns: Our data management challenge was further compounded by the dynamic and real-time nature of our data collection and analysis processes. The way we gathered and utilized data was constantly evolving, presenting unique challenges that conventional lifecycle policies were ill-equipped to handle.

Continuous 24x7 Data Collection: Our systems were designed to collect data round-the-clock, resulting in a continuous influx of new information. This constant stream of data was both a benefit and a challenge. While it provided up-to-date insights crucial for our operations, it also meant that the volume of data in our S3 buckets was continually increasing. Managing this ever-growing dataset required a solution that could adapt in real-time to the changing data landscape.
Varied Data Analysis Needs: The way we analyzed data depended on multiple, often changing, data points within our dataset. Different teams and operations within our organization required various slices of data for analysis, often overlapping in their requirements. This meant that certain data segments were accessed more frequently than others, but these patterns could shift rapidly based on evolving business needs and analytical focuses.

Given these challenges, we explored several alternatives to find a more fitting solution.

S3 Batch Operations: We considered using S3 Batch Operations, a feature designed to perform large-scale batch operations on S3 objects. While powerful for executing operations across millions of objects, we found it lacked the granularity and flexibility we needed for our specific archiving rules and patterns.
Open Source Tools: We also delved into the realm of open-source tools, searching for existing solutions that could be adapted to our needs. There were several promising tools available, but each fell short in one aspect or another — be it scalability, customization, or integration with our existing AWS ecosystem.
Custom Scripting and Automation: The limitations of existing tools led us to the realization that we needed a more customized solution. This approach would allow us to script and automate the archival process precisely according to our requirements, taking into account the unique interactions with our data lake services and the specific access patterns of our data.
Combining Multiple AWS Services: Another consideration was to architect a solution combining multiple AWS services such as Lambda, Step Functions, and S3 triggers. This approach could potentially provide the automation and flexibility we required but would involve a more complex setup and management.

After a thorough evaluation of these alternatives, we decided that developing a custom archiving tool, tailored to our specific needs, was the most viable solution. This decision was driven by our requirement for a highly customizable, scalable, and cost-effective approach to manage our rapidly growing data efficiently.

Time for a Custom Solution

The Core of Our Solution: The Archive Script

The backbone of our solution is a Python script run on AWS Batch. This script is tailored to meet our specific needs and includes several key features:

AWS Integration: The script uses Boto3, an AWS SDK for Python, which allows it to interact directly with S3 and other AWS services.
Efficient Processing: Utilizing ThreadPoolExecutor, the script processes multiple files concurrently, significantly speeding up the archiving process.
Robust Logging and Error Handling: Detailed logging and error management are implemented, ensuring that any issues during the archiving process are recorded and can be addressed promptly.

Here’s a snippet from our archive script. We are paginating through bucket levels given a certain path to archive:

def list_s3_objects(source_bucket, path):
    logger.debug(
        f"Starting the listing of S3 objects for bucket: {source_bucket} and path: {path}"
    )
    s3_objects = []
    paginator = s3.get_paginator("list_objects_v2")
    try:
        for page in paginator.paginate(Bucket=source_bucket, Prefix=path):
            for obj in page.get("Contents", []):
                s3_objects.append(obj["Key"])
    except botocore.exceptions.NoCredentialsError:
        logger.critical("No AWS credentials found. Exiting...")
        sys.exit()
    except botocore.exceptions.PartialCredentialsError as e:
        logger.critical(f"Incomplete AWS credentials. Exiting. Details: {e}")
        sys.exit()
    except botocore.exceptions.ClientError as e:
        logger.error(f"Error listing S3 objects for path: {path}. Error: {e}")
    except Exception as e:
        logger.warning(
            f"An unexpected error occurred while listing objects for path: {path}. Error: {e}"
        )

    logger.debug(
        f"Found {len(s3_objects)} objects for bucket: {source_bucket} and path: {path}"
    )
    return s3_objects

Deploying Infrastructure with AWS CDK

Alongside the archive script, we used AWS Cloud Development Kit (CDK) for defining the infrastructure required for the archiving process. The CDK file encapsulates the definition of resources like AWS Batch jobs, S3 buckets, and IAM roles. The CDK approach ensures our infrastructure is consistently deployed, easily repeatable, and under version control.

Here’s an excerpt from our CDK setup. We used a level three construct to create the AWS Batch job resources. Under the hood, this ManagedFargateJob construct uses CfnComputeEnvironment, CfnJobQueue, and CfnJobDefinition along with IAM constructs.

   ManagedFargateJob(
        self,
        "ArchiveUtilityBatchJobResource",
        vpc=processing_vpc,
        docker_directory=docker_directory,
        maxv_cpus=batch_job_maxv_cpus,
        comp_env_maxv_cpus=comp_env_maxv_cpus,
        memory=batch_job_memory,
        enable_monitoring=True,
    )

Understanding the Data Growth and Archiving Needs

Our data growth rate was not just a challenge but also an opportunity to optimize. We had to consider not only the volume of data but also the types of data, access patterns, and future scalability.

Evaluating the Cost-Benefit Analysis

An essential part of our project was understanding and evaluating the trade-offs between the cost of many PUT requests versus allowing data to lifecycle into an archive automatically. This required a detailed analysis of S3 pricing, data retrieval needs, and long-term storage strategies.

Building & Deploying the Archiving Tool

The implementation of our archiving strategy involved several critical steps:

Identifying Data for Archival: A critical initial step in our archiving process was to identify which data needed to be archived. This task was not merely about selecting files randomly; it required a strategic approach, considering multiple factors to ensure the effectiveness and efficiency of our archiving system. Here's how we tackled it:

Developing Predefined Rules: The foundation of our archiving process was a set of predefined rules that we developed. These rules were based on an analysis of our data's characteristics and organizational needs. They included parameters such as file type, age, last accessed date, and frequency of access. By setting these criteria, we could automate the selection of files for archiving, ensuring consistency and adherence to our data management policies.
Analyzing Data Priority: Not all data is created equal in terms of business value. We had to analyze the priority of the data, which involved looking at the size of the files and their importance to our operations. Larger files that were seldom accessed were prime candidates for archiving, as they occupied significant storage space but contributed little to day-to-day operations. Similarly, data critical for business but infrequently used was flagged for archiving, with the provision for easy retrieval when needed.
Data Retention Considerations: An essential aspect of our strategy was determining the data retention period — how long data should be kept active before being archived. This period varied based on the data type and legal or regulatory requirements. For instance, financial records might have a different retention period compared to general operational data. Our script took these retention policies into account, ensuring that data was archived only after it fulfilled its required active lifespan.
Access Patterns and Frequency: We leveraged analytics to understand our data's access patterns and frequency. This analysis helped us identify data that was infrequently accessed or had become stagnant. For example, files that hadn't been accessed for a certain period, say six months, were flagged for archiving. This approach helped in decluttering active storage and reducing costs.
Automating the Selection Process: The final step was to automate the data selection process for archiving. Our script, with the predefined rules and analysis, automatically scanned our S3 buckets and identified files that met the archiving criteria. This automation was crucial in handling the vast amounts of data we dealt with daily, making the process efficient and scalable.

Data Compression and Transfer: To optimize storage and cost, the data is compressed before being transferred to the archival storage.

def compress_file_with_gzip(input_filename, output_filename):
   logger.info(f"Starting compression for file: {input_filename}")
   try:
       with open(input_filename, "rb") as f_in:
           with gzip.open(output_filename, "wb") as f_out:
               f_out.writelines(f_in)
       logger.info(f"Successfully compressed {input_filename} to {output_filename}")
       logger.debug(f"Removing non-compressed file: {input_filename}")
       try:
           os.remove(input_filename)  # Remove the original combined file
       except Exception as e:
           logger.error(
               f"Error deleting non-compressed file: {input_filename}. Error: {e}"
           )
       compressed_file_size = os.path.getsize(
           output_filename
       )  # Get compressed file size in bytes
       return output_filename, compressed_file_size
   except FileNotFoundError:
       logger.error(f"File {input_filename} not found during compression.")
   except OSError as e:
       logger.error(
           f"Error during compression for file {input_filename}. OS Error: {e}"
       )
   except Exception as e:
       logger.error(
           f"Unexpected error during compression for file {input_filename}. Error: {e}"
       )

Ensuring Data Integrity: Ensuring the integrity of our data during and after the archiving process was a top priority. To achieve this, we implemented a robust system for managing metadata. This approach not only safeguarded the integrity of our data but also facilitated efficient analysis of archived data.

Metadata Retrieval During Archiving: As part of the archiving process, we extracted and stored metadata for each file. This metadata included essential information such as:

File Name and Path: To track the original location and identity of each file.
File Size: To monitor changes in size before and after archiving, ensuring no data loss during compression or transfer.
Timestamps: Including the date and time of the last modification and the archiving date.

Storing Metadata in S3: Alongside the archived data, we stored these metadata files in a separate S3 location. This separation allowed us to maintain a clear distinction between active data and its corresponding metadata, improving organization and accessibility.
Glue Crawlers for Metadata Cataloging: To efficiently manage and query the metadata, we utilized AWS Glue crawlers. These crawlers scanned the metadata files stored in S3, creating a metadata catalog that was queryable and easily accessible.
Athena Queries on Metadata: By leveraging Amazon Athena, we were able to run SQL-like queries on the metadata catalog created by Glue. This capability enabled us to quickly and efficiently find information about the archived data, such as identifying specific files based on date, size, type, or custom tags.

def create_metadata(paths: list, archive_path: str):
   processing_day = datetime.datetime.utcnow()
   year, month, day = (
    str(processing_day.year),
    str(processing_day.month),
    str(processing_day.day),
   )
  try:
    metadata = {
            "archiveyear": year,
            "archivemonth": month,
            "archiveday": day,
            "sourceAccount": account,
            "sourceBucket": source_bucket_name,
            "sourcePrefix": s3_prefix_to_archive,
            "fileFormat": find_first_file_suffix(
                [list_s3_objects(source_bucket_name, paths[0])]
            ),
            "uncompressSize": MAX_FILE_SIZE,
            "s3Location": f"s3://{archive_bucket_name}/{archive_path}",
            "database": database,
            "archiveFilesDetails": "",
        }
    logger.debug(f"Generated metadata: {metadata}")
    return metadata
except Exception as e:
    logger.error(f"Unexpected error while creating metadata. Error: {e}")

Results and Benefits

Our custom archiving solution resulted in several key benefits:

Cost Efficiency: By optimizing the archival process, we significantly reduced our storage costs.
Improved Data Management: With better control over our data, we could manage and retrieve it more efficiently.
Scalability: The solution was designed to scale with our data growth, ensuring its long-term viability.

Challenges and Lessons

Throughout the development and implementation of our custom archiving solution, we encountered a range of challenges that provided valuable lessons. These challenges not only tested our technical skills but also our ability to adapt and innovate under complex and evolving requirements.

Balancing Cost and Performance: Finding the optimal balance between cost-saving and performance was a continuous process that required careful analysis and fine-tuning. We needed to ensure that our solution was cost-effective without compromising on the efficiency and speed of the archiving process. This meant closely monitoring our AWS usage and costs, and making iterative adjustments to strike the right balance.
Complex Data Patterns: Dealing with complex and varied data patterns posed a significant challenge. Understanding and predicting our data access and growth patterns required extensive data analysis. We had to consider factors like file sizes, types, access frequencies, and potential future data trends. This process involved a mix of data analytics and, at times, trial and error to establish the most effective archiving rules.
Writing the Archive Script: Developing the archive script was a complex task, as it had to be efficient, reliable, and scalable. The script needed to handle large volumes of data while interacting seamlessly with AWS services. Writing and refining this script involved deep dives into Python programming, Boto3 library, and AWS SDKs, ensuring that the script was optimized for performance and reliability.
Balancing Multithreading and AWS API Limits: One of the technical challenges we faced was finding the right balance in multithreading, especially considering AWS API rate limits. We had to ensure that our script did not overwhelm the AWS services with too many concurrent requests, which could lead to throttling or API limits being hit. This required careful tuning of the ThreadPoolExecutor in our Python script and sometimes implementing backoff strategies to gracefully handle rate limiting.
Managing Compute Resources: Determining the right amount of memory and compute power for our archiving tasks was a critical aspect. We needed to ensure that the compute resources were sufficient to handle the data load and processing requirements without incurring unnecessary costs. This involved experimenting with different AWS EC2 instance types and configurations to find a setup that provided the necessary performance at an optimal cost.
Recording and Tracking Metadata: Properly recording the metadata and keeping track of the archived data was essential for data retrieval and integrity. We developed a system to meticulously log metadata, including file names, sizes, archival dates, and locations. This aspect was crucial for maintaining data integrity and ensuring that we could efficiently locate and retrieve data when needed.
Unarchiving Strategy: While our focus was on archiving data, we also had to consider how to retrieve or unarchive data if required. This meant implementing a retrieval process that was both efficient and reliable. We had to design a system that could quickly locate and restore data from the archives without disrupting the integrity of the data or the efficiency of the storage system.

Each of these challenges brought its own set of learnings and opportunities for improvement. By addressing these challenges head-on, we were able to develop a more robust, efficient, and scalable archiving solution. These experiences have not only enhanced our technical capabilities but also provided us with deeper insights into cloud resource management, cost optimization, and data handling at scale.

Conclusion

In conclusion, our journey to developing a custom archiving solution for our AWS S3 data was both challenging and rewarding. It not only solved our immediate problem of managing massive data volumes but also provided us with valuable insights into efficient data management practices.

Cullan Carey

My experience in Amazon Web Services spans various areas, including designing and implementing cloud-based data processing architectures, managing AWS resources, and optimizing infrastructure for cost and performance. I am well-versed in Python, Github actions, and the AWS CDK ,leveraging these tools to create and manage infrastructure-as-code solutions specifically for AWS.

https://github.com/cullancarey