Analyzing one billion records per day with Kinesis and Glue

Sep 22

Background: A small tech company scrambled to keep its Fortune 100 Customer happy when troubles plagued their data pipeline during the holiday season. The product worked on the front end, but its back end record keeping began to fray whenthe data pipeline became overloaded with information. For every completed service, the data pipeline received a couple of records per day, and 30 unique attributes per record. These were originally handled on a Lambda platform. However, as the customer traffic grew, the data needs exceeded Lambda capabilities. In a short time, the sheer number of unique data sets flooding the system each day exceeded 1 billion records, causing the company’s crisis.

Challenge: Days before Christmas, the entire data team quit on the job and left behind a broken data pipeline with no one to turn it over. Turning to Big Cloud Country, the tech-start up asked for a brand new Big Data Platform ASAP, capable of processing billions of records per day.

Unscalable The original data pipeline was constructed for a much smaller task load using a tool called AWS Lambda, which was designed for running backend services and automating tasks in a serverless environment, not for long-running data processing jobs. In this high data load environment, AWS Lambda was destined for failure as data flooded the system and overwhelmed the servers.

Expensive Storing the data in-memory with Redis Cache seemed like an understandable choice for quickly and efficiently retrieving data. However, the speed came at a premium price, particularly when dealing with large volumes of data. Within months, the massive volume pushed infrastructure costs beyond acceptable limits for the pipeline, even as the company considered the necessity of high-capacity RAM (Random Access Memory) and optimized hardware.

Dirty Data The data that flowed into the pipeline included both clean and "messy" data. This mixture of data types posed challenges because it contained inaccuracies, missing values, and inconsistent formats, which often led to the misplacement of critical information. Due to the high volume of incoming data, quick decisions regarding data validation were imperative. To handle this volume efficiently, the pipeline required well-defined rules and automated mechanisms for data validation and deletion. These rules were essential not only to maintain data accuracy but also to ensure that the pipeline operated effectively and made sound data-driven decisions in real-time. The implementation of clear validation and data cleansing protocols played a crucial role in mitigating the risks associated with the influx of diverse and potentially unreliable data.

Duplication Aside from being costly, the cache system had several other drawbacks. It limited storage space, caused performance issues, and couldn't adapt well to growing data demands. As a result, around 5-10% of the data was duplicated, which made the information less reliable. Due to the stringents error tolerance the company established, they chose to include all data in their reports but set a very low tolerance for errors, discarding 1% of their data. This decision necessitated the implementation of additional methods within the Amazon Web Services (AWS) framework to effectively manage and track their data due to the stringent error tolerance they had established.

Solution:

Big Cloud Country went straight to work building a brand new pipeline from the ground up. The data architects allowed for the existing pipeline to keep running while the new one was being built. And VERSION 2 rolled out 60 days later with the following custom tech stack: Cloud Front, Application Load Balance, Elastic Container Service, the Kinesis Producer Library, Kinesis Firehose, S3, Glue, Athena, Step Functions, Event Bridge, and Lambda, and Cloud Watch. Each of these platforms were carefully chosen to match the processing strengths to the customer needs.

Scalable To match the big data needs of this pipeline, the architects eliminated Lambda and replaced it with Apache Spark, which is a distributed data processing framework designed to handle large-scale data processing tasks efficiently. It can scale horizontally across a cluster of machines, allowing it to handle massive datasets and distribute the processing load effectively. Spark is also optimized for in-memory processing, which can significantly speed up data processing tasks.

Cost Effective Employing Spark also eliminated the need for Redis Cache to retrieve data in memory. Spark had the additional ability to process data in-memory, but at a much larger scale and could be used for tasks like batch processing lookups to enrich the incoming records.

High quality data After consulting the domain experts, BCC architects chose Pyspark to solve for data quality. Python introduced a dynamic dimension to data management by facilitating real-time adjustments to data midstream. It demonstrated its prowess by identifying discrepancies, inaccuracies, or irregularities as data flowed through the pipeline, promptly rectifying issues whenever feasible. In cases where data integrity was at risk, the application exercised the discretion to discard or exclude problematic data, thereby upholding the reliability and consistency of the overall dataset. This Python-driven approach not only met the demand for data cleanliness but also fortified the data processing pipeline with agile and automated data quality control.

Deduplication Instead of using the cache, the designers opted to compare duplicate records within a 24-hour timeframe, permitting some duplicates to exist at the start and end of this period. This modification proved to be significantly more cost-effective, faster, and accurate. The decision was driven by the recurrent issues with their previous solution, which frequently failed to complete the deduplication process. Consequently, they achieved a notable reduction in the duplicate rate, bringing it down to a range of 1-5%."

Benefits: The implementation of the new customized pipeline yielded remarkable results, significantly enhancing the data management process. Beyond just functioning smoothly and efficiently, this revamped pipeline brought about substantial benefits in scalable data handling, deduplication of data, and meeting the high threshold for accuracy. Perhaps one of the most remarkable achievements of this revamped pipeline was its substantial cost savings. By reducing operational expenses by an impressive 50 percent, equating to over $60,000 per month, it not only improved the company's bottom line but also allowed for more strategic allocation of resources.

Next Steps: Reach out for a complimentary consultation with Big Cloud Country. Our data architect can provide insight into whether your data tools are appropriate for your needs. Our team is pleased to help your company reduce operational costs, streamline data, capture data insights, transition to a serverless data pipeline, automate data quality monitoring, and any other goals your company may have related to AWS.

Rob Whelan https://bigcloudcountry.com

Analyzing one billion records per day with Kinesis and Glue

Switching from Lambda to Glue for $60k in savings