Switching from Lambda to Glue for $60k in savings

Background: A start-up landed a lucrative contract for a technology product designed to optimize advertising dollars. Often, ads are cast in a wide net to capture some smaller fraction of interested viewers. However, this product was sold with a guarantee that the marketing messages would reach their intended audience.

In designing the product, the in-house engineers wrote a code to fit the needs of the customer’s current data traffic level, but the traffic exponentially increased in a short amount of time. Months after launching the product, the in-house data engineers hit a brick wall when soaring operational costs and frequent crashes revealed that the pipeline was failing. The chosen server tools were unfit to handle the growing data needs of the customer, and the ad-tech company reached out to Big Cloud Country for a brand new pipeline.

Objective: The ad-tech company needed a big data pipeline that could effectively and efficiently process billions of points of data while reducing processing costs and enhancing reliability. 

Challenge: The unconventional pipeline, built originally for proof of concept purposes, used AWS Lambda, ATHENA and S-3. These misaligned tools, faltering under the weight of billions of points of data input each day, frequently crashed and led to a gut busting $100,000 in monthly AWS costs. In effect, the pipeline ran on a dirt road when it needed a superhighway.

  1. Lambda, a pay-as-you-go model, processed the early, lighter, data traffic, but could not scale with the exponential demand month after month. Costs almost tripled from $40,000 to $100,00 per month.

  2. Athena was an unconventional choice for production ETL with large, bulky files.

  3. The pipeline stored files in S3 as JSON, which are big and bulky, taking up too much space, and running up storage costs.

Solution (V2):

  1. Fixed pricing: Given the mismatch of needs and tools, Big Cloud Country replaced the data pipeline with AWS Glue, a serverless system. Glue’s serverless nature delivered streamlined data and eased the loading process on the system. Its pay structure was also fixed, ensuring that additional data processing did not add to operational costs. Additionally, AWS Glue, along with the following fixes, eliminated the need for Athena or S3, further reducing the costs. 

  2. Streamlined processing For distributed processing, PySpark replaced SQL for its ability to process large datasets across multiple computers on a single cluster or across multiple clusters. Additionally, PySpark ran in parallel with the V1 before retiring it.

  3. Appropriate file storage Files formatted in Parquet, a columnar storage file format optimized for use with big data processing tools, replaced files formatted in JSON. This reduced the overall data lake size.

Timeline: Replacing the data pipeline took about 2 months to build from the ground up. This took place in parallel to the existing pipeline to allow for a seamless transition. Once the new system was ready for production, it could fully replace the first version.

Benefits: Matching the needs of a big data lake to the appropriately sized AWS Glue tool was like paving a superhighway where a dirt road previously existed. This wasn't just a minor upgrade; it was a complete redesign. The result made the data easier to manipulate and caused it to flow more smoothly than ever before. But the real headline was the money saved. The switch led to jaw-dropping monthly savings of $60,000, slashing the previous costs by an incredible 60%. The customer reported that this operational change saved that this operation change actually saved their company from operating in the red.

Next Steps:  Reach out for a complimentary consultation with Big Cloud Country. Our data architect can provide insight into whether your data tools are appropriate for your needs. Our team is pleased to help your company reduce operational costs, streamline data, capture data insights, transition to a serverless data pipeline, automate data quality monitoring, and any other goals your company may have related to AWS.

Previous
Previous

Analyzing one billion records per day with Kinesis and Glue

Next
Next

How We Used Redis to Enrich 250 Million Data Points per Day