crypto news

Delay the data for 5 seconds that we broke almost (and how we fix it in the millimeter again)

Less than a month ago, our team reached a challenge that seemed slightly. The offered mission was to create a transaction monitoring dashboard in the actual time of a specific financial platform that we preserved. In principle, it was clear – event data will be placed in Dynamodb, and Opensearch will be used for actual time analyzes.

There was a simple expectation: for every event it is recorded in Dynamodb, Opensearch should be ready to analyze it immediately. No delay. No waiting for it.

This is where we were completely wrong.

When it was “actual time” truly at present

I clearly remember our CTO reaction during the first experimental offer. “Why is there such a delay?” He asked while referring to the dashboard, which was amazingly for nearly 5 seconds.

We chose less than 1 second time. What we presented was a system that was sometimes, during the rise in traffic, late 3-5 seconds. Sometimes, it was worse. In financing monitoring, this type of delay is good like watches. Next or interactive? The system’s failure was discovered immediately in exchange for discovering complaints? This is the real difference.

Some sharp changes were required and required quickly.

Our first attempt: the traditional batch approach

Just like the other teams that came before us, we also deviated from a familiar area:

  1. Table AWS Lambda function for implementation every 5 seconds.
  2. Make it to recover new records stored in Dynamodb.
  3. Collect these updates and assemble them in batches.
  4. Pay the payments to OpenSearch for indexing.

I worked. Type of. It is difficult to be “work” and “loyalty” are completely absolute terms.

Things were shattered and burned with such impressive rate:

  • The delay of the induction is essentialFor 5 seconds, new data remained simply in Dynamodb pending its transfer, and it cannot be transferred.

  • Cumin in indexing: The assembly inside OpenSearch caused great delays in inquiries, making it increasingly slow

  • Reliability issues: The breakdown in the middle of the batch means that all updates in the batch could not be could not be.

During one of the cases that were particularly disturbing, our system was absent from the essential incidents because it was comprehensive in the failure of the batch. By the time when the problem was diagnosed, thousands of transactions had already passed through the system.

“For the sake of God’s love, this is not sustainable.” After listing another report. “There is a clear need to change the system mainly and allow these updates to broadcast live instead of inclusion.”

It was exceptionally true.

The solution: direct broadcasting updates when they happen

I was working on AWS documents one night for a long period of time, and hit me the solution: Dynamodb flows.

What if we can capture and process every change made on the Dynamodb table per second, instead of having to withdraw the updates in batches on a schedule?

This completely changed how we work for the best:

  • Prepare Dynamodb flows to capture each insert, modify and remove records

  • Add AWS Lambda functions that will treat these changes immediately

  • Pay updates to OpenSearch and some light processing on data.

In my initial tests, the results were incredible. Cumin decreased from 3-5 seconds to less than 500 milliliters. I will never forget the message I sent to the Slack team at three in the morning, pending “we think we have broken it.”

Make

This was not just the task of software engineering or a design project where we had to get evidence of the work of the concept pipeline. In one of the many nights that do not sleep on the mountains of coffee, we analyze our problem into three implemented steps. The first was the notification about the changes in Dynamodb.

Get notification about changes in Dynamodb

The number one challenge was, how did we know that something changed in Dynamodb? After some Googling, I discovered that we need to enable Dynamodb flows. As it turned out, it was one, although it was painful for me at first.

aws dynamodb update-table \
    --table-name RealTimeMetrics \
    --stream-specification StreamEnabled=true,StreamViewType=NEW_IMAGE

I still remember telling my colleague at 11 pm how excited when I got this to work: “It works! The table is flowing all the changes you pass!”

Establishing a listener changed

Now, with the enabling of the flows, we needed something to arrest these events. Thus, we built the Lambda function that I decided to call it the “International Energy Agency”. He is waiting for Dynamodb to reach that of change, and as soon as it happens, it goes.

import json
import boto3
import requests

OPENSEARCH_URL = "
INDEX_NAME = "metrics"
HEADERS = {"Content-Type": "application/json"}

def lambda_handler(event, context):
    records = event.get("Records", [])

    for record in records:
        if record["eventName"] in ["INSERT", "MODIFY"]:
            new_data = record["dynamodb"]["NewImage"]
            doc_id = new_data["id"]["S"]  # Primary key
              
            # Convert DynamoDB format to JSON
            doc_body = {
                "id": doc_id,
                "timestamp": new_data["timestamp"]["N"],
                "metric": new_data["metric"]["S"],
                "value": float(new_data["value"]["N"]),
            }

            # Send update to OpenSearch
            response = requests.put(f"{OPENSEARCH_URL}/{INDEX_NAME}/_doc/{doc_id}", 
                                   headers=HEADERS, 
                                   json=doc_body)
            print(f"Indexed {doc_id}: {response.status_code}")

    return {"statusCode": 200, "body": json.dumps("Processed Successfully")}

The simple appearance now, the code was not simple in writing – it took three attempts to operate. Our initial attempt was constantly crashed due to a deadline because we were ignoring the response coordination from Dynamodb.

Teaching OpenSearch to keep up with

This last issue was the most difficult to solve and ignited it. Even with the OpenSearch update immediately, the updates were not available in actual time. It turns out that Opensearch uses its own inclusion technology.

“This is meaningless,” my teammate complained. “We send data in the actual time and do not appear in the actual time!”

curl -X PUT " -H 'Content-Type: application/json' -d ' { "index": { "refresh_interval": "500ms", "number_of_replicas": 1 } }'

After a little research, experience and error, we have identified the parameters that we need to modify. This amendment made a big difference. *OpenSearch has requested new data to be available to search within half a second instead of waiting for the update cycle. I was about to jump from my seat when I saw the first event appears on our dashboard after it was created in Dynamodb. “

Disels: from seconds to millimeters

We have learned the first week with the new system of production. The dashboard is no longer a backward point of view, as it is not a service that works at full capacity.

We have achieved:

  • Average time of transition is less than 500 milliliters (3-5 seconds)

  • No more payment delay – the spread of changes was immediately

  • Indexing index – smaller and more efficient updates were more efficient

  • Improving system elasticity in general – no more payment failure “all or nothing” failure

When we shared the updated dashboard with our leadership, they noticed the change immediately. The difference was clear. We mentioned CTO, “This is what we need from the beginning.”

More scaling: Traffic tops management

Although the new approach was for the regular traffic, it faced its difficulties during the extreme mutations of use. During the reconciliation periods at the end of the day, the rate of events will increase, from hundred to thousands per second.

To alleviate this problem, we added Amazon Kinesis Firehose as a temporary store. Instead of sending Lambda every update directly to OpenSearch, we changed it so that it broadcasts data to Firehose:

firehose_client = boto3.client("firehose")

firehose_client.put_record(
    DeliveryStreamName="MetricsStream",
    Record={"Data": json.dumps(doc_body)}
)

Firehose took care of delivery to OpenSearch, and automatically expanding it in response to productivity requirements without compromising the properties of the pipeline in actual time.

Lessons learned: Searching for speed continues

We have learned that with data systems in actual time, working constantly to reduce the time of delay is an endless battle. We are trying more difficult now, and we have reached 500 milliliters:

  • Open the OpenSearch pipelines instead of Lambda

  • Using Aws Listicache to retrieve faster than frequent queries

  • Consider the edge computing of users spread all over the world

In financial monitoring, each microseCond number. As a senior engineer says, “It is not a fun place to monitor. You are either a problem or behind it. No between them.”

What did you try?

Have you tried to solve the problem of data problems in an actual time? What did you do this latest team? I am excited to get to know your adventures on the edges with AWS.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button

Adblock Detected

Please consider supporting us by disabling your ad blocker