Monitor AWS ECS agent and restart automatically on failure

0
353
monitor AWS ECS agent

Amazon Elastic Container Service (Amazon ECS) is a highly scalable, fast, container management service that makes it easy to run, stop, and manage Docker containers on a cluster. Amazon ECS lets you launch and stop container-based applications with simple API calls, allows you to get the state of your cluster from a centralized service, and gives you access to many familiar Amazon EC2 features.

You can use Amazon ECS to schedule the placement of containers across your cluster based on your resource needs, isolation policies, and availability requirements. Amazon ECS eliminates the need for you to operate your own cluster management and configuration management systems or worry about scaling your management infrastructure.

Amazon ECS can be used to create a consistent deployment and build experience, manage, and scale batch and Extract-Transform-Load (ETL) workloads, and build sophisticated application architectures on a microservices model.

AWS ECS you can create task definition to define container configuration like memory, cpu, environment variables, mount point and services to scale docker containers.

Sometimes, it must to monitor each service. Just imagine, what if ECS service unable to communicate with ECS agent? It will cause scheduling and cannot get the status of existing containers

Read More:  How to Create AWS ECS using AWS CLI

Steps to setup monitoring script on ECS nodes:

  1. Setup SNS topic

On the AWS console create SNS topic and in the subscriber add notification email id, confirm the subscription you received from the SNS service.

2. Install AWS CLI

Our script will use AWS CLI to query AWS to find container instance arn and agent status using awscli ecs command option.

# apt install -y aws-cli

3. Setup IAM policies for SNS and ECS

AWS SNS IAM Policy: The below mentioned policy will allow IAM instance role to publish message to the SNS topic we created earlier. This will help us in getting notifications for agent failure.

{
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "Stmt1460976768000",
                "Effect": "Allow",
                "Action": [
                    "sns:GetEndpointAttributes",
                    "sns:GetPlatformApplicationAttributes",
                    "sns:GetSubscriptionAttributes",
                    "sns:GetTopicAttributes",
                    "sns:ListEndpointsByPlatformApplication",
                    "sns:ListPlatformApplications",
                    "sns:ListSubscriptions",
                    "sns:ListSubscriptionsByTopic",
                    "sns:ListTopics",
                    "sns:Publish"
                ],
                "Resource": [
                    "arn:aws:sns:ap-southeast-1:<aws-account-id>:<topic-name>"
                ]
            }
        ]
    }

AWS ECS IAM Policy: The below mentioned IAM policy will allow IAM instance role to query AWS ECS api to list container instances and check agent connectivity status.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1460960788000",
            "Effect": "Allow",
            "Action": [
                "ecs:DescribeClusters",
                "ecs:DescribeContainerInstances",
                "ecs:DescribeServices",
                "ecs:DescribeTaskDefinition",
                "ecs:DescribeTasks",
                "ecs:DiscoverPollEndpoint",
                "ecs:ListClusters",
                "ecs:ListContainerInstances",
                "ecs:ListServices",
                "ecs:ListTaskDefinitionFamilies",
                "ecs:ListTaskDefinitions",
                "ecs:ListTasks",
                "ecs:Poll"
            ],
            "Resource": [
                "arn:aws:ecs:ap-southeast-1:<aws-account-id>:cluster/<cluster-name>"
            ]
        }
    ]
}

4. Monitoring Script

The below mentioned script will check for ECS agent connectivity with the ECS service, it first extracts all the container instances ARN’s, instance id (using metadata). It will then check for each container instance ARN for its status check whether it’s on the same instance. If current instance ECS agent is disconnected it will trigger a notification and restart ecs service on the instance.

5. Setup cron to run every 5 minutes

You can setup cronjob to run every 5 minutes and writing error logs to /var/log/monitor-agent-logs.txt.

*/5 * * * * bash /home/ec2-user/monitor_agent.sh 2&>1 /var/log/ecsmonitoring.log

NO COMMENTS