Configuring ECS Fargate and ECR with Private Subnets

January 29, 2025

Tags: Containers, DevOps, ECS, Networking, Security

Recently I found myself working with AWS ECS (Elastic Container Service) to host a simple application and using using AWS Fargate as the underlying compute. This pairing bills itself as a simple means of deploying containers without the hassle of standing up and configuring your own servers.

That’s an attractive offer but I quickly ran in to some issues when trying to deploy to private subnets. Like so many other cloud systems this one prefers to be public by default which often isn’t suitable when working in secure, governed environments and certainly wasn’t an option for me.

In this post I’m going to look at how to deploy an ECS/Fargate service to truly private subnets and what extra infrastructure is needed to allow proper communication between services.

TL;DR. If you aren’t really interested in reading this and you just want your problem to go away…well that’s a shame, but I have written a Terraform module and stuck it in the module registry here which will fix your problems. But beware of blindly applying configs written by strangers!

Private Subnets Mean…What Exactly?

First of all let’s clear up some terminology. Generally when we talk about “Private Subnets” in AWS, we’re usually talking about something that isn’t directly exposed to the internet but can still ACCESS the internet (so really, a subnet that doesn’t have a route to an Internet Gateway but does have a route to NAT Gateway, I.E. you can get traffic out but it can’t get in).

This is the common wisdom, but that doesn’t feel very private and if you’ve landed here you probably don’t really have a problem at all because this configuration will work with ECS Fargate and ECR just fine out of the box. What we’re talking about here is truly “private” subnets, which are isolated and have no access to the internet.

Private ECR Repositories Mean…What Exactly?

ECR Repositories come in two flavours, Public and Private, but Private here doesn’t relate to IP networking the same as our subnets, rather it refers to authentication and the policies that can be attached to your Repository. Important to understand is that a Private ECR Repository still has a Public IP address, we just have to authenticate before we can push and pull any images from it.

Unhelpful Error Messages

So, with the preamble out of the way…

Chances are that if you’ve found your way here you are staring this error message in our ECS event stream:

ResourceInitializationError: unable to pull secrets or registry auth: The task cannot pull registry auth from Amazon ECR: There is a connection issue between the task and Amazon ECR. Check your task network configuration. RequestError: send request failed caused by: Post "https://api.ecr.region.amazonaws.com/": dial tcp x.x.X.x:443: i/o timeout

This error message is useful in these sense that it tells us what is roughly going wrong, we either have no route from between ECS and ECR…or we do have a route, but we don’t have any permission to read once we establish a connection…but it’s unhelpful in that doesn’t really tell you why this is the case or what to do about it, and these scenarios can be pretty difficult to diagnose in the opaque world of public cloud.

Understanding how traffic moves around inside a VPC is useful here. Since our ECS Cluster is isolated within one or more private subnets, with no routes to go anywhere beyond it’s own address space and our ECR Repository is only accessible via a public IP address the problem is pretty obvious; our subnet is really a little bit too private. We will need to introduce some method to get access to ECR that isn’t via the public internet.

The below quick diagram should help to break down exactly what we’re working with and where we’re going wrong

Our *ECS Cluster* is private…but a bit too private. We have effectively put it in a bubble where it can’t even reach our *ECR Repository*!

Missing Endpoints

It’s here that VPC Endpoints come in to play. The AWS documentation will give you an in depth breakdown of how endpoints and their underlying technology (AWS PrivateLink) work so there isn’t much value in me regurgitating it, but in a nutshell there are two types:

Interface Endpoints. These are a collection of network interfaces which catch DNS queries destined for some AWS service and redirect your traffic appropriately, keeping it inside your VPC. The broad effect of using an Interface Endpoint is to apply a private IP to an otherwise public service (such as ECR).
Gateway Endpoints. Are used specifically for S3 or DynamoDB and manipulate a specific AWS Route Table by adding a Prefix List (this is just a fancy name for a well known list of IP addresses that can be summarised as a single prefix). The effect of using this method is to allow calls to S3 or DynamoDB from within a private subnet with no means of reaching the internet.

So to get those endpoints in place:

First we’ll need to create an appropriate Security Group to manage access to our endpoints. In the AWS console browse to VPC > Security Groups > Create security group. Apply the rules shown in the table below (the ingress rules must be applied for each of your Private Subnet CIDRs):

Type	Protocol	Port Range	Reason
HTTP	TCP	80	Image Pulls
HTTPS	TCP	443	Image Pulls
DNS Query	UDP	53	DNS Queries

Substitute your subnet CIDRs as appropriate!

To create our Interface Endpoints; browse to VPC > Endpoints > Create endpoint:

Select the following configuration options:

Name Tag: Name the Endpoint something suitable
Endpoint type: AWS Services
Endpoint Service Name: com.amazonaws.<YOUR_REGION>.ecr.dkr
Endpoint Service Type: Interface
VPC: Your VPC as appropriate
DNS name: Enable DNS resolution
DNS record IP type: IPv4
Subnets: Your private Subnets as appropriate
Subnet IP address type: IPv4
Security groups: The Security Group created in the previous step

Click Create Endpoint.

If you are using ECS Fargate Version 1.4 (and you almost certainly are), repeat this process again and create a second endpoint for endpoint com.amazonaws.<YOUR_REGION>.ecr.api.

This should be the Interface Endpoints complete, we need to create a final Gateway Endpoint for S3, repeat the creation process again with the below configurations:

Name Tag: Name the Endpoint something suitable
Endpoint type: AWS Services
Endpoint Service Name: com.amazonaws.<YOUR_REGION>.s3
Endpoint Service Type: Gateway
VPC: Your VPC as appropriate
Route Tables: Your Private Route Table(s) as appropriate

With this in place our Endpoints are ready for use:

Injected *Prefix List* corresponding to *S3 Gateway Endpoint*

…and our infrastructure should function something like this:

Our ECS services can now run just fine and pull images from ECR without being exposed to the internet.

Wait…What Has S3 Got to Do Anything?

Good question. This is actually detailed in a tiny paragraph of the configuration documentation here but you would be forgiven for not understanding it at all. When ECS pulls an image from ECR it actually caches the image in a hidden S3 bucket that you have zero visibility of (in fact you have no visibility of this process at all), so if you are working in private subnets you will need to ensure that you also have a VPC Gateway Endpoint for S3 otherwise your images will never find their way down to your cluster!

Can’t I Just Use A NAT Gateway? A Final Thought On Hidden Costs

The short answer is yes, you can technically. It might seem to make your life easier in the short term when things just seem to start working and you don’t have to do so much configuration but this isn’t very forward thinking (plus it might not even be an option you have if you are in a highly governed environment).

Every request that goes through a NAT Gateway and out in to the internet costs money, not a lot of money but it can all add up pretty quickly and you don’t want to be on the wrong side of that.

A few minutes glancing around Stackoverflow etc. will show you a lot of horror stories where people’s bills suddenly went through the roof in the middle of the night when their application started crashing and ECS started pulling images from ECR over and over again without any rate limits, such a scenario will lead to a huge amount of traffic flying through a NAT Gateway. Without proper network considerations this could easily be you and that wouldn’t be very nice for anyone!