The creation of an Azure Site to Site VPN is (even by Software Defined Networking standards)…involved. This isn’t a problem unique to Azure and isn’t aided by the desire by vendors to call all of their components something unusual rather than the terminology that already exists. Setup is a very manual and time consuming process, however Terraform can completely automate and codify the process.
Example code for this post can be found in my GitHub at here.
Before Jumping In…
We need to define the usual settings, the local gateway (usually an on-premise firewall), the VPN Gateway (Azure’s VPN Gateway) and the Connection (the VPN connection between the two), however all three of these need to be defined in Azure, this can lead to some confusion as on the surface you might assume that the Local Gateway has no business being defined in Azure since it’s not a Cloud item (not to mention the various SKU oddities that crop up along the way).
Despite the Local Gateway being defined in Azure, this isn’t some kind of magic self configuring and self routing VPN, you will still need to configure your actual local device(s) to do their part, Microsoft have tried to lay out a good chunk of a assistance in providing configuration guides for supported devices in their documentation (though I know from experience that “unsupported” devices will work with varying degrees of success as long as you can make the protocols and proposals match).
It is also critical to know that Azure has a mandatory requirement for an entire /24 Transport Subnet inside the Address Space your VNet has been created in named GatewaySubnet, if this isn’t in place when you attempt to create your first VPN you’ll get nowhere.
Finally, I’m assuming that authentication is going to be done with Pre-Shared Keys of a good length, since the key needs to be pre-shared, I’m going to have it entered at run time rather than randomly generated using Terraform’s pseudorandom generation utilities.
How Does The VPN Look?
According to Microsoft, the VPN should look something like this:
…except that simplistic view of things isn’t exactly how anything works, how could it? The Local Network Gateway isn’t a real device, it’s just a digital representation of a real network appliance. We’re also not seeing any mention of our transport subnet. It’s more reasonable to say that the real setup looks like:
Let’s Try and Make Something
With all of this in mind, let’s try and make something.
The code can get a little long to read for a simple blog entry so let’s just look at automating the creation of a single VPN entry, adding loops and counts is simple enough but is only going to confuse the matter right now.
Below is the standard providers.tf, simple enough, just a single Provider for AzureRM:
#--provider.tf provider "azurerm" { version = "=2.1.0" features {} subscription_id = var.subscription_id tenant_id = var.tenant_id client_id = var.client_id client_secret = var.client_secret }
As usual, we want to define as much as possible in variables, this will aid with parameterisation and allow us to scale the routine if we want to add loops and counts later:
#--variables.tf #--Primary Location variable "location" { type = string description = "Primary Location" default = "uksouth" } #--Subscription variable "subscription_id" { type = string description = "Subscription id" } #--Tenant variable "tenant_id" { type = string description = "Tenant id" } ############################## #---Auth and Secret Params---# ############################## #--Service Principle AppID variable "client_id" { type = string description = "Client id" } #--Service Principle Secret variable "client_secret" { type = string description = "Client secret" } #--Service Principle Secret variable "vpn_psk" { type = string description = "VPN PSK" } ##################### #---Deploy Params---# ##################### #--Resource Groups variable "resource_group" { description = "Resource Group" type = string default = "tinfoil_network_rg" } #--Base VNet variable "vnet" { description = "Base vnet" type = string default = "tinfoil_vnet" } #--Subnet Address Spaces variable "peer_subnet_address_spaces" { description = "All peer subnets" type = list(string) default = ["172.16.1.0/24",] } #--Transport Subnet Address Space variable "transport_subnet_address_space" { description = "All subnets" type = list(string) default = ["10.0.3.0/24"] } #--VPN Gateway variable "vpn_gateway" { description = "VPN Gateway" type = string default = "tinfoil_vpn_gateway" } #--Peer VPN Gateway variable "peer_vpn_gateway" { description = "Peer VPN Gateway" type = string default = "madcaplaughs_vpn_gateway" } #--VPN Connection variable "vpn_connection" { description = "VPN Connection" type = string default = "tinfoil_vpn_connection" } #--VPN Connection variable "vpn_public_ip" { description = "VPN Public IP" type = string default = "tinfoil_vpn_ip" }
With everything in place, we can now use our main.tf for the deployment of the Azure VPN components, there’s a few things to be aware of so I’ve added commends in-line:
data "azurerm_subnet" "tinfoilvpn" { #--We need to look this up as as list as we need to get the ID of the Subnet name = var.transport_subnet_address_space[count.index] count = length(var.transport_subnet_address_space) resource_group_name = var.resource_group virtual_network_name = var.vnet } resource "azurerm_local_network_gateway" "madcaplaughs" { name = var.peer_vpn_gateway location = var.location resource_group_name = var.resource_group gateway_address = "xx.xx.xx.xx" #--Your local device public IP here address_space = var.peer_subnet_address_spaces } resource "azurerm_public_ip" "tinfoilvpn" { name = var.vpn_public_ip location = var.location resource_group_name = var.resource_group allocation_method = "Dynamic" #--Dynamic set means Azure will generate an IP for your Azure VPN Gateway } resource "azurerm_virtual_network_gateway" "tinfoilvpn" { name = var.vpn_gateway location = var.location resource_group_name = var.resource_group type = "Vpn" #--Other option is ExpressRoute, predictably for ExpressRoute VPNs vpn_type = "RouteBased" #--Policy based is also acceptable here, depending on your use case active_active = false enable_bgp = false sku = "Basic" #--A whole load of oddities occur around SKUs, see MS Docs for details ip_configuration { public_ip_address_id = azurerm_public_ip.tinfoilvpn.id private_ip_address_allocation = "Dynamic" subnet_id = data.azurerm_subnet.tinfoilvpn.0.id #--There's that ID we needed, for the Transport Subnet } } resource "azurerm_virtual_network_gateway_connection" "tinfoilvpn" { name = var.vpn_connection location = var.location resource_group_name = var.resource_group type = "IPsec" virtual_network_gateway_id = azurerm_virtual_network_gateway.tinfoilvpn.id local_network_gateway_id = azurerm_local_network_gateway.madcaplaughs.id shared_key = var.vpn_psk #-Provided at run time }
Now when we terraform init we will load the AzureRM backend, and when we terraform apply get ready for a very long wait as the provisioning of these resources takes a good long time (seriously expect it to be up to 30 minutes for the provisioning of the Azure Virtual Network Gateway and then around 15-30 minutes further before the Azure RM starts to show any traffic in or out. This isn’t a Terraform limitation, this is the speed of Azure:
If we look in to the AzureRM now at our active VPN connections, we can see that the connection has been created, and our Remote and Local gateways are on either end of it (IP addresses redacted for privacy):
Future Considerations
I would also add that it’s ill advised to link the creation of VNets, address spaces and subnets to the creation of the VPNs themselves as when you modify the configurations and reapply the entire state will be modified and you will end up reprovisioning any and all VPNs defined by the configuration, and at around an hour per VPN that’s a tedious waste of time you could well do without.
After all, you don’t want to interrupt services or waste your time watching progress counters tick along forever!