Starfish vLabs
Cisco ACI IPN Whitepaper
A Real-World Build
Filename: Cisco ACI IPN Whitepaper (A Real-World Build)
Date: 25/04/2022
Author: Adam Ratcliffe
Version: 1.0
Introduction
During a recent project we had a lot of fun building out a greenfield data centre based on a Cisco ACI Multi-Pod and Multi-Site architecture. We lent heavily on two excellent Cisco White Papers – see below for the links.
Cisco ACI Multi-Pod White Paper
Cisco ACI Multi-Pod Configuration White Paper
While these White Papers provide comprehensive coverage of the topic, there were configuration gaps and a lack of information on testing and verification. As a result, during the implementation, we had to overcome several unforeseen challenges and research how best to verify out implementation.
In this White Paper, we provide a real-world example of how our build deals with several of the problems that we faced and outline the verification commands that we used. The White Paper is split into eight sections as outlined below.
1. Physical Topology
2. Interface Configuration
3. Inter-Pod Layer 3 Configuration
4. Routing and Multicast Configuration
5. IPN to Spine Connectivity
6. DHCP Relay Configuration
7. The IPN Wizard
8. Verification
Physical Topology
Initially we started by connecting the two PODs together, for our IPN we used Cisco N93180YC-FX-24 devices. As these devices are purely for the IPN we saw no point in wasting resource on 48 Port switches so opted for 2 * 10GB 24 ports running in NX-OS mode. The Intra-pod links were via a port-channel using SFP-H10GB-CU3M and Inter-pod links via single interfaces using 10Gbase-LR.
Interface Configuration
After the IPN devices were physically connected we moved onto the physical Interface configuration. This includes the Layer 2 MTU and allowed VLANs. Recommended all Layer 2 and Layer 3 MTU’s match, switching fabrics like ACI utilising VXLAN require Layer 2 jumbo frames. We also came across an issue with OSPF where routes were not passing from one of our L3Outs, this was due to the MTU size of the DBD, we were required to increase our Layer 2 MTU on the IPN interface, we set these at the max for ACI which is 9216. The VLANs allowed on the trunks are for the Intra and Inter-pod point-to-point networks (they also allow for the management and ERSPAN VLANs in our live IPN *Not shown here. The use of VLANs and VRF Lite also allows the links to be used for other purposes).
POD1-SW01
POD2-SW01
POD1-SW02
POD2-SW02
Inter-Pod Layer 3 Configuration
Once the physical interface configuration was in place, we proceeded with the Layer 3 Configuration for communication between the PODs and the ability to share the required routes dynamically using OSPF.
We used /30’s on all Intra and Inter-pod links which were configured on SVI’s inside OSPF Process 2, this was to ensure separation between our IPN and the management network OSPF Process 1. We also created the IPN VRF for use throughout the IPN and in future the ISN – Inter-site Network.
POD1-SW01
POD2-SW01
POD1-SW02
POD2-SW02
Routing and Multicast
At this point in the process, it is pertinent to confirm the OSPF configuration, confirm all OSPF neighbours are FULL.
show ip ospf neighbor vrf IPN
Ensure all interfaces are pingable and reachable from each of the IPN devices.
In the SVI configuration you can also see on each SVI we enabled.
ip pim sparse-mode
We must configure pim globally to allow the use of pim sparse-mode.
feature pim
This is a key component of the IPN and enables the underlying multicast PIM communication between the PODs, in the ACI IPN a Bidirectional Phantom RP must be in use. This was the next step in our configuration and the most problematic. Bidir PIM for Broadcast, Unknown Unicast and Multicast (BUM) traffic between PODs must be configured or there will be no Inter-pod communication.
For the Bidir Phantom RP to work the following must be in place:
Rendezvous Point’s are defined within the loopback interfaces subnet. The IP address of the RP must be within the subnet configured on the loopback but not the IP address configured on the loopback. (If you use the actual IP address configured on the loopback this becomes ‘real’ RP and not phantom RP).
Each device’s RP loopback uses the same IP address with a different length subnet mask (most specific route wins) This setup ensures failover if the RP is unreachable, the active RP will always be the one with the most specific route available, if the /30 disappears from the IPN routing table the device with the /29 configured will take over.
The RP address must be advertised within the IGP, in our case this is OSPF.
Point to note: OSPF advertises all loopbacks as /32 by default, we must use the ‘ip ospf network point-to-point’ command to advertise the whole subnet.
Here you can see we configured each loopback with the same IP address whilst using different subnets. Also, you can see below the RP configuration uses 172.16.1.17, this IP address is part of loopback3’s subnet but is not 172.16.1.18 which is the IP address configured on loopback3.
All IPN devices are configured with the same rp-address:
POD1-SW01
POD2-SW01
POD1-SW02
POD2-SW02
In our configuration the active RP shall be POD1-SW01. If POD1-SW01 goes down or Lo3 subnet becomes unreachable via the IGP then POD2-SW01 will take over as this is the next most specific route, after this POD1-SW02 and finally POD2-SW02. I shall cover confirmation of the Bidir Phantom PIM configuration once the Multi-Pod is in place.
At this point we have two fully functioning individual ACI PODs complete with APICs, Leafs and Spines. Although the PODs cannot communicate until the IPN is positioned between both PODs completing the Multi-Pod.
Spine Connectivity
The next step is to physically attach our IPN devices to the ACI Spines on both POD1 and POD2. A connection between the IPN and the ACI Spines is a key part of the ACI Multi-Pod configuration, without this we have 2 separate individual PODs.
Now the IPN is connected to the ACI Spines we can configure the IPN devices and the required L3Outs on ACI.
DHCP Relay Configuration
DHCP Relay is used for the connections into the IPN from ACI. Each IPN device connecting to an ACI Spine must have DHCP enabled. The DHCP relay configuration that is added to the IPN interfaces facing the Spines is there to ensure that when a spine sends DHCP discoveries out of its POD into the IPN the DHCP discoveries are relayed to the opposing POD. This is to ensure the APICs in the opposing POD receive the spines information (including its serial number via TLV) adding it to its membership table, this aids in the discovery of the Multi-Pod network. The DHCP Relay address must be the APIC IPs.
Here POD1 has 2x APICs and POD2 has 1x APIC. Our APICs used the following IPs:
- POD1-APIC01 : 172.18.0.1
- POD1-APIC02 : 172.18.0.2
- POD2-APIC03 : 172.18.128.3
Note the use of VLAN 4. This is key to the ACI Multi-Pod configuration. The IPN interfaces facing the Spines must be configured with sub-interfaces using.
encapsulation dot1q 4
Traffic originated from the Spines will always be tagged 802.1Q VLAN 4 this is so ACI knows that traffic is from a Spine. Therefore, IPN interfaces facing the Spines must allow traffic encapsulated with VLAN 4. Multiple Layer 3 sub-interfaces tagged with the same VLAN must be allowed on the IPN devices. Here we used the minimum recommended ACI MTU of 9150.
POD1-SW01
POD2-SW01
POD1-SW02
POD2-SW02
Next we continue with the ACI L3Out configuration that will connect the ACI PODs to the IPNs and in turn create the ACI Multi-Pod network.
IPN Wizard
Multi-Pod L3Outs connect the ACI to the IPN and should be configured within the infra-Tenant. Here we have one L3Out configured with four Logical Node Profiles (one for each Spine) each with two OSPF interfaces which reference the physical spine interfaces connecting to the IPN.
To create the L3Out we right click L3Outs and select ‘create new’ under the infra Tenant. Note here the VRF has already been chosen for us. The Overlay-1 VRF deals with all the Multi-Pod and Multi-Site control plane traffic.
When creating the L3Out we decide the following:
The name of the L3Out along with the the L3Domain, the L3Domain is important as it must reference our Spines_EntityProfile AEPs, (The Logical and the Physical part) we also select the VLAN pool which should be set to dynamically allocate VLAN 4. We are also asked to select ‘Use for:’ in this case we should select MPod for Multi-pod. Note BGP is left ticked although we are configured as an OSPF interface.
Once these are completed, we can then select next and move onto the IP configuration of the Spine interfaces.
Here we select the NODE ID (There is a set list of Spine devices, at this point we should already be seeing the Spines for both POD’s, this is the DHCP RELAY configuration on the IPN coming into play).
We select our Spines, set our Router ID (In line with the existing TEP pools created when we set up our existing ACI PODs), set our interfaces our VLAN Encapsulation to 4, MTU will be default 9150 unless we change this, and the IP /30 address that corresponds with what we have already configured on the IPN.
The next steps in creating this L3Out are the OSPF protocol configuration cost etc and then onto the creation of an External EPG. This is required but default configuration will suffice. With the L3Out created we can now confirm the Logical Interface Profile is available.
If we have configured this correctly, we should begin to see POD2 devices start to populate under the inventory in ACI.
Verification
We can confirm the OSPF neighbors in the IPN or from the CLI of the Spines. Here we confirm the OSPF neighborship between POD1-SP01 172.17.254.34/30 Eth1/33.33 and POD1-SW01 172.17.254.33/30 Eth1/1.4
From the Spine:
From the IPN:
We have now confirmed that the OSPF neighbours are up between the ACI and the IPN. If at this point, we are not
seeing both PODs in the ACI inventory, then we should fault find.
We should also confirm the MP-BGP EVPN Adjacencies are up between the Spines in each POD. We have four Spines
two in each POD so we should see 2 neighbours on each device, these are the spines from the opposing POD. This is important for sharing prefix information East to West and for learning IP and MAC endpoint information across PODs.
The 172.28.255.x IPs were configured as RIDs for OSPF in the multipod L3Out configuration on ACI and BGP remained ticked to ensure vpvn4 peers were brought up.
show bgp l2vpn evpn summary vrf overlay-1
show coop internal info ip-db
show endpoint (Should see dynamic tunnels)
Some of the fundamental issues we had during our Multi-Pod deployment was OSPF neighbours flapping due to large DBD and no communication or intermittent communication between PODs.
When fault finding the OSPF flaps we found the database descriptor to be too large. This pointed to Layer 2 MTU which was not set high enough on the physical IPN interfaces facing the Spines.
When fault finding the intermittent connectivity issues, we found that the Multicast configuration was incorrect, the mroutes were pointing towards the Spines and not the Phantom RP. The Inter-pod communication will not work if the mroutes are pointed at the spines, this is because the spine devices cannot be RPs.
To verify this confirm the mroutes are pointing at the Bidir RP, not at the interface facing the Spines. Here we can see the route to the RP is pointing to the Loopback3 interface as it should be.
show ip mroute vrf IPN
It is also good to confirm the OSPF route for the RP is correct and not pointing at the Spine. The route to the RP should be via the IPN’s never via the Spines. If the route for the RP, in our case 172.16.1.17 is pointing towards the Spine this may point to an OSPF cost issue between ACI and the IPN devices. You can change the OSPF cost under the L3Out within the infra Tenant.
This was by far the most challenging aspect of the IPN build.