502 Bad Gateway error | Azure Application Gateway Troubleshooting

I was setting up an Azure Application Gateway for a project couple days back. The intended workload was setting up git on nginix.  But when I tried to reach the git URL I noticed that it was failing with 502 Bad Gateway error.

Initial Troubleshooting Steps

  • We tried to access backend servers from Application Gateway IP 10.62.124.6, backend server IPs are 10.62.66.4 and 10.62.66.4. Application Gateway configured for SSL Offload.
  • We were able to access the backend servers directly on port 80, but when accessed via Application Gateway this issue occurs.
  • Rebooted the Application Gateway and the backend servers. and configured custom probe as well. But the issue was with the Request time out value which is by default configured for 30 seconds.
  • This means that when user request is received on Application Gateway, it forwards it to the backend pool and waits for 30 seconds and if it fails to get a response back then within that period users will receive a 502 error.
  • The issue has been temporarily  resolved after the time out period on the Backend HTTP settings has been changed to 120 seconds.

Real Deal

Increasing the timeout values were only a temporary fix as we were unable to find out a permanent fix. I have reached out to Microsoft Support and they wanted us to run below diagnostics.

  • Execute the below cmdlet and share the results.

$getgw = Get-AzureRmApplicationGateway -Name <application gateway name> -ResourceGroupName <resource group name>

  • Collect simultaneous Network traces:
    1. Start network captures on On-premises machine(Source client machine) and Azure VMs (Backend servers)
      • Windows: Netsh
        1. Command to start the trace:  Netsh trace start capture=yes report=yes maxsize=4096 tracefile=C:\Nettrace.etl
        2. Command to stop the trace:  Netsh trace stop 
      • Linux: TCPdump
        1. TCP DUMP command: sudo tcpdump -i eth0 -s 0 -X -w vmtrace.cap 
    2. Reproduce the behavior.
    3. Stop network captures.

Analysis

The network traces collected on Client machine and Destination servers while the issue was reproduced indicates that,  during the time period the trace was collected, for every HTTP Get Request (default probing) from the Application Gateway instances on the backend servers, the servers responded “Status: Forbidden” HTTP Response.

This has resulted in Application Gateway marking the backend servers as unhealthy as the expected response is HTTP 200 OK.
 
The Application Gateway “gitapp” configured for 2 instances (Internal instance IPs: 10.62.124.4, 10.62.124.5)
 
Trace collected on backend server 10.62.66.4
 
12:50:18 PM 1/3/2017    10.62.124.4         10.62.66.4            HTTP      HTTP:Request, GET /
12:50:18 PM 1/3/2017    10.62.66.4            10.62.124.4         HTTP      HTTP:Response, HTTP/1.1, Status: Forbidden, URL: /
12:50:45 PM 1/3/2017    10.62.124.5         10.62.66.4            HTTP      HTTP:Request, GET /
12:50:45 PM 1/3/2017    10.62.66.4            10.62.124.5         HTTP      HTTP:Response, HTTP/1.1, Status: Forbidden, URL: /
 
Trace collected on backend server 10.62.66.5
 
12:50:48 PM 1/3/2017    10.62.124.4         10.62.66.5            HTTP      HTTP:Request, GET /
12:50:48 PM 1/3/2017    10.62.66.5            10.62.124.4         HTTP      HTTP:Response, HTTP/1.1, Status: Forbidden, URL: /
12:50:45 PM 1/3/2017    10.62.124.5         10.62.66.5            HTTP      HTTP:Request, GET /
12:50:45 PM 1/3/2017    10.62.66.5            10.62.124.5         HTTP      HTTP:Response, HTTP/1.1, Status: Forbidden, URL: /

Rootcause

Due to the security feature ‘rack_attack’ enabled on the backend servers, it has blacklisted the application gateway instance IP’s and therefore the servers were not responding to the Application Gateway, causing it to mark the backend servers as Unhealthy.

Fix

Once this feature was disabled on the backend web servers (niginx) , the issue was resolved and we could successfully access the web application using Application Gateway.