Step-by-Step Guide to Monitoring AKS in Azure: Setting Up Diagnostics, Log Analytics, and Alerts for CrashLoopBackOff Pods

Rate this post

Step-by-Step Guide to Monitoring AKS in Azure

In this blog post, we’ll walk you through the process of creating an Azure Kubernetes Service (AKS) cluster, enabling diagnostic settings, sending events to a Log Analytics workspace, creating a KQL query to look for pods in a CrashLoopBackOff state, and setting up alerts to notify you via email and SMS if a pod remains in this state for longer than 10 minutes.

1. Create AKS in Azure

To create an AKS cluster, follow these steps:

  1. Navigate to Azure Portal: Go to the Azure Portal.
  2. Create a Resource: Click on “Create a resource” and search for “Kubernetes Service”.
  3. Configure Basics:
    • Subscription: Select your subscription.
    • Resource Group: Create a new resource group or select an existing one.
    • Cluster Details: Provide a name for your cluster and choose the region.
    • Kubernetes Version: Select the desired Kubernetes version.
  4. Node Pools:
    • Node Size: Choose the VM size for your nodes.
    • Node Count: Set the initial number of nodes.
  5. Review and Create: Review the settings and click “Create”.

2. Enable Diagnostic Settings

Once the AKS cluster is created, enable diagnostic settings to monitor the cluster:

  1. Go to AKS Cluster: Navigate to your AKS cluster in the Azure Portal.
  2. Diagnostics Settings: Under “Monitoring” select “Diagnostics settings”.
  3. Add Diagnostic Setting:
    • Name: Provide a name for the diagnostic setting.
    • Log Analytics Workspace: Select the Log Analytics workspace where you want to send the logs.
    • Logs: Select the logs you want to collect (e.g., “kube-apiserver”, “kube-controller-manager”, “kube-scheduler”, etc.).
  4. Save: Click “Save” to apply the settings.

3. Send Events to Log Analytics Workspace

Ensure that events from your AKS cluster are being sent to your Log Analytics workspace:

  1. Log Analytics Workspace: Navigate to the Log Analytics workspace linked to your AKS cluster.
  2. Logs: Verify that logs are being collected by clicking on “Logs” under the workspace and checking for incoming data.

4. Create KQL Query to Look for Pods in CrashLoopBackOff

Now, create a KQL query to identify pods in a CrashLoopBackOff state:

  1. Go to Logs: In your Log Analytics workspace, click on “Logs”.
  2. KQL Query: Enter the following KQL query to find pods in the CrashLoopBackOff state:
    • KubePodInventory | where ContainerRestartCount > 0 and Reason == "CrashLoopBackOff" | project TimeGenerated, ClusterName, Namespace, Name, ContainerID, ContainerName, ContainerRestartCount, Reason
  3. Run Query: Click “Run” to execute the query and verify it returns the expected results.

5. Create Alerts for CrashLoopBackOff State

Set up an alert to notify you if a pod is in the CrashLoopBackOff state for longer than 10 minutes:

  1. Go to Alerts: In the Azure Portal, go to “Monitor” and select “Alerts”.
  2. Create Alert Rule:
    • Scope: Select the resource (your AKS cluster).
    • Condition: Click on “Add condition” and choose “Custom log search”. Paste your KQL query.
    • Condition Settings: Set the condition frequency to “10 minutes” and the threshold to trigger when results are greater than 0.
  3. Action Groups:
    • Add Action Group: Create a new action group.
      • Action Group Name: Provide a name.
      • Short Name: Provide a short name.
      • Notifications: Add your EPAM email address and phone number for SMS.
  4. Alert Details: Provide a name for the alert rule and set the severity.
  5. Create Alert Rule: Review the settings and click “Create alert rule”.

Summary Table

StepDescriptionActions
1Create AKS in AzureNavigate to Azure Portal, create a Kubernetes Service, configure basic settings, node pools, and create the cluster.
2Enable Diagnostic SettingsGo to AKS cluster, enable diagnostics, select logs to collect, and save settings.
3Send Events to Log Analytics WorkspaceEnsure events are sent to Log Analytics workspace, verify logs collection.
4Create KQL Query for CrashLoopBackOffWrite and run a KQL query in Log Analytics to find CrashLoopBackOff pods.
5Create Alerts for CrashLoopBackOff StateSet up alert rule, configure condition and action group, specify notifications.

By following these steps, you will have a robust monitoring and alerting system for your AKS cluster, ensuring timely notifications for any pods that encounter issues.