Step-by-Step Guide to Monitoring AKS in Azure
In this blog post, we’ll walk you through the process of creating an Azure Kubernetes Service (AKS) cluster, enabling diagnostic settings, sending events to a Log Analytics workspace, creating a KQL query to look for pods in a CrashLoopBackOff state, and setting up alerts to notify you via email and SMS if a pod remains in this state for longer than 10 minutes.
1. Create AKS in Azure
To create an AKS cluster, follow these steps:
- Navigate to Azure Portal: Go to the Azure Portal.
- Create a Resource: Click on “Create a resource” and search for “Kubernetes Service”.
- Configure Basics:
- Subscription: Select your subscription.
- Resource Group: Create a new resource group or select an existing one.
- Cluster Details: Provide a name for your cluster and choose the region.
- Kubernetes Version: Select the desired Kubernetes version.
- Node Pools:
- Node Size: Choose the VM size for your nodes.
- Node Count: Set the initial number of nodes.
- Review and Create: Review the settings and click “Create”.
2. Enable Diagnostic Settings
Once the AKS cluster is created, enable diagnostic settings to monitor the cluster:
- Go to AKS Cluster: Navigate to your AKS cluster in the Azure Portal.
- Diagnostics Settings: Under “Monitoring” select “Diagnostics settings”.
- Add Diagnostic Setting:
- Name: Provide a name for the diagnostic setting.
- Log Analytics Workspace: Select the Log Analytics workspace where you want to send the logs.
- Logs: Select the logs you want to collect (e.g., “kube-apiserver”, “kube-controller-manager”, “kube-scheduler”, etc.).
- Save: Click “Save” to apply the settings.
3. Send Events to Log Analytics Workspace
Ensure that events from your AKS cluster are being sent to your Log Analytics workspace:
- Log Analytics Workspace: Navigate to the Log Analytics workspace linked to your AKS cluster.
- Logs: Verify that logs are being collected by clicking on “Logs” under the workspace and checking for incoming data.
4. Create KQL Query to Look for Pods in CrashLoopBackOff
Now, create a KQL query to identify pods in a CrashLoopBackOff state:
- Go to Logs: In your Log Analytics workspace, click on “Logs”.
- KQL Query: Enter the following KQL query to find pods in the CrashLoopBackOff state:
KubePodInventory | where ContainerRestartCount > 0 and Reason == "CrashLoopBackOff" | project TimeGenerated, ClusterName, Namespace, Name, ContainerID, ContainerName, ContainerRestartCount, Reason
- Run Query: Click “Run” to execute the query and verify it returns the expected results.
5. Create Alerts for CrashLoopBackOff State
Set up an alert to notify you if a pod is in the CrashLoopBackOff state for longer than 10 minutes:
- Go to Alerts: In the Azure Portal, go to “Monitor” and select “Alerts”.
- Create Alert Rule:
- Scope: Select the resource (your AKS cluster).
- Condition: Click on “Add condition” and choose “Custom log search”. Paste your KQL query.
- Condition Settings: Set the condition frequency to “10 minutes” and the threshold to trigger when results are greater than 0.
- Action Groups:
- Add Action Group: Create a new action group.
- Action Group Name: Provide a name.
- Short Name: Provide a short name.
- Notifications: Add your EPAM email address and phone number for SMS.
- Add Action Group: Create a new action group.
- Alert Details: Provide a name for the alert rule and set the severity.
- Create Alert Rule: Review the settings and click “Create alert rule”.
Summary Table
Step | Description | Actions |
---|---|---|
1 | Create AKS in Azure | Navigate to Azure Portal, create a Kubernetes Service, configure basic settings, node pools, and create the cluster. |
2 | Enable Diagnostic Settings | Go to AKS cluster, enable diagnostics, select logs to collect, and save settings. |
3 | Send Events to Log Analytics Workspace | Ensure events are sent to Log Analytics workspace, verify logs collection. |
4 | Create KQL Query for CrashLoopBackOff | Write and run a KQL query in Log Analytics to find CrashLoopBackOff pods. |
5 | Create Alerts for CrashLoopBackOff State | Set up alert rule, configure condition and action group, specify notifications. |
By following these steps, you will have a robust monitoring and alerting system for your AKS cluster, ensuring timely notifications for any pods that encounter issues.