Troubleshooting

Cluster API Troubleshooting

Verification of the PODS

When a problem occur with cluster API the first thing to do is to check if all the PODs involved are running.

Check is these 4 pods have the STATUS Running

# kubectl get pods -A

NAMESPACE                           NAME                                                            READY   STATUS    RESTARTS   AGE
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-7dc44947-hrmvc        1/1     Running   0          36m
capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-cb9d954f5-r8w54   1/1     Running   0          36m
capi-system                         capi-controller-manager-7594c7bc57-jr75r                        1/1     Running   0          37m
capvcd-system                       capvcd-controller-manager-89758d745-kw4sm                       1/1     Running   0          13s

If one of these pods don’t have a Running status, two things can be check in order to get information about the issue.

Describe the pod that have error
Get the logs of the pod that have error

Example :

Check the provider VCD

If issue occurs with the actions made on vCloud Director, during cluster creation, upgrade or scaling operation the problem can come from the Cluster API Provider.

The corresponding pod is capvcd-controller-manager inside the capvcd namespace.

Error can be found in the logs :

An option exists to display more logs regarding communication between the provider and vCloud Director.

For that apply this command :

kubectl set env -n capvcd-system deployment/capvcd-controller-manager GOVCD_LOG_ON_SCREEN=true -oyaml

These option could be verbose, so do not forget to disable it when your diagnose is ended, for that :

kubectl set env -n capvcd-system deployment/capvcd-controller-manager GOVCD_LOG_ON_SCREEN-

Check the Cluster API Objects

Cluster API use different types of objects to describe a Kubernetes cluster to manage.

The idea is to explore these objects step by step to find the object that have an error in his status or when we describe the object

Depending if the issue is about worker nodes, master nodes or the overall cluster, it is possible to choose the objects involved using the above diagram.

Get the objects to find the exact name and to check the status

Describe the object that have an issue

Repeat action 1 & 2 for all other objects till find the error.

Export logs script

VMware creates a script that export logs in a tar ball and some information about the cluster.

generate-k8s-log-bundle.sh

Node deployment troubleshooting

Check kubelet status

systemctl status kubelet

Journalctl

journalctl -xeu containerd

journalctl -xeu kubelet

Multiple files could be parsed to troubleshoot an issue during deployment :

Cloud-init

/var/log/cloud-init-output.log

/var/log/capvcd/customization/status.log

/var/log/capvcd/customization/error.log

containerd

/var/log/containers/*

Overview

Practical sheets

API

Backup

Cloud Customer Space

Contractual documents

Help and technical assistance

Online help

Roles and users

Technical Dashboards

The Support

Virtual Data Center ressources

Kubernetes

Additional components

KaaS

Tooling & Troubleshooting

Network

Networks and External Interconnections (links)

VDC Networks

Secnumcloud

Security

Services Area

Log in to the service area

Services

Storage

Objet Storage (S3)

Advanced features implementation guide

Getting Started

Service user manual

Shared storage (Network storage)

VDC & Virtual Machines

vCOD

Virtual Machines

Advanced Operations

Known Issues

Main Operations

VMware Cloud Director (VCD)

Catalogs

VMWare Cloud Director Availability (vCDA)

Q & A