Install and upgrade issues on Kubernetes
Occasionally, you might encounter issues during installation and upgrade of YugabyteDB Anywhere on Kubernetes. You can troubleshoot most of these issues.
If you experience difficulties while troubleshooting, contact Yugabyte Support.
For more information, see the following:
Pod scheduling failure
YugabyteDB Anywhere pod scheduling can fail for a variety of reasons, such as insufficient resource allocation, mismatch in the node selector or affinity, incorrect storage class configuration, problems with Elastic Block Store (EBS). Typically, this manifests by pods being in a pending state for a long time.
For additional information, see Node selection in kube-scheduler.
Insufficient resources
To start diagnostics, execute the following command to obtain the pod:
kubectl get pod -n <NAMESPACE>
If the issue you are experiencing is due to the pod scheduling failure, expect to see STATUS
as Pending
, as per the following output:
NAME READY STATUS RESTARTS AGE
yw-test-yugaware-0 0/4 Pending 0 2m30s
Execute the following command to obtain detailed information about the pod and failure:
kubectl describe pod <POD_NAME> -n <NAMESPACE>
Expect to see a Message similar to the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 56s default-scheduler 0/2 nodes are available: 2 Insufficient cpu.
For more information, see Kubernetes: Specify a CPU request that is too big for your nodes.
Resolution
- Ensure that you have enough resources in the Kubernetes cluster to schedule the YugabyteDB Anywhere pods. For more information, see Prerequisites - Kubernetes.
- Modify the YugabyteDB Anywhere pods resources configuration. For more information, see Modify resources.
Mismatch in node selector, affinity, taints, tolerations
To start diagnostics, execute the following command to obtain detailed information about the failure:
kubectl describe pod <POD_NAME> -n <NAMESPACE>
If the issue you are experiencing is due to the mismatched node selector or affinity, expect to see a Message similar to the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 75s (x40 over 55m) default-scheduler 0/55 nodes are available: 19 Insufficient cpu, 36 node(s) didn't match Pod's node affinity/selector
Resolution
Ensure that there is no mismatch between labels or taints when you schedule YugabyteDB Anywhere pods on specific nodes. Otherwise, the scheduler can fail to identify the node. For more information, see the following:
Storage class VolumeBindingMode is not set to WaitForFirstConsumer
During multi-zone deployment of YugabyteDB, start diagnostsics by executing the following command to obtain detailed information about the failure:
kubectl describe pod <POD_NAME> -n <NAMESPACE>
If the issue you are experiencing is due to the incorrect setting for the storage class VolumeBindingMode
, expect to see a Message similar to the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 75s (x40 over 55m) default-scheduler 0/55 nodes are available: 19 Insufficient cpu, 36 node(s) didn't match Pod's node affinity/selector
You can obtain information related to storage classes, as follows:
-
Get the
VolumeBindingMode
setting information from all storage classes in the universe by executing the following command:kubectl get storageclass
Expect an output similar to the following:
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE premium-rwo pd.csi.storage.gke.io Delete WaitForFirstConsumer true 418d standard (default) kubernetes.io/gce-pd Delete Immediate true 418d standard-rwo pd.csi.storage.gke.io Delete WaitForFirstConsumer true 418d
-
If a storage class name standard was defined in the universe, you can obtain information about a specific storage class by executing the following command:
kubectl describe storageclass standard
Expect an output similar to the following:
Name: standard IsDefaultClass: Yes Annotations: storageclass.kubernetes.io/is-default-class=true Provisioner: kubernetes.io/gce-pd Parameters: type=pd-standard AllowVolumeExpansion: True MountOptions: <none> ReclaimPolicy: Delete VolumeBindingMode: Immediate Events: <none>
Resolution
Since not setting VolumeBindingMode
to WaitForFirstConsumer
might result in the universe creating the volume in a different zone than the selected zone, ensure that the StorageClass
used during YugabyteDB Anywhere deployment has its WaitForFirstConsumer
set to VolumeBindingMode
. You can use the following command:
kubectl get storageclass standard -ojson \
| jq '.volumeBindingMode="WaitForFirstConsumer" | del(.metadata.managedFields, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid)' \
| kubectl replace --force -f -
Elastic Block Store (EBS) controller is missing in the Elastic Kubernetes Service
Execute the following command to check events for the persistent volume claim (PVC):
kubectl describe pvc <PVC_NAME> -n <NAMESPACE>
If the issue you are experiencing is due to the missing EBS controller, expect an output similar to the following:
waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
Resolution
Follow the instructions provided in Troubleshoot AWS EBS volumes.
Pod failure to run
In some cases, a scheduled pod fails to run and errors are thrown.
ImagePullBackOff and ErrImagePull errors
A Kubernetes pod may encounter these errors when it fails to pull the container images from a private container registry and the pod enters the ImagePullBackOff state.
The following are some of the specific reasons for the errors:
- Incorrect image path.
- Network failure or limitation.
- The
kubelet
node agent cannot authenticate with the container registry.
To start diagnostics, execute the following command to obtain the pod:
kubectl get pod -n <NAMESPACE>
If the issue you are experiencing is due to the image pull error, expect to initially see the ErrImagePull
error, and on subsequent attempts the ImagePullBackOff
error listed under STATUS
, as per the following output:
NAME READY STATUS RESTARTS AGE
yw-test-yugaware-0 0/4 Init:ErrImagePull 0 3
NAME READY STATUS RESTARTS AGE
yw-test-yugaware-0 0/4 Init:ImagePullBackOff 0 2m10s
Execute the following command to obtain detailed information about the pod and failure:
kubectl describe pod <POD_NAME> -n <NAMESPACE>
Expect an output similar to the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 25s (x3 over 75s) kubelet Pulling image "quay.io/yugabyte/yugaware:2.16.0.0-b90"
Warning Failed 22s (x3 over 72s) kubelet Failed to pull image "quay.io/yugabyte/yugaware:2.16.0.0-b90": rpc error: code = Unknown desc = failed to pull and unpack image "quay.io/yugabyte/yugaware:2.16.0.0-b90": failed to resolve reference "quay.io/yugabyte/yugaware:2.16.0.0-b90": pulling from host quay.io failed with status code [manifests 2.16.0.0-b90]: 401 UNAUTHORIZED
Warning Failed 22s (x3 over 72s) kubelet Error: ErrImagePull
Normal BackOff 7s (x3 over 72s) kubelet Back-off pulling image "quay.io/yugabyte/yugaware:2.16.0.0-b90"
Warning Failed 7s (x3 over 72s) kubelet Error: ImagePullBackOff
Resolution
- To resolve the Bad pull secret, No pull secret, Bad pull secret name errors, enable the pull secret to fetch the images from the YugabyteDB Quay.io registry and ensure that you have applied the same in the namespace that will be used to install YugabyteDB Anywhere. By default, search for a secret with name
yugabyte-k8s-pull-secret
is performed. For more information, see values.yaml. - To resolve the Unable to pull image error, ensure that the Kubernetes nodes can connect to Quay.io or you have images in the local registry. For more information, see Pull and push yugabytedb docker images to private container registry.
CrashLoopBackOff error
There is a number of reasons for the CrashLoopBackOff
error. It typically occurs when a YugabyteDB Anywhere pod crashes due to an internal application error.
To start diagnostics, execute the following command to obtain the pod:
kubectl get pod -n <NAMESPACE>
If the issue you are experiencing is due to the CrashLoopBackOff
error, expect to see this error listed under STATUS
, as per the following output:
NAME READY STATUS RESTARTS AGE
yugabyte-platform-1-yugaware-0 3/4 CrashLoopBackOff 2 4d14h
Resolution
-
Execute the following command to obtain detailed information about the YugabyteDB Anywhere pod experiencing the
CrashLoopBackOff
error:kubectl describe pods <POD_NAME> -n <NAMESPACE>
-
Execute the following commands to check YugabyteDB Anywhere logs for a specific container and perform troubleshooting based on the information in the logs:
# YugabyteDB Anywhere kubectl logs <POD_NAME> -n <NAMESPACE> -c yugaware # PostgreSQL kubectl logs <POD_NAME> -n <NAMESPACE> -c postgres
Load balancer service is not ready
Load balancer might not be ready to provide services to a running YugabyteDB Anywhere instance.
Incompatible load balancer
The internet-facing load balancer may not perform as expected because the default AWS load balancer used in Amazon Elastic Kubernetes Service (EKS) by the YugabyteDB Anywhere Helm chart is not suitable for your configuration.
Resolution
Use the following settings to customize the AWS load balancer controller behavior:
- Set
aws-load-balancer-scheme
to theinternal
orinternet-facing
string value. - Set
aws-load-balancer-backend-protocol
andaws-load-balancer-healthcheck-protocol
to thehttp
string value.
The following is a sample configuration:
service.beta.kubernetes.io/aws-load-balancer-type: "ip"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: "http"
Pending state of load balancer IP assignment
Due to a variety of reasons, such as the absence of the load balancer controller or exceeded public IP quota, the load balancer IP assignment might enter a continuous pending state.
To start diagnostics, execute the following command to obtain the switched virtual circuit (SVC) information:
kubectl get svc -n <NAMESPACE>
If the issue you are experiencing is due to IP assignment for the load balancer, expect to see an output similar to the following:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/yw-test-yugaware-ui LoadBalancer 10.4.1.7 <pending> 80:30553/TCP,9090:32507/TCP 15s
Resolution
Typically, cloud providers supply a load balancer controller that serves the load balancer type service. You need to verify whether or not the universe has the load balancer controller, as follows:
- If a load balancer controller is absent in Google Kubernetes Engine (GKE), follow instructions provided in GKE ingress overview.
- If the public IP is not associated with the load balancer, you might have exceeded the public IP quota limit.
- If you are using Minikube, run the minikube tunnel.
Other issues
A number of other issue can occur while installing and upgrading YugabyteDB Anywhere on Kubernetes.
Cross-Origin Resource Sharing (CORS) error
You might encounter a CORS error while accessing YugabyteDB Anywhere through a load balancer. The condition can manifest itself by the initial setup or any login attempts not working or resulting in a blank screen.
To start diagnostics, check the developer tools of your browser for any errors. In addition, check the logs of the YugabyteDB Anywhere pod by executing the following command:
kubectl logs <POD_NAME> -n <NAMESPACE> -c yugaware
If the issue you are experiencing is due to the load balancer access-related CORS error, expect to see an error message similar to the following:
2023-01-09T10:48:08.898Z [warn] 57fe083d-6ebb-49ab-bbaa-5e6576040d62
AbstractCORSPolicy.scala:311
[application-akka.actor.default-dispatcher-10275]
play.filters.cors.CORSFilter Invalid CORS
request;Origin=Some(https://localhost:8080);Method=POST;Access-Control-Request-Headers=None
Resolution
Specify correct domain names during the Helm installation or upgrade, as per instructions provided in Set a DNS name.
PVC expansion error
This error manifests itself in an inability to expand the PVC via the helm upgrade
command. The error message should look similar to the following:
Error: UPGRADE FAILED: cannot patch "yw-test-yugaware-storage" with kind PersistentVolumeClaim: persistentvolumeclaims "yw-test-yugaware-storage" is forbidden: only dynamically provisioned pvc can be resized and the storageclass that provisions the pvc must support resize
To start diagnostics, execute the following command to obtain information about the storage class:
kubectl describe sc <STORAGE_CLASS>
For example:
kubectl describe sc test-sc
The following output shows that the AllowVolumeExpansion
parameter of the storage class is set to false
:
Name: test-sc
IsDefaultClass: No
Provisioner: kubernetes.io/gce-pd
Parameters: type=pd-standard
AllowVolumeExpansion: False
MountOptions: <none>
ReclaimPolicy: Delete
VolumeBindingMode: Immediate
Events: <none>
Resolution
-
Set the
AllowVolumeExpansion
parameter totrue
to expand the PVC, as follows:kubectl get storageclass <STORAGE_CLASS> -o json \ | jq '.allowVolumeExpansion=true | del(.metadata.managedFields, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid)' \ | kubectl replace --force -f -
-
Use the following command to verify that
true
is returned:kubectl get storageclass <STORAGE_CLASS> -o json | jq '.allowVolumeExpansion'
-
Increase the storage size using Helm upgrade and then execute the following command to obtain the persistent volume information:
kubectl describe pvc <PVC_NAME> -n <NAMESPACE>
Expect to see events for the PVC listed via an output similar to following:
Normal ExternalExpanding 95s volume_expand CSI migration enabled for kubernetes.io/gce-pd; waiting for external resizer to expand the pvc Warning VolumeResizeFailed 85s external-resizer pd.csi.storage.gke.io resize volume "pvc-71315a47-d93a-4751-b48e-c7bfc365ae19" by resizer "pd.csi.storage.gke.io" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded Normal Resizing 84s (x2 over 95s) external-resizer pd.csi.storage.gke.io External resizer is resizing volume pvc-71315a47-d93a-4751-b48e-c7bfc365ae19 Normal FileSystemResizeRequired 84s external-resizer pd.csi.storage.gke.io Require file system resize of volume on node Normal FileSystemResizeSuccessful 44s kubelet MountVolume.NodeExpandVolume succeeded for volume "pvc-71315a47-d93a-4751-b48e-c7bfc365ae19"