Kubernetes 笔记 - Deploy资源与POD + troubleshooting

2024-01-08 15:27:33

Test: Deploy资源与POD

创建一个名为ngx-dep 的 deployment , image 是 1.15.9-alpine, 没有指定namespace,默认default:

kubectl create deploy ngx-dep --image=nginx:1.15.9-alpine
[root@master ~]# kubectl get pod -n default -o wide
NAME                       READY   STATUS    RESTARTS   AGE     IP               NODE    NOMINATED NODE   READINESS GATES
ngx-dep-6dbbc68b59-kn9sf   1/1     Running   0          3d17h   10.244.166.134   node1   <none>           <none>
访问pod:
[root@master ~]# curl 10.244.166.134
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>
[root@master ~]# kubectl descript ngx-dep-6dbbc68b59-kn9sf -n default -o yaml
Error: unknown command "descript" for "kubectl"

Did you mean this?
        describe

Run 'kubectl --help' for usage.

查看POD属性,了解基本五大要素apiVersion、kind、metadata、spec、status:

[root@master ~]# kubectl get pod ngx-dep-6dbbc68b59-kn9sf -n default -o yaml
apiVersion: v1	 //版本
kind: Pod          //类型
metadata:
  annotations:
    cni.projectcalico.org/containerID: 095c8d9f6b86da61f02ac0c9da401ad2eb51c973e7ae88a3bd2707f11a4ff7d4
    cni.projectcalico.org/podIP: 10.244.166.134/32
    cni.projectcalico.org/podIPs: 10.244.166.134/32
  creationTimestamp: "2023-12-29T08:17:42Z"
  generateName: ngx-dep-6dbbc68b59-
  labels:  //指定资源label 列表
    app: ngx-dep
    pod-template-hash: 6dbbc68b59
  name: ngx-dep-6dbbc68b59-kn9sf
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: ngx-dep-6dbbc68b59
    uid: 579bb2fe-784a-48dc-903c-9411211982b4
  resourceVersion: "143782"
  uid: 4acc3231-5ad9-4b21-abc2-453097a75b28
spec:
  containers:
  - image: nginx:1.15.9-alpine
    imagePullPolicy: IfNotPresent   // //镜像拉取策略,指定tag为latest默认always,tag为具体版本号,默认IfNotPresent
    name: nginx
    resources: {}
    terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
//指定容器终止时记录终止消息的策略,File,使用terminationMessagePath所设置的值来记录
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-hs4zt
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: node1
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
//启动策略:无论什么原因导致容器退出,将始终尝试重新启动容器,以确保容器一直处于运行状态
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-hs4zt
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-12-29T08:17:42Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-12-29T08:17:53Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-12-29T08:17:53Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-12-29T08:17:42Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://991f544ea0a49e966a7653ae0ae05fb6c09047fb1dfd207b83106d8b8a159676
    image: nginx:1.15.9-alpine  //镜像名称
    imageID: docker-pullable://nginx@sha256:55390addbb1a2b82e6ffabafd72e0f5dfbc8f86c2e7d9f41fb914cca537bd500
    lastState: {}
    name: nginx
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-12-29T08:17:53Z"
  hostIP: 192.168.204.131
  phase: Running
  podIP: 10.244.166.134
  podIPs:
  - ip: 10.244.166.134
  qosClass: BestEffort
  startTime: "2023-12-29T08:17:42Z"

troubleshooting 流程:

提供一些检查思路和指令给大家参考。

create pod 之后可以curl + ping, 重启pod,很长时间pod 才running,之后pod 就curl +ping不通了, nodes 的状态正常。

[root@master ~]# curl 10.244.166.135

curl: (7) Failed connect to 10.244.166.135:80; No route to host

(1)检查iptable,FORWARD链的默认策略设备是ACCEPT

[root@master ~]# iptables -L -n

(2)查看pod 报错

[root@master ~]# kubectl describe pod ngx-dep-6dbbc68b59-9xzfq -n default
Events:
  Type     Reason                  Age                  From               Message
  ----     ------                  ----                 ----               -------
  Normal   Scheduled               2m33s                default-scheduler  Successfully assigned default/ngx-dep-6dbbc68b59-9xzfq to node2
  Warning  FailedCreatePodSandBox  2m32s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "ef9d392596c7f3386eb249ea6e1bc5f0bbc1e4e808d52ab3250261c8f3cc01c4" network for pod "ngx-dep-6dbbc68b59-9xzfq": networkPlugin cni failed to set up pod "ngx-dep-6dbbc68b59-9xzfq_default" network: error getting ClusterInformation: connection is unauthorized: Unauthorized, failed to clean up sandbox container "ef9d392596c7f3386eb249ea6e1bc5f0bbc1e4e808d52ab3250261c8f3cc01c4" network for pod "ngx-dep-6dbbc68b59-9xzfq": networkPlugin cni failed to teardown pod "ngx-dep-6dbbc68b59-9xzfq_default" network: error getting ClusterInformation: connection is unauthorized: Unauthorized]
  Normal   SandboxChanged          7s (x12 over 2m31s)  kubelet            Pod sandbox changed, it will be killed and re-created.

查了一下报错原因,应该是和calico 有关。查看calico 日志, 有许多ERROR。

[root@master ~]# more /var/log/calico/cni/cni.log
/var/lib/calico/nodename: 
no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
2023-12-28 06:34:15.434 [ERROR][120891] plugin.go 580: Final result of CNI DEL was an error. error=stat /var/lib/calico/nodename: 
no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
2023-12-28 06:34:16.270 [ERROR][120939] plugin.go 580: Final result of CNI DEL was an error. error=stat /var/lib/calico/nodename:

找到了一个解决方法:<参考:https://blog.csdn.net/a5534789/article/details/112848404>

kubectl delete -f  <yaml>
/etc/cni/net.d/目录下 删掉与calico相关的所有配置文件, 然后重启机器。

(3)但是我没有uninstall calico, 先来查看calico pods状态。有crash 的pod。

[root@master ~]# kubectl get pod -n calico-system
NAMESPACE          NAME                                       READY   STATUS             RESTARTS   AGE
calico-apiserver   calico-apiserver-784bc4f6cf-lpwwl          1/1     Running            18         4d20h
calico-apiserver   calico-apiserver-784bc4f6cf-s8k52          0/1     CrashLoopBackOff   31         4d20h
calico-system      calico-kube-controllers-7bb7d7b46d-wkgtp   0/1     CrashLoopBackOff   24         4d20h

重启CrashLoopBackOff 的pods.

 [root@master ~]# kubectl delete calico-kube-controllers-7bb7d7b46d-wkgt -n calico-system
 [root@master ~]# kubectl delete pod calico-kube-controllers-7bb7d7b46d-wkgt -n calico-system

再次查看calico pods状态。 crash 的 pod 变成running。 但是我的问题还是没有解决,新建的pod 依然curl 不通。再继续checkcheck。

[root@master ~]# kubectl get pods -n calico-system
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-7bb7d7b46d-cq86s   1/1     Running   0          3m22s
calico-node-hbx55                          0/1     Running   0          24s
calico-node-m669s                          0/1     Running   0          11s
calico-node-q4q8h                          0/1     Running   0          49s
calico-typha-b9dcc579b-9xxtn               1/1     Running   0          4d20h
calico-typha-b9dcc579b-lqh76               1/1     Running   0          4d20h
csi-node-driver-pm2cg                      2/2     Running   0          4d20h
csi-node-driver-s9rpc                      2/2     Running   0          4d20h
csi-node-driver-ssgd5                      2/2     Running   0          4d20h

(4)屋漏偏逢连夜雨。执行kubectl 指令变的非常慢。

[root@master ~]# time kubectl get node
NAME     STATUS   ROLES                  AGE     VERSION
master   Ready    control-plane,master   4d20h   v1.21.0
node1    Ready    <none>                 4d20h   v1.21.0
node2    Ready    <none>                 4d20h   v1.21.0

real    0m15.155s
user    0m0.089s
sys     0m0.089s 

又开始查kubelet 的状态 和日志。


[root@master ~]# docker ps -a
Exited (255) 5 hours ago             k8s_calico-apiserver_calico-apiserver-784bc4f6cf-lpwwl_calico-apiserver_d94c60d1-8f39-499d-981b-a350c4946e6c_17
 Exited (0) 5 hours ago               k8s_kube-apiserver_kube-apiserver-master_kube-system_8ca919a1f4463b2b2917640d376690ca_17

###kubelete查看日志,由于日志过多,没粘。 
[root@master ~]# journalctl -xefu kubelet

###查看kubelete状态, running 但是有一些event
[root@master ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Thu 2023-12-28 05:22:38 EST; 4 days ago
     Docs: https://kubernetes.io/docs/
 Main PID: 99393 (kubelet)
    Tasks: 20
   Memory: 121.5M
   CGroup: /system.slice/kubelet.service
           └─99393 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/...

Jan 01 20:51:36 master kubelet[99393]: I0101 20:51:36.052774   99393 scope.go:111] "RemoveContainer" containerID="0e4cf0...fff40d"
Jan 01 20:51:36 master kubelet[99393]: E0101 20:51:36.054004   99393 pod_workers.go:190] "Error syncing pod, skipping" err="fai...
Jan 01 20:51:47 master kubelet[99393]: I0101 20:51:47.052526   99393 scope.go:111] "RemoveContainer" containerID="0e4cf0...fff40d"
Jan 01 20:51:47 master kubelet[99393]: E0101 20:51:47.052815   99393 pod_workers.go:190] "Error syncing pod, skipping" err="fai...
Jan 01 20:52:02 master kubelet[99393]: I0101 20:52:02.053678   99393 scope.go:111] "RemoveContainer" containerID="0e4cf0...fff40d"
Jan 01 20:52:02 master kubelet[99393]: E0101 20:52:02.054666   99393 pod_workers.go:190] "Error syncing pod, skipping" err="fai...
Jan 01 20:52:15 master kubelet[99393]: I0101 20:52:15.053453   99393 scope.go:111] "RemoveContainer" containerID="0e4cf0...fff40d"
Jan 02 00:46:11 master kubelet[99393]: E0102 00:46:11.405420   99393 kubelet.go:1877] "Skipping pod synchronization" err...s down"
Jan 02 00:47:33 master kubelet[99393]: E0102 00:47:33.328566   99393 kubelet.go:1877] "Skipping pod synchronization" err...s down"
Jan 02 00:47:33 master kubelet[99393]: E0102 00:47:33.535001   99393 kubelet.go:1877] "Skipping pod synchronization" err...s down"
Hint: Some lines were ellipsized, use -l to show in full.

###查看kube-system pod状态
[root@master net.d]# kubectl get pod -n kube-system
NAME                             READY   STATUS             RESTARTS   AGE
coredns-545d6fc579-258pf         0/1     ImagePullBackOff   0          4d21h
coredns-545d6fc579-rr4vm         0/1     ImagePullBackOff   0          4d21h

先来修一下coredns 的image无法加载的问题。查到一个修复的帖子:原因是无法下载image : coredns:v1.8.0

<参考k8s解决coredns 的ImagePullBackOff 和ErrImagePull 问题-CSDN博客>

###查看pod 日志和具体镜像:
[root@master net.d]# kubectl get pods coredns-545d6fc579-258pf -n kube-system -o yaml | grep image:


    image: registry.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
  - image: registry.aliyuncs.com/google_containers/coredns/coredns:v1.8.0

###登录到coredns 的宿主机上,一定是宿主机.
[root@node1 root]# docker images
###官方中没有v1.8.0的版本的镜像,无法下载.
####先pull一下试试,又出问题了,  connection refused。再来解决一下。
[root@node1 root]# docker pull registry.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
Error response from daemon: Get "https://registry.aliyuncs.com/v2/": dial tcp: lookup registry.aliyuncs.com on [::1]:53: read udp [::1]:36245->[::1]:53: read: connection refused

###尝试将dns修改为8.8.8.8
[root@node1 root]# vim /etc/sysconfig/network-scripts/ifcfg-ens33
DNS1=8.8.8.

###修改之后重启 network
systemctl restart network

###报错来了,因为官方镜像库没有v1.8.0的版本。尝试失败。
[root@node1 root]# docker pull registry.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
Error response from daemon: pull access denied for registry.aliyuncs.com/google_containers/coredns/coredns, repository does not exist or may require 'docker login': denied: requested access to the resource is denied

###接下来直接下载1.8.0 , 并修改tag。
[root@node1 root]# sudo docker tag 296a6d5035e2 registry.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
coredns/coredns                                                   1.8.0      296a6d5035e2   3 years ago    42.5MB

[root@node1 root]# docker images
registry.aliyuncs.com/google_containers/coredns/coredns           v1.8.0     296a6d5035e2   3 years ago    42.5MB






都修好了,再次检查curl。?curl &ping pod:终于恢复了。

一些check 命令总结:
kubectl get pod -A
kubectl get pod xxx -n xxx -o yaml
kubectl logs xxx -n xxx
kubectl describe pod xxx -n xxx
systemctl status kubelet
docker ps -a | grep Exited 
journalctl -xefu kubelet
systemctl restart network
docker images
kubectl get ep
kubectl get svc

Calico 配置目录: /etc/cni/net.d/
log_file_path: /var/log/calico/cni/cni.log

k8s 认证信息存在 /etc/cni/net.d/calico-kubeconfig
生成 cni 配置在 /etc/cni/net.d/10-calico.conflist
不停循环检查 etcd 认证是否更新,更新了存在 /etc/cni/net.d/calico-tls/

日常分享 一些实验过程中遇到的问题和解决方法。有绕弯路的地方,欢迎大佬来指点迷津!:)

文章来源:https://blog.csdn.net/tuboshu1234/article/details/135378358
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。