Kubernetes 笔记 - Deploy资源与POD + troubleshooting
Test: Deploy资源与POD
创建一个名为ngx-dep 的 deployment , image 是 1.15.9-alpine, 没有指定namespace,默认default:
kubectl create deploy ngx-dep --image=nginx:1.15.9-alpine
[root@master ~]# kubectl get pod -n default -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ngx-dep-6dbbc68b59-kn9sf 1/1 Running 0 3d17h 10.244.166.134 node1 <none> <none>
访问pod:
[root@master ~]# curl 10.244.166.134
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
body {
width: 35em;
margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif;
}
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>
<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>
<p><em>Thank you for using nginx.</em></p>
</body>
</html>
[root@master ~]# kubectl descript ngx-dep-6dbbc68b59-kn9sf -n default -o yaml
Error: unknown command "descript" for "kubectl"
Did you mean this?
describe
Run 'kubectl --help' for usage.
查看POD属性,了解基本五大要素apiVersion、kind、metadata、spec、status:
[root@master ~]# kubectl get pod ngx-dep-6dbbc68b59-kn9sf -n default -o yaml
apiVersion: v1 //版本
kind: Pod //类型
metadata:
annotations:
cni.projectcalico.org/containerID: 095c8d9f6b86da61f02ac0c9da401ad2eb51c973e7ae88a3bd2707f11a4ff7d4
cni.projectcalico.org/podIP: 10.244.166.134/32
cni.projectcalico.org/podIPs: 10.244.166.134/32
creationTimestamp: "2023-12-29T08:17:42Z"
generateName: ngx-dep-6dbbc68b59-
labels: //指定资源label 列表
app: ngx-dep
pod-template-hash: 6dbbc68b59
name: ngx-dep-6dbbc68b59-kn9sf
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: ngx-dep-6dbbc68b59
uid: 579bb2fe-784a-48dc-903c-9411211982b4
resourceVersion: "143782"
uid: 4acc3231-5ad9-4b21-abc2-453097a75b28
spec:
containers:
- image: nginx:1.15.9-alpine
imagePullPolicy: IfNotPresent // //镜像拉取策略,指定tag为latest默认always,tag为具体版本号,默认IfNotPresent
name: nginx
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
//指定容器终止时记录终止消息的策略,File,使用terminationMessagePath所设置的值来记录
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-hs4zt
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: node1
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
//启动策略:无论什么原因导致容器退出,将始终尝试重新启动容器,以确保容器一直处于运行状态
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: kube-api-access-hs4zt
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-12-29T08:17:42Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2023-12-29T08:17:53Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2023-12-29T08:17:53Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2023-12-29T08:17:42Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://991f544ea0a49e966a7653ae0ae05fb6c09047fb1dfd207b83106d8b8a159676
image: nginx:1.15.9-alpine //镜像名称
imageID: docker-pullable://nginx@sha256:55390addbb1a2b82e6ffabafd72e0f5dfbc8f86c2e7d9f41fb914cca537bd500
lastState: {}
name: nginx
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2023-12-29T08:17:53Z"
hostIP: 192.168.204.131
phase: Running
podIP: 10.244.166.134
podIPs:
- ip: 10.244.166.134
qosClass: BestEffort
startTime: "2023-12-29T08:17:42Z"
troubleshooting 流程:
提供一些检查思路和指令给大家参考。
create pod 之后可以curl + ping, 重启pod,很长时间pod 才running,之后pod 就curl +ping不通了, nodes 的状态正常。
[root@master ~]# curl 10.244.166.135
curl: (7) Failed connect to 10.244.166.135:80; No route to host
(1)检查iptable,FORWARD链的默认策略设备是ACCEPT
[root@master ~]# iptables -L -n
(2)查看pod 报错
[root@master ~]# kubectl describe pod ngx-dep-6dbbc68b59-9xzfq -n default
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m33s default-scheduler Successfully assigned default/ngx-dep-6dbbc68b59-9xzfq to node2
Warning FailedCreatePodSandBox 2m32s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "ef9d392596c7f3386eb249ea6e1bc5f0bbc1e4e808d52ab3250261c8f3cc01c4" network for pod "ngx-dep-6dbbc68b59-9xzfq": networkPlugin cni failed to set up pod "ngx-dep-6dbbc68b59-9xzfq_default" network: error getting ClusterInformation: connection is unauthorized: Unauthorized, failed to clean up sandbox container "ef9d392596c7f3386eb249ea6e1bc5f0bbc1e4e808d52ab3250261c8f3cc01c4" network for pod "ngx-dep-6dbbc68b59-9xzfq": networkPlugin cni failed to teardown pod "ngx-dep-6dbbc68b59-9xzfq_default" network: error getting ClusterInformation: connection is unauthorized: Unauthorized]
Normal SandboxChanged 7s (x12 over 2m31s) kubelet Pod sandbox changed, it will be killed and re-created.
查了一下报错原因,应该是和calico 有关。查看calico 日志, 有许多ERROR。
[root@master ~]# more /var/log/calico/cni/cni.log
/var/lib/calico/nodename:
no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
2023-12-28 06:34:15.434 [ERROR][120891] plugin.go 580: Final result of CNI DEL was an error. error=stat /var/lib/calico/nodename:
no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
2023-12-28 06:34:16.270 [ERROR][120939] plugin.go 580: Final result of CNI DEL was an error. error=stat /var/lib/calico/nodename:
找到了一个解决方法:<参考:https://blog.csdn.net/a5534789/article/details/112848404>
kubectl delete -f <yaml>
/etc/cni/net.d/目录下 删掉与calico相关的所有配置文件, 然后重启机器。
(3)但是我没有uninstall calico, 先来查看calico pods状态。有crash 的pod。
[root@master ~]# kubectl get pod -n calico-system
NAMESPACE NAME READY STATUS RESTARTS AGE
calico-apiserver calico-apiserver-784bc4f6cf-lpwwl 1/1 Running 18 4d20h
calico-apiserver calico-apiserver-784bc4f6cf-s8k52 0/1 CrashLoopBackOff 31 4d20h
calico-system calico-kube-controllers-7bb7d7b46d-wkgtp 0/1 CrashLoopBackOff 24 4d20h
重启CrashLoopBackOff 的pods.
[root@master ~]# kubectl delete calico-kube-controllers-7bb7d7b46d-wkgt -n calico-system
[root@master ~]# kubectl delete pod calico-kube-controllers-7bb7d7b46d-wkgt -n calico-system
再次查看calico pods状态。 crash 的 pod 变成running。 但是我的问题还是没有解决,新建的pod 依然curl 不通。再继续checkcheck。
[root@master ~]# kubectl get pods -n calico-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-7bb7d7b46d-cq86s 1/1 Running 0 3m22s
calico-node-hbx55 0/1 Running 0 24s
calico-node-m669s 0/1 Running 0 11s
calico-node-q4q8h 0/1 Running 0 49s
calico-typha-b9dcc579b-9xxtn 1/1 Running 0 4d20h
calico-typha-b9dcc579b-lqh76 1/1 Running 0 4d20h
csi-node-driver-pm2cg 2/2 Running 0 4d20h
csi-node-driver-s9rpc 2/2 Running 0 4d20h
csi-node-driver-ssgd5 2/2 Running 0 4d20h
(4)屋漏偏逢连夜雨。执行kubectl 指令变的非常慢。
[root@master ~]# time kubectl get node
NAME STATUS ROLES AGE VERSION
master Ready control-plane,master 4d20h v1.21.0
node1 Ready <none> 4d20h v1.21.0
node2 Ready <none> 4d20h v1.21.0
real 0m15.155s
user 0m0.089s
sys 0m0.089s
又开始查kubelet 的状态 和日志。
[root@master ~]# docker ps -a
Exited (255) 5 hours ago k8s_calico-apiserver_calico-apiserver-784bc4f6cf-lpwwl_calico-apiserver_d94c60d1-8f39-499d-981b-a350c4946e6c_17
Exited (0) 5 hours ago k8s_kube-apiserver_kube-apiserver-master_kube-system_8ca919a1f4463b2b2917640d376690ca_17
###kubelete查看日志,由于日志过多,没粘。
[root@master ~]# journalctl -xefu kubelet
###查看kubelete状态, running 但是有一些event
[root@master ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Thu 2023-12-28 05:22:38 EST; 4 days ago
Docs: https://kubernetes.io/docs/
Main PID: 99393 (kubelet)
Tasks: 20
Memory: 121.5M
CGroup: /system.slice/kubelet.service
└─99393 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/...
Jan 01 20:51:36 master kubelet[99393]: I0101 20:51:36.052774 99393 scope.go:111] "RemoveContainer" containerID="0e4cf0...fff40d"
Jan 01 20:51:36 master kubelet[99393]: E0101 20:51:36.054004 99393 pod_workers.go:190] "Error syncing pod, skipping" err="fai...
Jan 01 20:51:47 master kubelet[99393]: I0101 20:51:47.052526 99393 scope.go:111] "RemoveContainer" containerID="0e4cf0...fff40d"
Jan 01 20:51:47 master kubelet[99393]: E0101 20:51:47.052815 99393 pod_workers.go:190] "Error syncing pod, skipping" err="fai...
Jan 01 20:52:02 master kubelet[99393]: I0101 20:52:02.053678 99393 scope.go:111] "RemoveContainer" containerID="0e4cf0...fff40d"
Jan 01 20:52:02 master kubelet[99393]: E0101 20:52:02.054666 99393 pod_workers.go:190] "Error syncing pod, skipping" err="fai...
Jan 01 20:52:15 master kubelet[99393]: I0101 20:52:15.053453 99393 scope.go:111] "RemoveContainer" containerID="0e4cf0...fff40d"
Jan 02 00:46:11 master kubelet[99393]: E0102 00:46:11.405420 99393 kubelet.go:1877] "Skipping pod synchronization" err...s down"
Jan 02 00:47:33 master kubelet[99393]: E0102 00:47:33.328566 99393 kubelet.go:1877] "Skipping pod synchronization" err...s down"
Jan 02 00:47:33 master kubelet[99393]: E0102 00:47:33.535001 99393 kubelet.go:1877] "Skipping pod synchronization" err...s down"
Hint: Some lines were ellipsized, use -l to show in full.
###查看kube-system pod状态
[root@master net.d]# kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-545d6fc579-258pf 0/1 ImagePullBackOff 0 4d21h
coredns-545d6fc579-rr4vm 0/1 ImagePullBackOff 0 4d21h
先来修一下coredns 的image无法加载的问题。查到一个修复的帖子:原因是无法下载image : coredns:v1.8.0
<参考k8s解决coredns 的ImagePullBackOff 和ErrImagePull 问题-CSDN博客>
###查看pod 日志和具体镜像:
[root@master net.d]# kubectl get pods coredns-545d6fc579-258pf -n kube-system -o yaml | grep image:
image: registry.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
- image: registry.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
###登录到coredns 的宿主机上,一定是宿主机.
[root@node1 root]# docker images
###官方中没有v1.8.0的版本的镜像,无法下载.
####先pull一下试试,又出问题了, connection refused。再来解决一下。
[root@node1 root]# docker pull registry.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
Error response from daemon: Get "https://registry.aliyuncs.com/v2/": dial tcp: lookup registry.aliyuncs.com on [::1]:53: read udp [::1]:36245->[::1]:53: read: connection refused
###尝试将dns修改为8.8.8.8
[root@node1 root]# vim /etc/sysconfig/network-scripts/ifcfg-ens33
DNS1=8.8.8.
###修改之后重启 network
systemctl restart network
###报错来了,因为官方镜像库没有v1.8.0的版本。尝试失败。
[root@node1 root]# docker pull registry.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
Error response from daemon: pull access denied for registry.aliyuncs.com/google_containers/coredns/coredns, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
###接下来直接下载1.8.0 , 并修改tag。
[root@node1 root]# sudo docker tag 296a6d5035e2 registry.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
coredns/coredns 1.8.0 296a6d5035e2 3 years ago 42.5MB
[root@node1 root]# docker images
registry.aliyuncs.com/google_containers/coredns/coredns v1.8.0 296a6d5035e2 3 years ago 42.5MB
都修好了,再次检查curl。?curl &ping pod:终于恢复了。
一些check 命令总结:
kubectl get pod -A
kubectl get pod xxx -n xxx -o yaml
kubectl logs xxx -n xxx
kubectl describe pod xxx -n xxx
systemctl status kubelet
docker ps -a | grep Exited
journalctl -xefu kubelet
systemctl restart network
docker images
kubectl get ep
kubectl get svc
Calico 配置目录: /etc/cni/net.d/
log_file_path: /var/log/calico/cni/cni.log
k8s 认证信息存在 /etc/cni/net.d/calico-kubeconfig
生成 cni 配置在 /etc/cni/net.d/10-calico.conflist
不停循环检查 etcd 认证是否更新,更新了存在 /etc/cni/net.d/calico-tls/
日常分享 一些实验过程中遇到的问题和解决方法。有绕弯路的地方,欢迎大佬来指点迷津!:)
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。 如若内容造成侵权/违法违规/事实不符,请联系我的编程经验分享网邮箱:veading@qq.com进行投诉反馈,一经查实,立即删除!