etcd备份迁移

# 备份etcd(可以在etcd容器内操作,也可以在主机上安装ectdctl工具操作)
ETCDCTL_API=3 etcdctl snapshot save /root/temp/backup.db \
    --endpoints=https://172.18.8.147:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/peer.crt \
    --key=/etc/kubernetes/pki/etcd/peer.key

# 传输文件到新主机上
scp root@172.18.8.147:/root/temp/backup.db /root/temp/

# 确认cri插件状态为OK
sudo ctr plugins ls | grep cri
# 新主机环境init (不用装插件)
kubeadm reset -f
kubeadm init --config=./kubeadm-init-config.conf
# 确认单机集群环境正常
# 确认群集正常运行
journalctl -xeu kubelet

# 移出etcd.yaml以停止etcd pod
mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# 等待etcd停止
crictl ps
# 看不到etcd后
rm -rf /var/lib/etcd
# 恢复 etcd 备份
ETCDCTL_API=3 etcdctl snapshot restore /root/temp/backup.db \
    --data-dir=/var/lib/etcd
# 移回etcd.yaml
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
systemctl restart kubelet containerd

# 查看pod状态
kubectl get pod -A
# 可以看到原群集的pod全都running,不过这个是假状态,只是当时的数据
kubectl get node
```
NAME                    STATUS     ROLES           AGE    VERSION
localhost.cloudvoyage   NotReady   control-plane   53d    v1.28.2
master-node-test1       Ready      <none>          2d6h   v1.28.2
`
# 标记当前节点为控制节点
kubectl label node master-node-test1 node-role.kubernetes.io/control-plane=
# 删除原节点信息
kubectl delete node localhost.cloudvoyage
# 修改kube-proxy "server" 对应的ip
kubectl edit configmap -n kube-system kube-proxy
# 修改kubeadm-config controlPlaneEndpoint ip
kubectl edit configmap kubeadm-config -n kube-system
# 重新生成 cluster-info
kubectl delete configmap cluster-info -n kube-public
kubeadm init phase bootstrap-token
# 查看生成结果
kubectl -n kube-public get configmap cluster-info -o yaml

# 针对性比较强的问题放在其他问题上了

加入集群方式迁移

加入迁移方式很简单,以控制节点方式加入原集群中,然后删除原控制节点,或者在新节点的etcd.yaml中添加
    - --initial-cluster-state=new
    - --force-new-cluster=true
强制将当前节点变成全新的集群控制节点。

控制节点加入参考:K8s集群添加Master节点
摘要:
kubeadm token create --print-join-command
kubeadm init phase upload-certs --upload-certs
# 组装控制节点加入命令
kubeadm join 192.168.202.151:6443 --token 9e27gy.jov2cblxelz085ux --discovery-token-ca-cert-hash sha256:5ee09d78e7ffba7d1f757b71035182e4c69bd51f07082e31b37d472cb9d540b9  --control-plane --certificate-key 83cabc28964ffc107fa7e1069ead5b9c0c8a3f5075ca195abeaad41c40ab0d13
# 可能会报错,如果当前集群未指定controlPlaneEndpoint,执行以下操作
kubectl edit cm kubeadm-config -n kube-system
controlPlaneEndpoint: 192.168.202.151:6443

遇到过的问题记录

证书问题

# [failure loading certificate for CA: couldn't load the certificate file /etc/kubernetes/pki/ca.crt: open /etc/kubernetes/pki/ca.crt: no such file or directory, failure loading key for service account: couldn't load the private key file /etc/kubernetes/pki/sa.key: open /etc/kubernetes/pki/sa.key: no such file or directory, failure loading certificate for front-proxy CA: couldn't load the certificate file /etc/kubernetes/pki/front-proxy-ca.crt: open /etc/kubernetes/pki/front-proxy-ca.crt: no such file or directory, failure loading certificate for etcd CA: couldn't load the certificate file /etc/kubernetes/pki/etcd/ca.crt: open /etc/kubernetes/pki/etcd/ca.crt: no such file or directory]

# 证书迁移
scp -r root@172.18.8.147:/etc/kubernetes/pki /etc/kubernetes/pki

# error execution phase control-plane-prepare/certs: error creating PKI assets: failed to write or validate certificate "etcd-peer": certificate etcd/peer is invalid: x509: certificate is valid for localhost, localhost.cloudvoyage, not master-node-test1

# 证书未授权本机hostname
# 在原控制节点上添加当前主机名
# 备份 kubernetes 目录
cp -r /etc/kubernetes{,-bak}

# 查看证书内的 ip 
for i in $(find /etc/kubernetes/pki -type f -name "*.crt");do echo ${i} && openssl x509 -in ${i} -text | grep 'DNS:';done

# k8s init配置调整 kubeadm-init-config.conf 内追加
```
apiServer:
  extraArgs:
    authorization-mode: Node,RBAC
  certSANs:
    - master-node-test1   # 新主机
etcd:
  local:
    serverCertSANs:
      - master-node-test1  # 新主机
    peerCertSANs:
      - master-node-test1  # 新主机
```

# 完整配置
```
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
imageRepository: registry.cn-hangzhou.aliyuncs.com/google_containers
controlPlaneEndpoint: "172.18.8.147"
networking:
  podSubnet: "10.2.0.0/16"
apiServer:
  extraArgs:
    authorization-mode: Node,RBAC
  certSANs:
    - master-node-test1   # 新主机
etcd:
  local:
    serverCertSANs:
      - master-node-test1  # 新主机
    peerCertSANs:
      - master-node-test1  # 新主机
```

# 删除原有的证书
# 需要保留 ca ,sa,front-proxy 这三个证书
rm -rf /etc/kubernetes/pki/{apiserver*,front-proxy-client*}
rm -rf /etc/kubernetes/pki/etcd/{healthcheck*,peer*,server*}

# 生成新证书
kubeadm init phase certs all --config /root/kubeadm-init-config.conf

# 再次查看证书内的 ip
for i in $(find /etc/kubernetes/pki -type f -name "*.crt");do echo ${i} && openssl x509 -in ${i} -text | grep 'DNS:';done


# 如果 kubeadm init 的配置文件有更新,你还需要将它们上传到集群的 ConfigMap 中。这样可以确保集群在后续升级或重新初始化时使用新的配置。
kubeadm init phase upload-config kubeadm --config /root/kubeadm-init-config.conf 

join中断,原有控制节点etcd异常

# kubeadm join 时,可能不小心中断了join过程,但是服务器已经注册到了集群的etcd,导致etcd中存在不可用的服务ip,如果是集群原本是单节点会出现etcd异常,apiservice不可用的情况

# etcd错误日志:
# {"level":"warn","ts":"2025-01-23T13:27:41.897254Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":9509795448636559634,"retry-timeout":"500ms"}
{"level":"warn","ts":"2025-01-23T13:27:42.235348Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"89e1acc3ebaf541c","rtt":"0s","error":"dial tcp 192.168.1.12:2380: i/o timeout"}

# etcdctl 也连接不上etcd服务:
ETCDCTL_API=3 etcdctl  \
        --endpoints=https://172.22.8.15:2379   \
        --cacert=/etc/kubernetes/pki/etcd/ca.crt   \
        --cert=/etc/kubernetes/pki/apiserver-etcd-client.crt   \
        --key=/etc/kubernetes/pki/apiserver-etcd-client.key   \
        member list
# {"level":"warn","ts":"2025-01-23T21:32:47.523636+0800","logger":"etcd-client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000baa80/172.22.8.15:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}

# 这时候陷入了死循环,存在的错误ip导致etcd服务不可用,又无法用etcdctl命令remove错误ip,
# 解决办法是指定--force-new-cluster,强制放弃原有集群,以当前节点etcd数据作为新的集群,这样做配置不会丢失,但是操作的节点会脱离原来的群集,当然如果本来就是只有一个控制节点那就只是相当于把错误ip去掉了而已。
# 进入manifests
cd /etc/kubernetes/manifests/
vi etcd.yaml
command:
    - etcd
    - --name=<本节点名称>            # 保持原有 name
    - --initial-cluster=<本节点名称>=https://127.0.0.1:2380
    - --data-dir=/var/lib/etcd       # 指向原数据目录
    - --listen-client-urls=https://127.0.0.1:2379
    - --advertise-client-urls=https://127.0.0.1:2379
    - --listen-peer-urls=https://127.0.0.1:2380
    - --initial-advertise-peer-urls=https://127.0.0.1:2380

    # 注意以下 2 个最关键的改动:
    - --initial-cluster-state=new
    - --force-new-cluster=true

    # TLS 证书参数(示例,保持和原先一致)
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    ....

# 添加 - --initial-cluster-state=new    - --force-new-cluster=true

# 重启kubelet
systemctl restart kubelet

# 查询etcd服务运行情况
crictl ps |grep etcd
crictl logs <etcd-id>
# 尝试访问apiservice
kubectl get pod

ps: 记得把加的两行删除, etcdctl主机上正常是没有的,需要手动安装

etcd同步失败

# {"level":"warn","ts":"2025-01-24T02:17:01.571171Z","caller":"fileutil/fileutil.go:53","msg":"check file permission","error":"directory \"/var/lib/etcd\" exist, but the permission is \"drwxr-xr-x\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}

# 权限调整
chmod 700 /var/lib/etcd

# 删除etcd以重启
crictl rm <etcd-di>

# 又出现问题
#{"level":"warn","ts":"2025-01-24T02:44:36.557466Z","caller":"etcdserver/cluster_util.go:79","msg":"failed to get cluster response","address":"https://172.18.8.147:2380/members","error":"Get \"https://172.18.8.147:2380/members\": EOF"}

# 节点上测试
curl --cacert /etc/kubernetes/pki/etcd/ca.crt \
     --cert /etc/kubernetes/pki/etcd/peer.crt \
     --key /etc/kubernetes/pki/etcd/peer.key \
     -k https://172.18.8.147:2380/members
# 提示 curl: (92) HTTP/2 stream 1 was not closed cleanly before end of the underlying stream

# 172.18.8.147 上测试结果:[{"id":12527478125644897856,"peerURLs":["https://172.18.8.147:2380"],"name":"localhost.cloudvoyage","clientURLs":["https://172.18.8.147:2379"]}]

# 单节点群集只要加入时出点小问题就可能发生,发生了就只能改原来集群的etcd,标记放弃节点独立

关于etcdctl

helm 的数据也是放在里面

安装
sudo apt update
sudo apt install etcd-client -y

记录一下相关命令

apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
imageRepository: registry.cn-hangzhou.aliyuncs.com/google_containers
controlPlaneEndpoint: "172.22.8.15"
networking:
  podSubnet: "10.2.0.0/16"




ETCDCTL_API=3 etcdctl  \
	--endpoints=https://172.18.8.147:2379   \
	--cacert=/etc/kubernetes/pki/etcd/ca.crt   \
	--cert=/etc/kubernetes/pki/apiserver-etcd-client.crt   \
	--key=/etc/kubernetes/pki/apiserver-etcd-client.key   \
	member list


ETCDCTL_API=3 etcdctl  \
	--endpoints=https://172.22.8.15:2379   \
	--cacert=/etc/kubernetes/pki/etcd/ca.crt   \
	--cert=/etc/kubernetes/pki/apiserver-etcd-client.crt   \
	--key=/etc/kubernetes/pki/apiserver-etcd-client.key   \
	member list
	

# 语法:etcdctl member add <name> --peer-urls=<new-member-peer-urls>
ETCDCTL_API=3 etcdctl  \
	--endpoints=https://172.22.8.15:2379   \
	--cacert=/etc/kubernetes/pki/etcd/ca.crt   \
	--cert=/etc/kubernetes/pki/apiserver-etcd-client.crt   \
	--key=/etc/kubernetes/pki/apiserver-etcd-client.key   \
	member add etcd3 --peer-urls=https://192.168.1.12:2380

# 语法:etcdctl member remove <memberID>
ETCDCTL_API=3 etcdctl  \
	--endpoints=https://172.22.8.15:2379   \
	--cacert=/etc/kubernetes/pki/etcd/ca.crt   \
	--cert=/etc/kubernetes/pki/apiserver-etcd-client.crt   \
	--key=/etc/kubernetes/pki/apiserver-etcd-client.key   \
	member remove 89e1acc3ebaf541c






# 备份etcd
ETCDCTL_API=3 etcdctl snapshot save /root/temp/backup.db \
    --endpoints=https://172.18.8.147:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/peer.crt \
    --key=/etc/kubernetes/pki/etcd/peer.key

# 清空节点信息
kubeadm reset -f

# 恢复 etcd 备份
ETCDCTL_API=3 etcdctl snapshot restore /root/temp/backup.db \
    --data-dir=/var/lib/etcd









# 更新节点信息
ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key \
  --endpoints=https://172.22.8.21:2379 \
  member update 8e9e05c52164694d --peer-urls=https://172.22.8.21:2380


其他问题

kube-flannel namespace无法删除

kubectl get ns
# 看到 kube-flannel 一直处于 Teminal 状态

# 导出为json
kubectl get ns kube-flannel -ojson > kube-flannel.json

# 删除 finalizers 的内容
vi kube-flannel.json

# 更新ns(一般需要等很久才能成功执行)
kubectl replace --raw "/api/v1/namespaces/kube-flannel/finalize" -f kube-flannel.json

代理问题修复

# 过滤 kube-flannel
kubectl edit cm -n registry-proxy registry-proxy-config
# excludeNamespaces 加上 - kube-flannel

flannel网络问题


# 网络检查(如果 flannel 启动不起来或者报错)
kubectl get events -A --sort-by=.metadata.creationTimestamp
# 查看到类似flannel异常
kubectl logs -n kube-flannel -l app=flannel
# 可能出现cni0 网卡与当前ip不匹配
```
failed to setup network for sandbox "88a1ab6dc996d559fe15770b1bd12094950d14bb0f90e22f6e8a3b964ae89634": plugin type="flannel" failed (add): failed to delegate add: failed to set bridge addr: "cni0" already has an IP address different from 10.2.2.1/24
```
# 查看网卡
ip addr show
# 删除网卡ip
ip addr del 10.2.0.1/24 dev cni0
# 一般运行着flannel的pod它自己就会加上,或者手动添加
ip addr add 10.2.2.1/24 dev cni0
ip link set cni0 up
# 重启所有pod
kubectl delete pod --all -A

PV PVC修复


# 清空pod (如果不需要修改可以不操作这步,我需要修改nfs的ip和路径,如果ip用的域名,路径再设置成一样就不用改直接能用了)
kubectl get deployment -A --no-headers | awk '$1!="kube-system" {print "kubectl scale deployment "$2" --replicas=0 -n "$1}' | sh
kubectl get statefulset -A --no-headers | awk '$1!="kube-system" {print "kubectl scale statefulset "$2" --replicas=0 -n "$1}' | sh
# 导出pv pvc 
kubectl get pv -oyaml >> allpv.yaml
kubectl get pvc -A -oyaml >> allpvc.yaml
# 删除pv pvc
kubectl delete pvc -A --all
kubectl delete pv --all
# 修改pv时记录删除claimRef信息,去除原pvc的关联。删除后恢复pv pvc
kubectl apply -f allpv.yaml

# 恢复pod
kubectl get deployment -A --no-headers | awk '$1!="kube-system" {print "kubectl scale deployment "$2" --replicas=1 -n "$1}' | sh
kubectl get statefulset -A --no-headers | awk '$1!="kube-system" {print "kubectl scale statefulset "$2" --replicas=1 -n "$1}' | sh

proxy mode问题

# kube-proxy默认是使用iptables(兼容性好)配置转发规则,不过默认的方式性能差,而且会生成cni0网桥,可能产生cni0 ip冲突,
# 使用性能更好的代理方式 ipvs(兼容差,老版本linux不支持,性能好), 修改配置中的mode为:ipvs
kubectl edit cm -n kube-system kube-proxy
kubectl delete pod -n kube-system -l k8s-app=kube-proxy

总结

弄东西前先把基础环境弄好,确认没问题了再开始。这次就是环境没弄好就开始迁移操作了,出了很多原本可能不会遇到的问题。

主要问题汇总一下:
containerd 配置问题:主要就是systemdCgroup没配置好,浪费了不少时间

flannel插件问题:
这个有点复杂,因为装了registry-proxy,它默认只排除了系统级的namespace,比如:kube-system等,我的flannl是在kube-flannel上的,得手动排除一下,然后修改了cm后还得手动重启一下register-proxy pod
然后就是flannel pod无法访问apiservice,因为kube-proxy出了点问题,cm里的系统配置ip没改成新节点的
还有cni0网卡ip会与flannel ip对不上

k8s网络:
   cni0网桥(连接pod网络与flannel网桥,无pod时可能不会生成,如控制节点) --》 flannel网络 ----》  kube-proxy(集群网络10.9.x.x,iptables转发)--》 节点主机网卡 -》 外网
一般kube-proxy与coredns没报错,网络有问题,基本就是网络插件的原因,具体原因的可能性非常多!网络真难搞。



再随便记录点:
1.flannel日志报错,连不上apiservice
  coredns与kube-proxy pod日志异常,删除flannel插件,修改kube-proxy配置问题
2.网络pod运行正常,应用pod无法连clusterIp,无法解析baidu.com
  正常来说dns是coredns的问题,clusterIp连不上是kube-proxy配置iptables转发的问题,不过这里还是flannel插件的问题,重装flannel后修复
清理旧配置:
rm -rf /etc/cni/net.d/*
rm -rf /var/lib/cni/*
rm -rf /run/flannel
3.控制节点无法访问clusterIp,但是能访问apiservice ip
  这个很离谱,不知道是ubuntu哪配置不对,之前centOS感觉好像也有过一次,不知道是不是反复操作太多哪出问题了。最终定位到的底层原因是控制节点无法通过路由把流量传输到其他节点的pod上,ip route get 的结果是控制节点它走网卡了,工作节点能被正确转发。但是从路由表到网卡配置都没问题,iptables配置也正常。时间紧,而且我只要ingress-controller的入口,直接将nginx 80端口转发到工作节点端口上,先用,后面可能会尝试一下把工作节点转成控制节点,看看效果,不管怎么总算是迁移成功了,数据都正常。

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注