背景
BR 工具能实现 KV 数据的备份和快速还原。在测试 TiDB 集群性能偏弱的情况下,提升生产数据到测试环境的迁移效率。支持指定表清单进行操作,和备份还原限速等功能。单次操作时间较于 dumpling / lightning 逻辑迁移能提升 3 倍以上的效率。
1、准备工作
在源集群和目标还原集群上先配置 TiKV 的 NFS 目录
a. 以目标集群为例,编辑 tc 添加 tikv 的 additionalVolumes 使用 nfs 的定义。
[root@x.x.x.163 ~] # kubectl edit tc uat-tidb -n tidb
spec:
tikv:
additionalVolumes:
- name: baknfs
nfs:
server: x.x.x.180
path: /data
additionalVolumeMounts:
- name: baknfs
mountPath: /bk_data
[root@x.x.x.163 ~]# kubectl get pods -n tidb
NAME READY STATUS RESTARTS AGE
uat-tidb-discovery-8574b88df-zpfbg 1/1 Running 0 94d
uat-tidb-monitor-6bf8675f97-f2wpb 3/3 Running 0 95d
uat-tidb-pd-0 1/1 Running 0 49d
uat-tidb-pd-1 1/1 Running 0 94d
uat-tidb-pd-2 1/1 Running 7 94d
uat-tidb-tidb-0 2/2 Running 0 43m
uat-tidb-tikv-0 1/1 Running 1 29d
uat-tidb-tikv-1 1/1 Running 2 29d
uat-tidb-tikv-2 1/1 Running 1 29d
uat-tidb-tikv-3 1/1 Terminating 0 29d
b. 滚动重启后,pod 内和所在 work node 节点上会有 nfs 的挂接。
[root@x.x.x.167 ~]# df -h |grep data
x.x.x.180:/data 992G 166G 826G 17% /var/lib/kubelet/pods/7ed60c2f-bd80-4017-9202-21f44b7e4eb3/volumes/kubernetes.io~nfs/baknfs
[root@x.x.x.167 ~]# cd /var/lib/kubelet/pods/7ed60c2f-bd80-4017-9202-21f44b7e4eb3/volumes/kubernetes.io~nfs/baknfs/bk_test
[root@x.x.x.167 /var/lib/kubelet/pods/7ed60c2f-bd80-4017-9202-21f44b7e4eb3/volumes/kubernetes.io~nfs/baknfs/bk_test]# ls
4_1542800_2634_8f570f541985f65eee0363cf943cd442addfecd0c33a578e32e007ed08f8a41f_1636698811852_write.sst
4_1542800_2634_aaf6e66052cb6b18a8d3629abaa4e1bb5009ea7c66b3af73030e88712c8aaaf8_1636698811858_write.sst
5_1542424_2562_0a18eb8d9a177225cadf7bf7a0ab11a8b58e5864ab1e066445f5f44e09f2de66_1636698811841_write.sst
5_1542424_2562_47baf93dda75bc96b63d86972390c0f34fcd930d720777ee8dfd7ba227095cb6_1636698811834_write.sst
5_1542424_2562_8797f837d2ae4942bc47238ec81ddf5f747c99446f814b1ac9fc17f6a6bd526d_1636698811826_write.sst
backup.lock
backupmeta
[root@x.x.x.163 ~]# kubectl exec -it uat-tidb-tikv-3 -n tidb /bin/sh
/ # df -h
Filesystem Size Used Available Use% Mounted on
x.x.x.180:/data 991.1G 333.5G 657.5G 34% /bk_data
/dev/sdb1 590.5G 185.2G 375.2G 33% /var/lib/tikv
...略...
/ # ls /bk_data
backup-nfs backup-nfs-old bak20211112 bk_test
在测试环境上传指定的镜像到 Harbor
[root@x.x.x.163 ~]# docker login x.x.x.162
Authenticating with existing credentials...
WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded
[root@x.x.x.163 ~]# docker tag pingcap/br:v4.0.12 x.x.x.162/pingcap/br:v4.0.12
[root@x.x.x.163 ~]# docker push x.x.x.162/pingcap/br:v4.0.12
[root@x.x.x.163 ~]# docker tag pingcap/tidb-backup-manager:v1.1.9 x.x.x.162/pingcap/tidb-backup-manager:v1.1.9
[root@x.x.x.163 ~]# docker push x.x.x.162/pingcap/tidb-backup-manager:v1.1.9
The push refers to repository [x.x.x.162/pingcap/tidb-backup-manager]
删除迁移表
mysql> use account_db;
mysql> drop table table_account_collect_entry;
mysql> drop table table_account_counts_cst;
2、配置备份
生产环境定义按表名过滤的 ad-hoc 备份,可按需配置 concurrency 和 rateLimit 参数,减少对业务的影响。
#vi br.yaml
apiVersion: pingcap.com/v1alpha1
kind: Backup
metadata:
name: backup-111202
namespace: tidb-cluster
spec:
br:
cluster: tidb-cluster
clusterNamespace: tidb-cluster
concurrency: 1
rateLimit: 10
checksum: true
toolImage: 72.0.253.205/tidb/pingcap/br:v4.0.12
cleanPolicy: Delete
tableFilter:
- "account_db.table_account_collect_entry"
- "account_db.table_account_counts_cst"
local:
prefix: backup-nfs/20211112
volume:
name: nfs
nfs:
server: x.x.x.92
path: /data/tidb_backup
volumeMount:
name: nfs
mountPath: /tidb_backup
在生产源集群上执行备份,注意观察 BR JOB POD 的日志。
#kubectl apply -f br.yaml
3、配置还原
将生产源集群备份存储 x.x.x.92 的备份文件夹 /data/tidb_backup/backup-nfs/20211112 复制到 x.x.x.180 的 /data 目录下,因 prefix 命令要求,改名成 bak20211112。
定义 ad-hoc 还原,指定正确的目录,在还原目标集群上执行。本例未配置 concurrency 和 rateLimit 参数。
[root@x.x.x.163 ~]# vi restore.yaml
apiVersion: pingcap.com/v1alpha1
kind: Restore
metadata:
name: restore-1113
namespace: tidb
spec:
br:
cluster: uat-tidb
clusterNamespace: tidb
toolImage: x.x.x.162/pingcap/br:v4.0.12
local:
prefix: bak20211112
volume:
name: baknfs
nfs:
server: x.x.x.180
path: /data
volumeMount:
name: baknfs
mountPath: /bk_data
[root@x.x.x.163 ~]# kubectl apply -f restore.yaml
[root@x.x.x.163 ~]# kubectl get rt -n tidb
NAME STATUS STARTED COMPLETED COMMITTS AGE
restore-1112 Complete 2021-11-12T11:37:27Z 2021-11-12T11:37:31Z 429050773156921347 15h
restore-1113 Running <no value> <no value> 23m
[root@x.x.x.163 ~]# kubectl get pods -n tidb
NAME READY STATUS RESTARTS AGE
restore-restore-1113-wqwsh 1/1 Running 0 33m
...略...
[root@x.x.x.163 ~]# kubectl logs restore-restore-1113-wqwsh -n tidb |more
Create rclone.conf file.
/tidb-backup-manager restore --namespace=tidb --restoreName=restore-1113 --tikvVersion=v4.0.12
I1113 10:15:13.411502 1 restore.go:73] start to process restore tidb/restore-1113
I1113 10:15:13.447383 1 restore_status_updater.go:64] Restore: [tidb/restore-1113] updated successfully
I1113 10:15:14.299951 1 restore.go:66] Running br command with args: [restore full --pd=uat-tidb-pd.tidb:2379 --storage=local:///bk_data/bak20211112]
I1113 10:15:14.440232 1 restore.go:89] [2021/11/13 10:15:14.439 +08:00] [INFO] [info.go:40] ["Welcome to Backup & Restore (BR)"] [release-version=v4.0.12] [git-hash=14d55a7a3696a37e6e7f199f75c5dc405383c547] [git-branch=release-4.0] [go-version=go1.13] [utc-build-time="2021-04-02 10:41:29"] [race-enabled=false]
I1113 10:15:14.440356 1 restore.go:89] [2021/11/13 10:15:14.440 +08:00] [INFO] [common.go:458] [arguments] [__command="br restore full"] [pd="[uat-tidb-pd.tidb:2379]"] [storage=local:///bk_data/bak20211112]
...略...
在 BR JOB POD 的日志中可以观测到恢复进度,在 tidb-controller-manager pod 的日志中可以查看任务执行发起和结束过程。
[root@x.x.x.163 ~]# kubectl logs restore-restore-1113-wqwsh -n tidb -f |grep progress.go
I1113 16:47:16.201855 1 restore.go:89] [2021/11/13 16:47:16.201 +08:00] [INFO] [progress.go:134] [progress] [step="Full restore"] [progress=78.54%] [count="45467 / 57887"] [speed="1 p/s"] [elapsed=6h32m0s] [remaining=3h16m8s]
[root@x.x.x.163 ~]# kubectl logs tidb-controller-manager-5457786d5c-c9gjq -n tidb-admin
I1113 10:14:56.709066 1 tidbcluster_control.go:66] TidbCluster: [tidb/uat-tidb] updated successfully
I1113 10:14:58.483964 1 tidbcluster_control.go:66] TidbCluster: [tidb/uat-tidb] updated successfully
I1113 10:15:09.008373 1 event.go:255] Event(v1.ObjectReference{Kind:"Restore", Namespace:"tidb", Name:"restore-1113", UID:"5deccc07-c66d-4efd-94c4-aa94f7757450", APIVersion:"pingcap.com/v1alpha1", ResourceVersion:"297617481", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' create job tidb/restore-restore-1113 for cluster restore-1113 restore successful
I1113 10:15:09.018953 1 restore_status_updater.go:64] Restore: [tidb/restore-1113] updated successfully
I1113 10:15:26.987474 1 tidbcluster_control.go:66] TidbCluster: [tidb/uat-tidb] updated successfully
I1113 10:15:28.697291 1 tidbcluster_control.go:66] TidbCluster: [tidb/uat-tidb] updated successfully
...略...
4、诊断方法
当 rt 资源一直没有 STATUS 值,也没有发起对应的 BR JOB POD,需要查看 controller-manager 日志。
[root@x.x.x.163 ~]# kubectl get rt -n tidb
NAME STATUS STARTED COMPLETED COMMITTS AGE
restore-1112 Complete 2021-11-12T11:37:27Z 2021-11-12T11:37:31Z 429050773156921347 14h
restore-1113 8m48s
[root@x.x.x.163 ~]# kubectl get pods -n tidb-admin
NAME READY STATUS RESTARTS AGE
tidb-controller-manager-5457786d5c-c9gjq 1/1 Running 80 94d
tidb-scheduler-6c9c6bd7f7-7bh2v 2/2 Running 0 24h
...略...
[root@x.x.x.163 ~]# kubectl logs tidb-controller-manager-5457786d5c-c9gjq -n tidb-admin
E1113 10:02:30.890073 1 reflector.go:123] k8s.io/client-go@v0.0.0/tools/cache/reflector.go:96: Failed to list *v1alpha1.Restore: v1alpha1.RestoreList.Items: []v1alpha1.Restore: v1alpha1.Restore.Spec: v1alpha1.RestoreSpec.StorageProvider: Local: v1alpha1.LocalStorageProvider.Prefix: ReadString: expects " or n, but found 2, error found in #10 byte of ...|"prefix":20211112,"v|..., bigger context ...|160.3.162/pingcap/br:v4.0.12"},"local":{"prefix":20211112,"volume":{"name":"baknfs","nfs":{"path":"/|...
E1113 10:02:31.893830 1 reflector.go:123] k8s.io/client-go@v0.0.0/tools/cache/reflector.go:96: Failed to list *v1alpha1.Restore: v1alpha1.RestoreList.Items: []v1alpha1.Restore: v1alpha1.Restore.Spec: v1alpha1.RestoreSpec.StorageProvider: Local: v1alpha1.LocalStorageProvider.Prefix: ReadString: expects " or n, but found 2, error found in #10 byte of ...|"prefix":20211112,"v|..., bigger context ...|160.3.162/pingcap/br:v4.0.12"},"local":{"prefix":20211112,"volume":{"name":"baknfs","nfs":{"path":"/|...
br 对象的 .spec.local.prefix 是不是不允许数字开始,也就是不允许用一个数字开始命名的一层子目录。
删除未成功的 BK 资源并将目录改名
[root@x.x.x.163 ~]# kubectl delete rt restore-1113 -n tidb
[root@x.x.x.167 /var/lib/kubelet/pods/7ed60c2f-bd80-4017-9202-21f44b7e4eb3/volumes/kubernetes.io~nfs/baknfs]# mv 20211112 bak20211112
[root@x.x.x.167 /var/lib/kubelet/pods/7ed60c2f-bd80-4017-9202-21f44b7e4eb3/volumes/kubernetes.io~nfs/baknfs]# ls
backup-nfs backup-nfs-old bak20211112 bk_test
需要对应修订 restore.yaml
local:
prefix: bak20211112
注意事项
- 注意 BR 任务本身的限制,如:要求 new collation 的配置一致等;
- 注意备份文件夹的命名;
- 如果出现任务发起异常,注意观察 tidb-controller-manager pod 的日志;
- 注意各个步骤中的 br pod 的日志,保证每个操作步骤的成功执行;
- 注意统计信息的更新,某些情况下需要手工发起;