vSphere 7.0 使いたいので Andes を 5.2 に Upgrade

vSphere 7.0 早く使ってみたいなぁと思っている今日この頃.U1 出るまで待った方がいいですよ,と良く言われますが,それなら GA の意味無いかな?とも思ったりしています(笑).
vSphere 自体を Upgrade する事にはあまり躊躇い無いのですが,VMware 社製品だけで成り立っている IT インフラも珍しいかと思います.うちもとりあえず Backup system が対応出来て,初めて vSphere の Upgrade に踏み切れるのです!という訳で,Rubrik CDM Andes を vSphere 7 対応の 5.2 に Upgrade してみます.ざっくり手順を紹介しますが,みなさん環境や条件が異なっていると思いますので,必ず Rubrik CDM Install and Upgrade Guide を確認してくださいね.


1. Upgrade 準備

Update 用のモジュールとシグニチャーを入手して Node の /upgrade に sftp で Upload します.
ユーザーは adminstaging で実行します.
以前に Upgrade していると /upgrade ディレクトリ に古いモジュールが残っていたりするので,必ず削除しておきましょう.
準備が出来たら,precheck を実行します.

YOUR_NODE >> upgrade start --mode prechecks_only
Do you want to use --share rubrik-5.1.3-p2-8318.tar.gz [y/N] [N]: y
=======================================
Starting upgrade in prechecks_only mode
=======================================
Upgrade status: Started pre-checks successfully

少し時間をおいて Status を確認しましょう.以下のように "Completed successfully" となれば OK です.

YOUR_NODE >> cluster upgrade status
Last upgrade mode: prechecks_only
Last upgrade pre-checks node: YOUR_NODE
Last upgrade pre-checks tarball name: --share rubrik-image-5.2.0-p2-9418.tar.gz
Last upgrade pre-checks status: Completed successfully
Last run ended at: 2020-07-07 03:31:14.913000 UTC+0000
Current state: IDLE


2. Upgrade 実行

5.2 への Upgrade 失敗という噂も色々聞いていたので,今回は少し慎重に KB 調べたりしてみました.3rd party H/W で色々あったようですが,SMC の H/W なら問題なさそうなので思い切って Upgrade 実行です!
ただ,少しでも不安要素を払拭するために,Rubrik CDM Install and Upgrade Guide には目を通して最低限 Trouble shoot する準備はしていました.
万が一 Fail した場合は,Rollback 出来そうです.

YOUR_NODE >> cluster upgrade rollback

また,Upgrade 時の log も support log_view で参照できます.
例えば,precheck を実施した時の log は.log_view で [24: upgrade-service] を選択し,[current] の番号を選択することで参照できます.

YOUR_NODE >> support log_view 
1: agent-server
2: backup-agent
3: cassandra
4: cdp-log-receiver
5: cdp-metadata-service
6: cloud-storage-service
7: cluster-config
8: cockroachdb
9: diamond
10: job-fetcher
11: key-wrapper
12: lambda
13: lambda-content-analyzer
14: lambda-parser-service
15: node-monitor
16: pyvmware
17: remote-cluster
18: replication
19: samba
20: sdfs
21: search
22: snapshot
23: spray-server
24: upgrade-service
25: vmware
26: firewall
27: kern
28: syslog

Type 'exit' to exit log_view.
Select service [1..28]: 24
1:    @400000005b66ee2521436374.u    2018-06-20 04:35:01.928485
2:    @400000005ba0fea021873e8c.u    2018-08-05 12:33:12.273457
3:    @400000005bd85432224dfd74.u    2018-09-18 13:34:53.815127
4:    @400000005e047a75155a1a44.u    2019-12-26 09:14:58.818369
5:    @400000005ee4402717f8a284.u    2020-06-13 02:53:46.233864
6:                        current    2020-07-07 03:32:00.921830

Type 'exit' to exit log_view.
Type 'back' to select a different service.
Select file [1..6]: 6
2020-07-07T03:24:48+0000 INFO <32663.Thread-2> [__main__] info:88 [rkcli] Begin request StartUpgradeRequest(tarball_path='/var/lib/rubrik/staging/local/share/upgrade/rubrik-image-5.2.0-p2-9418.tar.gz', signature_path='/var/lib/rubrik/staging/local/share/upgrade/rubrik-image-5.2.0-p2-9418.tar.gz.sig', context=CommonUpgradeRequestContext(current_client_version=1, common_context=RequestContext(timeout_duration_ms=0, ndc_tag='rkcli', verbose_level=None, hard_deadline_duration_ms=0, job_context=None), min_client_version=1), mode=3)
2020-07-07T03:24:48+0000 INFO <32663.Thread-2> [rk_syslog_only] set_upgrade_paths:39 tasks.DeployTask: setting tarball path to /var/lib/rubrik/staging/local/share/upgrade/rubrik-image-5.2.0-p2-9418.tar.gz, signature path to /var/lib/rubrik/staging/local/share/upgrade/rubrik-image-5.2.0-p2-9418.tar.gz.sig
2020-07-07T03:24:48+0000 INFO <32663.Thread-2> [rk_syslog_only] set_upgrade_paths:39 tasks.DeployTask: setting tarball path to /var/lib/rubrik/staging/local/share/upgrade/rubrik-image-5.2.0-p2-9418.tar.gz, signature path to /var/lib/rubrik/staging/local/share/upgrade/rubrik-image-5.2.0-p2-9418.tar.gz.sig
2020-07-07T03:24:48+0000 INFO <32663.Thread-2> [rk_syslog_only] set_upgrade_paths:39 tasks.DeployTask: setting tarball path to /var/lib/rubrik/staging/local/share/upgrade/rubrik-image-5.2.0-p2-9418.tar.gz, signature path to /var/lib/rubrik/staging/local/share/upgrade/rubrik-image-5.2.0-p2-9418.tar.gz.sig
2020-07-07T03:24:48+0000 INFO <32663.Thread-2> [rk_syslog_only] __update_state_machine:407 State machine updated (persist=True): UpgradeStateMachineInfo(upgrade_error_msg='', ondisk_checksum=None, next_task_index=0, current_task_name=None, signature_path='/var/lib/rubrik/staging/local/share/upgrade/rubrik-image-5.2.0-p2-9418.tar.gz.sig', upgrade_node='YOUR_NODE, tarball_path='/var/lib/rubrik/staging/local/share/upgrade/rubrik-image-5.2.0-p2-9418.tar.gz', current_state=None, upgrade_start_timestamp=1592015496984, mode=3, upgrade_status=1, next_task_start_timestamp=0, upgrade_end_timestamp=1592018549984, next_state=0)
...
(省略)
...
2020-07-07T03:32:00+0000 INFO <18198.Thread-6> [__main__] info:162 [rkcli] Begin request UpgradeStatusRequest(context=CommonUpgradeRequestContext(current_client_version=1, common_context=RequestContext(timeout_duration_ms=0, ndc_tag='rkcli', verbose_level=None, hard_deadline_duration_ms=0, job_context=None), min_client_version=1))
2020-07-07T03:32:00+0000 INFO <18198.Thread-6> [rk_syslog_only] __load_state_machine:500 Loaded state machine from file "/home/rkcluster/.upgrade/upgrade_state_machine.json": UpgradeStateMachineInfo(next_task_index=0, upgrade_node='YOUR_NODE', upgrade_start_timestamp=1594092288145, is_auto_rollback=None, tarball_path='/var/lib/rubrik/staging/local/share/upgrade/rubrik-image-5.2.0-p2-9418.tar.gz', next_state=0, upgrade_error_msg='Skip', ondisk_checksum='f3dc6f02935173ada98cb3f86f7be26e', current_task_name=None, signature_path='/var/lib/rubrik/staging/local/share/upgrade/rubrik-image-5.2.0-p2-9418.tar.gz.sig', is_resume=None, current_state=None, mode=3, context=None, upgrade_status=1, next_task_start_timestamp=0, upgrade_end_timestamp=1594092674913)
2020-07-07T03:32:00+0000 INFO <18198.Thread-6> [__main__] info:162 [rkcli] End request UpgradeStatusResponse(pending_states=[], current_task_progress_deprecated=None, current_state_name='IDLE', is_resume=None, failure_point="('Not available',)", current_state=0, is_auto_rollback=None, upgrade_time_left_secs=None, upgrade_timestamp=1594092674913, node_name='YOUR_NODE', tarball_name='/var/lib/rubrik/staging/local/share/upgrade/rubrik-image-5.2.0-p2-9418.tar.gz', upgrade_progress_percentage=None, current_state_progress=1.0, mode=3, context=CommonUpgradeResponseContext(status=Status(message='All good', code=0, excepshuns=[]), current_server_version=1, min_server_version=1), upgrade_status=Status(message='Completed successfully', code=0, excepshuns=[]), finished_states=[], progress=1.0, user_surfaced_task_name='Not available', current_task_name_deprecated=None)

この log_view コマンド,vi っぽく使えて便利です.例えば特定のプロセスだけ見たければ "/" と pid で検索できます.
以下は "/18198" の出力結果.

2020-07-07T03:31:17+0000 DEBUG <18198.state_machine_thread> [rk_syslog_only] unlock:491 Removing lock from metadatastore
2020-07-07T03:31:17+0000 DEBUG <18198.state_machine_thread> [rk_syslog_only] __upgrade_lock_atomic_update:97 Executing update: UPDATE upgrade SET value = NULL WHERE namespace = 'upgrade' AND key = 'lock' IF value = '{"1":{"str":"YOUR_NODE"},"2":{"str":"44d215ae-cc87-4145-99da-e39e4870d1b9"},"3":{"i32":2}}'
2020-07-07T03:31:17+0000 DEBUG <18198.state_machine_thread> [rk_syslog_only] execute:138 CQL => UPDATE upgrade SET value = NULL WHERE namespace = 'upgrade' AND key = 'lock' IF value = '{"1":{"str":"YOUR_NODE"},"2":{"str":"44d215ae-cc87-4145-99da-e39e4870d1b9"},"3":{"i32":2}}'
2020-07-07T03:31:17+0000 DEBUG <18198.state_machine_thread> [rk_syslog_only] __upgrade_lock_atomic_update:99 Result: Row(applied=True)
2020-07-07T03:31:17+0000 INFO <18198.state_machine_thread> [rk_syslog_only] next_state:601 Waiting on start upgrade event
2020-07-07T03:32:00+0000 INFO <18198.Thread-6> [__main__] info:162 [rkcli] Begin request UpgradeStatusRequest(context=CommonUpgradeRequestContext(current_client_version=1, common_context=RequestContext(timeout_duration_ms=0, ndc_tag='rkcli', verbose_level=None, hard_deadline_duration_ms=0, job_context=None), min_client_version=1))
2020-07-07T03:32:00+0000 INFO <18198.Thread-6> [rk_syslog_only] __load_state_machine:500 Loaded state machine from file "/home/rkcluster/.upgrade/upgrade_state_machine.json": UpgradeStateMachineInfo(next_task_index=0, upgrade_node='YOUR_NODE', upgrade_start_timestamp=1594092288145, is_auto_rollback=None, tarball_path='/var/lib/rubrik/staging/local/share/upgrade/rubrik-image-5.2.0-p2-9418.tar.gz', next_state=0, upgrade_error_msg='Skip', ondisk_checksum='f3dc6f02935173ada98cb3f86f7be26e', current_task_name=None, signature_path='/var/lib/rubrik/staging/local/share/upgrade/rubrik-image-5.2.0-p2-9418.tar.gz.sig', is_resume=None, current_state=None, mode=3, context=None, upgrade_status=1, next_task_start_timestamp=0, upgrade_end_timestamp=1594092674913)
2020-07-07T03:32:00+0000 INFO <18198.Thread-6> [__main__] info:162 [rkcli] End request UpgradeStatusResponse(pending_states=[], current_task_progress_deprecated=None, current_state_name='IDLE', is_resume=None, failure_point="('Not available',)", current_state=0, is_auto_rollback=None, upgrade_time_left_secs=None, upgrade_timestamp=1594092674913, node_name='YOUR_NODE', tarball_name='/var/lib/rubrik/staging/local/share/upgrade/rubrik-image-5.2.0-p2-9418.tar.gz', upgrade_progress_percentage=None, current_state_progress=1.0, mode=3, context=CommonUpgradeResponseContext(status=Status(message='All good', code=0, excepshuns=[]), current_server_version=1, min_server_version=1), upgrade_status=Status(message='Completed successfully', code=0, excepshuns=[]), finished_states=[], progress=1.0, user_surfaced_task_name='Not available', current_task_name_deprecated=None)

では,cluster upgrade start を実行して Upgrade してみましょう!

YOUR_NODE >> cluster upgrade start
Do you want to use --share rubrik-image-5.2.0-p2-9418.tar.gz [y/N] (Optional) [N]: y
===============================
Starting upgrade in normal mode
===============================
Upgrade status: Started upgrade successfully

ワクワクしながら Status を確認していると, 自動的に Reboot します.

YOUR_NODE >> cluster upgrade status
Current upgrade mode: normal
Current upgrade node: YOUR_NODE
Current upgrade tarball name: --share rubrik-image-5.2.0-p2-9418.tar.gz
Current upgrade status: In progress
Current run started at: 2020-07-07 10:55:05.221000 UTC+0000

Current state (9/11): CONFIGURING
Current task: Reboot nodes to load new software
Current state progress: 30.0%
Finished states (8/11): ACQUIRING, COPYING, VERIFYING, UNTARING, DEPLOYING, PRECHECKING, PREPARING, IMAGING
Pending states (2/11): MIGRATING, RESTARTING

Time taken so far: 28 minutes and 27.76 seconds
Overall upgrade progress: 50.0%

Broadcast message from rkcluster@YOUR_NODE (somewhere) (Tue Jul  7 11:26:56

Ansible orchestrated reboot... 

暫く待って ssh で接続出来る様になったら Status を確認しましょう."Completed successfully" と表示されたら成功です!

YOUR_NODE >> cluster upgrade status
Last upgrade mode: normal
Last upgrade node: YOUR_NODE
Last upgrade tarball name: --share rubrik-image-5.2.0-p2-9418.tar.gz
Last upgrade status: Completed successfully
Last run ended at: 2020-07-07 11:50:17.676000 UTC+0000
Current state: IDLE

Version は cluster version コマンドで確認できます.

YOUR_NODE >> cluster version  
5.2.0-p2-9418


うちの r334, 全く問題無く 5.2 への Upgrade が完了しました.きっと Chris に貰った「おまじない」が効いているおかげなのでしょう(笑). f:id:tcpninja:20200707214440j:plain