前提・実現したいこと

AWSを利用してMulti-AZのHA構成でサーバを構築しています。
利用しているクラスターソフトはPacemekarとcorosyncです。
VIPの設定を実施したところ上手く動作しません。
また、Pacemekar設定時からですがpcs statusで確認したところ自身のサーバのみOnlineになり、
対象のサーバがOfflineになったままになります。

下記の内容を参考に実施しております。
https://qiita.com/kenzo0107/items/851c002b6ea62f9a07c4

OS：Red Hat Enterprise Linux 8 with High Availability
AWSの設定は上記URLの内容とほとんど変更なし

発生している問題・エラーメッセージ

#pcs status node[server01]
Pacemaker Nodes:
Online:ip-10-10-10-1.ap-northeast-1.compute.internal
Standby:
Standby with resource(s) running:
Maintenance:
Offline:ip-10-10-10-2.ap-northeast-1.compute.internal

#pcs status node[server02]
Pacemaker Nodes:
Online:ip-10-10-10-2.ap-northeast-1.compute.internal
Standby:
Standby with resource(s) running:
Maintenance:
Offline:ip-10-10-10-1.ap-northeast-1.compute.internal

#pcs config[server01]
Cluster Name: aws-cluster
Corosync Nodes:
ip-10-10-10-1.ap-northeast-1.compute.internal ip-10-10-10-2.ap-northeast-1.compute.internal
Pacemaker Nodes:
ip-10-10-10-1.ap-northeast-1.compute.internal ip-10-10-10-2.ap-northeast-1.compute.internal
Resources:
Resource: eip (class=ocf provider=heartbeat type=eip)
Attributes: elastic_ip=1.1.1.1←例
Operations: start interval=0s timeout=60s on-fail=stop (eip-start-interval-0s)
monitor interval=10s timeout=60s on-fail=restart (eip-monitor-interval-10s)
stop interval=0s timeout=60s on-fail=block (eip-stop-interval-0s)

Stonith Devices:
Fencing Levels:
Location Constraints:
Ordering Constraints:
Colocation Constraints:
Resources Defaults:
resource-stickiness: INFINITY
migration-threshold: 1
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: aws-cluster
transition-delay: 0s
dc-version: 1.1.13-10.el7-44eb2dd
have-watchdog: false
no-quorum-policy: ignore
stonith-enabled: false

#pcs config[server02]
Cluster Name: aws-cluster
Corosync Nodes:
ip-10-10-10-2.ap-northeast-1.compute.internal ip-10-10-10-1.ap-northeast-1.compute.internal
Pacemaker Nodes:
ip-10-10-10-2.ap-northeast-1.compute.internal ip-10-10-10-1.ap-northeast-1.compute.internal
Resources:
Stonith Devices:
Fencing Levels:
Location Constraints:
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
No defaults set
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: aws-cluster
dc-version: 2.0.5-9.el8_4.1-ba59be7122
have-watchdog: false
Tags:
No tags defined
Quorum:
Options:

#pcs status[server01]
Cluster name: aws-cluster
Cluster Summary:
＊ Stack: corosync
＊ Current DC: ip-10-10-10-1.ap-northeast-1.compute.internal (version 2.0.5-9.el8_4.1-ba59be7122) - partition WITHOUT quorum
＊Last updated: Fri Jun 4 14:20:06 2021
＊ Last change: Fri Jun 4 09:46:23 2021 by root via cibadmin on ip-10-10-10-1.ap-northeast-1.compute.internal
＊2 nodes configured
＊1 resource instance configured
Node List:
＊Online: [ ip-10-10-10-1.ap-northeast-1.compute.internal ]
＊OFFLINE: [ ip-10-10-10-2.ap-northeast-1.compute.internal ]

Full List of Resources:
＊eip (ocf:💓eip): Stopped
Failed Resource Actions:

＊eip_start_0 on ip-10-10-10-1.ap-northeast-1.compute.internal 'error' (1): call=6, status='complete', exitreason='', last-rc-change='2021-06-04 09:32:25 +09:00', queued=0ms, exec=12ms

Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled

#pcs status[server02]
Cluster name: aws-cluster
WARNINGS:
No stonith devices and stonith-enabled is not false
Cluster Summary:
＊ Stack: corosync
＊ Current DC: ip-10-10-10-2.ap-northeast-1.compute.internal (version 2.0.5-9.el8_4.1-ba59be7122) - partition WITHOUT quorum
＊ Last updated: Fri Jun 4 14:28:01 2021
＊ Last change: Fri Jun 4 09:29:47 2021 by hacluster via crmd on ip-10-10-10-2.ap-northeast-1.compute.internal
＊ 2 nodes configured
＊ 0 resource instances configured
Node List:
＊ Node ip-10-10-10-1.ap-northeast-1.compute.internal: UNCLEAN (offline)
＊ Online: [ ip-10-10-10-2.ap-northeast-1.compute.internal ]

Full List of Resources:
＊ No resources
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled

corosync.conf

totem {
version: 2
transport:udpu
}

nodelist {
node {
ring0_addr: 10.10.10.1
name: ip-10-10-10-1.ap-northeast-1.compute.internal
nodeid: 1
}

node {

　　　　 ring0_addr: 10.10.10.2
name: ip-10-10-10-2.ap-northeast-1.compute.internal
nodeid: 2
}
}

quorum {
provider: corosync_votequorum
two_node: 2
}

logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
timestamp: on
}

#Online後のpcs status [server01]
Cluster name: aws-cluster
Cluster Summary:

Stack: corosync
Current DC: ip-10-10-10-1.ap-northeast-1.compute.internal (version 2.0.5-9.el8_4.1-ba59be7122) - partition with quorum
Last updated: Mon Jun 7 03:39:48 2021
Last change: Mon Jun 7 03:23:05 2021 by root via cibadmin on ip-10-10-10-2.ap-northeast-1.compute.internal
2 nodes configured
1 resource instance configured

Node List:

Online: [ ip-10-10-10-1.ap-northeast-1.compute.internal ip-10-10-10-2.ap-northeast-1.compute.internal ]

Full List of Resources:

eip (ocf:💓eip): Stopped

Failed Resource Actions:

eip_start_0 on ip-10-10-10-2.ap-northeast-1.compute.internal 'error' (1): call=244, status='complete', exitreason='', last-rc-change='2021-06-07 03:10:06 +09:00', queued=0ms, exec=14ms

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

yukky1201

2021/06/04 05:17

pcs statusの結果ですが、現状記載あるのはserver1での結果だと思います。server2でのpcs statusの結果も提示してください。

kensan_p

2021/06/04 05:49

失礼いたしました。また、pcs configの内容でしたのでstatusも追記いたしました。何卒よろしくお願いいたします。

yukky1201

2021/06/04 05:57

server1とserver2が区別できるように修正いただけますか。実際はIPアドレスで識別できるのでしょうが、伏字になっていてどちらか識別できていないです。

kensan_p

2021/06/04 06:26

大変申し訳ございません。修正いたしました。

TaichiYanagiya

2021/06/04 12:18

corosync.conf で「transport: udpu」になっていますでしょうか？ AWS 環境でマルチキャストは通らないので。

kensan_p

2021/06/04 12:57

上記変更し、再度確認しましたが状態が変わらないです。他に設定が必要な個所などありますでしょうか？ corosync.confは上記に掲載しております。

TaichiYanagiya

2021/06/04 13:45

ごめんなさい。RHEL 8 では「transport: knet」がデフォルトで、udpu にしなくても良いようです。

kensan_p

2021/06/06 19:15 編集

ありがとうございます。 SGを調査したところ、アウトバウンドの設定が間違っていたため再設定したところOnlineになりました。ですが、EIPの付け替えは上手くいかなかったです。 pcs statusを再度上記に貼り付けております。

行動規範の内容に同意します

回答2件

corosync が使う、UDP 5405 番ポートの通信が相互に可能かどうか確認ください。
参照先 URL を見ると、セキュリティグループで TCP, ICMP は全許可していますが、UDP の許可が無いように見えます。

(2021/06/09 11:52)

(質問内容が変わっているので、別の質問にした方がいいと思います。)

eip リソース起動失敗の詳細がログに出力されていると思います。
「eip_start」で以下のログを調べてみてください。

/var/log/cluster/ha.log (古い？)
/var/log/pacemaker/pacemaker.log
/var/log/messages (または "journalctl -u pacemaker" 出力)

起動スクリプトの終了コードがログに記録されているので、起動スクリプトの中身を追って、どこで失敗しているのか見つけます。

起動スクリプトは、resource-agents-4.1.1-90.el8.x86_64 だと /usr/lib/ocf/resource.d/heartbeat/awseip ですが、質問文を見ると /usr/lib/ocf/resource.d/heartbeat/eip でしょうか。

投稿2021/06/04 13:45

編集2021/06/09 02:52

TaichiYanagiya

総合スコア12173

kensan_p

2021/06/06 06:34

UDPはセキュリティグループですべて許可をしています。

行動規範の内容に同意します

pcs status node[server01]

Pacemaker Nodes:
Online:ip-10-10-10-1.ap-northeast-1.compute.internal
Standby:
Standby with resource(s) running:
Maintenance:
Offline:ip-10-10-10-2.ap-northeast-1.compute.internal

pcs status node[server02]

Pacemaker Nodes:
Online:ip-10-10-10-2.ap-northeast-1.compute.internal
Standby:
Standby with resource(s) running:
Maintenance:
Offline:ip-10-10-10-1.ap-northeast-1.compute.internal

どちらも自身のnodeがonline、相手のnodeがofflineとなっていることから、相互通信できていないのでは？と予想されます。
もし、firewall(firewalldプロセス)が起動しているなら一旦停止(systemctl stop firewalld)して切り分けてみてはいかがでしょう。

投稿2021/06/04 08:25

yukky1201

総合スコア2751