高崖仙月:HACMP 的心跳简介

来源:百度文库 编辑:偶看新闻 时间:2024/05/01 14:09:26

HACMP 的心跳简介

 

Heartbeating用于监控网络接口、通讯设备、IP labelservie,non-service,persistent)的可用性,以及节点的可用性。

 

HACMP5.1开始,HA只使用基于RSCT拓扑服务的心跳,经典心跳不再使用(经典心跳使用网络接口模块NIMs,直接由clstrmgrES监控)。

 

HA通过在每个节点之间,在每个通讯接口和设备上交换消息来实现心跳。每个节点按指定的时间间隔向其他节点发送心跳信号,并期望在指定的时间间隔内收到相应节点的心跳信号。如果没有收到心跳信号,则RSCT认为发生错误,报告给HACMP,由HACMP采取相应的恢复措施。

 

心跳信息可以通过2种途径交换:

       基于IP网络

       基于non-IP网络

Cluster孤岛:由于TCPIP网络的原因(交换机、路由器、HUB),基于IP的心跳不能正常发送接收,如果没有其他non-ip的心跳交换,每个节点都会认为其他节点失败,自己请求获得资源,这将影响数据的一致性和完整性,所以HACMP应该能够区分是IP网络故障还是节点故障。防止孤岛的出现。

 

NON-IP网络心跳不使用TCPIP网络传输心跳,所以能有效的避免由于TCPIP网络故障造成的Cluster孤岛。

 

基于磁盘的心跳

       HACMP5.1以后才支持。

此类型的心跳支持SSASCSIFC类型的存储,使用磁盘(diskhb)交换心跳信息。该磁盘需要属于增强的concurrent vg。同时,此磁盘也可以用于存储其他共享信息。

1、  一块盘属于一个网络(2节点),2个节点上该磁盘的ID要一致

2、  每对节点配置一个网络

3、  该磁盘需要是增强的concurrent vg的一部分,但和RG无关但部分current vgn

 

基于IP别名的心跳

       使用基于IP别名的心跳,当HACMP启动时,在每个存在的IP上添加一个IP别名用于心跳信息交换,该别名需要使用不同的子网,并且不属于任何名字解析。RSCT使用该别名为每个通讯接口建立通讯组(心跳环),来交换心跳信息。该方式的心跳不再监控baseIP地址,而监视通讯接口和service IPIP别名的子网掩码需要和sercie IP的子网掩码相同。配置基于IP别名心跳的HACMP,你需要指定起始的IP地址。

例如:一个2节点的HA,每个节点2个网络接口en0en1,起始用于心跳的IP别名是192.168.1.l

Adapter/Node

Node1

Node2

ring

en0

192.168.1.1

192.168.1.2

Ring1

en1

192.168.2.1

192.168.2.2

Ring2

使用IP别名的适配器存储在HACMPadapter ODM类中。

 

心跳通讯测试

 

系统环境:H80OS520008HA5.3F50OS520008HA5.3FAStT600

心跳配置:网络别名心跳,心跳别名初始化IP10.0.3.1

          串口心跳,分别连主机的串口3---àtty1

             磁盘心跳,增强的并行vghbvg--àhdisk4FAStT存储

测试:

1、  2个节点启动HA

2、  在主节点执行lssrc –ls topsvcs

# lssrc -ls topsvcs

Subsystem         Group            PID     Status

 topsvcs          topsvcs          491574  active

Network Name   Indx Defd  Mbrs  St   Adapter ID      Group ID

net_ether_01_0 [ 0] 2     2     S    10.0.4.2        10.0.4.2      

net_ether_01_0 [ 0] en1              0x4508653b      0x45086568

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 0 Current group: 0

Packets sent    : 249 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 387 ICMP 0 Dropped: 0

NIM's PID: 454686

net_ether_01_1 [ 1] 2     2     S    10.0.3.2        10.0.3.2      

net_ether_01_1 [ 1] en0              0x4508653c      0x45086569

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 0 Current group: 0

Packets sent    : 249 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 387 ICMP 0 Dropped: 0

NIM's PID: 503972

rs232_0        [ 2] 2     2     S    255.255.0.1     255.255.0.1   

rs232_0        [ 2] tty1             0x8508656b      0x8508656e

HB Interval = 2.000 secs. Sensitivity = 5 missed beats

Missed HBs: Total: 0 Current group: 0

Packets sent    : 186 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 178 ICMP 0 Dropped: 0

NIM's PID: 532510

diskhb_0       [ 3] 2     2     S    255.255.10.1    255.255.10.1  

diskhb_0       [ 3] rhdisk4          0x8508653a      0x8508656c

HB Interval = 2.000 secs. Sensitivity = 4 missed beats

Missed HBs: Total: 0 Current group: 0

Packets sent    : 126 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 125 ICMP 0 Dropped: 0

NIM's PID: 434228

  2 locally connected Clients with PIDs:

haemd(413824) hagsd(450668)

  Dead Man Switch Enabled:

     reset interval = 1 seconds

     trip  interval = 20 seconds

  Configuration Instance = 3

  Daemon employs no security

  Segments pinned: Text Data.

  Text segment size: 767 KB. Static data segment size: 957 KB.

  Dynamic data segment size: 4233. Number of outstanding malloc: 222

  User time 0 sec. System time 0 sec.

  Number of page faults: 263. Process swapped out 0 times.

  Number of nodes up: 2. Number of nodes down: 0.

 

用于心跳的进程

# ps -ef|grep nim

    root 434228 491574   0 15:08:23      -  0:00 /usr/sbin/rsct/bin/hats_diskhb_nim

    root 454686 491574   0 15:08:23      -  0:00 /usr/sbin/rsct/bin/hats_nim

    root 503972 491574   0 15:08:23      -  0:00 /usr/sbin/rsct/bin/hats_nim

    root 532510 491574   1 15:08:23      -  0:01 /usr/sbin/rsct/bin/hats_rs232_nim

 

3、  备份节点上运行lssrc –ls topsvcs

# lssrc -ls topsvcs

Subsystem         Group            PID     Status

 topsvcs          topsvcs          29482   active

Network Name   Indx Defd  Mbrs  St   Adapter ID      Group ID

net_ether_01_0 [ 0] 2     2     S    10.0.4.1        10.0.4.2      

net_ether_01_0 [ 0] en1              0x45087231      0x45086568

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 1 Current group: 1

Packets sent    : 1174 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 1726 ICMP 0 Dropped: 0

NIM's PID: 28006

net_ether_01_1 [ 1] 2     2     S    10.0.3.1        10.0.3.2      

net_ether_01_1 [ 1] en0              0x45087232      0x45086569

HB Interval = 1.000 secs. Sensitivity = 10 missed beats

Missed HBs: Total: 0 Current group: 0

Packets sent    : 1175 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 1724 ICMP 0 Dropped: 0

NIM's PID: 27640

rs232_0        [ 2] 2     2     S    255.255.0.0     255.255.0.1   

rs232_0        [ 2] tty1             0x85087233      0x8508656e

HB Interval = 2.000 secs. Sensitivity = 5 missed beats

Missed HBs: Total: 1 Current group: 1

Packets sent    : 13173 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 810 ICMP 0 Dropped: 0

NIM's PID: 25294

diskhb_0       [ 3] 2     2     S    255.255.10.0    255.255.10.1  

diskhb_0       [ 3] rhdisk4          0x85087234      0x8508656c

HB Interval = 2.000 secs. Sensitivity = 4 missed beats

Missed HBs: Total: 0 Current group: 0

Packets sent    : 568 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 570 ICMP 0 Dropped: 0

NIM's PID: 26892

  2 locally connected Clients with PIDs:

haemd( 26836) hagsd( 27390)

  Dead Man Switch Enabled:

     reset interval = 1 seconds

     trip  interval = 20 seconds

  Configuration Instance = 3

  Daemon employs no security

  Segments pinned: Text Data.

  Text segment size: 767 KB. Static data segment size: 957 KB.

  Dynamic data segment size: 4169. Number of outstanding malloc: 222

  User time 1 sec. System time 3 sec.

  Number of page faults: 423. Process swapped out 0 times.

  Number of nodes up: 2. Number of nodes down: 0.

 

根据上面的信息可以看出,一共有4个心跳环在传输心跳信号,2个以太网,1个串口,1个磁盘。心跳信号在心跳环内进行传输。

我们也可以通过日志来查看心跳传输的情况:

/var/ha/log目录下的nim.topsvcs.en0.whnim.topsvcs.en1.whnim.topsvcs.tty1.whnim.topsvcs.rhdisk4.wh

 

3down 网卡en0

   HA进行了正常的网卡swap操作。此时查看心跳日志。Nim.topsvcs.en0.wh

09/14 10:22:05.356: Error sending to 10.0.3.1: Bad file number.

09/14 10:22:05.356: Dispatching netmon request while another in progress.

09/14 10:22:05.356: Received a SEND MSG command. Dst: 10.0.3.1.

09/14 10:22:05.376: Error sending to 10.0.3.1: Network is down.

09/14 10:22:05.376: Error sending to 10.0.3.1: Network is down.

09/14 10:22:05.376: Error sending to 10.0.3.1: Network is down.

09/14 10:22:05.376: Error sending to 10.0.3.1: Network is down.

09/14 10:22:05.376: Error sending to 10.0.3.1: Network is down.

09/14 10:22:08.538: netmon response: Adapter is down

09/14 10:22:08.538: Adapter status successfully sent.

   此时en0不再发送心跳信息。备份节点的en0发现发送给10.0.3.2地址的心跳失败,并收到停止发送心跳信息的命令,随后发送心跳信息的地址变成10.0.3.255

   启动网卡en0后,心跳有开始正常传输。

4、其他心跳环类似。

5、更改心跳相关的参数

  Extended Configuration----àExtended Topology Configuration----à

Configure HACMP Network Modules----à

Change a Network Module using Predefined Values

分别选择ether,diskhb,rs232

 

                   Change a Cluster Network Module using Pre-defined Values

 

 

                                                        

[Entry Fields]

* Network Module Name                               diskhb

  Description                                         Disk Heartbeat Serial protocol

  Failure Detection Rate                                Slow                                                                                          

 

 

  NOTE: Changes made to this panel must be

        propagated to the other nodes by

        Verifying and Synchronizing the cluster

 

Slow区域可以改成Normal ,Fast

也可以用Show a Network Module菜单进行查看