Reliable DUCC - Design

Written and maintained by the Apache
UIMATMDevelopment Community

Copyright ©  2012 The Apache Software Foundation

Copyright ©  2012 International Business Machines Corporation

License and Disclaimer The ASF licenses this documentation to you under the Apache License, Version 2.0 (the ”License”); you may not use this documentation except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, this documentation and its contents are distributed under the License on an ”AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Trademarks All terms mentioned in the text that are known to be trademarks or service marks have been appropriately capitalized. Use of such terms in this book should not be regarded as affecting the validity of the the trademark or service mark.

Publication date: April 2019

Multiple DUCC head nodes

This first major section describes support for multiple active DUCC head nodes.

Introduction

DUCC can be configured to run reliably by having multiple head nodes, comprising one master and one or more backup head nodes. DUCC exploits Linux keepalived virtual IP addressing to enable this capability.

The advantages are that if the master DUCC host becomes unusable, the backup DUCC can take over seamlessly such that active distributed Jobs, Reservations, Managed Reservations and Services continue uninterrupted. Take over also facilitates continued acceptance of new submissions and monitoring of new and existing submissions without interruption.

Daemons

Each head node, whether master or backup, runs a Broker, Orchestrator, PM, RM, and SM.

The Cassandra database is expected to be located on a node(s) separate from the head nodes.

Likewise, the JD node(s) is separate from the head nodes.

The Agents are distributed, as before.

Configuring Host Machines

See Configuring Simple Virtual IP Address Failover Using Keepalived which can be found at this web address: https://docs.oracle.com/cd/E37670_01/E41138/html/section_uxg_lzh_nr.html.

Sample MASTER /etc/keepalived/keepalived.conf

    ! Configuration File for keepalived  
 
vrrp_instance VI_1 {  
    state MASTER  
    interface eth0  
    virtual_router_id 51  
    priority 100  
    advert_int 1  
    authentication {  
        auth_type PASS  
        auth_pass 1111  
    }  
    virtual_ipaddress {  
        192.168.6.253  
    }  
}  
   

Sample BACKUP /etc/keepalived/keepalived.conf

    ! Configuration File for keepalived  
 
vrrp_instance VI_1 {  
    state BACKUP  
    interface eth0  
    virtual_router_id 51  
    priority 100  
    advert_int 1  
    authentication {  
        auth_type PASS  
        auth_pass 1111  
    }  
    virtual_ipaddress {  
        192.168.6.253  
    }  
}  
   

Linux Commands

Starting keepalived

    > sudo service keepalived start  
    Starting keepalived:                                       [  OK  ]  
   

Querying keepalived

    > /sbin/ip addr show dev eth0  
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000  
    link/ether 00:21:5e:20:02:84 brd ff:ff:ff:ff:ff:ff  
    inet 192.168.3.7/16 brd 192.168.255.255 scope global eth0  
    inet 192.168.6.253/32 scope global eth0  
    inet6 fe80::221:5eff:fe20:284/64 scope link  
       valid_lft forever preferred_lft forever  
   

Stopping keepalived

    > sudo service keepalived stop  
    Stopping keepalived:  
   

Configuring DUCC

To configure DUCC to run reliable, one required property must be configured in the site.ducc.properties file. Example:

ducc.head = 192.168.6.253  
   

Use the virtual IP address configured for your host machines keepalived. Use of the DNS name is also supported.

Webserver

Webserver for Master

The master DUCC Webserver will display all pages normally with additional information in the heading upper left:

reliable: master

Webserver for Backup

The backup DUCC Webserver will display some pages normally with additional information in the heading upper left:

reliable: backup

Hovering over reliable will yield the following information: Click to visit master

Several pages will display the following information (or similar):

no data - not master  
   

Database

Configure the database to be on a separate machine from the reliable DUCC head nodes. In site.ducc.properties specify:

# Database location  
    ducc.database.host = dbhost123  
    ducc.database.jmx.host = dbhost123  
    ducc.database.automanage = false  
   

The existing administrator commands start_ducc and stop_ducc will honor the value specified for ducc.database.automanage in site.ducc.properties.

Code changes

The key changes include a new script (see ducc_head_mode.py) to interact with Linux to determine virtual IP address status and corresponding Java code (see common.head.ADuccHead.java) that interprets the status to make transitions between master and backup states.

new scripts

ducc_head_mode.py

This is a new script employed at runtime by the various daemons to determine the current mode of operation. Status is determined though invocation of this script upon receipt of each Orchestrator publication.

    # purpose:    determine reliable ducc status  
    # input:      none  
    # output:     one of { unspecified, master, backup }  
    # operation:  look in ducc.properties for relevant keywords  
    #             and employ linux commands to determine if system  
    #             has matching configured virtual IP address  
   

existing and new scripts

configuration files

ducc.properties

    # The name of the node where DUCC runs.  
    # This property declares the node where the DUCC administrative processes run (Orchestrator,  
    # Resource Manager, Process Manager, Service Manager).  This property is required and MUST be  
    # configured in new installation.  The installation script ducc_post_install initializes this  
    # property to the node the script is executed on.  
    # Reliable DUCC: if running reliably, then this value must be the same as that specified  
    # for the virtual_ipaddress in /etc/keepalived/keepalived.conf.  DUCC CLI and Agents employ  
    # this value to connect to the current reliable DUCC head node.  
    ducc.head = <head-node>  
   

Although not strictly true, the Orchestrator, RM, SM, PM, Webserver and Broker ”must” all be configured on the head node. Reliable DUCC may work with other configurations, but it has not been tested as such.

# If set to true, DUCC will start and stop the Cassandra database as part of its normal  
# start/stop scripting.  
ducc.database.automanage = true

log4j.xml

    Add DUCC\_NODENAME to log file name for OR, RM, PM, SM, and system-events.  
    This allows reliable DUCC head nodes to share the same ducc\_runtime directory  
    in the filesystem without collisions.  
   

agent

cli

common

database

orchestrator

pm

sm

transport

webserver

examples

Installing and Cloning

This second major section describes support for installation of head node master and backup(s).

TBD

Autostart

This third major section describes support for autostart of head node and agent daemons.

TBD

Monitoring and Switching

This fourth major section describes support monitoring of multiple head nodes and switching to an alternate when the primary is dysfunctional.

TBD