PGPool for HA failover#

Intro#

PGPool-II is a service that creates a pool of PostgreSQL server nodes. In EclecticIQ, it can be installed during the initial installation of you platform to ensure continual operation of the PostgresSQL node in your deployment.

Watchdog#

Enabling PGPool allows you to setup Watchdog, a High Availability (HA) feature providing failover mechanics. This means that it monitors the nodes in the pool, one of which is the primary node providing the PostgresSQL service to your deployment, the others are synced secondary nodes. If the watchdog system finds the primary node to be in a failstate, a secondary node will be promoted to primary. The former primary node is then sidelined. This minimizes the loss of service and allows the sidelined node to be recovered and reinstituted to the pool, or be completely reconfigured.

Virtual IP#

Watchdog’s operation depends on Virtual IP (VIP). While the nodes in a pool all have individual IP addresses, VIP creates an additional IP address and assigns it to the primary node (which will have two IP addresses; one actual and one virtual). Outside connections to the pool, i.e. to the primary node rendering the service, are made through the VIP. When Watchdog commits a failover, the VIP will be reassigned to the new primary node. The practical upshot of this is that the pool is accessible through the same IP address before and after the failover, meaning outside systems won’t have to adjust to the failover happening.

Configure PGPool#

Requirements#

Contact support if necessary

Please contact EclecticIQ support if you need help implementing PGPool-II and Watchdog.

PGPool can only be initialized during new installations.
Make sure you are able and allowed to provision a dedicated IP address for the VIP.
Read and understand the documentation on PGPool-II, Watchdog, and the setup example.

Set up PGPool-II with Ansible deployment#

You can only set up a PGPool during the initial install of your EclecticIQ Intelligence Center, but the initialization can be handled by the Ansible playbooks provided.

Load balancing is not yet supported

Intelligence Center does not yet support using PGPool for so-called active-active configuration, just HA failover.
KEEP export EIQ_POSTGRES_PGPOOL_LOAD_BALANCE_ENABLE set to false!

When choosing a deployment architecture, select either prod-medium or prod-large

While setting up your .env, also set the following variables:

    export EIQ_POSTGRES_SUPERUSER_PASSWORD="<example_password>"
    export EIQ_POSTGRES_PGPOOL_ENABLE=true
    export EIQ_POSTGRES_PGPOOL_LOAD_BALANCE_ENABLE=false #Not supported yet, keep set to false
    export EIQ_POSTGRES_PGPOOL_USER_PASSWORD="<example_password>"
    export EIQ_POSTGRES_PGPOOL_PCP_PASSWORD="<example_password>"
    export EIQ_POSTGRES_PGPOOL_VIRTUAL_IP="300.300.300.300" #needs to be an IP in the same L2 as the pg-nodes

Instead of running the playbooks in turn, run the deploy-from-inventory.sh script from the utils folder of the playbook package, passing your deployment architecture, inventory file, and ansible configurations as arguments: Usage: ./utils/deploy-from-inventory.sh <DEPLOYMENT_TYPE> <INVENTORY_FILE> <ANSIBLE_CONFIG> Example: ./utils/deploy-from-inventory.sh prod-large inventories/ic-prod-large.yml ansible.cfg

Commands for managing PGPool#

Run as root

The ansible playbooks provide .pcppass, so run these commands as root.

To find node status, run pcp_node_info -v.
To find Watchdog status, run pcp_watchdog_info -v.
To recover a failed node, run pcp_recovery_node -n <node index>
To manually switch primary node pcp_promote_node -n <node index> -s -g

Return to original primary after failure

Upgrading won’t work properly if the primary node at time of upgrade is not the node that was the original primary, so be sure to switch back before upgrading.

In the event of failover#

The Watchdog will send out an alert through the platform UI. When this alert is received, you are advised to carry out the following steps.

Health check on new primary node.
Recover the failed node to functioning condition.
Restore the secondary node.
Reinit the synching between nodes.
Switchover to the old primary node.
Health check the system to verify normal operation.
Rerun any (automated) tasks that happened between the last successful replication and failover.
Data procured in those tasks will be missing from the secondary node(s) at the point of failover.

FAQ#

In the event if the Primary PG server is not available, how should we assess the impact/data loss on the primary node?

The only possible method is to estimate data loss is to check replication lag at last successful healthcheck before failover. Everything committed to the original primary between the last successful replication and failover is not present on the new primary.

If the primary node’s database is corrupted, will this be replicated across? If so, are there any mitigation stratgies? Or is restoration from VM backup be the only solution?

Every committed change, including invalid data from bugs or manual queries will be replicated. Proper solutions for recovery are regular backups via filesystem snapshots, or different database backup options. Unfortunately, the Intelligence Center doesn’t not have a built-in backup solution yet.
See Postgres’ backup documentation for more information.