Manage High Availability Errors
This section describes the concepts of High Availability error management. It also describes the following use cases:
- Disconnection issues between the appliances (short or long outage)
- Service issues
- View conflict warnings
Conflict Management
In a High Availability Dual Mode deployment, the two nodes are symmetric.
You can perform other daily operations from either one of the nodes. As long as the nodes are synchronized, each operation is replicated to the other node.
When the two nodes become desynchronized, all features are still available on both nodes (except for two exceptions - add and delete domain operations).
Therefore, data update conflict might appear if the same data is updated with different values on the two nodes (during a short network outage, or simultaneously on both nodes).
For example, while the appliances are disconnected, you can update the same field with different values on both nodes, or create the same user name twice. This will result in a data conflict when the appliances are reconnected.
The following sections describe how such conflicts are managed and audited. Monitoring and Reporting also describes how they are monitored.
If you run into any conflicts, contact HID Global Technical Support for assistance on their resolution.
Data Conflict Management
The system can resolve potential database conflicts based on the following principles:
- Timing – when 2 conflicting updates are identified, the last one wins, the earlier one is discarded. This applies to conflicting data update and data creation.
- Deletion wins – when a deletion conflicts with a data update, the deletion wins.
- Most usage wins – when a token has a higher total usage on one of the nodes (that is, the total number of failed and/or successful authentication). This information will be used to update the database (instead of taking into account the last update time).
For example, during a short outage, one administrator updates one field for a user at T0 on the first appliance, and another operator updates this same field at T0+1 on the second appliance. Once the appliances are reconnected, the last update made is inserted into the database while the update made on the first appliance will be discarded.
- If conflicts are detected between the two appliances when they are reconnected, you will receive a warning.
- If conflict resolution fails for some objects, you will receive both a warning and a short list of items to check, once the appliances are in a synchronized state. See Data Replication in Dual Mode.
Conflict Resolution Auditing
Automatic changes to database content due to conflict resolution are audited in a dedicated table (that is, not using the ActivID Server audit feature). Changes are audited locally. On each node, changes are audited on the node only.
The following is audited:
- Date and time of the conflict resolution
- Date and time of the conflict
- Security domain and database table name
- Change type - Insert/Update/Delete
- Row with conflicting data
- Row before change
- Row after conflict resolution
The conflict resolution audit cannot be accessed until archived. It is archived with ActivID Authentication Server audit archive. It is purged together with ActivID Authentication Server audit purge. The conflict resolution audit archive is also uploaded to the same location. The archive file name is Audit_conflict_XXXXXXXX.tar.
Manage Short Outage
The following are the reasons why a short outage could occur:
- A short power outage for one of the appliances occurs.
- A short network disconnection between the two appliances occurs.
- Hardware maintenance is required, and the appliance must be shut down for a short period of time.
Since the shutdown is short (that is, as long as the other appliance can store the synchronization data while waiting for the synchronization to be initiated again), it does not impact the activity on the other appliance.
During the process, both databases are no longer synchronized, but the difference is recoverable. The active appliance continues to record database updates. These updates will be sent to the other appliance when it is up and running again.
- There is no noticeable interruption of service for end users. Certain sessions on the ActivID Management Console or Self-Service Portal might be closed. However, you can log on to the applications again.
- There is no data loss.
- A RADIUS authentication could be aborted, but the next one might be successful on both nodes or only on one node.
The duration over which the two nodes still remains Out Of Sync. Recoverable depends on the following factors:
- The activity of the service and the number of authentication per second
- The network bandwidth between the two nodes
- The lag time between the two nodes
When you restart node B or the network resumes connection, the database is automatically synchronized.
As the recovery is automatic, the Synchronization Status switches to the Synchronized state. No manual intervention is required.
If you do not want to trigger the automatic recovery process, you can click Cancel Synchronization when the node is in the Out of Synchronization state.
Manage Long Outage
When the outage of Node B is long, Node A automatically cancels the synchronization.
During the process, both databases are no longer synchronized, and the difference is not recoverable automatically.
To perform a manual recovery, you have two options:
- Repair the connection issue between the appliances, and initiate the synchronization again.
- Once all the identified issues are fixed, set the appliances back to Single Mode, and reconfigure the Dual Mode deployment.