Oracle Data Guard Switchover & Failover: A Production Role-Transition Runbook
Having a standby database is not the same as knowing how to use it under pressure. The moment your primary is in trouble - or you need to move it for planned maintenance - is not the time to be looking up commands. This is the runbook: the clear difference between a planned switchover and an emergency failover, the pre-checks that stop a role transition going wrong, the exact DGMGRL steps for each, how to bring the old primary back afterwards, and how Fast-Start Failover automates the whole thing. It assumes you already have a working Data Guard configuration; if you need the concepts first, start with the complete Data Guard guide.
Key Takeaways
- A switchover is a planned, lossless role swap between primary and standby; a failover is an emergency promotion of the standby when the primary is gone.
- Always run pre-checks first - configuration status, transport and apply lag, and a validation of the target standby - before any transition.
- The Data Guard Broker (DGMGRL) makes role transitions a single command each and handles the many steps underneath safely.
- After a failover, the old primary must be reinstated (often via Flashback) to rejoin as a standby, not just restarted.
- A snapshot standby lets you open the standby read-write for testing, then discard the changes and resume applying redo.
- Fast-Start Failover with an observer automates failover within seconds of a primary loss - the gold standard for hands-off high availability.
1. Switchover vs Failover - Know Which You Are Doing
These two words get used interchangeably in a crisis, but they are very different operations, and confusing them causes mistakes.
A switchover is planned and lossless. The primary and standby cleanly swap roles - the old primary becomes a standby, the old standby becomes the primary - with no data loss. You use it for planned maintenance, hardware moves, or testing your DR readiness.
A failover is unplanned. The primary is gone - crashed, unreachable, destroyed - and you promote the standby to primary to restore service. Depending on your protection mode and how much redo reached the standby, a failover may involve a small amount of data loss. The old primary does not automatically come back as a standby; it must be reinstated.
2. Pre-Checks Before Any Role Transition
Whether planned or not, look before you leap. The Broker makes this quick.
-- Connect to the broker
dgmgrl sys/password@primary
-- Overall health - everything should be SUCCESS
SHOW CONFIGURATION;
-- Detail on the standby you intend to promote, including apply lag
SHOW DATABASE VERBOSE 'standby_db';
-- Validate that the standby is ready to take the primary role
VALIDATE DATABASE 'standby_db';
VALIDATE DATABASE is the command many DBAs miss. It reports whether the target is ready to become primary - redo received, apply status, flashback, and any warnings - before you commit. For a switchover you want zero lag; for a failover you accept whatever redo arrived.
3. Switchover Runbook (Planned)
With the Broker, a healthy switchover is essentially one command, but the surrounding steps matter.
- Confirm applications can tolerate the brief transition and that the standby lag is zero.
- Run
VALIDATE DATABASEon the target and resolve any warnings. - Execute the switchover; the Broker coordinates both databases.
- Verify the new roles and that redo transport has reversed.
-- One command; the broker does the rest
SWITCHOVER TO 'standby_db';
-- Afterwards, confirm the swap
SHOW CONFIGURATION; -- roles are now reversed, status SUCCESS
The old primary automatically becomes a standby and starts receiving redo from the new primary. No reinstatement is needed for a switchover - that is the whole point of it being planned and clean.
4. Failover Runbook (Unplanned)
When the primary is truly gone, you promote the standby. The key decision is already baked into your protection mode: Maximum Protection and Maximum Availability aim for zero data loss by guaranteeing redo reached the standby, while Maximum Performance favours primary speed and may lose the last few transactions on failover.
-- Connect the broker to the SURVIVING standby
dgmgrl sys/password@standby_db
-- Promote it to primary
FAILOVER TO 'standby_db';
-- Confirm it is now the primary and open
SHOW CONFIGURATION;
After a failover, restore application connectivity to the new primary. If you use a role-aware service and a properly configured client connect string, sessions reconnect to the new primary automatically. This is exactly the kind of event covered in what to do when the database fails at 3 AM - the runbook is what turns panic into procedure.
5. Reinstating the Old Primary
After a failover, the failed database - once it is back on its feet - is behind the new primary and cannot simply rejoin. It must be reinstated as a standby. If you had Flashback Database enabled (and you should have), the Broker can flash it back to the correct point and turn it into a standby automatically.
-- Start the old primary in MOUNT, then from the broker:
REINSTATE DATABASE 'old_primary';
-- It flashes back and becomes a standby of the new primary
SHOW CONFIGURATION; -- both databases healthy again
This is a concrete reason to keep Flashback Database on: without it, reinstating the old primary usually means rebuilding the standby from scratch with a fresh copy, which is far slower.
6. Snapshot Standby - Test on Real Data, Then Rewind
Sometimes you want to open the standby read-write - to test a change against production-like data - without losing your DR protection. A snapshot standby does exactly that. It converts the standby to read-write, keeps receiving redo (but does not apply it yet), and when you convert it back, it discards your test changes and catches up.
-- Open the standby for read-write testing
CONVERT DATABASE 'standby_db' TO SNAPSHOT STANDBY;
-- ... run your tests; changes here are temporary ...
-- Discard changes and resume being a standby
CONVERT DATABASE 'standby_db' TO PHYSICAL STANDBY;
This is invaluable for validating an application release or a risky data fix against realistic data without touching production and without building a separate clone.
7. Fast-Start Failover - Automating the Emergency
Manual failover depends on a human noticing and acting. Fast-Start Failover (FSFO) removes the human from the critical path. A lightweight process called the observer continuously watches both databases; if the primary becomes unreachable for longer than a set threshold, the observer triggers an automatic failover to the standby within seconds, and later reinstates the old primary automatically when it returns.
-- Enable Fast-Start Failover and start an observer (on a third host)
EDIT DATABASE 'standby_db' SET PROPERTY FastStartFailoverThreshold = 30;
ENABLE FAST_START FAILOVER;
START OBSERVER;
SHOW FAST_START FAILOVER; -- confirm enabled + observer present
The observer should run on a third machine, separate from both databases, so it can tell the difference between a dead primary and a network split. FSFO is the configuration to aim for when the business needs hands-off, seconds-level recovery.
8. Common Issues and How to Handle Them
- Switchover refuses with a warning: almost always apply lag or an open blocking session. Clear the gap and re-run
VALIDATE DATABASE; do not force it past a real warning. - Redo gap on the standby: missing archived logs mean the standby is behind. Resolve the gap (the Broker usually fetches it automatically) before a switchover.
- Old primary will not reinstate: Flashback was off or the flashback logs aged out - you will need to rebuild the standby from a fresh copy, for example with RMAN DUPLICATE.
- Applications do not follow the new primary: the connect string or database service was not role-aware. Use a Data Guard-aware service so connections move automatically on a transition.
9. A Real Transition: A Clean DR Drill on a Bank Standby
A banking client needed to prove, to auditors, that they could run on their DR site - without risking the live system. We scheduled a planned switchover during a quiet window: ran SHOW CONFIGURATION and VALIDATE DATABASE to confirm zero lag and a clean standby, executed SWITCHOVER TO the DR database through the Broker, and confirmed roles reversed and redo transport flowed back the other way. Applications, using a role-aware service, reconnected to the DR site automatically. The bank ran on DR for the agreed period, then switched back the same way. Because everything went through the Broker with pre-validation, the drill was uneventful - which, for a DR test, is exactly the goal. This operational discipline sits on top of the design covered in the complete Data Guard guide and a solid backup foundation.
Frequently Asked Questions
What is the difference between a Data Guard switchover and failover?
A switchover is planned and lossless - the primary and standby cleanly swap roles for maintenance or DR testing, and the old primary becomes a standby automatically. A failover is an emergency promotion of the standby when the primary is gone; depending on the protection mode it may involve minor data loss, and the old primary must be reinstated afterwards.
How do I perform a Data Guard switchover?
With the Data Guard Broker it is essentially one command. Connect with DGMGRL, run SHOW CONFIGURATION and VALIDATE DATABASE to confirm the standby is ready with zero lag, then run SWITCHOVER TO 'standby_db'. The broker coordinates both databases and reverses redo transport; confirm the new roles with SHOW CONFIGURATION.
How do I bring the old primary back after a failover?
You reinstate it rather than just restart it. Start the old primary in MOUNT and run REINSTATE DATABASE from the broker. If Flashback Database was enabled, the broker flashes it back to the right point and converts it to a standby automatically. Without Flashback you usually have to rebuild the standby from a fresh copy.
What is Fast-Start Failover?
Fast-Start Failover (FSFO) automates failover using an observer process that watches both databases. If the primary is unreachable beyond a set threshold, the observer triggers an automatic failover to the standby within seconds and later reinstates the old primary when it returns. The observer should run on a third host so it can distinguish a dead primary from a network split.
What is a snapshot standby used for?
A snapshot standby lets you open the standby read-write for testing against production-like data while it keeps receiving redo. You convert it to a snapshot standby, run your tests, then convert it back to a physical standby, which discards the test changes and resumes applying redo. It is ideal for validating a release without touching production or building a separate clone.
🔁 Need a Reliable DR Runbook or a DR Drill?
I configure Data Guard, run planned switchovers and DR drills, set up Fast-Start Failover, and document the runbook your team can follow under pressure. Bangladesh and worldwide clients.
References & Further Reading
- 📄 Oracle Data Guard Concepts and Administration (19c)
- 📄 Oracle Data Guard Broker (19c) - DGMGRL and Fast-Start Failover
The procedures and case studies in this article are based on 18+ years of Oracle production database administration across manufacturing, banking, and pharmaceutical environments.
