ORACLE DBA · DATA GUARD · SWITCHOVER · FAILOVER

Oracle Data Guard Switchover & Failover: A Production Role-Transition Runbook

👤 Nasir Uddin Khan · OCP · Oracle DBA 18+ Years 📅 July 2026 ⏱️ 15 min read 🏷️ Data Guard, Switchover, Failover, DGMGRL, FSFO

Having a standby database is not the same as knowing how to use it under pressure. The moment your primary is in trouble - or you need to move it for planned maintenance - is not the time to be looking up commands. This is the runbook: the clear difference between a planned switchover and an emergency failover, the pre-checks that stop a role transition going wrong, the exact DGMGRL steps for each, how to bring the old primary back afterwards, and how Fast-Start Failover automates the whole thing. It assumes you already have a working Data Guard configuration; if you need the concepts first, start with the complete Data Guard guide.

Key Takeaways

A switchover is a planned, lossless role swap between primary and standby; a failover is an emergency promotion of the standby when the primary is gone.
Always run pre-checks first - configuration status, transport and apply lag, and a validation of the target standby - before any transition.
The Data Guard Broker (DGMGRL) makes role transitions a single command each and handles the many steps underneath safely.
After a failover, the old primary must be reinstated (often via Flashback) to rejoin as a standby, not just restarted.
A snapshot standby lets you open the standby read-write for testing, then discard the changes and resume applying redo.
Fast-Start Failover with an observer automates failover within seconds of a primary loss - the gold standard for hands-off high availability.

Server racks in a data centre representing primary and standby databases in an Oracle Data Guard switchover — Photo: panumas nikhomkhai / Pexels

1. Switchover vs Failover - Know Which You Are Doing

These two words get used interchangeably in a crisis, but they are very different operations, and confusing them causes mistakes.

A switchover is planned and lossless. The primary and standby cleanly swap roles - the old primary becomes a standby, the old standby becomes the primary - with no data loss. You use it for planned maintenance, hardware moves, or testing your DR readiness.

A failover is unplanned. The primary is gone - crashed, unreachable, destroyed - and you promote the standby to primary to restore service. Depending on your protection mode and how much redo reached the standby, a failover may involve a small amount of data loss. The old primary does not automatically come back as a standby; it must be reinstated.

2. Pre-Checks Before Any Role Transition

Whether planned or not, look before you leap. The Broker makes this quick.

-- Connect to the broker
dgmgrl sys/password@primary

-- Overall health - everything should be SUCCESS
SHOW CONFIGURATION;

-- Detail on the standby you intend to promote, including apply lag
SHOW DATABASE VERBOSE 'standby_db';

-- Validate that the standby is ready to take the primary role
VALIDATE DATABASE 'standby_db';

VALIDATE DATABASE is the command many DBAs miss. It reports whether the target is ready to become primary - redo received, apply status, flashback, and any warnings - before you commit. For a switchover you want zero lag; for a failover you accept whatever redo arrived.

3. Switchover Runbook (Planned)

With the Broker, a healthy switchover is essentially one command, but the surrounding steps matter.

Confirm applications can tolerate the brief transition and that the standby lag is zero.
Run VALIDATE DATABASE on the target and resolve any warnings.
Execute the switchover; the Broker coordinates both databases.
Verify the new roles and that redo transport has reversed.

-- One command; the broker does the rest
SWITCHOVER TO 'standby_db';

-- Afterwards, confirm the swap
SHOW CONFIGURATION;      -- roles are now reversed, status SUCCESS

The old primary automatically becomes a standby and starts receiving redo from the new primary. No reinstatement is needed for a switchover - that is the whole point of it being planned and clean.

Industrial control panel with switches, like triggering an Oracle Data Guard failover in an emergency — Photo: Fernando Narvaez / Pexels

4. Failover Runbook (Unplanned)

When the primary is truly gone, you promote the standby. The key decision is already baked into your protection mode: Maximum Protection and Maximum Availability aim for zero data loss by guaranteeing redo reached the standby, while Maximum Performance favours primary speed and may lose the last few transactions on failover.

-- Connect the broker to the SURVIVING standby
dgmgrl sys/password@standby_db

-- Promote it to primary
FAILOVER TO 'standby_db';

-- Confirm it is now the primary and open
SHOW CONFIGURATION;

After a failover, restore application connectivity to the new primary. If you use a role-aware service and a properly configured client connect string, sessions reconnect to the new primary automatically. This is exactly the kind of event covered in what to do when the database fails at 3 AM - the runbook is what turns panic into procedure.

5. Reinstating the Old Primary

After a failover, the failed database - once it is back on its feet - is behind the new primary and cannot simply rejoin. It must be reinstated as a standby. If you had Flashback Database enabled (and you should have), the Broker can flash it back to the correct point and turn it into a standby automatically.

-- Start the old primary in MOUNT, then from the broker:
REINSTATE DATABASE 'old_primary';

-- It flashes back and becomes a standby of the new primary
SHOW CONFIGURATION;   -- both databases healthy again

This is a concrete reason to keep Flashback Database on: without it, reinstating the old primary usually means rebuilding the standby from scratch with a fresh copy, which is far slower.

6. Snapshot Standby - Test on Real Data, Then Rewind

Sometimes you want to open the standby read-write - to test a change against production-like data - without losing your DR protection. A snapshot standby does exactly that. It converts the standby to read-write, keeps receiving redo (but does not apply it yet), and when you convert it back, it discards your test changes and catches up.

-- Open the standby for read-write testing
CONVERT DATABASE 'standby_db' TO SNAPSHOT STANDBY;
-- ... run your tests; changes here are temporary ...

-- Discard changes and resume being a standby
CONVERT DATABASE 'standby_db' TO PHYSICAL STANDBY;

This is invaluable for validating an application release or a risky data fix against realistic data without touching production and without building a separate clone.

7. Fast-Start Failover - Automating the Emergency

Manual failover depends on a human noticing and acting. Fast-Start Failover (FSFO) removes the human from the critical path. A lightweight process called the observer continuously watches both databases; if the primary becomes unreachable for longer than a set threshold, the observer triggers an automatic failover to the standby within seconds, and later reinstates the old primary automatically when it returns.

-- Enable Fast-Start Failover and start an observer (on a third host)
EDIT DATABASE 'standby_db' SET PROPERTY FastStartFailoverThreshold = 30;
ENABLE FAST_START FAILOVER;
START OBSERVER;

SHOW FAST_START FAILOVER;   -- confirm enabled + observer present

The observer should run on a third machine, separate from both databases, so it can tell the difference between a dead primary and a network split. FSFO is the configuration to aim for when the business needs hands-off, seconds-level recovery.

8. Common Issues and How to Handle Them

Switchover refuses with a warning: almost always apply lag or an open blocking session. Clear the gap and re-run VALIDATE DATABASE; do not force it past a real warning.
Redo gap on the standby: missing archived logs mean the standby is behind. Resolve the gap (the Broker usually fetches it automatically) before a switchover.
Old primary will not reinstate: Flashback was off or the flashback logs aged out - you will need to rebuild the standby from a fresh copy, for example with RMAN DUPLICATE.
Applications do not follow the new primary: the connect string or database service was not role-aware. Use a Data Guard-aware service so connections move automatically on a transition.

9. A Real Transition: A Clean DR Drill on a Bank Standby

A banking client needed to prove, to auditors, that they could run on their DR site - without risking the live system. We scheduled a planned switchover during a quiet window: ran SHOW CONFIGURATION and VALIDATE DATABASE to confirm zero lag and a clean standby, executed SWITCHOVER TO the DR database through the Broker, and confirmed roles reversed and redo transport flowed back the other way. Applications, using a role-aware service, reconnected to the DR site automatically. The bank ran on DR for the agreed period, then switched back the same way. Because everything went through the Broker with pre-validation, the drill was uneventful - which, for a DR test, is exactly the goal. This operational discipline sits on top of the design covered in the complete Data Guard guide and a solid backup foundation.

Operator in a control room monitoring systems during an Oracle Data Guard role transition drill — Photo: Fernando Narvaez / Pexels

10. My Switchover Night: What I Watch, Minute by Minute

On paper a switchover is one command. In my calendar it is a ninety-minute window, and the command itself takes about two minutes of it. This is the timeline I actually run.

T-60: I check lag one final time and freeze the batch schedule. A long-running batch job holding a big transaction is the most common reason a switchover stalls at the worst possible moment.

SELECT name, value FROM v$dataguard_stats
WHERE name IN ('transport lag','apply lag');
-- I want +00 00:00:00 on both before I go any further

T-30: I run VALIDATE DATABASE VERBOSE and read it line by line, not just the summary. Four things get my attention: "Ready for Switchover: Yes", the flashback status on both databases, standby redo log counts matching the primary's threads, and the temporary tablespace section. A standby missing temp files opens happily - and then the first big sort after the transition falls on its face.

T-10: Application connections get drained at the connection-pool level, never by killing sessions. I watch v$session until only my own connections remain.

T-0: SWITCHOVER TO, with both alert logs tailing in separate terminals. The broker's messages tell the story in real time - the primary converting, the standby mounting as the new primary, redo transport reversing direction.

T+10: Smoke tests before any announcement: a sequence NEXTVAL, one insert and commit, an application login, one real report. I declare success on evidence, not on the broker's SUCCESS line alone.

11. Declaring the Primary Dead: The Failover Decision

The hardest part of a failover is not the command - it is deciding to run it. Once you fail over, any redo that never left the old primary is gone for good, yet every minute of hesitation extends the outage.

My discipline is three checks, time-boxed in advance. Can I reach the primary host at all - ping, SSH, the ILO console? If yes, is the instance recoverable in minutes - a crashed instance that will simply restart does not justify a failover. And is the standby still receiving redo - because if it is, the primary is alive somewhere and I am probably looking at a network problem, not a dead database.

The time-box matters more than the checks. I agree it with management in daylight: if the primary cannot be diagnosed within fifteen minutes, we fail over - no committee call at 3 AM. The worst failovers I have seen were not triggered too early; they happened two hours too late because nobody felt authorised to decide.

Client redirection is the other half of the decision. Both hosts belong in the connect string, and the application service should exist only where the database currently holds the primary role:

SALES =
 (DESCRIPTION=
  (ADDRESS_LIST=
   (ADDRESS=(PROTOCOL=TCP)(HOST=prod-host)(PORT=1521))
   (ADDRESS=(PROTOCOL=TCP)(HOST=dr-host)(PORT=1521)))
  (CONNECT_DATA=(SERVICE_NAME=sales_rw)
   (FAILOVER_MODE=(TYPE=SELECT)(METHOD=BASIC)(RETRIES=20)(DELAY=3))))

-- Role-aware service: only starts where the database is PRIMARY
srvctl add service -db proddb -service sales_rw -role PRIMARY

With that pairing, clients try both addresses but only find the service on the current primary, so a role transition moves them automatically. One audit I always run before a drill: hunt for applications with a hard-coded single hostname buried in their own config files. They bypass all of this, and in my experience there is always at least one.

Frequently Asked Questions

What is the difference between a Data Guard switchover and failover?

A switchover is planned and lossless - the primary and standby cleanly swap roles for maintenance or DR testing, and the old primary becomes a standby automatically. A failover is an emergency promotion of the standby when the primary is gone; depending on the protection mode it may involve minor data loss, and the old primary must be reinstated afterwards.

How do I perform a Data Guard switchover?

With the Data Guard Broker it is essentially one command. Connect with DGMGRL, run SHOW CONFIGURATION and VALIDATE DATABASE to confirm the standby is ready with zero lag, then run SWITCHOVER TO 'standby_db'. The broker coordinates both databases and reverses redo transport; confirm the new roles with SHOW CONFIGURATION.

How do I bring the old primary back after a failover?

You reinstate it rather than just restart it. Start the old primary in MOUNT and run REINSTATE DATABASE from the broker. If Flashback Database was enabled, the broker flashes it back to the right point and converts it to a standby automatically. Without Flashback you usually have to rebuild the standby from a fresh copy.

What is Fast-Start Failover?

Fast-Start Failover (FSFO) automates failover using an observer process that watches both databases. If the primary is unreachable beyond a set threshold, the observer triggers an automatic failover to the standby within seconds and later reinstates the old primary when it returns. The observer should run on a third host so it can distinguish a dead primary from a network split.

What is a snapshot standby used for?

A snapshot standby lets you open the standby read-write for testing against production-like data while it keeps receiving redo. You convert it to a snapshot standby, run your tests, then convert it back to a physical standby, which discards the test changes and resumes applying redo. It is ideal for validating a release without touching production or building a separate clone.

🔁 Need a Reliable DR Runbook or a DR Drill?

I configure Data Guard, run planned switchovers and DR drills, set up Fast-Start Failover, and document the runbook your team can follow under pressure. Bangladesh and worldwide clients.

Book a Consultation → 💬 WhatsApp Me

About the Author

Nasir Uddin Khan Senior IT Consultant · Oracle DBA · ERP & AI Specialist OCP · Red Hat Certified · MBA · CSV · 18+ Years Experience

Nasir is an Oracle Certified Professional and CSV-certified IT consultant based in Dhaka, Bangladesh. He has 18+ years of hands-on experience in Oracle database administration (RAC, Data Guard, RMAN), WebLogic middleware, ERP system design, and AI integration for manufacturing, pharmaceutical, banking, and healthcare organisations worldwide.

About Nasir → LinkedIn Book a Consultation

References & Further Reading

The procedures and case studies in this article are based on 18+ years of Oracle production database administration across manufacturing, banking, and pharmaceutical environments.