📍 New Eskaton, Dhaka-1000

IT Infrastructure Design for Business Continuity: High Availability, Disaster Recovery & Enterprise Architecture Guide

Every organization eventually faces a moment when the infrastructure decision made three years ago — the one that seemed reasonable at the time — becomes the reason the business is offline at 2 AM on a Monday. A single server running the ERP. No standby database. Backups that were never tested. A network with one path to the internet. I have spent 18 years designing infrastructure that prevents those moments — and recovering systems for organizations that experienced them without adequate preparation. This guide covers the complete framework for designing enterprise IT infrastructure that keeps your business running regardless of what fails.

1. The Real Cost of Infrastructure Failure

Before discussing architecture, it is worth understanding what is actually at stake. Downtime costs vary by industry but the numbers are consistently significant:

  • Manufacturing: Production lines halt, batch records cannot be completed, shipments are delayed — every hour of ERP downtime translates directly to lost output and contractual penalties
  • Banking and financial services: Transactions cannot be processed, customer confidence erodes, and regulatory obligations may be breached — the Bangladesh Bank has specific requirements for system availability
  • Pharmaceutical: Manufacturing cannot proceed without validated computer systems; a database outage during a production batch can result in that entire batch being rejected on quality grounds
  • Retail and e-commerce: Every minute offline during a sale campaign is measurable lost revenue

Beyond the direct cost, there is the recovery cost: emergency vendor support, potential data loss, staff overtime, and the reputational damage that follows any prolonged outage. Infrastructure design is not an IT expense — it is business risk management.

2. Three Concepts You Must Separate

High Availability, Disaster Recovery, and Business Continuity are related but distinct. Conflating them leads to architectures that address one while leaving the others exposed.

2.1 High Availability (HA)

Protection against component failure within a site. HA means the system keeps running when a server, disk, network card, or power supply fails. HA is achieved through redundancy — multiple components doing the same job so that the failure of one does not bring down the service. Typical HA targets: 99.9% uptime (8.7 hours downtime/year) to 99.99% (52 minutes/year).

2.2 Disaster Recovery (DR)

Protection against site-level failure. DR means the business can resume operations when an entire data center, building, or geographic location becomes unavailable — due to fire, flood, power failure, ransomware, or physical damage. DR involves a secondary site with a synchronized or near-synchronized copy of your systems. Key metrics: Recovery Time Objective (RTO — how long to recover) and Recovery Point Objective (RPO — how much data you can afford to lose).

2.3 Business Continuity (BC)

The broader plan that includes HA and DR. BC covers not just the technology but the people, processes, and communications needed to keep the business functioning through any disruption — including scenarios where the IT systems are fine but the building is inaccessible, or key personnel are unavailable.

A complete enterprise infrastructure design must address all three layers. Most organizations do HA reasonably well, handle DR inadequately, and rarely formalize BC at all.

3. The HA + DR Architecture Framework

The architecture framework I use for enterprise infrastructure design is built around five layers, each of which must be designed for redundancy and recovery:

Layer 1: Compute      — Servers, VMs, clusters
Layer 2: Storage      — NAS, SAN, ASM, replication
Layer 3: Network      — Switches, routers, firewalls, load balancers
Layer 4: Database     — Oracle RAC (HA), Data Guard (DR)
Layer 5: Application  — WebLogic clusters, middleware, app servers

Each layer must be designed independently — a failure at any layer should not cascade into a total outage. If compute is redundant but storage has a single point of failure, the architecture is only as resilient as its weakest layer.

4. Layer 1 — Compute Architecture

4.1 Physical Server vs Virtualization

For Oracle production databases, I generally recommend physical servers over virtual machines for performance-critical workloads. Oracle licensing is per-core, and virtualization can complicate licensing compliance while adding overhead. For Oracle RAC specifically, the recommendation is bare-metal for production — VMware is supported but adds latency to cluster interconnect communication.

For application servers (WebLogic, middleware, web tier), virtualization is appropriate and provides rapid recovery options through VM snapshot and live migration.

4.2 Server Sizing Methodology

Server sizing is not guesswork. The correct methodology starts with workload measurement:

  • Current peak CPU utilization — measured over 4 weeks, not a single snapshot. Target: peak usage should not exceed 60–70% of capacity under normal conditions, leaving headroom for growth and failure scenarios
  • Memory requirements — Oracle SGA + PGA + OS + application. A common error: sizing memory for current data volume without accounting for 3-year growth
  • I/O profile — IOPS (reads vs writes), latency requirements, sequential vs random I/O. This drives storage selection more than raw capacity
  • Network throughput — particularly for RAC cluster interconnect (private network), which must be low-latency (< 1ms) and high-bandwidth (10 GbE minimum, 25 GbE recommended for busy clusters)

4.3 Redundancy at Compute Layer

Every production server should have:

  • Dual power supplies connected to separate PDUs (and ideally separate UPS circuits)
  • Dual network interface cards (NIC bonding / teaming) for public network
  • Separate NICs for cluster interconnect (RAC), storage network (SAN), and management (IPMI/iDRAC)
  • Hardware RAID for local OS disks (RAID 1 minimum)
  • Out-of-band management (iDRAC, iLO) for remote power control and console access

5. Layer 2 — Storage Architecture

Storage is the most common single point of failure in enterprise environments. A well-designed storage architecture must survive disk failure, controller failure, and in the DR scenario, site failure.

5.1 Storage Options for Oracle

Storage Type Best For Oracle Use
SAN (FC/iSCSI) High-performance databases, RAC shared storage Oracle RAC requires shared storage — SAN or NFS
NAS (NFS) File sharing, Oracle ASM on NFS (Direct NFS) Oracle Direct NFS (dNFS) provides good performance
All-Flash (NVMe) Latency-sensitive OLTP workloads Ideal for high-transaction ERP databases
Hybrid Flash Mixed workloads with cost sensitivity Hot data on flash tier, cold data on spinning disk

5.2 Oracle ASM — Automatic Storage Management

For Oracle databases, ASM is the recommended storage management layer. ASM provides:

  • Automatic data distribution across disks — no manual striping
  • Built-in mirroring (Normal Redundancy = 2-way, High Redundancy = 3-way)
  • Online disk group rebalancing — add or remove disks without downtime
  • Shared storage for RAC clusters — multiple nodes access the same ASM disk group
  • Fast Mirror Resync — if a disk goes offline and comes back, only changed extents are resynced
-- Create ASM disk group with high redundancy (3-way mirror)
CREATE DISKGROUP DATA HIGH REDUNDANCY
  FAILGROUP fg_ctrl1 DISK '/dev/sdb' NAME DATA_0001,
                          '/dev/sdc' NAME DATA_0002,
  FAILGROUP fg_ctrl2 DISK '/dev/sdd' NAME DATA_0003,
                          '/dev/sde' NAME DATA_0004,
  FAILGROUP fg_ctrl3 DISK '/dev/sdf' NAME DATA_0005,
                          '/dev/sdg' NAME DATA_0006
ATTRIBUTE 'AU_SIZE'='4M', 'COMPATIBLE.ASM'='19.0';

6. Layer 3 — Network Architecture

Network design is frequently underinvested in enterprise environments. A well-designed compute and storage layer can be completely undermined by a single-switch network.

6.1 Network Segmentation

Production enterprise networks should be segmented into separate VLANs/segments for:

  • Public / application network: User traffic, application-to-database connections
  • Oracle RAC cluster interconnect: Private, dedicated, 10/25 GbE, low-latency switches only
  • Storage network: iSCSI or dedicated FC fabric for SAN traffic
  • Management network: IPMI, iDRAC, switch management — isolated from production
  • Backup network: RMAN backup traffic — prevents backup jobs from saturating production network
  • DR replication network: Oracle Data Guard redo transport — dedicated bandwidth for log shipping

6.2 Network Redundancy

Every network path that matters must have a failover path:

  • Dual top-of-rack switches with cross-connects — no single switch failure takes down all server ports
  • NIC bonding (LACP / active-passive) on all server network interfaces
  • Redundant uplinks between access and distribution switches
  • Redundant internet connectivity — ideally two providers into two separate routers
  • Redundant firewall pair (active-standby or active-active)

6.3 Firewall and Security Zoning

The database tier should never be directly accessible from the internet or from user workstations. A proper DMZ architecture places:

  • Web/application servers in a DMZ zone
  • Application servers in an internal application zone
  • Database servers in a restricted database zone — only application servers can connect, on specific Oracle listener ports
  • Management servers in a dedicated management zone with MFA-protected access

7. Layer 4 — Database HA and DR (Oracle RAC + Data Guard)

For Oracle-based enterprise systems, the combination of Oracle RAC and Oracle Data Guard provides the gold standard for database availability and disaster recovery.

7.1 Oracle RAC for High Availability

Oracle RAC runs the same database across multiple server nodes simultaneously. All nodes share the same ASM storage. If any node fails, the other nodes continue serving the application with no outage. TAF (Transparent Application Failover) reconnects active sessions automatically.

A minimum production RAC configuration for a mid-size organization:

  • 2 RAC nodes (minimum for HA) — 4 nodes for larger workloads
  • Oracle Grid Infrastructure on each node
  • Shared ASM disk group (SAN or NAS)
  • Private cluster interconnect: 2 × 10 GbE (bonded) per node
  • Public network: 2 × 1 GbE (bonded) per node
  • SCAN (Single Client Access Name) — 3 SCAN IPs, DNS round-robin

7.2 Oracle Data Guard for Disaster Recovery

Data Guard maintains a synchronized copy of the database at the DR site. In the event of a primary site failure, the standby database is activated — either automatically (using Fast-Start Failover) or manually — and becomes the new primary.

The key DR design decisions:

  • Protection mode: Maximum Performance (async, near-zero RPO) vs Maximum Availability (sync, zero data loss) — choose based on RPO requirement and network latency between sites
  • RTO target: How fast must the DR site be active? Manual failover = 15–30 minutes. Automatic failover (FSFO) = 30–60 seconds
  • Standby type: Physical standby (most common, lowest overhead) vs Active Data Guard (read-only access on standby for reporting offload)
  • DR site infrastructure: Does the DR site need to run at full production capacity? Or is it acceptable to run at reduced capacity while primary is restored?

7.3 Maximum Availability Architecture (MAA)

Oracle's Maximum Availability Architecture combines RAC + Data Guard + RMAN into a comprehensive HA/DR/backup solution. For a two-site MAA deployment:

PRIMARY SITE (Dhaka):
  ├── RAC Node 1 (Oracle 26ai)
  ├── RAC Node 2 (Oracle 26ai)
  ├── ASM Disk Group — SAN storage (High Redundancy)
  └── Observer (for FSFO)

DR SITE (Chittagong / separate facility):
  ├── Physical Standby Node 1
  ├── Physical Standby Node 2 (optional — standby RAC)
  ├── ASM Disk Group — separate SAN storage
  └── Data Guard Broker managing the configuration

BACKUP:
  └── RMAN backups to separate media (tape or cloud) — both sites

8. Layer 5 — Application Tier HA

Database HA without application tier HA achieves little — if the application server is a single point of failure, users are offline regardless of the database. Oracle WebLogic provides the application-tier clustering needed to complete the HA picture.

  • WebLogic cluster: Multiple managed servers hosting the same application — requests are load-balanced across all available nodes
  • Session replication: HTTP session state is replicated between cluster members — if a managed server dies, users are transparently redirected to another node without losing their session
  • Load balancer: Hardware or software (e.g., F5, HAProxy, Oracle HTTP Server) distributes incoming requests and detects failed servers
  • Admin Server separation: WebLogic Administration Server runs on a dedicated, lightweight machine — never on the same server as managed servers handling production load

9. Capacity Planning — Sizing for the Future

Infrastructure sized for today's workload becomes inadequate within 18–24 months without proper growth planning. My capacity planning methodology uses a 3-year horizon with quarterly checkpoints:

9.1 Data Volume Growth

Measure current database size and transaction growth rate. ERP databases in active organizations typically grow 15–30% per year. Plan storage for 3× current size minimum.

9.2 User and Transaction Growth

Map business growth plans to IT workload. A planned 40% increase in production volume means a proportional increase in ERP transactions, batch processing time, and database I/O. Size compute to handle peak load at 3× current transaction volume without performance degradation.

9.3 The N+1 Rule

In a clustered environment, always design so that if one node fails, the remaining nodes can handle 100% of the normal workload — not just survive, but perform acceptably. If you need two RAC nodes to handle normal load, deploy three — so a single node failure still leaves you with two nodes at full capacity.

10. Documentation Deliverables

A complete IT infrastructure design engagement produces a set of documents that serve as the authoritative reference for the environment:

  • High-Level Design (HLD): Architecture diagram, technology decisions, and design rationale — suitable for management review and approval
  • Low-Level Design (LLD): Detailed specifications — IP address scheme, VLAN assignments, storage layout, Oracle parameters, RAC configuration, Data Guard setup
  • Bill of Materials (BOM): Hardware, software licenses, and network equipment required — with vendor-neutral specifications so procurement can get competitive quotes
  • Implementation Runbook: Step-by-step installation and configuration procedures — reproducible, tested, signed off
  • DR Runbook: Failover and failback procedures — tested, timed, and ready for the moment they are needed
  • Capacity Review Schedule: Quarterly metrics to review against growth plan

11. Common Infrastructure Mistakes I See in Bangladesh Organizations

After working across multiple industries in Bangladesh and internationally, these are the infrastructure decisions I see causing the most problems:

  • Single server for ERP database: No redundancy at any level — one hardware failure takes everything offline
  • Backup to same server: The backup is useless if the server with the backup is the one that fails or burns
  • No DR site: "We have backups" is not a DR strategy — restoring from backup takes hours; failing over to a standby takes minutes
  • Undersized storage IOPS: Enough disk space but the disks are too slow for the transaction rate — performance degrades under load
  • Single internet connection: Internet uptime is critical for remote access, cloud services, and partner integrations — a second provider costs less than an hour of downtime
  • Untested DR: A DR plan that has never been tested is not a DR plan — it is a hope. Test failover at minimum annually, ideally every 6 months
  • No capacity monitoring: Storage fills up, CPU hits 95%, response times degrade — and nobody notices until it is already a problem

12. Where to Start: The Infrastructure Assessment

Before designing new infrastructure, every engagement I take on begins with an honest assessment of the current state:

  • What are the single points of failure in the current environment?
  • What are the RTO and RPO requirements — and do they match what the current infrastructure can actually deliver?
  • What growth is planned in the next 3 years — new sites, new users, new systems?
  • What are the compliance and regulatory requirements that drive specific architecture decisions?
  • What budget envelope exists — and how do we prioritize the highest-risk gaps first?

Good infrastructure design is not about implementing every best practice simultaneously. It is about understanding the risk profile of the organization and progressively closing the highest-risk gaps in order of business impact.

🏗️ Need an Infrastructure Design or Assessment?

I design enterprise IT infrastructure for organizations across Bangladesh — from the initial architecture blueprint to the RAC + Data Guard implementation. Whether you are building from scratch, upgrading an existing environment, or preparing for a compliance audit, let's start with an honest assessment of where you stand.

Request Infrastructure Assessment → 💬 WhatsApp Me

Final Thoughts

Infrastructure design is one of the highest-leverage investments an organization can make. A well-designed environment runs quietly in the background, invisibly supporting every business operation. A poorly designed one becomes a recurring source of crisis, cost, and reputational risk.

The organizations I have worked with that invest in proper HA + DR architecture rarely call me in an emergency. The ones that delay that investment — because it feels expensive until the moment it is desperately needed — end up spending far more on emergency recovery, lost production, and business impact than the design would have cost in the first place.

If your organization is running critical systems on infrastructure that has never been properly designed for availability and recovery, the best time to address that is before the failure happens.

Nasir Uddin Khan — Oracle DBA Consultant

About the Author

Nasir Uddin Khan Senior IT Consultant · Oracle DBA · ERP & AI Specialist OCP · Red Hat Certified · MBA · CSV · 18+ Years Experience

Nasir is an Oracle Certified Professional and CSV-certified IT consultant based in Dhaka, Bangladesh. He has 18+ years of hands-on experience in Oracle database administration (RAC, Data Guard, RMAN), WebLogic middleware, ERP system design, and AI integration for manufacturing, pharmaceutical, banking, and healthcare organisations worldwide.

References & Further Reading

This article is based on 18+ years of enterprise IT infrastructure design experience and Oracle's Maximum Availability Architecture documentation.

Related Articles

Enterprise Infrastructure That Keeps Your Business Running

HA architecture · DR planning · Oracle RAC + Data Guard · Capacity sizing · Documentation. Bangladesh and worldwide.

💬