Part 2 - MongoDB Replication, Security & User Management: Building Secure, Highly Available Systems

Nov 22, 2025 · 12 min read · SRE

MongoDB High Availability Replication Database Security RBAC SRE

Introduction

A database that’s down is a business that’s down. In Part 1, we covered operations fundamentals: monitoring, performance tuning, and index optimization. But fast queries don’t matter if your database is offline, and great performance is meaningless if your data is compromised.

This is Part 2 of our MongoDB operations series, focusing on reliability and security: building highly available systems with replica sets, scaling horizontally with sharding, hardening security, and implementing role-based access control.

These aren’t optional “nice-to-haves” for production systems. They’re fundamental requirements. Let’s dive in.

Replication and High Availability

MongoDB replica sets provide automatic failover, data redundancy, and read scalability. A properly configured replica set keeps your application running even when hardware fails.

Replica Set Architecture

A replica set consists of multiple MongoDB instances maintaining copies of the same data:

Primary: Accepts all write operations and replicates them to secondaries
Secondaries: Maintain data copies and can serve reads (with proper read preference)
Arbiter (optional): Participates in elections but doesn’t hold data - generally not recommended for production

The typical production configuration is a 3-node replica set: one primary and two secondaries. This tolerates one node failure while maintaining quorum for elections.

Why not use arbiters? They seem cheaper - no data storage required. But they create operational complexity: you need separate deployment infrastructure for a non-data-bearing node, and they don’t improve read capacity or provide backup sources. Three data-bearing nodes are simpler and more useful.

Configuring a Replica Set

Initialize a replica set:

rs.initiate({
  _id: "rs0",
  members: [
    { _id: 0, host: "mongo1.example.com:27017", priority: 2 },
    { _id: 1, host: "mongo2.example.com:27017", priority: 1 },
    { _id: 2, host: "mongo3.example.com:27017", priority: 1 }
  ]
})

The priority field determines election preference. Higher priority nodes are preferred as primary. Set your most capable hardware (fastest disks, most RAM) with higher priority.

Add a member dynamically:

rs.add("mongo4.example.com:27017")

// Add with specific configuration
rs.add({
  host: "mongo4.example.com:27017",
  priority: 0,
  hidden: true,
  buildIndexes: true
})

Remove a member:

rs.remove("mongo4.example.com:27017")

Special Member Types

Beyond standard secondaries, MongoDB supports special-purpose members:

Hidden members don’t appear to applications (they’re excluded from read preference queries). They’re perfect for analytics workloads or backup sources - processing that might slow down a node shouldn’t impact production reads.

cfg = rs.conf()
cfg.members[2].priority = 0
cfg.members[2].hidden = true
rs.reconfig(cfg)

Delayed members replicate with an intentional lag, providing protection against human error. If someone accidentally drops a collection, you have a time window to recover from the delayed member before it replicates the mistake.

cfg = rs.conf()
cfg.members[2].priority = 0
cfg.members[2].hidden = true
cfg.members[2].slaveDelay = 3600  // 1 hour delay
rs.reconfig(cfg)

The delayed member should be hidden (users shouldn’t read stale data) with priority 0 (it should never become primary).

Monitoring Replication Health

Check replica set status:

rs.status()

This shows each member’s state (PRIMARY, SECONDARY, RECOVERING), current optime, and replication lag. Run this first when troubleshooting replication issues.

Monitor replication lag:

rs.printReplicationInfo()        // Primary oplog info
rs.printSecondaryReplicationInfo() // Secondary lag

Replication lag is the time difference between the primary’s latest operation and what secondaries have replicated. Lag under 1 second is excellent, lag over 30 seconds indicates problems.

Common causes of replication lag:

Network issues: Slow or unreliable network between nodes
Secondary overload: Insufficient CPU, disk I/O, or RAM on secondary
Long-running operations: Large bulk writes blocking replication
Oplog too small: Oplog fills before secondary catches up

Check oplog size and window:

use local
db.oplog.rs.stats()

// Check how much time the oplog covers
db.oplog.rs.find().sort({$natural: 1}).limit(1).pretty()  // Oldest entry
db.oplog.rs.find().sort({$natural: -1}).limit(1).pretty() // Newest entry

The oplog is a capped collection storing all write operations. If a secondary is offline longer than the oplog window, it can’t catch up through normal replication - you’ll need to resync from a snapshot.

A good rule: your oplog should cover at least 24 hours of normal write load. This gives you time to fix a failed secondary before it requires a full resync.

Failover and Elections

When the primary fails, replica set members hold an election to choose a new primary. Elections are usually fast (seconds), but understanding the process helps you architect for reliability.

Election factors:

Priority: Higher priority members preferred
Data freshness: Member with most recent data preferred
Network connectivity: Member must reach a majority of voting members
Voting configuration: Max 7 voting members (you can have more than 7 total members)

Force a primary to step down:

rs.stepDown(60)  // Step down for 60 seconds

This is useful for planned maintenance. The primary steps down, triggering an election, and won’t try to become primary for the specified duration.

Handle a stuck election:

If members can’t elect a primary (usually due to network partitions), you may need manual intervention:

// Reconfigure with surviving members only
cfg = rs.conf()
cfg.members = [/* only reachable members */]
rs.reconfig(cfg, {force: true})

Force reconfig is dangerous - use only when you’re certain about network topology and data consistency.

Write Concerns and Read Preferences

These settings balance consistency, durability, and performance.

Write concerns determine when a write is acknowledged:

// Wait for majority of nodes
db.collection.insertOne(doc, {
  writeConcern: { w: "majority", j: true }
})

// Fast writes (only primary acknowledgment)
db.collection.insertOne(doc, {
  writeConcern: { w: 1, j: false }
})

w: "majority" ensures data survives node failures
j: true ensures data is written to on-disk journal
Most applications should use w: "majority", j: true for critical data

Read preferences route reads to appropriate members:

primary: All reads from primary (default, strongest consistency)
secondary: Read from secondaries only (reduce primary load)
nearest: Lowest network latency (good for geo-distributed apps)

// Route analytics queries to secondaries
db.collection.find().readPref("secondary")

Read concerns specify consistency level:

// Read majority-committed data
db.collection.find().readConcern("majority")

local: Latest data on node (default, may be rolled back)
majority: Data acknowledged by majority (survives rollbacks)
linearizable: Read-your-writes guarantee (strongest consistency)

Sharding for Horizontal Scale

When vertical scaling hits limits (you can’t buy bigger servers), sharding distributes data across multiple machines. But sharding adds complexity - only shard when you need to.

When to Shard

Consider sharding when:

Working set exceeds RAM: Your indexes and frequently-accessed data don’t fit in memory
Throughput hits limits: Single server can’t handle request volume
Storage exceeds capacity: Single server can’t hold your data
Geographic distribution needed: You want data close to users globally

Don’t shard too early. A well-tuned single replica set can handle terabytes of data and tens of thousands of operations per second.

Shard Key Selection: Make or Break

The shard key determines how data distributes across shards. A poor shard key creates hot shards, uneven distribution, and terrible performance. A good shard key enables linear scalability.

Critical factors:

High cardinality: Many unique values (millions, not dozens)
Low frequency: Values evenly distributed (not 80% of data in one value)
Non-monotonic: Not always-increasing (avoids hot shard for new data)
Query isolation: Shard key appears in most queries (enables targeted queries)

Good shard keys:

// User ID (if queries are user-specific)
{ user_id: 1 }

// Compound key for geographic distribution
{ country: 1, user_id: 1 }

// Hashed key for even distribution
{ user_id: "hashed" }

Bad shard keys:

// Monotonically increasing - all writes to one shard
{ timestamp: 1 }
{ _id: 1 }

// Low cardinality - few unique values
{ status: 1 }

// High frequency - uneven distribution
{ country: 1 }  // If 80% of users in one country

The timestamp problem: New data always goes to the highest shard key value. With timestamps or ObjectIds, this means all writes hit one “hot” shard while others sit idle. Solution: use hashed shard keys or combine with another field.

Sharded Cluster Architecture

A sharded cluster has three components:

Config servers: 3-node replica set storing cluster metadata
Shard servers: Replica sets storing your data (each shard is a replica set)
mongos routers: Query routers directing requests to appropriate shards

Applications connect to mongos routers, not directly to shards. Mongos determines which shards hold requested data and routes queries accordingly.

Enable sharding:

// Connect to mongos
sh.enableSharding("mydb")

// Shard a collection
sh.shardCollection("mydb.users", { user_id: 1 })

// Shard with hashed key
sh.shardCollection("mydb.events", { _id: "hashed" })

Add shards:

sh.addShard("shard01/mongo1:27017,mongo2:27017,mongo3:27017")
sh.addShard("shard02/mongo4:27017,mongo5:27017,mongo6:27017")

The Balancer

The balancer moves chunks (ranges of shard key values) between shards to maintain even distribution. It runs automatically but you can control when it runs.

// Check balancer status
sh.getBalancerState()

// Schedule balancer to run only during maintenance window
use config
db.settings.update(
  { _id: "balancer" },
  { $set: { activeWindow: { start: "23:00", stop: "06:00" } } },
  { upsert: true }
)

Balancing impacts performance (it moves data across the network), so schedule it during low-traffic periods.

Zone Sharding

Zone sharding lets you control data placement geographically or logically.

// Assign shards to zones
sh.addShardToZone("shard01", "US")
sh.addShardToZone("shard02", "EU")

// Route data by shard key range
sh.updateZoneKeyRange(
  "mydb.users",
  { country: "US" },
  { country: "US\uffff" },
  "US"
)

This ensures US users’ data stays on US-located shards, reducing latency and meeting data residency requirements.

Security Best Practices

An unsecured database is an incident waiting to happen. MongoDB security is layered: authentication, authorization, network security, and encryption.

Enable Authentication

By default, MongoDB accepts connections without authentication (for ease of initial setup). This is catastrophic in production.

Enable auth in mongod.conf:

security:
  authorization: enabled

Create the first admin user:

use admin
db.createUser({
  user: "admin",
  pwd: "secure_password",
  roles: [
    { role: "userAdminAnyDatabase", db: "admin" },
    { role: "readWriteAnyDatabase", db: "admin" }
  ]
})

After enabling authentication, restart MongoDB and all connections require credentials.

TLS/SSL Encryption

Always encrypt data in transit. MongoDB supports TLS for client connections and inter-node communication.

net:
  tls:
    mode: requireTLS
    certificateKeyFile: /path/to/mongodb.pem
    CAFile: /path/to/ca.pem

Obtain certificates from a trusted CA or use Let’s Encrypt. Self-signed certificates work for testing but complicate production operations.

Network Security

Bind to specific IPs only:

net:
  bindIp: 127.0.0.1,10.0.0.5
  port: 27017

Never bind to 0.0.0.0 in production - this accepts connections from any network interface.

Firewall rules:

# Allow MongoDB port only from application servers
iptables -A INPUT -p tcp -s 10.0.0.0/24 --dport 27017 -j ACCEPT
iptables -A INPUT -p tcp --dport 27017 -j DROP

Security groups (AWS), firewall rules (GCP), or network policies (Kubernetes) should restrict MongoDB ports to only known clients.

Encryption at Rest

MongoDB Enterprise supports native encryption:

security:
  enableEncryption: true
  encryptionKeyFile: /path/to/keyfile

For MongoDB Community, use disk encryption:

# Linux LUKS example
cryptsetup luksFormat /dev/sdb
cryptsetup open /dev/sdb mongodb_encrypted
mkfs.ext4 /dev/mapper/mongodb_encrypted

Cloud providers offer encrypted volumes (AWS EBS encryption, GCP persistent disk encryption). Enable these for databases storing sensitive data.

Security Checklist

✅ Authentication enabled
✅ TLS for all connections
✅ Network access restricted by firewall
✅ Encryption at rest enabled
✅ Audit logging configured (Enterprise)
✅ Regular security updates applied
✅ Strong passwords (15+ characters)
✅ Credentials in secrets manager, not code
✅ Regular security audits scheduled

User Management and RBAC

MongoDB uses role-based access control (RBAC) to govern who can do what. Good user management follows the principle of least privilege: users get only the permissions they need.

Understanding Roles

MongoDB has built-in roles for common patterns and supports custom roles for specific needs.

Database roles:

read: Query data
readWrite: Query and modify data
dbAdmin: Manage schema (indexes, collections)
userAdmin: Manage users and roles
dbOwner: All database privileges

Cluster roles:

clusterMonitor: View cluster statistics (for monitoring systems)
clusterManager: Manage replica sets and sharding
clusterAdmin: Full cluster administration
backup/restore: Backup and restore operations

All-database roles:

readAnyDatabase: Read from all databases
readWriteAnyDatabase: Read and write to all databases
root: Complete access (avoid for regular operations)

Creating Application Users

Application user (read/write access to specific database):

use myDatabase
db.createUser({
  user: "app_user",
  pwd: passwordPrompt(),
  roles: [
    { role: "readWrite", db: "myDatabase" }
  ]
})

Use passwordPrompt() to avoid passwords in shell history or logs.

Read-only analytics user:

use admin
db.createUser({
  user: "analytics_user",
  pwd: passwordPrompt(),
  roles: [
    { role: "read", db: "sales" },
    { role: "read", db: "customers" }
  ]
})

Monitoring and Backup Users

Monitoring user (for Prometheus, Datadog, etc.):

use admin
db.createUser({
  user: "monitoring",
  pwd: passwordPrompt(),
  roles: [
    { role: "clusterMonitor", db: "admin" },
    { role: "read", db: "local" },
    { role: "read", db: "config" }
  ]
})

Backup user:

use admin
db.createUser({
  user: "backup_user",
  pwd: passwordPrompt(),
  roles: [
    { role: "backup", db: "admin" },
    { role: "restore", db: "admin" }
  ]
})

Custom Roles for Fine-Grained Control

When built-in roles are too broad, create custom roles:

use admin
db.createRole({
  role: "orderProcessor",
  privileges: [
    {
      resource: { db: "ecommerce", collection: "orders" },
      actions: [ "find", "insert", "update" ]
    },
    {
      resource: { db: "ecommerce", collection: "products" },
      actions: [ "find" ]
    }
  ],
  roles: []
})

This user can read/write orders and read products, but nothing else. Principle of least privilege in action.

Managing Users

Grant additional roles:

db.grantRolesToUser(
  "username",
  [ { role: "read", db: "newDatabase" } ]
)

Revoke roles:

db.revokeRolesFromUser(
  "username",
  [ { role: "readWrite", db: "oldDatabase" } ]
)

Change password:

db.changeUserPassword("username", passwordPrompt())

Delete user:

db.dropUser("username")

User Management Best Practices

Naming conventions help organization:

{service}_{environment}_{permission}

Examples:
- webapp_prod_rw
- analytics_prod_ro
- backup_prod
- monitoring_prod

Connection string format:

mongodb://username:password@host1:27017,host2:27017/database?replicaSet=rs0&authSource=admin&tls=true

Always specify authSource (the database where the user was created, typically admin).

Regular audits:

Monthly:

Review all users and their roles
Remove unused accounts
Verify least privilege
Check for shared credentials

Quarterly:

Rotate passwords
Review custom roles
Validate connection strings
Test backup user credentials

Conclusion and Next Steps

High availability and security aren’t bolt-on features - they’re fundamental architecture decisions. Replica sets provide failover and redundancy. Sharding enables horizontal scale. Authentication, authorization, and encryption protect your data. Role-based access control limits damage from compromised credentials.

Key takeaways:

Replica sets are mandatory: Three-node replica sets with proper write concerns provide reliability
Shard key matters: Spend time designing shard keys; a poor choice haunts you forever
Security is layered: Authentication + TLS + firewalls + encryption + RBAC
Least privilege works: Users get only necessary permissions, nothing more
Monitor everything: Replication lag, balancer activity, failed logins

In Part 3 (final article), we’ll bring everything together for production deployment on Kubernetes: StatefulSets, MongoDB operators, automated backups with CronJobs, and the cloud-native patterns that make MongoDB reliable at scale.

For now, review your replica set configuration, audit your user permissions, and verify your security posture. Your database’s reliability and security depend on these fundamentals.

Additional Resources