Introduction
A database that’s down is a business that’s down. In Part 1, we covered operations fundamentals: monitoring, performance tuning, and index optimization. But fast queries don’t matter if your database is offline, and great performance is meaningless if your data is compromised.
This is Part 2 of our MongoDB operations series, focusing on reliability and security: building highly available systems with replica sets, scaling horizontally with sharding, hardening security, and implementing role-based access control.
These aren’t optional “nice-to-haves” for production systems. They’re fundamental requirements. Let’s dive in.
Replication and High Availability
MongoDB replica sets provide automatic failover, data redundancy, and read scalability. A properly configured replica set keeps your application running even when hardware fails.
Replica Set Architecture
A replica set consists of multiple MongoDB instances maintaining copies of the same data:
- Primary: Accepts all write operations and replicates them to secondaries
- Secondaries: Maintain data copies and can serve reads (with proper read preference)
- Arbiter (optional): Participates in elections but doesn’t hold data - generally not recommended for production
The typical production configuration is a 3-node replica set: one primary and two secondaries. This tolerates one node failure while maintaining quorum for elections.
Why not use arbiters? They seem cheaper - no data storage required. But they create operational complexity: you need separate deployment infrastructure for a non-data-bearing node, and they don’t improve read capacity or provide backup sources. Three data-bearing nodes are simpler and more useful.
Configuring a Replica Set
Initialize a replica set:
rs.initiate({
_id: "rs0",
members: [
{ _id: 0, host: "mongo1.example.com:27017", priority: 2 },
{ _id: 1, host: "mongo2.example.com:27017", priority: 1 },
{ _id: 2, host: "mongo3.example.com:27017", priority: 1 }
]
})
The priority field determines election preference. Higher priority nodes are preferred as primary. Set your most capable hardware (fastest disks, most RAM) with higher priority.
Add a member dynamically:
rs.add("mongo4.example.com:27017")
// Add with specific configuration
rs.add({
host: "mongo4.example.com:27017",
priority: 0,
hidden: true,
buildIndexes: true
})
Remove a member:
rs.remove("mongo4.example.com:27017")
Special Member Types
Beyond standard secondaries, MongoDB supports special-purpose members:
Hidden members don’t appear to applications (they’re excluded from read preference queries). They’re perfect for analytics workloads or backup sources - processing that might slow down a node shouldn’t impact production reads.
cfg = rs.conf()
cfg.members[2].priority = 0
cfg.members[2].hidden = true
rs.reconfig(cfg)
Delayed members replicate with an intentional lag, providing protection against human error. If someone accidentally drops a collection, you have a time window to recover from the delayed member before it replicates the mistake.
cfg = rs.conf()
cfg.members[2].priority = 0
cfg.members[2].hidden = true
cfg.members[2].slaveDelay = 3600 // 1 hour delay
rs.reconfig(cfg)
The delayed member should be hidden (users shouldn’t read stale data) with priority 0 (it should never become primary).
Monitoring Replication Health
Check replica set status:
rs.status()
This shows each member’s state (PRIMARY, SECONDARY, RECOVERING), current optime, and replication lag. Run this first when troubleshooting replication issues.
Monitor replication lag:
rs.printReplicationInfo() // Primary oplog info
rs.printSecondaryReplicationInfo() // Secondary lag
Replication lag is the time difference between the primary’s latest operation and what secondaries have replicated. Lag under 1 second is excellent, lag over 30 seconds indicates problems.
Common causes of replication lag:
- Network issues: Slow or unreliable network between nodes
- Secondary overload: Insufficient CPU, disk I/O, or RAM on secondary
- Long-running operations: Large bulk writes blocking replication
- Oplog too small: Oplog fills before secondary catches up
Check oplog size and window:
use local
db.oplog.rs.stats()
// Check how much time the oplog covers
db.oplog.rs.find().sort({$natural: 1}).limit(1).pretty() // Oldest entry
db.oplog.rs.find().sort({$natural: -1}).limit(1).pretty() // Newest entry
The oplog is a capped collection storing all write operations. If a secondary is offline longer than the oplog window, it can’t catch up through normal replication - you’ll need to resync from a snapshot.
A good rule: your oplog should cover at least 24 hours of normal write load. This gives you time to fix a failed secondary before it requires a full resync.
Failover and Elections
When the primary fails, replica set members hold an election to choose a new primary. Elections are usually fast (seconds), but understanding the process helps you architect for reliability.
Election factors:
- Priority: Higher priority members preferred
- Data freshness: Member with most recent data preferred
- Network connectivity: Member must reach a majority of voting members
- Voting configuration: Max 7 voting members (you can have more than 7 total members)
Force a primary to step down:
rs.stepDown(60) // Step down for 60 seconds
This is useful for planned maintenance. The primary steps down, triggering an election, and won’t try to become primary for the specified duration.
Handle a stuck election:
If members can’t elect a primary (usually due to network partitions), you may need manual intervention:
// Reconfigure with surviving members only
cfg = rs.conf()
cfg.members = [/* only reachable members */]
rs.reconfig(cfg, {force: true})
Force reconfig is dangerous - use only when you’re certain about network topology and data consistency.
Write Concerns and Read Preferences
These settings balance consistency, durability, and performance.
Write concerns determine when a write is acknowledged:
// Wait for majority of nodes
db.collection.insertOne(doc, {
writeConcern: { w: "majority", j: true }
})
// Fast writes (only primary acknowledgment)
db.collection.insertOne(doc, {
writeConcern: { w: 1, j: false }
})
w: "majority"ensures data survives node failuresj: trueensures data is written to on-disk journal- Most applications should use
w: "majority", j: truefor critical data
Read preferences route reads to appropriate members:
primary: All reads from primary (default, strongest consistency)secondary: Read from secondaries only (reduce primary load)nearest: Lowest network latency (good for geo-distributed apps)
// Route analytics queries to secondaries
db.collection.find().readPref("secondary")
Read concerns specify consistency level:
// Read majority-committed data
db.collection.find().readConcern("majority")
local: Latest data on node (default, may be rolled back)majority: Data acknowledged by majority (survives rollbacks)linearizable: Read-your-writes guarantee (strongest consistency)
Sharding for Horizontal Scale
When vertical scaling hits limits (you can’t buy bigger servers), sharding distributes data across multiple machines. But sharding adds complexity - only shard when you need to.
When to Shard
Consider sharding when:
- Working set exceeds RAM: Your indexes and frequently-accessed data don’t fit in memory
- Throughput hits limits: Single server can’t handle request volume
- Storage exceeds capacity: Single server can’t hold your data
- Geographic distribution needed: You want data close to users globally
Don’t shard too early. A well-tuned single replica set can handle terabytes of data and tens of thousands of operations per second.
Shard Key Selection: Make or Break
The shard key determines how data distributes across shards. A poor shard key creates hot shards, uneven distribution, and terrible performance. A good shard key enables linear scalability.
Critical factors:
- High cardinality: Many unique values (millions, not dozens)
- Low frequency: Values evenly distributed (not 80% of data in one value)
- Non-monotonic: Not always-increasing (avoids hot shard for new data)
- Query isolation: Shard key appears in most queries (enables targeted queries)
Good shard keys:
// User ID (if queries are user-specific)
{ user_id: 1 }
// Compound key for geographic distribution
{ country: 1, user_id: 1 }
// Hashed key for even distribution
{ user_id: "hashed" }
Bad shard keys:
// Monotonically increasing - all writes to one shard
{ timestamp: 1 }
{ _id: 1 }
// Low cardinality - few unique values
{ status: 1 }
// High frequency - uneven distribution
{ country: 1 } // If 80% of users in one country
The timestamp problem: New data always goes to the highest shard key value. With timestamps or ObjectIds, this means all writes hit one “hot” shard while others sit idle. Solution: use hashed shard keys or combine with another field.
Sharded Cluster Architecture
A sharded cluster has three components:
- Config servers: 3-node replica set storing cluster metadata
- Shard servers: Replica sets storing your data (each shard is a replica set)
- mongos routers: Query routers directing requests to appropriate shards
Applications connect to mongos routers, not directly to shards. Mongos determines which shards hold requested data and routes queries accordingly.
Enable sharding:
// Connect to mongos
sh.enableSharding("mydb")
// Shard a collection
sh.shardCollection("mydb.users", { user_id: 1 })
// Shard with hashed key
sh.shardCollection("mydb.events", { _id: "hashed" })
Add shards:
sh.addShard("shard01/mongo1:27017,mongo2:27017,mongo3:27017")
sh.addShard("shard02/mongo4:27017,mongo5:27017,mongo6:27017")
The Balancer
The balancer moves chunks (ranges of shard key values) between shards to maintain even distribution. It runs automatically but you can control when it runs.
// Check balancer status
sh.getBalancerState()
// Schedule balancer to run only during maintenance window
use config
db.settings.update(
{ _id: "balancer" },
{ $set: { activeWindow: { start: "23:00", stop: "06:00" } } },
{ upsert: true }
)
Balancing impacts performance (it moves data across the network), so schedule it during low-traffic periods.
Zone Sharding
Zone sharding lets you control data placement geographically or logically.
// Assign shards to zones
sh.addShardToZone("shard01", "US")
sh.addShardToZone("shard02", "EU")
// Route data by shard key range
sh.updateZoneKeyRange(
"mydb.users",
{ country: "US" },
{ country: "US\uffff" },
"US"
)
This ensures US users’ data stays on US-located shards, reducing latency and meeting data residency requirements.
Security Best Practices
An unsecured database is an incident waiting to happen. MongoDB security is layered: authentication, authorization, network security, and encryption.
Enable Authentication
By default, MongoDB accepts connections without authentication (for ease of initial setup). This is catastrophic in production.
Enable auth in mongod.conf:
security:
authorization: enabled
Create the first admin user:
use admin
db.createUser({
user: "admin",
pwd: "secure_password",
roles: [
{ role: "userAdminAnyDatabase", db: "admin" },
{ role: "readWriteAnyDatabase", db: "admin" }
]
})
After enabling authentication, restart MongoDB and all connections require credentials.
TLS/SSL Encryption
Always encrypt data in transit. MongoDB supports TLS for client connections and inter-node communication.
net:
tls:
mode: requireTLS
certificateKeyFile: /path/to/mongodb.pem
CAFile: /path/to/ca.pem
Obtain certificates from a trusted CA or use Let’s Encrypt. Self-signed certificates work for testing but complicate production operations.
Network Security
Bind to specific IPs only:
net:
bindIp: 127.0.0.1,10.0.0.5
port: 27017
Never bind to 0.0.0.0 in production - this accepts connections from any network interface.
Firewall rules:
# Allow MongoDB port only from application servers
iptables -A INPUT -p tcp -s 10.0.0.0/24 --dport 27017 -j ACCEPT
iptables -A INPUT -p tcp --dport 27017 -j DROP
Security groups (AWS), firewall rules (GCP), or network policies (Kubernetes) should restrict MongoDB ports to only known clients.
Encryption at Rest
MongoDB Enterprise supports native encryption:
security:
enableEncryption: true
encryptionKeyFile: /path/to/keyfile
For MongoDB Community, use disk encryption:
# Linux LUKS example
cryptsetup luksFormat /dev/sdb
cryptsetup open /dev/sdb mongodb_encrypted
mkfs.ext4 /dev/mapper/mongodb_encrypted
Cloud providers offer encrypted volumes (AWS EBS encryption, GCP persistent disk encryption). Enable these for databases storing sensitive data.
Security Checklist
- ✅ Authentication enabled
- ✅ TLS for all connections
- ✅ Network access restricted by firewall
- ✅ Encryption at rest enabled
- ✅ Audit logging configured (Enterprise)
- ✅ Regular security updates applied
- ✅ Strong passwords (15+ characters)
- ✅ Credentials in secrets manager, not code
- ✅ Regular security audits scheduled
User Management and RBAC
MongoDB uses role-based access control (RBAC) to govern who can do what. Good user management follows the principle of least privilege: users get only the permissions they need.
Understanding Roles
MongoDB has built-in roles for common patterns and supports custom roles for specific needs.
Database roles:
read: Query datareadWrite: Query and modify datadbAdmin: Manage schema (indexes, collections)userAdmin: Manage users and rolesdbOwner: All database privileges
Cluster roles:
clusterMonitor: View cluster statistics (for monitoring systems)clusterManager: Manage replica sets and shardingclusterAdmin: Full cluster administrationbackup/restore: Backup and restore operations
All-database roles:
readAnyDatabase: Read from all databasesreadWriteAnyDatabase: Read and write to all databasesroot: Complete access (avoid for regular operations)
Creating Application Users
Application user (read/write access to specific database):
use myDatabase
db.createUser({
user: "app_user",
pwd: passwordPrompt(),
roles: [
{ role: "readWrite", db: "myDatabase" }
]
})
Use passwordPrompt() to avoid passwords in shell history or logs.
Read-only analytics user:
use admin
db.createUser({
user: "analytics_user",
pwd: passwordPrompt(),
roles: [
{ role: "read", db: "sales" },
{ role: "read", db: "customers" }
]
})
Monitoring and Backup Users
Monitoring user (for Prometheus, Datadog, etc.):
use admin
db.createUser({
user: "monitoring",
pwd: passwordPrompt(),
roles: [
{ role: "clusterMonitor", db: "admin" },
{ role: "read", db: "local" },
{ role: "read", db: "config" }
]
})
Backup user:
use admin
db.createUser({
user: "backup_user",
pwd: passwordPrompt(),
roles: [
{ role: "backup", db: "admin" },
{ role: "restore", db: "admin" }
]
})
Custom Roles for Fine-Grained Control
When built-in roles are too broad, create custom roles:
use admin
db.createRole({
role: "orderProcessor",
privileges: [
{
resource: { db: "ecommerce", collection: "orders" },
actions: [ "find", "insert", "update" ]
},
{
resource: { db: "ecommerce", collection: "products" },
actions: [ "find" ]
}
],
roles: []
})
This user can read/write orders and read products, but nothing else. Principle of least privilege in action.
Managing Users
Grant additional roles:
db.grantRolesToUser(
"username",
[ { role: "read", db: "newDatabase" } ]
)
Revoke roles:
db.revokeRolesFromUser(
"username",
[ { role: "readWrite", db: "oldDatabase" } ]
)
Change password:
db.changeUserPassword("username", passwordPrompt())
Delete user:
db.dropUser("username")
User Management Best Practices
Naming conventions help organization:
{service}_{environment}_{permission}
Examples:
- webapp_prod_rw
- analytics_prod_ro
- backup_prod
- monitoring_prod
Connection string format:
mongodb://username:password@host1:27017,host2:27017/database?replicaSet=rs0&authSource=admin&tls=true
Always specify authSource (the database where the user was created, typically admin).
Regular audits:
Monthly:
- Review all users and their roles
- Remove unused accounts
- Verify least privilege
- Check for shared credentials
Quarterly:
- Rotate passwords
- Review custom roles
- Validate connection strings
- Test backup user credentials
Conclusion and Next Steps
High availability and security aren’t bolt-on features - they’re fundamental architecture decisions. Replica sets provide failover and redundancy. Sharding enables horizontal scale. Authentication, authorization, and encryption protect your data. Role-based access control limits damage from compromised credentials.
Key takeaways:
- Replica sets are mandatory: Three-node replica sets with proper write concerns provide reliability
- Shard key matters: Spend time designing shard keys; a poor choice haunts you forever
- Security is layered: Authentication + TLS + firewalls + encryption + RBAC
- Least privilege works: Users get only necessary permissions, nothing more
- Monitor everything: Replication lag, balancer activity, failed logins
In Part 3 (final article), we’ll bring everything together for production deployment on Kubernetes: StatefulSets, MongoDB operators, automated backups with CronJobs, and the cloud-native patterns that make MongoDB reliable at scale.
For now, review your replica set configuration, audit your user permissions, and verify your security posture. Your database’s reliability and security depend on these fundamentals.

