Skip to content
5 min read

MongoDB — Document Database Deep Dive

Why Netflix Chose MongoDB: Netflix migrated their content metadata from Oracle to MongoDB to handle 200M+ subscribers with unpredictable content schemas. Each title has different attributes (TV series have seasons/episodes; movies have directors/runtime). A rigid relational schema couldn't keep up with weekly schema changes. MongoDB's flexible document model solved this.


When to Reach for MongoDB

MongoDB excels when your data has variable structure, you need horizontal scaling, and relationships are mostly hierarchical (embedded documents). Think: content catalogs, user profiles, IoT event streams, real-time analytics.


Document Model vs Relational

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph SQL["Relational (SQL)"]
        direction TB
        T1["orders table"]
        T2["order_items table"]
        T3["customers table"]
        T1 --- T2
        T1 --- T3
    end

    subgraph Mongo["Document (MongoDB)"]
        direction TB
        D1["orders collection<br/>{<br/>  customer: {...},<br/>  items: [...]<br/>}"]
    end

    style T1 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style T2 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style T3 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style D1 fill:#D1FAE5,stroke:#6EE7B7,color:#065F46
Feature Relational (SQL) MongoDB
Data unit Row (fixed columns) Document (flexible JSON/BSON)
Schema Rigid, predefined Flexible, per-document
Relationships Foreign keys + JOINs Embedded documents or $lookup
Scaling Vertical (scale up) Horizontal (sharding)
Transactions Full ACID since day 1 Multi-document ACID since 4.0
Query language SQL MQL (MongoDB Query Language)
Best for Complex relationships, reporting Variable schemas, high write throughput
Storage Tables/rows/columns Collections/documents/fields

MongoDB Terminology Mapping

SQL Term MongoDB Term
Database Database
Table Collection
Row Document
Column Field
Primary Key _id field
Index Index
JOIN $lookup (aggregation) or embedding
Schema Validation rules (optional)

BSON Format & ObjectId

MongoDB stores data in BSON (Binary JSON) — a binary-encoded extension of JSON.

BSON vs JSON

Feature JSON BSON
Format Text Binary
Types 6 types (string, number, boolean, null, array, object) 20+ types (Date, Decimal128, Binary, ObjectId, etc.)
Size Larger (text encoding) Compact (binary)
Speed Parse from text Direct memory mapping
Max size N/A 16 MB per document

ObjectId Structure

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    OID["ObjectId: 507f1f77bcf86cd799439011<br/>(24 hex chars = 12 bytes)"]
    OID --> TS["Bytes 0-3<br/>Timestamp<br/>(seconds since epoch)"]
    OID --> RND["Bytes 4-8<br/>Random Value<br/>(per machine/process)"]
    OID --> CNT["Bytes 9-11<br/>Counter<br/>(incrementing)"]

    style OID fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style TS fill:#D1FAE5,stroke:#6EE7B7,color:#065F46
    style RND fill:#FEF3C7,stroke:#FCD34D,color:#92400E
    style CNT fill:#EDE9FE,stroke:#C4B5FD,color:#5B21B6
JavaScript
// ObjectId properties
const id = ObjectId("507f1f77bcf86cd799439011");
id.getTimestamp(); // 2012-10-17T20:46:22Z (embedded creation time!)
id.toString();     // "507f1f77bcf86cd799439011"

Key facts about ObjectId:

  • 12 bytes, globally unique without coordination
  • Sortable by creation time (timestamp is first 4 bytes)
  • Auto-generated by driver if _id not specified
  • You can use any unique value as _id (UUID, natural key, etc.)

CRUD Operations

Create

JavaScript
// Insert one document
db.orders.insertOne({
  customerId: ObjectId("64a1b2c3d4e5f6a7b8c9d0e1"),
  items: [
    { product: "MacBook Pro", quantity: 1, price: 2499.99 },
    { product: "USB-C Hub", quantity: 2, price: 49.99 }
  ],
  total: 2599.97,
  status: "pending",
  createdAt: new Date(),
  shippingAddress: {
    street: "123 Main St",
    city: "San Francisco",
    state: "CA",
    zip: "94102"
  }
});

// Insert many
db.orders.insertMany([
  { /* order1 */ },
  { /* order2 */ },
  { /* order3 */ }
]);

Read (Find)

JavaScript
// Find one by ID
db.orders.findOne({ _id: ObjectId("507f1f77bcf86cd799439011") });

// Find with conditions
db.orders.find({
  status: "pending",
  total: { $gte: 100 },
  "shippingAddress.state": "CA"
});

// Projection (select specific fields)
db.orders.find(
  { status: "pending" },
  { customerId: 1, total: 1, status: 1, _id: 0 }
);

// Sorting, limiting, skipping
db.orders.find({ status: "shipped" })
  .sort({ createdAt: -1 })
  .skip(20)
  .limit(10);

Update

JavaScript
// Update one
db.orders.updateOne(
  { _id: ObjectId("507f1f77bcf86cd799439011") },
  {
    $set: { status: "shipped", shippedAt: new Date() },
    $push: { timeline: { event: "shipped", at: new Date() } }
  }
);

// Update many
db.orders.updateMany(
  { status: "pending", createdAt: { $lt: new Date("2024-01-01") } },
  { $set: { status: "cancelled" } }
);

// Upsert (insert if not exists)
db.metrics.updateOne(
  { date: "2024-03-15", metric: "page_views" },
  { $inc: { count: 1 } },
  { upsert: true }
);

Delete

JavaScript
// Delete one
db.orders.deleteOne({ _id: ObjectId("507f1f77bcf86cd799439011") });

// Delete many
db.orders.deleteMany({ status: "cancelled", createdAt: { $lt: new Date("2023-01-01") } });

Query Operators

Operator Meaning Example
$eq Equal { status: { $eq: "active" } }
$ne Not equal { status: { $ne: "deleted" } }
$gt, $gte Greater than (or equal) { total: { $gte: 100 } }
$lt, $lte Less than (or equal) { age: { $lt: 30 } }
$in In array { status: { $in: ["pending", "active"] } }
$nin Not in array { status: { $nin: ["deleted"] } }
$and Logical AND { $and: [{age: {$gte:18}}, {age: {$lte:65}}] }
$or Logical OR { $or: [{status: "a"}, {total: {$lt:10}}] }
$exists Field exists { email: { $exists: true } }
$regex Pattern match { name: { $regex: /^john/i } }
$elemMatch Array element match { items: { $elemMatch: { qty: {$gt:5} } } }

Indexing

Indexes are critical for MongoDB performance. Without indexes, MongoDB performs a collection scan (reads every document).

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TD
    Query["Query: find({status: 'active', total: {$gt: 100}})"]
    Query -->|"No Index"| Scan["Collection Scan<br/>O(n) — reads ALL documents"]
    Query -->|"With Index"| Index["Index Lookup<br/>O(log n) — B-tree traversal"]

    style Query fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style Scan fill:#FEE2E2,stroke:#FCA5A5,color:#991B1B
    style Index fill:#D1FAE5,stroke:#6EE7B7,color:#065F46

Index Types

Type Use Case Example
Single Field Query on one field db.orders.createIndex({ status: 1 })
Compound Query on multiple fields db.orders.createIndex({ status: 1, createdAt: -1 })
Multikey Index array fields db.products.createIndex({ tags: 1 })
Text Full-text search db.articles.createIndex({ title: "text", body: "text" })
Geospatial Location queries db.places.createIndex({ location: "2dsphere" })
Hashed Shard key distribution db.users.createIndex({ userId: "hashed" })
TTL Auto-expire documents db.sessions.createIndex({ expiresAt: 1 }, { expireAfterSeconds: 0 })
Unique Enforce uniqueness db.users.createIndex({ email: 1 }, { unique: true })
Partial Index subset of documents db.orders.createIndex({ total: 1 }, { partialFilterExpression: { status: "active" } })

Compound Index Order Matters (ESR Rule)

**E**quality -> **S**ort -> **R**ange

JavaScript
// Query: find active orders over $100, sorted by date
db.orders.find({ status: "active", total: { $gt: 100 } }).sort({ createdAt: -1 });

// Optimal index: Equality fields first, then Sort, then Range
db.orders.createIndex({ status: 1, createdAt: -1, total: 1 });
//                      ^Equality    ^Sort           ^Range

explain() for Query Analysis

JavaScript
db.orders.find({ status: "pending" }).explain("executionStats");
// Look for:
// - "stage": "IXSCAN" (good) vs "COLLSCAN" (bad)
// - "nReturned" vs "totalDocsExamined" ratio
// - "executionTimeMillis"

Aggregation Pipeline

The aggregation pipeline processes documents through a sequence of stages. Each stage transforms the document stream.

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    Input["Input<br/>Documents"] --> Match["$match<br/>(filter)"]
    Match --> Group["$group<br/>(aggregate)"]
    Group --> Sort["$sort"]
    Sort --> Project["$project<br/>(reshape)"]
    Project --> Output["Output<br/>Documents"]

    style Input fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style Match fill:#D1FAE5,stroke:#6EE7B7,color:#065F46
    style Group fill:#FEF3C7,stroke:#FCD34D,color:#92400E
    style Sort fill:#EDE9FE,stroke:#C4B5FD,color:#5B21B6
    style Project fill:#D1FAE5,stroke:#6EE7B7,color:#065F46
    style Output fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF

Common Pipeline Stages

Stage Purpose SQL Equivalent
$match Filter documents WHERE
$group Aggregate values GROUP BY
$project Reshape/select fields SELECT
$sort Order results ORDER BY
$limit Limit results LIMIT
$skip Skip results OFFSET
$unwind Flatten arrays Lateral join
$lookup Join collections JOIN
$addFields Add computed fields Computed column
$facet Multiple pipelines in parallel N/A
$bucket Group into ranges Range-based GROUP BY

Real-World Aggregation Examples

JavaScript
// Revenue by category for last 30 days
db.orders.aggregate([
  // Stage 1: Filter recent completed orders
  { $match: {
      status: "completed",
      createdAt: { $gte: new Date(Date.now() - 30*24*60*60*1000) }
  }},

  // Stage 2: Flatten items array
  { $unwind: "$items" },

  // Stage 3: Group by category
  { $group: {
      _id: "$items.category",
      totalRevenue: { $sum: { $multiply: ["$items.price", "$items.quantity"] } },
      orderCount: { $sum: 1 },
      avgOrderValue: { $avg: "$items.price" }
  }},

  // Stage 4: Sort by revenue
  { $sort: { totalRevenue: -1 } },

  // Stage 5: Reshape output
  { $project: {
      category: "$_id",
      totalRevenue: { $round: ["$totalRevenue", 2] },
      orderCount: 1,
      avgOrderValue: { $round: ["$avgOrderValue", 2] },
      _id: 0
  }}
]);

$lookup (JOIN equivalent)

JavaScript
// Join orders with customer data
db.orders.aggregate([
  { $lookup: {
      from: "customers",
      localField: "customerId",
      foreignField: "_id",
      as: "customer"
  }},
  { $unwind: "$customer" }, // Convert array to object
  { $project: {
      orderId: "$_id",
      total: 1,
      customerName: "$customer.name",
      customerEmail: "$customer.email"
  }}
]);

$facet (Multiple Aggregations in One Query)

JavaScript
// Dashboard data in a single query
db.orders.aggregate([
  { $facet: {
      "statusBreakdown": [
        { $group: { _id: "$status", count: { $sum: 1 } } }
      ],
      "revenueByMonth": [
        { $group: {
            _id: { $dateToString: { format: "%Y-%m", date: "$createdAt" } },
            revenue: { $sum: "$total" }
        }},
        { $sort: { _id: 1 } }
      ],
      "topCustomers": [
        { $group: { _id: "$customerId", totalSpent: { $sum: "$total" } } },
        { $sort: { totalSpent: -1 } },
        { $limit: 5 }
      ]
  }}
]);

Replica Sets and Sharding

Replica Sets (High Availability)

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TD
    App["Application"] --> Primary["Primary<br/>(reads + writes)"]
    Primary -->|"Replication"| S1["Secondary 1<br/>(reads)"]
    Primary -->|"Replication"| S2["Secondary 2<br/>(reads)"]
    Primary -.->|"Heartbeat"| S1
    Primary -.->|"Heartbeat"| S2

    S1 -.->|"Election on<br/>primary failure"| NewPrimary["New Primary"]

    style App fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style Primary fill:#D1FAE5,stroke:#6EE7B7,color:#065F46
    style S1 fill:#FEF3C7,stroke:#FCD34D,color:#92400E
    style S2 fill:#FEF3C7,stroke:#FCD34D,color:#92400E
    style NewPrimary fill:#EDE9FE,stroke:#C4B5FD,color:#5B21B6
Feature Description
Minimum nodes 3 (1 primary + 2 secondary, or 1 primary + 1 secondary + 1 arbiter)
Write concern w: "majority" — acknowledged after majority of nodes confirm
Read preference Primary (default), Secondary, PrimaryPreferred, Nearest
Failover Automatic election (10-12 seconds)
Oplog Capped collection that records all write operations for replication

Sharding (Horizontal Scaling)

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TD
    App["Application"] --> Router["mongos<br/>(Router)"]
    Router --> S1["Shard 1<br/>(userId: A-M)"]
    Router --> S2["Shard 2<br/>(userId: N-Z)"]
    Router --> Config["Config Servers<br/>(metadata)"]

    style App fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style Router fill:#EDE9FE,stroke:#C4B5FD,color:#5B21B6
    style S1 fill:#D1FAE5,stroke:#6EE7B7,color:#065F46
    style S2 fill:#D1FAE5,stroke:#6EE7B7,color:#065F46
    style Config fill:#FEF3C7,stroke:#FCD34D,color:#92400E
Shard Key Strategy Pros Cons
Hashed Even distribution No range queries on shard key
Range Efficient range queries Hot spots if sequential (timestamps)
Zone/Tag Geographic locality Manual management
JavaScript
// Enable sharding
sh.enableSharding("ecommerce");
sh.shardCollection("ecommerce.orders", { customerId: "hashed" });

Schema Design Patterns

Embedding vs Referencing

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph Embed["Embedding (Denormalized)"]
        direction TB
        E1["{<br/>  order: {...},<br/>  items: [...],<br/>  customer: {...}<br/>}"]
    end

    subgraph Ref["Referencing (Normalized)"]
        direction TB
        R1["orders: { customerId: ObjectId }"]
        R2["customers: { _id: ObjectId }"]
        R1 -.-> R2
    end

    style E1 fill:#D1FAE5,stroke:#6EE7B7,color:#065F46
    style R1 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style R2 fill:#FEF3C7,stroke:#FCD34D,color:#92400E
Factor Embed Reference
Read pattern Always read together Read independently
Data size Small/bounded subdocuments Large or unbounded
Update pattern Updated together Updated independently
Cardinality 1:few (embed many side) 1:many or many:many
Document size Under 16 MB Could exceed 16 MB
Consistency Atomic (single document) Requires transactions

Schema Design Patterns

Pattern When to Use Example
Embedded Tightly coupled data, read together Order with line items
Subset Only need recent/top-N of a large array Product with latest 10 reviews embedded
Computed Avoid expensive aggregations on read Pre-computed totals, averages
Bucket Time-series data, IoT measurements Group sensor readings by hour
Outlier 99% small, 1% huge Most users have 5 posts; some have 50K
Extended Ref Need some fields from related doc Embed customerName + customerId

Example: Blog Post Schema

JavaScript
// Embedded pattern (good for bounded data)
{
  _id: ObjectId("..."),
  title: "MongoDB Schema Design",
  author: {
    _id: ObjectId("..."),
    name: "Jane Smith",
    avatar: "/img/jane.jpg"  // Extended reference
  },
  content: "...",
  tags: ["mongodb", "nosql", "schema-design"],
  comments: [   // Embed if bounded (e.g., max 50)
    { user: "Bob", text: "Great post!", createdAt: new Date() },
    { user: "Alice", text: "Very helpful", createdAt: new Date() }
  ],
  stats: {      // Computed pattern
    views: 1524,
    likes: 47,
    commentCount: 2
  },
  createdAt: new Date(),
  updatedAt: new Date()
}

Transactions in MongoDB (4.0+)

Multi-document transactions provide ACID guarantees across multiple documents and collections.

Java
// Spring Data MongoDB transaction
@Transactional
public void transferFunds(String fromAccount, String toAccount, BigDecimal amount) {
    Account from = accountRepo.findById(fromAccount).orElseThrow();
    Account to = accountRepo.findById(toAccount).orElseThrow();

    if (from.getBalance().compareTo(amount) < 0) {
        throw new InsufficientFundsException();
    }

    from.debit(amount);
    to.credit(amount);

    accountRepo.save(from);
    accountRepo.save(to);
    // Both or neither — ACID guaranteed
}
JavaScript
// Native MongoDB transaction
const session = client.startSession();
try {
  session.startTransaction({
    readConcern: { level: "snapshot" },
    writeConcern: { w: "majority" }
  });

  db.accounts.updateOne(
    { _id: fromAccount },
    { $inc: { balance: -amount } },
    { session }
  );

  db.accounts.updateOne(
    { _id: toAccount },
    { $inc: { balance: amount } },
    { session }
  );

  session.commitTransaction();
} catch (error) {
  session.abortTransaction();
  throw error;
} finally {
  session.endSession();
}

Transaction Limitations

  • Max 60 seconds runtime (configurable)
  • Performance overhead (snapshot isolation)
  • Not designed for long-running operations
  • If you need transactions frequently, reconsider your schema (embedding may eliminate the need)

Spring Data MongoDB Integration

Configuration

YAML
# application.yml
spring:
  data:
    mongodb:
      uri: mongodb://localhost:27017/ecommerce
      # or individual properties:
      host: localhost
      port: 27017
      database: ecommerce
      authentication-database: admin
      username: app_user
      password: ${MONGO_PASSWORD}

Entity Mapping

Java
@Document(collection = "orders")
@CompoundIndex(name = "status_date_idx", def = "{'status': 1, 'createdAt': -1}")
public class Order {

    @Id
    private String id; // Maps to _id (ObjectId if String)

    @Indexed
    private String customerId;

    @Field("items") // Custom field name
    private List<OrderItem> lineItems;

    private BigDecimal total;

    @Indexed
    private OrderStatus status;

    @CreatedDate
    private Instant createdAt;

    @LastModifiedDate
    private Instant updatedAt;

    @Version
    private Long version; // Optimistic locking
}

// Embedded document (no @Document, no @Id)
public class OrderItem {
    private String productId;
    private String productName;
    private int quantity;
    private BigDecimal price;
}

Repository

Java
public interface OrderRepository extends MongoRepository<Order, String> {

    List<Order> findByStatusAndTotalGreaterThan(OrderStatus status, BigDecimal minTotal);

    @Query("{ 'status': ?0, 'createdAt': { $gte: ?1 } }")
    List<Order> findRecentByStatus(OrderStatus status, Instant since);

    @Query(value = "{ 'customerId': ?0 }", fields = "{ 'total': 1, 'status': 1 }")
    List<Order> findOrderSummariesByCustomer(String customerId);

    @Aggregation(pipeline = {
        "{ $match: { status: 'COMPLETED' } }",
        "{ $group: { _id: '$customerId', totalSpent: { $sum: '$total' } } }",
        "{ $sort: { totalSpent: -1 } }",
        "{ $limit: 10 }"
    })
    List<CustomerSpending> findTopSpenders();
}

MongoTemplate for Complex Operations

Java
@Service
public class OrderAnalyticsService {

    @Autowired
    private MongoTemplate mongoTemplate;

    public List<CategoryRevenue> getRevenueByCategory(Instant since) {
        Aggregation agg = Aggregation.newAggregation(
            match(Criteria.where("status").is("COMPLETED")
                          .and("createdAt").gte(since)),
            unwind("lineItems"),
            group("lineItems.category")
                .sum(ArithmeticOperators.Multiply.valueOf("lineItems.price")
                    .multiplyBy("lineItems.quantity")).as("revenue")
                .count().as("orderCount"),
            sort(Sort.Direction.DESC, "revenue"),
            project()
                .and("_id").as("category")
                .andInclude("revenue", "orderCount")
                .andExclude("_id")
        );

        return mongoTemplate.aggregate(agg, "orders", CategoryRevenue.class)
            .getMappedResults();
    }
}

When to Choose MongoDB vs SQL

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TD
    Q1{"Need complex JOINs<br/>across many tables?"}
    Q1 -->|Yes| SQL["Use SQL<br/>(PostgreSQL, MySQL)"]
    Q1 -->|No| Q2{"Schema changes<br/>frequently?"}
    Q2 -->|Yes| Mongo["Use MongoDB"]
    Q2 -->|No| Q3{"Need horizontal<br/>write scaling?"}
    Q3 -->|Yes| Mongo
    Q3 -->|No| Q4{"Strong consistency<br/>across related data?"}
    Q4 -->|Yes| SQL
    Q4 -->|No| Q5{"Hierarchical/<br/>nested data?"}
    Q5 -->|Yes| Mongo
    Q5 -->|No| SQL

    style Q1 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style Q2 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style Q3 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style Q4 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style Q5 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style SQL fill:#FEF3C7,stroke:#FCD34D,color:#92400E
    style Mongo fill:#D1FAE5,stroke:#6EE7B7,color:#065F46
Choose MongoDB Choose SQL
Variable/evolving schema Fixed, well-known schema
High write throughput Complex transactions across tables
Hierarchical data (nested) Heavy reporting/analytics with JOINs
Horizontal scaling needed Strict referential integrity
Rapid prototyping Financial/banking (strong ACID)
Content management, catalogs ERP, inventory with relationships
IoT / time-series data Multi-table aggregations
Geospatial queries Mature tooling ecosystem

Interview Questions

Q: How does MongoDB handle schema flexibility, and what are the trade-offs?

MongoDB is schema-less at the database level — documents in the same collection can have different fields. This provides:

Benefits: (1) Faster development iterations. (2) Handles polymorphic data (different product types with different attributes). (3) No costly ALTER TABLE migrations. (4) Natural fit for semi-structured data.

Trade-offs: (1) Application must handle schema validation (or use MongoDB Schema Validation rules). (2) Inconsistent data possible if multiple services write differently. (3) No compile-time safety for field names. (4) Query planning harder without guaranteed schema.

Best practice: Use schema validation rules in production ($jsonSchema validator) to enforce structure while retaining flexibility for optional fields.

Q: Explain the difference between embedding and referencing in MongoDB schema design. When would you use each?

Embedding: Nest related data within the parent document. Use when: (1) Data is always read together. (2) Cardinality is bounded (1:few). (3) Child data doesn't need independent access. (4) Updates are atomic (single document).

Referencing: Store ObjectId references to other collections. Use when: (1) Data is accessed independently. (2) Cardinality is unbounded (1:millions). (3) Document would exceed 16 MB. (4) Many-to-many relationships. (5) Data is updated independently.

Rule of thumb: If you would use JOIN every time you read, embed. If you read the related data <20% of the time, reference.

Example: Order with items -> embed (always read together). User with orders -> reference (user has thousands of orders over time).

Q: What is the aggregation pipeline in MongoDB? How does it compare to SQL?

The aggregation pipeline processes documents through a sequence of stages, where each stage transforms the document stream. It's MongoDB's equivalent to complex SQL queries with GROUP BY, HAVING, and subqueries.

Key stages: $match (WHERE), $group (GROUP BY), $project (SELECT), $sort (ORDER BY), $lookup (JOIN), $unwind (flatten arrays), $facet (multiple parallel pipelines).

Advantages over SQL: (1) Pipeline stages are composable and reorderable. (2) Can process nested/array data natively. (3) $facet runs multiple aggregations in one pass. (4) Stages can be conditionally added programmatically.

Performance tip: Put $match early (uses indexes), then $project to reduce document size before expensive stages.

Q: How does MongoDB achieve high availability and horizontal scaling?

High Availability (Replica Sets): Minimum 3 nodes. One primary handles writes; secondaries replicate asynchronously. On primary failure, automatic election chooses new primary (10-12 sec failover). Write concern w: majority ensures durability.

Horizontal Scaling (Sharding): Data distributed across shards by shard key. mongos routers direct queries to correct shard(s). Two strategies: (1) Hashed sharding — even distribution, no range queries on key. (2) Range sharding — efficient ranges, possible hot spots.

Choosing shard key is critical: high cardinality, even distribution, matches query patterns. Bad shard key (e.g., timestamp) causes all writes to hit one shard.

Q: When would you choose MongoDB over PostgreSQL for a new project?

Choose MongoDB when: (1) Schema evolves frequently (startup, MVP). (2) Data is naturally hierarchical/nested (catalogs, CMS). (3) Need horizontal write scaling beyond single node. (4) Geospatial queries are primary use case. (5) Variable structure per document (polymorphic entities). (6) High write throughput with eventual consistency acceptable.

Choose PostgreSQL when: (1) Complex relationships requiring JOINs. (2) Need strong ACID across multiple tables. (3) Reporting/analytics with complex aggregations. (4) Team expertise is SQL-centric. (5) JSON flexibility needed within relational model (JSONB). (6) Extensions matter (PostGIS, full-text search, etc.).

Both support JSON, indexing, replication, and transactions. The deciding factor is usually your access patterns and relationship complexity.


Quick Recall

Concept Key Point
Document max size 16 MB (BSON)
ObjectId 12 bytes: timestamp + random + counter (sortable by time)
Embedding rule Embed if always read together and bounded cardinality
Referencing rule Reference if independent access or unbounded growth
ESR indexing Equality, Sort, Range — field order in compound indexes
Aggregation Pipeline of stages: $match -> $group -> $sort -> $project
$lookup LEFT OUTER JOIN equivalent (aggregation stage)
Replica Set 3+ nodes, automatic failover, write to primary only
Sharding Horizontal scaling by shard key, use hashed for even distribution
Transactions Multi-document ACID since 4.0, 60-second limit
Spring Data @Document, MongoRepository, MongoTemplate
Write concern w: "majority" for durability across replica set

See Also