5 min read By Vamsi Karuturi · Senior Backend Engineer at Salesforce

MongoDB — Document Database Deep Dive

Q: Q: Explain the difference between embedding and referencing in MongoDB schema design. When would you use each?

Embedding: Nest related data within the parent document. Use when: (1) Data is always read together. (2) Cardinality is bounded (1:few). (3) Child data doesn't need independent access. (4) Updates are atomic (single document). Referencing: Store ObjectId references to other collections. Use when: (1) Data is accessed independently. (2) Cardinality is unbounded (1:millions). (3) Document would exceed 16 MB. (4) Many-to-many relationships. (5) Data is updated independently. Rule of thumb: If you w

Q: Q: How does MongoDB achieve high availability and horizontal scaling?

High Availability (Replica Sets): Minimum 3 nodes. One primary handles writes; secondaries replicate asynchronously. On primary failure, automatic election chooses new primary (10-12 sec failover). Write concern w: majority ensures durability. Horizontal Scaling (Sharding): Data distributed across shards by shard key. mongos routers direct queries to correct shard(s). Two strategies: (1) Hashed sharding — even distribution, no range queries on key. (2) Range sharding — efficient ranges, possible

Why Netflix Chose MongoDB: Netflix migrated their content metadata from Oracle to MongoDB to handle 200M+ subscribers with unpredictable content schemas. Each title has different attributes (TV series have seasons/episodes; movies have directors/runtime). A rigid relational schema couldn't keep up with weekly schema changes. MongoDB's flexible document model solved this.

When to Reach for MongoDB

MongoDB excels when your data has variable structure, you need horizontal scaling, and relationships are mostly hierarchical (embedded documents). Think: content catalogs, user profiles, IoT event streams, real-time analytics.

Document Model vs Relational

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph SQL["Relational (SQL)"]
        direction TB
        T1["orders table"]
        T2["order_items table"]
        T3["customers table"]
        T1 --- T2
        T1 --- T3
    end

    subgraph Mongo["Document (MongoDB)"]
        direction TB
        D1["orders collection<br/>{<br/>  customer: {...},<br/>  items: [...]<br/>}"]
    end

    style T1 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style T2 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style T3 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style D1 fill:#D1FAE5,stroke:#6EE7B7,color:#065F46

Feature	Relational (SQL)	MongoDB
Data unit	Row (fixed columns)	Document (flexible JSON/BSON)
Schema	Rigid, predefined	Flexible, per-document
Relationships	Foreign keys + JOINs	Embedded documents or `$lookup`
Scaling	Vertical (scale up)	Horizontal (sharding)
Transactions	Full ACID since day 1	Multi-document ACID since 4.0
Query language	SQL	MQL (MongoDB Query Language)
Best for	Complex relationships, reporting	Variable schemas, high write throughput
Storage	Tables/rows/columns	Collections/documents/fields

MongoDB Terminology Mapping

SQL Term	MongoDB Term
Database	Database
Table	Collection
Row	Document
Column	Field
Primary Key	`_id` field
Index	Index
JOIN	`$lookup` (aggregation) or embedding
Schema	Validation rules (optional)

BSON Format & ObjectId

MongoDB stores data in BSON (Binary JSON) — a binary-encoded extension of JSON.

BSON vs JSON

Feature	JSON	BSON
Format	Text	Binary
Types	6 types (string, number, boolean, null, array, object)	20+ types (Date, Decimal128, Binary, ObjectId, etc.)
Size	Larger (text encoding)	Compact (binary)
Speed	Parse from text	Direct memory mapping
Max size	N/A	16 MB per document

ObjectId Structure

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    OID["ObjectId: 507f1f77bcf86cd799439011<br/>(24 hex chars = 12 bytes)"]
    OID --> TS["Bytes 0-3<br/>Timestamp<br/>(seconds since epoch)"]
    OID --> RND["Bytes 4-8<br/>Random Value<br/>(per machine/process)"]
    OID --> CNT["Bytes 9-11<br/>Counter<br/>(incrementing)"]

    style OID fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style TS fill:#D1FAE5,stroke:#6EE7B7,color:#065F46
    style RND fill:#FEF3C7,stroke:#FCD34D,color:#92400E
    style CNT fill:#EDE9FE,stroke:#C4B5FD,color:#5B21B6

JavaScript

// ObjectId properties
const id = ObjectId("507f1f77bcf86cd799439011");
id.getTimestamp(); // 2012-10-17T20:46:22Z (embedded creation time!)
id.toString();     // "507f1f77bcf86cd799439011"

Key facts about ObjectId:

12 bytes, globally unique without coordination
Sortable by creation time (timestamp is first 4 bytes)
Auto-generated by driver if _id not specified
You can use any unique value as _id (UUID, natural key, etc.)

CRUD Operations

Create

JavaScript

// Insert one document
db.orders.insertOne({
  customerId: ObjectId("64a1b2c3d4e5f6a7b8c9d0e1"),
  items: [
    { product: "MacBook Pro", quantity: 1, price: 2499.99 },
    { product: "USB-C Hub", quantity: 2, price: 49.99 }
  ],
  total: 2599.97,
  status: "pending",
  createdAt: new Date(),
  shippingAddress: {
    street: "123 Main St",
    city: "San Francisco",
    state: "CA",
    zip: "94102"
  }
});

// Insert many
db.orders.insertMany([
  { /* order1 */ },
  { /* order2 */ },
  { /* order3 */ }
]);

Read (Find)

JavaScript

// Find one by ID
db.orders.findOne({ _id: ObjectId("507f1f77bcf86cd799439011") });

// Find with conditions
db.orders.find({
  status: "pending",
  total: { $gte: 100 },
  "shippingAddress.state": "CA"
});

// Projection (select specific fields)
db.orders.find(
  { status: "pending" },
  { customerId: 1, total: 1, status: 1, _id: 0 }
);

// Sorting, limiting, skipping
db.orders.find({ status: "shipped" })
  .sort({ createdAt: -1 })
  .skip(20)
  .limit(10);

Update

JavaScript

// Update one
db.orders.updateOne(
  { _id: ObjectId("507f1f77bcf86cd799439011") },
  {
    $set: { status: "shipped", shippedAt: new Date() },
    $push: { timeline: { event: "shipped", at: new Date() } }
  }
);

// Update many
db.orders.updateMany(
  { status: "pending", createdAt: { $lt: new Date("2024-01-01") } },
  { $set: { status: "cancelled" } }
);

// Upsert (insert if not exists)
db.metrics.updateOne(
  { date: "2024-03-15", metric: "page_views" },
  { $inc: { count: 1 } },
  { upsert: true }
);

Delete

JavaScript

// Delete one
db.orders.deleteOne({ _id: ObjectId("507f1f77bcf86cd799439011") });

// Delete many
db.orders.deleteMany({ status: "cancelled", createdAt: { $lt: new Date("2023-01-01") } });

Query Operators

Operator	Meaning	Example
`$eq`	Equal	`{ status: { $eq: "active" } }`
`$ne`	Not equal	`{ status: { $ne: "deleted" } }`
`$gt`, `$gte`	Greater than (or equal)	`{ total: { $gte: 100 } }`
`$lt`, `$lte`	Less than (or equal)	`{ age: { $lt: 30 } }`
`$in`	In array	`{ status: { $in: ["pending", "active"] } }`
`$nin`	Not in array	`{ status: { $nin: ["deleted"] } }`
`$and`	Logical AND	`{ $and: [{age: {$gte:18}}, {age: {$lte:65}}] }`
`$or`	Logical OR	`{ $or: [{status: "a"}, {total: {$lt:10}}] }`
`$exists`	Field exists	`{ email: { $exists: true } }`
`$regex`	Pattern match	`{ name: { $regex: /^john/i } }`
`$elemMatch`	Array element match	`{ items: { $elemMatch: { qty: {$gt:5} } } }`

Indexing

Indexes are critical for MongoDB performance. Without indexes, MongoDB performs a collection scan (reads every document).

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TD
    Query["Query: find({status: 'active', total: {$gt: 100}})"]
    Query -->|"No Index"| Scan["Collection Scan<br/>O(n) — reads ALL documents"]
    Query -->|"With Index"| Index["Index Lookup<br/>O(log n) — B-tree traversal"]

    style Query fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style Scan fill:#FEE2E2,stroke:#FCA5A5,color:#991B1B
    style Index fill:#D1FAE5,stroke:#6EE7B7,color:#065F46

Index Types

Type	Use Case	Example
Single Field	Query on one field	`db.orders.createIndex({ status: 1 })`
Compound	Query on multiple fields	`db.orders.createIndex({ status: 1, createdAt: -1 })`
Multikey	Index array fields	`db.products.createIndex({ tags: 1 })`
Text	Full-text search	`db.articles.createIndex({ title: "text", body: "text" })`
Geospatial	Location queries	`db.places.createIndex({ location: "2dsphere" })`
Hashed	Shard key distribution	`db.users.createIndex({ userId: "hashed" })`
TTL	Auto-expire documents	`db.sessions.createIndex({ expiresAt: 1 }, { expireAfterSeconds: 0 })`
Unique	Enforce uniqueness	`db.users.createIndex({ email: 1 }, { unique: true })`
Partial	Index subset of documents	`db.orders.createIndex({ total: 1 }, { partialFilterExpression: { status: "active" } })`

Compound Index Order Matters (ESR Rule)

**E**quality -> **S**ort -> **R**ange

JavaScript

// Query: find active orders over $100, sorted by date
db.orders.find({ status: "active", total: { $gt: 100 } }).sort({ createdAt: -1 });

// Optimal index: Equality fields first, then Sort, then Range
db.orders.createIndex({ status: 1, createdAt: -1, total: 1 });
//                      ^Equality    ^Sort           ^Range

explain() for Query Analysis

JavaScript

db.orders.find({ status: "pending" }).explain("executionStats");
// Look for:
// - "stage": "IXSCAN" (good) vs "COLLSCAN" (bad)
// - "nReturned" vs "totalDocsExamined" ratio
// - "executionTimeMillis"

Aggregation Pipeline

The aggregation pipeline processes documents through a sequence of stages. Each stage transforms the document stream.

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    Input["Input<br/>Documents"] --> Match["$match<br/>(filter)"]
    Match --> Group["$group<br/>(aggregate)"]
    Group --> Sort["$sort"]
    Sort --> Project["$project<br/>(reshape)"]
    Project --> Output["Output<br/>Documents"]

    style Input fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style Match fill:#D1FAE5,stroke:#6EE7B7,color:#065F46
    style Group fill:#FEF3C7,stroke:#FCD34D,color:#92400E
    style Sort fill:#EDE9FE,stroke:#C4B5FD,color:#5B21B6
    style Project fill:#D1FAE5,stroke:#6EE7B7,color:#065F46
    style Output fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF

Common Pipeline Stages

Stage	Purpose	SQL Equivalent
`$match`	Filter documents	`WHERE`
`$group`	Aggregate values	`GROUP BY`
`$project`	Reshape/select fields	`SELECT`
`$sort`	Order results	`ORDER BY`
`$limit`	Limit results	`LIMIT`
`$skip`	Skip results	`OFFSET`
`$unwind`	Flatten arrays	Lateral join
`$lookup`	Join collections	`JOIN`
`$addFields`	Add computed fields	Computed column
`$facet`	Multiple pipelines in parallel	N/A
`$bucket`	Group into ranges	Range-based GROUP BY

Real-World Aggregation Examples

JavaScript

// Revenue by category for last 30 days
db.orders.aggregate([
  // Stage 1: Filter recent completed orders
  { $match: {
      status: "completed",
      createdAt: { $gte: new Date(Date.now() - 30*24*60*60*1000) }
  }},

  // Stage 2: Flatten items array
  { $unwind: "$items" },

  // Stage 3: Group by category
  { $group: {
      _id: "$items.category",
      totalRevenue: { $sum: { $multiply: ["$items.price", "$items.quantity"] } },
      orderCount: { $sum: 1 },
      avgOrderValue: { $avg: "$items.price" }
  }},

  // Stage 4: Sort by revenue
  { $sort: { totalRevenue: -1 } },

  // Stage 5: Reshape output
  { $project: {
      category: "$_id",
      totalRevenue: { $round: ["$totalRevenue", 2] },
      orderCount: 1,
      avgOrderValue: { $round: ["$avgOrderValue", 2] },
      _id: 0
  }}
]);

$lookup (JOIN equivalent)

JavaScript

// Join orders with customer data
db.orders.aggregate([
  { $lookup: {
      from: "customers",
      localField: "customerId",
      foreignField: "_id",
      as: "customer"
  }},
  { $unwind: "$customer" }, // Convert array to object
  { $project: {
      orderId: "$_id",
      total: 1,
      customerName: "$customer.name",
      customerEmail: "$customer.email"
  }}
]);

JavaScript

// Dashboard data in a single query
db.orders.aggregate([
  { $facet: {
      "statusBreakdown": [
        { $group: { _id: "$status", count: { $sum: 1 } } }
      ],
      "revenueByMonth": [
        { $group: {
            _id: { $dateToString: { format: "%Y-%m", date: "$createdAt" } },
            revenue: { $sum: "$total" }
        }},
        { $sort: { _id: 1 } }
      ],
      "topCustomers": [
        { $group: { _id: "$customerId", totalSpent: { $sum: "$total" } } },
        { $sort: { totalSpent: -1 } },
        { $limit: 5 }
      ]
  }}
]);

Replica Sets and Sharding

Replica Sets (High Availability)

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TD
    App["Application"] --> Primary["Primary<br/>(reads + writes)"]
    Primary -->|"Replication"| S1["Secondary 1<br/>(reads)"]
    Primary -->|"Replication"| S2["Secondary 2<br/>(reads)"]
    Primary -.->|"Heartbeat"| S1
    Primary -.->|"Heartbeat"| S2

    S1 -.->|"Election on<br/>primary failure"| NewPrimary["New Primary"]

    style App fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style Primary fill:#D1FAE5,stroke:#6EE7B7,color:#065F46
    style S1 fill:#FEF3C7,stroke:#FCD34D,color:#92400E
    style S2 fill:#FEF3C7,stroke:#FCD34D,color:#92400E
    style NewPrimary fill:#EDE9FE,stroke:#C4B5FD,color:#5B21B6

Feature	Description
Minimum nodes	3 (1 primary + 2 secondary, or 1 primary + 1 secondary + 1 arbiter)
Write concern	`w: "majority"` — acknowledged after majority of nodes confirm
Read preference	Primary (default), Secondary, PrimaryPreferred, Nearest
Failover	Automatic election (10-12 seconds)
Oplog	Capped collection that records all write operations for replication

Sharding (Horizontal Scaling)

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TD
    App["Application"] --> Router["mongos<br/>(Router)"]
    Router --> S1["Shard 1<br/>(userId: A-M)"]
    Router --> S2["Shard 2<br/>(userId: N-Z)"]
    Router --> Config["Config Servers<br/>(metadata)"]

    style App fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style Router fill:#EDE9FE,stroke:#C4B5FD,color:#5B21B6
    style S1 fill:#D1FAE5,stroke:#6EE7B7,color:#065F46
    style S2 fill:#D1FAE5,stroke:#6EE7B7,color:#065F46
    style Config fill:#FEF3C7,stroke:#FCD34D,color:#92400E

Shard Key Strategy	Pros	Cons
Hashed	Even distribution	No range queries on shard key
Range	Efficient range queries	Hot spots if sequential (timestamps)
Zone/Tag	Geographic locality	Manual management

JavaScript

// Enable sharding
sh.enableSharding("ecommerce");
sh.shardCollection("ecommerce.orders", { customerId: "hashed" });

Schema Design Patterns

Embedding vs Referencing

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph Embed["Embedding (Denormalized)"]
        direction TB
        E1["{<br/>  order: {...},<br/>  items: [...],<br/>  customer: {...}<br/>}"]
    end

    subgraph Ref["Referencing (Normalized)"]
        direction TB
        R1["orders: { customerId: ObjectId }"]
        R2["customers: { _id: ObjectId }"]
        R1 -.-> R2
    end

    style E1 fill:#D1FAE5,stroke:#6EE7B7,color:#065F46
    style R1 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style R2 fill:#FEF3C7,stroke:#FCD34D,color:#92400E

Factor	Embed	Reference
Read pattern	Always read together	Read independently
Data size	Small/bounded subdocuments	Large or unbounded
Update pattern	Updated together	Updated independently
Cardinality	1:few (embed many side)	1:many or many:many
Document size	Under 16 MB	Could exceed 16 MB
Consistency	Atomic (single document)	Requires transactions

Schema Design Patterns

Pattern	When to Use	Example
Embedded	Tightly coupled data, read together	Order with line items
Subset	Only need recent/top-N of a large array	Product with latest 10 reviews embedded
Computed	Avoid expensive aggregations on read	Pre-computed totals, averages
Bucket	Time-series data, IoT measurements	Group sensor readings by hour
Outlier	99% small, 1% huge	Most users have 5 posts; some have 50K
Extended Ref	Need some fields from related doc	Embed `customerName` + `customerId`

Example: Blog Post Schema

JavaScript

// Embedded pattern (good for bounded data)
{
  _id: ObjectId("..."),
  title: "MongoDB Schema Design",
  author: {
    _id: ObjectId("..."),
    name: "Jane Smith",
    avatar: "/img/jane.jpg"  // Extended reference
  },
  content: "...",
  tags: ["mongodb", "nosql", "schema-design"],
  comments: [   // Embed if bounded (e.g., max 50)
    { user: "Bob", text: "Great post!", createdAt: new Date() },
    { user: "Alice", text: "Very helpful", createdAt: new Date() }
  ],
  stats: {      // Computed pattern
    views: 1524,
    likes: 47,
    commentCount: 2
  },
  createdAt: new Date(),
  updatedAt: new Date()
}

Transactions in MongoDB (4.0+)

Multi-document transactions provide ACID guarantees across multiple documents and collections.

Java

// Spring Data MongoDB transaction
@Transactional
public void transferFunds(String fromAccount, String toAccount, BigDecimal amount) {
    Account from = accountRepo.findById(fromAccount).orElseThrow();
    Account to = accountRepo.findById(toAccount).orElseThrow();

    if (from.getBalance().compareTo(amount) < 0) {
        throw new InsufficientFundsException();
    }

    from.debit(amount);
    to.credit(amount);

    accountRepo.save(from);
    accountRepo.save(to);
    // Both or neither — ACID guaranteed
}

JavaScript

// Native MongoDB transaction
const session = client.startSession();
try {
  session.startTransaction({
    readConcern: { level: "snapshot" },
    writeConcern: { w: "majority" }
  });

  db.accounts.updateOne(
    { _id: fromAccount },
    { $inc: { balance: -amount } },
    { session }
  );

  db.accounts.updateOne(
    { _id: toAccount },
    { $inc: { balance: amount } },
    { session }
  );

  session.commitTransaction();
} catch (error) {
  session.abortTransaction();
  throw error;
} finally {
  session.endSession();
}

Transaction Limitations

Max 60 seconds runtime (configurable)
Performance overhead (snapshot isolation)
Not designed for long-running operations
If you need transactions frequently, reconsider your schema (embedding may eliminate the need)

Spring Data MongoDB Integration

Configuration

YAML

# application.yml
spring:
  data:
    mongodb:
      uri: mongodb://localhost:27017/ecommerce
      # or individual properties:
      host: localhost
      port: 27017
      database: ecommerce
      authentication-database: admin
      username: app_user
      password: ${MONGO_PASSWORD}

Entity Mapping

Java

@Document(collection = "orders")
@CompoundIndex(name = "status_date_idx", def = "{'status': 1, 'createdAt': -1}")
public class Order {

    @Id
    private String id; // Maps to _id (ObjectId if String)

    @Indexed
    private String customerId;

    @Field("items") // Custom field name
    private List<OrderItem> lineItems;

    private BigDecimal total;

    @Indexed
    private OrderStatus status;

    @CreatedDate
    private Instant createdAt;

    @LastModifiedDate
    private Instant updatedAt;

    @Version
    private Long version; // Optimistic locking
}

// Embedded document (no @Document, no @Id)
public class OrderItem {
    private String productId;
    private String productName;
    private int quantity;
    private BigDecimal price;
}

Repository

Java

public interface OrderRepository extends MongoRepository<Order, String> {

    List<Order> findByStatusAndTotalGreaterThan(OrderStatus status, BigDecimal minTotal);

    @Query("{ 'status': ?0, 'createdAt': { $gte: ?1 } }")
    List<Order> findRecentByStatus(OrderStatus status, Instant since);

    @Query(value = "{ 'customerId': ?0 }", fields = "{ 'total': 1, 'status': 1 }")
    List<Order> findOrderSummariesByCustomer(String customerId);

    @Aggregation(pipeline = {
        "{ $match: { status: 'COMPLETED' } }",
        "{ $group: { _id: '$customerId', totalSpent: { $sum: '$total' } } }",
        "{ $sort: { totalSpent: -1 } }",
        "{ $limit: 10 }"
    })
    List<CustomerSpending> findTopSpenders();
}

MongoTemplate for Complex Operations

Java

@Service
public class OrderAnalyticsService {

    @Autowired
    private MongoTemplate mongoTemplate;

    public List<CategoryRevenue> getRevenueByCategory(Instant since) {
        Aggregation agg = Aggregation.newAggregation(
            match(Criteria.where("status").is("COMPLETED")
                          .and("createdAt").gte(since)),
            unwind("lineItems"),
            group("lineItems.category")
                .sum(ArithmeticOperators.Multiply.valueOf("lineItems.price")
                    .multiplyBy("lineItems.quantity")).as("revenue")
                .count().as("orderCount"),
            sort(Sort.Direction.DESC, "revenue"),
            project()
                .and("_id").as("category")
                .andInclude("revenue", "orderCount")
                .andExclude("_id")
        );

        return mongoTemplate.aggregate(agg, "orders", CategoryRevenue.class)
            .getMappedResults();
    }
}

When to Choose MongoDB vs SQL

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TD
    Q1{"Need complex JOINs<br/>across many tables?"}
    Q1 -->|Yes| SQL["Use SQL<br/>(PostgreSQL, MySQL)"]
    Q1 -->|No| Q2{"Schema changes<br/>frequently?"}
    Q2 -->|Yes| Mongo["Use MongoDB"]
    Q2 -->|No| Q3{"Need horizontal<br/>write scaling?"}
    Q3 -->|Yes| Mongo
    Q3 -->|No| Q4{"Strong consistency<br/>across related data?"}
    Q4 -->|Yes| SQL
    Q4 -->|No| Q5{"Hierarchical/<br/>nested data?"}
    Q5 -->|Yes| Mongo
    Q5 -->|No| SQL

    style Q1 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style Q2 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style Q3 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style Q4 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style Q5 fill:#DBEAFE,stroke:#93C5FD,color:#1E40AF
    style SQL fill:#FEF3C7,stroke:#FCD34D,color:#92400E
    style Mongo fill:#D1FAE5,stroke:#6EE7B7,color:#065F46

Choose MongoDB	Choose SQL
Variable/evolving schema	Fixed, well-known schema
High write throughput	Complex transactions across tables
Hierarchical data (nested)	Heavy reporting/analytics with JOINs
Horizontal scaling needed	Strict referential integrity
Rapid prototyping	Financial/banking (strong ACID)
Content management, catalogs	ERP, inventory with relationships
IoT / time-series data	Multi-table aggregations
Geospatial queries	Mature tooling ecosystem

Interview Questions

Q: How does MongoDB handle schema flexibility, and what are the trade-offs?

MongoDB is schema-less at the database level — documents in the same collection can have different fields. This provides:

Benefits: (1) Faster development iterations. (2) Handles polymorphic data (different product types with different attributes). (3) No costly ALTER TABLE migrations. (4) Natural fit for semi-structured data.

Trade-offs: (1) Application must handle schema validation (or use MongoDB Schema Validation rules). (2) Inconsistent data possible if multiple services write differently. (3) No compile-time safety for field names. (4) Query planning harder without guaranteed schema.

Best practice: Use schema validation rules in production ($jsonSchema validator) to enforce structure while retaining flexibility for optional fields.

Q: Explain the difference between embedding and referencing in MongoDB schema design. When would you use each?

Embedding: Nest related data within the parent document. Use when: (1) Data is always read together. (2) Cardinality is bounded (1:few). (3) Child data doesn't need independent access. (4) Updates are atomic (single document).

Referencing: Store ObjectId references to other collections. Use when: (1) Data is accessed independently. (2) Cardinality is unbounded (1:millions). (3) Document would exceed 16 MB. (4) Many-to-many relationships. (5) Data is updated independently.

Rule of thumb: If you would use JOIN every time you read, embed. If you read the related data <20% of the time, reference.

Example: Order with items -> embed (always read together). User with orders -> reference (user has thousands of orders over time).

Q: What is the aggregation pipeline in MongoDB? How does it compare to SQL?

The aggregation pipeline processes documents through a sequence of stages, where each stage transforms the document stream. It's MongoDB's equivalent to complex SQL queries with GROUP BY, HAVING, and subqueries.

Key stages: $match (WHERE), $group (GROUP BY), $project (SELECT), $sort (ORDER BY), $lookup (JOIN), $unwind (flatten arrays), $facet (multiple parallel pipelines).

Advantages over SQL: (1) Pipeline stages are composable and reorderable. (2) Can process nested/array data natively. (3) $facet runs multiple aggregations in one pass. (4) Stages can be conditionally added programmatically.

Performance tip: Put $match early (uses indexes), then $project to reduce document size before expensive stages.

Q: How does MongoDB achieve high availability and horizontal scaling?

High Availability (Replica Sets): Minimum 3 nodes. One primary handles writes; secondaries replicate asynchronously. On primary failure, automatic election chooses new primary (10-12 sec failover). Write concern w: majority ensures durability.

Horizontal Scaling (Sharding): Data distributed across shards by shard key. mongos routers direct queries to correct shard(s). Two strategies: (1) Hashed sharding — even distribution, no range queries on key. (2) Range sharding — efficient ranges, possible hot spots.

Choosing shard key is critical: high cardinality, even distribution, matches query patterns. Bad shard key (e.g., timestamp) causes all writes to hit one shard.

Q: When would you choose MongoDB over PostgreSQL for a new project?

Choose MongoDB when: (1) Schema evolves frequently (startup, MVP). (2) Data is naturally hierarchical/nested (catalogs, CMS). (3) Need horizontal write scaling beyond single node. (4) Geospatial queries are primary use case. (5) Variable structure per document (polymorphic entities). (6) High write throughput with eventual consistency acceptable.

Choose PostgreSQL when: (1) Complex relationships requiring JOINs. (2) Need strong ACID across multiple tables. (3) Reporting/analytics with complex aggregations. (4) Team expertise is SQL-centric. (5) JSON flexibility needed within relational model (JSONB). (6) Extensions matter (PostGIS, full-text search, etc.).

Both support JSON, indexing, replication, and transactions. The deciding factor is usually your access patterns and relationship complexity.

Quick Recall

Concept	Key Point
Document max size	16 MB (BSON)
ObjectId	12 bytes: timestamp + random + counter (sortable by time)
Embedding rule	Embed if always read together and bounded cardinality
Referencing rule	Reference if independent access or unbounded growth
ESR indexing	Equality, Sort, Range — field order in compound indexes
Aggregation	Pipeline of stages: $match -> $group -> $sort -> $project
$lookup	LEFT OUTER JOIN equivalent (aggregation stage)
Replica Set	3+ nodes, automatic failover, write to primary only
Sharding	Horizontal scaling by shard key, use hashed for even distribution
Transactions	Multi-document ACID since 4.0, 60-second limit
Spring Data	`@Document`, `MongoRepository`, `MongoTemplate`
Write concern	`w: "majority"` for durability across replica set

MongoDB — Document Database Deep Dive

Document Model vs Relational

MongoDB Terminology Mapping

BSON Format & ObjectId

BSON vs JSON

ObjectId Structure

CRUD Operations

Create

Read (Find)

Update

Delete

Query Operators

Indexing

Index Types

Compound Index Order Matters (ESR Rule)

explain() for Query Analysis

Aggregation Pipeline

Common Pipeline Stages

Real-World Aggregation Examples

$lookup (JOIN equivalent)

$facet (Multiple Aggregations in One Query)

Replica Sets and Sharding

Replica Sets (High Availability)

Sharding (Horizontal Scaling)

Schema Design Patterns

Embedding vs Referencing

Schema Design Patterns

Example: Blog Post Schema

Transactions in MongoDB (4.0+)

Spring Data MongoDB Integration

Configuration

Entity Mapping

Repository

MongoTemplate for Complex Operations

When to Choose MongoDB vs SQL

Interview Questions

Quick Recall

See Also

MongoDB — Document Database Deep Dive

Document Model vs Relational

MongoDB Terminology Mapping

BSON Format & ObjectId

BSON vs JSON

ObjectId Structure

CRUD Operations

Create

Read (Find)

Update

Delete

Query Operators

Indexing

Index Types

Compound Index Order Matters (ESR Rule)

explain() for Query Analysis

Aggregation Pipeline

Common Pipeline Stages

Real-World Aggregation Examples

$lookup (JOIN equivalent)

$facet (Multiple Aggregations in One Query)

Replica Sets and Sharding

Replica Sets (High Availability)

Sharding (Horizontal Scaling)

Schema Design Patterns

Embedding vs Referencing

Schema Design Patterns

Example: Blog Post Schema

Transactions in MongoDB (4.0+)

Spring Data MongoDB Integration

Configuration

Entity Mapping

Repository

MongoTemplate for Complex Operations

When to Choose MongoDB vs SQL

Interview Questions

Quick Recall

See Also

5-Minute System Design — Weekly