Technical Interview Questions: MongoDB Interview Questions

MongoDB Interview Questions

Q: Explain what is MongoDB?

A: Mongo-DB is a document database which provides high performance, high availability and easy scalability.

Q: What is “Namespace” in MongoDB?

A: MongoDB stores BSON (Binary Interchange and Structure Object Notation) objects in the collection. The concatenation of the collection name and database name is called a namespace.

Q: What is sharding in MongoDB?

A: The procedure of storing data records across multiple machines is referred as Sharding. It is a MongoDB approach to meet the demands of data growth. It is the horizontal partition of data in a database or search engine. Each partition is referred as shard or database shard.

Q: How can you see the connection used by Mongos?

A: To see the connection used by Mongos use db_adminCommand (“connPoolStats”);

Q: Explain what is a replica set?

A: A replica set is a group of mongo instances that host the same data set. In replica set, one node is primary, and another is secondary. From primary to the secondary node all data replicates.

Q: How replication works in MongoDB?

A: Across multiple servers, the process of synchronizing data is known as replication. It provides redundancy and increase data availability with multiple copies of data on different database server. Replication helps in protecting the database from the loss of a single server.

Q: While creating Schema in MongoDB what are the points need to be taken in consideration?

A: Points need to be taken in consideration are

• Design your schema according to user requirements

• Combine objects into one document if you use them together. Otherwise, separate them

• Do joins while write, and not when it is on read

• For most frequent use cases optimize your schema

• Do complex aggregation in the schema

Q: What is the syntax to create a collection and to drop a collection in MongoDB?

A: • Syntax to create collection in MongoDB is db.createCollection(name,options)

• Syntax to drop collection in MongoDB is db.collection.drop()

Q: Explain what is the role of profiler in MongoDB?

A: MongoDB database profiler shows performance characteristics of each operation against the database. You can find queries using the profiler that are slower than they should be.

Q: Explain can you move old files in the moveChunk directory?

A: Yes, it is possible to move old files in the moveChunk directory, during normal shard balancing operations these files are made as backups and can be deleted once the operations are done.

Q: To do safe backups what is the feature in MongoDB that you can use?

A: Journaling is the feature in MongoDB that you can use to do safe backups.

Q: Mention what is Objecld composed of?

A: Objectld is composed of

• Timestamp

• Client machine ID

• Client process ID

• 3 byte incremented counter

Q: Mention what is the command syntax for inserting a document?

A: For inserting a document command syntax is database.collection.insert (document).

Q: Mention how you can inspect the source code of a function?

A: To inspect a source code of a function, without any parentheses, the function must be invoked.

Q: What is the command syntax that tells you whether you are on the master server or not? And how many master does MongoDB allow?

A: Command syntax Db.isMaster() will tell you whether you are on the master server or not. MongoDB allows only one master server, while couchDB allows multiple masters.

Q: Mention the command syntax that is used to view Mongo is using the link?

A: The command syntax that is used to view mongo is using the link is db._adminCommand(“connPoolStats.”)

Q: Explain what are indexes in MongoDB?

A: Indexes are special structures in MongoDB, which stores a small portion of the data set in an easy to traverse form. Ordered by the value of the field specified in the index, the index stores the value of a specific field or set of fields.

Q: Mention what is the basic syntax to use index in MongoDB?

A: The basic syntax to use in MongoDB is >db.COLLECTION_NAME.ensureIndex ( {KEY:1} ). In here the key is the name of the file on which you want to create an index, where 1 is for ascending order while you use -1 for descending order.

Q: Explain what is GridFS in MongoDB?

A: For storing and retrieving large files such as images, video files and audio files GridFS is used. By default, it uses two files fs.files and fs.chunks to store the file’s metadata and the chunks.

Q: What are alternatives to MongoDB?

A: Cassandra, CouchDB, Redis, Riak, HBase are a few good alternatives.

MongoDB Fundamentals

Q: What kind of database is MongoDB?

A: MongoDB is a document-oriented DBMS. Think of MySQL but with JSON-like objects comprising the data model, rather than RDBMS tables. Significantly, MongoDB supports neither joins nor transactions. However, it features secondary indexes, an expressive query language, atomic writes on a per-document level, and fully-consistent reads.

Operationally, MongoDB features master-slave replication with automated failover and built-in horizontal scaling via automated range-based partitioning.

Note: MongoDB uses BSON, a binary object format similar to, but more expressive than JSON.

Q: Do MongoDB databases have tables?

A: Instead of tables, a MongoDB database stores its data in collections, which are the rough equivalent of RDBMS tables. A collection holds one or more documents, which corresponds to a record or a row in a relational database table, and each document has one or more fields, which corresponds to a column in a relational database table.

Collections have important differences from RDBMS tables. Documents in a single collection may have a unique combination and set of fields. Documents need not have identical fields. You can add a field to some documents in a collection without adding that field to all documents in the collection.

Q: Do MongoDB databases have schemas?

A: MongoDB uses dynamic schemas. You can create collections without defining the structure, i.e. the fields or the types of their values, of the documents in the collection. You can change the structure of documents simply by adding new fields or deleting existing ones. Documents in a collection need not have an identical set of fields.

In practice, it is common for the documents in a collection to have a largely homogeneous structure; however, this is not a requirement. MongoDB’s flexible schemas mean that schema migration and augmentation are very easy in practice, and you will rarely, if ever, need to write scripts that perform “alter table” type operations, which simplifies and facilitates iterative software development with MongoDB.

Q: What languages can I use to work with MongoDB?

A: MongoDB client drivers exist for all of the most popular programming languages, and many other ones. See the latest list of drivers for details.

Q: Does MongoDB support SQL?

A: No.

However, MongoDB does support a rich, ad-hoc query language of its own.

Q: What are typical uses for MongoDB?

A: MongoDB has a general-purpose design, making it appropriate for a large number of use cases. Examples include content management systems, mobile applications, gaming, e-commerce, analytics, archiving, and logging.

Do not use MongoDB for systems that require SQL, joins, and multi-object transactions.

Q: Does MongoDB support ACID transactions?

A: MongoDB does not support multi-document transactions.

However, MongoDB does provide atomic operations on a single document. Often these document-level atomic operations are sufficient to solve problems that would require ACID transactions in a relational database.

For example, in MongoDB, you can embed related data in nested arrays or nested documents within a single document and update the entire document in a single atomic operation. Relational databases might represent the same kind of data with multiple tables and rows, which would require transaction support to update the data atomically.

MongoDB allows clients to read documents inserted or modified before it commits these modifications to disk, regardless of write concern level or journaling configuration. As a result, applications may observe two classes of behaviors:

For systems with multiple concurrent readers and writers, MongoDB will allow clients to read the results of a write operation before the write operation returns.

If the mongod terminates before the journal commits, even if a write returns successfully, queries may have read data that will not exist after the mongod restarts.

Other database systems refer to these isolation semantics as read uncommitted. For all inserts and updates, MongoDB modifies each document in isolation: clients never see documents in intermediate states. For multi-document operations, MongoDB does not provide any multi-document transactions or isolation.

When mongod returns a successful journaled write concern, the data is fully committed to disk and will be available after mongod restarts.

For replica sets, write operations are durable only after a write replicates and commits to the journal of a majority of the voting members of the set. [1] MongoDB regularly commits data to the journal regardless of journaled write concern: use the commitIntervalMs to control how often a mongod commits the journal.

[1] For the purposes of write concern, w:majority refers to a majority of all the members in the set. As a result, arbiters, non-voting members, passive members, hidden members and delayed members are all included in the definition of majority write concern.

Q: Does MongoDB require a lot of RAM?

A: Not necessarily. It’s certainly possible to run MongoDB on a machine with a small amount of free RAM.

MongoDB automatically uses all free memory on the machine as its cache. System resource monitors show that MongoDB uses a lot of memory, but its usage is dynamic. If another process suddenly needs half the server’s RAM, MongoDB will yield cached memory to the other process.

Technically, the operating system’s virtual memory subsystem manages MongoDB’s memory. This means that MongoDB will use as much free memory as it can, swapping to disk as needed. Deployments with enough memory to fit the application’s working data set in RAM will achieve the best performance.

FAQ: MongoDB Diagnostics for answers to additional questions about MongoDB and Memory use.

Q: How do I configure the cache size?

A: MongoDB has no configurable cache. MongoDB uses all free memory on the system automatically by way of memory-mapped files. Operating systems use the same approach with their file system caches.

Q: Does MongoDB require a separate caching layer for application-level caching?

A: No. In MongoDB, a document’s representation in the database is similar to its representation in application memory. This means the database already stores the usable form of data, making the data usable in both the persistent store and in the application cache. This eliminates the need for a separate caching layer in the application.

This differs from relational databases, where caching data is more expensive. Relational databases must transform data into object representations that applications can read and must store the transformed data in a separate cache: if these transformation from data to application objects require joins, this process increases the overhead related to using the database which increases the importance of the caching layer.

Q: Does MongoDB handle caching?

A: Yes. MongoDB keeps all of the most recently used data in RAM. If you have created indexes for your queries and your working data set fits in RAM, MongoDB serves all queries from memory.

MongoDB does not implement a query cache: MongoDB serves all queries directly from the indexes and/or data files.

Are writes written to disk immediately, or lazily?

Writes are physically written to the journal within 100 milliseconds, by default. At that point, the write is “durable” in the sense that after a pull-plug-from-wall event, the data will still be recoverable after a hard restart. See commitIntervalMs for more information on the journal commit window.

While the journal commit is nearly instant, MongoDB writes to the data files lazily. MongoDB may wait to write data to the data files for as much as one minute by default. This does not affect durability, as the journal has enough information to ensure crash recovery. To change the interval for writing to the data files, see syncPeriodSecs.

Q: What language is MongoDB written in?

A: MongoDB is implemented in C++. Drivers and client libraries are typically written in their respective languages, although some drivers use C extensions for better performance.

What are the limitations of 32-bit versions of MongoDB?

MongoDB uses memory-mapped files. When running a 32-bit build of MongoDB, the total storage size for the server, including data and indexes, is 2 gigabytes. For this reason, do not deploy MongoDB to production on 32-bit machines.

If you’re running a 64-bit build of MongoDB, there’s virtually no limit to storage size. For production deployments, 64-bit builds and operating systems are strongly recommended.

“Blog Post: 32-bit Limitations“

Note: 32-bit builds disable journaling by default because journaling further limits the maximum amount of data that the database can store.

FAQ: MongoDB for Application Developers.

Q: What is a namespace in MongoDB?

A: A “namespace” is the concatenation of the database name and the collection names [1] with a period character in between.

Collections are containers for documents that share one or more indexes. Databases are groups of collections stored on disk using a single set of data files. [2]

For an example acme.users namespace, acme is the database name and users is the collection name. Period characters can occur in collection names, so that acme.user.history is a valid namespace, with acme as the database name, and user.history as the collection name.

While data models like this appear to support nested collections, the collection namespace is flat, and there is no difference from the perspective of MongoDB between acme, acme.users, and acme.records.

[1] Each index also has its own namespace.

[2] MongoDB database have a configurable limit on the number of namespaces in a database.

Q: How do you copy all objects from one collection to another?

A: In the mongo shell, you can use the following operation to duplicate the entire collection:

db.source.copyTo(newCollection)

Warning: When using db.collection.copyTo() check field types to ensure that the operation does not remove type information from documents during the translation from BSON to JSON. Consider using cloneCollection() to maintain type fidelity.

The db.collection.copyTo() method uses the eval command internally. As a result, the db.collection.copyTo() operation takes a global lock that blocks all other read and write operations until the db.collection.copyTo() completes.

Also consider the cloneCollection command that may provide some of this functionality.

Q: If you remove a document, does MongoDB remove it from disk?

A: Yes.

When you use remove(), the object will no longer exist in MongoDB’s on-disk data storage.

Q: When does MongoDB write updates to disk?

A: MongoDB flushes writes to disk on a regular interval. In the default configuration, MongoDB writes data to the main data files on disk every 60 seconds and commits the journal roughly every 100 milliseconds. These values are configurable with the commitIntervalMs and syncPeriodSecs.

These values represent the maximum amount of time between the completion of a write operation and the point when the write is durable in the journal, if enabled, and when MongoDB flushes data to the disk. In many cases MongoDB and the operating system flush data to disk more frequently, so that the above values represents a theoretical maximum.

However, by default, MongoDB uses a “lazy” strategy to write to disk. This is advantageous in situations where the database receives a thousand increments to an object within one second, MongoDB only needs to flush this data to disk once. In addition to the aforementioned configuration options, you can also use fsync and Write Concern Reference to modify this strategy.

Q: How do I do transactions and locking in MongoDB?

A: MongoDB does not have support for traditional locking or complex transactions with rollback. MongoDB aims to be lightweight, fast, and predictable in its performance. This is similar to the MySQL MyISAM autocommit model. By keeping transaction support extremely simple, MongoDB can provide greater performance especially for partitioned or replicated systems with a number of database server processes.

MongoDB does have support for atomic operations within a single document. Given the possibilities provided by nested documents, this feature provides support for a large number of use-cases.

Q: How do you aggregate data with MongoDB?

A: In version 2.1 and later, you can use the new aggregation framework, with the aggregate command.

MongoDB also supports map-reduce with the mapReduce command, as well as basic aggregation with the group, count, and distinct. commands.

Q: Why does MongoDB log so many “Connection Accepted” events?

A: If you see a very large number connection and re-connection messages in your MongoDB log, then clients are frequently connecting and disconnecting to the MongoDB server. This is normal behavior for applications that do not use request pooling, such as CGI. Consider using FastCGI, an Apache Module, or some other kind of persistent application server to decrease the connection overhead.

If these connections do not impact your performance you can use the run-time quiet option or the command-line option --quiet to suppress these messages from the log.

Q: Does MongoDB run on Amazon EBS?

A: Yes.

MongoDB users of all sizes have had a great deal of success using MongoDB on the EC2 platform using EBS disks.

Q: Why are MongoDB’s data files so large?

A: MongoDB aggressively preallocates data files to reserve space and avoid file system fragmentation. You can use the storage.smallFiles setting to modify the file preallocation strategy.

Q: How do I optimize storage use for small documents?

A: Each MongoDB document contains a certain amount of overhead. This overhead is normally insignificant but becomes significant if all documents are just a few bytes, as might be the case if the documents in your collection only have one or two fields.

Consider the following suggestions and strategies for optimizing storage utilization for these collections:

Use the _id field explicitly.

MongoDB clients automatically add an _id field to each document and generate a unique 12-byte ObjectId for the _id field. Furthermore, MongoDB always indexes the _id field. For smaller documents this may account for a significant amount of space.

To optimize storage use, users can specify a value for the _id field explicitly when inserting documents into the collection. This strategy allows applications to store a value in the _id field that would have occupied space in another portion of the document.

You can store any value in the _id field, but because this value serves as a primary key for documents in the collection, it must uniquely identify them. If the field’s value is not unique, then it cannot serve as a primary key as there would be collisions in the collection.

Use shorter field names.

MongoDB stores all field names in every document. For most documents, this represents a small fraction of the space used by a document; however, for small documents the field names may represent a proportionally large amount of space. Consider a collection of documents that resemble the following:

{ last_name : "Smith", best_score: 3.9 }

If you shorten the field named last_name to lname and the field named best_score to score, as follows, you could save 9 bytes per document.

{ lname : "Smith", score : 3.9 }

Shortening field names reduces expressiveness and does not provide considerable benefit for larger documents and where document overhead is not of significant concern. Shorter field names do not reduce the size of indexes, because indexes have a predefined structure.

In general it is not necessary to use short field names.

Embed documents.

In some cases you may want to embed documents in other documents and save on the per-document overhead.

Q: When should I use GridFS?

A: For documents in a MongoDB collection, you should always use GridFS for storing files larger than 16 MB.

In some situations, storing large files may be more efficient in a MongoDB database than on a system-level filesystem.

If your filesystem limits the number of files in a directory, you can use GridFS to store as many files as needed.

When you want to keep your files and metadata automatically synced and deployed across a number of systems and facilities. When using geographically distributed replica sets MongoDB can distribute files and their metadata automatically to a number of mongod instances and facilities.

When you want to access information from portions of large files without having to load whole files into memory, you can use GridFS to recall sections of files without reading the entire file into memory.

Do not use GridFS if you need to update the content of the entire file atomically. As an alternative you can store multiple versions of each file and specify the current version of the file in the metadata. You can update the metadata field that indicates “latest” status in an atomic update after uploading the new version of the file, and later remove previous versions if needed.

Furthermore, if your files are all smaller the 16 MB BSON Document Size limit, consider storing the file manually within a single document. You may use the BinData data type to store the binary data. See your drivers documentation for details on using BinData.

Q: How does MongoDB address SQL or Query injection?

A: BSON

As a client program assembles a query in MongoDB, it builds a BSON object, not a string. Thus traditional SQL injection attacks are not a problem. More details and some nuances are covered below.

MongoDB represents queries as BSON objects. Typically client libraries provide a convenient, injection free, process to build these objects. Consider the following C++ example:

BSONObj my_query = BSON( "name" << a_name );

auto_ptr<DBClientCursor> cursor = c.query("tutorial.persons", my_query);

Here, my_query then will have a value such as { name : "Joe" }. If my_query contained special characters, for example ,, :, and {, the query simply wouldn’t match any documents. For example, users cannot hijack a query and convert it to a delete.

JavaScript

Note: You can disable all server-side execution of JavaScript, by passing the --noscripting option on the command line or setting security.javascriptEnabled in a configuration file.

All of the following MongoDB operations permit you to run arbitrary JavaScript expressions directly on the server:

$where

db.eval()

mapReduce

group

You must exercise care in these cases to prevent users from submitting malicious JavaScript.

Fortunately, you can express most queries in MongoDB without JavaScript and for queries that require JavaScript, you can mix JavaScript and non-JavaScript in a single query. Place all the user-supplied fields directly in a BSON field and pass JavaScript code to the $where field.

If you need to pass user-supplied values in a $where clause, you may escape these values with the CodeWScope mechanism. When you set user-submitted values as variables in the scope document, you can avoid evaluating them on the database server.

If you need to use db.eval() with user supplied values, you can either use a CodeWScope or you can supply extra arguments to your function. For instance:

db.eval(function(userVal){...}, user_value);

This will ensure that your application sends user_value to the database server as data rather than code.

Dollar Sign Operator Escaping

Field names in MongoDB’s query language have semantic meaning. The dollar sign (i.e $) is a reserved character used to represent operators (i.e. $inc.) Thus, you should ensure that your application’s users cannot inject operators into their inputs.

In some cases, you may wish to build a BSON object with a user-provided key. In these situations, keys will need to substitute the reserved $ and . characters. Any character is sufficient, but consider using the Unicode full width equivalents: U+FF04 (i.e. “$”) and U+FF0E (i.e. “.”).

Consider the following example:

BSONObj my_object = BSON( a_key << a_name );

The user may have supplied a $ value in the a_key value. At the same time, my_object might be { $where : "things" }. Consider the following cases:

Insert. Inserting this into the database does no harm. The insert process does not evaluate the object as a query.

Note:

MongoDB client drivers, if properly implemented, check for reserved characters in keys on inserts.

Update. The update() operation permits $ operators in the update argument but does not support the $where operator. Still, some users may be able to inject operators that can manipulate a single document only. Therefore your application should escape keys, as mentioned above, if reserved characters are possible.

Query Generally this is not a problem for queries that resemble { x : user_obj }: dollar signs are not top level and have no effect. Theoretically it may be possible for the user to build a query themselves. But checking the user-submitted content for $ characters in key names may help protect against this kind of injection.

Q: How does MongoDB provide concurrency?

A: MongoDB implements a readers-writer lock. This means that at any one time, only one client may be writing or any number of clients may be reading, but that reading and writing cannot occur simultaneously.

In standalone and replica sets the lock’s scope applies to a single mongod instance or primary instance. In a sharded cluster, locks apply to each individual shard, not to the whole cluster.

Q: What is the compare order for BSON types?

A: MongoDB permits documents within a single collection to have fields with different BSON types. For instance, the following documents may exist within a single collection.

{ x: "string" }

{ x: 42 }

When comparing values of different BSON types, MongoDB uses the following comparison order, from lowest to highest:

MinKey (internal type)

Null

Numbers (ints, longs, doubles)

Symbol, String

Object

Array

BinData

ObjectId

Boolean

Date, Timestamp

Regular Expression

MaxKey (internal type)

MongoDB treats some types as equivalent for comparison purposes. For instance, numeric types undergo conversion before comparison.

The comparison treats a non-existent field as it would an empty BSON Object. As such, a sort on the a field in documents { } and { a: null } would treat the documents as equivalent in sort order.

With arrays, a less-than comparison or an ascending sort compares the smallest element of arrays, and a greater-than comparison or a descending sort compares the largest element of the arrays. As such, when comparing a field whose value is a single-element array (e.g. [ 1 ]) with non-array fields (e.g. 2), the comparison is between 1 and 2. A comparison of an empty array (e.g. [ ]) treats the empty array as less than null or a missing field.

MongoDB sorts BinData in the following order:

First, the length or size of the data.

Then, by the BSON one-byte subtype.

Finally, by the data, performing a byte-by-byte comparison.

Consider the following mongo example:

db.test.insert( {x : 3 } );

db.test.insert( {x : 2.9 } );

db.test.insert( {x : new Date() } );

db.test.insert( {x : true } );

db.test.find().sort({x:1});

{ "_id" : ObjectId("4b03155dce8de6586fb002c7"), "x" : 2.9 }

{ "_id" : ObjectId("4b03154cce8de6586fb002c6"), "x" : 3 }

{ "_id" : ObjectId("4b031566ce8de6586fb002c9"), "x" : true }

{ "_id" : ObjectId("4b031563ce8de6586fb002c8"), "x" : "Tue Nov 17 2009 16:28:03 GMT-0500 (EST)" }

The $type operator provides access to BSON type comparison in the MongoDB query syntax. See the documentation on BSON types and the $type operator for additional information.

Warning: Data models that associate a field name with different data types within a collection are strongly discouraged.

Without internal consistency complicates application code, and can lead to unnecessary complexity for application developers.

Q: When multiplying values of mixed types, what type conversion rules apply?

A: The $mul multiplies the numeric value of a field by a number. For multiplication with values of mixed numeric types (32-bit integer, 64-bit integer, float), the following type conversion rules apply:

32-bit Integer 64-bit Integer Float

32-bit Integer 32-bit or 64-bit Integer 64-bit Integer Float

64-bit Integer 64-bit Integer 64-bit Integer Float

Float Float Float Float

Note: If the product of two 32-bit integers exceeds the maximum value for a 32-bit integer, the result is a 64-bit integer.

Integer operations of any type that exceed the maximum value for a 64-bit integer produce an error.

Q: How do I query for fields that have null values?

A: Different query operators treat null values differently.

Consider the collection test with the following documents:

{ _id: 1, cancelDate: null }

{ _id: 2 }

Comparison with Null

The { cancelDate : null } query matches documents that either contain the cancelDate field whose value is null or that do not contain the cancelDate field. If the queried index is sparse, however, then the query will only match null values, not missing fields.

Changed in version 2.6: If using the sparse index results in an incomplete result, MongoDB will not use the index unless a hint() explicitly specifies the index. See Sparse Indexes for more information.

Given the following query:

db.test.find( { cancelDate: null } )

The query returns both documents:

{ "_id" : 1, "cancelDate" : null }

{ "_id" : 2 }

Type Check

The { cancelDate : { $type: 10 } } query matches documents that contains the cancelDate field whose value is null only; i.e. the value of the cancelDate field is of BSON Type Null (i.e. 10) :

db.test.find( { cancelDate : { $type: 10 } } )

The query returns only the document that contains the null value:

{ "_id" : 1, "cancelDate" : null }

Existence Check

The { cancelDate : { $exists: false } } query matches documents that do not contain the cancelDate field:

db.test.find( { cancelDate : { $exists: false } } )

The query returns only the document that does not contain the cancelDate field:

{ "_id" : 2 }

Q: Are there any restrictions on the names of Collections?

A: Collection names can be any UTF-8 string with the following exceptions:

A collection name should begin with a letter or an underscore.

The empty string ("") is not a valid collection name.

Collection names cannot contain the $ character. (version 2.2 only)

Collection names cannot contain the null character: \0

Do not name a collection using the system. prefix. MongoDB reserves system. for system collections, such as the system.indexes collection.

The maximum size of a collection name is 128 characters, including the name of the database. However, for maximum flexibility, collections should have names less than 80 characters.

If your collection name includes special characters, such as the underscore character, then to access the collection use the db.getCollection() method or a similar method for your driver.

Example

To create a collection _foo and insert the { a : 1 } document, use the following operation:

db.getCollection("_foo").insert( { a : 1 } )

To perform a query, use the find() method, in as the following:

db.getCollection("_foo").find()

Q: How do I isolate cursors from intervening write operations?

A: MongoDB cursors can return the same document more than once in some situations. [3] You can use the snapshot() method on a cursor to isolate the operation for a very specific case.

snapshot() traverses the index on the _id field and guarantees that the query will return each document (with respect to the value of the _id field) no more than once. [4]

The snapshot() does not guarantee that the data returned by the query will reflect a single moment in time nor does it provide isolation from insert or delete operations.

Warning:

You cannot use snapshot() with sharded collections.

You cannot use snapshot() with sort() or hint() cursor methods.

As an alternative, if your collection has a field or fields that are never modified, you can use a unique index on this field or these fields to achieve a similar result as the snapshot(). Query with hint() to explicitly force the query to use that index.

[3] As a cursor returns documents other operations may interleave with the query: if some of these operations are updates that cause the document to move (in the case of a table scan, caused by document growth) or that change the indexed field on the index used by the query; then the cursor will return the same document more than once.

[4] MongoDB does not permit changes to the value of the _id field; it is not possible for a cursor that transverses this index to pass the same document more than once.

Q: When should I embed documents within other documents?

A: When modeling data in MongoDB, embedding is frequently the choice for:

“contains” relationships between entities.

one-to-many relationships when the “many” objects always appear with or are viewed in the context of their parents.

You should also consider embedding for performance reasons if you have a collection with a large number of small documents. Nevertheless, if small, separate documents represent the natural model for the data, then you should maintain that model.

If, however, you can group these small documents by some logical relationship and you frequently retrieve the documents by this grouping, you might consider “rolling-up” the small documents into larger documents that contain an array of subdocuments. Keep in mind that if you often only need to retrieve a subset of the documents within the group, then “rolling-up” the documents may not provide better performance.

“Rolling up” these small documents into logical groupings means that queries to retrieve a group of documents involve sequential reads and fewer random disk accesses.

Additionally, “rolling up” documents and moving common fields to the larger document benefit the index on these fields. There would be fewer copies of the common fields and there would be fewer associated key entries in the corresponding index. See Index Concepts for more information on indexes.

Q: Where can I learn more about data modeling in MongoDB?

A: Begin by reading the documents in the Data Models section. These documents contain a high level introduction to data modeling considerations in addition to practical examples of data models targeted at particular issues.

Additionally, consider the following external resources that provide additional examples:

Schema Design by Example

Dynamic Schema Blog Post

MongoDB Data Modeling and Rails

Ruby Example of Materialized Paths

Sean Cribs Blog Post which was the source for much of the Model Tree Structures in MongoDB content.

Q: Can I manually pad documents to prevent moves during updates?

A: An update can cause a document to move on disk if the document grows in size. To minimize document movements, MongoDB uses padding.

You should not have to pad manually because MongoDB adds padding automatically and can adaptively adjust the amount of padding added to documents to prevent document relocations following updates. You can change the default paddingFactor calculation by using the collMod command with the usePowerOf2Sizes flag. The usePowerOf2Sizes flag ensures that MongoDB allocates document space in sizes that are powers of 2, which helps ensure that MongoDB can efficiently reuse free space created by document deletion or relocation.

However, if you must pad a document manually, you can add a temporary field to the document and then $unset the field, as in the following example.

Warning: Do not manually pad documents in a capped collection. Applying manual padding to a document in a capped collection can break replication. Also, the padding is not preserved if you re-sync the MongoDB instance.

var myTempPadding = [ "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",

"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",

"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"];

db.myCollection.insert( { _id: 5, paddingField: myTempPadding } );

db.myCollection.update( { _id: 5 },

{ $unset: { paddingField: "" } }

)

db.myCollection.update( { _id: 5 },

{ $set: { realField: "Some text that I might have needed padding for" } }

)

FAQ: The mongo Shell

Q: How can I enter multi-line operations in the mongo shell?

A: If you end a line with an open parenthesis ('('), an open brace ('{'), or an open bracket ('['), then the subsequent lines start with ellipsis ("...") until you enter the corresponding closing parenthesis (')'), the closing brace ('}') or the closing bracket (']'). The mongo shell waits for the closing parenthesis, closing brace, or the closing bracket before evaluating the code, as in the following example:

> if ( x > 0 ) {

... count++;

... print (x);

... }

You can exit the line continuation mode if you enter two blank lines, as in the following example:

> if (x > 0

...

Q: How can I access different databases temporarily?

A: You can use db.getSiblingDB() method to access another database without switching databases, as in the following example which first switches to the test database and then accesses the sampleDB database from the test database:

use test

db.getSiblingDB('sampleDB').getCollectionNames();

Q: Does the mongo shell support tab completion and other keyboard shortcuts?

A: The mongo shell supports keyboard shortcuts. For example, Use the up/down arrow keys to scroll through command history. See .dbshell documentation for more information on the .dbshell file.

Use <Tab> to autocomplete or to list the completion possibilities, as in the following example which uses <Tab> to complete the method name starting with the letter 'c':

db.myCollection.c<Tab>

Because there are many collection methods starting with the letter 'c', the <Tab> will list the various methods that start with 'c'.

For a full list of the shortcuts, see Shell Keyboard Shortcuts

Q: How can I customize the mongo shell prompt?

A: You can change the mongo shell prompt by setting the prompt variable. This makes it possible to display additional information in the prompt.

Set prompt to any string or arbitrary JavaScript code that returns a string, consider the following examples:

Set the shell prompt to display the hostname and the database issued:

var host = db.serverStatus().host;

var prompt = function() { return db+"@"+host+"> "; }

The mongo shell prompt should now reflect the new prompt:

test@my-machine.local>

Set the shell prompt to display the database statistics:

var prompt = function() {

return "Uptime:"+db.serverStatus().uptime+" Documents:"+db.stats().objects+" > ";

}

The mongo shell prompt should now reflect the new prompt: Uptime:1052 Documents:25024787 >

You can add the logic for the prompt in the .mongorc.js file to set the prompt each time you start up the mongo shell.

Q: Can I edit long shell operations with an external text editor?

A: You can use your own editor in the mongo shell by setting the EDITOR environment variable before starting the mongo shell. Once in the mongo shell, you can edit with the specified editor by typing edit <variable> or edit <function>, as in the following example:

Set the EDITOR variable from the command line prompt: EDITOR=vim

Start the mongo shell: mongo

Define a function myFunction: function myFunction () { }

Edit the function using your editor: edit myFunction

The command should open the vim edit session. Remember to save your changes.

Type myFunction to see the function definition:

myFunction

The result should be the changes from your saved edit:

function myFunction() {

print("This was edited");

}

FAQ: Concurrency

MongoDB allows multiple clients to read and write a single corpus of data using a locking system to ensure that all clients receive the same view of the data and to prevent multiple applications from modifying the exact same pieces of data at the same time. Locks help guarantee that all writes to a single document occur either in full or not at all.

Q: What type of locking does MongoDB use?

A: MongoDB uses a readers-writer [1] lock that allows concurrent reads access to a database but gives exclusive access to a single write operation.

When a read lock exists, many read operations may use this lock. However, when a write lock exists, a single write operation holds the lock exclusively, and no other read or write operations may share the lock.

Locks are “writer greedy,” which means write locks have preference over reads. When both a read and write are waiting for a lock, MongoDB grants the lock to the write.

[1] You may be familiar with a “readers-writer” lock as “multi-reader” or “shared exclusive” lock. See the Wikipedia page on Readers-Writer Locks for more information.

Q: How granular are locks in MongoDB?

A: Changed in version 2.2.

Beginning with version 2.2, MongoDB implements locks on a per-database basis for most read and write operations. Some global operations, typically short lived operations involving multiple databases, still require a global “instance” wide lock. Before 2.2, there is only one “global” lock per mongod instance.

For example, if you have six databases and one takes a database-level write lock, the other five are still available for read and write. A global lock makes all six databases unavailable during the operation.

Q: How do I see the status of locks on my mongod instances?

A: For reporting on lock utilization information on locks, use any of the following methods:

db.serverStatus(),

db.currentOp(),

mongotop,

mongostat, and/or

the MongoDB Management Service (MMS)

Specifically, the locks document in the output of serverStatus, or the locks field in the current operation reporting provides insight into the type of locks and amount of lock contention in your mongod instance.

To terminate an operation, use db.killOp().

Q: Does a read or write operation ever yield the lock?

A: In some situations, read and write operations can yield their locks.

Long running read and write operations, such as queries, updates, and deletes, yield under many conditions. MongoDB uses an adaptive algorithms to allow operations to yield locks based on predicted disk access patterns (i.e. page faults.)

MongoDB operations can also yield locks between individual document modification in write operations that affect multiple documents like update() with the multi parameter.

MongoDB uses heuristics based on its access pattern to predict whether data is likely in physical memory before performing a read. If MongoDB predicts that the data is not in physical memory an operation will yield its lock while MongoDB loads the data to memory. Once data is available in memory, the operation will reacquire the lock to complete the operation.

Changed in version 2.6: MongoDB does not yield locks when scanning an index even if it predicts that the index is not in memory.

Which operations lock the database?

Changed in version 2.2.

The following table lists common database operations and the types of locks they use.

Operation Lock Type

Issue a query Read lock

Get more data from a cursor Read lock

Insert data Write lock

Remove data Write lock

Update data Write lock

Map-reduce Read lock and write lock, unless operations are specified as non-atomic. Portions of map-reduce jobs can run concurrently.

Create an index Building an index in the foreground, which is the default, locks the database for extended periods of time.

db.eval() Write lock. The db.eval() method takes a global write lock while evaluating the JavaScript function. To avoid taking this global write lock, you can use the eval command with nolock: true.

eval Write lock. By default, eval command takes a global write lock while evaluating the JavaScript function. If used with nolock: true, the eval command does not take a global write lock while evaluating the JavaScript function. However, the logic within the JavaScript function may take write locks for write operations.

aggregate() Read lock

Q: Which administrative commands lock the database?

A: Certain administrative commands can exclusively lock the database for extended periods of time. In some deployments, for large databases, you may consider taking the mongod instance offline so that clients are not affected. For example, if a mongod is part of a replica set, take the mongod offline and let other members of the set service load while maintenance is in progress.

The following administrative operations require an exclusive (i.e. write) lock on the database for extended periods:

db.collection.ensureIndex(), when issued without setting background to true,

reIndex,

compact,

db.repairDatabase(),

db.createCollection(), when creating a very large (i.e. many gigabytes) capped collection,

db.collection.validate(), and

db.copyDatabase(). This operation may lock all databases.

The following administrative commands lock the database but only hold the lock for a very short time:

db.collection.dropIndex(),

db.getLastError(),

db.isMaster(),

rs.status() (i.e. replSetGetStatus),

db.serverStatus(),

db.auth(), and

db.addUser().

Q: Does a MongoDB operation ever lock more than one database?

A: The following MongoDB operations lock multiple databases:

db.copyDatabase() must lock the entire mongod instance at once.

db.repairDatabase() obtains a global write lock and will block other operations until it finishes.

Journaling, which is an internal operation, locks all databases for short intervals. All databases share a single journal.

User authentication requires a read lock on the admin database for deployments using 2.6 user credentials. For deployments using the 2.4 schema for user credentials, authentication locks the admin database as well as the database the user is accessing.

All writes to a replica set’s primary lock both the database receiving the writes and then the local database for a short time. The lock for the local database allows the mongod to write to the primary’s oplog and accounts for a small portion of the total time of the operation.

Q: How does sharding affect concurrency?

A: Sharding improves concurrency by distributing collections over multiple mongod instances, allowing shard servers (i.e. mongos processes) to perform any number of operations concurrently to the various downstream mongod instances.

Each mongod instance is independent of the others in the shard cluster and uses the MongoDB readers-writer lock. The operations on one mongod instance do not block the operations on any others.

Q: How does concurrency affect a replica set primary?

A: In replication, when MongoDB writes to a collection on the primary, MongoDB also writes to the primary’s oplog, which is a special collection in the local database. Therefore, MongoDB must lock both the collection’s database and the local database. The mongod must lock both databases at the same time to keep the database consistent and ensure that write operations, even with replication, are “all-or-nothing” operations.

Q: How does concurrency affect secondaries?

A: In replication, MongoDB does not apply writes serially to secondaries. Secondaries collect oplog entries in batches and then apply those batches in parallel. Secondaries do not allow reads while applying the write operations, and apply write operations in the order that they appear in the oplog.

MongoDB can apply several writes in parallel on replica set secondaries, in two phases:

During the first prefer phase, under a read lock, the mongod ensures that all documents affected by the operations are in memory. During this phase, other clients may execute queries against this member.

A thread pool using write locks applies all write operations in the batch as part of a coordinated write phase.

Q: What kind of concurrency does MongoDB provide for JavaScript operations?

A: Changed in version 2.4: The V8 JavaScript engine added in 2.4 allows multiple JavaScript operations to run at the same time. Prior to 2.4, a single mongod could only run a single JavaScript operation at once.

FAQ: Sharding with MongoDB

This document answers common questions about horizontal scaling using MongoDB’s sharding.

Q: Is sharding appropriate for a new deployment?

A: Sometimes.

If your data set fits on a single server, you should begin with an unsharded deployment.

Converting an unsharded database to a sharded cluster is easy and seamless, so there is little advantage in configuring sharding while your data set is small.

Still, all production deployments should use replica sets to provide high availability and disaster recovery.

How does sharding work with replication?

To use replication with sharding, deploy each shard as a replica set.

Q: Can I change the shard key after sharding a collection?

A: No.

There is no automatic support in MongoDB for changing a shard key after sharding a collection. This reality underscores the importance of choosing a good shard key. If you must change a shard key after sharding a collection, the best option is to:

dump all data from MongoDB into an external format.

drop the original sharded collection.

configure sharding using a more ideal shard key.

pre-split the shard key range to ensure initial even distribution.

restore the dumped data into MongoDB.

See shardCollection, sh.shardCollection(), the Shard Key, Deploy a Sharded Cluster, and SERVER-4000 for more information.

Q: What happens to unsharded collections in sharded databases?

A: In the current implementation, all databases in a sharded cluster have a “primary shard.” All unsharded collection within that database will reside on the same shard.

Q: How does MongoDB distribute data across shards?

A: Sharding must be specifically enabled on a collection. After enabling sharding on the collection, MongoDB will assign various ranges of collection data to the different shards in the cluster. The cluster automatically corrects imbalances between shards by migrating ranges of data from one shard to another.

Q: What happens if a client updates a document in a chunk during a migration?

A: The mongos routes the operation to the “old” shard, where it will succeed immediately. Then the shard mongod instances will replicate the modification to the “new” shard before the sharded cluster updates that chunk’s “ownership,” which effectively finalizes the migration process.

Q: What happens to queries if a shard is inaccessible or slow?

A: If a shard is inaccessible or unavailable, queries will return with an error.

However, a client may set the partial query bit, which will then return results from all available shards, regardless of whether a given shard is unavailable.

If a shard is responding slowly, mongos will merely wait for the shard to return results.

Q: How does MongoDB distribute queries among shards?

A: Changed in version 2.0.

The exact method for distributing queries to shards in a cluster depends on the nature of the query and the configuration of the sharded cluster. Consider a sharded collection, using the shard key user_id, that has last_login and email attributes:

For a query that selects one or more values for the user_id key:

mongos determines which shard or shards contains the relevant data, based on the cluster metadata, and directs a query to the required shard or shards, and returns those results to the client.

For a query that selects user_id and also performs a sort:

mongos can make a straightforward translation of this operation into a number of queries against the relevant shards, ordered by user_id. When the sorted queries return from all shards, the mongos merges the sorted results and returns the complete result to the client.

For queries that select on last_login:

These queries must run on all shards: mongos must parallelize the query over the shards and perform a merge-sort on the email of the documents found.

Q: How does MongoDB sort queries in sharded environments?

A: If you call the cursor.sort() method on a query in a sharded environment, the mongod for each shard will sort its results, and the mongos merges each shard’s results before returning them to the client.

Q: How does MongoDB ensure unique _id field values when using a shard key other than _id?

A: If you do not use _id as the shard key, then your application/client layer must be responsible for keeping the _id field unique. It is problematic for collections to have duplicate _id values.

If you’re not sharding your collection by the _id field, then you should be sure to store a globally unique identifier in that field. The default BSON ObjectId works well in this case.

Q: I’ve enabled sharding and added a second shard, but all the data is still on one server. Why?

A: First, ensure that you’ve declared a shard key for your collection. Until you have configured the shard key, MongoDB will not create chunks, and sharding will not occur.

Next, keep in mind that the default chunk size is 64 MB. As a result, in most situations, the collection needs to have at least 64 MB of data before a migration will occur.

Additionally, the system which balances chunks among the servers attempts to avoid superfluous migrations. Depending on the number of shards, your shard key, and the amount of data, systems often require at least 10 chunks of data to trigger migrations.

You can run db.printShardingStatus() to see all the chunks present in your cluster.

Q: Is it safe to remove old files in the moveChunk directory?

A: Yes. mongod creates these files as backups during normal shard balancing operations. If some error occurs during a migration, these files may be helpful in recovering documents affected during the migration.

Once the migration has completed successfully and there is no need to recover documents from these files, you may safely delete these files. Or, if you have an existing backup of the database that you can use for recovery, you may also delete these files after migration.

To determine if all migrations are complete, run sh.isBalancerRunning() while connected to a mongos instance.

Q: How does mongos use connections?

A: Each client maintains a connection to a mongos instance. Each mongos instance maintains a pool of connections to the members of a replica set supporting the sharded cluster. Clients use connections between mongos and mongod instances one at a time. Requests are not multiplexed or pipelined. When client requests complete, the mongos returns the connection to the pool.

Q: Why does mongos hold connections open?

A: mongos uses a set of connection pools to communicate with each shard. These pools do not shrink when the number of clients decreases.

This can lead to an unused mongos with a large number of open connections. If the mongos is no longer in use, it is safe to restart the process to close existing connections.

Q: Where does MongoDB report on connections used by mongos?

A: Connect to the mongos with the mongo shell, and run the following command:

db._adminCommand("connPoolStats");

Q: What does writebacklisten in the log mean?

A: The writeback listener is a process that opens a long poll to relay writes back from a mongod or mongos after migrations to make sure they have not gone to the wrong server. The writeback listener sends writes back to the correct server if necessary.

These messages are a key part of the sharding infrastructure and should not cause concern.

Q: How should administrators deal with failed migrations?

A: Failed migrations require no administrative intervention. Chunk migrations always preserve a consistent state. If a migration fails to complete for some reason, the cluster retries the operation. When the migration completes successfully, the data resides only on the new shard.

Q: What is the process for moving, renaming, or changing the number of config servers?

A: See Sharded Cluster Tutorials for information on migrating and replacing config servers.

Q: When do the mongos servers detect config server changes?

A: mongos instances maintain a cache of the config database that holds the metadata for the sharded cluster. This metadata includes the mapping of chunks to shards.

mongos updates its cache lazily by issuing a request to a shard and discovering that its metadata is out of date. There is no way to control this behavior from the client, but you can run the flushRouterConfig command against any mongos to force it to refresh its cache.

Q: Is it possible to quickly update mongos servers after updating a replica set configuration?

A: The mongos instances will detect these changes without intervention over time. However, if you want to force the mongos to reload its configuration, run the flushRouterConfig command against to each mongos directly.

Q: What does the maxConns setting on mongos do?

A: The maxIncomingConnections option limits the number of connections accepted by mongos.

If your client driver or application creates a large number of connections but allows them to time out rather than closing them explicitly, then it might make sense to limit the number of connections at the mongos layer.

Set maxIncomingConnections to a value slightly higher than the maximum number of connections that the client creates, or the maximum size of the connection pool. This setting prevents the mongos from causing connection spikes on the individual shards. Spikes like these may disrupt the operation and memory allocation of the sharded cluster.

Q: How do indexes impact queries in sharded systems?

A: If the query does not include the shard key, the mongos must send the query to all shards as a “scatter/gather” operation. Each shard will, in turn, use either the shard key index or another more efficient index to fulfill the query.

If the query includes multiple sub-expressions that reference the fields indexed by the shard key and the secondary index, the mongos can route the queries to a specific shard and the shard will use the index that will allow it to fulfill most efficiently. See this presentation for more information.

Q: Can shard keys be randomly generated?

A: Shard keys can be random. Random keys ensure optimal distribution of data across the cluster.

Sharded clusters, attempt to route queries to specific shards when queries include the shard key as a parameter, because these directed queries are more efficient. In many cases, random keys can make it difficult to direct queries to specific shards.

Q: Can shard keys have a non-uniform distribution of values?

A: Yes. There is no requirement that documents be evenly distributed by the shard key.

However, documents that have the same shard key must reside in the same chunk and therefore on the same server. If your sharded data set has too many documents with the exact same shard key you will not be able to distribute those documents across your sharded cluster.

Q: Can you shard on the _id field?

A: You can use any field for the shard key. The _id field is a common shard key.

Be aware that ObjectId() values, which are the default value of the _id field, increment as a timestamp. As a result, when used as a shard key, all new documents inserted into the collection will initially belong to the same chunk on a single shard. Although the system will eventually divide this chunk and migrate its contents to distribute data more evenly, at any moment the cluster can only direct insert operations at a single shard. This can limit the throughput of inserts. If most of your write operations are updates, this limitation should not impact your performance. However, if you have a high insert volume, this may be a limitation.

To address this issue, MongoDB 2.4 provides hashed shard keys.

Q: What do moveChunk commit failed errors mean?

A: At the end of a chunk migration, the shard must connect to the config database to update the chunk’s record in the cluster metadata. If the shard fails to connect to the config database, MongoDB reports the following error:

ERROR: moveChunk commit failed: version is at <n>|<nn> instead of

<N>|<NN>" and "ERROR: TERMINATING"

When this happens, the primary member of the shard’s replica set then terminates to protect data consistency. If a secondary member can access the config database, data on the shard becomes accessible again after an election.

The user will need to resolve the chunk migration failure independently. If you encounter this issue, contact the MongoDB User Group or MongoDB Support to address this issue.

Q: How does draining a shard affect the balancing of uneven chunk distribution?

A: The sharded cluster balancing process controls both migrating chunks from decommissioned shards (i.e. draining) and normal cluster balancing activities. Consider the following behaviors for different versions of MongoDB in situations where you remove a shard in a cluster with an uneven chunk distribution:

After MongoDB 2.2, the balancer first removes the chunks from the draining shard and then balances the remaining uneven chunk distribution.

Before MongoDB 2.2, the balancer handles the uneven chunk distribution and then removes the chunks from the draining shard.

FAQ: Replication and Replica Sets

This document answers common questions about database replication in MongoDB.

Q: What kinds of replication does MongoDB support?

A: MongoDB supports master-slave replication and a variation on master-slave replication known as replica sets. Replica sets are the recommended replication topology.

Q: What do the terms “primary” and “master” mean?

A: Primary and master nodes are the nodes that can accept writes. MongoDB’s replication is “single-master:” only one node can accept write operations at a time.

In a replica set, if the current “primary” node fails or becomes inaccessible, the other members can autonomously elect one of the other members of the set to be the new “primary”.

By default, clients send all reads to the primary; however, read preference is configurable at the client level on a per-connection basis, which makes it possible to send reads to secondary nodes instead.

Q: What do the terms “secondary” and “slave” mean?

A: Secondary and slave nodes are read-only nodes that replicate from the primary.

Replication operates by way of an oplog, from which secondary/slave members apply new operations to themselves. This replication process is asynchronous, so secondary/slave nodes may not always reflect the latest writes to the primary. But usually, the gap between the primary and secondary nodes is just few milliseconds on a local network connection.

Q: How long does replica set failover take?

A: It varies, but a replica set will select a new primary within a minute.

It may take 10-30 seconds for the members of a replica set to declare a primary inaccessible. This triggers an election. During the election, the cluster is unavailable for writes.

The election itself may take another 10-30 seconds.

Note: Eventually consistent reads, like the ones that will return from a replica set are only possible with a write concern that permits reads from secondary members.

Q: Does replication work over the Internet and WAN connections?

A: Yes.

For example, a deployment may maintain a primary and secondary in an East-coast data center along with a secondary member for disaster recovery in a West-coast data center.

Q: Can MongoDB replicate over a “noisy” connection?

A: Yes, but not without connection failures and the obvious latency.

Members of the set will attempt to reconnect to the other members of the set in response to networking flaps. This does not require administrator intervention. However, if the network connections among the nodes in the replica set are very slow, it might not be possible for the members of the node to keep up with the replication.

If the TCP connection between the secondaries and the primary instance breaks, a replica set will automatically elect one of the secondary members of the set as primary.

Q: What is the preferred replication method: master/slave or replica sets?

A: New in version 1.8.

Replica sets are the preferred replication mechanism in MongoDB. However, if your deployment requires more than 12 nodes, you must use master/slave replication.

Q: What is the preferred replication method: replica sets or replica pairs?

A: Deprecated since version 1.6.

Replica sets replaced replica pairs in version 1.6. Replica sets are the preferred replication mechanism in MongoDB.

Q: Why use journaling if replication already provides data redundancy?

A: Journaling facilitates faster crash recovery. Prior to journaling, crashes often required database repairs or full data resync. Both were slow, and the first was unreliable.

Journaling is particularly useful for protection against power failures, especially if your replica set resides in a single data center or power circuit.

When a replica set runs with journaling, mongod instances can safely restart without any administrator intervention.

Note: Journaling requires some resource overhead for write operations. Journaling has no effect on read performance, however.

Journaling is enabled by default on all 64-bit builds of MongoDB v2.0 and greater.

Q: Are write operations durable if write concern does not acknowledge writes?

A: Yes.

However, if you want confirmation that a given write has arrived at the server, use write concern.

After the default write concern change, the default write concern acknowledges all write operations, and unacknowledged writes must be explicitly configured. See the MongoDB Drivers and Client Libraries documentation for your driver for more information.

Changed in version 2.6: The mongo shell now defaults to use safe writes.

A new protocol for write operations integrates write concerns with the write operations. Previous versions issued a getLastError command after a write to specify a write concern.

Q: How many arbiters do replica sets need?

A: Some configurations do not require any arbiter instances. Arbiters vote in elections for primary but do not replicate the data like secondary members.

Replica sets require a majority of the remaining nodes present to elect a primary. Arbiters allow you to construct this majority without the overhead of adding replicating nodes to the system.

There are many possible replica set architectures.

A replica set with an odd number of voting nodes does not need an arbiter.

A common configuration consists of two replicating nodes that include a primary and a secondary, as well as an arbiter for the third node. This configuration makes it possible for the set to elect a primary in the event of failure, without requiring three replicating nodes.

You may also consider adding an arbiter to a set if it has an equal number of nodes in two facilities and network partitions between the facilities are possible. In these cases, the arbiter will break the tie between the two facilities and allow the set to elect a new primary.

Q: What information do arbiters exchange with the rest of the replica set?

A: Arbiters never receive the contents of a collection but do exchange the following data with the rest of the replica set:

Credentials used to authenticate the arbiter with the replica set. All MongoDB processes within a replica set use keyfiles. These exchanges are encrypted.

Replica set configuration data and voting data. This information is not encrypted. Only credential exchanges are encrypted.

If your MongoDB deployment uses SSL, then all communications between arbiters and the other members of the replica set are secure. See the documentation for Configure mongod and mongos for SSL for more information. Run all arbiters on secure networks, as with all MongoDB components.

Q: Which members of a replica set vote in elections?

A: All members of a replica set, unless the value of votes is equal to 0, vote in elections. This includes all delayed, hidden and secondary-only members, as well as the arbiters.

Additionally, the state of the voting members also determine whether the member can vote. Only voting members in the following states are eligible to vote:

PRIMARY

SECONDARY

RECOVERING

ARBITER

ROLLBACK

Q: Do hidden members vote in replica set elections?

A: Hidden members of replica sets do vote in elections. To exclude a member from voting in an election, change the value of the member’s votes configuration to 0.

Q: Is it normal for replica set members to use different amounts of disk space?

A: Yes.

Factors including: different oplog sizes, different levels of storage fragmentation, and MongoDB’s data file pre-allocation can lead to some variation in storage utilization between nodes. Storage use disparities will be most pronounced when you add members at different times.

FAQ: MongoDB Storage

This document addresses common questions regarding MongoDB’s storage system.

Q: What are memory mapped files?

A memory-mapped file is a file with data that the operating system places in memory by way of the mmap() system call. mmap() thus maps the file to a region of virtual memory. Memory-mapped files are the critical piece of the storage engine in MongoDB. By using memory mapped files MongoDB can treat the contents of its data files as if they were in memory. This provides MongoDB with an extremely fast and simple method for accessing and manipulating data.

Q: How do memory mapped files work?

A: Memory mapping assigns files to a block of virtual memory with a direct byte-for-byte correlation. Once mapped, the relationship between file and memory allows MongoDB to interact with the data in the file as if it were memory.

Q: How does MongoDB work with memory mapped files?

A: MongoDB uses memory mapped files for managing and interacting with all data. MongoDB memory maps data files to memory as it accesses documents. Data that isn’t accessed is not mapped to memory.

Q: What are page faults?

A: Page faults can occur as MongoDB reads from or writes data to parts of its data files that are not currently located in physical memory. In contrast, operating system page faults happen when physical memory is exhausted and pages of physical memory are swapped to disk.

If there is free memory, then the operating system can find the page on disk and load it to memory directly. However, if there is no free memory, the operating system must:

find a page in memory that is stale or no longer needed, and write the page to disk.

read the requested page from disk and load it into memory.

This process, particularly on an active system can take a long time, particularly in comparison to reading a page that is already in memory.

Q: What is the difference between soft and hard page faults?

A: Page faults occur when MongoDB needs access to data that isn’t currently in active memory. A “hard” page fault refers to situations when MongoDB must access a disk to access the data. A “soft” page fault, by contrast, merely moves memory pages from one list to another, such as from an operating system file cache. In production, MongoDB will rarely encounter soft page faults.

Q: What tools can I use to investigate storage use in MongoDB?

A: The db.stats() method in the mongo shell, returns the current state of the “active” database. The dbStats command document describes the fields in the db.stats() output.

Q: What is the working set?

A: Working set represents the total body of data that the application uses in the course of normal operation. Often this is a subset of the total data size, but the specific size of the working set depends on actual moment-to-moment use of the database.

If you run a query that requires MongoDB to scan every document in a collection, the working set will expand to include every document. Depending on physical memory size, this may cause documents in the working set to “page out,” or to be removed from physical memory by the operating system. The next time MongoDB needs to access these documents, MongoDB may incur a hard page fault.

If you run a query that requires MongoDB to scan every document in a collection, the working set includes every active document in memory.

For best performance, the majority of your active set should fit in RAM.

Q: Why are the files in my data directory larger than the data in my database?

A: The data files in your data directory, which is the /data/db directory in default configurations, might be larger than the data set inserted into the database. Consider the following possible causes:

Preallocated data files.

In the data directory, MongoDB preallocates data files to a particular size, in part to prevent file system fragmentation. MongoDB names the first data file <databasename>.0, the next <databasename>.1, etc. The first file mongod allocates is 64 megabytes, the next 128 megabytes, and so on, up to 2 gigabytes, at which point all subsequent files are 2 gigabytes. The data files include files with allocated space but that hold no data. mongod may allocate a 1 gigabyte data file that may be 90% empty. For most larger databases, unused allocated space is small compared to the database.

The oplog.

If this mongod is a member of a replica set, the data directory includes the oplog.rs file, which is a preallocated capped collection in the local database. The default allocation is approximately 5% of disk space on 64-bit installations, see Oplog Sizing for more information. In most cases, you should not need to resize the oplog. However, if you do, see Change the Size of the Oplog.

The journal.

The data directory contains the journal files, which store write operations on disk prior to MongoDB applying them to databases. See Journaling Mechanics.

Empty records.

MongoDB maintains lists of empty records in data files when deleting documents and collections. MongoDB can reuse this space, but will never return this space to the operating system.

To de-fragment allocated storage, use compact, which de-fragments allocated space. By de-fragmenting storage, MongoDB can effectively use the allocated space. compact requires up to 2 gigabytes of extra disk space to run. Do not use compact if you are critically low on disk space.

Important: compact only removes fragmentation from MongoDB data files and does not return any disk space to the operating system.

To reclaim deleted space, use repairDatabase, which rebuilds the database which de-fragments the storage and may release space to the operating system. repairDatabase requires up to 2 gigabytes of extra disk space to run. Do not use repairDatabase if you are critically low on disk space.

Warning: repairDatabase requires enough free disk space to hold both the old and new database files while the repair is running. Be aware that repairDatabase will block all other operations and may take a long time to complete.

Q: How can I check the size of a collection?

A: To view the size of a collection and other information, use the db.collection.stats() method from the mongo shell. The following example issues db.collection.stats() for the orders collection:

db.orders.stats();

To view specific measures of size, use these methods:

db.collection.dataSize(): data size in bytes for the collection.

db.collection.storageSize(): allocation size in bytes, including unused space.

db.collection.totalSize(): the data size plus the index size in bytes.

db.collection.totalIndexSize(): the index size in bytes.

Also, the following scripts print the statistics for each database and collection:

db._adminCommand("listDatabases").databases.forEach(function (d) {mdb = db.getSiblingDB(d.name); printjson(mdb.stats())})

db._adminCommand("listDatabases").databases.forEach(function (d) {mdb = db.getSiblingDB(d.name); mdb.getCollectionNames().forEach(function(c) {s = mdb[c].stats(); printjson(s)})})

Q: How can I check the size of indexes?

A: To view the size of the data allocated for an index, use one of the following procedures in the mongo shell:

Use the db.collection.stats() method using the index namespace. To retrieve a list of namespaces, issue the following command:

db.system.namespaces.find()

Check the value of indexSizes in the output of the db.collection.stats() command.

Example

Issue the following command to retrieve index namespaces:

db.system.namespaces.find()

The command returns a list similar to the following:

{"name" : "test.orders"}

{"name" : "test.system.indexes"}

{"name" : "test.orders.$_id_"}

View the size of the data allocated for the orders.$_id_ index with the following sequence of operations:

use test

db.orders.$_id_.stats().indexSizes

Q: How do I know when the server runs out of disk space?

A: If your server runs out of disk space for data files, you will see something like this in the log:

Thu Aug 11 13:06:09 [FileAllocator] allocating new data file dbms/test.13, filling with zeroes...

Thu Aug 11 13:06:09 [FileAllocator] error failed to allocate new file: dbms/test.13 size: 2146435072 errno:28 No space left on device

Thu Aug 11 13:06:09 [FileAllocator] will try again in 10 seconds

Thu Aug 11 13:06:19 [FileAllocator] allocating new data file dbms/test.13, filling with zeroes...

Thu Aug 11 13:06:19 [FileAllocator] error failed to allocate new file: dbms/test.13 size: 2146435072 errno:28 No space left on device

Thu Aug 11 13:06:19 [FileAllocator] will try again in 10 seconds

The server remains in this state forever, blocking all writes including deletes. However, reads still work. To delete some data and compact, using the compact command, you must restart the server first.

If your server runs out of disk space for journal files, the server process will exit. By default, mongod creates journal files in a sub-directory of dbPath named journal. You may elect to put the journal files on another storage device using a filesystem mount or a symlink.

Note: If you place the journal files on a separate storage device you will not be able to use a file system snapshot tool to capture a valid snapshot of your data files and journal files.

FAQ: Indexes

This document addresses common questions regarding MongoDB indexes.

Q: Should you run ensureIndex() after every insert?

A: No. You only need to create an index once for a single collection. After initial creation, MongoDB automatically updates the index as data changes.

While running ensureIndex() is usually ok, if an index doesn’t exist because of ongoing administrative work, a call to ensureIndex() may disrupt database availability. Running ensureIndex() can render a replica set inaccessible as the index creation is happening. See Build Indexes on Replica Sets.

Q: How do you know what indexes exist in a collection?

A: To list a collection’s indexes, use the db.collection.getIndexes() method or a similar method for your driver.

Q: How do you determine the size of an index?

A: To check the sizes of the indexes on a collection, use db.collection.stats().

Q: What happens if an index does not fit into RAM?

A: When an index is too large to fit into RAM, MongoDB must read the index from disk, which is a much slower operation than reading from RAM. Keep in mind an index fits into RAM when your server has RAM available for the index combined with the rest of the working set.

In certain cases, an index does not need to fit entirely into RAM. For details, see Indexes that Hold Only Recent Values in RAM.

Q: How do you know what index a query used?

A: To inspect how MongoDB processes a query, use the explain() method in the mongo shell, or in your application driver.

Q: How do you determine what fields to index?

A: A number of factors determine what fields to index, including selectivity, fitting indexes into RAM, reusing indexes in multiple queries when possible, and creating indexes that can support all the fields in a given query. For detailed documentation on choosing which fields to index, see Indexing Tutorials.

Q: How do write operations affect indexes?

A: Any write operation that alters an indexed field requires an update to the index in addition to the document itself. If you update a document that causes the document to grow beyond the allotted record size, then MongoDB must update all indexes that include this document as part of the update operation.

Therefore, if your application is write-heavy, creating too many indexes might affect performance.

Q: Will building a large index affect database performance?

A: Building an index can be an IO-intensive operation, especially if you have a large collection. This is true on any database system that supports secondary indexes, including MySQL. If you need to build an index on a large collection, consider building the index in the background. See Index Creation.

If you build a large index without the background option, and if doing so causes the database to stop responding, do one of the following:

Wait for the index to finish building.

Kill the current operation (see db.killOp()). The partial index will be deleted.

Q: Can I use index keys to constrain query matches?

A: You can use the min() and max() methods to constrain the results of the cursor returned from find() by using index keys.

Q: Using $ne and $nin in a query is slow. Why?

A: The $ne and $nin operators are not selective. See Create Queries that Ensure Selectivity. If you need to use these, it is often best to make sure that an additional, more selective criterion is part of the query.

Q: Can I use a multi-key index to support a query for a whole array?

A: Not entirely. The index can partially support these queries because it can speed the selection of the first element of the array; however, comparing all subsequent items in the array cannot use the index and must scan the documents individually.

Q: How can I effectively use indexes strategy for attribute lookups?

A: For simple attribute lookups that don’t require sorted result sets or range queries, consider creating a field that contains an array of documents where each document has a field (e.g. attrib ) that holds a specific type of attribute. You can index this attrib field.

For example, the attrib field in the following document allows you to add an unlimited number of attributes types:

{ _id : ObjectId(...),

attrib : [

{ k: "color", v: "red" },

{ k: "shape": v: "rectangle" },

{ k: "color": v: "blue" },

{ k: "avail": v: true }

]

}

Both of the following queries could use the same { "attrib.k": 1, "attrib.v": 1 } index:

db.mycollection.find( { attrib: { $elemMatch : { k: "color", v: "blue" } } } )

db.mycollection.find( { attrib: { $elemMatch : { k: "avail", v: true } } } )

FAQ: MongoDB Diagnostics

This document provides answers to common diagnostic questions and issues.

Q: Where can I find information about a mongod process that stopped running unexpectedly?

A: If mongod shuts down unexpectedly on a UNIX or UNIX-based platform, and if mongod fails to log a shutdown or error message, then check your system logs for messages pertaining to MongoDB. For example, for logs located in /var/log/messages, use the following commands:

sudo grep mongod /var/log/messages

sudo grep score /var/log/messages

Q: Does TCP keepalive time affect sharded clusters and replica sets?

A: If you experience socket errors between members of a sharded cluster or replica set, that do not have other reasonable causes, check the TCP keep alive value, which Linux systems store as the tcp_keepalive_time value. A common keep alive period is 7200 seconds (2 hours); however, different distributions and OS X may have different settings. For MongoDB, you will have better experiences with shorter keepalive periods, on the order of 300 seconds (five minutes).

On Linux systems you can use the following operation to check the value of tcp_keepalive_time:

cat /proc/sys/net/ipv4/tcp_keepalive_time

You can change the tcp_keepalive_time value with the following operation:

echo 300 > /proc/sys/net/ipv4/tcp_keepalive_time

The new tcp_keepalive_time value takes effect without requiring you to restart the mongod or mongos servers. When you reboot or restart your system you will need to set the new tcp_keepalive_time value, or see your operating system’s documentation for setting the TCP keepalive value persistently.

For OS X systems, issue the following command to view the keep alive setting:

sysctl net.inet.tcp.keepinit

To set a shorter keep alive period use the following invocation:

sysctl -w net.inet.tcp.keepinit=300

If your replica set or sharded cluster experiences keepalive-related issues, you must alter the tcp_keepalive_time value on all machines hosting MongoDB processes. This includes all machines hosting mongos or mongod servers.

Windows users should consider the Windows Server Technet Article on KeepAliveTime configuration for more information on setting keep alive for MongoDB deployments on Windows systems.

Q: What tools are available for monitoring MongoDB?

A: The MongoDB Management Service includes monitoring functionality, which collects data from running MongoDB deployments and provides visualization and alerts based on that data.

A full list of third-party tools is available as part of the Monitoring for MongoDB documentation. Also consider the MMS Documentation.

Memory Diagnostics

Q: Do I need to configure swap space?

A: Always configure systems to have swap space. Without swap, your system may not be reliant in some situations with extreme memory constraints, memory leaks, or multiple programs using the same memory. Think of the swap space as something like a steam release valve that allows the system to release extra pressure without affecting the overall functioning of the system.

Nevertheless, systems running MongoDB do not need swap for routine operation. Database files are memory-mapped and should constitute most of your MongoDB memory use. Therefore, it is unlikely that mongod will ever use any swap space in normal operation. The operating system will release memory from the memory mapped files without needing swap and MongoDB can write data to the data files without needing the swap system.

Q: What is “working set” and how can I estimate its size?

A: The working set for a MongoDB database is the portion of your data that clients access most often. You can estimate size of the working set, using the workingSet document in the output of serverStatus. To return serverStatus with the workingSet document, issue a command in the following form:

db.runCommand( { serverStatus: 1, workingSet: 1 } )

Q: Must my working set size fit RAM?

A: Your working set should stay in memory to achieve good performance. Otherwise many random disk IO’s will occur, and unless you are using SSD, this can be quite slow.

One area to watch specifically in managing the size of your working set is index access patterns. If you are inserting into indexes at random locations (as would happen with id’s that are randomly generated by hashes), you will continually be updating the whole index. If instead you are able to create your id’s in approximately ascending order (for example, day concatenated with a random id), all the updates will occur at the right side of the b-tree and the working set size for index pages will be much smaller.

It is fine if databases and thus virtual size are much larger than RAM.

Q: How do I calculate how much RAM I need for my application?

A: The amount of RAM you need depends on several factors, including but not limited to:

The relationship between database storage and working set.

The operating system’s cache strategy for LRU (Least Recently Used)

The impact of journaling

The number or rate of page faults and other MMS gauges to detect when you need more RAM

Each database connection thread will need up to 1 MB of RAM.

MongoDB defers to the operating system when loading data into memory from disk. It simply memory maps all its data files and relies on the operating system to cache data. The OS typically evicts the least-recently-used data from RAM when it runs low on memory. For example if clients access indexes more frequently than documents, then indexes will more likely stay in RAM, but it depends on your particular usage.

To calculate how much RAM you need, you must calculate your working set size, or the portion of your data that clients use most often. This depends on your access patterns, what indexes you have, and the size of your documents. Because MongoDB uses a thread per connection model, each database connection also will need up to 1MB of RAM, whether active or idle.

If page faults are infrequent, your working set fits in RAM. If fault rates rise higher than that, you risk performance degradation. This is less critical with SSD drives than with spinning disks.

Q: How do I read memory statistics in the UNIX top command

A: Because mongod uses memory-mapped files, the memory statistics in top require interpretation in a special way. On a large database, VSIZE (virtual bytes) tends to be the size of the entire database. If the mongod doesn’t have other processes running, RSIZE (resident bytes) is the total memory of the machine, as this counts file system cache contents.

For Linux systems, use the vmstat command to help determine how the system uses memory. On OS X systems use vm_stat.

Sharded Cluster Diagnostics

The two most important factors in maintaining a successful sharded cluster are:

choosing an appropriate shard key and

sufficient capacity to support current and future operations.

You can prevent most issues encountered with sharding by ensuring that you choose the best possible shard key for your deployment and ensure that you are always adding additional capacity to your cluster well before the current resources become saturated. Continue reading for specific issues you may encounter in a production environment.

Q: In a new sharded cluster, why does all data remains on one shard?

A: Your cluster must have sufficient data for sharding to make sense. Sharding works by migrating chunks between the shards until each shard has roughly the same number of chunks.

The default chunk size is 64 megabytes. MongoDB will not begin migrations until the imbalance of chunks in the cluster exceeds the migration threshold. While the default chunk size is configurable with the chunkSize setting, these behaviors help prevent unnecessary chunk migrations, which can degrade the performance of your cluster as a whole.

If you have just deployed a sharded cluster, make sure that you have enough data to make sharding effective. If you do not have sufficient data to create more than eight 64 megabyte chunks, then all data will remain on one shard. Either lower the chunk size setting, or add more data to the cluster.

As a related problem, the system will split chunks only on inserts or updates, which means that if you configure sharding and do not continue to issue insert and update operations, the database will not create any chunks. You can either wait until your application inserts data or split chunks manually.

Finally, if your shard key has a low cardinality, MongoDB may not be able to create sufficient splits among the data.

Q: Why would one shard receive a disproportion amount of traffic in a sharded cluster?

A: In some situations, a single shard or a subset of the cluster will receive a disproportionate portion of the traffic and workload. In almost all cases this is the result of a shard key that does not effectively allow write scaling.

It’s also possible that you have “hot chunks.” In this case, you may be able to solve the problem by splitting and then migrating parts of these chunks.

In the worst case, you may have to consider re-sharding your data and choosing a different shard key to correct this pattern.

Q: What can prevent a sharded cluster from balancing?

A: If you have just deployed your sharded cluster, you may want to consider the troubleshooting suggestions for a new cluster where data remains on a single shard.

If the cluster was initially balanced, but later developed an uneven distribution of data, consider the following possible causes:

You have deleted or removed a significant amount of data from the cluster. If you have added additional data, it may have a different distribution with regards to its shard key.

Your shard key has low cardinality and MongoDB cannot split the chunks any further.

Your data set is growing faster than the balancer can distribute data around the cluster. This is uncommon and typically is the result of:

a balancing window that is too short, given the rate of data growth.

an uneven distribution of write operations that requires more data migration. You may have to choose a different shard key to resolve this issue.

poor network connectivity between shards, which may lead to chunk migrations that take too long to complete. Investigate your network configuration and interconnections between shards.

Q: Why do chunk migrations affect sharded cluster performance?

A: If migrations impact your cluster or application’s performance, consider the following options, depending on the nature of the impact:

If migrations only interrupt your clusters sporadically, you can limit the balancing window to prevent balancing activity during peak hours. Ensure that there is enough time remaining to keep the data from becoming out of balance again.

If the balancer is always migrating chunks to the detriment of overall cluster performance:

You may want to attempt decreasing the chunk size to limit the size of the migration.

Your cluster may be over capacity, and you may want to attempt to add one or two shards to the cluster to distribute load.

It’s also possible that your shard key causes your application to direct all writes to a single shard. This kind of activity pattern can require the balancer to migrate most data soon after writing it. Consider redeploying your cluster with a shard key that provides better write scaling.

5 comments:

Divit12 May 2016 at 22:58
I like your site and content. thanks for sharing the information keep updating.

MongoDB Training Centers in Chenai
Unknown19 June 2018 at 04:34
Unfit to Recover MongoDB from Rollback? Contact to DB Recovery Support
If any of the customers running a MongoDB impersonation in an open cloud condition by then might be possible you have an expert Rollback issue. Obviously it sounds some extraordinary anyway there is respond in due order regarding comprehend this rollback issue with DB Recovery Service or Exchange Database Recovery. With our clear Backup Recovery decision you can without quite a bit of a stretch recover MongoDB from rollback.
For More Info: https://cognegicsystems.com/
Contact Number: 1-800-450-8670
Email Address- info@cognegicsystems.com
Company’s Address- 507 Copper Square Drive Bethel Connecticut (USA) 06801
kavyasri31 October 2019 at 05:46

Thank you for excellent article

Mean stack online training
Mean stack training in hyderabad

Ashwin28 August 2020 at 00:13
Machine Learning training in Pallikranai Chennai

Data science training in Pallikaranai

Python Training in Pallikaranai chennai

Bigdata training in Pallikaranai chennai

Spark with ML training in Pallikaranai chennai
ddd24 March 2022 at 00:28
MongoDB is supported by GenexDb
Open-source database MongoDB provides document-oriented database managementDatabase Training. GenexDB helps you design MongoDB applications that are high quality and effective. We are in-sync with you right from installation to timely upgrades, leading you to success in every step of the way.Database Training
Combining our MongoDB experts with MongoDB's innate features, schema-less architecture, sharding, and replication, we will optimize your database to fully utilize its powerful capabilities & provide you a strong foundation to deploy highly available and massively scalable database-driven applications.