What is GridFS in MongoDB

[What is MongoDB? 】

MongoDB is a database based on distributed file storage. Written in C ++ language. Developed to provide scalable, high-performance data storage solutions for WEB applications.

 

MongoDB is a product between relational and non-relational databases. It is the most versatile and similar relational database among non-relational databases. The supported data structure is very loose and it is a bson format that is similar to json so more complex types of data can be stored. The biggest feature of Mongo is that the query language it supports is very powerful. The syntax is similar to an object-oriented query language. It can perform almost most functions that are similar to querying a table in relational databases, and it also supports indexing of data.

(Note: the above content was selected from the Baidu encyclopedia.)

 

In fact, there is an important feature in Mongodb-GridFS that people are not aware of. This article mainly explains how to make better use of the GridFS feature and share experiences in combination with cases in practice.

 

[An important module of Mongobd - GridFS]

GridFS is a sub-module of Mongo. With GridFS, files based on MongoDB can be preserved. And supports distributed applications (distributed storage and reading of files). As a solution to storing binary data in the database in MongoDB, it is usually used to process large files. There is a size limit for storing data (documents) in MongoDB's BSON format and a maximum of 16 MB. In the actual system development, however, the uploaded images or files can be very large. Currently we can borrow GridFS to help manage these files.

 

GridFS is not a feature of MongoDB itself, but a file specification for storing large files in MongoDB. All officially supported drivers implement the GridFS specification. GridFS formulates how large files are processed in the database, completed via drivers for development languages ​​and how large files are stored and accessed via API interfaces.

 

<Szene verwenden>

▲ If your file system stores a limited number of files in a directory, you can use GridFS to store as many files as possible.

▲ If you want to access some of the information in a large file but do not want to load the entire file into memory, you can use GridFS to save the file and read part of the file information without loading the entire file into memory.

▲ If you want your files and metadata to be automatically synchronized and served across multiple systems and facilities, you can implement distributed file storage with GridFS.

 

[GridFS storage principle]

GridFS uses two collections to store files. A collection consists of blocks in which the binary data of the file content is stored. A collection is made up of files that store the file's metadata.

 

GridFS inserts two sentences into a common bouquet, and the two sentences use the name of the bouquet as a prefix. By default, MongoDB's GridFs use the fs bucket to store two groups of files. Hence, the two sets of storage files are referred to as set fs.files and set fs.chunks, respectively.

 

Of course, you can define different bouquet names or even several bouquets in one database, but the names of all collections must not exceed the limit of the mongoDB namespace.

 

The naming of the MongoDB collection includes the database name and collection name, and the database name and collection name are separated by "." Separated (e.g. . ). The maximum length of the name must not exceed 120 bytes.

 

When saving a file in GridFS, if the file is larger than the block size (each block is 256KB in size), the file will be divided into several blocks according to the block size, and finally the block information will be saved in the fs.chunks collection of Multiple Documents . Then save the file information in the only file in the fs.files collection. Among these, the file_id field in several documents in the fs.chunks listing corresponds to the "_id" field in the fs.files listing.

 

When reading a file, first find the corresponding document in the file collection according to the query conditions and at the same time retrieve the "_id" field. Then query all documents with "files_id" equal to "_id" in the chunk collection according to "_id". Finally, read the data in the "Data" field of the block according to the "n" field to restore the file.

 

The stored procedure is shown below:

 

The fs.files collection stores the metadata of files that are stored in the form of documents in JSON-like format. Every time a file is saved in GridFS, a corresponding document is generated in the fs.files collection.

▲ The memory contents of the documents in the fs.files collection are as follows:

 

The fs.chunks listing stores the binary data of the file file contents, which is stored in the form of a Dson-like document. Every time a file is saved in GridFS, GridFS divides the file contents into several file blocks according to the block size (the block capacity is 256 KB) and then saves the file blocks in the .chunks listing in json-like format. Each file block corresponds to fs.chunk A document in the collection. A storage file corresponds to one or more block files.

▲ The memory contents of the documents in the fs.chunks collection are as follows:

To improve retrieval speed, MongoDB created indexes on the two collections of GridFS. The fs.files collection uses the Filename and UploadDate fields as a unique composite index. The fs.chunk listing uses the fields "files_id" and "n" as a unique composite index.

 

[How do you use GridFS? 】

<Shell-Befehle verwenden>

mongoDB offers Mingofiles tools. You can use the command line to operate GridFS. There are actually four main commands, namely:

put - save command

get - get command

list - list command

Delete - delete command

These commands operate the files stored in GridFS according to the file name.

 

 

<API verwenden>

MongoDB supports several programming language drivers. Like C, Java, C #, NodeJs, etc. So you can use these MongoDB languages ​​to control API operations to extend GridFS.

 

【Exchange of experience】

▲ GridFs do not automatically process files with the same md5 value. That is, two put commands for the same file correspond to two different stores in GridFS, which is a waste of memory. If you only want to have one storage in GridFS for the same file in md5, you have to expand it via the API.

 

▲ MongoDB does not release the used hard disk space. Even if you delete the collection in db, MongoDB does not free up any space. If you use GridFS to store files and delete useless junk files from GridFS storage, MongoDB will still not free up any space. As a result, the hard drive is constantly consumed and cannot be recycled.

 

So how can we free up space?

(1) Disk space can be restored by repairing the database, i.e. running the db.repairDatabase () or the db.runCommand ({RepairDatabase: 1}) command in the Mongo shell. (This command is slow to execute).

When using the hard drive reclamation by repairing the database method, note that the remaining space on the hard drive to be repaired must be greater than or equal to the space occupied by the storage record plus 2G. Otherwise, the repair cannot be completed. Hence, the use of GridFS to store a large number of files must be considered in advance in order to design a disk recovery program to solve the mongoDB disk recovery problem.

 

(2) Use the Dump & Restore method, that is, delete the data that needs to be deleted in the MongoDB database and then back up the database using Mongodump. After the backup is complete, delete the MongoDB database and use the Mongorestore tool to restore the backup data to the database.

 

If the db.repairDatabase () command does not have enough free space, the Dump & Restore method can be used to restore disk resources. When MongoDB is in replica set mode, the Dump & Restore method can achieve continuous external services and restore disk resources without affecting normal MongoDB use.

 

MogonDB uses replica sets and implements dump and restore methods to reclaim disk resources. 70G data can be cleaned and the hard disk restored within 2 hours. The whole process does not affect MongoDB's external services and at the same time can ensure the integrity of the database's incremental data during processing.


Collected on the internet for personal learning