Elasticsearch Bulk Insert

This article shows how to setup an Elasticsearch index with an alias and bulk insert a lot of documents. When bulk inserting lots of documents, it improves performance by turning off the refresh interval (RefreshInterval = “-1”) and turning off replication. When the insert is finished, these settings are set to the required values depending on your requirements.

Code: https://github.com/damienbod/ElasticsearchBulkInsert

Other Tutorials:

Part 1: ElasticsearchCRUD introduction
Part 2: MVC application search with simple documents using autocomplete, jQuery and jTable
Part 3: MVC Elasticsearch CRUD with nested documents
Part 4: Data Transfer from MS SQL Server using Entity Framework to Elasticsearch
Part 5: MVC Elasticsearch with child, parent documents
Part 6: MVC application with Entity Framework and Elasticsearch
Part 7: Live Reindex in Elasticsearch
Part 8: CSV export using Elasticsearch and Web API
Part 9: Elasticsearch Parent, Child, Grandchild Documents and Routing
Part 10: Elasticsearch Type mappings with ElasticsearchCRUD
Part 11: Elasticsearch Synonym Analyzer using ElasticsearchCRUD
Part 12: Using Elasticsearch German Analyzer
Part 13: MVC google maps search using Elasticsearch
Part 14: Search Queries and Filters with ElasticsearchCRUD
Part 15: Elasticsearch Bulk Insert
Part 16: Elasticsearch Aggregations With ElasticsearchCRUD
Part 17: Searching Multiple Indices and Types in Elasticsearch
Part 18: MVC searching with Elasticsearch Highlighting
Part 19: Index Warmers with ElasticsearchCRUD

To create the index, a TestDto class is used. This is mapped to a “testdtos_v1” index and a “testdto” type. An alias “testdtos” is then added to the index. The is very useful, if a live reindex is required, or if the data types are changed. The search client uses the alias. When the index is created, the RefreshInterval and the NumberOfReplicas properties are set. The values are set so that the refresh is turned off and the replication is also turned off. These recommendations can be found in the Elasticsearch documentation. See the links at the end of this post.

.
public void CreateIndexWithAlias()
{		
	IElasticsearchMappingResolver elasticsearchMappingResolver = new ElasticsearchMappingResolver();
	elasticsearchMappingResolver.AddElasticSearchMappingForEntityType(typeof(TestDto), new ElasticsearchMappingTestDto());
	using (var context = new ElasticsearchContext( ConnectionString, new ElasticsearchSerializerConfiguration(elasticsearchMappingResolver, true, true)))
	{
		context.TraceProvider = new ConsoleTraceProvider();

		context.IndexCreate<TestDto>(
			new IndexDefinition
			{
				IndexAliases = new IndexAliases
				{
					Aliases = new List<IndexAlias>
					{
						// alais maps to default index name
						new IndexAlias("testdtos")
					}
				}, 
				IndexSettings = new IndexSettings{RefreshInterval="-1", NumberOfReplicas = 0}
			}
		);
	}
}

The Mapping for the index can be changed from the default mapping by implementing the ElasticsearchMapping class. Any index or type can be set here for any DTO class and the mapping is then added to the context. This mapping is then used for this class as long as the context exists. The ElasticsearchMappingTestDto mapping is only required to create the index, otherwise the alias is used, which matches the default settings in ElasticsearchCRUD.

public class ElasticsearchMappingTestDto : ElasticsearchMapping
{
	public override string GetIndexForType(Type type)
	{
		return "testdtos_v1";
	}
}

The settings can be viewed in Elasticsearch using the _settings API:
http://localhost:9200/_settings

The “refresh_interval”: “-1”, and “number_of_replicas”: “0” have been changed as required.

{
	"testdtos_v1": {
		"settings": {
			"index": {
				"creation_date": "1422473302831",
				"uuid": "luOwcuQiRyqxX3IvTTovWg",
				"number_of_replicas": "0",
				"number_of_shards": "5",
				"refresh_interval": "-1",
				"version": {
					"created": "1040299"
				}
			}
		}
	}
}

The TestDto class has the following mapping:

{
	"testdtos_v1": {
		"mappings": {
			"testdto": {
				"properties": {
					"description": {
						"type": "string"
					},
					"id": {
						"type": "long"
					},
					"info": {
						"type": "string"
					}
				}
			}
		}
	}
}

Now that the index is created, a million documents are added in 100 bulk HTTP requests. The optimal size of the bulk request, and the optimal amount of documents in each bulk request, depends on the size of each document and the Elasticsearch installation. The SaveChanges method sends the bulk request.

public void DoBulkInsert()
{
	// Add a million records
	long id = 1;
	for (int i = 0; i < 100; i++)
	{
		for (int t = 0; t < 10000; t++)
		{
			var item = new TestDto
			{
				Id = id,
				Description = "this is cool",
				Info = "info"
			};
			_elasticsearchContext.AddUpdateDocument(item, item.Id);
			id++;
		}
		// add data ...
		_elasticsearchContext.SaveChanges();
		Console.WriteLine("Saved:" + (i + 1) * 10000 + " items");
	}
}

Once the bulk insert is finished, the refresh is turned back on and the replication is activated to the required amount.

public void UpdateIndexRefreshIntervalTo1S()
{
	_elasticsearchContext.IndexUpdateSettings
	(
		new IndexUpdateSettings
		{
			RefreshInterval = "1s",
			NumberOfReplicas = 1
		}
	);
}

The settings can then be viewed and the two property values have been updated.
http://localhost:9200/_settings

{
	"testdtos_v1": {
		"settings": {
			"index": {
				"creation_date": "1422473302831",
				"uuid": "luOwcuQiRyqxX3IvTTovWg",
				"number_of_replicas": "1",
				"number_of_shards": "5",
				"refresh_interval": "1s",
				"version": {
					"created": "1040299"
				}
			}
		}
	}
}

Now the index with the alias is ready for searches or whatever.

http://localhost:9200/testdtos/testdto/_search

Links:

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/bulk.html

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/indexing-performance.html

http://gibrown.com/2014/02/06/scaling-elasticsearch-part-2-indexing/

2 comments

  1. […] Elasticsearch Bulk Insert – damienbod continues this comprehensive series looking at ElasticSearch in .NET applications with a look at how to get large quantities of data into ElasticSearch […]

  2. Hi Damienbod, could you please give an example of inserting a dynamic object to Elastic Search usng NEST.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: