Importing Data¶
Overview¶
SolveBio specializes in harmonizing disparate datasets through its robust import system. Importing data is the process of converting a flat file into a SolveBio dataset that can be queried in real time. SolveBio supports data in many formats, including: JSONL, VCF, CSV, TSV, XML, GTF, and GFF3. SolveBio also provides a service to assist you in importing any other format (including custom, proprietary formats and unstructured data). Contact SolveBio Support for import assistance.
Pro Tip
Try our Quickstart Tutorial to get started right away with an end-to-end example.
SolveBio's import system automates the traditional ETL (Extract, Transform, Load) process. The process typically starts by uploading files into a SolveBio vault. An import task can then be configured and launched. The import system automatically handles data extraction (file parsing), data transformation, data validation, and finally data loading.
Supported Formats¶
The following file formats and extensions are supported:
Reader | Extension | Reference |
---|---|---|
CSV | .csv | Docs |
GFF3 | .gff3 | Docs |
GTF | .gtf | — |
JSONL | .json | Docs |
TSV | .tsv, .txt, .maf | Docs |
VCF | .vcf | Docs |
XML | .xml | Docs |
Nirvana | .json | Docs |
SolveBio automatically detects the file format based on the extension, except for the Nirvana JSON file, and parses the file using a specialized "reader". It is possible to manually specify a reader and modify reader parameters using the reader_params
attribute of the DatasetImport
resource. Learn more about reader parameters.
SolveBio supports GZip compression for all file types. Gzipped files must have the .gz
file extension in addition to their format extension (i.e. file.vcf.gz
).
Use Compression Whenever Possible
We recommend compressing files with GZip for faster uploads and imports.
Importing from Files¶
The first step to getting your data onto SolveBio is by uploading files into a vault. Learn more about vaults.
1 2 3 4 5 6 7 8 9 | from solvebio import Vault from solvebio import Object # Upload a local file to the root of your personal Vault vault = Vault.get_personal_vault() uploaded_file = vault.upload_file('local/path/file.vcf.gz', '/') # You can now retrieve the file by its full path: uploaded_file = Object.get_by_full_path('~/file.vcf.gz') |
1 2 3 4 5 6 7 | library(solvebio) vault <- Vault.get_personal_vault() uploaded_file <- Object.upload_file('local/path/file.vcf.gz', vault$id, '/') # You can now retrieve the file by its full path: uploaded_file <- Object.get_by_full_path('~/file.vcf.gz') |
Once your files have been uploaded, you can import them into any new or existing dataset (Learn how to create a dataset). To launch an import, use the DatasetImport
resource. You'll need to provide your uploaded file and target dataset as input. Once the import has been launched you can track its progress through the API or on the web.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from solvebio import Dataset from solvebio import DatasetImport dataset = Dataset.get_or_create_by_full_path('~/python_examples/test_dataset') # Launch the import imp = DatasetImport.create( dataset_id=dataset.id, object_id=uploaded_file.id, commit_mode='append' ) # Follow the import status dataset.activity(follow=True) |
1 2 3 4 5 6 7 8 9 10 11 12 13 | library(solvebio) vault <- Vault.get_personal_vault() dataset_full_path <- paste(vault$full_path, "/r_examples/test_dataset", sep=":") dataset <- Dataset.get_or_create_by_full_path(dataset_full_path) # Launch the import imp = DatasetImport.create(dataset_id = dataset$id, object_id = object$id, commit_mode = 'append') # Wait for the import to complete Dataset.activity(dataset$id) |
Importing from URLs¶
If your files are on a remote server and accessible by URL, you can import them using a manifest. A manifest is simply a list of files (URLs and other attributes) to import:
1 2 3 4 5 6 | from solvebio import Manifest source_url = "https://s3.amazonaws.com/downloads.solvebio.com/demo/interesting-variants.json.gz" manifest = Manifest() manifest.add_url(source_url) |
1 2 3 4 5 6 7 | source_url <- "https://s3.amazonaws.com/downloads.solvebio.com/demo/interesting-variants.json.gz" manifest <- list( files = list( list(url = source_url) ) ) |
Once your manifest has been created, you can import it into any new or existing dataset (Learn how to create a dataset). To launch an import, use the DatasetImport
resource, providing your manifest and target dataset as input. Once the import has been launched you can track its progress through the API or on the web. For more details, see the manifest reference.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from solvebio import Dataset from solvebio import DatasetImport dataset = Dataset.get_or_create_by_full_path('~/python_examples/manifest_dataset') # Launch the import imp = DatasetImport.create( dataset_id=dataset.id, manifest=manifest.manifest, commit_mode='append' ) # Follow the import status dataset.activity(follow=True) |
1 2 3 4 5 6 7 8 9 10 11 12 13 | library(solvebio) vault <- Vault.get_personal_vault() dataset_full_path <- paste(vault$full_path, "/r_examples/manifest_dataset", sep=":") dataset <- Dataset.get_or_create_by_full_path(dataset_full_path) # Launch the import imp = DatasetImport.create(dataset_id = dataset$id, manifest = manifest, commit_mode = 'append') # Wait for the import to complete Dataset.activity(dataset$id) |
SolveBio can also pull data from DNAnexus, SevenBridges, or any other pipeline. Contact SolveBio for more information.
Importing from Records¶
You may only import up to 5000 records at a time this way
This import method is great for importing small datasets and making edits to datasets. For larger imports and transforms, we recommend importing from compressed JSONL files.
SolveBio can also import a list of records, i.e. a list of Python dictionaries or R data.
1 2 3 4 5 6 7 8 9 10 11 12 | from solvebio import DatasetImport records = [ {'gene': 'CFTR', 'importance': 1, 'sample_count': 2104}, {'gene': 'BRCA1', 'importance': 1, 'sample_count': 1391}, {'gene': 'CLIC2', 'importance': 5, 'sample_count': 14}, ] dataset = Dataset.get_or_create_by_full_path('~/python_examples/records_dataset') imp = DatasetImport.create( dataset_id=dataset.id, data_records=records ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 | library(solvebio) vault <- Vault.get_personal_vault() dataset_full_path <- paste(vault$full_path, "/r_examples/records_dataset", sep=":") dataset <- Dataset.get_or_create_by_full_path(dataset_full_path) records <- list( list(gene='CFTR', importance=1, sample_count=2104), list(gene='BRCA2', importance=1, sample_count=1391), list(gene='CLIC2', importance=5, sample_count=14) ) imp <- DatasetImport.create(dataset_id = dataset$id, data_records = records) |
Command Line Tools¶
Requires SolveBio's Python Client
The Command Line Interface (CLI) is a component of SolveBio's Python client.
Using the solvebio import
command you can import files from your command line:
# Import a file (create the dataset if necessary): solvebio import --create-dataset --follow ~/test-dataset data.vcf.gz # Import files in upsert mode (create the dataset from a template if necessary): solvebio import --create-dataset --template-file template.json --commit-mode=upsert --follow ~/test-dataset data.vcf.gz
Here is an example dataset template file (i.e. template.json
):
{ "name": "Example SolveBio Dataset template", "fields": [ { "name": "gene_symbol", "description": "HUGO gene symbol", "entity_type": "gene", "data_type": "string" }, { "name": "variant", "entity_type": "variant", "data_type": "string" } ] }
If you simply want to upload files into a vault in batch, you can use the solvebio upload
command. If multiple files are uploaded, the command will cross-check the files and upload only the missing files and folders. Please note that comparison is performed by filename, not by file contents.
# Upload local_folder to the root of your personal vault:
solvebio upload ./local_folder
Transforming Imported Data¶
You can transform your imported data (add or edit fields) by providing a list of fields to the target_fields
parameter. You can use expressions to dynamically modify your data as it is being imported, making it possible to:
- Modify data types (numbers to strings or vice-versa)
- Add new fields with static or dynamic content
- Format strings, dates to clean the data
- Merge data from datasets
- And a lot more!
The following example imports a list of records and transforms the contents in a single step:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | from solvebio import Dataset from solvebio import DatasetImport dataset = Dataset.get_or_create_by_full_path('~/python_examples/transform_import') # The original records records = [ {'name': 'Francis Crick'}, {'name': 'James Watson'}, {'name': 'Rosalind Franklin'} ] # The transforms to apply through "target_fields" # Compute the first and last names. target_fields = [ { "name": "first_name", "description": "Adds a first name column based on name column", "data_type": "string", "expression": "record.name.split(' ')[0]" }, { "name": "last_name", "description": "Adds a last name column based on name column", "data_type": "string", "expression": "record.name.split(' ')[-1]" } ] imp = DatasetImport.create( dataset_id=dataset.id, data_records=records, target_fields=target_fields ) # Wait until the import finishes dataset.activity(follow=True) for record in dataset.query(exclude_fields=['_id', '_commit']): print record # Output: # {'first_name': 'Francis', 'last_name': 'Crick', 'name': 'Francis Crick'} # {'first_name': 'James', 'last_name': 'Watson', 'name': 'James Watson'} # {'first_name': 'Rosalind', 'last_name': 'Franklin', 'name': 'Rosalind Franklin'} |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | library(solvebio) vault <- Vault.get_personal_vault() dataset_full_path <- paste(vault$full_path, "/r_examples/transform_dataset", sep=":") dataset <- Dataset.get_or_create_by_full_path(dataset_full_path) # The original records records <- list( list(name='Francis Crick'), list(name='James Watson'), list(name='Rosalind Franklin') ) # The transforms to apply through "target_fields" # Compute the first and last names. target_fields <- list( list( name="first_name", description="Adds a first name column based on name column", data_type="string", expression="record.name.split(' ')[0]" ), list( name="last_name", description="Adds a last name column based on name column", data_type="string", expression="record.name.split(' ')[-1]" ) ) # Import and transform the records imp <- DatasetImport.create(dataset_id = dataset$id, data_records = records, target_fields = target_fields) # Wait until import is completed Dataset.activity(dataset$id) Dataset.query(id = dataset$id, exclude_fields=list('_id', '_commit')) # Output: # first_name last_name name # 1 Francis Crick Francis Crick # 2 James Watson James Watson # 3 Rosalind Franklin Rosalind Franklin |
Existing imported data can also be modified by using migrations. This allows a user to add a column, modify data within a column or remove a column.
Learn more about transforming data →
Validating Imported Data¶
When importing data, every record is validated to ensure it can be committed into a SolveBio Dataset. Validation compares the schema of existing Dataset fields with the values of incoming data and issues validation errors if the Dataset field schema does not match the incoming value. Validation can also issue warnings.
During validation, a field's data_type
and is_list
value are checked. All records are evaluated (although you may override this to fail fast on first error.) A commit will not be created if there are any validation errors.
The following settings can be passed to the validation_params
field.
disable
- (boolean) default False - Disables validation completelyraise_on_errors
- (boolean) default False - Will fail the import on first validation error encountered.strict_validation
- (boolean) default False - Will upgrade all validation warnings to errors.allow_new_fields
- (boolean) default False - If strict validation is True, will still allow new fields to be added
See the full validation reference →
The following example fails an import as soon as invalid data is detected:
1 2 3 4 5 6 7 | imp = DatasetImport.create( dataset_id=dataset.id, object_id=object.id, validation_params={ 'raise_on_errors': True } ) |
1 2 3 4 5 6 7 | imp <- DatasetImport.create( dataset_id = dataset$id, object_id = object$id, validation_params = list( raise_on_errors = TRUE ) ) |
The following example disables validation from running, which can improve import performance.
Disable validation at your own risk
Validation checks field data types and formatting and prevents downstream errors that can occur during the commit stage.
1 2 3 4 5 6 7 | imp = DatasetImport.create( dataset_id=dataset.id, object_id=object.id, validation_params={ 'disable': True } ) |
1 2 3 4 5 6 7 | imp <- DatasetImport.create( dataset_id = dataset$id, object_id = object$id, validation_params = list( disable = TRUE ) ) |
Committing Imported Data¶
Once data has been extracted from files, transformed, and validated, it will be automatically indexed ("committed") into SolveBio's datastore. Dataset commits represent all changes made to the target dataset by the import process. There are four commit modes that can be selected depending on the scenario: append
(default), overwrite
, upsert
, and delete
. The commit mode can be specified when creating the DatasetImport
using the commit_mode
parameter.
append
(default)
Append mode always adds records to the dataset. Imported record IDs (the _id
field) will be overwritten with unique values. Only append commits can be rolled back at this time.
overwrite
Overwrite mode requires that each record have value in the _id
field. Existing records with the same _id
are overwritten completely.
upsert
Upsert mode merges imported records with existing records, based on the value of their _id
field. Object fields are merged together, scalar fields (such as integers and strings) are overwritten, and new fields are added. List fields are completely overwritten regardless of the data type.
delete
Delete mode is a special case that deletes existing dataset records based on their _id
field.
Performance Tips¶
Below are some tips for improving the performance of your dataset imports.
Disable Data validation
Data validation is enabled by default when running imports or migrations. This is used for data type checking on each record that is processed. Disabling this will provide a per-record performance improvement, translating to substantial time savings for large datasets.
Dataset Capacity
For many simultaneous imports, use a larger dataset capacity. Simultaneous imports have a high upper limit (50+) but simultaneous commits are throttled. Every import spawns a commit that does the actual indexing of the data. small
capacity datasets allow a single running commit per dataset at a time, medium
allow 2 simultaneous commits and large
allow 3 simultaneous commits. Commits will remain queued until running ones are completed.
Indexing operations and query operations are also faster for larger capacity datasets. If you expect a dataset to be queried at high frequency, then we recommend using a larger dataset. If the dataset already exists, copy the dataset into a medium
or large
dataset.
Optimize "expensive" Expressions
Some dataset field expressions are more expensive than others. Dataset query expressions can be sped up by applying exact filters, using fields
to only pull back the fields that are needed, or using dataset_count()
if length is what is needed.