Importing Data¶

Overview¶

SolveBio specializes in harmonizing disparate datasets through its robust import system. Importing data is the process of converting a flat file into a SolveBio dataset that can be queried in real time. SolveBio supports data in many formats, including: JSONL, VCF, CSV, TSV, XML, GTF, and GFF3. SolveBio also provides a service to assist you in importing any other format (including custom, proprietary formats and unstructured data). Contact SolveBio Support for import assistance.

Pro Tip

Try our Quickstart Tutorial to get started right away with an end-to-end example.

SolveBio's import system automates the traditional ETL (Extract, Transform, Load) process. The process typically starts by uploading files into a SolveBio vault. An import task can then be configured and launched. The import system automatically handles data extraction (file parsing), data transformation, data validation, and finally data loading.

Supported Formats¶

The following file formats and extensions are supported:

Reader	Extension	Reference
CSV	.csv	Docs
GFF3	.gff3	Docs
GTF	.gtf	—
JSONL	.json	Docs
TSV	.tsv, .txt, .maf	Docs
VCF	.vcf	Docs
XML	.xml	Docs
Nirvana	.json	Docs

SolveBio automatically detects the file format based on the extension, except for the Nirvana JSON file, and parses the file using a specialized "reader". It is possible to manually specify a reader and modify reader parameters using the reader_params attribute of the DatasetImport resource. Learn more about reader parameters.

SolveBio supports GZip compression for all file types. Gzipped files must have the .gz file extension in addition to their format extension (i.e. file.vcf.gz).

Use Compression Whenever Possible

We recommend compressing files with GZip for faster uploads and imports.

Importing from Files¶

The first step to getting your data onto SolveBio is by uploading files into a vault. Learn more about vaults.

Python R

from solvebio import Vault
from solvebio import Object

# Upload a local file to the root of your personal Vault
vault = Vault.get_personal_vault()
uploaded_file = vault.upload_file('local/path/file.vcf.gz', '/')

# You can now retrieve the file by its full path:
uploaded_file = Object.get_by_full_path('~/file.vcf.gz')

library(solvebio)

vault <- Vault.get_personal_vault()
uploaded_file <- Object.upload_file('local/path/file.vcf.gz', vault$id, '/')

# You can now retrieve the file by its full path:
uploaded_file <- Object.get_by_full_path('~/file.vcf.gz')

Once your files have been uploaded, you can import them into any new or existing dataset (Learn how to create a dataset). To launch an import, use the DatasetImport resource. You'll need to provide your uploaded file and target dataset as input. Once the import has been launched you can track its progress through the API or on the web.

Python R

from solvebio import Dataset
from solvebio import DatasetImport

dataset = Dataset.get_or_create_by_full_path('~/python_examples/test_dataset')

# Launch the import
imp = DatasetImport.create(
    dataset_id=dataset.id,
    object_id=uploaded_file.id,
    commit_mode='append'
)

# Follow the import status
dataset.activity(follow=True)

library(solvebio)

vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/test_dataset", sep=":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

# Launch the import
imp = DatasetImport.create(dataset_id = dataset$id,
                           object_id = object$id,
                           commit_mode = 'append')

# Wait for the import to complete
Dataset.activity(dataset$id)

Importing from URLs¶

If your files are on a remote server and accessible by URL, you can import them using a manifest. A manifest is simply a list of files (URLs and other attributes) to import:

Python R

from solvebio import Manifest

source_url = "https://s3.amazonaws.com/downloads.solvebio.com/demo/interesting-variants.json.gz"

manifest = Manifest()
manifest.add_url(source_url)

source_url <- "https://s3.amazonaws.com/downloads.solvebio.com/demo/interesting-variants.json.gz"

manifest <- list(
        files = list(
            list(url = source_url)
        )
)

Once your manifest has been created, you can import it into any new or existing dataset (Learn how to create a dataset). To launch an import, use the DatasetImport resource, providing your manifest and target dataset as input. Once the import has been launched you can track its progress through the API or on the web. For more details, see the manifest reference.

Python R

from solvebio import Dataset
from solvebio import DatasetImport

dataset = Dataset.get_or_create_by_full_path('~/python_examples/manifest_dataset')

# Launch the import
imp = DatasetImport.create(
    dataset_id=dataset.id,
    manifest=manifest.manifest,
    commit_mode='append'
)

# Follow the import status
dataset.activity(follow=True)

library(solvebio)

vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/manifest_dataset", sep=":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

# Launch the import
imp = DatasetImport.create(dataset_id = dataset$id,
                           manifest = manifest,
                           commit_mode = 'append')

# Wait for the import to complete
Dataset.activity(dataset$id)

SolveBio can also pull data from DNAnexus, SevenBridges, or any other pipeline. Contact SolveBio for more information.

Importing from Records¶

You may only import up to 5000 records at a time this way

This import method is great for importing small datasets and making edits to datasets. For larger imports and transforms, we recommend importing from compressed JSONL files.

SolveBio can also import a list of records, i.e. a list of Python dictionaries or R data.

Python R

from solvebio import DatasetImport

records = [
    {'gene': 'CFTR', 'importance': 1, 'sample_count': 2104},
    {'gene': 'BRCA1', 'importance': 1, 'sample_count': 1391},
    {'gene': 'CLIC2', 'importance': 5, 'sample_count': 14},
]
dataset = Dataset.get_or_create_by_full_path('~/python_examples/records_dataset')
imp = DatasetImport.create(
    dataset_id=dataset.id,
    data_records=records
)

library(solvebio)

vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/records_dataset", sep=":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

records <- list(
    list(gene='CFTR', importance=1, sample_count=2104),
    list(gene='BRCA2', importance=1, sample_count=1391),
    list(gene='CLIC2', importance=5, sample_count=14)
)
imp <- DatasetImport.create(dataset_id = dataset$id,
                            data_records = records)

Command Line Tools¶

Requires SolveBio's Python Client

The Command Line Interface (CLI) is a component of SolveBio's Python client.

Using the solvebio import command you can import files from your command line:

# Import a file (create the dataset if necessary):
solvebio import --create-dataset --follow ~/test-dataset data.vcf.gz

# Import files in upsert mode (create the dataset from a template if necessary):
solvebio import --create-dataset --template-file template.json --commit-mode=upsert --follow ~/test-dataset data.vcf.gz

Here is an example dataset template file (i.e. template.json):

{
  "name": "Example SolveBio Dataset template",
  "fields": [
    {
      "name": "gene_symbol",
      "description": "HUGO gene symbol",
      "entity_type": "gene",
      "data_type": "string"
    },
    {
      "name": "variant",
      "entity_type": "variant",
      "data_type": "string"
    }
  ]
}

If you simply want to upload files into a vault in batch, you can use the solvebio upload command. If multiple files are uploaded, the command will cross-check the files and upload only the missing files and folders. Please note that comparison is performed by filename, not by file contents.

# Upload local_folder to the root of your personal vault:
solvebio upload ./local_folder

Transforming Imported Data¶

You can transform your imported data (add or edit fields) by providing a list of fields to the target_fields parameter. You can use expressions to dynamically modify your data as it is being imported, making it possible to:

Modify data types (numbers to strings or vice-versa)
Add new fields with static or dynamic content
Format strings, dates to clean the data
Merge data from datasets
And a lot more!

The following example imports a list of records and transforms the contents in a single step:

Python R

from solvebio import Dataset
from solvebio import DatasetImport

dataset = Dataset.get_or_create_by_full_path('~/python_examples/transform_import')

# The original records
records = [
    {'name': 'Francis Crick'},
    {'name': 'James Watson'},
    {'name': 'Rosalind Franklin'}
]

# The transforms to apply through "target_fields"
# Compute the first and last names.
target_fields = [
    {
        "name": "first_name",
        "description": "Adds a first name column based on name column",
        "data_type": "string",
        "expression": "record.name.split(' ')[0]"
    },
    {
        "name": "last_name",
        "description": "Adds a last name column based on name column",
        "data_type": "string",
        "expression": "record.name.split(' ')[-1]"
    }
]

imp = DatasetImport.create(
    dataset_id=dataset.id,
    data_records=records,
    target_fields=target_fields
)

# Wait until the import finishes
dataset.activity(follow=True)

for record in dataset.query(exclude_fields=['_id', '_commit']):
    print record

# Output:
# {'first_name': 'Francis', 'last_name': 'Crick', 'name': 'Francis Crick'}
# {'first_name': 'James', 'last_name': 'Watson', 'name': 'James Watson'}
# {'first_name': 'Rosalind', 'last_name': 'Franklin', 'name': 'Rosalind Franklin'}

library(solvebio)

vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/transform_dataset", sep=":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

# The original records
records <- list(
    list(name='Francis Crick'),
    list(name='James Watson'),
    list(name='Rosalind Franklin')
)

# The transforms to apply through "target_fields"
# Compute the first and last names.
target_fields <- list(
    list(
        name="first_name",
        description="Adds a first name column based on name column",
        data_type="string",
        expression="record.name.split(' ')[0]"
    ),
    list(
        name="last_name",
        description="Adds a last name column based on name column",
        data_type="string",
        expression="record.name.split(' ')[-1]"
    )
)

# Import and transform the records
imp <- DatasetImport.create(dataset_id = dataset$id,
                            data_records = records,
                            target_fields = target_fields)

# Wait until import is completed
Dataset.activity(dataset$id)

Dataset.query(id = dataset$id, exclude_fields=list('_id', '_commit'))
# Output:
#  first_name last_name              name
#  1    Francis     Crick     Francis Crick
#  2      James    Watson      James Watson
#  3   Rosalind  Franklin Rosalind Franklin

Existing imported data can also be modified by using migrations. This allows a user to add a column, modify data within a column or remove a column.

Learn more about transforming data →

Validating Imported Data¶

When importing data, every record is validated to ensure it can be committed into a SolveBio Dataset. Validation compares the schema of existing Dataset fields with the values of incoming data and issues validation errors if the Dataset field schema does not match the incoming value. Validation can also issue warnings.

During validation, a field's data_type and is_list value are checked. All records are evaluated (although you may override this to fail fast on first error.) A commit will not be created if there are any validation errors.

The following settings can be passed to the validation_params field.

disable - (boolean) default False - Disables validation completely
raise_on_errors - (boolean) default False - Will fail the import on first validation error encountered.
strict_validation - (boolean) default False - Will upgrade all validation warnings to errors.
allow_new_fields - (boolean) default False - If strict validation is True, will still allow new fields to be added

See the full validation reference →

The following example fails an import as soon as invalid data is detected:

Python R

imp = DatasetImport.create(
    dataset_id=dataset.id,
    object_id=object.id,
    validation_params={
        'raise_on_errors': True
    }
)

imp <- DatasetImport.create(
    dataset_id = dataset$id,
    object_id = object$id,
    validation_params = list(
        raise_on_errors = TRUE
    )
)

The following example disables validation from running, which can improve import performance.

Disable validation at your own risk

Validation checks field data types and formatting and prevents downstream errors that can occur during the commit stage.

Python R

imp = DatasetImport.create(
    dataset_id=dataset.id,
    object_id=object.id,
    validation_params={
        'disable': True
    }
)

imp <- DatasetImport.create(
    dataset_id = dataset$id,
    object_id = object$id,
    validation_params = list(
        disable = TRUE
    )
)

Committing Imported Data¶

Once data has been extracted from files, transformed, and validated, it will be automatically indexed ("committed") into SolveBio's datastore. Dataset commits represent all changes made to the target dataset by the import process. There are four commit modes that can be selected depending on the scenario: append (default), overwrite, upsert, and delete. The commit mode can be specified when creating the DatasetImport using the commit_mode parameter.

append (default)

Append mode always adds records to the dataset. Imported record IDs (the _id field) will be overwritten with unique values. Only append commits can be rolled back at this time.

overwrite

Overwrite mode requires that each record have value in the _id field. Existing records with the same _id are overwritten completely.

upsert

Upsert mode merges imported records with existing records, based on the value of their _id field. Object fields are merged together, scalar fields (such as integers and strings) are overwritten, and new fields are added. List fields are completely overwritten regardless of the data type.

delete

Delete mode is a special case that deletes existing dataset records based on their _id field.

Performance Tips¶

Below are some tips for improving the performance of your dataset imports.

Disable Data validation

Data validation is enabled by default when running imports or migrations. This is used for data type checking on each record that is processed. Disabling this will provide a per-record performance improvement, translating to substantial time savings for large datasets.

Dataset Capacity

For many simultaneous imports, use a larger dataset capacity. Simultaneous imports have a high upper limit (50+) but simultaneous commits are throttled. Every import spawns a commit that does the actual indexing of the data. small capacity datasets allow a single running commit per dataset at a time, medium allow 2 simultaneous commits and large allow 3 simultaneous commits. Commits will remain queued until running ones are completed.

Indexing operations and query operations are also faster for larger capacity datasets. If you expect a dataset to be queried at high frequency, then we recommend using a larger dataset. If the dataset already exists, copy the dataset into a medium or large dataset.

Optimize "expensive" Expressions

Some dataset field expressions are more expensive than others. Dataset query expressions can be sped up by applying exact filters, using fields to only pull back the fields that are needed, or using dataset_count() if length is what is needed.

Last updated 2022-12-07.

Have questions or comments about this article? Get in touch with SolveBio Support by submitting a ticket or by sending us an email.