Dataset Imports¶

create¶

HTTP Request

POST https://api.solvebio.com/v2/dataset_imports

Parameters

This request does not accept URL parameters.

Authorization

This request requires an authorized user with write permission on the dataset.

Request Body

In the request body, provide an object with the following properties:

Property	Value	Description
commit_mode	string	A valid commit mode.
dataset_id	integer	The target dataset to import into.
object_id	integer	(optional) The ID of an existing object on SolveBio.
manifest	object	(optional) A file manifest (see below).
data_records	objects	(optional) A list of records to import synchronously.
description	string	(optional) A description of this import.
entity_params	object	(optional) Configuration parameters for entity detection.
reader_params	object	(optional) Configuration parameters for readers.
validation_params	object	(optional) Configuration parameters for validation.
annotator_params	object	(optional) Configuration parameters for the Annotator.
include_errors	boolean	If True, a new field (`_errors`) will be added to each record containing expression evaluation errors (default: True).
target_fields	objects	A list of valid dataset fields to create or override in the import.
priority	integer	A priority to assign to this task

When creating a new import, either manifest, object_id or data_records must be provided. Using a manifest allows you to import a remote file accessible by HTTP(S), for example:

# Example Manifest
{
    "files": [{
        "url": "https://example.com/file.json.gz",
        "name": "file.json.gz",
        "format": "json",
        "size": 100,
        "md5": "",
        "base64_md5": ""
    }]
}

Response

The response returns "HTTP 201 Created", along with the DatasetImport resource when successful.

Reader Parameters¶

Reader	Reader name	Extension
VCF	vcf	.vcf
JSONL	json	.json
CSV	csv	.csv
TSV	tsv	.tsv, .txt, .maf
XML	xml	.xml
GTF	gtf	.gtf
GFF3	gff3	.gff3
Nirvana JSON	nirvana	.json

SolveBio will automatically select a reader based on the imported file's extension. This is not applicable to the Nirvana JSON file because it has .json extension the same as the JSONL file, so the reader attribute has to be set manually to nirvana.

In the case where the extension is not recognized, you can manually select a reader using the reader attribute of reader_params by setting the associated reader name as its value:

# Force the JSONL reader
reader_params = {
    'reader': 'json'
}

imp = DatasetImport.create(
    reader_params=reader_params
    ...
)

JSON (JSONL)¶

The JSONL format supported by SolveBio has four requirements (adapted from jsonlines.org):

1. UTF-8 encoding

JSON allows encoding Unicode strings with only ASCII escape sequences, however those escapes will be hard to read when viewed in a text editor. The author of the JSON Lines file may choose to escape characters to work with plain ASCII files.

Non-ascii content may be corrupted during the import process if non-UTF-8 files are imported.

2. Each line must be a complete JSON object

Specifically, each line must be a JSON object without any internal line-breaks. For example, here are three records:

{"field": 1}
{"field": 2}
{"field": 3}

3. Lines are separated by '\n'

This means '\r\n' is also supported because trailing white space is ignored when parsing JSON values.

The last character in the file may be a line separator, and it will be treated the same as if there was no line separator present.

4. The file extension must be .json or .json.gz

JSON Lines files for SolveBio must be saved with the .json extension. Files may be gzipped, resulting in the .json.gz extension.

CSV/TSV¶

The following parameters can be passed for files that end with extension .csv, .tsv, and .txt.

delimiter: A one-character string used to separate fields. Defaults to ',' for CSVs and '\t' for TSVs and TXT files
quotechar: A one-character string used to quote fields containing special characters, such as the delimiter or quotechar, or which contain new-line characters. It defaults to '"'.
header: The row number (starting from 0) containing the header, None (to indicate no header), or infer (default). By default, column names are inferred from the first row unless headercols are provided, in which case the file is assumed to have no header row. Set header = 0 and provide headercols to replace existing headers.
headercols: A list of field names that represent column headers. By default, providing headercols assumes the file has no header. To replace existing headers, set header = 0). The order of the columns matters and must match the number of delimited columns in each line.
comment: A string used to determine which lines in the files are comments. Lines that begin with this string will be ignored. Default is '#'.
skiprows: A list of integers that define the line numbers of the file that should be skipped. The first line is line 0. Default is [].
skipcols: A list of integers that define the columns of the file that should be ignored. The first column is column 0. Default is [].

Column Ordering¶

This reader will preserve the column order of the original file, unless otherwise overridden with an import template.

Numeric Fields¶

This reader will cast all numeric fields to doubles.

The following example modifies the default CSV reader to handle a pipe-delimited file with 5 header rows:

# Custom reader params for a pipe-delimited "CSV" file with 5 header rows:
csv_reader_settings = {
    'delimiter': '|',
    'skiprows': [0, 1, 2, 3, 4]
}

imp = DatasetImport.create(
    reader_params=csv_reader_settings
    ...
)

VCF¶

The following parameters can be passed for files that end with extension .vcf.

genome_build: The string 'GRCh37' or 'GRCh38'. If no genome_build is passed an attempt to guess the build will be made from the file headers and will fallback to GRCh37 if nothing is found.
explode_annotations: Default: False - Will explode the annotations column of the VCF by creating one new record per annotation. By default it will look for annotations at the ANN column within the info object (info.ANN). This key can be configured with the annotations_key parameter.
annotations_key: The field name that contains the VCF annotations. For use with explode_annotations parameter. The default key is ANN.
sample_key: The field name that the VCF parser will output the VCF samples to. The default key is sample.

XML¶

The following parameters can be passed for files that end with extension .xml.

item_depth: An integer that defines at which XML element to begin record enumeration. Default is 1. A depth of 0 would be the XML document root element and would return a single record.
required_keys: A list of strings that represent items that must exist in the XML element. Otherwise the record will be ignored.
cdata_key: A string that identifies the text value of a node element. Default is text
attr_prefix: A string representing the default prefix for node attributes. Default is @

Example XML Document

<xml>
    <library>
        <shelf>
            <book lang="eng">
                <title>SolveBio Docs, 3rd Edition</title>
                <publish_date></publish_date>
                <summary></summary>
            </book>
            <book lang="eng">
                <title></title>
                <publish_date></publish_date>
                <summary></summary>
            </book>
        </shelf>
        <shelf>
            <book>
                <title></title>
            </book>
        </shelf>
    </library>
</xml>

item_depth=0 would parse a single library record with nested shelves and books. item_depth=1 would parse two shelf records with nested books. item_depth=2 would parse three book records.

required_keys=[] would return 3 book records required_keys=['title'] would also return 3 book records required_keys=['summary'] would return only 2 book records

cdata_key="value" would return the field name book.title.value

attr_prefix="" would return the field name book.lang attr_prefix="_" would return the field name book._lang

Example XML Document

<xml>
    <library>
        <shelf>
            <book></book>
        </shelf>
        <shelf>
            <dust></dust>
        </shelf>
    </library>
</xml>

item_depth=1 and required_keys=['dust'] would parse 1 shelf record.

GFF3¶

The following parameters can be passed for files that end with extension .gff3.

comment: A string used to determine which lines in the files are comments. Lines that begin with this string will be ignored. Default is '##'.

Nirvana JSON¶

The Nirvana JSON format supported by SolveBio has to meet the official Illumina's Nirvana JSON layout in order to be parsed properly.

Entity Detection Parameters¶

When importing data, every field is sampled and to determine if it is a SolveBio entity. The following configuration parameters allow for customization of this detection by setting entity_params on the import object.

Genes and variants are detected by default. The example below overrides this and attempts to detect only genes and literature entities:

imp = DatasetImport.create(
    dataset_id=dataset.id,
    object_id=object.id,
    entity_params={
        'entity_types': ['gene', 'literature']
    }
)

To completely disable entity detection, use the disable attribute:

imp = DatasetImport.create(
    dataset_id=dataset.id,
    object_id=object.id,
    entity_params={
        'disable': True
    }
)

Validation Parameters¶

The following settings can be passed to the validation_params field.

disable - (boolean) default False - Disables validation completely
raise_on_errors - (boolean) default False - Will fail the import on first validation error encountered.
strict_validation - (boolean) default False - Will upgrade all validation warnings to errors.
allow_new_fields - (boolean) default False - If strict validation is True, will still allow new fields to be added

Validation will raise the following errors and warnings. The list below represents them in the following format: [Error code] Name - Description

Warnings¶

[202] Column Name Warning: Column name uses characters that do not comply with strict column name validation. (upgraded to an Error if strict_validation=True)
[203] New Column added: A new column was added to the Dataset (upgraded to an Error if strict_validation=True and allow_new_fields=False)
[302] List Expected violation: A column expected a list of values but didn't receive them. For example, a field has is_list=True but received a single string (upgraded to an Error if strict_validation=True)
[303] Unexpected List violation: A column expected a single value but received a list of values. For example a field has is_list=False but received a list of strings. (upgraded to an Error if strict_validation=True)
[400] Too Many Columns in record: Warns if 150 or more columns are found. Errors if 400 or more.

Errors¶

[301] Invalid Value for Field: Value is not a valid type (e.g. An integer passed for a date field data_type)
[304] NaN Value for Field: Value is a JSON "NaN" value which can not be indexed by SolveBio.
[305] Infinity Value for Field: Value is a JSON "Infinity" value which can not be indexed by SolveBio.
[306] Max String Length for Field: The max value for the string data_type is 32,766 bytes. Anything larger must be a text data_type.

Annotator Parameters¶

The following settings can be used to customize the annotator that is used during transformation.

annotator - (string) Choose from simple (default), serial or parallel

delete¶

Not recommended

Deleting dataset imports is not recommended as data provenance will be lost.

HTTP Request

DELETE https://api.solvebio.com/v2/dataset_imports/{ID}

Parameters

This request does not accept URL parameters.

Authorization

This request requires an authorized user with write permissions on the dataset.

Request Body

Do not supply a request body with this method.

Response

The response returns "HTTP 200 OK" when successful.

get¶

HTTP Request

GET https://api.solvebio.com/v2/dataset_imports/{ID}

Parameters

This request does not accept URL parameters.

Authorization

This request requires an authorized user with permission.

Request Body

Do not supply a request body with this method.

Response

The response contains a DatasetImport resource.

list¶

HTTP Request

GET https://api.solvebio.com/v2/datasets/{DATASET_ID}/imports

Parameters

This request accepts the following parameters:

Parameter	Value	Description
limit	integer	The number of objects to return per page.
offset	integer	The offset within the list of available objects.

Authorization

This request requires an authorized user with read permission on the dataset.

Response

The response contains a list of DatasetImport resources.

Last updated 2022-12-07.

Have questions or comments about this article? Get in touch with SolveBio Support by submitting a ticket or by sending us an email.