Skip to content

Exporting Data

Overview

SolveBio is committed to data access and data portability. Exporting data to downstream tools is a key part of molecular data analytics and our goal is to make that a seamless process.

Export limits

By default, you may export up to 100 million records at a time. To increase your limit, please contact SolveBio support. We recommend using filters to export in batches if possible.

Dataset can be exported in multiple formats:

  • JSON: JSON Lines format (gzipped).
  • CSV: Comma Separated Value format (flattened, gzipped).
  • TSV: Tab Separated Value format (flattened, gzipped).
  • Excel (XLSX): Microsoft Excel format (flattened).

An exported JSON file can be re-imported into SolveBio without any modification.

Exporting data can take anywhere from a few seconds to tens of minutes, depending on the number of records and selected format. Exports are processed server-side, and the output is a downloadable file.

Flattened Fields (CSV/XLSX only)

CSV and XLSX exports are processed by a flattening algorithm during export. The reason for this is to handle list fields, which are not well supported by Excel and other CSV readers. The following example illustrates the effects of the flattening algorithm:

The following dataset records:

1
2
3
{"a": "a", "b": ["x"]}
{"a": "a", "b": ["x", "y"]}
{"a": "a", "b": ["x", "y", "z"]}

will be exported to the following CSV:

1
2
3
4
a,b.0,b.1,b.2
a,x,,
a,x,y,
a,x,y,z

Export a Dataset

To export a dataset, retrieve it by name or ID, and initiate the export. Exports can take a few minutes for large datasets. You can always start a large export and check back when it finishes on the Activity tab of the SolveBio website.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from solvebio import Dataset

dataset = Dataset.get_by_full_path('solvebio:public:/HGNC/3.1.0-2017-06-29/HGNC')

# Export the entire dataset (~40k records), this may take a minute...
# NOTE: `format` can be: json, tsv, csv, or excel
export = dataset.export(format='json', follow=True)

# Save the exported file to the current directory
export.download('./')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
dataset <- Dataset.get_or_create_by_full_path('solvebio:public:/HGNC/3.1.0-2017-06-29/HGNC')

# Export the entire dataset (~40k records), this may take a minute...
# NOTE: `format` can be: json, tsv, csv, or excel
export <- DatasetExport.create(
    dataset$id,
    format = 'csv',
    params = NULL
)

# Wait for the export to complete
Dataset.activity(dataset$id)

# Download
url <- DatasetExport.get_download_url(export$id)
download.file(url, 'data.csv.gz')

Export a Filtered Dataset

In this example we will export a slice of a dataset. This leverages the dataset filtering system.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from solvebio import Dataset

dataset = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37')

# Filter the dataset by field values or limit the number of results
query = dataset.query(limit=100).filter(review_status_star__gte=3)

# Export the query (100 records, filtered on a field)
# NOTE: `format` can be: json, tsv, csv, or excel
export = query.export(format='json', follow=True)

# Save the exported file to a specific location (optionally with a specific name)
export.download(path='./my_variants.json.gz')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
dataset <- Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37')

# Filter the dataset by field values and limit the number of results
# NOTE: `format` can be: json, tsv, csv, or excel
filters <- list(list("review_status_star__gte", 3))
export <- DatasetExport.create(
    dataset$id,
    format = 'json',
    params=list(filters=filters, limit=100),
    follow = TRUE
)

# Download to your home directory
url <- DatasetExport.get_download_url(export$id)
download.file(url, 'my_variants.json.gz')