Exporting Data¶
Overview¶
SolveBio is committed to data access and data portability. Exporting data to downstream tools is a key part of molecular data analytics and our goal is to make that a seamless process.
Dataset can be exported in multiple formats:
- JSON: JSON Lines format (gzipped).
- CSV: Comma Separated Value format (flattened, gzipped).
- TSV: Tab Separated Value format (flattened, gzipped).
- Excel (XLSX): Microsoft Excel format (flattened).
An exported JSON file can be re-imported into SolveBio without any modification.
Exporting data can take anywhere from a few seconds to tens of minutes, depending on the number of records and selected format. Exports are processed server-side, and the output is a downloadable file.
Export Limits¶
Different export formats have different limits.
Format | Max Records |
---|---|
Excel | 1,048,576 records |
JSON | 500,000,000 records |
TSV | 500,000,000 records |
CSV | 500,000,000 records |
Flattened Fields (CSV/XLSX only)¶
CSV and XLSX exports are processed by a flattening algorithm during export. The reason for this is to handle list fields, which are not well supported by Excel and other CSV readers. The following example illustrates the effects of the flattening algorithm:
The following dataset records:
1 2 3 | {"a": "a", "b": ["x"]} {"a": "a", "b": ["x", "y"]} {"a": "a", "b": ["x", "y", "z"]} |
will be exported to the following CSV:
1 2 3 4 | a,b.0,b.1,b.2 a,x,, a,x,y, a,x,y,z |
Export a Dataset¶
To export a dataset, retrieve it by name or ID, and initiate the export. Exports can take a few minutes for large datasets. You can always start a large export and check back when it finishes on the Activity tab of the SolveBio website. Exports can also be saved directly into a vault (with target_full_path
keyword argument) and accessed from there.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from solvebio import Dataset dataset = Dataset.get_by_full_path('solvebio:public:/HGNC/3.1.0-2017-06-29/HGNC') # Export the entire dataset (~40k records), this may take a minute... # NOTE: `format` can be: json, tsv, csv, or excel # `send_email_on_completion`: enable/disable sending an email when the export is ready export = dataset.export(format='json', follow=True, send_email_on_completion=True) # Save the exported file to the current directory export.download('./') # Exports can also be saved to a path in a vault dataset.export(target_full_path='my_vault:/path/to/json_files_folder/my_export') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | library(solvebio) dataset <- Dataset.get_or_create_by_full_path('solvebio:public:/HGNC/3.1.0-2017-06-29/HGNC') # Export the entire dataset (~40k records), this may take a minute... # NOTE: `format` can be: json, tsv, csv, or excel export <- DatasetExport.create( dataset$id, format = 'csv', params = NULL, send_email_on_completion = TRUE ) # Wait for the export to complete Dataset.activity(dataset$id) # Download url <- DatasetExport.get_download_url(export$id) download.file(url, 'data.csv.gz') # An exports can also be saved to a path in a vault export <- DatasetExport.create( dataset$id, format = 'csv', params = NULL, send_email_on_completion = TRUE, target_full_path='my_vault:/path/to/csv_files_folder/my_export' ) |
Exporting Large Amounts of Data¶
An example file size for a CSV file with 150M rows and 50 columns populated with floats and relatively short strings is about 50GB. In general we recommend not working with files this size.
Shrink the export by applying filters or selecting only specific columns. If necessary, export in batches (e.g. export by chromosome or sample).
Export a Filtered Dataset¶
In this example we will export a slice of a dataset. This leverages the dataset filtering system.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from solvebio import Dataset dataset = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37') # Filter the dataset by field values, limit the number of results, select a subset of fields query = dataset.query(limit=100, fields=["variant_sbid", "review_status_star", "gene_symbol"]).filter(review_status_star__gte=3) # Export the query (100 records, filtered on a field) # NOTE: `format` can be: json, tsv, csv, or excel # `send_email_on_completion`: enable/disable sending an email when the export is ready export = query.export(format='json', follow=True, send_email_on_completion=True) # Save the exported file to a specific location (optionally with a specific name) export.download(path='./my_variants.json.gz') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | library(solvebio) dataset <- Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37') # Filter the dataset by field values and limit the number of results # NOTE: `format` can be: json, tsv, csv, or excel filters <- list(list("review_status_star__gte", 3)) fields <- list("variant_sbid", "review_status_star", "gene_symbol") export <- DatasetExport.create( dataset$id, format = 'json', params=list(filters=filters, fields=fields, limit=100), follow = TRUE, ) # Download to your home directory url <- DatasetExport.get_download_url(export$id) download.file(url, 'my_variants.json.gz') |
Export in Batches¶
In this example we will export the data by chromosome.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | from solvebio import Dataset dataset = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37') # Get available chromosomes facets = dataset.query().facets(**{'genomic_coordinates.chromosome': {'facet_type': 'terms', 'limit': 100}}) print("Found {} chromosomes".format(len(facets['genomic_coordinates.chromosome']))) for chromosome, records_count in facets['genomic_coordinates.chromosome']: # Defines a location on SolveBio to export to (this is a shortcut to the user's personal vault) # Appends chromosome to the filename target_full_path = "~/clinvar_{}.csv.gz".format(chromosome) # Filter the query by chromosome filtered_query = query.filter(**{'genomic_coordinates.chromosome': chromosome}) # Export filtered_query.export(format='csv-gz', follow=False, send_email_on_completion=True, target_full_path=target_full_path) |