Creating Datasets¶

Overview¶

Datasets are like documents in a word processor. To start working with data on SolveBio, you'll need to create a new, empty dataset. Datasets have both a schema (dataset fields) and contents (dataset records). On SolveBio, unlike other database systems, you don't need to know your schema in advance (fields are automatically detected from imported data). However, in many cases crafting a schema (i.e. setting dataset fields) can help avoid issues with data types and field names.

Create a Dataset¶

To create a dataset, supply a full path in the following format: <domain>:<vault>:<path> (e.g. myDomain:MyVault:/folder/with/dataset or ~/folder/with/dataset to use your personal vault). The path must be within a vault where you have write-level access (such as your personal vault).

Python R

from solvebio import Dataset

# Create a new, empty dataset in your personal vault (represented by "~/")
dataset = Dataset.get_or_create_by_full_path('~/my_dataset')

library(solvebio)

# Specify where you want the new dataset
vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/my_dataset", sep=":")

# Create a new, empty dataset
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

When creating the dataset you can supply a number of optional parameters:

description: A description (text) of the dataset.
fields: A list of field objects (see below).
capacity: A performance optimization for datasets that will have tens or hundreds of millions of records. Default is "small" but can be set to "medium" or "large". This cannot be changed once it has been set.
metadata: A dictionary of key/value pairs which can be associated to the dataset.
tags: A list of strings (tags) that can be associated to the dataset.

Once the dataset is created it will be empty (containing no records) and consist of the default SolveBio _id field and any other fields that may have been added by the fields parameter.

Dataset Fields¶

By default, new fields are automatically detected by the import system. You can also provide a list of fields (i.e. a template) using the fields parameter. This lets you explicitly set field names and titles, data types, ordering, descriptions, and entity types for each field. See dataset field reference for more info.

The following example creates a new dataset using a template with two fields:

Python R

from solvebio import Dataset
from solvebio import DatasetField

fields = [
    {
        "name": "my_string_field",
        "description": "Just a string",
        "data_type": "string",
        "is_list": False,
        "is_hidden": False,
        "ordering": 0
    },
    {
        "name": "gene_symbol",
        "description": "HUGO gene symbol",
        "data_type": "string",
        "entity_type": "gene"
    }
]

dataset_full_path = '~/python_examples/my_fields_dataset'

# Fields, capacity, and other optional parameters can be set during dataset creation
dataset = Dataset.get_or_create_by_full_path(
    dataset_full_path,
    fields=fields,
    capacity='small'
)

# If the dataset already exists, you can add additional fields:
DatasetField.create(
    dataset_id=dataset.id,
    name="my_new_field",
    data_type="string")

# Fields can also be edited
field = dataset.fields("my_string_field")
field.description = "A new description"
field.save()

library(solvebio)

fields <- list(
    list(
        name="my_string_field",
        description="Just a string",
        data_type="string",
        is_list=FALSE,
        is_hidden=FALSE,
        ordering=0
    ),
    list(
        name="gene_symbol",
        description="HUGO Gene Symbol",
        data_type="string",
        entity_type="gene"
    )
)

vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/my_fields_dataset", sep=":")

# Fields, capacity, and other optional parameters can be set during dataset creation
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path, fields=fields, capacity="small")

# If the dataset already exists, you can add additional fields:
DatasetField.create(
    dataset_id=dataset$id,
    name="my_new_field",
    data_type="string")

Field Properties¶

Dataset fields have the following properties:

Property	Value	Description
name (required)	string	The "low-level" field name, used in JSON formatted records.
data_type	string	A valid data type.
description	string	Free text that describes the contents of the field.
entity_type	string	A valid SolveBio entity type.
expression	string	A valid SolveBio expression.
is_hidden	boolean	Set to True if the field should be excluded by default from the UI. Default is False.
is_list	boolean	Set to True if multiple values are stored as a list. Default is False.
ordering	integer	The order in which this column appears in the UI and in tabular exports.
title	string	The field's display name, shown in the UI and in tabular exports. Default is set automatically from the name.
is_transient	boolean	Set to True if the field is a temporary field used for the purposes of easier data & expression manipulation during imports & migrations. Default is False. See example for usage.
depends_on	list of strings	List of fields that must have expressions run first before this field's expression is evaluated. In other words, what other fields that this field depends on. Default is an empty list.
url_template	string	A URL template with one or more "{value}" sections that will be interpolated with the field value and displayed as a link in the dataset table

Adding Links to the dataset table

If you add a url_template value to the dataset field, the dataset table will show the value as a link in the SolveBio UI. This is useful for linking out to other sources/websites. The dataset below has links for the gene and variant pages on SolveBio Example

Modifying Fields¶

Fields can be modified via the API or via the SolveBio UI from several places. Look for the pencil icon next to a dataset field on the dataset's About page or any of the filtering or facets panels.

Data types and field names cannot be changed. Once a field is added to a dataset (either manually or as a result of an import), the field's data type (data_type) cannot be altered in-place. To change the data type of a field, you can perform a dataset migration.

The title, description, url_template, ordering, entity_type can be modified at any time.

Dataset Caveats¶

Reserved Fields¶

SolveBio never alters input data, except in the case of reserved fields. Fields beginning with an underscore such as _id and _commit are considered reserved and may be modified during an import.

The _commit field is always reset during the commit process, and makes it possible to track and log all changes made to a Dataset. The _id field represents the unique ID for each record, which can be used to edit or delete individual records. The value of _id cannot be edited once a record is saved.

Removing Fields¶

Fields cannot be removed in-place.

Fields can easily be hidden to end-users either programmatically or using SolveBio's web interface, but this does not remove the underlying data. To remove a field from a dataset, it must be cloned to a new dataset that does not have the field you intend to delete.

Renaming Fields¶

Field names cannot be renamed, only titles.

While field titles can change (as they are for display purposes only), field names are static and cannot be changed. To change the name of a field, use a dataset migration to create a new field with the desired name.

The following example renames the field of the datasets:

Python R

import solvebio as sb

dataset = sb.Dataset.get_or_create_by_full_path("dataset_id")

# We want to transform the data by passing target_fields.
# Because there is no "rename" functionality, the workaround is to create new fields from with the values from the old fields. Set is_transient=TRUE for the old fields so that they are just temporarily used during data transform, and not included in the output.
# The new fields just take the value of the old fields.
target_fields = [
               {
                      "name": "old_field_name",
                      "is_transient": True
               },
               {
                      "name": "new_field_name",
                      "data_type": 'string',
                      "expression": "record.old_field_name"
               }
 ]

# Create migration
migration = dataset.migrate(target=dataset, target_fields=target_fields)

# Retrieve the source dataset
source_dataset <- Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37')

# Create your new target dataset
vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/clinvar_renamed", sep=":")
target_dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

# We only want data from these fields
fields <- list('gene_symbol', 'clinical_significance', 'review_status')

# We want to transform the data by passing target_fields.
# Because there is no "rename" functionality, the workaround is to create new fields from with the values from the old fields. Set is_transient=TRUE for the old fields so that they are just temporarily used during data transform, and not included in the output.
# The new fields just take the value of the old fields.
target_fields <- list(
    list(
        name='gene_symbol',
        is_transient=TRUE
    ),
    list(
        name='clinical_significance',
        is_transient=TRUE
    ),
    list(
        name='review_status',
        is_transient=TRUE
    ),
    list(
        name='gene',
        data_type='string',
        expression='record.gene_symbol'
    ),
    list(
        name='clin_sig',
        data_type='string',
        expression='record.clinical_significance'
    ),
    list(
        name='rev_stat',
        data_type='string',
        expression='record.review_status'
    )
)

# Create migration
#  source_params
#   fields kwarg returns only the fields defined
#   limit kwarg pulls only 10 records, remove for all records
#  target_fields
#   defines the field transform template
migration <- DatasetMigration.create(
    source_id=source_dataset$id,
    target_id=target_dataset$id,
    source_params=list(
        fields=fields,
        limit=10
    ),
    target_fields=target_fields
)

Dataset Capacity¶

The "capacity" of the dataset determines its import, migration and query performance. For most datasets, a "small" capacity will be sufficient. There are three available capacities:

small - the default value and one that is good for most datasets up to 1 million records.
medium - for medium sized datasets up to 100 million records.
large - for large datasets that will have more than 100 million records.

The record limit is not a "hard" limit, e.g. a small dataset could have millions of records, it will just be less performant as more records are added.

Once a dataset is created, its capacity cannot be changed. However, you can always copy the data into a new higher capacity dataset.

Last updated 2022-12-07.

Have questions or comments about this article? Get in touch with SolveBio Support by submitting a ticket or by sending us an email.