Querying Datasets¶
Overview¶
SolveBio is designed for easy access to molecular information. It provides an easy-to-use, real time API for querying any dataset on the platform.
Dataset query results are returned in pages, similar to a search engine. To narrow down search results, datasets can be filtered on one or more fields. The easiest way to query datasets is by using one of the client libraries. You can either build queries using a programming language (or even writing raw JSON), or you can build them right on any dataset page in the SolveBio web application.
A basic query returns a page of results from the specified dataset:
1 2 3 4 5 6 7 8 9 10 11 12 13 | Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37').query() # You can order query results using the ordering argument # Order the query results by clinical_significance ascending Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37').query(ordering='clinical_significance') # Order the query results by clinical_significance descending Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37').query(ordering='-clinical_significance') # Query results can be ordered by multiple columns # Order the query results by clinical_significance descending and gene_symbol ascending Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37').query(ordering=['-clinical_significance', 'gene_symbol']) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | clinvar <- Dataset.get_by_full_path("solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37") records <- Dataset.query(id = clinvar$id, limit = 1000, paginate = TRUE) # You can order query results using the ordering argument # Order the query results by clinical_significance ascending records <- Dataset.query(id = clinvar$id, limit = 1000, paginate = TRUE, ordering='clinical_significance') # Order the query results by clinical_significance descending records <- Dataset.query(id = clinvar$id, limit=1000, ordering='-clinical_significance') # Query results can be ordered by multiple columns # Order the query results by clinical_significance descending and gene_symbol ascending records <- Dataset.query(id = clinvar$id, limit=1000, ordering=list('-clinical_significance', 'gene_symbol')) |
1 | curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data
|
Filters¶
You can filter on any field in a dataset.
1 2 | dataset = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37') dataset.query().filter(clinical_significance='Pathogenic') |
1 2 3 | clinvar <- Dataset.get_by_full_path("solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37") filters <- '[["clinical_significance", "pathogenic"]]' records <- Dataset.query(id = clinvar$id, filters=filters) |
1 2 3 4 5 6 7 | curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \ -H "Content-Type: application/json" \ -d '{ "filters": [ ["clinical_significance", "Pathogenic"] ] }' |
Filters can be modified using "filter actions", which let you adjust the condition of a filter. To specify a filter action, append it to the field name when building a filter: <field>__<action>
(for example, date__gte
for filtering by dates greater or equal to the input).
Action | Description |
---|---|
iexact (default) | Field is equal to value (case-insensitive). If the field's value is a list, this will match values within the list, not the list as a whole. |
exact | Field is an exact match to value. Useful for case sensitive string field queries. |
in | Field is "one of" a list of values. (similar to Python's in operator). |
range | Field is a number within two values. Ranges are inclusive (fully-closed). |
gt | Field is a number greater than value. |
lt | Field is a number less than value. |
gte | Field is a number greater than or equal to value. |
lte | Field is a number less than or equal to value. |
contains | Field contains this string value. |
regexp | Field value matches this regular expression. (Note: The action is only compatible with datasets) |
Some filter actions (range
, gt
, lt
, gte
, lte
) may only be used on numeric and date fields.
Full-text fields use the contains
filter action by default and act like a typical search would. Results are ordered by relevance based on your search terms. When using the contains
action on string fields, the system converts the filter in a regular expression: .*{VALUE}.*
.
String Filters¶
You may filter string fields using exact (or case-insensitive) match, regular expression match (regexp
), or prefix match (prefix
). You can also match against multiple strings at once (a boolean or) using the in
filter. By default, filters on string fields use the "equals" match.
1 2 3 4 5 6 7 8 9 10 11 12 13 | q = solvebio.Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/ClinVar-GRCh37').query() # Equals match q.filter(gene_symbol='BRCA1') # Equals match (in list) q.filter(gene_symbol__in=['BRCA1', 'BRCA2']) # Regular expression match q.filter(gene_symbol__regexp='BRCA[12]') # Prefix match q.filter(gene_symbol__prefix='BRCA') |
1 2 3 4 5 6 7 8 9 10 11 12 13 | # Equals match clinvar <- Dataset.get_by_full_path("solvebio:public:/ClinVar/3.7.4-2017-01-30/Combined-GRCh37") Dataset.query(id = clinvar$id, filters='[["gene_symbol", "BRCA1"]]') # Equals match (in list) Dataset.query(id = clinvar$id, filters='[["gene_symbol__in", ["BRCA1", "BRCA2"]]]') # Regular expression match Dataset.query(id = clinvar$id, filters='[["gene_symbol__regexp", "BRCA[12]"]]') # Prefix match Dataset.query(id = clinvar$id, filters='[["gene_symbol__prefix", "BRCA"]]') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | # Equals match curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \ -H "Content-Type: application/json" \ -d '{ "filters": [ ["gene_symbol", "BRCA1"] ] }' # Exact match (in list) curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \ -H "Content-Type: application/json" \ -d '{ "filters": [ ["gene_symbol__in", ["BRCA1", "BRCA2"]] ] }' # Regular expression match curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \ -H "Content-Type: application/json" \ -d '{ "filters": [ ["gene_symbol__regexp", "BRCA[12]"] ] }' # Prefix match curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \ -H "Content-Type: application/json" \ -d '{ "filters": [ ["gene_symbol__prefix", "BRCA"] ] }' |
Text Filters¶
Long (paragraph-length) fields typically use the text
data type. The contains
filter in text fields works more like a search than a filter. Results that match the search term are brought back in the order of relevance.
1 2 3 4 | q = Dataset.get_by_full_path('solvebio:public:/MEDLINE/2.3.1-2017/MEDLINE').query() # Contains match for text fields q.filter(abstract__contains='diabetes') |
1 2 3 | # Contains match for text fields MedLine <- Dataset.get_by_full_path("solvebio:public:/MEDLINE/2.3.1-2017/MEDLINE") Dataset.query(id = MedLine$id, filters='[["abstract__contains", "diabetes"]]') |
1 2 3 4 5 6 7 8 | # Contains match for text fields curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510113043291943359/data \ -H "Content-Type: application/json" \ -d '{ "filters": [ ["abstract__contains", "diabetes"] ] }' |
Numeric & Date Filters¶
Numeric and date fields can be filtered by exact match, exact match in list (in
), half-open range match (range
), and standard operators (gt
, lt
, gte
, lte
). Dates are in the format YYYY-MM-DD
.
1 2 3 4 5 6 7 8 9 10 11 12 13 | q = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37').query() # Equals match q.filter(rcv_accession_version=4) # Equals match (in list) q.filter(rcv_accession_version__in=[1, 2, 3]) # Range query q.filter(rcv_accession_version__range=[1, 3]) # Operator query (gt/gte/lt/lte) q.filter(rcv_accession_version__gt=4) |
1 2 3 4 5 6 7 8 9 10 11 12 13 | # Equals match clinvar <- Dataset.get_by_full_path("solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37") Dataset.query(id = clinvar$id, filters='[["rcv_accession_version", 4]]') # Equals match (in list) Dataset.query(id = clinvar$id, filters='[["rcv_accession_version__in", [1, 2, 3]]]') # Range query Dataset.query(id = clinvar$id, filters='[["rcv_accession_version__range", [1, 3]]]') # Operator query (gt/gte/lt/lte) Dataset.query(id = clinvar$id, filters='[["rcv_accession_version__gt", 4]]') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | # Equals match curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \ -H "Content-Type: application/json" \ -d '{ "filters": [ ["rcv_accession_version", 4] ] }' # Equals match (in list) curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \ -H "Content-Type: application/json" \ -d '{ "filters": [ ["rcv_accession_version__in", [1, 2, 3]] ] }' # Range query curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \ -H "Content-Type: application/json" \ -d '{ "filters": [ ["rcv_accession_version__range", [1, 3]] ] }' # Operator query (gt/gte/lt/lte) curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \ -H "Content-Type: application/json" \ -d '{ "filters": [ ["rcv_accession_version__gt", 4] ] }' |
Entity Filters¶
Compatibility
The entity filters are only compatible with datasets.
SolveBio-supported Entities can be used for filtering, without requiring the exact field name that the Entity resides in.
1 2 3 4 5 6 7 8 9 10 | clinvar = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Combined-GRCh37') # Gene entity query clinvar.query(entities=[['gene', 'BRCA2']]) # Variant entity query clinvar.query(entities=[['variant', 'GRCH37-13-32890599-32890599-C']]) # Literature entity query clinvar.query(entities=[['literature', '26096185']]) |
1 2 3 4 5 6 7 8 9 10 | clinvar <- Dataset.get_by_full_path("solvebio:public:/ClinVar/3.7.4-2017-01-30/Combined-GRCh37") # Gene entity query Dataset.query(id = clinvar$id, entities=list(c("gene", "BRCA2"))) # Variant entity query Dataset.query(id = clinvar$id, entities=list(c("variant", "GRCH37-13-32890599-32890599-C"))) # Literature entity query Dataset.query(id = clinvar$id, entities=list(c("literature", "26096185"))) |
Genomic Coordinate Filters¶
Compatibility
The genomic coordinate filters are only compatible with datasets.
A dataset's genomic build is indicated by the suffix of the dataset's full_path.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | # GRCh37 q = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37').query() # GRCh38 q = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh38').query() # Position (all overlapping features) - these two queries are equivalent q.position('chr11', 18313400) q.position(chromosome='11', position=18313400) # Exact position only q.position(chromosome='11', position=17552955, exact=True) # Range (all overlapping features) - these two queries are equivalent q.range('chr11', 18313300, 18315000) q.range(chromosome='11', start=18313300, stop=18315000) # Exact range only q.range(chromosome='11', start=18313399, stop=18313403, exact=True) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | # Position (all overlapping features) curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \ -H "Content-Type: application/json" \ -d '{ "genome_build": "GRCh37", "filters": [ { "and": [ ["genomic_coordinates.start__lte", 18313400], ["genomic_coordinates.stop__gte", 18313400], ["genomic_coordinates.chromosome", "11"] ] } ] }' # Exact position only curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \ -H "Content-Type: application/json" \ -d '{ "genome_build": "GRCh37", "filters": [ { "and": [ ["genomic_coordinates.stop", 17552955], ["genomic_coordinates.start", 17552955], ["genomic_coordinates.chromosome", "11"] ] } ] }' # Range (all overlapping features) curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \ -H "Content-Type: application/json" \ -d '{ "genome_build": "GRCh37", "filters": [ { "and": [ { "or": [ { "and": [ ["genomic_coordinates.start__lte", 18313300], ["genomic_coordinates.stop__gte", 18315000] ] }, ["genomic_coordinates.start__range", [18313300, 18315000]], ["genomic_coordinates.stop__range", [18313300, 18315000]] ] }, ["genomic_coordinates.chromosome", "11"] ] } ] }' # Exact range only curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \ -H "Content-Type: application/json" \ -d '{ "genome_build": "GRCh37", "filters": [ { "and": [ ["genomic_coordinates.stop", 18313403], ["genomic_coordinates.start", 18313399], ["genomic_coordinates.chromosome", "11"] ] } ] }' |
Combining Filters¶
The examples below show you how to filter a dataset on one or two fields. In many cases, you will probably need to combine many filters into a single query.
When manually writing queries in JSON, you can combine and nest filters using a boolean operator ('and', 'or', 'not'). In the Python client, you can combine filters using the Filter
and GenomicFilter
classes ("&
" for "and", "|
" for "or", and "~
" for "not").
1 2 3 4 5 6 7 8 9 10 11 12 13 | q = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37').query() # AND f = solvebio.Filter(gene_symbol='BRCA1') & solvebio.Filter(clinical_significance='Pathogenic') q.filter(f) # OR f = solvebio.Filter(gene_symbol='BRCA1') | solvebio.Filter(gene_symbol='BRCA2') q.filter(f) # NOT f = ~ solvebio.Filter(gene_symbol='BRCA1') q.filter(f) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | # AND curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \ -H "Content-Type: application/json" \ -d '{ "filters": [ { "and": [ ["gene_symbol", "BRCA1"], ["clinical_significance", "Pathogenic"] ] } ] }' # OR curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \ -H "Content-Type: application/json" \ -d '{ "filters": [ { "or": [ ["gene_symbol", "BRCA1"], ["gene_symbol", "BRCA2"] ] } ] }' # NOT curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \ -H "Content-Type: application/json" \ -d '{ "filters": [ { "not": ["gene_symbol", "BRCA1"] } ] }' |
Query Strings¶
Compatibility
The query strings are only compatible with datasets.
Query strings are parsed into a series of terms and operators. A query string can be provided as part of a SolveBio dataset query in combination with filters, or as an alternative to filters. Terms in a query string can be single words - "quick" or "brown" - or a phrase surrounded by double quotes - "quick brown" - which will search for all the words in the phrase, in the same order. The query syntax is based on the Lucene query syntax.
Queries and Filters
Queries are great to find records that best match a word or phrase, relative to others. Filters are designed to reduce the potential result-set by asking yes/no questions on every record in a dataset.
Query string operators allow you to customize a search. The available options are explained below:
Field Names
By default, when no field names are specified, all string or text fields are searched for each term. You can provide an explicit field name if you know the field in question:
status:active
You can do an exact match on a specific field, for example:
author:"John Smith"
To search for one-or-more terms in a field, combine them with OR
(default) or AND
:
title:(quick brown) title:(quick OR brown) title:(quick AND brown)
You can also find records with missing fields:
_missing_:title
Or, records where the field has a value (i.e. "not missing"):
_exists_:title
Wildcards
Wildcard searches can be run on individual terms, using ?
to replace a single character, and *
to replace zero or more characters:
qu?ck bro*
Ranges
Ranges can be specified for almost any field data type. They are most useful for dates and numeric fields. Inclusive ranges are specified with square brackets [min TO max]
and exclusive ranges with curly brackets {min TO max}
.
For example, this query will retrieve records for all days in 2012:
date:[2012-01-01 TO 2012-12-31]
Similarly, you can also use ranges on numeric fields:
count:[1 TO 5]
And use infinite ranges:
count:[10 TO *]
Standard numeric comparison operators can also be used:
age:>10 age:>=10 age:<10 age:<=10
To combine an upper and lower bound with the simplified syntax, you would need to join two clauses with an AND
operator:
age:(>=10 AND <20) age:(+>=10 +<20)
Always try range filters first
Processing ranges from a query string is much slower and less reliable than using an explicit range filter.
Grouping
Multiple terms or clauses can be grouped together with parentheses to form sub-queries:
(quick OR brown) AND fox
Groups can be used to target a particular field, or to boost the result of a sub-query:
status:(active OR pending) title:(full text search)^2
Reserved characters
The following characters are reserved in query strings and must be escaped with a leading backslash when used as part of a query term:
+ - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ /
For example, to search for the string (1+1)=2
, your query should be \(1\+1\)\=2
.
Advanced Filters¶
You can also compose filters in JSON and apply these filters via R, Python, or the SolveBio UI:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | # A single filter may contains a single tuple ["{FIELD}", "{VALUE}"] # Or it may be composed of several filters under a boolean operator (and/or) { "{OPERATOR}": [ ["{FIELD}", "{VALUE}"], ["{FIELD}__{ACTION}", "{VALUE}"] ] } # Providing a list of filters assumes the "and" boolean operator: [ ["{FIELD}", "{VALUE}"], ["{FIELD}__{ACTION}", "{VALUE}"] ] # You can also exclude one or more filters using the "not" operator: { "not": ["{FIELD}", "{VALUE1}"] } { "not": { "and": [ ["{FIELD}", "{VALUE1}"], ["{FIELD}", "{VALUE2}"] ] } } # Advanced filters may also be nested in complex ways: { "and": [ ["{FIELD}", "{VALUE}"], { "or": [ {"not": ["{FIELD}", "{VALUE}"]}, ["{FIELD}", "{VALUE}"] ] } ] } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | dataset = Dataset.get_by_full_path('solvebio:public:/MEDLINE/2.3.1-2017/MEDLINE') # Include all abstracts with "diabetes" filters = [ ["abstract__contains", "diabetes"] ] dataset.query(filters=filters) # Exclude all abstracts with "diabetes" filters = [{ "not": ["abstract__contains", "diabetes"] }] dataset.query(filters=filters) # Find abstracts without "diabetes" from 1977 filters = [{ "and": [ {"not": ["abstract__contains", "diabetes"]}, { "or": [ ["date_published__regex", "*1977*"], ["date_created__range", ["1977-01-01", "1977-12-31"]] ] } ] }] dataset.query(filters=filters) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | # Contains match for text fields dataset <- Dataset.get_by_full_path("solvebio:public:/MEDLINE/2.3.1-2017/MEDLINE") # Include all abstracts with "diabetes" filters <- '[ ["abstract__contains", "diabetes"] ]' Dataset.query(id = dataset$id, filters = filters) # Exclude all abstracts with "diabetes" filters = '[{ "not": ["abstract__contains", "diabetes"] }]' Dataset.query(id = dataset$id, filters = filters) # Find abstracts without "diabetes" from 1977 filters = '[{ "and": [ {"not": ["abstract__contains", "diabetes"]}, { "or": [ ["date_published__regex", "*1977*"], ["date_created__range", ["1977-01-01", "1977-12-31"]] ] } ] }]' Dataset.query(id = dataset$id, filters = filters) |
Advanced filter syntax is composed of the following elements:
{OPERATOR}
is one ofand
,or
, ornot
.{FIELD}
is a documented field name in the dataset.{ACTION}
is a valid field action in the format{FIELD}__{ACTION}
(see below).{VALUE}
can be a string, numeric, or list value.
By default, a {FIELD}
with no attached {ACTION}
implies the "case-insensitive equals" (iexact
) operator.
String and text actions include:
Action | Description |
---|---|
iexact | (default for string/text) Field is equal to value (case-insensitive). If the field's value is a list, matches a value within the list, not the list as a whole. |
exact | Field is an exact match to value. Useful for longer string and text fields. |
in | Field is "one of" a list of values. |
contains | Field contains this string value. |
regexp | Field value matches this regular expression. (Note: The action is only compatible with datasets) |
Numeric and date field actions include:
Action | Description |
---|---|
exact | (default for numeric/date) Field is an exact match to value. |
range | Field is a number within two within two values (inclusive/fully-closed). |
gt | Field is a number greater than value. |
lt | Field is a number less than value. |
gte | Field is a number greater than or equal to value. |
lte | Field is a number less than or equal to value. |
Full-text (text
data type) fields automatically use the contains
filter action instead.