Skip to content

Aggregating Datasets

Overview

Aggregations are a powerful tool for building complex summaries of data. Aggregation queries can be run on SolveBio datasets with the help of facets. Facets can be used to generate aggregated summaries of string (and date) fields as well as numeric fields, and they automatically work on top of queries and filters. Facets can also be nested, which provide an incredibly efficient mechanism to summarize binned or rolled-up data (i.e. data summarized by term or by date).

String and Date Aggregations

For string fields (i.e. categorical fields) and date fields, you can use facets to find the total number of unique values as well as a list of the most common values that occur in the dataset. When used with a filtered dataset, the results will represent the filtered subset of data.

The following facet types are supported:

  • terms (default): Returns a list of the top terms and the number of times they occur (in order of this value). The default number of terms returned at once is 10. You can set a limit up to 1 million (1,000,000) terms returned.
  • count: Returns the number of unique values in the field. For very large datasets, this is an approximate number.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from solvebio import Dataset

query = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Combined-GRCh37').query()

# Find the most common genes in ClinVar
query.facets('gene_symbol')

# Retrieve the number of unique genes in ClinVar
query.facets(gene_symbol={'facet_type': 'count'})

# Filter ClinVar for only variants that relate to drug response.
# Which are the most common genes now?
filters = solvebio.Filter(clinical_significance='Drug response')
query.filter(f).facets('gene_symbol')

# How many genes are in this filtered query?
query.filter(f).facets(gene_symbol={'facet_type': 'count'})

# Now, get the top 100 most common values
query.filter(f).facets(gene_symbol={'limit': 100}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
library(solvebio)

# Load ClinVar
clinvar <- Dataset.get_by_full_path("solvebio:public:/ClinVar/3.7.4-2017-01-30/Combined-GRCh37")

# Find the most common genes in ClinVar
facets <- Dataset.facets(clinvar$id, list("gene_symbol"))
# Convert the facet results to a matrix
topGenes <- do.call(rbind, facets$gene_symbol)

# Retrieve the number of unique genes in ClinVar
count <- Dataset.facets(clinvar$id, '{"gene_symbol": {"facet_type": "count"}}')
# Convert the facet result to a number
count <- as.numeric(count)

# Filter ClinVar for only variants that relate to drug response.
# Which are the most common genes now?
filters <- '[["clinical_significance", "Drug response"]]'
facets <- Dataset.facets(clinvar$id, list("gene_symbol"), filters=filters)
# Convert the facet results to a matrix
topDrugResponseGenes <- do.call(rbind, facets$gene_symbol)

# How many genes are in this filtered query?
count <- Dataset.facets(clinvar$id, '{"gene_symbol": {"facet_type": "count"}}', filters=filters)
# Convert the facet result to a number
countDrugResponseGenes <- as.numeric(count)

# Now, get the top 100 most common values
facets <- Dataset.facets(clinvar$id, list("gene_symbol"=list("limit" = 1000)))
# Convert the facet results to a matrix
top1000DrugResponseGenes <- do.call(rbind, facets$gene_symbol)

Term facets do not work with text fields

Please keep in mind that facets will not work for text fields which are indexed (and tokenized) for full-text search. Terms facets are also disabled for _id fields.

Numeric Aggregations

A few options are available for numerical fields (such as float/double, integer/long, and date). Instead of returning "common terms", numerical facets can calculate summary statistics, histograms, and percentiles. The following facet types are supported:

  • stats: Default stats return average, count, maximum, minimum, and sum. Extended stats also include standard deviation, standard deviation lower and upper bounds, sum of squares, and variance.
  • histogram: values are binned according to a provided interval. For numerical fields, the default interval is 100. For dates, the default interval is 'month'. Histogram intervals must be integers, and will therefore not work for fields with values between 0 and 1 (such as allele frequencies).
  • percentiles: calculates estimated percentiles for a field. By default, returns the following percentiles: 1, 5, 25, 50, 75, 95, 99. Percentiles are approximated and have an 1-5% error for very large datasets.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from solvebio import Dataset

query = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Combined-GRCh37').query()

# Get summary statistics for a date field
query.facets('date_last_evaluated')

# Get extended statistics for a numerical field.
query.facets(
    **{'review_status_star': {
        'facet_type': 'stats', 'extended': True}})

# Calculate a histogram for genomic position in chromosome 12
# NOTE: We use **-style notation since "chromosome" and "start" are nested fields
query.filter(**{'genomic_coordinates.chromosome': 12}).facets(
    **{'genomic_coordinates.start': {
        'facet_type': 'histogram',
        'interval': 1000000}})
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
library(solvebio)

# Get summary statistics for a date field
clinvar <- Dataset.get_by_full_path("solvebio:public:/ClinVar/3.7.4-2017-01-30/Combined-GRCh37")
Dataset.facets(id = clinvar$id, list("date_last_evaluated"))

# Get extended statistics for a numerical field.
stats <- Dataset.facets(
    clinvar$id,
    '{"review_status_star": {
        "facet_type": "stats", "extended": true}}')
# Get the min and max values
stats$review_status_star$min
stats$review_status_star$max

# Calculate a histogram for genomic position in chromosome 12
facets <- Dataset.facets(
    clinvar$id,
    '{"genomic_coordinates.start": {"facet_type": "histogram"}}',
    filters='[["genomic_coordinates.chromosome", 12]]')
# Convert the result to a matrix
genomicCoordinates <- do.call(rbind, facets$genomic_coordinates.start)

Nested Aggregations

Nested aggregations can be used to apply an aggregation query to the result of another aggregation query. For example, if you have a dataset with patients and want to determine the most common diagnosis age for each cancer type. You could iterate through each cancer type and run a facets query on the age field, but that would require a number of expensive API calls. Using nested aggregations, you can simply construct a facets query within an existing facets query, as in the example below.

At this time, you may only nest term and histogram facets under terms facets. Nesting within histogram facets is not currently supported.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from solvebio import Dataset

# Retrieve the TCGA Patient Information dataset
tcga = Dataset.get_by_full_path('solvebio:public:/TCGA/1.2.0-2015-02-11/PatientInformation')

# Retrieve each cancer type (terms facets)
# and the diagnosis ages for each (nested terms facet)
facets = {
    'cancer_abbreviation': {
        'limit': 100,
        'facets': {
            'age_at_initial_pathologic_diagnosis': {
                'limit': 10
            }
        }
    }
}

results = tcga.query().facets(**facets)

# Process the nested results
for cancer_type, patient_count, subfacets in results['cancer_abbreviation']:
    print cancer_type, patient_count, subfacets['age_at_initial_pathologic_diagnosis']
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
library(solvebio)

# Retrieve the TCGA Patient Information dataset
TCGA <- Dataset.get_by_full_path("solvebio:public:/TCGA/1.2.0-2015-02-11/PatientInformation")

# Retrieve each cancer type (terms facets)
# and the diagnosis ages for each (nested terms facet)
facets <- list(
    "cancer_abbreviation" = list(
        "limit" = 100,
        "facets" = list(
            "age_at_initial_pathologic_diagnosis" = list("limit" = 10)
        )
    )
)

results <- Dataset.facets(TCGA$id, facets = facets)

# Process the nested results
for (row in results$cancer_abbreviation) {
    cancer_type <- row[[1]]
    patient_count <- row[[2]]
    diagnosis_ages <- row[[3]]$age_at_initial_pathologic_diagnosis
}