Python Template¶
This notebook is a template for SolveBio Python examples. Download the original notebook here.
Install Packages¶
1 2 | !pip install solvebio !pip install plotly |
Load and Initialize Modules¶
1 2 3 4 5 6 7 8 9 10 | import solvebio import numpy as np import plotly.plotly as py import plotly.tools as tls from plotly.graph_objs import Data, Layout, XAxis, YAxis, Figure, Box from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot # Initialize Plot.ly offline mode init_notebook_mode(connected=True) |
Log-in to SolveBio¶
You'll need your SolveBio Personal Access Token to run this notebook. Create a Personal Access Token here.
1 | solvebio.login(access_token='TOKEN') |
Example: Average Age of Diagnosis in TCGA¶
In this demo, we will use SolveBio's Python package combined with Plot.ly and numpy to quickly analyze and visualize patients and their characteristics from the The Cancer Genome Atlas Project.
We will use the the TCGA Patient Information dataset on SolveBio.
Since we're conducting this analysis by cancer type, we first need to pull out all the possible values for cancer type (aka cancer_abbreviation
) in this dataset. Then, we want to retrieve the range of all ages at first diagnosis for our analysis. Below, we use "nested facets" to do this in a single SolveBio query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | # Retrieve the TCGA Patient Information dataset tcga = solvebio.Dataset.get_by_full_path('solvebio:public:/TCGA/1.2.0-2015-02-11/PatientInformation') # Filter out values where the age is not available include_ages = ~ solvebio.Filter(age_at_initial_pathologic_diagnosis='[Not Available]') # Retrieve each cancer type (via terms facets) # and the list of ages for each type (through a nested terms facet). facets = { 'cancer_abbreviation': { 'limit': 1000, # Use a large number to get all available cancer types 'facets': { # Add a nested facet to get the ages for each cancer type 'age_at_initial_pathologic_diagnosis': { 'limit': 1000 } } } } results = tcga.query(filters=include_ages).facets(**facets) # Convert the results into a format usable by Plot.ly # (a list of ages for each cancer type). cancer_and_age = [] for cancer_type, count, sub_facets in results['cancer_abbreviation']: # The ages are represented by tuples (age, count). To get a nice # box plot below, expand out the ages for each occurrence. ages = [] for age, count in sub_facets['age_at_initial_pathologic_diagnosis']: ages += [int(age)] * count cancer_and_age.append({'cancer_type': cancer_type, 'ages': ages}) |
Now that we have the age of diagnosis for every patient in TCGA, by cancer type, let's sort the data by median age for each cancer with numpy and visualize the data with Plot.ly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | cancer_and_age = sorted(cancer_and_age, key = lambda x: np.median(x['ages'])) data = Data([ Box(y=cancer['ages'], name=cancer['cancer_type']) for cancer in cancer_and_age ]) layout = Layout( title='Age of Diagnosis for TCGA Patients by Cancer Type', xaxis=XAxis(title='Cancer Type'), yaxis=YAxis(title='Age of Diagnosis') ) fig = Figure(data=data, layout=layout) iplot(fig) |
The results are as we expect, based on the unique epidemiology of each cancer. For example, we know that testicular germ cell tumors are most common between the ages of 15-35 in men. This is a pretty simple analysis, but there's a lot of data in SolveBio's TCGA datasets that are ripe for analysis.