Skip to content

Import Reference Data (CGD)

In this tutorial, we will import a commonly used reference dataset, just like SolveBio does it! We use the same pipeline to maintain the public reference datasets that we support.

The Clinical Genomic Database (CGD) is a manually curated database of conditions with known genetic causes published. CGD is maintained and published by the NHGRI and available at https://research.nhgri.nih.gov/CGD/.

In this tutorial, we will add CGD to the SolveBio data platform. Once on the system, the dataset can be used to annotate existing data, in the gene-specific beacons, and more.

Install Packages

1
!pip install solvebio
1
2
3
4
5
6
7
8
9
Requirement already satisfied: solvebio in /Users/dandan/anaconda2/lib/python2.7/site-packages
Requirement already satisfied: pyprind in /Users/dandan/anaconda2/lib/python2.7/site-packages (from solvebio)
Requirement already satisfied: pycurl>=7.0.0 in /Users/dandan/anaconda2/lib/python2.7/site-packages (from solvebio)
Requirement already satisfied: requests>=2.0.0 in /Users/dandan/anaconda2/lib/python2.7/site-packages (from solvebio)
Requirement already satisfied: six in /Users/dandan/anaconda2/lib/python2.7/site-packages (from solvebio)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /Users/dandan/anaconda2/lib/python2.7/site-packages (from requests>=2.0.0->solvebio)
Requirement already satisfied: idna<2.7,>=2.5 in /Users/dandan/anaconda2/lib/python2.7/site-packages (from requests>=2.0.0->solvebio)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/dandan/anaconda2/lib/python2.7/site-packages (from requests>=2.0.0->solvebio)
Requirement already satisfied: certifi>=2017.4.17 in /Users/dandan/anaconda2/lib/python2.7/site-packages (from requests>=2.0.0->solvebio)

Load and Initialize Modules

1
import solvebio

Log-in to SolveBio

You'll need your SolveBio Personal Access Token to run this notebook. Create a Personal Access Token here.

1
solvebio.login(access_token='TOKEN')

Walkthrough: Importing Reference Data into SolveBio

First we're going to create a new dataset in our private vault.

1
dataset = solvebio.Dataset.get_or_create_by_full_path('~/Examples/CGD')

Next we're going to grab the newest and latest data dump of CGD from the original source at the NHGRI (available at https://research.nhgri.nih.gov/CGD/download/) and use it as our source data via a manifest. We can add any number of files to a manifest.

1
2
manifest = solvebio.Manifest()
manifest.add_url('https://research.nhgri.nih.gov/CGD/download/txt/CGD.txt.gz')

Now we can import this manifest into a dataset on SolveBio. During the import, we can use SolveBio expressions to transform the data before it gets indexed into the final dataset. We can add ordering, descriptions, new fields, bring in annotations from other datasets, etc.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
fields= [
    {
      'name': 'gene',
      'entity_type': 'gene',
      'description': 'Official HGNC gene symbol for this gene.'
    },
    {
      'name': 'references',
      'entity_type': 'literature',
      'description': 'Pubmed ID of published articles relevant to this record in the PubMed database.'
    },
    {
      'name': 'inheritance',
      'description': """
            The inheritance pattern for this condition.
            Acronyms include AD (Autosomal dominant), AR (Autosomal recessive),
            BG (Blood group), XL (X-linked), YL (Y-linked).
        """
    },
    {
      'name': 'inheritance_transformed',
      'data_type': 'string',
      'is_list': True,
      'expression': "[x.strip() for x in value.split('/')]",
      'description': 'A transformed version of the previous field to make this dataset more usable.'
    }
]

Finally, now that we have the new fields all set up, set up the proper reader parameters (in this case, to allow # as a header line because this particular file has a header line starting with #) and we are ready to import the data!

1
2
3
4
5
6
7
solvebio.DatasetImport.create(
        dataset_id=dataset.id,
        manifest=manifest.manifest,
        commit_mode='append',
        target_fields=fields,
        reader_params={'comment': '-'}
    )
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|            Fields | Data                                                     |
|-------------------+----------------------------------------------------------|
|        class_name | DatasetImport                                            |
|       commit_mode | append                                                   |
|        created_at | 2018-01-12T22:04:25.491Z                                 |
|           dataset | {  "beacon_url": "https://api.solvebio.com/v2/datasets/63|
|   dataset_commits | []                                                       |
|        dataset_id | 631938550719741150                                       |
|       description |                                                          |
|     entity_params | {}                                                       |
|     error_message |                                                          |
|      genome_build |                                                          |
|                id | 631981498896319230                                       |
|   import_messages | {}                                                       |
|    include_errors | False                                                    |
|    logical_object |                                                          |
|          manifest | {  "files": [    {      "base64_md5": null,       "format|
|          metadata | {}                                                       |
|     reader_params | {  "skiprows": [    0  ]}                                |
|            source | manifest                                                 |
|            status | queued                                                   |
|              tags | []                                                       |
|     target_fields | [<SolveObject at 0x107475ab8> JSON: {  "data_type":  ... |
|           task_id | 631981499236441456                                       |
|        timestamps | [[u'new', u'queued', u'2018-01-12T22:04:25.511829Z']]    |
|             title |                                                          |
|        updated_at | 2018-01-12T22:04:25.514Z                                 |
|            upload |                                                          |
|              user | {  "class_name": "User",   "email": "dandan@solvebio.com"|
| validation_params | {}                                                       |
|          vault_id | 650                                                      |