Biological use cases

You can find a couple of use cases with a biological value that demonstrates the use of the pdbecif parser along with some of its features.

Reading PDB macromolecular files in the mmCIF format

wwPDB distributes macromolecular data in the mmCIF format. For our use case, we will use PDB entry: 7cgo. This large molecular machine contains some 330K atoms and drives the rotation of the flagellum for bacterial mobility.

Let us first download the updated mmCIF file from Protein Data Bank in Europe. This file contains more annotations to place this biomacromolecular structure in its biological context to the standard wwPDB file load the whole file in memory:

from pdbecif.mmcif_io import CifFileReader

data = CifFileReader().read('7cgo_updated.cif')

now we can traverse the tree-like structure in any way we like. See the tree hierarchy below highlighting some of the important categories

7CGO
├── _entry
│   ├── id
├── _entity
│   ├── id
│   ├── type
│   ├── src_method
│   ├── pdbx_description
│   ├── formula_weight
│   ├── pdbx_ec
├── _atom_site
│   ├── id
│   └── type_symbol

So accessing content of the _entity category is as simple as:

data['7CGO']['_entity']

We can have a look on how many atoms are there in the entry:

len(pdb['_atom_site']['id'])

as well as e.g. loop through all the categories that contain information related to the EM experiment:

for k, v in data['7CGO'].items():
    if k[0:3] == '_em':
        print(k)
        print(v)

Whenever you are unsure on what certain data category contains you can always refer back to the reference and browse through the dictionary.

Speed optimization for large entires

In the previous example, we parsed and processed the entire file including the large _atom_site category. This is often not needed. Particularly, if we are interested in data annotations only and not in coordinates.

this can be achieved by discarding the _atom_site category such as:

data = CifFileReader().read('7cgo_updated.cif', ignore=['_atom_site'])

Noticed the speed improvement? Further optimization can be achieved by specifying only the categories one is interested in. Let’s assume we are interested in primary sequences of all the polymeric entities found in the PDB entry. Such information is included in _entity_poly category so let’s extract that one and print those out along with the polymeric type:

data = CifFileReader().read('7cgo_updated.cif', only=['_entity_poly'])

entity_poly = data['7CGO']['_entity_poly']
poly_types = entity_poly['type']
sequences = entity_poly['pdbx_seq_one_letter_code']

for p_type, sequence in zip(poly_types, sequences):
    print(f'{p_type}: {sequences}')

Reading wwPDB Biologically Interesting molecule Reference Dictionary

The Biologically Interesting molecule Reference Dictionary (BIRD) contains information about biologically interesting peptide-like antibiotic and inhibitor molecules in the PDB archive.

Let us first download the reference dictionary from the FTP area, load the whole file in memory:

from pdbecif.mmcif_io import CifFileReader

data = CifFileReader().read('prdcc-all.cif.gz')

and check how many such molecules are there:

len(data)

Next we can extract e.g. InChIKeys of all the molecules specified in the dictionary including molecular type:

for k, v in data.items():
    if '_pdbx_chem_comp_descriptor' not in v:
        continue

    descriptor = v['_pdbx_chem_comp_descriptor']

    if 'InChIKey' not in descriptor['type']:
        continue

    inchikey_index = descriptor['type'].index('InChIKey')
    inchikey = descriptor['descriptor'][inchikey_index]

    mol_type = v['_chem_comp']['type']

    print(f'{k} | {mol_type}: {inchikey}')