pdbecif

This exercise demonstrates the use of the PDBeCIF. Lightweight pure python 2/3 mmCif/CIF/STAR parser. The mmCIF format replaced the older PDB format in 2014 as a master format of the wwPDB and is used for depositions of x-ray structures since 2020. The description of the mmCIF format is available.

There are no external dependencies, but internally Global phasing tokenizer is used. The package allows I/O operations over mmCIF files distributed by wwPDB partners.

You can install the package from PYPi or from the repository using one of the following commands:

pip install pdbecif
pip install git+https://github.com/PDBeurope/pdbecif.git@master#egg=pdbecif

Full API documentation is available.

Before we start we need to download a structure of any PDB entry. We are going to use 1cbs for this purpose.

[1]:
import requests
from pprint import pprint # for later pretty printing only

response = requests.get(f'http://www.ebi.ac.uk/pdbe/entry-files/download/1cbs_updated.cif')
cif_path = '1cbs.cif'

with open(cif_path, 'wb') as fp:
    fp.write(response.content)

Structure reading

There are three main data structures one can consume while reading mmCIF files using PDBeCIF package. In the following sections we are going to go through all of them and highlight different use cases.

Dictionary output

One of the outputs is a dictionary. You can get python dictionary by setting parameter output='cif_dictionary'.

Use this option:

  • If you want to have a direct way of modifying data

  • Fastest option

[2]:
from pdbecif.mmcif_io import CifFileReader

reader = CifFileReader()

cif_dict = reader.read(cif_path, output='cif_dictionary')

The result is an mmCIF file structured in a plain python dictionary in a tree like structure where the key at the first level is equal to the data block id. The value to this key is another dictionary with category names as keys. The schema looks like this:

{
  "DATABLOCK_ID":
    {
      "CATEGORY_NAME":
        {
          "CATEGORY_ITEM": "VALUE"
        }
    }
}

See example for details:

{"1CBS": {
    "_entry": {"id": "1CBS"},
    "_symmetry": {"Int_Tables_number": "19",
                  "cell_setting": "?",
                  "entry_id": "1CBS",
                  "pdbx_full_space_group_name_H-M": "?",
                  "space_group_name_H-M": "P 21 21 21"}}}

that corresponds to the following part of the mmCIF file:

data_1CBS
#
_entry.id 1CBS
#
_symmetry.entry_id 1CBS
_symmetry.space_group_name_H-M 'P 21 21 21'
_symmetry.pdbx_full_space_group_name_H-M ?
_symmetry.cell_setting ?
_symmetry.Int_Tables_number 19
#
[3]:
# you can also limit data categories by listing the names you are interested in
short_dict = reader.read(cif_path, output='cif_dictionary', only=['_entry', '_symmetry'])

print(f"Number of categories in the CIF file: {len(cif_dict['1CBS'].keys())}")
print(f"Limited number of categories in the CIF file: {len(short_dict['1CBS'].keys())}")

Number of categories in the CIF file: 46
Limited number of categories in the CIF file: 2

The other way is to provide a list of category names that should be discarded from the CIF file. This can be particularly useful when one does not want to deal with huge categories, e.g. coordinates (_atom_site)

[4]:
ignored_categories = reader.read(cif_path, output='cif_dictionary', ignore=['_atom_site'])

print(f"_atom_site is present in cif_dict: {bool('_atom_site' in cif_dict['1CBS'])}")
print(f"_atom_site is present in ignored_categories: {bool('_atom_site' in ignored_categories['1CBS'])}")
_atom_site is present in cif_dict: True
_atom_site is present in ignored_categories: False

CIFWrapper output

when you specify output='cif_wrapper'to the read() function, one retrieves a wrapper object that allows some extra functionality:

  • access data items and their values using python dot (.) notation

  • searching in the data item values

[5]:
cif_wrapper_result = reader.read(cif_path, output='cif_wrapper')
cif_wrapper_result
[5]:
{'1CBS': <pdbecif.mmcif.CIFWrapper at 0x7f87f8ffbd90>}
[6]:
cif_wrapper = list(cif_wrapper_result.values())[0]

# access data objects using dot notation
print(cif_wrapper._entity.pdbx_description)

# or by indexing, the result is the same
print(cif_wrapper['_entity']['pdbx_description'])

['CELLULAR RETINOIC ACID BINDING PROTEIN TYPE II', 'RETINOIC ACID', 'water']
['CELLULAR RETINOIC ACID BINDING PROTEIN TYPE II', 'RETINOIC ACID', 'water']

CIFWrapper object wraps up mmCIF categories and allows accessing them. A category object is represented by CIFWrapperTable that exposes convenience functions for searching through the data.

Let’s have a look at the following example and extract all the chemical components of ‘non-polymer’ type:

[7]:
components = cif_wrapper._chem_comp
non_polymer_components = components.search('type', 'non-polymer')
non_polymer_components
[7]:
{8: {'id': 'HOH',
  'type': 'non-polymer',
  'mon_nstd_flag': '.',
  'name': 'WATER',
  'pdbx_synonyms': '?',
  'formula': 'H2 O',
  'formula_weight': '18.015'},
 15: {'id': 'REA',
  'type': 'non-polymer',
  'mon_nstd_flag': '.',
  'name': 'RETINOIC ACID',
  'pdbx_synonyms': '?',
  'formula': 'C20 H28 O2',
  'formula_weight': '300.435'}}

Regular expressions are also supported.

[8]:
import re

# all chemical components with ID starting with A
reg_exp = re.compile(r'^A')
components.search('id', reg_exp)
[8]:
{0: {'id': 'ALA',
  'type': 'L-peptide linking',
  'mon_nstd_flag': 'y',
  'name': 'ALANINE',
  'pdbx_synonyms': '?',
  'formula': 'C3 H7 N O2',
  'formula_weight': '89.093'},
 1: {'id': 'ARG',
  'type': 'L-peptide linking',
  'mon_nstd_flag': 'y',
  'name': 'ARGININE',
  'pdbx_synonyms': '?',
  'formula': 'C6 H15 N4 O2 1',
  'formula_weight': '175.209'},
 2: {'id': 'ASN',
  'type': 'L-peptide linking',
  'mon_nstd_flag': 'y',
  'name': 'ASPARAGINE',
  'pdbx_synonyms': '?',
  'formula': 'C4 H8 N2 O3',
  'formula_weight': '132.118'},
 3: {'id': 'ASP',
  'type': 'L-peptide linking',
  'mon_nstd_flag': 'y',
  'name': 'ASPARTIC ACID',
  'pdbx_synonyms': '?',
  'formula': 'C4 H7 N O4',
  'formula_weight': '133.103'}}

CifFile output

when specifying output='cif_file'to the read() function, one retrieves a CifFile object that allows some extra functionality. Use this option:

  • when you want to add or remove categories from your cif file using convenience functions as well as modifying existing data.

[9]:
cif_file = reader.read(cif_path, output='cif_file')
cif_file
[9]:
<CifFile "1cbs.cif">

There are convenience function to retrieve a list of data block ids as well as data blocks containing data. The following example demonstrates the hierarchy of the CifFile object:

[10]:
# cif file is organized in data blocks. 1cbs.cif contains a single one.
block_id = cif_file.getDataBlockIds()[0]
data_block = cif_file.getDataBlock(block_id)

# each data block contains a list of categories that include data items
all_category_names = data_block.getCategoryIds()
entity_like_categories = [x for x in all_category_names if 'entity' in x]
print(f'Entity-like categories: {entity_like_categories}')

# extract entity_poly category
entity_poly = data_block.getCategory('entity_poly')

# list all the data items
item_names = entity_poly.getItemNames()
print(f"These are the item names for 'entity_poly' category {item_names}")

# extract PDB sequence stored in the data item: `pdbx_seq_one_letter_code_can`
item = entity_poly.getItem('pdbx_seq_one_letter_code_can')

formated_val = item.getFormattedValue()
plain_val = item.getRawValue()

Entity-like categories: ['entity', 'entity_poly', 'entity_poly_seq', 'entity_src_gen', 'pdbx_entity_nonpoly']
These are the item names for 'entity_poly' category ['entity_id', 'type', 'nstd_linkage', 'nstd_monomer', 'pdbx_seq_one_letter_code', 'pdbx_seq_one_letter_code_can', 'pdbx_strand_id', 'pdbx_target_identifier']

Compare values of the pdbx_seq_one_letter_code_can item:

[11]:
print(formated_val)
print(plain_val)

;PNFSGNWKIIRSENFEELLKVLGVNVMLRKIAVAAASKPAVEIKQEGDTFYIKTSTTVRTTEINFKVGEEFEEQTVDGRP
CKSLVKWESENKMVCEQKLLKGEGPKTSWTRELTNDGELILTMTADDVVCTRVYVRE
;

PNFSGNWKIIRSENFEELLKVLGVNVMLRKIAVAAASKPAVEIKQEGDTFYIKTSTTVRTTEINFKVGEEFEEQTVDGRP
CKSLVKWESENKMVCEQKLLKGEGPKTSWTRELTNDGELILTMTADDVVCTRVYVRE

These are different because getFormattedValue() retrieves the value exactly as it is formatted in the input mmCIF file, whereas (does not remove any leading or trailing whitespace characters or delimiters), this is the purpose of getRawValue() method. However, the value is retrieved as a string including all characters. Note there are still white space characters in the data item value \n in this case!

You can also create or remove categories as you wish

[12]:
# add category
category = data_block.setCategory("new_category")
new_item = category.setItem('new_item')
new_item.setValue('some value')

data_block.getCategory('new_category')
[12]:
<Category "_new_category" with items ['new_item']>
[13]:
# remove categories
data_block.removeChild('_atom_site')
Warning: '_atom_site' removed from categories
[13]:
True

Structure writing

Structure export can be done by using CifFileWriter object and underlying method write().

[14]:
from pdbecif.mmcif_io import CifFileWriter

writer = CifFileWriter('dict_output.cif')
writer
[14]:
<pdbecif.mmcif_io.CifFileWriter at 0x7f87f91ad850>

write() method takes all the data objects returned by the read() function we discussed.

A dictionary

[15]:
custom_dict = {
    "root": {
        "category1": {
            "subcatA": "val1",
            "subcatB": "val2"
        },
        "category2": {
            "subcat1": [0,1,2],
            "subcat2": ["a", "b", "c"]
        }
    }
}
writer.write(custom_dict)

that results in a properly formated and valid mmCIF file:

data_root
#
_category1.subcatA       val1
_category1.subcatB       val2

#
loop_
_category2.subcat1
_category2.subcat2
. a
1 b
2 c
#

a CIFWrapper

[16]:
writer = CifFileWriter('wrapper_output.cif')
writer.write(cif_wrapper)

or a CifFile()

[17]:
writer = CifFileWriter('cif_file_output.cif')
writer.write(cif_file)