arpeggio

arpeggio calculates interatomic contacts based on the rules defined in CREDO.

Install and activate conda environment with openbabel (<3.0) dependency using conda environment such as:

conda create conda -n arpeggio-env python=3.7
conda install -c openbabel openbabel
conda activate arpeggio-env

and install arpeggio package as follows:

pip install git+https://github.com/PDBeurope/arpeggio.git@master#egg=arpeggio

Arpeggio output is very rich in terms of observed contacts. The static nature of protein structure models makes very challenging to correctly estimate whether or not e.g. a hydrogen bond is formed exactly between atom A and atom B and not between atom A and atom C that is 0.01Å farther. Hence arpeggio tries, in accordance with the CREDO nomenclature, to identify all the plausible contacts. The reasoning behing identification of arpeggio contacts, and their possible mutual (non)exclusivity is detaily described in the SI of arpeggio paper.

Arpeggio relies on PDB structures with added hydrogens and works best with mmCIF files for inferring molecular interactions.

Precomputed data

In the release process of PDBe we are using possible quaternary structures generated by the ModelServer and protonated using ChimeraX software. You can obtain these precomputed structures using the following link using plain PDB id substitution: https://www.ebi.ac.uk/pdbe/model-server/v1/1cbs/full?encoding=cif&data_source=pdb-h (PDB id: 1cbs) in this case.

You can also access precomputed interactions for vast majority of PDB entries using our aggregated API using one of the following API calls: GetBoundMolecules; GetBoundMoleculeInteractions; GetBoundLigandInteractions.

Please note that chain id information (field: _atom.site.auth_asym_id) is different for cryo em and x-ray structures in the files provided by the ModelServer. The reason being is that these are quaternary structures that are indicated to play possible biologicall role. The chain id is in the form X_Y (e.g. A_1), where X stands for _atom.site.auth_asym_id of a PDB entry asymetric unit and Y is a symmetry operator id that defines a collection of crystal symmetry operations applied on the original chain to generate position of the chain.

Example

Potassium channel found in the pdbe entry 1k4c contains just a single chain in the assymetric unit, however, the pore is formed by a tetrameric quarternary structure.

arpeggio script

arpeggio exposes a single script that you can use for inferring interatomic contacts. The only required parameters are protonated PDB entry and molecular selection.

You can get the idea on how to use the script by running:

arpeggio -h

Nevertheless the basic usage is as follows:

arpeggio -s /A/200/ -o arpeggio_result 1cbs.cif

INFO//14:59:04.545//Program begin.
INFO//14:59:04.545//Selection perceived: ['/A/200/']
DEBUG//14:59:04.605//Loaded PDB structure (BioPython)
DEBUG//14:59:04.667//Loaded MMCIF structure (OpenBabel)
DEBUG//14:59:04.674//Mapped OB to BioPython atoms and vice-versa.
DEBUG//14:59:04.674//Detected that the input structure contains hydrogens. Hydrogen addition will be skipped.
DEBUG//14:59:04.787//Determined atom explicit and implicit valences, bond orders, atomic numbers, formal charge and number of bound hydrogens.
DEBUG//14:59:04.810//Initialised SIFts.
DEBUG//14:59:04.812//Determined polypeptide residues, chain breaks, termini
DEBUG//14:59:04.858//Percieved and stored rings.
DEBUG//14:59:04.869//Perceived and stored amide groups.
DEBUG//14:59:04.882//Added hydrogens to BioPython atoms.
DEBUG//14:59:04.887//Added VdW radii.
DEBUG//14:59:04.892//Added covalent radii.
DEBUG//14:59:04.910//Completed NeighborSearch.
DEBUG//14:59:04.912//Assigned rings to residues.
DEBUG//14:59:04.918//Made selection.
DEBUG//14:59:05.112//Expanded to binding site.
DEBUG//14:59:05.113//Flagged selection rings.
DEBUG//14:59:05.114//Completed new NeighbourSearch.
INFO//14:59:05.438//Program End. Maximum memory usage was 77.32 MB.

This calculates all the interatomic contacts between retinoic acid (REA 200 A) in the PDB structure of 1cbs.

arpeggio API

You can achieve the very same behaviour by using arpeggio API.

[1]:
import requests
import pandas as pd # for example purposes only

def download_protonated_structure(pdb_id):
    response = requests.get(f'https://www.ebi.ac.uk/pdbe/model-server/v1/{pdb_id}/full?encoding=cif&data_source=pdb-h')
    cif_path = f'{pdb_id}.cif'

    with open(cif_path, 'wb') as fp:
        fp.write(response.content)

    return cif_path

# lets download a protonated quaternary structure first
download_protonated_structure('1cbs')

[1]:
'1cbs.cif'
[2]:
from arpeggio.core import InteractionComplex

selection = ['/A_1/200/']

# run structure checks and create internal representation of the molecule
complex = InteractionComplex('1cbs.cif')
complex.structure_checks()
complex.address_ambiguities()
complex.initialize()

# calculate interactions to our selection
complex.run_arpeggio(selection, interacting_cutoff=5, # cutoff for 'proximal' interactions
                                vdw_comp=0.1, # 'compensation' factor to address structural inconsistencies
                                include_sequence_adjacent=False)
contacts = complex.get_contacts()

[3]:
len(contacts)
[3]:
709
[4]:
contacts[0]
[4]:
{'bgn': {'label_comp_id': 'LEU',
  'auth_seq_id': 19,
  'auth_asym_id': 'A_1',
  'auth_atom_id': 'CA',
  'pdbx_PDB_ins_code': ' '},
 'end': {'label_comp_id': 'VAL',
  'auth_seq_id': 24,
  'auth_asym_id': 'A_1',
  'auth_atom_id': 'CB',
  'pdbx_PDB_ins_code': ' '},
 'type': 'atom-atom',
 'distance': 4.69,
 'contact': ['proximal'],
 'interacting_entities': 'INTRA_NON_SELECTION'}
[5]:

# we can filter out all the contacts that are not just 'proximal' only
non_proximal = [x for x in contacts if x['contact'] == ['proximal']]

# we can list all the possible hydrogen bonds that can be found in the binding site
hbonds = [x for x in contacts if 'hbond' in x['contact']]

# extract the distances
hbond_distances = [x['distance'] for x in hbonds]

df = pd.DataFrame(hbond_distances, columns=['Hbond distances'])
df.describe()

[5]:
Hbond distances
count 38.000000
mean 3.041579
std 0.238911
min 2.570000
25% 2.872500
50% 3.015000
75% 3.167500
max 3.620000
[ ]:
There is a lot one can do in terms of statistics for molecular interactions both ligand-wise and PDB-wise. I encourage you to try answering following questions:

Excercise

  • How many residues are forming this binding site?

  • What are all the interaction types that can be found in this active site?

  • Are there any atomic clashes?