Pipelines
pdbeccdutils
can be used to process standard ligand definitions defined by CCD or BIRD dictionaries, identify boundmolecules from PDB model files and infer Covalently Linked Components (CLC). The packages comes with two pipelines, one to process CCD/PRDs and an another one to process boundmolecules from PDB model files
Processing of CCD/PRD
process_components_cif
entry point can be used to run the pipeline to process a CCD or PRD. The pipeline enriches (data-enrichment-process) the information in standard wwPDB CCD/PRD files, generates 2D images and exports to multiple file formats.
To find all the input options, use:
process_components_cif -h
For example, following files are generated for ATP
ATP.cif
- Standard wwPDB CCD file with the data enrichments |mmCIF
.ATP_ideal.pdb
- ideal cooordinates |PDB
.ATP_ideal_alt.pdb
- ideal cooordinates with atom alternate names |PDB
.ATP_model.pdb
- model cooordinates |PDB
.ATP_model_alt.pdb
- model cooordinates with atom alternate names |PDB
.ATP_N.svg
- 2D depiction in N x N resolution. Where N is (100,200,300,400,500) pixels.ATP_N_names.svg
- 2D depiction in N x N resolution with atom names. Where N in (100,200,300,400,500) pixels.ATP_model.sdf
- model coordinates |MOL
ATP_ideal.sdf
- ideal cooordinates |MOL
.ATP.cml
- model coordinates |CML
.ATP_annotation.json
- 2D depiction in JSON format with some additional annotation.
Data enrichment process
Aside from the standard content of the wwPDB CCD files, process_components_cif
pipeline adds the following information in the mmCIF
files.
2D coordinations of the ligand.
RDKit regenerated 3D conformer.
Information about scaffold and fragments found in the entry.
Processing of boundmolecules from PDB model files
Most of the structures in PDB has atleast one small molecule bound to it. Many of these small molecules are ligands of biological significance, such as cofactors, metabolites, carbohydrates, or lipids. read_boundmolecule
entry point can be used to run the pipeline to identify these boundmolecules from PDB entries and assign them as Covalently Linked Components (CLC) if they are composed of individual components represented by individual CCDs.
To find all the input options, use:
read_boundmolecule -h
For example, the pipeline generates the following files in PDB entry 1d83
1d83_processed.cif
- The input structure of 1d83 after removing alternate conformersbound_molecules.json
- Details of boundmolecules found in the input structureCLC_1
- Folder containing details of CLC identified from input structureCLC_2
- Folder containing details of CLC identified from input structure
In each folder of CLCs identified from an input structure, the following files are stored:
CLC_K.cif
- CIF files of CLC after data enrichments |mmCIF
.CLC_K_model.pdb
- model cooordinates |PDB
.CLC_K_N.svg
- 2D depiction in N x N resolution. Where N is (100,200,300,400,500) pixels.CLC_K_N_names.svg
- 2D depiction in N x N resolution with atom names. Where N in (100,200,300,400,500) pixels.CLC_K_model.sdf
- model coordinates |MOL
CLC_K.cml
- component representation |CML
.CLC_K_annotation.json
- 2D depiction in ‘natural format’ (i.e. 50px per 1Å) with some additional annotation.
Where K is a positive interger representing the Kth CLC identified from an input structure