PDBe API webinar series Logo
1.0
  • Introduction to PDBe programmatic access
  • Searching with the PDBe API
    • PDB search
      • 1) Making imports and setting variables
      • 2) a function to get data from the search API
      • 3) formatting the search terms
      • 4) Getting useful data out of the search
      • 5) running a search
      • 6) Analysing and plotting the results
      • 7) searching for two terms at once
      • Questions to answer
    • PDB sequence search
      • 1) Making imports and setting variables
  • Creating complex PDBe API queries
  • Using the PDBe Graph API
    • Introduction
      • Using the documentation
    • Usecases
      • Predicted ligand binding sites in interaction interface
        • 1) Get the annotations data for UniProt accession P61626
        • 2) Filter the data for providers p2rank and 3dligandsite
        • 3) Get interacting residues for the UniProt accession
        • 4) Filter interface_data on common residues
      • Accessible residues for a PDB structure
        • 1) Get the 3Dcomplex annotations for PDB entry 3pxe
        • 2) Filter the data for ASA_alone label
        • 3) Filter site_residues on raw_score > 0
  • PDBe tools in GitHub
    • pdbecif
      • Structure reading
        • Dictionary output
        • CIFWrapper output
        • CifFile output
      • Structure writing
    • pdbeccdutils
      • Structure reading
      • Component object
        • Scaffolds
        • Fragments
        • Molecular depictions
      • Structure writing
      • Parity method
      • Scripts
    • arpeggio
      • Precomputed data
      • Example
      • arpeggio script
      • arpeggio API
      • Excercise
  • Data visualisation at PDBe
    • PDB Component Library
    • Example Links
PDBe API webinar series
  • »
  • Searching with the PDBe API »
  • PDB sequence search
  • View page source

PDB sequence search¶

This interactive Python notebook will guide you through various ways of programmatically accessing Protein Data Bank in Europe (PDBe) data using REST API

The REST API is a programmatic way to obtain information from the PDB and EMDB. You can access details about:

  • sample

  • experiment

  • models

  • compounds

  • cross-references

  • publications

  • quality

  • assemblies and more… For more information, visit https://www.ebi.ac.uk/pdbe/pdbe-rest-api

This notebook is a part of the training material series, and focuses on searching for sequences in the PDBe search API. Retrieve this material and many more from GitHub

1) Making imports and setting variables¶

First, we import some packages that we will use, and set some variables.

Note: Full list of valid URLs is available from https://www.ebi.ac.uk/pdbe/api/doc/

[1]:
from pprint import pprint # used for pretty printing
import requests # used to get data from the a URL
import pandas as pd # used to analyse the results
search_url = "https://www.ebi.ac.uk/pdbe/search/pdb/select?" # the rest of the URL used for PDBe's search API.

We will now define some functions which will use to get the data from PDBe’s search API

[9]:
def make_request_post(search_dict, number_of_rows=10):
    """
    makes a post request to the PDBe API
    :param dict search_dict: the terms used to search
    :param number_of_rows: number or rows to return - initially limited to 10
    :return dict: response JSON
    """
    # make sure we get the number of rows we need
    if 'rows' not in search_dict:
        search_dict['rows'] = number_of_rows
    # set the return type to JSON
    search_dict['wt'] = 'json'

    # do the query
    response = requests.post(search_url, data=search_dict)

    if response.status_code == 200:
        return response.json()
    else:
        print("[No data retrieved - %s] %s" % (response.status_code, response.text))

    return {}

def format_sequence_search_terms(sequence, filter_terms=None):
    """
    Format parameters for a sequence search
    :param str sequence: one letter sequence
    :param lst filter_terms: Terms to filter the results by
    :return str: search string
    """
    # first we set the parameters which we will pass to PDBe's search
    params = {
        'json.nl': 'map',
        'start': '0',
        'sort': 'fasta(e_value) asc',
        'xjoin_fasta': 'true',
        'bf': 'fasta(percentIdentity)',
        'xjoin_fasta.external.expupperlim': '0.1',
        'xjoin_fasta.external.sequence': sequence,
        'q': '*:*',
        'fq': '{!xjoin}xjoin_fasta'
    }
    # we make sure that we add required filter terms if they aren't present
    if filter_terms:
        for term in ['pdb_id', 'entity_id', 'entry_entity', 'chain_id']:
            filter_terms.append(term)
        filter_terms = list(set(filter_terms))
        params['fl'] = ','.join(filter_terms)

    # returns the parameter dictionary
    return params


def run_sequence_search(sequence, filter_terms=None, number_of_rows=10):
    """
    Runs a sequence search and results the results
    :param str sequence: sequence in one letter code
    :param lst filter_terms: terms to filter the results by
    :param int number_of_rows: number of results to return
    :return lst: List of results
    """
    search_dict = format_sequence_search_terms(sequence=sequence, filter_terms=filter_terms)
    response = make_request_post(search_dict=search_dict, number_of_rows=number_of_rows)
    results = response.get('response', {}).get('docs', [])
    print('Number of results {}'.format(len(results)))

    # we now have to go through the FASTA results and join them with the main results

    raw_fasta_results = response.get('xjoin_fasta').get('external')
    fasta_results = {} # results from FASTA will be stored here - key'd by PDB ID and Chain ID

    # go through each FASTA result and get the E value, percentage identity and sequence from the result

    for fasta_row in raw_fasta_results:
        # join_id = fasta_row.get('joinId')
        fasta_doc = fasta_row.get('doc', {})
        percent_identity = fasta_doc.get('percent_identity')
        e_value = fasta_doc.get('e_value')
        return_sequence = fasta_row.get('return_sequence_string')
        pdb_id_chain = fasta_doc.get('pdb_id_chain').split('_')
        pdb_id = pdb_id_chain[0].lower()
        chain_id = pdb_id_chain[-1]
        join_id = '{}_{}'.format(pdb_id, chain_id)
        fasta_results[join_id] = {'e_value': e_value,
                                  'percentage_identity': percent_identity,
                                  'return_sequence': return_sequence}

    # now we go through the main results and add the FASTA results
    ret = [] # final results will be stored here.
    for row in results:
        pdb_id = row.get('pdb_id').lower()
        chain_ids = row.get('chain_id')
        for chain_id in chain_ids:
            search_id = '{}_{}'.format(pdb_id, chain_id)
            entry_fasta_results = fasta_results.get(search_id, {})
            # we will only keep results that match the search ID
            if entry_fasta_results:
                row['e_value'] = entry_fasta_results.get('e_value')
                row['percentage_identity'] = entry_fasta_results.get('percentage_identity')
                row['result_sequence'] = entry_fasta_results.get('return_sequence_string')

                ret.append(row)
    return ret

We will search for a sequence with an example sequence from UniProt P24941 - Cyclin-dependent kinase 2

[10]:
sequence_to_search = """
MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNH
PNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHS
HRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYY
STAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSF
PKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL"""

filter_list = ['pfam_accession', 'pdb_id', 'molecule_name', 'ec_number',
               'uniprot_accession_best', 'tax_id']

first_results = run_sequence_search(sequence_to_search, filter_terms=filter_list)
Number of results 10

Print the first result to see what we have

[11]:
pprint(first_results[0])
{'chain_id': ['A', 'C'],
 'e_value': 2.9e-76,
 'ec_number': ['2.7.11.22'],
 'entity_id': 1,
 'entry_entity': '1jst_1',
 'molecule_name': ['Cyclin-dependent kinase 2'],
 'pdb_id': '1jst',
 'percentage_identity': 100.0,
 'pfam_accession': ['PF00069'],
 'result_sequence': None,
 'tax_id': [9606],
 'uniprot_accession_best': ['P24941']}

Notice that some of the results are lists

Before we do any further analysis we should get a few more results so we can see some patterns. We are going to increase the number of results to 1000

[34]:
first_results = run_sequence_search(sequence_to_search,
                                    filter_terms=filter_list,
                                    number_of_rows=1000
                                    )

Number of results 1000

Load the results into a Pandas Dataframe so we can query them

Before we do this we have to do a bit of house keeping. We are going to change the lists (results with [] around them) into comma separated values

[28]:
def change_lists_to_strings(results):
    """
    updates lists to strings for loading into Pandas
    :param dict results: dictionary of results to process
    :return dict: dictionary of results
    """
    for row in results:
        for data in row:
            if type(row[data]) == list:
                # if there are any numbers in the list change them into strings
                row[data] = [str(a) for a in row[data]]
                # unique and sort the list and then change the list into a string
                row[data] = ','.join(sorted(list(set(row[data]))))

    return results


def pandas_dataset(list_of_results):
    results = change_lists_to_strings(list_of_results)  # we have added our function to change lists to strings
    df = pd.DataFrame(results)

    return df
[35]:
df = pandas_dataset(first_results)

Lets see what we have - you’ll see it looks a bit like a spreadsheet or a database

[36]:
print(df.head())

  chain_id  ec_number  entity_id entry_entity              molecule_name  \
0      A,C  2.7.11.22          1       1jst_1  Cyclin-dependent kinase 2
1        A  2.7.11.22          1       3tiy_1  Cyclin-dependent kinase 2
2        A  2.7.11.22          1       3lfn_1  Cyclin-dependent kinase 2
3        A  2.7.11.22          1       6q4k_1  Cyclin-dependent kinase 2
4      A,C  2.7.11.22          1       4bcq_1  Cyclin-dependent kinase 2

  pdb_id pfam_accession tax_id uniprot_accession_best       e_value  \
0   1jst        PF00069   9606                 P24941  1.900000e-76
1   3tiy        PF00069   9606                 P24941  1.900000e-76
2   3lfn        PF00069   9606                 P24941  1.900000e-76
3   6q4k        PF00069   9606                 P24941  1.900000e-76
4   4bcq        PF00069   9606                 P24941  1.900000e-76

   percentage_identity result_sequence
0                100.0            None
1                100.0            None
2                100.0            None
3                100.0            None
4                100.0            None

We can save the results to a CSV file which we can load into excel

[17]:
df.to_csv("search_results.csv")

There isn’t a cut off of eValue or percentage identity in our search so we should look what the values go to

we can select the column and find the minimum value with .min() or maximum value with .max()

[18]:
df['percentage_identity'].max()
[18]:
100.0
[19]:
df['percentage_identity'].min()
[19]:
36.1

same for e value - here we want the min and max

[20]:
df['e_value'].min()
[20]:
1.8e-77
[21]:
df['e_value'].max()
[21]:
2.4e-20

We can see that percentage identity drops to as low as 36% Lets say we want to restrict it to 50%

[31]:
df2 = df.query('percentage_identity > 50')

We stored the results in a new Dataframe called “df2”

[32]:
df2.head()
[32]:
chain_id ec_number entity_id entry_entity molecule_name pdb_id pfam_accession tax_id uniprot_accession_best e_value percentage_identity result_sequence
0 A,C 2.7.11.22 1 1jst_1 Cyclin-dependent kinase 2 1jst PF00069 9606 P24941 1.800000e-77 100.0 None
1 A 2.7.11.22 1 3tiy_1 Cyclin-dependent kinase 2 3tiy PF00069 9606 P24941 1.800000e-77 100.0 None
2 A 2.7.11.22 1 3lfn_1 Cyclin-dependent kinase 2 3lfn PF00069 9606 P24941 1.800000e-77 100.0 None
3 A 2.7.11.22 1 6q4k_1 Cyclin-dependent kinase 2 6q4k PF00069 9606 P24941 1.800000e-77 100.0 None
4 A,C 2.7.11.22 1 4bcq_1 Cyclin-dependent kinase 2 4bcq PF00069 9606 P24941 1.800000e-77 100.0 None

Number of entries in the Dataframe

[24]:
len(df2)
[24]:
446

Max value of percentage identity

[25]:
df2['percentage_identity'].max()
[25]:
100.0

Min value of percentage identity

[26]:
df2['percentage_identity'].min()
[26]:
54.2

How many unique Pfam domains or UniProts did we get back?

We can group the results by Pfam using “groupby” and then counting the results

[37]:
df.groupby('pfam_accession').count()
[37]:
chain_id ec_number entity_id entry_entity molecule_name pdb_id tax_id uniprot_accession_best e_value percentage_identity result_sequence
pfam_accession
PF00069 731 729 731 731 731 731 731 731 731 731 0

same for uniprot accession This time we will sort the values by the number of PDB entries (“pdb_id”’s) they appear in.

[38]:
group_by_uniprot = df.groupby('uniprot_accession_best').count().sort_values('pdb_id', ascending=False)
group_by_uniprot
[38]:
chain_id ec_number entity_id entry_entity molecule_name pdb_id pfam_accession tax_id e_value percentage_identity result_sequence
uniprot_accession_best
P24941 416 416 416 416 416 416 416 416 416 416 0
P28482 109 109 109 109 109 109 109 109 109 109 0
P63086 58 58 58 58 58 58 58 58 58 58 0
P47811 43 43 43 43 43 43 43 43 43 43 0
Q00534 18 18 18 18 18 18 18 18 18 18 0
Q13164 11 11 11 11 11 11 11 11 11 11 0
P06493 10 10 10 10 10 10 10 10 10 10 0
P11802-2 8 8 8 8 8 8 8 8 8 8 0
O15264 7 7 7 7 7 7 7 7 7 7 0
Q16539 6 6 6 6 6 6 6 6 6 6 0
Q00535 6 6 6 6 6 6 6 6 6 6 0
P11802 5 5 5 5 5 5 5 5 5 5 0
Q07785 4 4 4 4 4 4 4 4 4 4 0
P17157 4 4 4 4 4 4 4 4 4 4 0
P27361 3 3 3 3 3 3 3 3 3 3 0
Q16539-4 2 2 2 2 2 2 2 2 2 2 0
Q39026 2 2 2 2 2 2 2 2 2 2 0
Q5CRJ8 2 1 2 2 2 2 2 2 2 2 0
Q00536 2 2 2 2 2 2 2 2 2 2 0
A8BZ95 2 1 2 2 2 2 2 2 2 2 0
P53778 2 2 2 2 2 2 2 2 2 2 0
P50613 2 2 2 2 2 2 2 2 2 2 0
Q92772 2 2 2 2 2 2 2 2 2 2 0
Q00532 1 1 1 1 1 1 1 1 1 1 0
P63086,Q16539 1 1 1 1 1 1 1 1 1 1 0
A9UJZ9 1 1 1 1 1 1 1 1 1 1 0
P47811-4 1 1 1 1 1 1 1 1 1 1 0
O76039 1 1 1 1 1 1 1 1 1 1 0
G4N374 1 1 1 1 1 1 1 1 1 1 0
Q8IVW4 1 1 1 1 1 1 1 1 1 1 0

In this case the most common UniProt accession is P24941. How many UniProt accessions were there?

[39]:
len(group_by_uniprot)
[39]:
30

How many are enzymes? We can use “ec_number” to work see how many have E.C. numbers

[40]:
uniprot_with_ec = group_by_uniprot.query('ec_number != 0')
[41]:
len(uniprot_with_ec)
[41]:
30
Next Previous

© Copyright 2020, Protein Data Bank in Europe

Built with Sphinx using a theme provided by Read the Docs.