PDB sequence search¶

This interactive Python notebook will guide you through various ways of programmatically accessing Protein Data Bank in Europe (PDBe) data using REST API

The REST API is a programmatic way to obtain information from the PDB and EMDB. You can access details about:

sample
experiment
models
compounds
cross-references
publications
quality
assemblies and more… For more information, visit https://www.ebi.ac.uk/pdbe/pdbe-rest-api

This notebook is a part of the training material series, and focuses on searching for sequences in the PDBe search API. Retrieve this material and many more from GitHub

1) Making imports and setting variables¶

First, we import some packages that we will use, and set some variables.

Note: Full list of valid URLs is available from https://www.ebi.ac.uk/pdbe/api/doc/

[1]:

from pprint import pprint # used for pretty printing
import requests # used to get data from the a URL
import pandas as pd # used to analyse the results
search_url = "https://www.ebi.ac.uk/pdbe/search/pdb/select?" # the rest of the URL used for PDBe's search API.

We will now define some functions which will use to get the data from PDBe’s search API

[9]:

def make_request_post(search_dict, number_of_rows=10):
    """
    makes a post request to the PDBe API
    :param dict search_dict: the terms used to search
    :param number_of_rows: number or rows to return - initially limited to 10
    :return dict: response JSON
    """
    # make sure we get the number of rows we need
    if 'rows' not in search_dict:
        search_dict['rows'] = number_of_rows
    # set the return type to JSON
    search_dict['wt'] = 'json'

    # do the query
    response = requests.post(search_url, data=search_dict)

    if response.status_code == 200:
        return response.json()
    else:
        print("[No data retrieved - %s] %s" % (response.status_code, response.text))

    return {}

def format_sequence_search_terms(sequence, filter_terms=None):
    """
    Format parameters for a sequence search
    :param str sequence: one letter sequence
    :param lst filter_terms: Terms to filter the results by
    :return str: search string
    """
    # first we set the parameters which we will pass to PDBe's search
    params = {
        'json.nl': 'map',
        'start': '0',
        'sort': 'fasta(e_value) asc',
        'xjoin_fasta': 'true',
        'bf': 'fasta(percentIdentity)',
        'xjoin_fasta.external.expupperlim': '0.1',
        'xjoin_fasta.external.sequence': sequence,
        'q': '*:*',
        'fq': '{!xjoin}xjoin_fasta'
    }
    # we make sure that we add required filter terms if they aren't present
    if filter_terms:
        for term in ['pdb_id', 'entity_id', 'entry_entity', 'chain_id']:
            filter_terms.append(term)
        filter_terms = list(set(filter_terms))
        params['fl'] = ','.join(filter_terms)

    # returns the parameter dictionary
    return params


def run_sequence_search(sequence, filter_terms=None, number_of_rows=10):
    """
    Runs a sequence search and results the results
    :param str sequence: sequence in one letter code
    :param lst filter_terms: terms to filter the results by
    :param int number_of_rows: number of results to return
    :return lst: List of results
    """
    search_dict = format_sequence_search_terms(sequence=sequence, filter_terms=filter_terms)
    response = make_request_post(search_dict=search_dict, number_of_rows=number_of_rows)
    results = response.get('response', {}).get('docs', [])
    print('Number of results {}'.format(len(results)))

    # we now have to go through the FASTA results and join them with the main results

    raw_fasta_results = response.get('xjoin_fasta').get('external')
    fasta_results = {} # results from FASTA will be stored here - key'd by PDB ID and Chain ID

    # go through each FASTA result and get the E value, percentage identity and sequence from the result

    for fasta_row in raw_fasta_results:
        # join_id = fasta_row.get('joinId')
        fasta_doc = fasta_row.get('doc', {})
        percent_identity = fasta_doc.get('percent_identity')
        e_value = fasta_doc.get('e_value')
        return_sequence = fasta_row.get('return_sequence_string')
        pdb_id_chain = fasta_doc.get('pdb_id_chain').split('_')
        pdb_id = pdb_id_chain[0].lower()
        chain_id = pdb_id_chain[-1]
        join_id = '{}_{}'.format(pdb_id, chain_id)
        fasta_results[join_id] = {'e_value': e_value,
                                  'percentage_identity': percent_identity,
                                  'return_sequence': return_sequence}

    # now we go through the main results and add the FASTA results
    ret = [] # final results will be stored here.
    for row in results:
        pdb_id = row.get('pdb_id').lower()
        chain_ids = row.get('chain_id')
        for chain_id in chain_ids:
            search_id = '{}_{}'.format(pdb_id, chain_id)
            entry_fasta_results = fasta_results.get(search_id, {})
            # we will only keep results that match the search ID
            if entry_fasta_results:
                row['e_value'] = entry_fasta_results.get('e_value')
                row['percentage_identity'] = entry_fasta_results.get('percentage_identity')
                row['result_sequence'] = entry_fasta_results.get('return_sequence_string')

                ret.append(row)
    return ret

We will search for a sequence with an example sequence from UniProt P24941 - Cyclin-dependent kinase 2

[10]:

sequence_to_search = """
MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNH
PNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHS
HRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYY
STAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSF
PKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL"""

filter_list = ['pfam_accession', 'pdb_id', 'molecule_name', 'ec_number',
               'uniprot_accession_best', 'tax_id']

first_results = run_sequence_search(sequence_to_search, filter_terms=filter_list)

Number of results 10

Print the first result to see what we have

[11]:

pprint(first_results[0])

{'chain_id': ['A', 'C'],
 'e_value': 2.9e-76,
 'ec_number': ['2.7.11.22'],
 'entity_id': 1,
 'entry_entity': '1jst_1',
 'molecule_name': ['Cyclin-dependent kinase 2'],
 'pdb_id': '1jst',
 'percentage_identity': 100.0,
 'pfam_accession': ['PF00069'],
 'result_sequence': None,
 'tax_id': [9606],
 'uniprot_accession_best': ['P24941']}

Notice that some of the results are lists

Before we do any further analysis we should get a few more results so we can see some patterns. We are going to increase the number of results to 1000

[34]:

first_results = run_sequence_search(sequence_to_search,
                                    filter_terms=filter_list,
                                    number_of_rows=1000
                                    )

Number of results 1000

Load the results into a Pandas Dataframe so we can query them

Before we do this we have to do a bit of house keeping. We are going to change the lists (results with [] around them) into comma separated values

[28]:

def change_lists_to_strings(results):
    """
    updates lists to strings for loading into Pandas
    :param dict results: dictionary of results to process
    :return dict: dictionary of results
    """
    for row in results:
        for data in row:
            if type(row[data]) == list:
                # if there are any numbers in the list change them into strings
                row[data] = [str(a) for a in row[data]]
                # unique and sort the list and then change the list into a string
                row[data] = ','.join(sorted(list(set(row[data]))))

    return results


def pandas_dataset(list_of_results):
    results = change_lists_to_strings(list_of_results)  # we have added our function to change lists to strings
    df = pd.DataFrame(results)

    return df

[35]:

df = pandas_dataset(first_results)

Lets see what we have - you’ll see it looks a bit like a spreadsheet or a database

[36]:

print(df.head())

  chain_id  ec_number  entity_id entry_entity              molecule_name  \
0      A,C  2.7.11.22          1       1jst_1  Cyclin-dependent kinase 2
1        A  2.7.11.22          1       3tiy_1  Cyclin-dependent kinase 2
2        A  2.7.11.22          1       3lfn_1  Cyclin-dependent kinase 2
3        A  2.7.11.22          1       6q4k_1  Cyclin-dependent kinase 2
4      A,C  2.7.11.22          1       4bcq_1  Cyclin-dependent kinase 2

  pdb_id pfam_accession tax_id uniprot_accession_best       e_value  \
0   1jst        PF00069   9606                 P24941  1.900000e-76
1   3tiy        PF00069   9606                 P24941  1.900000e-76
2   3lfn        PF00069   9606                 P24941  1.900000e-76
3   6q4k        PF00069   9606                 P24941  1.900000e-76
4   4bcq        PF00069   9606                 P24941  1.900000e-76

   percentage_identity result_sequence
0                100.0            None
1                100.0            None
2                100.0            None
3                100.0            None
4                100.0            None

We can save the results to a CSV file which we can load into excel

[17]:

df.to_csv("search_results.csv")

There isn’t a cut off of eValue or percentage identity in our search so we should look what the values go to

we can select the column and find the minimum value with .min() or maximum value with .max()

[18]:

df['percentage_identity'].max()

[18]:

100.0

[19]:

df['percentage_identity'].min()

[19]:

36.1

same for e value - here we want the min and max

[20]:

df['e_value'].min()

[20]:

1.8e-77

[21]:

df['e_value'].max()

[21]:

2.4e-20

We can see that percentage identity drops to as low as 36% Lets say we want to restrict it to 50%

[31]:

df2 = df.query('percentage_identity > 50')

We stored the results in a new Dataframe called “df2”

[32]:

df2.head()

[32]:

	chain_id	ec_number	entity_id	entry_entity	molecule_name	pdb_id	pfam_accession	tax_id	uniprot_accession_best	e_value	percentage_identity	result_sequence
0	A,C	2.7.11.22	1	1jst_1	Cyclin-dependent kinase 2	1jst	PF00069	9606	P24941	1.800000e-77	100.0	None
1	A	2.7.11.22	1	3tiy_1	Cyclin-dependent kinase 2	3tiy	PF00069	9606	P24941	1.800000e-77	100.0	None
2	A	2.7.11.22	1	3lfn_1	Cyclin-dependent kinase 2	3lfn	PF00069	9606	P24941	1.800000e-77	100.0	None
3	A	2.7.11.22	1	6q4k_1	Cyclin-dependent kinase 2	6q4k	PF00069	9606	P24941	1.800000e-77	100.0	None
4	A,C	2.7.11.22	1	4bcq_1	Cyclin-dependent kinase 2	4bcq	PF00069	9606	P24941	1.800000e-77	100.0	None

Number of entries in the Dataframe

[24]:

len(df2)

[24]:

Max value of percentage identity

[25]:

df2['percentage_identity'].max()

[25]:

100.0

Min value of percentage identity

[26]:

df2['percentage_identity'].min()

[26]:

54.2

How many unique Pfam domains or UniProts did we get back?

We can group the results by Pfam using “groupby” and then counting the results

[37]:

df.groupby('pfam_accession').count()

[37]:

	chain_id	ec_number	entity_id	entry_entity	molecule_name	pdb_id	tax_id	uniprot_accession_best	e_value	percentage_identity	result_sequence
pfam_accession
PF00069	731	729	731	731	731	731	731	731	731	731	0

same for uniprot accession This time we will sort the values by the number of PDB entries (“pdb_id”’s) they appear in.

[38]:

group_by_uniprot = df.groupby('uniprot_accession_best').count().sort_values('pdb_id', ascending=False)
group_by_uniprot

[38]:

	chain_id	ec_number	entity_id	entry_entity	molecule_name	pdb_id	pfam_accession	tax_id	e_value	percentage_identity	result_sequence
uniprot_accession_best
P24941	416	416	416	416	416	416	416	416	416	416	0
P28482	109	109	109	109	109	109	109	109	109	109	0
P63086	58	58	58	58	58	58	58	58	58	58	0
P47811	43	43	43	43	43	43	43	43	43	43	0
Q00534	18	18	18	18	18	18	18	18	18	18	0
Q13164	11	11	11	11	11	11	11	11	11	11	0
P06493	10	10	10	10	10	10	10	10	10	10	0
P11802-2	8	8	8	8	8	8	8	8	8	8	0
O15264	7	7	7	7	7	7	7	7	7	7	0
Q16539	6	6	6	6	6	6	6	6	6	6	0
Q00535	6	6	6	6	6	6	6	6	6	6	0
P11802	5	5	5	5	5	5	5	5	5	5	0
Q07785	4	4	4	4	4	4	4	4	4	4	0
P17157	4	4	4	4	4	4	4	4	4	4	0
P27361	3	3	3	3	3	3	3	3	3	3	0
Q16539-4	2	2	2	2	2	2	2	2	2	2	0
Q39026	2	2	2	2	2	2	2	2	2	2	0
Q5CRJ8	2	1	2	2	2	2	2	2	2	2	0
Q00536	2	2	2	2	2	2	2	2	2	2	0
A8BZ95	2	1	2	2	2	2	2	2	2	2	0
P53778	2	2	2	2	2	2	2	2	2	2	0
P50613	2	2	2	2	2	2	2	2	2	2	0
Q92772	2	2	2	2	2	2	2	2	2	2	0
Q00532	1	1	1	1	1	1	1	1	1	1	0
P63086,Q16539	1	1	1	1	1	1	1	1	1	1	0
A9UJZ9	1	1	1	1	1	1	1	1	1	1	0
P47811-4	1	1	1	1	1	1	1	1	1	1	0
O76039	1	1	1	1	1	1	1	1	1	1	0
G4N374	1	1	1	1	1	1	1	1	1	1	0
Q8IVW4	1	1	1	1	1	1	1	1	1	1	0

In this case the most common UniProt accession is P24941. How many UniProt accessions were there?

[39]:

len(group_by_uniprot)

[39]:

How many are enzymes? We can use “ec_number” to work see how many have E.C. numbers

[40]:

uniprot_with_ec = group_by_uniprot.query('ec_number != 0')

[41]:

len(uniprot_with_ec)

[41]: