PDB sequence search¶
This interactive Python notebook will guide you through various ways of programmatically accessing Protein Data Bank in Europe (PDBe) data using REST API
The REST API is a programmatic way to obtain information from the PDB and EMDB. You can access details about:
sample
experiment
models
compounds
cross-references
publications
quality
assemblies and more… For more information, visit https://www.ebi.ac.uk/pdbe/pdbe-rest-api
This notebook is a part of the training material series, and focuses on searching for sequences in the PDBe search API. Retrieve this material and many more from GitHub
1) Making imports and setting variables¶
First, we import some packages that we will use, and set some variables.
Note: Full list of valid URLs is available from https://www.ebi.ac.uk/pdbe/api/doc/
[1]:
from pprint import pprint # used for pretty printing
import requests # used to get data from the a URL
import pandas as pd # used to analyse the results
search_url = "https://www.ebi.ac.uk/pdbe/search/pdb/select?" # the rest of the URL used for PDBe's search API.
We will now define some functions which will use to get the data from PDBe’s search API
[9]:
def make_request_post(search_dict, number_of_rows=10):
"""
makes a post request to the PDBe API
:param dict search_dict: the terms used to search
:param number_of_rows: number or rows to return - initially limited to 10
:return dict: response JSON
"""
# make sure we get the number of rows we need
if 'rows' not in search_dict:
search_dict['rows'] = number_of_rows
# set the return type to JSON
search_dict['wt'] = 'json'
# do the query
response = requests.post(search_url, data=search_dict)
if response.status_code == 200:
return response.json()
else:
print("[No data retrieved - %s] %s" % (response.status_code, response.text))
return {}
def format_sequence_search_terms(sequence, filter_terms=None):
"""
Format parameters for a sequence search
:param str sequence: one letter sequence
:param lst filter_terms: Terms to filter the results by
:return str: search string
"""
# first we set the parameters which we will pass to PDBe's search
params = {
'json.nl': 'map',
'start': '0',
'sort': 'fasta(e_value) asc',
'xjoin_fasta': 'true',
'bf': 'fasta(percentIdentity)',
'xjoin_fasta.external.expupperlim': '0.1',
'xjoin_fasta.external.sequence': sequence,
'q': '*:*',
'fq': '{!xjoin}xjoin_fasta'
}
# we make sure that we add required filter terms if they aren't present
if filter_terms:
for term in ['pdb_id', 'entity_id', 'entry_entity', 'chain_id']:
filter_terms.append(term)
filter_terms = list(set(filter_terms))
params['fl'] = ','.join(filter_terms)
# returns the parameter dictionary
return params
def run_sequence_search(sequence, filter_terms=None, number_of_rows=10):
"""
Runs a sequence search and results the results
:param str sequence: sequence in one letter code
:param lst filter_terms: terms to filter the results by
:param int number_of_rows: number of results to return
:return lst: List of results
"""
search_dict = format_sequence_search_terms(sequence=sequence, filter_terms=filter_terms)
response = make_request_post(search_dict=search_dict, number_of_rows=number_of_rows)
results = response.get('response', {}).get('docs', [])
print('Number of results {}'.format(len(results)))
# we now have to go through the FASTA results and join them with the main results
raw_fasta_results = response.get('xjoin_fasta').get('external')
fasta_results = {} # results from FASTA will be stored here - key'd by PDB ID and Chain ID
# go through each FASTA result and get the E value, percentage identity and sequence from the result
for fasta_row in raw_fasta_results:
# join_id = fasta_row.get('joinId')
fasta_doc = fasta_row.get('doc', {})
percent_identity = fasta_doc.get('percent_identity')
e_value = fasta_doc.get('e_value')
return_sequence = fasta_row.get('return_sequence_string')
pdb_id_chain = fasta_doc.get('pdb_id_chain').split('_')
pdb_id = pdb_id_chain[0].lower()
chain_id = pdb_id_chain[-1]
join_id = '{}_{}'.format(pdb_id, chain_id)
fasta_results[join_id] = {'e_value': e_value,
'percentage_identity': percent_identity,
'return_sequence': return_sequence}
# now we go through the main results and add the FASTA results
ret = [] # final results will be stored here.
for row in results:
pdb_id = row.get('pdb_id').lower()
chain_ids = row.get('chain_id')
for chain_id in chain_ids:
search_id = '{}_{}'.format(pdb_id, chain_id)
entry_fasta_results = fasta_results.get(search_id, {})
# we will only keep results that match the search ID
if entry_fasta_results:
row['e_value'] = entry_fasta_results.get('e_value')
row['percentage_identity'] = entry_fasta_results.get('percentage_identity')
row['result_sequence'] = entry_fasta_results.get('return_sequence_string')
ret.append(row)
return ret
We will search for a sequence with an example sequence from UniProt P24941 - Cyclin-dependent kinase 2
[10]:
sequence_to_search = """
MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNH
PNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHS
HRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYY
STAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSF
PKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL"""
filter_list = ['pfam_accession', 'pdb_id', 'molecule_name', 'ec_number',
'uniprot_accession_best', 'tax_id']
first_results = run_sequence_search(sequence_to_search, filter_terms=filter_list)
Number of results 10
Print the first result to see what we have
[11]:
pprint(first_results[0])
{'chain_id': ['A', 'C'],
'e_value': 2.9e-76,
'ec_number': ['2.7.11.22'],
'entity_id': 1,
'entry_entity': '1jst_1',
'molecule_name': ['Cyclin-dependent kinase 2'],
'pdb_id': '1jst',
'percentage_identity': 100.0,
'pfam_accession': ['PF00069'],
'result_sequence': None,
'tax_id': [9606],
'uniprot_accession_best': ['P24941']}
Notice that some of the results are lists
Before we do any further analysis we should get a few more results so we can see some patterns. We are going to increase the number of results to 1000
[34]:
first_results = run_sequence_search(sequence_to_search,
filter_terms=filter_list,
number_of_rows=1000
)
Number of results 1000
Load the results into a Pandas Dataframe so we can query them
Before we do this we have to do a bit of house keeping. We are going to change the lists (results with [] around them) into comma separated values
[28]:
def change_lists_to_strings(results):
"""
updates lists to strings for loading into Pandas
:param dict results: dictionary of results to process
:return dict: dictionary of results
"""
for row in results:
for data in row:
if type(row[data]) == list:
# if there are any numbers in the list change them into strings
row[data] = [str(a) for a in row[data]]
# unique and sort the list and then change the list into a string
row[data] = ','.join(sorted(list(set(row[data]))))
return results
def pandas_dataset(list_of_results):
results = change_lists_to_strings(list_of_results) # we have added our function to change lists to strings
df = pd.DataFrame(results)
return df
[35]:
df = pandas_dataset(first_results)
Lets see what we have - you’ll see it looks a bit like a spreadsheet or a database
[36]:
print(df.head())
chain_id ec_number entity_id entry_entity molecule_name \
0 A,C 2.7.11.22 1 1jst_1 Cyclin-dependent kinase 2
1 A 2.7.11.22 1 3tiy_1 Cyclin-dependent kinase 2
2 A 2.7.11.22 1 3lfn_1 Cyclin-dependent kinase 2
3 A 2.7.11.22 1 6q4k_1 Cyclin-dependent kinase 2
4 A,C 2.7.11.22 1 4bcq_1 Cyclin-dependent kinase 2
pdb_id pfam_accession tax_id uniprot_accession_best e_value \
0 1jst PF00069 9606 P24941 1.900000e-76
1 3tiy PF00069 9606 P24941 1.900000e-76
2 3lfn PF00069 9606 P24941 1.900000e-76
3 6q4k PF00069 9606 P24941 1.900000e-76
4 4bcq PF00069 9606 P24941 1.900000e-76
percentage_identity result_sequence
0 100.0 None
1 100.0 None
2 100.0 None
3 100.0 None
4 100.0 None
We can save the results to a CSV file which we can load into excel
[17]:
df.to_csv("search_results.csv")
There isn’t a cut off of eValue or percentage identity in our search so we should look what the values go to
we can select the column and find the minimum value with .min() or maximum value with .max()
[18]:
df['percentage_identity'].max()
[18]:
100.0
[19]:
df['percentage_identity'].min()
[19]:
36.1
same for e value - here we want the min and max
[20]:
df['e_value'].min()
[20]:
1.8e-77
[21]:
df['e_value'].max()
[21]:
2.4e-20
We can see that percentage identity drops to as low as 36% Lets say we want to restrict it to 50%
[31]:
df2 = df.query('percentage_identity > 50')
We stored the results in a new Dataframe called “df2”
[32]:
df2.head()
[32]:
chain_id | ec_number | entity_id | entry_entity | molecule_name | pdb_id | pfam_accession | tax_id | uniprot_accession_best | e_value | percentage_identity | result_sequence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A,C | 2.7.11.22 | 1 | 1jst_1 | Cyclin-dependent kinase 2 | 1jst | PF00069 | 9606 | P24941 | 1.800000e-77 | 100.0 | None |
1 | A | 2.7.11.22 | 1 | 3tiy_1 | Cyclin-dependent kinase 2 | 3tiy | PF00069 | 9606 | P24941 | 1.800000e-77 | 100.0 | None |
2 | A | 2.7.11.22 | 1 | 3lfn_1 | Cyclin-dependent kinase 2 | 3lfn | PF00069 | 9606 | P24941 | 1.800000e-77 | 100.0 | None |
3 | A | 2.7.11.22 | 1 | 6q4k_1 | Cyclin-dependent kinase 2 | 6q4k | PF00069 | 9606 | P24941 | 1.800000e-77 | 100.0 | None |
4 | A,C | 2.7.11.22 | 1 | 4bcq_1 | Cyclin-dependent kinase 2 | 4bcq | PF00069 | 9606 | P24941 | 1.800000e-77 | 100.0 | None |
Number of entries in the Dataframe
[24]:
len(df2)
[24]:
446
Max value of percentage identity
[25]:
df2['percentage_identity'].max()
[25]:
100.0
Min value of percentage identity
[26]:
df2['percentage_identity'].min()
[26]:
54.2
How many unique Pfam domains or UniProts did we get back?
We can group the results by Pfam using “groupby” and then counting the results
[37]:
df.groupby('pfam_accession').count()
[37]:
chain_id | ec_number | entity_id | entry_entity | molecule_name | pdb_id | tax_id | uniprot_accession_best | e_value | percentage_identity | result_sequence | |
---|---|---|---|---|---|---|---|---|---|---|---|
pfam_accession | |||||||||||
PF00069 | 731 | 729 | 731 | 731 | 731 | 731 | 731 | 731 | 731 | 731 | 0 |
same for uniprot accession This time we will sort the values by the number of PDB entries (“pdb_id”’s) they appear in.
[38]:
group_by_uniprot = df.groupby('uniprot_accession_best').count().sort_values('pdb_id', ascending=False)
group_by_uniprot
[38]:
chain_id | ec_number | entity_id | entry_entity | molecule_name | pdb_id | pfam_accession | tax_id | e_value | percentage_identity | result_sequence | |
---|---|---|---|---|---|---|---|---|---|---|---|
uniprot_accession_best | |||||||||||
P24941 | 416 | 416 | 416 | 416 | 416 | 416 | 416 | 416 | 416 | 416 | 0 |
P28482 | 109 | 109 | 109 | 109 | 109 | 109 | 109 | 109 | 109 | 109 | 0 |
P63086 | 58 | 58 | 58 | 58 | 58 | 58 | 58 | 58 | 58 | 58 | 0 |
P47811 | 43 | 43 | 43 | 43 | 43 | 43 | 43 | 43 | 43 | 43 | 0 |
Q00534 | 18 | 18 | 18 | 18 | 18 | 18 | 18 | 18 | 18 | 18 | 0 |
Q13164 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 0 |
P06493 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 0 |
P11802-2 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 0 |
O15264 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 0 |
Q16539 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 0 |
Q00535 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 0 |
P11802 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 0 |
Q07785 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 0 |
P17157 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 0 |
P27361 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 0 |
Q16539-4 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0 |
Q39026 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0 |
Q5CRJ8 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0 |
Q00536 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0 |
A8BZ95 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0 |
P53778 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0 |
P50613 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0 |
Q92772 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0 |
Q00532 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
P63086,Q16539 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
A9UJZ9 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
P47811-4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
O76039 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
G4N374 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
Q8IVW4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
In this case the most common UniProt accession is P24941. How many UniProt accessions were there?
[39]:
len(group_by_uniprot)
[39]:
30
How many are enzymes? We can use “ec_number” to work see how many have E.C. numbers
[40]:
uniprot_with_ec = group_by_uniprot.query('ec_number != 0')
[41]:
len(uniprot_with_ec)
[41]:
30