{ "cells": [ { "source": [ "# PDB sequence search\n", "\n", "This interactive Python notebook will guide you through various ways of programmatically accessing Protein Data Bank in Europe (PDBe) data using REST API\n", "\n", "The REST API is a programmatic way to obtain information from the PDB and EMDB. You can access details about:\n", "\n", "* sample\n", "* experiment\n", "* models\n", "* compounds\n", "* cross-references\n", "* publications\n", "* quality\n", "* assemblies\n", "and more...\n", "For more information, visit https://www.ebi.ac.uk/pdbe/pdbe-rest-api\n", "\n", "This notebook is a part of the training material series, and focuses on searching for sequences in the PDBe search API. Retrieve this material and many more from [GitHub](https://github.com/PDBeurope/pdbe-api-training)\n", "\n", "## 1) Making imports and setting variables\n", "First, we import some packages that we will use, and set some variables.\n", "\n", "Note: Full list of valid URLs is available from https://www.ebi.ac.uk/pdbe/api/doc/" ], "cell_type": "markdown", "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 1, "outputs": [], "source": [ "from pprint import pprint # used for pretty printing\n", "import requests # used to get data from the a URL\n", "import pandas as pd # used to analyse the results\n", "search_url = \"https://www.ebi.ac.uk/pdbe/search/pdb/select?\" # the rest of the URL used for PDBe's search API.\n" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "We will now define some functions which will use to get the data from PDBe's search API" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 9, "outputs": [], "source": [ "def make_request_post(search_dict, number_of_rows=10):\n", " \"\"\"\n", " makes a post request to the PDBe API\n", " :param dict search_dict: the terms used to search\n", " :param number_of_rows: number or rows to return - initially limited to 10\n", " :return dict: response JSON\n", " \"\"\"\n", " # make sure we get the number of rows we need\n", " if 'rows' not in search_dict:\n", " search_dict['rows'] = number_of_rows\n", " # set the return type to JSON\n", " search_dict['wt'] = 'json'\n", "\n", " # do the query\n", " response = requests.post(search_url, data=search_dict)\n", "\n", " if response.status_code == 200:\n", " return response.json()\n", " else:\n", " print(\"[No data retrieved - %s] %s\" % (response.status_code, response.text))\n", "\n", " return {}\n", "\n", "def format_sequence_search_terms(sequence, filter_terms=None):\n", " \"\"\"\n", " Format parameters for a sequence search\n", " :param str sequence: one letter sequence\n", " :param lst filter_terms: Terms to filter the results by\n", " :return str: search string\n", " \"\"\"\n", " # first we set the parameters which we will pass to PDBe's search\n", " params = {\n", " 'json.nl': 'map',\n", " 'start': '0',\n", " 'sort': 'fasta(e_value) asc',\n", " 'xjoin_fasta': 'true',\n", " 'bf': 'fasta(percentIdentity)',\n", " 'xjoin_fasta.external.expupperlim': '0.1',\n", " 'xjoin_fasta.external.sequence': sequence,\n", " 'q': '*:*',\n", " 'fq': '{!xjoin}xjoin_fasta'\n", " }\n", " # we make sure that we add required filter terms if they aren't present\n", " if filter_terms:\n", " for term in ['pdb_id', 'entity_id', 'entry_entity', 'chain_id']:\n", " filter_terms.append(term)\n", " filter_terms = list(set(filter_terms))\n", " params['fl'] = ','.join(filter_terms)\n", "\n", " # returns the parameter dictionary\n", " return params\n", "\n", "\n", "def run_sequence_search(sequence, filter_terms=None, number_of_rows=10):\n", " \"\"\"\n", " Runs a sequence search and results the results\n", " :param str sequence: sequence in one letter code\n", " :param lst filter_terms: terms to filter the results by\n", " :param int number_of_rows: number of results to return\n", " :return lst: List of results\n", " \"\"\"\n", " search_dict = format_sequence_search_terms(sequence=sequence, filter_terms=filter_terms)\n", " response = make_request_post(search_dict=search_dict, number_of_rows=number_of_rows)\n", " results = response.get('response', {}).get('docs', [])\n", " print('Number of results {}'.format(len(results)))\n", "\n", " # we now have to go through the FASTA results and join them with the main results\n", "\n", " raw_fasta_results = response.get('xjoin_fasta').get('external')\n", " fasta_results = {} # results from FASTA will be stored here - key'd by PDB ID and Chain ID\n", "\n", " # go through each FASTA result and get the E value, percentage identity and sequence from the result\n", "\n", " for fasta_row in raw_fasta_results:\n", " # join_id = fasta_row.get('joinId')\n", " fasta_doc = fasta_row.get('doc', {})\n", " percent_identity = fasta_doc.get('percent_identity')\n", " e_value = fasta_doc.get('e_value')\n", " return_sequence = fasta_row.get('return_sequence_string')\n", " pdb_id_chain = fasta_doc.get('pdb_id_chain').split('_')\n", " pdb_id = pdb_id_chain[0].lower()\n", " chain_id = pdb_id_chain[-1]\n", " join_id = '{}_{}'.format(pdb_id, chain_id)\n", " fasta_results[join_id] = {'e_value': e_value,\n", " 'percentage_identity': percent_identity,\n", " 'return_sequence': return_sequence}\n", "\n", " # now we go through the main results and add the FASTA results\n", " ret = [] # final results will be stored here.\n", " for row in results:\n", " pdb_id = row.get('pdb_id').lower()\n", " chain_ids = row.get('chain_id')\n", " for chain_id in chain_ids:\n", " search_id = '{}_{}'.format(pdb_id, chain_id)\n", " entry_fasta_results = fasta_results.get(search_id, {})\n", " # we will only keep results that match the search ID\n", " if entry_fasta_results:\n", " row['e_value'] = entry_fasta_results.get('e_value')\n", " row['percentage_identity'] = entry_fasta_results.get('percentage_identity')\n", " row['result_sequence'] = entry_fasta_results.get('return_sequence_string')\n", "\n", " ret.append(row)\n", " return ret" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "We will search for a sequence with an example sequence from UniProt P24941 -\n", "Cyclin-dependent kinase 2" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 10, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of results 10\n" ] } ], "source": [ "sequence_to_search = \"\"\"\n", "MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNH\n", "PNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHS\n", "HRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYY\n", "STAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSF\n", "PKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL\"\"\"\n", "\n", "filter_list = ['pfam_accession', 'pdb_id', 'molecule_name', 'ec_number',\n", " 'uniprot_accession_best', 'tax_id']\n", "\n", "first_results = run_sequence_search(sequence_to_search, filter_terms=filter_list)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Print the first result to see what we have" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 11, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'chain_id': ['A', 'C'],\n", " 'e_value': 2.9e-76,\n", " 'ec_number': ['2.7.11.22'],\n", " 'entity_id': 1,\n", " 'entry_entity': '1jst_1',\n", " 'molecule_name': ['Cyclin-dependent kinase 2'],\n", " 'pdb_id': '1jst',\n", " 'percentage_identity': 100.0,\n", " 'pfam_accession': ['PF00069'],\n", " 'result_sequence': None,\n", " 'tax_id': [9606],\n", " 'uniprot_accession_best': ['P24941']}\n" ] } ], "source": [ "pprint(first_results[0])" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Notice that some of the results are lists" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "markdown", "source": [ "Before we do any further analysis we should get a few more results so we can see some patterns.\n", "We are going to increase the number of results to 1000" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 34, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of results 1000\n" ] } ], "source": [ "first_results = run_sequence_search(sequence_to_search,\n", " filter_terms=filter_list,\n", " number_of_rows=1000\n", " )\n" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Load the results into a Pandas Dataframe so we can query them\n", "\n", "Before we do this we have to do a bit of house keeping. We are going to change the lists (results with [] around them)\n", "into comma separated values" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 28, "outputs": [], "source": [ "def change_lists_to_strings(results):\n", " \"\"\"\n", " updates lists to strings for loading into Pandas\n", " :param dict results: dictionary of results to process\n", " :return dict: dictionary of results\n", " \"\"\"\n", " for row in results:\n", " for data in row:\n", " if type(row[data]) == list:\n", " # if there are any numbers in the list change them into strings\n", " row[data] = [str(a) for a in row[data]]\n", " # unique and sort the list and then change the list into a string\n", " row[data] = ','.join(sorted(list(set(row[data]))))\n", "\n", " return results\n", "\n", "\n", "def pandas_dataset(list_of_results):\n", " results = change_lists_to_strings(list_of_results) # we have added our function to change lists to strings\n", " df = pd.DataFrame(results)\n", "\n", " return df" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 35, "outputs": [], "source": [ "df = pandas_dataset(first_results)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Lets see what we have - you'll see it looks a bit like a spreadsheet or a database" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 36, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " chain_id ec_number entity_id entry_entity molecule_name \\\n", "0 A,C 2.7.11.22 1 1jst_1 Cyclin-dependent kinase 2 \n", "1 A 2.7.11.22 1 3tiy_1 Cyclin-dependent kinase 2 \n", "2 A 2.7.11.22 1 3lfn_1 Cyclin-dependent kinase 2 \n", "3 A 2.7.11.22 1 6q4k_1 Cyclin-dependent kinase 2 \n", "4 A,C 2.7.11.22 1 4bcq_1 Cyclin-dependent kinase 2 \n", "\n", " pdb_id pfam_accession tax_id uniprot_accession_best e_value \\\n", "0 1jst PF00069 9606 P24941 1.900000e-76 \n", "1 3tiy PF00069 9606 P24941 1.900000e-76 \n", "2 3lfn PF00069 9606 P24941 1.900000e-76 \n", "3 6q4k PF00069 9606 P24941 1.900000e-76 \n", "4 4bcq PF00069 9606 P24941 1.900000e-76 \n", "\n", " percentage_identity result_sequence \n", "0 100.0 None \n", "1 100.0 None \n", "2 100.0 None \n", "3 100.0 None \n", "4 100.0 None \n" ] } ], "source": [ "print(df.head())\n" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "We can save the results to a CSV file which we can load into excel" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 17, "outputs": [], "source": [ "df.to_csv(\"search_results.csv\")" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "There isn't a cut off of eValue or percentage identity in our search\n", "so we should look what the values go to\n", "\n", "we can select the column and find the minimum value with .min() or maximum value with .max()" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 18, "outputs": [ { "data": { "text/plain": "100.0" }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['percentage_identity'].max()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 19, "outputs": [ { "data": { "text/plain": "36.1" }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['percentage_identity'].min()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "same for e value - here we want the min and max\n" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 20, "outputs": [ { "data": { "text/plain": "1.8e-77" }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['e_value'].min()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 21, "outputs": [ { "data": { "text/plain": "2.4e-20" }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['e_value'].max()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "We can see that percentage identity drops to as low as 36%\n", "Lets say we want to restrict it to 50%" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 31, "outputs": [], "source": [ "df2 = df.query('percentage_identity > 50')" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "We stored the results in a new Dataframe called \"df2\"" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 32, "outputs": [ { "data": { "text/plain": " chain_id ec_number entity_id entry_entity molecule_name \\\n0 A,C 2.7.11.22 1 1jst_1 Cyclin-dependent kinase 2 \n1 A 2.7.11.22 1 3tiy_1 Cyclin-dependent kinase 2 \n2 A 2.7.11.22 1 3lfn_1 Cyclin-dependent kinase 2 \n3 A 2.7.11.22 1 6q4k_1 Cyclin-dependent kinase 2 \n4 A,C 2.7.11.22 1 4bcq_1 Cyclin-dependent kinase 2 \n\n pdb_id pfam_accession tax_id uniprot_accession_best e_value \\\n0 1jst PF00069 9606 P24941 1.800000e-77 \n1 3tiy PF00069 9606 P24941 1.800000e-77 \n2 3lfn PF00069 9606 P24941 1.800000e-77 \n3 6q4k PF00069 9606 P24941 1.800000e-77 \n4 4bcq PF00069 9606 P24941 1.800000e-77 \n\n percentage_identity result_sequence \n0 100.0 None \n1 100.0 None \n2 100.0 None \n3 100.0 None \n4 100.0 None ", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
chain_idec_numberentity_identry_entitymolecule_namepdb_idpfam_accessiontax_iduniprot_accession_beste_valuepercentage_identityresult_sequence
0A,C2.7.11.2211jst_1Cyclin-dependent kinase 21jstPF000699606P249411.800000e-77100.0None
1A2.7.11.2213tiy_1Cyclin-dependent kinase 23tiyPF000699606P249411.800000e-77100.0None
2A2.7.11.2213lfn_1Cyclin-dependent kinase 23lfnPF000699606P249411.800000e-77100.0None
3A2.7.11.2216q4k_1Cyclin-dependent kinase 26q4kPF000699606P249411.800000e-77100.0None
4A,C2.7.11.2214bcq_1Cyclin-dependent kinase 24bcqPF000699606P249411.800000e-77100.0None
\n
" }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.head()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Number of entries in the Dataframe" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 24, "outputs": [ { "data": { "text/plain": "446" }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df2)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Max value of percentage identity" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 25, "outputs": [ { "data": { "text/plain": "100.0" }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2['percentage_identity'].max()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Min value of percentage identity" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 26, "outputs": [ { "data": { "text/plain": "54.2" }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2['percentage_identity'].min()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "How many unique Pfam domains or UniProts did we get back?\n", "\n", "We can group the results by Pfam using \"groupby\" and then counting the results" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 37, "outputs": [ { "data": { "text/plain": " chain_id ec_number entity_id entry_entity molecule_name \\\npfam_accession \nPF00069 731 729 731 731 731 \n\n pdb_id tax_id uniprot_accession_best e_value \\\npfam_accession \nPF00069 731 731 731 731 \n\n percentage_identity result_sequence \npfam_accession \nPF00069 731 0 ", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
chain_idec_numberentity_identry_entitymolecule_namepdb_idtax_iduniprot_accession_beste_valuepercentage_identityresult_sequence
pfam_accession
PF000697317297317317317317317317317310
\n
" }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('pfam_accession').count()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "same for uniprot accession\n", "This time we will sort the values by the number of PDB entries (\"pdb_id\"'s) they appear in." ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 38, "outputs": [ { "data": { "text/plain": " chain_id ec_number entity_id entry_entity \\\nuniprot_accession_best \nP24941 416 416 416 416 \nP28482 109 109 109 109 \nP63086 58 58 58 58 \nP47811 43 43 43 43 \nQ00534 18 18 18 18 \nQ13164 11 11 11 11 \nP06493 10 10 10 10 \nP11802-2 8 8 8 8 \nO15264 7 7 7 7 \nQ16539 6 6 6 6 \nQ00535 6 6 6 6 \nP11802 5 5 5 5 \nQ07785 4 4 4 4 \nP17157 4 4 4 4 \nP27361 3 3 3 3 \nQ16539-4 2 2 2 2 \nQ39026 2 2 2 2 \nQ5CRJ8 2 1 2 2 \nQ00536 2 2 2 2 \nA8BZ95 2 1 2 2 \nP53778 2 2 2 2 \nP50613 2 2 2 2 \nQ92772 2 2 2 2 \nQ00532 1 1 1 1 \nP63086,Q16539 1 1 1 1 \nA9UJZ9 1 1 1 1 \nP47811-4 1 1 1 1 \nO76039 1 1 1 1 \nG4N374 1 1 1 1 \nQ8IVW4 1 1 1 1 \n\n molecule_name pdb_id pfam_accession tax_id \\\nuniprot_accession_best \nP24941 416 416 416 416 \nP28482 109 109 109 109 \nP63086 58 58 58 58 \nP47811 43 43 43 43 \nQ00534 18 18 18 18 \nQ13164 11 11 11 11 \nP06493 10 10 10 10 \nP11802-2 8 8 8 8 \nO15264 7 7 7 7 \nQ16539 6 6 6 6 \nQ00535 6 6 6 6 \nP11802 5 5 5 5 \nQ07785 4 4 4 4 \nP17157 4 4 4 4 \nP27361 3 3 3 3 \nQ16539-4 2 2 2 2 \nQ39026 2 2 2 2 \nQ5CRJ8 2 2 2 2 \nQ00536 2 2 2 2 \nA8BZ95 2 2 2 2 \nP53778 2 2 2 2 \nP50613 2 2 2 2 \nQ92772 2 2 2 2 \nQ00532 1 1 1 1 \nP63086,Q16539 1 1 1 1 \nA9UJZ9 1 1 1 1 \nP47811-4 1 1 1 1 \nO76039 1 1 1 1 \nG4N374 1 1 1 1 \nQ8IVW4 1 1 1 1 \n\n e_value percentage_identity result_sequence \nuniprot_accession_best \nP24941 416 416 0 \nP28482 109 109 0 \nP63086 58 58 0 \nP47811 43 43 0 \nQ00534 18 18 0 \nQ13164 11 11 0 \nP06493 10 10 0 \nP11802-2 8 8 0 \nO15264 7 7 0 \nQ16539 6 6 0 \nQ00535 6 6 0 \nP11802 5 5 0 \nQ07785 4 4 0 \nP17157 4 4 0 \nP27361 3 3 0 \nQ16539-4 2 2 0 \nQ39026 2 2 0 \nQ5CRJ8 2 2 0 \nQ00536 2 2 0 \nA8BZ95 2 2 0 \nP53778 2 2 0 \nP50613 2 2 0 \nQ92772 2 2 0 \nQ00532 1 1 0 \nP63086,Q16539 1 1 0 \nA9UJZ9 1 1 0 \nP47811-4 1 1 0 \nO76039 1 1 0 \nG4N374 1 1 0 \nQ8IVW4 1 1 0 ", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
chain_idec_numberentity_identry_entitymolecule_namepdb_idpfam_accessiontax_ide_valuepercentage_identityresult_sequence
uniprot_accession_best
P249414164164164164164164164164164160
P284821091091091091091091091091091090
P63086585858585858585858580
P47811434343434343434343430
Q00534181818181818181818180
Q13164111111111111111111110
P06493101010101010101010100
P11802-288888888880
O1526477777777770
Q1653966666666660
Q0053566666666660
P1180255555555550
Q0778544444444440
P1715744444444440
P2736133333333330
Q16539-422222222220
Q3902622222222220
Q5CRJ821222222220
Q0053622222222220
A8BZ9521222222220
P5377822222222220
P5061322222222220
Q9277222222222220
Q0053211111111110
P63086,Q1653911111111110
A9UJZ911111111110
P47811-411111111110
O7603911111111110
G4N37411111111110
Q8IVW411111111110
\n
" }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group_by_uniprot = df.groupby('uniprot_accession_best').count().sort_values('pdb_id', ascending=False)\n", "group_by_uniprot" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "In this case the most common UniProt accession is P24941.\n", "How many UniProt accessions were there?" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 39, "outputs": [ { "data": { "text/plain": "30" }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(group_by_uniprot)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "How many are enzymes? We can use \"ec_number\" to work see how many have E.C. numbers" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 40, "outputs": [], "source": [ "uniprot_with_ec = group_by_uniprot.query('ec_number != 0')" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 41, "outputs": [ { "data": { "text/plain": "30" }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(uniprot_with_ec)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "3.7.8-final" } }, "nbformat": 4, "nbformat_minor": 0 }