{
 "cells": [
  {
   "source": [
    "# PDB sequence search\n",
    "\n",
    "This interactive Python notebook will guide you through various ways of programmatically accessing Protein Data Bank in Europe (PDBe) data using REST API\n",
    "\n",
    "The REST API is a programmatic way to obtain information from the PDB and EMDB. You can access details about:\n",
    "\n",
    "* sample\n",
    "* experiment\n",
    "* models\n",
    "* compounds\n",
    "* cross-references\n",
    "* publications\n",
    "* quality\n",
    "* assemblies\n",
    "and more...\n",
    "For more information, visit https://www.ebi.ac.uk/pdbe/pdbe-rest-api\n",
    "\n",
    "This notebook is a part of the training material series, and focuses on searching for sequences in the PDBe search API. Retrieve this material and many more from [GitHub](https://github.com/PDBeurope/pdbe-api-training)\n",
    "\n",
    "## 1) Making imports and setting variables\n",
    "First, we import some packages that we will use, and set some variables.\n",
    "\n",
    "Note: Full list of valid URLs is available from https://www.ebi.ac.uk/pdbe/api/doc/"
   ],
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "outputs": [],
   "source": [
    "from pprint import pprint # used for pretty printing\n",
    "import requests # used to get data from the a URL\n",
    "import pandas as pd # used to analyse the results\n",
    "search_url = \"https://www.ebi.ac.uk/pdbe/search/pdb/select?\" # the rest of the URL used for PDBe's search API.\n"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "We will now define some functions which will use to get the data from PDBe's search API"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "outputs": [],
   "source": [
    "def make_request_post(search_dict, number_of_rows=10):\n",
    "    \"\"\"\n",
    "    makes a post request to the PDBe API\n",
    "    :param dict search_dict: the terms used to search\n",
    "    :param number_of_rows: number or rows to return - initially limited to 10\n",
    "    :return dict: response JSON\n",
    "    \"\"\"\n",
    "    # make sure we get the number of rows we need\n",
    "    if 'rows' not in search_dict:\n",
    "        search_dict['rows'] = number_of_rows\n",
    "    # set the return type to JSON\n",
    "    search_dict['wt'] = 'json'\n",
    "\n",
    "    # do the query\n",
    "    response = requests.post(search_url, data=search_dict)\n",
    "\n",
    "    if response.status_code == 200:\n",
    "        return response.json()\n",
    "    else:\n",
    "        print(\"[No data retrieved - %s] %s\" % (response.status_code, response.text))\n",
    "\n",
    "    return {}\n",
    "\n",
    "def format_sequence_search_terms(sequence, filter_terms=None):\n",
    "    \"\"\"\n",
    "    Format parameters for a sequence search\n",
    "    :param str sequence: one letter sequence\n",
    "    :param lst filter_terms: Terms to filter the results by\n",
    "    :return str: search string\n",
    "    \"\"\"\n",
    "    # first we set the parameters which we will pass to PDBe's search\n",
    "    params = {\n",
    "        'json.nl': 'map',\n",
    "        'start': '0',\n",
    "        'sort': 'fasta(e_value) asc',\n",
    "        'xjoin_fasta': 'true',\n",
    "        'bf': 'fasta(percentIdentity)',\n",
    "        'xjoin_fasta.external.expupperlim': '0.1',\n",
    "        'xjoin_fasta.external.sequence': sequence,\n",
    "        'q': '*:*',\n",
    "        'fq': '{!xjoin}xjoin_fasta'\n",
    "    }\n",
    "    # we make sure that we add required filter terms if they aren't present\n",
    "    if filter_terms:\n",
    "        for term in ['pdb_id', 'entity_id', 'entry_entity', 'chain_id']:\n",
    "            filter_terms.append(term)\n",
    "        filter_terms = list(set(filter_terms))\n",
    "        params['fl'] = ','.join(filter_terms)\n",
    "\n",
    "    # returns the parameter dictionary\n",
    "    return params\n",
    "\n",
    "\n",
    "def run_sequence_search(sequence, filter_terms=None, number_of_rows=10):\n",
    "    \"\"\"\n",
    "    Runs a sequence search and results the results\n",
    "    :param str sequence: sequence in one letter code\n",
    "    :param lst filter_terms: terms to filter the results by\n",
    "    :param int number_of_rows: number of results to return\n",
    "    :return lst: List of results\n",
    "    \"\"\"\n",
    "    search_dict = format_sequence_search_terms(sequence=sequence, filter_terms=filter_terms)\n",
    "    response = make_request_post(search_dict=search_dict, number_of_rows=number_of_rows)\n",
    "    results = response.get('response', {}).get('docs', [])\n",
    "    print('Number of results {}'.format(len(results)))\n",
    "\n",
    "    # we now have to go through the FASTA results and join them with the main results\n",
    "\n",
    "    raw_fasta_results = response.get('xjoin_fasta').get('external')\n",
    "    fasta_results = {} # results from FASTA will be stored here - key'd by PDB ID and Chain ID\n",
    "\n",
    "    # go through each FASTA result and get the E value, percentage identity and sequence from the result\n",
    "\n",
    "    for fasta_row in raw_fasta_results:\n",
    "        # join_id = fasta_row.get('joinId')\n",
    "        fasta_doc = fasta_row.get('doc', {})\n",
    "        percent_identity = fasta_doc.get('percent_identity')\n",
    "        e_value = fasta_doc.get('e_value')\n",
    "        return_sequence = fasta_row.get('return_sequence_string')\n",
    "        pdb_id_chain = fasta_doc.get('pdb_id_chain').split('_')\n",
    "        pdb_id = pdb_id_chain[0].lower()\n",
    "        chain_id = pdb_id_chain[-1]\n",
    "        join_id = '{}_{}'.format(pdb_id, chain_id)\n",
    "        fasta_results[join_id] = {'e_value': e_value,\n",
    "                                  'percentage_identity': percent_identity,\n",
    "                                  'return_sequence': return_sequence}\n",
    "\n",
    "    # now we go through the main results and add the FASTA results\n",
    "    ret = [] # final results will be stored here.\n",
    "    for row in results:\n",
    "        pdb_id = row.get('pdb_id').lower()\n",
    "        chain_ids = row.get('chain_id')\n",
    "        for chain_id in chain_ids:\n",
    "            search_id = '{}_{}'.format(pdb_id, chain_id)\n",
    "            entry_fasta_results = fasta_results.get(search_id, {})\n",
    "            # we will only keep results that match the search ID\n",
    "            if entry_fasta_results:\n",
    "                row['e_value'] = entry_fasta_results.get('e_value')\n",
    "                row['percentage_identity'] = entry_fasta_results.get('percentage_identity')\n",
    "                row['result_sequence'] = entry_fasta_results.get('return_sequence_string')\n",
    "\n",
    "                ret.append(row)\n",
    "    return ret"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "We will search for a sequence with an example sequence from UniProt P24941 -\n",
    "Cyclin-dependent kinase 2"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of results 10\n"
     ]
    }
   ],
   "source": [
    "sequence_to_search = \"\"\"\n",
    "MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNH\n",
    "PNIVKLLDVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHS\n",
    "HRVLHRDLKPQNLLINTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYY\n",
    "STAVDIWSLGCIFAEMVTRRALFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSF\n",
    "PKWARQDFSKVVPPLDEDGRSLLSQMLHYDPNKRISAKAALAHPFFQDVTKPVPHLRL\"\"\"\n",
    "\n",
    "filter_list = ['pfam_accession', 'pdb_id', 'molecule_name', 'ec_number',\n",
    "               'uniprot_accession_best', 'tax_id']\n",
    "\n",
    "first_results = run_sequence_search(sequence_to_search, filter_terms=filter_list)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Print the first result to see what we have"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'chain_id': ['A', 'C'],\n",
      " 'e_value': 2.9e-76,\n",
      " 'ec_number': ['2.7.11.22'],\n",
      " 'entity_id': 1,\n",
      " 'entry_entity': '1jst_1',\n",
      " 'molecule_name': ['Cyclin-dependent kinase 2'],\n",
      " 'pdb_id': '1jst',\n",
      " 'percentage_identity': 100.0,\n",
      " 'pfam_accession': ['PF00069'],\n",
      " 'result_sequence': None,\n",
      " 'tax_id': [9606],\n",
      " 'uniprot_accession_best': ['P24941']}\n"
     ]
    }
   ],
   "source": [
    "pprint(first_results[0])"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Notice that some of the results are lists"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Before we do any further analysis we should get a few more results so we can see some patterns.\n",
    "We are going to increase the number of results to 1000"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of results 1000\n"
     ]
    }
   ],
   "source": [
    "first_results = run_sequence_search(sequence_to_search,\n",
    "                                    filter_terms=filter_list,\n",
    "                                    number_of_rows=1000\n",
    "                                    )\n"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Load the results into a Pandas Dataframe so we can query them\n",
    "\n",
    "Before we do this we have to do a bit of house keeping. We are going to change the lists (results with [] around them)\n",
    "into comma separated values"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "outputs": [],
   "source": [
    "def change_lists_to_strings(results):\n",
    "    \"\"\"\n",
    "    updates lists to strings for loading into Pandas\n",
    "    :param dict results: dictionary of results to process\n",
    "    :return dict: dictionary of results\n",
    "    \"\"\"\n",
    "    for row in results:\n",
    "        for data in row:\n",
    "            if type(row[data]) == list:\n",
    "                # if there are any numbers in the list change them into strings\n",
    "                row[data] = [str(a) for a in row[data]]\n",
    "                # unique and sort the list and then change the list into a string\n",
    "                row[data] = ','.join(sorted(list(set(row[data]))))\n",
    "\n",
    "    return results\n",
    "\n",
    "\n",
    "def pandas_dataset(list_of_results):\n",
    "    results = change_lists_to_strings(list_of_results)  # we have added our function to change lists to strings\n",
    "    df = pd.DataFrame(results)\n",
    "\n",
    "    return df"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "outputs": [],
   "source": [
    "df = pandas_dataset(first_results)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Lets see what we have - you'll see it looks a bit like a spreadsheet or a database"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  chain_id  ec_number  entity_id entry_entity              molecule_name  \\\n",
      "0      A,C  2.7.11.22          1       1jst_1  Cyclin-dependent kinase 2   \n",
      "1        A  2.7.11.22          1       3tiy_1  Cyclin-dependent kinase 2   \n",
      "2        A  2.7.11.22          1       3lfn_1  Cyclin-dependent kinase 2   \n",
      "3        A  2.7.11.22          1       6q4k_1  Cyclin-dependent kinase 2   \n",
      "4      A,C  2.7.11.22          1       4bcq_1  Cyclin-dependent kinase 2   \n",
      "\n",
      "  pdb_id pfam_accession tax_id uniprot_accession_best       e_value  \\\n",
      "0   1jst        PF00069   9606                 P24941  1.900000e-76   \n",
      "1   3tiy        PF00069   9606                 P24941  1.900000e-76   \n",
      "2   3lfn        PF00069   9606                 P24941  1.900000e-76   \n",
      "3   6q4k        PF00069   9606                 P24941  1.900000e-76   \n",
      "4   4bcq        PF00069   9606                 P24941  1.900000e-76   \n",
      "\n",
      "   percentage_identity result_sequence  \n",
      "0                100.0            None  \n",
      "1                100.0            None  \n",
      "2                100.0            None  \n",
      "3                100.0            None  \n",
      "4                100.0            None  \n"
     ]
    }
   ],
   "source": [
    "print(df.head())\n"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "We can save the results to a CSV file which we can load into excel"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "outputs": [],
   "source": [
    "df.to_csv(\"search_results.csv\")"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "There isn't a cut off of eValue or percentage identity in our search\n",
    "so we should look what the values go to\n",
    "\n",
    "we can select the column and find the minimum value with .min() or maximum value with .max()"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "outputs": [
    {
     "data": {
      "text/plain": "100.0"
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['percentage_identity'].max()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "outputs": [
    {
     "data": {
      "text/plain": "36.1"
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['percentage_identity'].min()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "same for e value - here we want the min and max\n"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "outputs": [
    {
     "data": {
      "text/plain": "1.8e-77"
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['e_value'].min()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "outputs": [
    {
     "data": {
      "text/plain": "2.4e-20"
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['e_value'].max()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "We can see that percentage identity drops to as low as 36%\n",
    "Lets say we want to restrict it to 50%"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "outputs": [],
   "source": [
    "df2 = df.query('percentage_identity > 50')"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "We stored the results in a new Dataframe called \"df2\""
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "outputs": [
    {
     "data": {
      "text/plain": "  chain_id  ec_number  entity_id entry_entity              molecule_name  \\\n0      A,C  2.7.11.22          1       1jst_1  Cyclin-dependent kinase 2   \n1        A  2.7.11.22          1       3tiy_1  Cyclin-dependent kinase 2   \n2        A  2.7.11.22          1       3lfn_1  Cyclin-dependent kinase 2   \n3        A  2.7.11.22          1       6q4k_1  Cyclin-dependent kinase 2   \n4      A,C  2.7.11.22          1       4bcq_1  Cyclin-dependent kinase 2   \n\n  pdb_id pfam_accession tax_id uniprot_accession_best       e_value  \\\n0   1jst        PF00069   9606                 P24941  1.800000e-77   \n1   3tiy        PF00069   9606                 P24941  1.800000e-77   \n2   3lfn        PF00069   9606                 P24941  1.800000e-77   \n3   6q4k        PF00069   9606                 P24941  1.800000e-77   \n4   4bcq        PF00069   9606                 P24941  1.800000e-77   \n\n   percentage_identity result_sequence  \n0                100.0            None  \n1                100.0            None  \n2                100.0            None  \n3                100.0            None  \n4                100.0            None  ",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>chain_id</th>\n      <th>ec_number</th>\n      <th>entity_id</th>\n      <th>entry_entity</th>\n      <th>molecule_name</th>\n      <th>pdb_id</th>\n      <th>pfam_accession</th>\n      <th>tax_id</th>\n      <th>uniprot_accession_best</th>\n      <th>e_value</th>\n      <th>percentage_identity</th>\n      <th>result_sequence</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>A,C</td>\n      <td>2.7.11.22</td>\n      <td>1</td>\n      <td>1jst_1</td>\n      <td>Cyclin-dependent kinase 2</td>\n      <td>1jst</td>\n      <td>PF00069</td>\n      <td>9606</td>\n      <td>P24941</td>\n      <td>1.800000e-77</td>\n      <td>100.0</td>\n      <td>None</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>A</td>\n      <td>2.7.11.22</td>\n      <td>1</td>\n      <td>3tiy_1</td>\n      <td>Cyclin-dependent kinase 2</td>\n      <td>3tiy</td>\n      <td>PF00069</td>\n      <td>9606</td>\n      <td>P24941</td>\n      <td>1.800000e-77</td>\n      <td>100.0</td>\n      <td>None</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>A</td>\n      <td>2.7.11.22</td>\n      <td>1</td>\n      <td>3lfn_1</td>\n      <td>Cyclin-dependent kinase 2</td>\n      <td>3lfn</td>\n      <td>PF00069</td>\n      <td>9606</td>\n      <td>P24941</td>\n      <td>1.800000e-77</td>\n      <td>100.0</td>\n      <td>None</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>A</td>\n      <td>2.7.11.22</td>\n      <td>1</td>\n      <td>6q4k_1</td>\n      <td>Cyclin-dependent kinase 2</td>\n      <td>6q4k</td>\n      <td>PF00069</td>\n      <td>9606</td>\n      <td>P24941</td>\n      <td>1.800000e-77</td>\n      <td>100.0</td>\n      <td>None</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>A,C</td>\n      <td>2.7.11.22</td>\n      <td>1</td>\n      <td>4bcq_1</td>\n      <td>Cyclin-dependent kinase 2</td>\n      <td>4bcq</td>\n      <td>PF00069</td>\n      <td>9606</td>\n      <td>P24941</td>\n      <td>1.800000e-77</td>\n      <td>100.0</td>\n      <td>None</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2.head()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Number of entries in the Dataframe"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "outputs": [
    {
     "data": {
      "text/plain": "446"
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df2)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Max value of percentage identity"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "outputs": [
    {
     "data": {
      "text/plain": "100.0"
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2['percentage_identity'].max()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Min value of percentage identity"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "outputs": [
    {
     "data": {
      "text/plain": "54.2"
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2['percentage_identity'].min()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "How many unique Pfam domains or UniProts did we get back?\n",
    "\n",
    "We can group the results by Pfam using \"groupby\" and then counting the results"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "outputs": [
    {
     "data": {
      "text/plain": "                chain_id  ec_number  entity_id  entry_entity  molecule_name  \\\npfam_accession                                                                \nPF00069              731        729        731           731            731   \n\n                pdb_id  tax_id  uniprot_accession_best  e_value  \\\npfam_accession                                                    \nPF00069            731     731                     731      731   \n\n                percentage_identity  result_sequence  \npfam_accession                                        \nPF00069                         731                0  ",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>chain_id</th>\n      <th>ec_number</th>\n      <th>entity_id</th>\n      <th>entry_entity</th>\n      <th>molecule_name</th>\n      <th>pdb_id</th>\n      <th>tax_id</th>\n      <th>uniprot_accession_best</th>\n      <th>e_value</th>\n      <th>percentage_identity</th>\n      <th>result_sequence</th>\n    </tr>\n    <tr>\n      <th>pfam_accession</th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>PF00069</th>\n      <td>731</td>\n      <td>729</td>\n      <td>731</td>\n      <td>731</td>\n      <td>731</td>\n      <td>731</td>\n      <td>731</td>\n      <td>731</td>\n      <td>731</td>\n      <td>731</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('pfam_accession').count()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "same for uniprot accession\n",
    "This time we will sort the values by the number of PDB entries (\"pdb_id\"'s) they appear in."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "outputs": [
    {
     "data": {
      "text/plain": "                        chain_id  ec_number  entity_id  entry_entity  \\\nuniprot_accession_best                                                 \nP24941                       416        416        416           416   \nP28482                       109        109        109           109   \nP63086                        58         58         58            58   \nP47811                        43         43         43            43   \nQ00534                        18         18         18            18   \nQ13164                        11         11         11            11   \nP06493                        10         10         10            10   \nP11802-2                       8          8          8             8   \nO15264                         7          7          7             7   \nQ16539                         6          6          6             6   \nQ00535                         6          6          6             6   \nP11802                         5          5          5             5   \nQ07785                         4          4          4             4   \nP17157                         4          4          4             4   \nP27361                         3          3          3             3   \nQ16539-4                       2          2          2             2   \nQ39026                         2          2          2             2   \nQ5CRJ8                         2          1          2             2   \nQ00536                         2          2          2             2   \nA8BZ95                         2          1          2             2   \nP53778                         2          2          2             2   \nP50613                         2          2          2             2   \nQ92772                         2          2          2             2   \nQ00532                         1          1          1             1   \nP63086,Q16539                  1          1          1             1   \nA9UJZ9                         1          1          1             1   \nP47811-4                       1          1          1             1   \nO76039                         1          1          1             1   \nG4N374                         1          1          1             1   \nQ8IVW4                         1          1          1             1   \n\n                        molecule_name  pdb_id  pfam_accession  tax_id  \\\nuniprot_accession_best                                                  \nP24941                            416     416             416     416   \nP28482                            109     109             109     109   \nP63086                             58      58              58      58   \nP47811                             43      43              43      43   \nQ00534                             18      18              18      18   \nQ13164                             11      11              11      11   \nP06493                             10      10              10      10   \nP11802-2                            8       8               8       8   \nO15264                              7       7               7       7   \nQ16539                              6       6               6       6   \nQ00535                              6       6               6       6   \nP11802                              5       5               5       5   \nQ07785                              4       4               4       4   \nP17157                              4       4               4       4   \nP27361                              3       3               3       3   \nQ16539-4                            2       2               2       2   \nQ39026                              2       2               2       2   \nQ5CRJ8                              2       2               2       2   \nQ00536                              2       2               2       2   \nA8BZ95                              2       2               2       2   \nP53778                              2       2               2       2   \nP50613                              2       2               2       2   \nQ92772                              2       2               2       2   \nQ00532                              1       1               1       1   \nP63086,Q16539                       1       1               1       1   \nA9UJZ9                              1       1               1       1   \nP47811-4                            1       1               1       1   \nO76039                              1       1               1       1   \nG4N374                              1       1               1       1   \nQ8IVW4                              1       1               1       1   \n\n                        e_value  percentage_identity  result_sequence  \nuniprot_accession_best                                                 \nP24941                      416                  416                0  \nP28482                      109                  109                0  \nP63086                       58                   58                0  \nP47811                       43                   43                0  \nQ00534                       18                   18                0  \nQ13164                       11                   11                0  \nP06493                       10                   10                0  \nP11802-2                      8                    8                0  \nO15264                        7                    7                0  \nQ16539                        6                    6                0  \nQ00535                        6                    6                0  \nP11802                        5                    5                0  \nQ07785                        4                    4                0  \nP17157                        4                    4                0  \nP27361                        3                    3                0  \nQ16539-4                      2                    2                0  \nQ39026                        2                    2                0  \nQ5CRJ8                        2                    2                0  \nQ00536                        2                    2                0  \nA8BZ95                        2                    2                0  \nP53778                        2                    2                0  \nP50613                        2                    2                0  \nQ92772                        2                    2                0  \nQ00532                        1                    1                0  \nP63086,Q16539                 1                    1                0  \nA9UJZ9                        1                    1                0  \nP47811-4                      1                    1                0  \nO76039                        1                    1                0  \nG4N374                        1                    1                0  \nQ8IVW4                        1                    1                0  ",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>chain_id</th>\n      <th>ec_number</th>\n      <th>entity_id</th>\n      <th>entry_entity</th>\n      <th>molecule_name</th>\n      <th>pdb_id</th>\n      <th>pfam_accession</th>\n      <th>tax_id</th>\n      <th>e_value</th>\n      <th>percentage_identity</th>\n      <th>result_sequence</th>\n    </tr>\n    <tr>\n      <th>uniprot_accession_best</th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>P24941</th>\n      <td>416</td>\n      <td>416</td>\n      <td>416</td>\n      <td>416</td>\n      <td>416</td>\n      <td>416</td>\n      <td>416</td>\n      <td>416</td>\n      <td>416</td>\n      <td>416</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>P28482</th>\n      <td>109</td>\n      <td>109</td>\n      <td>109</td>\n      <td>109</td>\n      <td>109</td>\n      <td>109</td>\n      <td>109</td>\n      <td>109</td>\n      <td>109</td>\n      <td>109</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>P63086</th>\n      <td>58</td>\n      <td>58</td>\n      <td>58</td>\n      <td>58</td>\n      <td>58</td>\n      <td>58</td>\n      <td>58</td>\n      <td>58</td>\n      <td>58</td>\n      <td>58</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>P47811</th>\n      <td>43</td>\n      <td>43</td>\n      <td>43</td>\n      <td>43</td>\n      <td>43</td>\n      <td>43</td>\n      <td>43</td>\n      <td>43</td>\n      <td>43</td>\n      <td>43</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>Q00534</th>\n      <td>18</td>\n      <td>18</td>\n      <td>18</td>\n      <td>18</td>\n      <td>18</td>\n      <td>18</td>\n      <td>18</td>\n      <td>18</td>\n      <td>18</td>\n      <td>18</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>Q13164</th>\n      <td>11</td>\n      <td>11</td>\n      <td>11</td>\n      <td>11</td>\n      <td>11</td>\n      <td>11</td>\n      <td>11</td>\n      <td>11</td>\n      <td>11</td>\n      <td>11</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>P06493</th>\n      <td>10</td>\n      <td>10</td>\n      <td>10</td>\n      <td>10</td>\n      <td>10</td>\n      <td>10</td>\n      <td>10</td>\n      <td>10</td>\n      <td>10</td>\n      <td>10</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>P11802-2</th>\n      <td>8</td>\n      <td>8</td>\n      <td>8</td>\n      <td>8</td>\n      <td>8</td>\n      <td>8</td>\n      <td>8</td>\n      <td>8</td>\n      <td>8</td>\n      <td>8</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>O15264</th>\n      <td>7</td>\n      <td>7</td>\n      <td>7</td>\n      <td>7</td>\n      <td>7</td>\n      <td>7</td>\n      <td>7</td>\n      <td>7</td>\n      <td>7</td>\n      <td>7</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>Q16539</th>\n      <td>6</td>\n      <td>6</td>\n      <td>6</td>\n      <td>6</td>\n      <td>6</td>\n      <td>6</td>\n      <td>6</td>\n      <td>6</td>\n      <td>6</td>\n      <td>6</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>Q00535</th>\n      <td>6</td>\n      <td>6</td>\n      <td>6</td>\n      <td>6</td>\n      <td>6</td>\n      <td>6</td>\n      <td>6</td>\n      <td>6</td>\n      <td>6</td>\n      <td>6</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>P11802</th>\n      <td>5</td>\n      <td>5</td>\n      <td>5</td>\n      <td>5</td>\n      <td>5</td>\n      <td>5</td>\n      <td>5</td>\n      <td>5</td>\n      <td>5</td>\n      <td>5</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>Q07785</th>\n      <td>4</td>\n      <td>4</td>\n      <td>4</td>\n      <td>4</td>\n      <td>4</td>\n      <td>4</td>\n      <td>4</td>\n      <td>4</td>\n      <td>4</td>\n      <td>4</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>P17157</th>\n      <td>4</td>\n      <td>4</td>\n      <td>4</td>\n      <td>4</td>\n      <td>4</td>\n      <td>4</td>\n      <td>4</td>\n      <td>4</td>\n      <td>4</td>\n      <td>4</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>P27361</th>\n      <td>3</td>\n      <td>3</td>\n      <td>3</td>\n      <td>3</td>\n      <td>3</td>\n      <td>3</td>\n      <td>3</td>\n      <td>3</td>\n      <td>3</td>\n      <td>3</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>Q16539-4</th>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>Q39026</th>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>Q5CRJ8</th>\n      <td>2</td>\n      <td>1</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>Q00536</th>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>A8BZ95</th>\n      <td>2</td>\n      <td>1</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>P53778</th>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>P50613</th>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>Q92772</th>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>2</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>Q00532</th>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>P63086,Q16539</th>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>A9UJZ9</th>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>P47811-4</th>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>O76039</th>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>G4N374</th>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>Q8IVW4</th>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>1</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "group_by_uniprot = df.groupby('uniprot_accession_best').count().sort_values('pdb_id', ascending=False)\n",
    "group_by_uniprot"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "In this case the most common UniProt accession is P24941.\n",
    "How many UniProt accessions were there?"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "outputs": [
    {
     "data": {
      "text/plain": "30"
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(group_by_uniprot)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "How many are enzymes? We can use \"ec_number\" to work see how many have E.C. numbers"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "outputs": [],
   "source": [
    "uniprot_with_ec = group_by_uniprot.query('ec_number != 0')"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "outputs": [
    {
     "data": {
      "text/plain": "30"
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(uniprot_with_ec)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "3.7.8-final"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}