{ "cells": [ { "source": [ "# PDB search\n", "\n", "This interactive Python notebook will guide you through various ways of programmatically accessing Protein Data Bank in Europe (PDBe) data using REST API\n", "\n", "The REST API is a programmatic way to obtain information from the PDB and EMDB. You can access details about:\n", "\n", "* sample\n", "* experiment\n", "* models\n", "* compounds\n", "* cross-references\n", "* publications\n", "* quality\n", "* assemblies\n", "and more...\n", "For more information, visit https://www.ebi.ac.uk/pdbe/pdbe-rest-api\n", "\n", "This notebook is a part of the training material series, and focuses on getting information from the PDBe search API. Retrieve this material and many more from [GitHub](https://github.com/PDBeurope/pdbe-api-training)\n", "\n", "## 1) Making imports and setting variables\n", "First, we import some packages that we will use, and set some variables.\n", "\n", "Note: Full list of valid URLs is available from https://www.ebi.ac.uk/pdbe/api/doc/" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 162, "metadata": {}, "outputs": [ { "output_type": "display_data", "data": { "text/html": " \n " }, "metadata": {} }, { "output_type": "display_data", "data": { "text/html": " \n " }, "metadata": {} } ], "source": [ "import requests # used for getting data from a URL\n", "from pprint import pprint # pretty print\n", "import pandas as pd # used for turning results into mini databases\n", "from solrq import Q # used to turn result queries into the right format\n", "\n", "import cufflinks as cf\n", "import plotly.offline as py\n", "\n", "cf.go_offline() # required to use plotly offline (no account required).\n", "py.init_notebook_mode() # graphs charts inline (IPython).\n", "\n", "search_url = \"https://www.ebi.ac.uk/pdbe/search/pdb/select?\" # the rest of the URL used for PDBe's search API." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2) a function to get data from the search API\n", "Let's start with defining a function that can be used to GET data from the PDBe search API.\n" ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [], "source": [ "def make_request(search_dict, number_of_rows=10):\n", " \"\"\"\n", " makes a get request to the PDBe API\n", " :param dict search_dict: the terms used to search\n", " :param number_of_rows: number or rows to return - limited to 10\n", " :return dict: response JSON\n", " \"\"\"\n", " if 'rows' not in search_dict:\n", " search_dict['rows'] = number_of_rows\n", " search_dict['wt'] = 'json'\n", " # pprint(search_dict)\n", " response = requests.post(search_url, data=search_dict)\n", "\n", " if response.status_code == 200:\n", " return response.json()\n", " else:\n", " print(\"[No data retrieved - %s] %s\" % (response.status_code, response.text))\n", "\n", " return {}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3) formatting the search terms \n", "This will allow us to use human readable search terms and this function will make a URL that the search API can handle." ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [], "source": [ "def format_search_terms(search_terms, filter_terms=None):\n", " ret = {'q': str(search_terms)}\n", " if filter_terms:\n", " fl = '{}'.format(','.join(filter_terms))\n", " ret['fl'] = fl\n", " return ret" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4) Getting useful data out of the search\n", "\n", "This function will run the search and will return a list of the results" ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [], "source": [ "def run_search(search_terms, filter_terms=None, number_of_rows=100):\n", " search_term = format_search_terms(search_terms, filter_terms)\n", "\n", " response = make_request(search_term, number_of_rows)\n", " results = response.get('response', {}).get('docs', [])\n", " print('Number of results: {}'.format(len(results)))\n", " return results\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5) running a search\n", "\n", "Now we are ready to actually run a search against the PDB API for entries containing human Dihydrofolate reductase in the PDB. This will return a list of results - only 10 to start with.\n", "\n", "A list of search terms is available at:\n", "https://www.ebi.ac.uk/pdbe/api/doc/search.html\n", "\n", "This will return details of human Dihydrofolate reductase's in the PDB\n", "\n", "We are going to use the Python package \"solrq\" to make sure we get the search in the right format.\n", "\n", "Q(molecule_name=\"Dihydrofolate reductase\")\n", "\n", "Here we are searching for molecules named Dihydrofolate reductase.\n", "If we search for two terms i.e. molecule_name and organism_scientific_name then we will get molecules that match both search terms.\n", "\n", "We will return the number of results for two searches.\n", "\n", "The first one will hit the limit of 100. There are more than 100 Dihydrofolate reductase structures. \n", "We have to add the argument \"number_of_rows\" to a higher number, say 1000, to find all the examples. " ] }, { "cell_type": "code", "execution_count": 166, "metadata": { "tags": [] }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": "1st search\nNumber of results: 100\n" } ], "source": [ "print('1st search')\n", "search_terms = Q(molecule_name=\"Dihydrofolate reductase\")\n", "\n", "first_results = run_search(search_terms)\n" ] }, { "cell_type": "code", "execution_count": 167, "metadata": { "tags": [] }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": "1st search - more rows\nNumber of results: 405\n" } ], "source": [ "print('1st search - more rows')\n", "search_terms = Q(molecule_name=\"Dihydrofolate reductase\")\n", "\n", "first_results_more_rows = run_search(search_terms, number_of_rows=1000)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will add organism_name of Human to the query to limit the results to only return those that are structures of Human Dihydrofolate reductase." ] }, { "cell_type": "code", "execution_count": 168, "metadata": { "tags": [] }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": "2nd search\nNumber of results: 79\n" } ], "source": [ "print('2nd search')\n", "search_terms = Q(molecule_name=\"Dihydrofolate reductase\",organism_name=\"Human\")\n", "second_results = run_search(search_terms)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How did we know which search terms to use?\n", "\n", "We will then look at the last result.\n", "We will print the data we have for the first result.\n", "\n", "This will be the first item of the list \"second_results\"\n", "i.e. second_results[0]\n", "\n", "We are using \"pprint\" (pretty print) rather than \"print\" to make the result easier to read.\n", "\n", "All of the \"keys\" on the left side of the results can be used as a search term." ] }, { "cell_type": "code", "execution_count": 169, "metadata": { "tags": [ "outputPrepend" ] }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": "L',\n 'q_struct_asym_id': ['A'],\n 'q_structure_determination_method': ['MOLECULAR REPLACEMENT'],\n 'q_structure_solution_software': ['MOLREP'],\n 'q_superkingdom': ['Eukaryota'],\n 'q_tax_id': [9606],\n 'q_tax_query': [9606],\n 'q_title': 'Preferential Selection of Isomer Binding from Chiral Mixtures: '\n 'Alternate Binding Modes Observed for the E- and Z-isomers of a '\n 'Series of 5-Substituted 2,4-Diaminofuro[2,3-d]pyrimidines as '\n 'Ternary Complexes with NADPH and Human Dihydrofolate Reductase',\n 'q_uniprot': ['P00374 : DYR_HUMAN'],\n 'q_uniprot_accession': ['P00374', 'P00374-2'],\n 'q_uniprot_accession_best': ['P00374'],\n 'q_uniprot_best': ['P00374 : DYR_HUMAN'],\n 'q_uniprot_coverage': [0.99],\n 'q_uniprot_features': ['Protein has possible alternate isoforms',\n 'DHFR',\n 'Protein has possible natural variant ',\n 'Nucleotide binding - NADP'],\n 'q_uniprot_id': ['DYR_HUMAN', 'DYR_HUMAN'],\n 'q_uniprot_id_best': ['DYR_HUMAN'],\n 'q_uniprot_non_canonical': ['P00374-2 : DYR_HUMAN'],\n 'q_unp_count': 1,\n 'q_unp_nf90_accession': ['B0YJ76',\n 'A0A2K6C3Y8',\n 'S5WD14',\n 'S5VM81',\n 'P00374',\n 'A0A024RAQ3'],\n 'q_unp_nf90_id': ['B0YJ76_HUMAN',\n 'A0A2K6C3Y8_MACNE',\n 'S5WD14_SHISS',\n 'S5VM81_ECO57',\n 'DYR_HUMAN',\n 'A0A024RAQ3_HUMAN'],\n 'q_unp_nf90_organism': ['Homo sapiens (Human)',\n 'Macaca nemestrina (Pig-tailed macaque)',\n 'Shigella sonnei (strain Ss046)',\n 'Escherichia coli O157:H7',\n 'Homo sapiens (Human)',\n 'Homo sapiens (Human)'],\n 'q_unp_nf90_protein_name': ['Dihydrofolate reductase',\n 'DHFR domain-containing protein',\n 'Dihydrofolate reductase',\n 'Dihydrofolate reductase',\n 'Dihydrofolate reductase',\n 'Dihydrofolate reductase'],\n 'q_unp_nf90_tax_id': ['9606', '9545', '300269', '83334', '9606', '9606'],\n 'r_factor': 0.19536,\n 'r_free': 0.2456,\n 'r_work': [0.19278],\n 'rank': ['species',\n 'genus',\n 'subfamily',\n 'family',\n 'order',\n 'class',\n 'phylum',\n 'kingdom',\n 'superkingdom',\n 'species',\n 'genus',\n 'family',\n 'order',\n 'class',\n 'phylum',\n 'superkingdom'],\n 'reactant_id': ['NDP'],\n 'refinement_software': ['REFMAC'],\n 'release_date': '2010-12-01T01:00:00Z',\n 'release_year': 2010,\n 'resolution': 1.9,\n 'revision_date': '2017-11-08T01:00:00Z',\n 'revision_year': 2017,\n 'sample_preparation_method': ['Engineered'],\n 'seq_100_cluster_number': '24820',\n 'seq_100_cluster_rank': 39,\n 'seq_30_cluster_number': '162',\n 'seq_30_cluster_rank': 65,\n 'seq_40_cluster_number': '16985',\n 'seq_40_cluster_rank': 65,\n 'seq_50_cluster_number': '28892',\n 'seq_50_cluster_rank': 65,\n 'seq_70_cluster_number': '25856',\n 'seq_70_cluster_rank': 65,\n 'seq_90_cluster_number': '15408',\n 'seq_90_cluster_rank': 57,\n 'seq_95_cluster_number': '4071',\n 'seq_95_cluster_rank': 51,\n 'serial_crystal_experiment': ['N'],\n 'spacegroup': 'H 3',\n 'status': 'REL',\n 'struct_asym_id': ['A'],\n 'structure_determination_method': ['MOLECULAR REPLACEMENT'],\n 'structure_solution_software': ['MOLREP'],\n 'superkingdom': ['Eukaryota'],\n 't_abstracttext_unassigned': ['The crystal structures of six human '\n 'dihydrofolate reductase (hDHFR) ternary '\n 'complexes with NADPH and a series of mixed E/Z '\n 'isomers of 5-substituted '\n '5-[2-(2-methoxyphenyl)-prop-1-en-1-yl]furo[2,3-d]pyrimidine-2,4-diamines '\n 'substituted at the C9 position with propyl, '\n 'isopropyl, cyclopropyl, butyl, isobutyl and '\n 'sec-butyl (E2-E7, Z3) were determined and the '\n 'results were compared with the resolved E and '\n 'Z isomers of the C9-methyl parent compound. '\n 'The configuration of all of the inhibitors, '\n 'save one, was observed as the E isomer, in '\n 'which the binding of the furopyrimidine ring '\n 'is flipped such that the 4-amino group binds '\n 'in the 4-oxo site of folate. The Z3\\xa0isomer '\n 'of the C9-isopropyl analog has the normal '\n '2,4-diaminopyrimidine ring binding geometry, '\n 'with the furo oxygen near Glu30 and the '\n '4-amino group interacting near the cofactor '\n 'nicotinamide ring. Electron-density maps for '\n 'these structures revealed the binding of only '\n 'one isomer to hDHFR, despite the fact that '\n 'chiral mixtures (E:Z ratios of 2:1, 3:1 and '\n '3:2) of the inhibitors were incubated with '\n 'hDHFR prior to crystallization. Superposition '\n 'of the hDHFR complexes with E2 and Z3 shows '\n "that the 2'-methoxyphenyl ring of E2 is "\n 'perpendicular to that of Z3. The most potent '\n 'inhibitor in this series is the isopropyl '\n 'analog Z3 and the least potent is the isobutyl '\n 'analog E6, consistent with data that show that '\n 'the Z isomer makes the most favorable '\n 'interactions with the active-site residues. '\n 'The isobutyl moiety of E6 is observed in two '\n 'orientations and the resultant steric crowding '\n 'of the E6 analog is consistent with its weaker '\n 'activity. The alternative binding modes '\n 'observed for the furopyrimidine ring in these '\n 'E/Z isomers suggest that new templates can be '\n 'designed to probe these binding regions of the '\n 'DHFR active site.'],\n 't_all_compound_names': ['Nicotinamide-adenine dinucleotide',\n 'NDP',\n 'D2F : '\n '5-[(1E)-2-(2-methoxyphenyl)hex-1-en-1-yl]furo[2,3-d]pyrimidine-2,4-diamine',\n 'NDP : NADPH '\n 'DIHYDRO-NICOTINAMIDE-ADENINE-DINUCLEOTIDE PHOSPHATE',\n 'SO4 : SULFATE ION',\n 'NDP : Dihydronicotinamide-adenine dinucleotide '\n 'phosphate',\n 'NDP : Reduced nicotinamide adenine dinucleotide '\n 'phosphate',\n 'NDP : TPNH',\n 'SO4 : Sulfate',\n 'SO4 : Sulfate dianion',\n 'SO4 : Sulfate(2-)',\n 'SO4 : Sulfuric acid ion(2-)',\n 'D2F : '\n '5-[(1E)-2-(2-methoxyphenyl)hex-1-en-1-yl]furo[2,3-d]pyrimidine-2,4-diamine',\n 'SO4 : sulfate',\n 'D2F : '\n '5-[(E)-2-(2-methoxyphenyl)hex-1-enyl]furo[2,3-d]pyrimidine-2,4-diamine',\n 'NDP : '\n '[[(2R,3S,4R,5R)-5-(3-aminocarbonyl-4H-pyridin-1-yl)-3,4-dihydroxy-oxolan-2-yl]methoxy-hydroxy-phosphoryl] '\n '[(2R,3R,4R,5R)-5-(6-aminopurin-9-yl)-3-hydroxy-4-phosphonooxy-oxolan-2-yl]methyl '\n 'hydrogen phosphate'],\n 't_all_enzyme_names': ['Oxidoreductases',\n 'Acting on the CH-NH group of donors',\n 'With NAD(+) or NADP(+) as acceptor',\n 'Dihydrofolate reductase',\n '1.5.1.3 : Dihydrofolate reductase',\n '5,6,7,8-tetrahydrofolate:NADP(+) oxidoreductase'],\n 't_all_go_terms': ['cytoplasm',\n 'mitochondrion',\n 'cytosol',\n 'folic acid binding',\n 'oxidoreductase activity',\n 'NADPH binding',\n 'RNA binding',\n 'sequence-specific mRNA binding',\n 'mRNA binding',\n 'dihydrofolate reductase activity',\n 'NADP binding',\n 'methotrexate binding',\n 'translation repressor activity, mRNA regulatory element '\n 'binding',\n 'drug binding',\n 'one-carbon metabolic process',\n 'negative regulation of translation',\n 'folic acid metabolic process',\n 'response to methotrexate',\n 'regulation of removal of superoxide radicals',\n 'tetrahydrobiopterin biosynthetic process',\n 'tetrahydrofolate metabolic process',\n 'tetrahydrofolate biosynthetic process',\n 'oxidation-reduction process',\n 'regulation of transcription involved in G1/S transition '\n 'of mitotic cell cycle',\n 'positive regulation of nitric-oxide synthase activity',\n 'axon regeneration',\n 'dihydrofolate metabolic process'],\n 't_all_sequence_family': ['IPR001796 : Dihydrofolate reductase domain',\n 'IPR024072 : Dihydrofolate reductase-like domain '\n 'superfamily',\n 'PF00186 : DHFR_1',\n 'CL0387 : DHFred'],\n 't_all_structure_family': ['3-Layer(aba) Sandwich',\n 'Alpha Beta',\n '3.40.430.10',\n 'Dihydrofolate Reductase, subunit A',\n 'Dihydrofolate Reductase, subunit A'],\n 't_citation_authors': ['Cody V', 'Piraino J', 'Pace J', 'Li W', 'Gangjee A'],\n 't_citation_title': ['Preferential selection of isomer binding from chiral '\n 'mixtures: alternate binding modes observed for the E '\n 'and Z isomers of a series of 5-substituted '\n '2,4-diaminofuro[2,3-d]pyrimidines as ternary complexes '\n 'with NADPH and human dihydrofolate reductase.'],\n 't_entry_authors': ['Cody V'],\n 't_entry_info': ['POTASSIUM PHOSPHATE',\n 'AMMONIUM SULFATE',\n 'ETHANOL',\n 'MOSFLM',\n 'SCALA',\n 'Image plate',\n 'RIGAKU RAXIS IV',\n 'X-ray diffraction',\n 'REFMAC',\n 'MOLECULAR REPLACEMENT',\n 'MOLREP'],\n 't_entry_title': ['Preferential Selection of Isomer Binding from Chiral '\n 'Mixtures: Alternate Binding Modes Observed for the E- and '\n 'Z-isomers of a Series of 5-Substituted '\n '2,4-Diaminofuro[2,3-d]pyrimidines as Ternary Complexes '\n 'with NADPH and Human Dihydrofolate Reductase'],\n 't_expression_organism_name': ['Escherichia coli',\n 'Bacterium Coli',\n 'Enterococcus Coli',\n 'Bacterium 10a',\n 'Escherichia Coli',\n 'Escherichia Sp. 3_2_53faa',\n 'Escherichia/Shigella Coli',\n 'Ecolx',\n 'Bacterium E3',\n 'Bacterium Coli Commune',\n 'E. Coli',\n 'Bacillus Coli',\n 'Escherichia Sp. Mar',\n 'Escherichia coli',\n 'Escherichia',\n 'Enterobacteriaceae',\n 'Enterobacterales',\n 'Gammaproteobacteria',\n 'Proteobacteria',\n 'Bacteria'],\n 't_journal': ['Acta Crystallogr. D Biol. Crystallogr.'],\n 't_mesh_terms': ['Crystallography, X-Ray,Humans,Hydrophobic and Hydrophilic '\n 'Interactions,Models, Molecular,NADP,Protein Structure, '\n 'Tertiary,Pyrimidines,Stereoisomerism,Tetrahydrofolate '\n 'Dehydrogenase'],\n 't_molecule_info': ['protein structure',\n 'homo',\n 'monomer',\n 'Dihydrofolate reductase',\n 'Dihydrofolate reductase',\n 'protein structure',\n 'homo',\n 'monomer',\n 'DHFR',\n 'P00374',\n 'P00374-2',\n 'Protein has possible alternate isoforms',\n 'DHFR',\n 'Protein has possible natural variant ',\n 'Nucleotide binding - NADP',\n 'DYR_HUMAN',\n 'DYR_HUMAN'],\n 't_molecule_sequence': 'VGSLNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQNLVIMGKKTWFSIPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHPGHLKLFVTRIMQDFESDTFFPEIDLEKYKLLPEYPGVLSDVQEEKGIKYKFEVYEKND',\n 't_organism_name': ['Homo sapiens',\n 'Man',\n 'Homo Sapiens (Human)',\n 'Human',\n 'Homo Sapiens',\n 'Homo sapiens',\n 'Homo',\n 'Homininae',\n 'Hominidae',\n 'Primates',\n 'Mammalia',\n 'Chordata',\n 'Metazoa',\n 'Eukaryota'],\n 'tax_id': [9606],\n 'tax_query': [9606],\n 'title': 'Preferential Selection of Isomer Binding from Chiral Mixtures: '\n 'Alternate Binding Modes Observed for the E- and Z-isomers of a '\n 'Series of 5-Substituted 2,4-Diaminofuro[2,3-d]pyrimidines as '\n 'Ternary Complexes with NADPH and Human Dihydrofolate Reductase',\n 'uniprot': ['P00374 : DYR_HUMAN'],\n 'uniprot_accession': ['P00374', 'P00374-2'],\n 'uniprot_accession_best': ['P00374'],\n 'uniprot_best': ['P00374 : DYR_HUMAN'],\n 'uniprot_coverage': [0.99],\n 'uniprot_features': ['Protein has possible alternate isoforms',\n 'DHFR',\n 'Protein has possible natural variant ',\n 'Nucleotide binding - NADP'],\n 'uniprot_id': ['DYR_HUMAN', 'DYR_HUMAN'],\n 'uniprot_id_best': ['DYR_HUMAN'],\n 'uniprot_non_canonical': ['P00374-2 : DYR_HUMAN'],\n 'unp_count': 1,\n 'unp_nf90_accession': ['B0YJ76',\n 'A0A2K6C3Y8',\n 'S5WD14',\n 'S5VM81',\n 'P00374',\n 'A0A024RAQ3'],\n 'unp_nf90_id': ['B0YJ76_HUMAN',\n 'A0A2K6C3Y8_MACNE',\n 'S5WD14_SHISS',\n 'S5VM81_ECO57',\n 'DYR_HUMAN',\n 'A0A024RAQ3_HUMAN'],\n 'unp_nf90_organism': ['Homo sapiens (Human)',\n 'Macaca nemestrina (Pig-tailed macaque)',\n 'Shigella sonnei (strain Ss046)',\n 'Escherichia coli O157:H7',\n 'Homo sapiens (Human)',\n 'Homo sapiens (Human)'],\n 'unp_nf90_protein_name': ['Dihydrofolate reductase',\n 'DHFR domain-containing protein',\n 'Dihydrofolate reductase',\n 'Dihydrofolate reductase',\n 'Dihydrofolate reductase',\n 'Dihydrofolate reductase'],\n 'unp_nf90_tax_id': ['9606', '9545', '300269', '83334', '9606', '9606']}\n" } ], "source": [ "pprint(second_results[0])" ] }, { "cell_type": "markdown", "source": [ "A full list of available search terms is available using the following command." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 170, "outputs": [ { "output_type": "stream", "name": "stdout", "text": "There are 448 available search terms\ndict_keys(['abstracttext_unassigned', 't_abstracttext_unassigned', 'q_abstracttext_unassigned', 'all_assembly_composition', 't_molecule_info', 'q_all_assembly_composition', 'all_assembly_form', 'q_all_assembly_form', 'all_assembly_id', 'q_all_assembly_id', 'all_assembly_mol_wt', 'q_all_assembly_mol_wt', 'all_assembly_type', 'q_all_assembly_type', 'all_authors', 'q_all_authors', 'all_molecule_names', 'q_all_molecule_names', 'all_num_interacting_entity_id', 'q_all_num_interacting_entity_id', 'assembly_composition', 'q_assembly_composition', 'assembly_form', 'q_assembly_form', 'assembly_id', 'q_assembly_id', 'assembly_mol_wt', 'q_assembly_mol_wt', 'assembly_num_component', 'q_assembly_num_component', 'assembly_type', 'q_assembly_type', 'beam_source_name', 'q_beam_source_name', 'biological_cell_component', 'all_go_terms', 'q_all_go_terms', 't_all_go_terms', 'q_biological_cell_component', 'biological_function', 'q_biological_function', 'biological_process', 'q_biological_process', 'bound_compound_id', 'q_bound_compound_id', 'bound_compound_name', 'q_bound_compound_name', 'bound_compound_synonym', 'q_bound_compound_synonym', 'bound_compound_systematic_name', 'q_bound_compound_systematic_name', 'bound_compound_weight', 'q_bound_compound_weight', 'cath_architecture', 'all_structure_family', 'q_all_structure_family', 't_all_structure_family', 'q_cath_architecture', 'cath_class', 'q_cath_class', 'cath_code', 'q_cath_code', 'cath_homologous_superfamily', 'q_cath_homologous_superfamily', 'cath_topology', 'q_cath_topology', 'cell_a', 'q_cell_a', 'cell_alpha', 'q_cell_alpha', 'cell_b', 'q_cell_b', 'cell_beta', 'q_cell_beta', 'cell_c', 'q_cell_c', 'cell_gamma', 'q_cell_gamma', 'chain_id', 'q_chain_id', 'citation_authors', 't_citation_authors', 'q_citation_authors', 'citation_doi', 'q_citation_doi', 'citation_title', 't_citation_title', 'q_citation_title', 'citation_year', 'q_citation_year', 'cofactor_class', 't_all_compound_names', 'q_cofactor_class', 'cofactor_id', 'q_cofactor_id', 'compound_id', 'q_compound_id', 'compound_name', 'all_compound_names', 'q_all_compound_names', 'q_compound_name', 'compound_synonym', 'q_compound_synonym', 'compound_systematic_name', 'q_compound_systematic_name', 'compound_weight', 'q_compound_weight', 'crystallisation_cond', 'q_crystallisation_cond', 'crystallisation_method', 'q_crystallisation_method', 'crystallisation_ph', 'q_crystallisation_ph', 'crystallisation_reservoir', 't_entry_info', 'q_crystallisation_reservoir', 'crystallisation_temperature', 'q_crystallisation_temperature', 'data_collection_date', 'q_data_collection_date', 'data_collection_year', 'q_data_collection_year', 'data_quality', 'q_data_quality', 'data_reduction_software', 'q_data_reduction_software', 'data_scaling_software', 'q_data_scaling_software', 'deposition_date', 'q_deposition_date', 'deposition_site', 'q_deposition_site', 'deposition_year', 'q_deposition_year', 'detector', 'q_detector', 'detector_type', 'q_detector_type', 'diffraction_protocol', 'q_diffraction_protocol', 'diffraction_source_type', 'q_diffraction_source_type', 'diffraction_wavelengths', 'q_diffraction_wavelengths', 'ec_hierarchy_name', 'all_enzyme_names', 'q_all_enzyme_names', 't_all_enzyme_names', 'q_ec_hierarchy_name', 'ec_number', 'q_ec_number', 'entity_id', 'q_entity_id', 'entity_weight', 'q_entity_weight', 'entry_author_list', 'q_entry_author_list', 'entry_authors', 't_entry_authors', 'q_entry_authors', 'entry_entity', 'entry_lig_entity', 'q_entry_lig_entity', 'entry_organism_scientific_name', 'entry_uniprot', 'q_entry_uniprot', 'entry_uniprot_accession', 'q_entry_uniprot_accession', 'entry_uniprot_id', 'q_entry_uniprot_id', 'enzyme_name', 'q_enzyme_name', 'enzyme_num_name', 'q_enzyme_num_name', 'enzyme_systematic_name', 'q_enzyme_systematic_name', 'experiment_data_available', 'q_experiment_data_available', 'experimental_method', 'q_experimental_method', 'expression_host_sci_name', 'expression_organism_name', 'q_expression_organism_name', 't_expression_organism_name', 'q_expression_host_sci_name', 'expression_host_synonyms', 'q_expression_host_synonyms', 'expression_host_tax_id', 'q_expression_host_tax_id', 'gene_name', 'q_gene_name', 'genus', 'q_genus', 'go_id', 'q_go_id', 'go_mapping', 'q_go_mapping', 'has_bound_molecule', 'q_has_bound_molecule', 'has_carb_polymer', 'q_has_carb_polymer', 'has_cofactor', 'q_has_cofactor', 'has_modified_residues', 'q_has_modified_residues', 'has_reactant', 'q_has_reactant', 'homologus_pdb_entity_id', 'q_homologus_pdb_entity_id', 'int_lig_entity', 'q_int_lig_entity', 'interacting_ligands', 'q_interacting_ligands', 'interpro', 'all_sequence_family', 'q_all_sequence_family', 't_all_sequence_family', 'q_interpro', 'interpro_accession', 'q_interpro_accession', 'interpro_dom', 'q_interpro_dom', 'interpro_dom_acc', 'q_interpro_dom_acc', 'interpro_dom_name', 'q_interpro_dom_name', 'interpro_name', 'q_interpro_name', 'interpro_supfam', 'q_interpro_supfam', 'interpro_supfam_acc', 'q_interpro_supfam_acc', 'interpro_supfam_name', 'q_interpro_supfam_name', 'inv_overall_quality', 'q_inv_overall_quality', 'journal', 't_journal', 'q_journal', 'journal_page', 'q_journal_page', 'journal_volume', 'q_journal_volume', 'matthews_coefficient', 'q_matthews_coefficient', 'max_observed_residues', 'q_max_observed_residues', 'mesh_terms', 't_mesh_terms', 'q_mesh_terms', 'model_quality', 'q_model_quality', 'modified_residue_flag', 'q_modified_residue_flag', 'molecule_name', 'q_molecule_name', 'molecule_sequence', 'q_molecule_sequence', 't_molecule_sequence', 'molecule_synonym', 'q_molecule_synonym', 'molecule_type', 'q_molecule_type', 'mutation', 'q_mutation', 'nigli_cell_a', 'q_nigli_cell_a', 'nigli_cell_alpha', 'q_nigli_cell_alpha', 'nigli_cell_b', 'q_nigli_cell_b', 'nigli_cell_beta', 'q_nigli_cell_beta', 'nigli_cell_c', 'q_nigli_cell_c', 'nigli_cell_gamma', 'q_nigli_cell_gamma', 'nigli_cell_symmetry', 'q_nigli_cell_symmetry', 'num_interacting_entity_id', 'q_num_interacting_entity_id', 'num_r_free_reflections', 'q_num_r_free_reflections', 'number_of_bound_entities', 'q_number_of_bound_entities', 'number_of_bound_molecules', 'q_number_of_bound_molecules', 'number_of_copies', 'q_number_of_copies', 'number_of_models', 'q_number_of_models', 'number_of_polymer_entities', 'q_number_of_polymer_entities', 'number_of_polymer_residues', 'q_number_of_polymer_residues', 'number_of_polymers', 'q_number_of_polymers', 'number_of_protein_chains', 'q_number_of_protein_chains', 'organism_scientific_name', 'organism_name', 'q_organism_name', 't_organism_name', 'q_organism_scientific_name', 'organism_synonyms', 'q_organism_synonyms', 'overall_quality', 'q_overall_quality', 'pdb_accession', 'q_pdb_accession', 'pdb_format_compatible', 'q_pdb_format_compatible', 'pdb_id', 'q_pdb_id', 'percent_solvent', 'q_percent_solvent', 'pfam', 'q_pfam', 'pfam_accession', 'q_pfam_accession', 'pfam_clan', 'q_pfam_clan', 'pfam_clan_name', 'q_pfam_clan_name', 'pfam_description', 'q_pfam_description', 'pfam_name', 'q_pfam_name', 'pivot_resolution', 'q_pivot_resolution', 'polymer_length', 'q_polymer_length', 'prefered_assembly_id', 'q_prefered_assembly_id', 'primary_wavelength', 'q_primary_wavelength', 'processing_site', 'q_processing_site', 'pubmed_author_list', 'q_pubmed_author_list', 'pubmed_authors', 'q_pubmed_authors', 'pubmed_id', 'q_pubmed_id', 'r_factor', 'q_r_factor', 'r_free', 'q_r_free', 'r_work', 'q_r_work', 'rank', 'q_rank', 'reactant_id', 'q_reactant_id', 'refinement_software', 'q_refinement_software', 'release_date', 'q_release_date', 'release_year', 'q_release_year', 'resolution', 'q_resolution', 'revision_date', 'q_revision_date', 'revision_year', 'q_revision_year', 'sample_preparation_method', 'q_sample_preparation_method', 'seq_100_cluster_number', 'q_seq_100_cluster_number', 'seq_100_cluster_rank', 'q_seq_100_cluster_rank', 'seq_30_cluster_number', 'q_seq_30_cluster_number', 'seq_30_cluster_rank', 'q_seq_30_cluster_rank', 'seq_40_cluster_number', 'q_seq_40_cluster_number', 'seq_40_cluster_rank', 'q_seq_40_cluster_rank', 'seq_50_cluster_number', 'q_seq_50_cluster_number', 'seq_50_cluster_rank', 'q_seq_50_cluster_rank', 'seq_70_cluster_number', 'q_seq_70_cluster_number', 'seq_70_cluster_rank', 'q_seq_70_cluster_rank', 'seq_90_cluster_number', 'q_seq_90_cluster_number', 'seq_90_cluster_rank', 'q_seq_90_cluster_rank', 'seq_95_cluster_number', 'q_seq_95_cluster_number', 'seq_95_cluster_rank', 'q_seq_95_cluster_rank', 'serial_crystal_experiment', 'q_serial_crystal_experiment', 'spacegroup', 'q_spacegroup', 'status', 'q_status', 'struct_asym_id', 'q_struct_asym_id', 'structure_determination_method', 'q_structure_determination_method', 'structure_solution_software', 'q_structure_solution_software', 'superkingdom', 'q_superkingdom', 'tax_id', 'q_tax_id', 'tax_query', 'q_tax_query', 'title', 't_entry_title', 'q_title', 'uniprot', 'q_uniprot', 'uniprot_accession', 'q_uniprot_accession', 'uniprot_accession_best', 'q_uniprot_accession_best', 'uniprot_best', 'q_uniprot_best', 'uniprot_coverage', 'q_uniprot_coverage', 'uniprot_features', 'q_uniprot_features', 'uniprot_id', 'q_uniprot_id', 'uniprot_id_best', 'q_uniprot_id_best', 'uniprot_non_canonical', 'q_uniprot_non_canonical', 'unp_nf90_accession', 'q_unp_nf90_accession', 'unp_nf90_id', 'q_unp_nf90_id', 'unp_nf90_organism', 'q_unp_nf90_organism', 'unp_nf90_protein_name', 'q_unp_nf90_protein_name', 'unp_nf90_tax_id', 'q_unp_nf90_tax_id', '_version_', 'q_unp_count', 'unp_count'])\n217\n" } ], "source": [ "print('There are {} available search terms'.format(len(second_results[0].keys())))\n", "print(second_results[0].keys())\n", "keys_without_q = [q for q in second_results[0].keys() if not (q.startswith('q_') or (q.startswith('t_')))]\n", "print(len(keys_without_q))" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" }, "tags": [] } }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see we get lots of data back about the individual molecule we have searched for and the PDB entries in which it is contained. \n", "\n", "We can get the PDB ID and experimental method for this first row as follows.\n", "Note that experimental method is a list" ] }, { "cell_type": "code", "execution_count": 171, "metadata": { "tags": [] }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": "3nxv\n['X-ray diffraction']\n" } ], "source": [ "print(second_results[0].get('pdb_id'))\n", "print(second_results[0].get('experimental_method'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can restrict the results to only the information we want using a filter so its easier to see the information we want." ] }, { "cell_type": "code", "execution_count": 172, "metadata": { "tags": [] }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": "3rd search\nNumber of results: 79\n[{'experimental_method': ['X-ray diffraction'], 'pdb_id': '3nxv'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '5hsu'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '5ht4'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '5ht5'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '5hqz'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '5hvb'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '5hui'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '5hpb'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '4m6k'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '4qjc'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '4keb'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '4m6j'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '4g95'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '4qhv'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '4kfj'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '4m6l'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '4kak'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '4kbn'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '4kd7'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '5hsr'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1kms'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1kmv'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1drf'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1pd9'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1dhf'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1hfr'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1hfq'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '4ddr'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '5hve'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '5hqy'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1dls'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1dlr'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1hfp'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1ohk'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1pd8'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1mvs'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1ohj'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1s3w'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3f8y'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3ghc'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '2w3a'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '2w3m'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3s7a'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3f8z'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3s3v'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3nzd'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3nxo'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3n0h'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1u71'},\n {'experimental_method': ['Solution NMR'], 'pdb_id': '1yho'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1mvt'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1boz'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1s3v'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1pdb'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1u72'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '1s3u'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '2dhf'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '2c2t'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '2c2s'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '2w3b'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3ghw'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3eig'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3ghv'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3gi2'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3nxy'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3oaf'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3fs6'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3nxt'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3ntz'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3nxr'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3nxx'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3l3r'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3gyf'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3f91'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '3nu0'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '6de4'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '6dav'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '6a7e'},\n {'experimental_method': ['X-ray diffraction'], 'pdb_id': '6a7c'}]\n" } ], "source": [ "print('3rd search')\n", "search_terms = Q(molecule_name=\"Dihydrofolate reductase\",organism_name=\"Human\")\n", "filter_terms = ['pdb_id', 'experimental_method']\n", "third_results = run_search(search_terms, filter_terms)\n", "pprint(third_results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6) Analysing and plotting the results\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are going to use a Python package called Pandas to help us sort and visualise the results\n", "\n", "\n", "First we have to do a bit of housekeeping, some of the results are lists (a PDB entry can have more than one experimental method or organism for example) so we need to change them into strings so we can use them in a graph" ] }, { "cell_type": "code", "execution_count": 173, "metadata": {}, "outputs": [], "source": [ "def change_lists_to_strings(results):\n", " \"\"\"\n", " input - list of results from search\n", " output - list of results with lists changed into strings\n", " \"\"\"\n", " for row in results:\n", " for data in row:\n", " if type(row[data]) == list:\n", " # if there are any numbers in the list change them into strings\n", " row[data] = [str(a) for a in row[data]]\n", " # unique and sort the list and then change the list into a string\n", " row[data] = ','.join(sorted(list(set(row[data]))))\n", " \n", " return results" ] }, { "cell_type": "code", "execution_count": 174, "metadata": { "tags": [] }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": "[{'experimental_method': 'X-ray diffraction', 'pdb_id': '3nxv'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '5hsu'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '5ht4'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '5ht5'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '5hqz'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '5hvb'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '5hui'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '5hpb'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '4m6k'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '4qjc'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '4keb'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '4m6j'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '4g95'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '4qhv'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '4kfj'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '4m6l'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '4kak'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '4kbn'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '4kd7'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '5hsr'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1kms'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1kmv'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1drf'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1pd9'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1dhf'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1hfr'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1hfq'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '4ddr'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '5hve'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '5hqy'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1dls'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1dlr'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1hfp'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1ohk'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1pd8'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1mvs'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1ohj'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1s3w'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3f8y'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3ghc'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '2w3a'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '2w3m'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3s7a'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3f8z'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3s3v'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3nzd'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3nxo'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3n0h'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1u71'},\n {'experimental_method': 'Solution NMR', 'pdb_id': '1yho'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1mvt'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1boz'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1s3v'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1pdb'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1u72'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '1s3u'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '2dhf'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '2c2t'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '2c2s'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '2w3b'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3ghw'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3eig'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3ghv'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3gi2'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3nxy'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3oaf'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3fs6'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3nxt'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3ntz'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3nxr'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3nxx'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3l3r'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3gyf'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3f91'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '3nu0'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '6de4'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '6dav'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '6a7e'},\n {'experimental_method': 'X-ray diffraction', 'pdb_id': '6a7c'}]\n" } ], "source": [ "results = change_lists_to_strings(third_results)\n", "pprint(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the only thing that changed is ['X-ray diffraction'] is now 'X-ray diffraction'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we wanted to know the experimental methods used to determine structures of Human Dihydrofolate reductase we could loop through the results and count how many entries use each experimental method. \n", "\n", "We can use a Python package called Pandas to do this for us. \n", "It changes the results into a mini database - called a DataFrame. \n" ] }, { "cell_type": "code", "execution_count": 175, "metadata": { "tags": [] }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": " experimental_method pdb_id\n0 X-ray diffraction 3nxv\n1 X-ray diffraction 5hsu\n2 X-ray diffraction 5ht4\n3 X-ray diffraction 5ht5\n4 X-ray diffraction 5hqz\n.. ... ...\n74 X-ray diffraction 3nu0\n75 X-ray diffraction 6de4\n76 X-ray diffraction 6dav\n77 X-ray diffraction 6a7e\n78 X-ray diffraction 6a7c\n\n[79 rows x 2 columns]\n" } ], "source": [ "def pandas_dataset(list_of_results):\n", " results = change_lists_to_strings(list_of_results) # we have added our function to change lists to strings\n", " df = pd.DataFrame(results)\n", "\n", " return df\n", "\n", "df = pandas_dataset(list_of_results=results)\n", "print(df)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the this to count how many PDB codes there are for each experimental method\n", "This groups PDB IDs by experimental method and then counts the number of unique PDB IDs per method." ] }, { "cell_type": "code", "execution_count": 176, "metadata": { "tags": [] }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": "experimental_method\nSolution NMR 1\nX-ray diffraction 78\nName: pdb_id, dtype: int64\n" } ], "source": [ "ds = df.groupby('experimental_method')['pdb_id'].nunique()\n", "print(ds)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can find which experimental method has the greatest (max) or lowest (min) number of entries." ] }, { "cell_type": "code", "execution_count": 177, "metadata": { "tags": [] }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": "78\n1\n" } ], "source": [ "dt = ds.max()\n", "print(dt)\n", "dt = ds.min()\n", "print(dt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can sort the results so its in decending order and then the first value is the experimental method with the highest number of results" ] }, { "cell_type": "code", "execution_count": 178, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "'X-ray diffraction'" }, "metadata": {}, "execution_count": 178 } ], "source": [ "ds.sort_values(ascending=False).index[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or sort ascending so the experimental method with the lowest number of results is given" ] }, { "cell_type": "code", "execution_count": 179, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": "'Solution NMR'" }, "metadata": {}, "execution_count": 179 } ], "source": [ "ds.sort_values(ascending=True).index[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we can then very easily plot these results as a bar chart" ] }, { "cell_type": "code", "execution_count": 180, "metadata": {}, "outputs": [ { "output_type": "display_data", "data": { "application/vnd.plotly.v1+json": { "config": { "linkText": "Export to plot.ly", "plotlyServerURL": "https://plot.ly", "showLink": true }, "data": [ { "marker": { "color": "rgba(255, 153, 51, 0.6)", "line": { "color": "rgba(255, 153, 51, 1.0)", "width": 1 } }, "name": "pdb_id", "orientation": "v", "text": "", "type": "bar", "x": [ "Solution NMR", "X-ray diffraction" ], "y": [ 1, 78 ] } ], "layout": { "legend": { "bgcolor": "#F5F6F9", "font": { "color": "#4D5663" } }, "paper_bgcolor": "#F5F6F9", "plot_bgcolor": "#F5F6F9", "template": { "data": { "bar": [ { "error_x": { "color": "#2a3f5f" }, "error_y": { "color": "#2a3f5f" }, "marker": { "line": { "color": "#E5ECF6", "width": 0.5 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "#E5ECF6", "width": 0.5 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "baxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "heatmap" } ], "heatmapgl": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "heatmapgl" } ], "histogram": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergl" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "#EBF0F8" }, "line": { "color": "white" } }, "header": { "fill": { "color": "#C8D4E3" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowcolor": "#2a3f5f", "arrowhead": 0, "arrowwidth": 1 }, "coloraxis": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "colorscale": { "diverging": [ [ 0, "#8e0152" ], [ 0.1, "#c51b7d" ], [ 0.2, "#de77ae" ], [ 0.3, "#f1b6da" ], [ 0.4, "#fde0ef" ], [ 0.5, "#f7f7f7" ], [ 0.6, "#e6f5d0" ], [ 0.7, "#b8e186" ], [ 0.8, "#7fbc41" ], [ 0.9, "#4d9221" ], [ 1, "#276419" ] ], "sequential": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "sequentialminus": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ] }, "colorway": [ "#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52" ], "font": { "color": "#2a3f5f" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "#E5ECF6", "showlakes": true, "showland": true, "subunitcolor": "white" }, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "paper_bgcolor": "white", "plot_bgcolor": "#E5ECF6", "polar": { "angularaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "radialaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "scene": { "xaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "yaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "zaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" } }, "shapedefaults": { "line": { "color": "#2a3f5f" } }, "ternary": { "aaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "baxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "caxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "title": { "x": 0.05 }, "xaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 }, "yaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 } } }, "title": { "font": { "color": "#4D5663" } }, "xaxis": { "gridcolor": "#E1E5ED", "showgrid": true, "tickfont": { "color": "#4D5663" }, "title": { "font": { "color": "#4D5663" }, "text": "" }, "zerolinecolor": "#E1E5ED" }, "yaxis": { "gridcolor": "#E1E5ED", "showgrid": true, "tickfont": { "color": "#4D5663" }, "title": { "font": { "color": "#4D5663" }, "text": "" }, "zerolinecolor": "#E1E5ED" } } }, "text/html": "