ID mapping using mygene

Last updated on 2023-04-27 | Edit this page

Overview

Questions

  • How can I get gene annotations in Python?

Objectives

  • Explore the mygene library.
  • Access and merge gene annotations using mygene.

Key Points

  • mygene is Python module which allows access to a gene annotation database.
  • You can query mygene with multiple identifiers using querymany.

Installing mygene from bioconda


ID mapping is a very common, often not fun, task for every bioinformatician. Different datasets and tools often use different sets of biological identifiers. Our goal is to take some list of gene symbols other gene identifiers and convert them to some other set of identifiers to facilitate some analysis.

mygene is a convenient Python module to access MyGene.info gene query web services.

Mygene is a part of a family of biothings API services which also has modules for chemical, diseases, genetic variances, and taxonomy. All of these modules use a simlar Python interface.

Mygene can be installed using conda, but we it exists on a separate set of biomedical packages called bioconda. We access this using the -c flag which tells conda the channel we wish to search.

BASH

$ conda install -c bioconda mygene

Once installed, we can import mygene and create a MyGeneInfo object, which is used for accessing and downloading genetic annotations.

PYTHON

import mygene

mg = mygene.MyGeneInfo()

Mapping gene symbols

Suppose xli is a list of gene symbols we want to convert to entrez gene ids:

PYTHON

xli = ['DDX26B',
'CCDC83',
'MAST3',
'FLOT1',
'RPL11',
'ZDHHC20',
'LUC7L3',
'SNORD49A',
'CTSH',
'ACOT8']

We can then use the querymany method to get entrez gene ids for this list, telling it your input is “symbol” with the scopes argument, and you want “entrezgene” (Entrez gene ids) back with the fields argument.

PYTHON

out = mg.querymany(xli, scopes='symbol', fields='entrezgene', species='human')
print(out)

OUTPUT

querying 1-10...done.
Finished.
1 input query terms found no hit:
	['DDX26B']
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
[{'query': 'DDX26B', 'notfound': True}, {'query': 'CCDC83', '_id': '220047', '_score': 18.026102, 'entrezgene': '220047'}, {'query': 'MAST3', '_id': '23031', '_score': 18.120222, 'entrezgene': '23031'}, {'query': 'FLOT1', '_id': '10211', '_score': 18.43404, 'entrezgene': '10211'}, {'query': 'RPL11', '_id': '6135', '_score': 16.711576, 'entrezgene': '6135'}, {'query': 'ZDHHC20', '_id': '253832', '_score': 18.165747, 'entrezgene': '253832'}, {'query': 'LUC7L3', '_id': '51747', '_score': 17.696375, 'entrezgene': '51747'}, {'query': 'SNORD49A', '_id': '26800', '_score': 22.741982, 'entrezgene': '26800'}, {'query': 'CTSH', '_id': '1512', '_score': 17.737492, 'entrezgene': '1512'}, {'query': 'ACOT8', '_id': '10005', '_score': 17.711779, 'entrezgene': '10005'}]

The mapping result is returned as a list of dictionaries. Each dictionary contains the fields we asked to return, in this case, “entrezgene” field. Each dictionary also returns the matching query term, “query”, and an internal id, “**_id“, which is the same as”entrezgene**” most of time (will be an ensembl gene id if a gene is available from Ensembl only). Mygene could not find a mapping for one of the genes.

We could also get Ensembl gene ids. This time, we will use the as.dataframe argument to get the result back as a pandas dataframe.

PYTHON

ensembl_res = mg.querymany(xli, scopes='symbol', fields='ensembl.gene', species='human', as_dataframe=True)
print(ensembl_res)

OUTPUT

querying 1-10...done.
Finished.
1 input query terms found no hit:
	['DDX26B']
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
         notfound     _id     _score     ensembl.gene  \
query                                                   
DDX26B       True     NaN        NaN              NaN   
CCDC83        NaN  220047  18.026253  ENSG00000150676   
MAST3         NaN   23031  18.120222  ENSG00000099308   
FLOT1         NaN   10211  18.416580              NaN   
RPL11         NaN    6135  16.711576  ENSG00000142676   
ZDHHC20       NaN  253832  18.165747  ENSG00000180776   
LUC7L3        NaN   51747  17.701532  ENSG00000108848   
SNORD49A      NaN   26800  22.728123  ENSG00000277370   
CTSH          NaN    1512  17.725145  ENSG00000103811   
ACOT8         NaN   10005  17.717607  ENSG00000101473   

                                                    ensembl  
query                                                        
DDX26B                                                  NaN  
CCDC83                                                  NaN  
MAST3                                                   NaN  
FLOT1     [{'gene': 'ENSG00000206480'}, {'gene': 'ENSG00...  
RPL11                                                   NaN  
ZDHHC20                                                 NaN  
LUC7L3                                                  NaN  
SNORD49A                                                NaN  
CTSH                                                    NaN  
ACOT8                                                   NaN 

Mixed symbols and multiple fields

PYTHON

xli = ['ENSG00000150676',
 'MAST3',
 'FLOT1',
 'RPL11',
 '1007_s_at',
 'AK125780']

The above id list contains multiple types of identifiers. Let’s try taking this list and trying to get back both Entrez gene ids and uniprot ids. Parameters scopes, fields, species are all flexible enough to support multiple values, either a list or a comma-separated string.

PYTHON

query_res = mg.querymany(xli, scopes='symbol,reporter,accession,ensembl.gene', fields='entrezgene,uniprot', species='human', as_dataframe=True)
print(query_res)

OUTPUT

querying 1-6...done.
Finished.
1 input query terms found dup hits:
	[('1007_s_at', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
                       _id     _score entrezgene uniprot.Swiss-Prot  \
query                                                                 
ENSG00000150676     220047  24.495731     220047             Q8IWF9   
MAST3                23031  18.115461      23031             O60307   
FLOT1                10211  18.416580      10211             O75955   
RPL11                 6135  16.715294       6135             P62913   
1007_s_at        100616237  18.266214  100616237                NaN   
1007_s_at              780  18.266214        780             Q08345   
AK125780         118142757  21.342941  118142757             P43080   

                                                    uniprot.TrEMBL  
query                                                               
ENSG00000150676                                             H0YDV3  
MAST3            [V9GYV0, A0A8V8TLL8, A0A8I5KST9, A0A8V8TMW0, A...  
FLOT1            [A2AB11, Q5ST80, A2AB13, A2AB10, A2ABJ5, A2AB1...  
RPL11                                 [Q5VVC8, Q5VVD0, A0A2R8Y447]  
1007_s_at                                                      NaN  
1007_s_at        [A0A024RCJ0, A0A024RCL1, A0A0A0MSX3, A0A024RCQ...  
AK125780                                      [A0A7I2V6E2, B2R9P6]  

Finally, we can take a look at all of the information mygene has to offer by setting fields to “all”.

PYTHON

query_res = mg.querymany(xli, scopes='symbol,reporter,accession,ensembl.gene', fields='all', species='human', as_dataframe=True)
print(list(query_res.columns))

OUTPUT

querying 1-6...done.
Finished.
1 input query terms found dup hits:
	[('1007_s_at', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
['AllianceGenome', 'HGNC', '_id', '_score', 'alias', 'entrezgene', 'exons', 'exons_hg19', 'generif', 'ipi', 'map_location', 'name', 'other_names', 'pharmgkb', 'symbol', 'taxid', 'type_of_gene', 'unigene', 'accession.genomic', 'accession.protein', 'accession.rna', 'accession.translation', 'ensembl.gene', 'ensembl.protein', 'ensembl.transcript', 'ensembl.translation', 'ensembl.type_of_gene', 'exac._license', 'exac.all.exp_lof', 'exac.all.exp_mis', 'exac.all.exp_syn', 'exac.all.lof_z', 'exac.all.mis_z', 'exac.all.mu_lof', 'exac.all.mu_mis', 'exac.all.mu_syn', 'exac.all.n_lof', 'exac.all.n_mis', 'exac.all.n_syn', 'exac.all.p_li', 'exac.all.p_null', 'exac.all.p_rec', 'exac.all.syn_z', 'exac.bp', 'exac.cds_end', 'exac.cds_start', 'exac.n_exons', 'exac.nonpsych.exp_lof', 'exac.nonpsych.exp_mis', 'exac.nonpsych.exp_syn', 'exac.nonpsych.lof_z', 'exac.nonpsych.mis_z', 'exac.nonpsych.mu_lof', 'exac.nonpsych.mu_mis', 'exac.nonpsych.mu_syn', 'exac.nonpsych.n_lof', 'exac.nonpsych.n_mis', 'exac.nonpsych.n_syn', 'exac.nonpsych.p_li', 'exac.nonpsych.p_null', 'exac.nonpsych.p_rec', 'exac.nonpsych.syn_z', 'exac.nontcga.exp_lof', 'exac.nontcga.exp_mis', 'exac.nontcga.exp_syn', 'exac.nontcga.lof_z', 'exac.nontcga.mis_z', 'exac.nontcga.mu_lof', 'exac.nontcga.mu_mis', 'exac.nontcga.mu_syn', 'exac.nontcga.n_lof', 'exac.nontcga.n_mis', 'exac.nontcga.n_syn', 'exac.nontcga.p_li', 'exac.nontcga.p_null', 'exac.nontcga.p_rec', 'exac.nontcga.syn_z', 'exac.transcript', 'genomic_pos.chr', 'genomic_pos.end', 'genomic_pos.ensemblgene', 'genomic_pos.start', 'genomic_pos.strand', 'genomic_pos_hg19.chr', 'genomic_pos_hg19.end', 'genomic_pos_hg19.start', 'genomic_pos_hg19.strand', 'go.MF.category', 'go.MF.evidence', 'go.MF.id', 'go.MF.pubmed', 'go.MF.qualifier', 'go.MF.term', 'homologene.genes', 'homologene.id', 'interpro.desc', 'interpro.id', 'interpro.short_desc', 'pantherdb.HGNC', 'pantherdb._license', 'pantherdb.ortholog', 'pantherdb.uniprot_kb', 'pharos.target_id', 'reagent.GNF_Qia_hs-genome_v1_siRNA', 'reagent.GNF_hs-ORFeome1_1_reads.id', 'reagent.GNF_hs-ORFeome1_1_reads.relationship', 'reagent.GNF_mm+hs-MGC.id', 'reagent.GNF_mm+hs-MGC.relationship', 'reagent.GNF_mm+hs_RetroCDNA.id', 'reagent.GNF_mm+hs_RetroCDNA.relationship', 'reagent.NOVART_hs-genome_siRNA', 'refseq.genomic', 'refseq.protein', 'refseq.rna', 'refseq.translation', 'reporter.GNF1H', 'reporter.HG-U133_Plus_2', 'reporter.HTA-2_0', 'reporter.HuEx-1_0', 'reporter.HuGene-1_1', 'reporter.HuGene-2_1', 'umls.cui', 'uniprot.Swiss-Prot', 'uniprot.TrEMBL', 'MIM', 'ec', 'interpro', 'pdb', 'pfam', 'prosite', 'summary', 'go.BP', 'go.CC.evidence', 'go.CC.gocategory', 'go.CC.id', 'go.CC.qualifier', 'go.CC.term', 'go.MF', 'reagent.GNF_hs-Origene.id', 'reagent.GNF_hs-Origene.relationship', 'reagent.GNF_hs-druggable_lenti-shRNA', 'reagent.GNF_hs-druggable_plasmid-shRNA', 'reagent.GNF_hs-druggable_siRNA', 'reagent.GNF_hs-pkinase_IDT-siRNA', 'reagent.Invitrogen_IVTHSSIPKv2', 'reagent.NIBRI_hs-Secretome_pDEST.id', 'reagent.NIBRI_hs-Secretome_pDEST.relationship', 'reporter.HG-U95Av2', 'ensembl', 'genomic_pos', 'genomic_pos_hg19', 'go.CC', 'pathway.kegg.id', 'pathway.kegg.name', 'pathway.netpath.id', 'pathway.netpath.name', 'pathway.reactome', 'pathway.wikipathways', 'reagent.GNF_hs-Origene', 'wikipedia.url_stub', 'pir', 'pathway.kegg', 'pathway.pid', 'pathway.wikipathways.id', 'pathway.wikipathways.name', 'miRBase', 'retired', 'reagent.GNF_mm+hs-MGC'

You can find a neater version of all available fields here.

Challenge

Load in the rnaseq dataset as a pandas dataframe. Add columns to the dataframe with entrez gene ids and at least one other field from mygene.

Other bioinformatics resources


Biopython is a library with a wide variety of bioinformatics applications. You can find its tutorial/cookbook here


This lesson was adapted from the official mygene python tutorial