ID mapping using mygene
Last updated on 2023-04-27 | Edit this page
Overview
Questions
- How can I get gene annotations in Python?
Objectives
- Explore the
mygene
library. - Access and merge gene annotations using
mygene
.
Key Points
-
mygene
is Python module which allows access to a gene annotation database. - You can query
mygene
with multiple identifiers usingquerymany
.
Installing mygene from bioconda
ID mapping is a very common, often not fun, task for every bioinformatician. Different datasets and tools often use different sets of biological identifiers. Our goal is to take some list of gene symbols other gene identifiers and convert them to some other set of identifiers to facilitate some analysis.
mygene is a convenient Python module to access MyGene.info gene query web services.
Mygene is a part of a family of biothings API services which also has modules for chemical, diseases, genetic variances, and taxonomy. All of these modules use a simlar Python interface.
Mygene can be installed using conda, but we it exists on a separate
set of biomedical packages called bioconda. We access this using
the -c
flag which tells conda the channel we wish
to search.
Once installed, we can import mygene
and create a
MyGeneInfo
object, which is used for accessing and
downloading genetic annotations.
Mapping gene symbols
Suppose xli is a list of gene symbols we want to convert to entrez gene ids:
PYTHON
xli = ['DDX26B',
'CCDC83',
'MAST3',
'FLOT1',
'RPL11',
'ZDHHC20',
'LUC7L3',
'SNORD49A',
'CTSH',
'ACOT8']
We can then use the querymany method to get entrez
gene ids for this list, telling it your input is “symbol” with the
scopes
argument, and you want “entrezgene” (Entrez gene
ids) back with the fields
argument.
OUTPUT
querying 1-10...done.
Finished.
1 input query terms found no hit:
['DDX26B']
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
[{'query': 'DDX26B', 'notfound': True}, {'query': 'CCDC83', '_id': '220047', '_score': 18.026102, 'entrezgene': '220047'}, {'query': 'MAST3', '_id': '23031', '_score': 18.120222, 'entrezgene': '23031'}, {'query': 'FLOT1', '_id': '10211', '_score': 18.43404, 'entrezgene': '10211'}, {'query': 'RPL11', '_id': '6135', '_score': 16.711576, 'entrezgene': '6135'}, {'query': 'ZDHHC20', '_id': '253832', '_score': 18.165747, 'entrezgene': '253832'}, {'query': 'LUC7L3', '_id': '51747', '_score': 17.696375, 'entrezgene': '51747'}, {'query': 'SNORD49A', '_id': '26800', '_score': 22.741982, 'entrezgene': '26800'}, {'query': 'CTSH', '_id': '1512', '_score': 17.737492, 'entrezgene': '1512'}, {'query': 'ACOT8', '_id': '10005', '_score': 17.711779, 'entrezgene': '10005'}]
The mapping result is returned as a list of dictionaries. Each dictionary contains the fields we asked to return, in this case, “entrezgene” field. Each dictionary also returns the matching query term, “query”, and an internal id, “**_id“, which is the same as”entrezgene**” most of time (will be an ensembl gene id if a gene is available from Ensembl only). Mygene could not find a mapping for one of the genes.
We could also get Ensembl gene ids. This time, we will use the
as.dataframe
argument to get the result back as a pandas
dataframe.
PYTHON
ensembl_res = mg.querymany(xli, scopes='symbol', fields='ensembl.gene', species='human', as_dataframe=True)
print(ensembl_res)
OUTPUT
querying 1-10...done.
Finished.
1 input query terms found no hit:
['DDX26B']
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
notfound _id _score ensembl.gene \
query
DDX26B True NaN NaN NaN
CCDC83 NaN 220047 18.026253 ENSG00000150676
MAST3 NaN 23031 18.120222 ENSG00000099308
FLOT1 NaN 10211 18.416580 NaN
RPL11 NaN 6135 16.711576 ENSG00000142676
ZDHHC20 NaN 253832 18.165747 ENSG00000180776
LUC7L3 NaN 51747 17.701532 ENSG00000108848
SNORD49A NaN 26800 22.728123 ENSG00000277370
CTSH NaN 1512 17.725145 ENSG00000103811
ACOT8 NaN 10005 17.717607 ENSG00000101473
ensembl
query
DDX26B NaN
CCDC83 NaN
MAST3 NaN
FLOT1 [{'gene': 'ENSG00000206480'}, {'gene': 'ENSG00...
RPL11 NaN
ZDHHC20 NaN
LUC7L3 NaN
SNORD49A NaN
CTSH NaN
ACOT8 NaN
Mixed symbols and multiple fields
The above id list contains multiple types of identifiers. Let’s try taking this list and trying to get back both Entrez gene ids and uniprot ids. Parameters scopes, fields, species are all flexible enough to support multiple values, either a list or a comma-separated string.
PYTHON
query_res = mg.querymany(xli, scopes='symbol,reporter,accession,ensembl.gene', fields='entrezgene,uniprot', species='human', as_dataframe=True)
print(query_res)
OUTPUT
querying 1-6...done.
Finished.
1 input query terms found dup hits:
[('1007_s_at', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
_id _score entrezgene uniprot.Swiss-Prot \
query
ENSG00000150676 220047 24.495731 220047 Q8IWF9
MAST3 23031 18.115461 23031 O60307
FLOT1 10211 18.416580 10211 O75955
RPL11 6135 16.715294 6135 P62913
1007_s_at 100616237 18.266214 100616237 NaN
1007_s_at 780 18.266214 780 Q08345
AK125780 118142757 21.342941 118142757 P43080
uniprot.TrEMBL
query
ENSG00000150676 H0YDV3
MAST3 [V9GYV0, A0A8V8TLL8, A0A8I5KST9, A0A8V8TMW0, A...
FLOT1 [A2AB11, Q5ST80, A2AB13, A2AB10, A2ABJ5, A2AB1...
RPL11 [Q5VVC8, Q5VVD0, A0A2R8Y447]
1007_s_at NaN
1007_s_at [A0A024RCJ0, A0A024RCL1, A0A0A0MSX3, A0A024RCQ...
AK125780 [A0A7I2V6E2, B2R9P6]
Finally, we can take a look at all of the information mygene has to
offer by setting fields
to “all”.
PYTHON
query_res = mg.querymany(xli, scopes='symbol,reporter,accession,ensembl.gene', fields='all', species='human', as_dataframe=True)
print(list(query_res.columns))
OUTPUT
querying 1-6...done.
Finished.
1 input query terms found dup hits:
[('1007_s_at', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
['AllianceGenome', 'HGNC', '_id', '_score', 'alias', 'entrezgene', 'exons', 'exons_hg19', 'generif', 'ipi', 'map_location', 'name', 'other_names', 'pharmgkb', 'symbol', 'taxid', 'type_of_gene', 'unigene', 'accession.genomic', 'accession.protein', 'accession.rna', 'accession.translation', 'ensembl.gene', 'ensembl.protein', 'ensembl.transcript', 'ensembl.translation', 'ensembl.type_of_gene', 'exac._license', 'exac.all.exp_lof', 'exac.all.exp_mis', 'exac.all.exp_syn', 'exac.all.lof_z', 'exac.all.mis_z', 'exac.all.mu_lof', 'exac.all.mu_mis', 'exac.all.mu_syn', 'exac.all.n_lof', 'exac.all.n_mis', 'exac.all.n_syn', 'exac.all.p_li', 'exac.all.p_null', 'exac.all.p_rec', 'exac.all.syn_z', 'exac.bp', 'exac.cds_end', 'exac.cds_start', 'exac.n_exons', 'exac.nonpsych.exp_lof', 'exac.nonpsych.exp_mis', 'exac.nonpsych.exp_syn', 'exac.nonpsych.lof_z', 'exac.nonpsych.mis_z', 'exac.nonpsych.mu_lof', 'exac.nonpsych.mu_mis', 'exac.nonpsych.mu_syn', 'exac.nonpsych.n_lof', 'exac.nonpsych.n_mis', 'exac.nonpsych.n_syn', 'exac.nonpsych.p_li', 'exac.nonpsych.p_null', 'exac.nonpsych.p_rec', 'exac.nonpsych.syn_z', 'exac.nontcga.exp_lof', 'exac.nontcga.exp_mis', 'exac.nontcga.exp_syn', 'exac.nontcga.lof_z', 'exac.nontcga.mis_z', 'exac.nontcga.mu_lof', 'exac.nontcga.mu_mis', 'exac.nontcga.mu_syn', 'exac.nontcga.n_lof', 'exac.nontcga.n_mis', 'exac.nontcga.n_syn', 'exac.nontcga.p_li', 'exac.nontcga.p_null', 'exac.nontcga.p_rec', 'exac.nontcga.syn_z', 'exac.transcript', 'genomic_pos.chr', 'genomic_pos.end', 'genomic_pos.ensemblgene', 'genomic_pos.start', 'genomic_pos.strand', 'genomic_pos_hg19.chr', 'genomic_pos_hg19.end', 'genomic_pos_hg19.start', 'genomic_pos_hg19.strand', 'go.MF.category', 'go.MF.evidence', 'go.MF.id', 'go.MF.pubmed', 'go.MF.qualifier', 'go.MF.term', 'homologene.genes', 'homologene.id', 'interpro.desc', 'interpro.id', 'interpro.short_desc', 'pantherdb.HGNC', 'pantherdb._license', 'pantherdb.ortholog', 'pantherdb.uniprot_kb', 'pharos.target_id', 'reagent.GNF_Qia_hs-genome_v1_siRNA', 'reagent.GNF_hs-ORFeome1_1_reads.id', 'reagent.GNF_hs-ORFeome1_1_reads.relationship', 'reagent.GNF_mm+hs-MGC.id', 'reagent.GNF_mm+hs-MGC.relationship', 'reagent.GNF_mm+hs_RetroCDNA.id', 'reagent.GNF_mm+hs_RetroCDNA.relationship', 'reagent.NOVART_hs-genome_siRNA', 'refseq.genomic', 'refseq.protein', 'refseq.rna', 'refseq.translation', 'reporter.GNF1H', 'reporter.HG-U133_Plus_2', 'reporter.HTA-2_0', 'reporter.HuEx-1_0', 'reporter.HuGene-1_1', 'reporter.HuGene-2_1', 'umls.cui', 'uniprot.Swiss-Prot', 'uniprot.TrEMBL', 'MIM', 'ec', 'interpro', 'pdb', 'pfam', 'prosite', 'summary', 'go.BP', 'go.CC.evidence', 'go.CC.gocategory', 'go.CC.id', 'go.CC.qualifier', 'go.CC.term', 'go.MF', 'reagent.GNF_hs-Origene.id', 'reagent.GNF_hs-Origene.relationship', 'reagent.GNF_hs-druggable_lenti-shRNA', 'reagent.GNF_hs-druggable_plasmid-shRNA', 'reagent.GNF_hs-druggable_siRNA', 'reagent.GNF_hs-pkinase_IDT-siRNA', 'reagent.Invitrogen_IVTHSSIPKv2', 'reagent.NIBRI_hs-Secretome_pDEST.id', 'reagent.NIBRI_hs-Secretome_pDEST.relationship', 'reporter.HG-U95Av2', 'ensembl', 'genomic_pos', 'genomic_pos_hg19', 'go.CC', 'pathway.kegg.id', 'pathway.kegg.name', 'pathway.netpath.id', 'pathway.netpath.name', 'pathway.reactome', 'pathway.wikipathways', 'reagent.GNF_hs-Origene', 'wikipedia.url_stub', 'pir', 'pathway.kegg', 'pathway.pid', 'pathway.wikipathways.id', 'pathway.wikipathways.name', 'miRBase', 'retired', 'reagent.GNF_mm+hs-MGC'
You can find a neater version of all available fields here.
Challenge
Load in the rnaseq dataset as a pandas dataframe. Add columns to the dataframe with entrez gene ids and at least one other field from mygene.
Other bioinformatics resources
Biopython is a library with a wide variety of bioinformatics applications. You can find its tutorial/cookbook here
This lesson was adapted from the official mygene python tutorial