Team:Paris Bettencourt/Project/Bioinformatics


Meta-Analysis of Odor-Related Genes


The National Institute of Health's (NIH) Human Microbiome Project (HMP) attempted to "characterize microbial communities found at multiple human body sites and to look for correlations between changes in the microbiome and human health". There were several studies that sprouted from the data produced by the HMP. One such study was done by the Huttenhower lab called HUMAnN: The HMP Unified Metabolic Analysis Network, a pipeline for efficient and accurate determination of the presence or absence and abundance of microbial pathways in a community using metagenomic data (Abubucker, 2012).

The abundances for each orthologous gene family (or groups of genes that perform approximately the same biological role) was reported in the units of read hits. This particular analysis used the KEGG Orthology (KO) database. Read hits refer to a read that maps to a gene sequence within a particular KO. These hits are weighted using two ways: 1. If a read hits multiple sequences, its weight is distributed among them in proportion to the strength of each mapping and 2. hits to longer sequences are down-weighted, since longer sequences contribute more reads to a metagenome due to the random sampling process of metagenomic studies (Abubucker, 2012).

The goal of this sub-project was to find genes related to odor from the HUMAnN analysis and see how the odor profile of various body sites (ear, nose, mouth, vagina, stool) and genders varies.


There is no large difference in odor profiles between males and females at the body sites sampled. In general, the abundance of odor related genes was slightly lower for women than for men; however, the general trend remained the same with both genders (Fig. 1).

The difference in odor profile between the five body sites was more stark. Fig. 2 shows the abundance of the seven odor related genes found in the HUMAnN analysis at the varying body sites. It is clear from this figure that acetate kinase (ackA), involved in fermentation pathways which can generate lactic acid and glycerol which can lead to the formation of carboxylic acids that contribute to acidic odor, is most abundant in vaginas. Leucine dehydrogenase (leuD), on the other hand, which leads to the formation of isovaleric acid (a compound with a characteristic cheese smell), is found to be most abundant in the ear.

Outer membrane lipoprotein Blc (apoD) is found to be most abundant in stool samples. This is interesting because this particular protein is most expressed in the apocrine glands, which are found in some parts of the external genitalia. It may be that some of the proteins expressed in the apocrine glands were transported into the stool samples. Finally, fatty acid dehydrogenases (the fad genes) were only abundant in the nose (with the exception of fadD which was found in large abundance in almost all the body sites). These genes are involved in fatty acid metabolism, which can generate volatile fatty acids typically associated with odor.

Figure 1. Log-scale abundance profiles of body-odor related genes at the five different body sites in males and females.

Figure 2. Abundance of body odor related genes (ackA, leuD, apoD, fadA, fadB, fadD, and fadE) in five different body sites: ear, mouth, nose, stool, and vagina.The data was derived from the HUMAnN analysis of the Human Microbiome Project database. The data was collected for fifteen different body sites; for this analysis, however, the fifteen body sites were combined into the five main ones listed above.

Pipeline for Characterization of Odor-Related Genes


There have been several deep sequencing studies performed on genes known to be related to body odor. A pre-defined bioinformatics pipeline was created in order to characterize odor related genes in order to analyze some of the large amount of whole genome sequencing (WGS) data that already exists in databases such as the Human Microbiome Project (HMP), the National Center for Biotechnology Information (NCBI), the DNA Databank of Japan (DDBJ), and the Sanger Center. Furthermore, these studies were used to supplement laboratory research, such as determining targets for CRISPRs on odor-related genes in the "Don't Sweat It" project.


Overall Pipeline:
1. Find whole genome shotgun sequences through various databases (Sequence Read Archive (SRA), HMP, etc.) for the organism in question.
2. Find nucleotide sequence on NCBI for gene in question.
3. Run a whole sequence alignment (paired-end alignment mode) with the gene as the reference using Bowtie2, a memory-efficient tool for aligning sequencing reads to long reference sequences (Langmead, 2012).
4. Run BLAST on consensus sequence from alignment from Integrative Genomics Viewer (Robinson, 2012).
5. Determine mutation rate versus nucleotide from consensus sequence data.
6. Use homology modeling, protein family domains data, and other structural information to determine the likelihood of mutations.

Mutation rate was determined by the following approach:
1. Determine the matrix for determining the consensus sequence, which contains information about the number of times each nucleotide (A, T, G, C, or unknown) is found at each position on the gene sequence.
2. For each position, determine a mutation rate using the following metric:

Percent Correct Nucleotide (PCN) = Max Nucleotide Value at Position / Total Number of Nucleotides at Position
Mutation Rate = 1/PCN


As an example, a case study with ackA, or acetate kinase, in Staphylococcus aureus is highlighted. This gene is a involved in the catabolic formation of ATP and known to be responsible for body odor in the human axilla (Tauch, 2013). Fig. 3 shows a screenshot of the sequence alignment of the WGS reads with the reference gene (ackA). Fig. 4a shows a graph of the mutation rates vs. nucleotide position. Fig 4b highlights the most likely nucleotide positions for these mutations and the corresponding translated amino acid residues. Furthermore, it also showcases whether the residues are solvent exposed or not or if they are structurally or functionally important. This data was determined using the Consurf server for protein structure prediction (Celniker, 2013). As seen from Fig. 4a, there may be an edge effect that is not accounted for in the metric of calculation of the mutation rate. This is a limitation of this pipeline and the metric needs to be further optimized. Finally, a 3D model was found using a consensus solution from 3D structure prediction tools such as Phyre2 and Consurf (Fig. 5) (Kelley, 2009). The residues corresponding to the mutated residues are highlighted on the structure in order to determine whether these residues correspond to a structurally or functionally relevant location, which would mark as a potentially important CRISPR target. A similar analysis was performed for other odor related enzymes, including leucine dehydrogenase (LeuDH), lactate dehydrogenase (Ldh), and C-S lyase (AecD).

Figure 5. Mutated amino acid residues highlighted on predicted 3D structure of AckA. Model predicted by Phyre2 3D prediction web-server (Kelley, 2009).

Figure 3. Alignment between WGS reads for ackA from the Sanger Institute.

Figure 4. Mutated nucleotide positions and amino acid residues determined by WGS alignment with reference gene (ackA). A) Graph of mutation rate versus nucleotide position as calculated using the metric presented in the Methods section. B) Table of predicted nucleotide positions on ackA and the corresponding amino acid residue. Includes information on solvent exposure, and structural and functional importance of each residue.

Centre for Research and Interdisciplinarity (CRI)
Faculty of Medicine Cochin Port-Royal, South wing, 2nd floor
Paris Descartes University
24, rue du Faubourg Saint Jacques
75014 Paris, France
+33 1 44 41 25 22/25
Copyright (c) 2014 All rights reserved.