Team:Paris Bettencourt/Project/Bioinformatics

From 2014.igem.org

(Difference between revisions)
 
(48 intermediate revisions not shown)
Line 60: Line 60:
vertical-align : middle;
vertical-align : middle;
position : relative;
position : relative;
-
width : 44%;
+
width : 43%;
-
margin-left : 5%;
+
margin-left : 4%;
                 font-size : 15px;
                 font-size : 15px;
 +
                float : left;
}
}
.project .text2 {
.project .text2 {
-
position : absolute;
+
float : right;
-
width : 44%;
+
width : 43%;
-
height : 250px;
+
                 margin-right : 4%;
-
                left : 51%;
+
-
                 margin-top : 60px;
+
                 font-size : 15px;
                 font-size : 15px;
}
}
Line 115: Line 114:
         </div>
         </div>
<table id=tablelien>
<table id=tablelien>
 +
<tr>
 +
</tr>
</table>
</table>
<div id=part1 class=project>
<div id=part1 class=project>
-
<p class=text2> <img src="https://static.igem.org/mediawiki/2014/5/57/Body_odor_abundance_pretty_pb.png"></br><span class=legende><b>Figure 1. Log-scale abundance profiles of body-odor related genes at the five different body sites in males and females. </b></span></br></br>
+
<h6>Meta-Analysis of Odor-Related Genes </h6><br><br>
-
 
+
-
<img src="https://static.igem.org/mediawiki/2014/8/82/Body_odor_abundance_pb.png"></br><span class=legende><b>Figure 2. Abundance of body odor related genes (ackA, leuD, apoD, fadA, fadB, fadD, and fadE) in five different body sites: ear, mouth, nose, stool, and vagina.</b>The data was derived from the HUMAnN analysis of the Human Microbiome Project database. The data was collected for fifteen different body sites; for this analysis, however, the fifteen body sites were combined into the five main ones listed above.</span></br></br>
+
-
</p>
+
-
<h6>Meta-Analysis of Odor-Related Genes </h6><br>
+
<p class=text1>
<p class=text1>
-
<strong style="font-size: 125%;">Introduction</strong><br>
+
<strong style="font-size: 125%;">Introduction</strong><br></br>
-
The National Institute of Health's (NIH) Human Microbiome Project (HMP) attempted to "characterize microbial communities found at multiple human body sites and to look for correlations between changes in the microbiome and human health." There were several studies that sprouted from the data produced by the HMP, and one such study was done by the Huttenhower lab called HUMAnN: The HMP Unified Metabolic Analysis Network, a pipeline for efficient and accurate determination of the presence or absence and abundance of microbial pathways in a community using metagenomic data. <br><br>
+
The National Institute of Health's (NIH) Human Microbiome Project (HMP) attempted to "characterize microbial communities found at multiple human body sites and to look for correlations between changes in the microbiome and human health". There were several studies that sprouted from the data produced by the HMP. One such study was done by the Huttenhower lab called HUMAnN: The HMP Unified Metabolic Analysis Network, a pipeline for efficient and accurate determination of the presence or absence and abundance of microbial pathways in a community using metagenomic data (<a href="https://2014.igem.org/Team:Paris_Bettencourt/Bibliograpy">Abubucker, 2012</a>). <br><br>
-
The abundances for each orthologous gene family (or groups of genes that perform approximately the same biological role) was reported in the units of read hits. This particular analysis used the KEGG Orthology (KO) database. Read hits refer to a read that maps to a gene sequence within a particular KO. These hits are weighted using two ways: 1. If a read hits multiple sequences, its weight is distributed among them in proportion to the strength of each mapping and 2. hits to longer sequences are down-weighted, since longer sequences contribute more reads to a metagenome due to the random sampling process of metagenomic studies. <br><br>
+
The abundances for each orthologous gene family (or groups of genes that perform approximately the same biological role) was reported in the units of read hits. This particular analysis used the KEGG Orthology (KO) database. Read hits refer to a read that maps to a gene sequence within a particular KO. These hits are weighted using two ways: 1. If a read hits multiple sequences, its weight is distributed among them in proportion to the strength of each mapping and 2. hits to longer sequences are down-weighted, since longer sequences contribute more reads to a metagenome due to the random sampling process of metagenomic studies (<a href="https://2014.igem.org/Team:Paris_Bettencourt/Bibliograpy">Abubucker, 2012</a>). <br><br>
The <strong>goal</strong> of this sub-project was to find genes related to odor from the HUMAnN analysis and see how the odor profile of various body sites (ear, nose, mouth, vagina, stool) and genders varies.  <br><br>  
The <strong>goal</strong> of this sub-project was to find genes related to odor from the HUMAnN analysis and see how the odor profile of various body sites (ear, nose, mouth, vagina, stool) and genders varies.  <br><br>  
-
<strong style="font-size: 125%;">Discussion</strong><br>
+
<br><br><strong style="font-size: 125%;">Discussion</strong><br></br>
-
There is no large difference in odor profiles between males and females at the body sites sampled. As shown in Fig. 1, in general, the abundance of odor related genes was slightly lower for women than for men; however, the general trend remained the same with both genders. <br><br>
+
There is no large difference in odor profiles between males and females at the body sites sampled. In general, the abundance of odor related genes was slightly lower for women than for men; however, the general trend remained the same with both genders (Fig. 1). <br><br>
-
The difference in odor profile between the five body sites was more stark. Fig. 2 shows the abundance of the seven odor related genes found in the HUMAnN analysis at the varying body sites. It is clear from this figure that acetate kinase (ackA), involved in fermentation pathways which can generate lactic acid and glycerol which can lead to the formation of carboxylic acids that contribute to acidic odor, is most abundant in vaginas. Leucine dehydrogenase (leuD), on the other hand, which leads to the formation of isovaleric acid (a compound with a characteristic cheese smell), is found to be most abundant in the ear.<br><br>
+
The difference in odor profile between the five body sites was more stark. Fig. 2 shows the abundance of the seven odor related genes found in the HUMAnN analysis at the varying body sites. It is clear from this figure that acetate kinase (<i>ackA</i>), involved in fermentation pathways which can generate lactic acid and glycerol which can lead to the formation of carboxylic acids that contribute to acidic odor, is most abundant in vaginas. Leucine dehydrogenase (<i>leuD</i>), on the other hand, which leads to the formation of isovaleric acid (a compound with a characteristic cheese smell), is found to be most abundant in the ear.<br><br>
-
Outer membrane lipoprotein Blc (apoD) is found to be most abundant in stool samples, which is interesting since this particular protein is most expressed in the apocrine glands, which are found in some parts of the external genitalia. It may be that some of the proteins expressed in the apocrine glands were transported into the stool samples. Finally, fatty acid dehydrogenases (the fad genes) were very abundant in the nose (which the exception of fadD which was found in large abundance in almost all the body sites). These genes are involved in fatty acid metabolism, which can generate volatile fatty acids, typically associated with odor. <br><br>  </p>
+
Outer membrane lipoprotein Blc (<i>apoD</i>) is found to be most abundant in stool samples. This is interesting because this particular protein is most expressed in the apocrine glands, which are found in some parts of the external genitalia. It may be that some of the proteins expressed in the apocrine glands were transported into the stool samples. Finally, fatty acid dehydrogenases (the <i>fad</i> genes) were only abundant in the nose (with the exception of <i>fadD</i> which was found in large abundance in almost all the body sites). These genes are involved in fatty acid metabolism, which can generate volatile fatty acids typically associated with odor. <br><br>  </p>
-
</div>
+
<p class=text2> <img src="https://static.igem.org/mediawiki/2014/5/57/Body_odor_abundance_pretty_pb.png"></br><span class=legende><b>Figure 1. Log-scale abundance profiles of body-odor related genes at the five different body sites in males and females. </b></span></br></br>
-
<div id=part2 class=project>
+
 
-
<p class=text2></p>
+
<img src="https://static.igem.org/mediawiki/2014/8/82/Body_odor_abundance_pb.png"></br><span class=legende><b>Figure 2. Abundance of body odor related genes (<i>ackA, leuD, apoD, fadA, fadB, fadD</i>, and <i>fadE</i>) in five different body sites: ear, mouth, nose, stool, and vagina.</b>The data was derived from the HUMAnN analysis of the Human Microbiome Project database. The data was collected for fifteen different body sites; for this analysis, however, the fifteen body sites were combined into the five main ones listed above.</span></br></br>
-
<h6>Pipeline for Characterization of Odor-Related Genes</h6><br>
+
</p><br><br>
 +
<h6>Pipeline for Characterization of Odor-Related Genes</h6><br>
<p class=text1>
<p class=text1>
-
<strong style="font-size: 125%;">Introduction</strong><br>
+
<strong style="font-size: 125%;">Introduction</strong><br></br>
-
There have been several deep sequencing studies performed on genes known to be related to body odor. A pre-defined bioinformatics pipeline was created in order to characterize odor related genes in order to analyze some of the large amount of whole genome sequencing (WGS) data that already exists in databases such as the Human Microbiome Project (HMP), the National Center for Biotechnology Information (NCBI), the DNA Databank of Japan (DDBJ), and the Sanger Center. Furthermore, these studies were used to supplement laboratory research, such as determining targets for CRISPRs on odor-related genes in the "Don't Sweat It" project. <br><br>
+
There have been several deep sequencing studies performed on genes known to be related to body odor. A pre-defined bioinformatics pipeline was created in order to characterize odor related genes in order to analyze some of the large amount of whole genome sequencing (WGS) data that already exists in databases such as the Human Microbiome Project (HMP), the National Center for Biotechnology Information (NCBI), the DNA Databank of Japan (DDBJ), and the Sanger Center. Furthermore, these studies were used to supplement laboratory research, such as determining targets for CRISPRs on odor-related genes in the <a href="https://2014.igem.org/Team:Paris_Bettencourt/Project/Eliminate_Smell">"Don't Sweat It"</a> project. <br><br>
-
<strong style="font-size: 125%;"> Methods </strong><br>
+
<br><br><strong style="font-size: 125%;"> Methods </strong><br></br>
<b>Overall Pipeline:</b><br>
<b>Overall Pipeline:</b><br>
1. Find whole genome shotgun sequences through various databases (Sequence Read Archive (SRA), HMP, etc.) for the organism in question. <br>
1. Find whole genome shotgun sequences through various databases (Sequence Read Archive (SRA), HMP, etc.) for the organism in question. <br>
-
2. Find nucleotide sequence on NCBI for gene in question.<br>
+
2. Find nucleotide sequence on NCBI for gene in question. <br>
-
3. Run a whole sequence alignment (paired-end alignment mode) with the gene as the reference using Bowtie2, a memory-efficient tool for aligning sequencing reads to long reference sequences (REF). <br>
+
3. Run a whole sequence alignment (paired-end alignment mode) with the gene as the reference using Bowtie2, a memory-efficient tool for aligning sequencing reads to long reference sequences (<a href="https://2014.igem.org/Team:Paris_Bettencourt/Bibliograpy">Langmead, 2012</a>). <br>
-
4. Run BLAST on consensus sequence from alignment from Integrative Genomics Viewer (REF).<br>
+
4. Run BLAST on consensus sequence from alignment from Integrative Genomics Viewer (<a href="https://2014.igem.org/Team:Paris_Bettencourt/Bibliograpy">Robinson, 2012</a>). <br>
-
5. Determine mutation rate versus nucleotide from consensus sequence data.<br>
+
5. Determine mutation rate versus nucleotide from consensus sequence data. <br>
-
6. Use homology modeling, protein family domains data, and other structural information to determine the likelihood of mutations.<br><br>
+
6. Use homology modeling, protein family domains data, and other structural information to determine the likelihood of mutations. <br><br>
<b>Mutation rate was determined by the following approach:</b><br>
<b>Mutation rate was determined by the following approach:</b><br>
Line 160: Line 158:
<b>Mutation Rate</b> = 1/PCN<br><br>
<b>Mutation Rate</b> = 1/PCN<br><br>
-
<strong style="font-size: 125%;"> Results </strong><br>
+
<br><br><strong style="font-size: 125%;"> Results </strong><br></br>
-
As an example, a case study with <i>ackA</i>, or acetate kinase, in <i>Staphylococcus aureus</i> is highlighted. This gene is a involved in the catabolic formation of ATP and known to be responsible for body odor in the human axilla (Tauch, 2013). Fig. 3 shows a screenshot of the sequence alignment of the WGS reads with the reference gene (<i>ackA</i>). Fig. 4a shows a graph of the mutation rates vs. nucleotide position. Fig 4b highlights the most likely nucleotide positions for these mutations and the corresponding translated amino acid residues. Furthermore, it also showcases whether the residues are solvent exposed or not or if they are structurally or functionally important. This data was determined using the Consurf server for protein structure prediction (REF). As seen from Fig. 4a, there may be an edge effect that is not accounted for in the metric of calculation of the mutation rate. This is a limitation of this pipeline and the metric needs to be further optimized. Finally, a 3D model was found using a consensus solution from 3D structure prediction tools such as Phyre2 and Consurf (REF). The residues corresponding to the mutated residues are highlighted on the structure in order to determine whether these residues correspond to a structurally or functionally relevant location, which would mark as a potentially important CRISPR target. A similar analysis was performed for other odor related genes, including leucine dehydrogenase (LeuDH), lactate dehydrogenase (Ldh), and C-S lyase (AecD).
+
As an example, a case study with <i>ackA</i>, or acetate kinase, in <i>Staphylococcus aureus</i> is highlighted. This gene is a involved in the catabolic formation of ATP and known to be responsible for body odor in the human axilla (<a href="https://2014.igem.org/Team:Paris_Bettencourt/Bibliograpy">Tauch, 2013</a>). Fig. 3 shows a screenshot of the sequence alignment of the WGS reads with the reference gene (<i>ackA</i>). Fig. 4a shows a graph of the mutation rates vs. nucleotide position. Fig 4b highlights the most likely nucleotide positions for these mutations and the corresponding translated amino acid residues. Furthermore, it also showcases whether the residues are solvent exposed or not or if they are structurally or functionally important. This data was determined using the Consurf server for protein structure prediction (<a href="https://2014.igem.org/Team:Paris_Bettencourt/Bibliograpy">Celniker, 2013</a>). As seen from Fig. 4a, there may be an edge effect that is not accounted for in the metric of calculation of the mutation rate. This is a limitation of this pipeline and the metric needs to be further optimized. Finally, a 3D model was found using a consensus solution from 3D structure prediction tools such as Phyre2 and Consurf (Fig. 5) (<a href="https://2014.igem.org/Team:Paris_Bettencourt/Bibliograpy">Kelley, 2009</a>). The residues corresponding to the mutated residues are highlighted on the structure in order to determine whether these residues correspond to a structurally or functionally relevant location, which would mark as a potentially important CRISPR target. A similar analysis was performed for other odor related enzymes, including leucine dehydrogenase (LeuDH), lactate dehydrogenase (Ldh), and C-S lyase (AecD). <br><br>
 +
 
 +
<img src="https://static.igem.org/mediawiki/2014/0/00/AckA_S_aureus_highlighted_mutations.jpg"></br><span class=legende><b>Figure 5. Mutated amino acid residues highlighted on predicted 3D structure of AckA. </b> Model predicted by Phyre2 3D prediction web-server (Kelley, 2009). </span></br></br>
 +
</p>
 +
 +
<p class=text2>
 +
<img src="https://static.igem.org/mediawiki/2014/b/bc/Igv_ackA_pb.png"></br><span class=legende><b>Figure 3. Alignment between WGS reads for <i>ackA</i> from the Sanger Institute.</b> </span></br></br>
 +
 
 +
<img src="https://static.igem.org/mediawiki/2014/b/bd/Bioinformatics_figure_4_pb.png"></br><span class=legende><b>Figure 4. Mutated nucleotide positions and amino acid residues determined by WGS alignment with reference gene (<i>ackA</i>). </b>A) Graph of mutation rate versus nucleotide position as calculated using the metric presented in the Methods section. B) Table of predicted nucleotide positions on <i>ackA</i> and the corresponding amino acid residue. Includes information on solvent exposure, and structural and functional importance of each residue.  </span></br></br>
 +
 
 +
 
 +
 
</p>
</p>
</div>
</div>

Latest revision as of 02:58, 18 October 2014

Meta-Analysis of Odor-Related Genes


Introduction

The National Institute of Health's (NIH) Human Microbiome Project (HMP) attempted to "characterize microbial communities found at multiple human body sites and to look for correlations between changes in the microbiome and human health". There were several studies that sprouted from the data produced by the HMP. One such study was done by the Huttenhower lab called HUMAnN: The HMP Unified Metabolic Analysis Network, a pipeline for efficient and accurate determination of the presence or absence and abundance of microbial pathways in a community using metagenomic data (Abubucker, 2012).

The abundances for each orthologous gene family (or groups of genes that perform approximately the same biological role) was reported in the units of read hits. This particular analysis used the KEGG Orthology (KO) database. Read hits refer to a read that maps to a gene sequence within a particular KO. These hits are weighted using two ways: 1. If a read hits multiple sequences, its weight is distributed among them in proportion to the strength of each mapping and 2. hits to longer sequences are down-weighted, since longer sequences contribute more reads to a metagenome due to the random sampling process of metagenomic studies (Abubucker, 2012).

The goal of this sub-project was to find genes related to odor from the HUMAnN analysis and see how the odor profile of various body sites (ear, nose, mouth, vagina, stool) and genders varies.



Discussion

There is no large difference in odor profiles between males and females at the body sites sampled. In general, the abundance of odor related genes was slightly lower for women than for men; however, the general trend remained the same with both genders (Fig. 1).

The difference in odor profile between the five body sites was more stark. Fig. 2 shows the abundance of the seven odor related genes found in the HUMAnN analysis at the varying body sites. It is clear from this figure that acetate kinase (ackA), involved in fermentation pathways which can generate lactic acid and glycerol which can lead to the formation of carboxylic acids that contribute to acidic odor, is most abundant in vaginas. Leucine dehydrogenase (leuD), on the other hand, which leads to the formation of isovaleric acid (a compound with a characteristic cheese smell), is found to be most abundant in the ear.

Outer membrane lipoprotein Blc (apoD) is found to be most abundant in stool samples. This is interesting because this particular protein is most expressed in the apocrine glands, which are found in some parts of the external genitalia. It may be that some of the proteins expressed in the apocrine glands were transported into the stool samples. Finally, fatty acid dehydrogenases (the fad genes) were only abundant in the nose (with the exception of fadD which was found in large abundance in almost all the body sites). These genes are involved in fatty acid metabolism, which can generate volatile fatty acids typically associated with odor.


Figure 1. Log-scale abundance profiles of body-odor related genes at the five different body sites in males and females.


Figure 2. Abundance of body odor related genes (ackA, leuD, apoD, fadA, fadB, fadD, and fadE) in five different body sites: ear, mouth, nose, stool, and vagina.The data was derived from the HUMAnN analysis of the Human Microbiome Project database. The data was collected for fifteen different body sites; for this analysis, however, the fifteen body sites were combined into the five main ones listed above.



Pipeline for Characterization of Odor-Related Genes

Introduction

There have been several deep sequencing studies performed on genes known to be related to body odor. A pre-defined bioinformatics pipeline was created in order to characterize odor related genes in order to analyze some of the large amount of whole genome sequencing (WGS) data that already exists in databases such as the Human Microbiome Project (HMP), the National Center for Biotechnology Information (NCBI), the DNA Databank of Japan (DDBJ), and the Sanger Center. Furthermore, these studies were used to supplement laboratory research, such as determining targets for CRISPRs on odor-related genes in the "Don't Sweat It" project.



Methods

Overall Pipeline:
1. Find whole genome shotgun sequences through various databases (Sequence Read Archive (SRA), HMP, etc.) for the organism in question.
2. Find nucleotide sequence on NCBI for gene in question.
3. Run a whole sequence alignment (paired-end alignment mode) with the gene as the reference using Bowtie2, a memory-efficient tool for aligning sequencing reads to long reference sequences (Langmead, 2012).
4. Run BLAST on consensus sequence from alignment from Integrative Genomics Viewer (Robinson, 2012).
5. Determine mutation rate versus nucleotide from consensus sequence data.
6. Use homology modeling, protein family domains data, and other structural information to determine the likelihood of mutations.

Mutation rate was determined by the following approach:
1. Determine the matrix for determining the consensus sequence, which contains information about the number of times each nucleotide (A, T, G, C, or unknown) is found at each position on the gene sequence.
2. For each position, determine a mutation rate using the following metric:

Percent Correct Nucleotide (PCN) = Max Nucleotide Value at Position / Total Number of Nucleotides at Position
Mutation Rate = 1/PCN



Results

As an example, a case study with ackA, or acetate kinase, in Staphylococcus aureus is highlighted. This gene is a involved in the catabolic formation of ATP and known to be responsible for body odor in the human axilla (Tauch, 2013). Fig. 3 shows a screenshot of the sequence alignment of the WGS reads with the reference gene (ackA). Fig. 4a shows a graph of the mutation rates vs. nucleotide position. Fig 4b highlights the most likely nucleotide positions for these mutations and the corresponding translated amino acid residues. Furthermore, it also showcases whether the residues are solvent exposed or not or if they are structurally or functionally important. This data was determined using the Consurf server for protein structure prediction (Celniker, 2013). As seen from Fig. 4a, there may be an edge effect that is not accounted for in the metric of calculation of the mutation rate. This is a limitation of this pipeline and the metric needs to be further optimized. Finally, a 3D model was found using a consensus solution from 3D structure prediction tools such as Phyre2 and Consurf (Fig. 5) (Kelley, 2009). The residues corresponding to the mutated residues are highlighted on the structure in order to determine whether these residues correspond to a structurally or functionally relevant location, which would mark as a potentially important CRISPR target. A similar analysis was performed for other odor related enzymes, including leucine dehydrogenase (LeuDH), lactate dehydrogenase (Ldh), and C-S lyase (AecD).


Figure 5. Mutated amino acid residues highlighted on predicted 3D structure of AckA. Model predicted by Phyre2 3D prediction web-server (Kelley, 2009).


Figure 3. Alignment between WGS reads for ackA from the Sanger Institute.


Figure 4. Mutated nucleotide positions and amino acid residues determined by WGS alignment with reference gene (ackA). A) Graph of mutation rate versus nucleotide position as calculated using the metric presented in the Methods section. B) Table of predicted nucleotide positions on ackA and the corresponding amino acid residue. Includes information on solvent exposure, and structural and functional importance of each residue.

Centre for Research and Interdisciplinarity (CRI)
Faculty of Medicine Cochin Port-Royal, South wing, 2nd floor
Paris Descartes University
24, rue du Faubourg Saint Jacques
75014 Paris, France
+33 1 44 41 25 22/25
paris-bettencourt-igem@googlegroups.com
Copyright (c) 2014 igem.org. All rights reserved.