Team:UESTC-Software/Modeling.html

From 2014.igem.org

Revision as of 12:12, 16 October 2014 by Tangying0608 (Talk | contribs)

UESTC-Software

Models and Algorithms

1.Overview

Modeling is a powerful tool in synthetic biology and engineering. In our project, we aim to design a bioinformatics tool “CRISPR-X”, which is a software developed for design of CRISPR sgRNA with minimized off-target effects and high cutting rate.

The CRISPR-associated (Cas)9 can be programmed with a single guide RNA (sgRNA) to generate site-specific DNA breaks, but there are few known rules governing on-target efficacy of this system[1,2]. Related reports suggest gRNAs are most effective with a GC-content between 40 and 80%. [1] In addition, a guanine at position 20 in the target site, which appears to improve cutting rate. [1] Therefore, we use efficacy score to characterize the activity of the sgRNA.

For sgRNA sequences can be 17-20 nt in length to achieve similar levels of on-target gene editing,and up to 10,000 fold improvement in target specificity when truncated (17 or 18 base pair) sgRNA is used. [3] We design the length of sgRNA sequences vary from 17nt to 20nt.

For sgRNA sequences can be 17-20 nt in length to achieve similar levels of on-target gene editing,and up to 10,000 fold improvement in target specificity when truncated (17 or 18 base pair) sgRNA is used. [3] We design the length of sgRNA sequences vary from 17nt to 20nt.

First of all, we find the protospacer-adjacent motif (PAM) based on user-specified gene region. Then, we find sgRNA corresponding to the PAM. Next, we find that whether there is a potential off-target binding site for the sgRNA over the entire gene region, and evaluate the specificity and efficacy of the sgRNA. Finally, we provide a secondary structure and the restriction enzyme cutting sites for the sgRNA.

2.Parameters
Parameters Description Range Unit Remark
d 0 Average distance for all the mismatch nucleotides to the PAM of any off-target site 0-19 nt
i Continuous variables for the number of mismatch nucleotide 1-Nmm 1
j Continuous variables for the total number of off-target sites exclude the perfect-hit off-target sites 1-(Nfg-Nph) 1
M Weight matrix [0,0,0.014,0,0,0.395,0.317,0,0.389,0.079,0.445,0.508,0.613,0.851,0.732,0.828,0.615,0.804,0.685,0.583] Reference: DNA targeting specificity of RNA-guided Cas9 nucleases, Hsu et al, 2013
n Mismatch position 1-20
Nfg The total number of off-target sites ≥0 1
Nmm The number of mismatch nucleotide for the not perfect-hit off-target sites 1-4 1
Nph Perfect-hit off-target sites ≥0 1 In our scoring algorithm, we allow the maximum value of Nph is 4, when Nph≥4, Sguide=0
r 1 The proportion of specificity score in the total score 0-1 1 In our scoring algorithm, it’s default value is 0.65
r 2 The proportion of efficacy score in the total score 0-1 1 In our scoring algorithm, it’s default value is 0.35
S1 The score of the first step ≥0 1
S20 The subtracted score for the 20th nucleotide is not a guanine =35 1
Seff The efficacy score 0-100 1 Represent the level of efficacy for the sgRNA
Sgc The subtracted score for different GC ratio 0,35,65 1
Sguide The total score of the sgRNA 0-100 1 Composed of Seff and Sspe, marking the overall properties(specificity and efficacy) of the sgRNA
Smm The subtracted score of the mismatch nucleotide for the not perfect-hit off-target sites ≥0 1
Sph The subtracted score of the perfect-hit off-target sites ≥0 1
Sspe The specificity score 0-100 1 Represent the level of specificity for the sgRNA
3.Scoring algorithm

    Judged conditions:
  • ①Bad GC ratio (< 40% or > 80%) : Sgc = 65;
    Not so good GC ratio (40% - 50% or 70% - 80%): Sgc = 35;
    Good GC ratio (51%-69%): Sgc = 0.[1]
    
  • ②The 20th nucleotide is not G: S20 = 35;[1]
  • ③If the sgRNA designed perfectly hit another sites, the penalty Sph = 25;if perfectly hit more than or equal to 4 loci, the total score Sguide is 0.

    Steps:
  • (1) Firstly, find out the number of off – target sequence Nfg, if Nfg = 0, output Sspe= r1*100; Otherwise, detect the third condition. If there is a sgRNA designed perfectly hit another site, regard the number of it as Nph, and then the score of the first step: S1 = Sph * Nph (Nph is 4 or less). When S1 is equal to or less than 75, perform step (2), otherwise the output Sguide = 0;
    If there is no sgRAN designed perfectly hit other sites, the score of the first step S1 = 0. This illustrates that there is no nucleotide which are matched between sgRNA and the place missed. Then perform step(2)。
  • (2) When performing step (2), remove the Nph which is perfectly hit first. For Nfg-Nph which does not perfectly hit, please combine the weight ratio which obtained in the literature: M=[ 0, 0, 0.014, 0, 0, 0.395, 0.317, 0, 0.389, 0.079, 0.445, 0.508, 0.613, 0.851, 0.732, 0.828, 0.615, 0.804, 0.685, 0.583];[4] Using the formula:

Assuming that specific score: efficacy score = r1: r2 (the default is r1:r2 = 0.65:0.35), and then use formula specificity scores Sspe =(when 100-, Sspe=0), efficacy score Seff = r2 * (100 - (Sgc + S20)), the total score: ; Finally according to Sguide score, arranging the sgRNA from high to low, outputting sgRNA, total score Sguide, specificity scores Sspe, efficacy score Seff, the chromosome and its site connected to sgRNA, the GC ratio.

4.Algorithm illustration

In literature [4], The algorithm used to score single off-targets is:

This algorithm is adopted by CRISPR-P, the inadequacies of this algorithm are: (a) Despite the presence of off-target sites, but sometimes it's subtracted score will still be 0 (which seems unreasonable under certain circumstances, and it will confuse the scoring of those sgRNA that don’t exist off-target sites). (b) Using W function, which cannot be expressed by elementary functions, it will take some additional time in calculation.

However, our algorithm can avoid these two shortcomings. Our algorithm is:

First, we use the summation of exponential replaced W function; Secondly, when there exist off-target sites, our running results will be with the subtracted score, and we use the rounding to further ensure this situation. The following table can show the score contrast.

Off-target sequence Mismatches CRISPR-P Score OUR Software Score
GTTTCTCCGTAATCGCGTCA 4 0.8 0.989
GTTCTTCCACAATTCCGTTA 4 0 0.391
TTTCTTCCAGAATCGTGACT 4 0 0.426
GAAAAATTCCTCTTATTTCA 2 3.9 2.177
GAACAACTCCTCTTATTACA 2 2.4 1.187
GAAGAACTACGCTTATGACA 4 0 0.402
5.Algorithm validation

In order to confirm our algorithm is consistent with the experimental results, we use the experimental data on the MLE Cleavage with the different mismatches, and we compare our scoring results to corresponding to the experimental data , in addition find the correlation coefficient of them.

First, we use aggregate data from single-mismatch guide RNAs for 15 EMX1 targets in literature [4](it’s relation figure is figure 2C, heatmap for relative SpCas9 cleavage efficiency for each possible RNA:DNA base pair).

Aggregate data from single-mismatch guide RNAs for 15 EMX1 targets [4]

Heatmap for relative SpCas9 cleavage efficiency for each possible RNA:DNA base pair[4]. (a)We use this set of data to determine the relationship between our software score with the MLE Cleavage for single mismatch position. MATLAB program is shown below:

	data=load('E:\matlab\work\igem_data.mat');
	[m n]=size(data.dataigem);
	for i=1:n
	    ave(i)=mean(data.dataigem(:,i));
	end
	M=[0,0,0.014,0,0,0.395,0.317,0,0.389,0.079,0.445,0.508,0.613,0.851,0.732,0.828,0.615,0.804,0.685,0.583];
	for j=2:20
	S(j-1)=(4*exp(1-M(j)))/((4*j+19)/19);
	end
	x=19:-1:1;
	subplot(1,2,1);
	stem(x,ave);
	title('The relaition between the single mismatch location and cleavage activity ');
	xlabel('location/nt');ylabel('cleavage activity');
	subplot(1,2,2);
	stem(x,S,'g');
	title('The figure of the single mismatch location and mismatch score  ');
	xlabel('location/nt');ylabel('score');
	figure(2);
	plot(x,ave,'b',x,S,'r');
	legend('mismatch cleavage activity','mismatch score');
	title('The contrast figure of the single mismatch cleavage activity and mismatch score  ');
	xlabel('location/nt');ylabel('amplitude');
	B=corrcoef(ave,S)%find the correlation coefficient

The result and figures are:

(b) We use the similar way to determine the relationship between our software score with the MLE Cleavage for two concatenated mismatches. MATLAB program is shown below:

	data=load('E:\matlab\work\data2misc.mat');
	[m n]=size(data.data2misc);
	for i=1:n
	    ave(i)=mean(data.data2misc(:,i));
	end
	N=[19 20;17 18;15 16;13 14;11 12;9 10;7 8;5 6;3 4];%Mismatch position
	M=[0,0,0.014,0,0,0.395,0.317,0,0.389,0.079,0.445,0.508,0.613,0.851,0.732,0.828,0.615,0.804,0.685,0.583];
	for j=1:n
	    d0(j)=mean(N(j,:));
	    S(j)=(exp(1-M(N(j,1)))+exp(1-M(N(j,2))))/((4*d0(j)+19)/19);
	end
	x=1:n;
	subplot(1,2,1);
	stem(x,ave);
	title('The two concatenated mismatches cleavage activity ');
	xlabel('The serial number');ylabel('cleavage activity');
	subplot(1,2,2);
	stem(x,S,'g');
	title('The two concatenated mismatches score  ');
	xlabel('The serial number');ylabel('score');
	B=corrcoef(ave,S) %find the correlation coefficient

The result and figures are:

(c) We use the similar way like (b)(just change N in MATLAB program) to determine the relationship between our software score with the MLE Cleavage for two interspaced mismatches. The MATLAB program result and figures are:

(d) We use the similar way like (b) (just change N and S(j) (according to the Smm formula)in MATLAB program) to determine the relationship between our software score with the MLE Cleavage for three concatenated mismatches. The MATLAB program result and figures are:

(e) We use the similar way like (b) (just change N and S(j) (according to the Smm formula)in MATLAB program) to determine the relationship between our software score with the MLE Cleavage for three interspaced mismatches. The MATLAB program result and figures are:

In summary, the correlation coefficient of the above-mentioned five different conditions (single mismatch, two concatenated mismatches, two interspaced mismatches, three concatenated mismatches and three interspaced mismatches) respectively are: 0.8840, 0.8902, 0.7688, 0.8566, 0.6092. The correlation coefficients are all over 0.6, and three correlation coefficients are over 0.85. In some extent, this result demonstrated the validity and availability of our scoring algorithm.

    Reference:
  • [1] Genetic Screens in Human Cells Using the CRISPR-Cas9 System, Wang et al., 2014
  • [2] Efficient mutagenesis by Cas9 protein-mediated oligonucleotide insertion and large-scale assessment of single-guide RNAs. Gagnon, J.A.et al,2014.
  • [3] Improving CRISPR-Cas nuclease specificity using truncated guide RNAs. Fu Y. et al.,2014)
  • [4] DNA targeting specificity of RNA-guided Cas9 nucleases, Hsu et al, 2013

1.Overview
2.Parameters
3.Scoring algorithm
4.Algorithm illustration
5.Algorithm validation