From 2014.igem.org

Validation | UESTC Software 2014

Validation

Complexity Analysis

1 Notation

n Number of candidate sgRNAs

n_h Number of hits in cache

n_m Number of misses in cache

n_t System will only output n_t results

m Number of possible-o_target sgRNAs

l Length of sgRNA

t_cd Time cost on connecting database

t_rd Approximate time cost on one database operation

2 Analysis

1. System will connect to Database, which costs t_cd.

2. Retrieve data from database, which costs t_rd* (n + m).

3. System calculate each n candidate sgRNAs' scores. For n_h sgRNAs which has already calculated before, system will cost only O(n_h). For n_m others, system will calculate the scores. According to our algorithm, those sgRNAs will compare every m possible- o_target sgRNAs. In each comparison, algorithm compare all l nucleobase, make a little adds and multiplies, package result to strings and save strings to database. All n_m calculation will cost n_m* (m*l + t_rd).

4. System will sort all n result, which cost O(n log2 n).

5. Output result costs O(n_t).

3 Run Time

t_cd + t_rd* (n + m) + O(n_h) + n_m* (m* l + t_rd) + O(n log₂ n) + O(n_t)

Software Validation

In order to verify the availability of our software, we used the CRISPR-P and Optimized CRISPR Design these two CRISPR tools, under the condition of the same input sequence, to compare their results with ours.

We use the same input sequence:

TTTGAGGTCAATACAAATCCTATTTCTTGTGGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCTTC ATTTGTTATATTGGATACAAGCTTTGCTACGATCTACATTTGGGAATGTGAGTCTCTTATTGTAACCTTA GGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTATGTTTGGACATTTATTGTCATTCTT
(NC_003070.9,Arabidopsis thaliana ,chromosome 1,211-420)

The output results of CRISPR-P, Optimized CRISPR Design and CRISPR-X respectively are:

CRISPR-P

Optimized CRISPR Design

CRISPR-X

Comparison:

(a)The number of sgRNA results

As we have seen, the number of sgRNA results of three software are all 14, in addition, the 14 sgRNA sequences are all the same(just the rank are different), proves our software’s approach to find sgrna is correct.

(b)The ranking of sgRNA results

Since three software have different scoring algorithm, so the rankings are not exactly the same. In addition, for our scoring which contains an efficacy score, so the rank of CRISPR-X and the other two software vary greatly. However, for specific scores, in some degree, the ranking trend of our software and the other two software are similar.

Species:Arabidopsis thaliana ,location:chromosome 1,211-420
Sequence	Specificity scoe	Efficacy score	Total score	Rank (CRISPR-P)	Rank (CRISPR - X Specificity)	Rank (Optimized CRISPR Design)
TTCCTTCACTTAGCTATGGATGG	49	15	64	1	1(3)	5
AGTCTCTTATTGTAACCTTAGGG	48	0	48	2	3(8)	2
TCTTATTGTAACCTTAGGGTTGG	48	0	48	3	4(7)	1
CTTGAGATAAACCAACCCTAAGG	49	15	64	4	2(2)	3
GCTTTGCTACGATCTACATTTGG	48	32	80	5	5(1)	4
TTCTTTCCTTCACTTAGCTATGG	47	0	47	6	7(9)	7
AACCATCCATAGCTAAGTGAAGG	48	15	63	7	6(5)	6
AATACAAATCCTATTTCTTGTGG	45	0	45	8	8(10)	9
GAGTCTCTTATTGTAACCTTAGG	45	17	63	9	9(4)	8
CAAGAATCTTATTAATTGTTTGG	43	0	43	10	10(11)	11
CTTTGCTACGATCTACATTTGGG	42	0	42	11	11(12)	10
TTGTTTGGACTGTTTATGTTTGG	36	0	36	12	12(13)	12
TTTATCTTCATTTGTTATATTGG	29	0	29	13	14(14)	13
GAAAGAAAACCACAAGAAATAGG	35	17	53	14	13(6)	14
In parentheses represents the total rank

(c) The position and strand of sgRNA results

Through our careful comparison, we found that the location information of sgrna of CRISPR-X is exactly the same with CRISPR-P (Optimized CRISPR Design pages doesn’t show the exact location), the results further illustrate the the availability of our software.

Species:Arabidopsis thaliana ,location:chromosome 1,211-420
sgRNA sequence	Position, strand
sgRNA sequence	CRISPR-P	CRISPR-X	Optimized CRISPR Design
TTCCTTCACTTAGCTATGGATGG	249,"+"	249,"+"	/
AGTCTCTTATTGTAACCTTAGGG	331,"+"	331,"+"	/
TCTTATTGTAACCTTAGGGTTGG	335,"+"	335,"+"	/
CTTGAGATAAACCAACCCTAAGG	346,"-"	346,"-"	/
GCTTTGCTACGATCTACATTTGG	301,"+"	301,"+"	/
TTCTTTCCTTCACTTAGCTATGG	245,"+"	245,"+"	/
AACCATCCATAGCTAAGTGAAGG	251,"-"	251,"-"	/
AATACAAATCCTATTTCTTGTGG	220,"+"	220,"+"	/
GAGTCTCTTATTGTAACCTTAGG	330,"+"	330,"+"	/
CAAGAATCTTATTAATTGTTTGG	365,"+"	365,"+"	/
CTTTGCTACGATCTACATTTGGG	302,"+"	302,"+"	/
TTGTTTGGACTGTTTATGTTTGG	380,"+"	380,"+"	/
TTTATCTTCATTTGTTATATTGG	272,"+"	272,"+"	/
GAAAGAAAACCACAAGAAATAGG	229,"-"	229,"-"	/

For example, the comparison figure of the sgRNA of ranking 3rd and 12th (in CRISPR-P):

(d)The number of off-target sites(PAM is NGG) of sgRNA results

To verify the availability of a CRISPR software, the correctness of identifying potential off-target number is a very important indicator. For the number of off-target sites for 14 sgRNA, we draw the following table for statistical analysis (here, we only count off-target sites of the PAM is NGG, because the total number of off-target sites of the other two software are inconsistent, although we can find the number of off-target sites containing PAM is NGG and NAG).

Species:Arabidopsis thaliana ,location:chromosome 1,211-420
sgRNA sequence	Specificity scoe	Efficacy score	Total score	Number of off-target sites(PAM is NGG)
sgRNA sequence	Specificity scoe	Efficacy score	Total score	CRISPR-X	CRISPR-P	Optimized CRISPR Design
TTCCTTCACTTAGCTATGGATGG	49	15	64	3	3(4)	3(12)
AGTCTCTTATTGTAACCTTAGGG	48	0	48	4	4(4)	4(10)
TCTTATTGTAACCTTAGGGTTGG	48	0	48	3	3(5)	3(7)
CTTGAGATAAACCAACCCTAAGG	49	15	64	3	3(7)	3(10)
GCTTTGCTACGATCTACATTTGG	48	32	80	5	5(6)	5(11)
TTCTTTCCTTCACTTAGCTATGG	47	0	47	9	9(16)	9(31)
AACCATCCATAGCTAAGTGAAGG	48	15	63	5	5(11)	5(14)
AATACAAATCCTATTTCTTGTGG	45	0	45	13	13(17)	13(33)
GAGTCTCTTATTGTAACCTTAGG	45	17	63	8	8(11)	8(17)
CAAGAATCTTATTAATTGTTTGG	43	0	43	21	/(35)	21(57)
CTTTGCTACGATCTACATTTGGG	42	0	42	17	17(18)	17(22)
TTGTTTGGACTGTTTATGTTTGG	36	0	36	37	/(48)	/(70)
TTTATCTTCATTTGTTATATTGG	29	0	29	57	/(96)	/(157)
GAAAGAAAACCACAAGAAATAGG	35	17	53	44	/(94)	/(158)
/ represents unknown, in parentheses represents the total number(including NAG and NGG)

Since CRISPR-P's results page displays only 20 off-target sites, Optimized CRISPR Design’s results page displays only 50 off-target sites, so we can see from the table: For the previous 11 sgRNA, the number of our software’s NGG off-target sites are exactly the same with the other two software. This result can be a strong evidence for the availability of our software.

For example, the comparison figure of the sgRNA of ranking 5rd (in CRISPR-P):

CRISPR-P-------5 NGG off-target sites

CRISPR-X------5 NGG off-target sites

Optimized CRISPR Design ------5 NGG off-target sites

The comparison figure of the sgRNA of ranking 9th (in CRISPR-P):

CRISPR-P-------8 NGG off-target sites

CRISPR-X------8 NGG off-target sites

Optimized CRISPR Design ------8 NGG off-target sites

In summary, in the case of the same input sequence, the output sgRNA results of our software have the same number of the output sgRNA, the same sequence information, the same location information and the same NGG off-target sites with the other two CRISPR design tools—CRISPR-P and Optimized CRISPR Design. Therefore, our software—CRISPR-X, is valid and available.

Algorithm Validation

In order to confirm our software is consistent with the experimental results, we use the experimental data on the MLE Cleavage with the different mismatches, and we compare our scoring results to corresponding to the experimental data, in addition find the correlation coefficient of them.

(A)Single mismatch

First, we use aggregate data from single-mismatch guide RNAs for 15 EMX1 targets in literature [1] (It’s relation figure is figure 2C, heat map for relative SpCas9 cleavage efficiency for each possible RNA: DNA base pair).

Heat map for relative SpCas9 cleavage efficiency for each possible RNA:DNA base pair

We use this set of data to determine the relationship between our software score with the MLE Cleavage for single mismatch position. MATLAB program is shown below:

	data=load('E:\matlab\work\igem_data.mat');
	[m n]=size(data.dataigem);
	for i=1:n
	    ave(i)=mean(data.dataigem(:,i));
	end
	M=[0,0,0.014,0,0,0.395,0.317,0,0.389,0.079,0.445,0.508,0.613,0.851,0.732,0.828,0.615,0.804,0.685,0.583];
	for j=2:20
	S(j-1)=(4*exp(1-M(j)))/((4*j+19)/19);
	end
	x=19:-1:1;
	figure(1);%title('Single mismatch,correlation coefficient=0.8840');

	subplot(1,2,1);
	stem(x,ave);
	title('The relaition between the single mismatch location and cleavage activity ');
	xlabel('location/nt');ylabel('cleavage activity');
	subplot(1,2,2);
	stem(x,S,'g');
	title('The figure of the single mismatch location and mismatch score  ');
	xlabel('location/nt');ylabel('score');

	figure(2);
	plot(x,ave,'b',x,S,'r');
	legend('mismatch cleavage activity','mismatch score');
	title('The contrast figure of the single mismatch cleavage activity and mismatch score  ');
	xlabel('location/nt');ylabel('amplitude');
	p=polyfit(ave,S,1)
	y=p(1)*ave+p(2);%Fitting equation of the straight line
	figure(3);
	plot(ave,S,'b',ave,y,'r');
	title('The relation of the single mismatch location and mismatch score  ');
	xlabel('Mismatch cleavage activity');ylabel('Mismatch score');
	legend('Relation curve','Fitting straight line');

	R=corrcoef(ave,S)%find the correlation coefficient

The result and figures are:

Therefore, the correlation coefficient between the single mismatch cleavage activity and CRISPR-X mismatch score is 0.8840. And the fitting equation of the straight line is: Y=0.1872*X+0.0987. (X stands for mismatch score).

(B) Multiple mismatches

We next explored the effect between multiple base mismatches on SpCas9 target activity and our mismatch score. We use data of sets of guide RNAs that contained varying combinations of mismatches to investigate the effect of mismatch number, position and spacing on SpCas9 target cleavage activity for four targets within the EMX1 gene. [1]

SpCas9 target cleavage activity for multiple mismatches [1]

(a) Two concatenated mismatches

MATLAB program is shown below:

	data=load('E:\matlab\work\data2misc.mat');
	[m n]=size(data.data2misc);
	for i=1:n
	    ave(i)=mean(data.data2misc(:,i));
	end
	N=[19 20;17 18;15 16;13 14;11 12;9 10;7 8;5 6;3 4];
	M=[0,0,0.014,0,0,0.395,0.317,0,0.389,0.079,0.445,0.508,0.613,0.851,0.732,0.828,0.615,0.804,0.685,0.583];
	for j=1:n
	    d0(j)=mean(N(j,:));
	    S(j)=(exp(1-M(N(j,1)))+exp(1-M(N(j,2))))/((4*d0(j)+19)/19);
	end
	x=1:n;
	subplot(1,2,1);
	stem(x,ave);
	title('The two concatenated mismatches cleavage activity ');
	xlabel('The serial number');ylabel('cleavage activity');
	subplot(1,2,2);
	stem(x,S,'g');
	title('The two concatenated mismatches score  ');
	xlabel('The serial number');ylabel('score');
	p=polyfit(S,ave,1)
	y=p(1)*S+p(2);%Fitting equation of the straight line
	figure(2);
	plot(S,ave,'b',S,y,'r');
	title('The relation of the two concatenated mismatches cleavage activity and mismatch score  ');
	xlabel('Mismatch score');ylabel('Mismatch cleavage activity');
	legend('Relation curve','Fitting straight line');
	R=corrcoef(ave,S)

The result and figures are:

Therefore, the correlation coefficient between the two concatenated mismatches cleavage activity and CRISPR-X mismatch score is 0.8902. And the fitting equation of the straight line is: Y=0.0445*X-0.0103. (X stands for mismatch score).

(b) Two interspaced mismatches

MATLAB program is shown below:

	data=load('E:\matlab\work\data2misi.mat');
	[m n]=size(data.data2misi);
	for i=1:n
	    ave(i)=mean(data.data2misi(:,i));
	end
	N=[18 20;15 20;11 20;6 20;1 20;16 18;13 18;9 18;4 18;14 16;11 16;7 16;2 16];
	M=[0,0,0.014,0,0,0.395,0.317,0,0.389,0.079,0.445,0.508,0.613,0.851,0.732,0.828,0.615,0.804,0.685,0.583];
	for j=1:n
	    d0(j)=mean(N(j,:));
	    S(j)=(exp(1-M(N(j,1)))+exp(1-M(N(j,2))))/((4*d0(j)+19)/19);
	end
	x=1:n;
	subplot(1,2,1);
	stem(x,ave);
	title('The two interspaced mismatches cleavage activity ');
	xlabel('The serial number');ylabel('cleavage activity');
	subplot(1,2,2);
	stem(x,S,'g');
	title('The two interspaced mismatches score  ');
	xlabel('The serial number');ylabel('score');
	p=polyfit(S,ave,1)
	y=p(1)*S+p(2);%Fitting equation of the straight line
	figure(2);
	plot(S,ave,'b',S,y,'r');
	title('The relation of the two interspaced mismatches cleavage activity and mismatch score  ');
	xlabel('Mismatch score');ylabel('Mismatch cleavage activity');
	legend('Relation curve','Fitting straight line');
	R=corrcoef(ave,S)

The result and figures are:

Therefore, the correlation coefficient between the two interspaced mismatches cleavage activity and CRISPR-X mismatch score is 0.7688. And the fitting equation of the straight line is: Y=0.0866*X-0.0353. (X stands for mismatch score).

(c) Three concatenated mismatches

MATLAB program is shown below:

	data=load('E:\matlab\work\data3misc.mat');
	[m n]=size(data.data3misc);
	for i=1:n
	    ave(i)=mean(data.data3misc(:,i));
	end
	N=[17 18 19;14 15 16;11 12 13;8 9 10;5 6 7;2 3 4];
	M=[0,0,0.014,0,0,0.395,0.317,0,0.389,0.079,0.445,0.508,0.613,0.851,0.732,0.828,0.615,0.804,0.685,0.583];
	for j=1:n
	    d0(j)=mean(N(j,:));
	    S(j)=4/9*(exp(1-M(N(j,1)))+exp(1-M(N(j,2)))+exp(1-M(N(j,3))))/((4*d0(j)+19)/19);
	end
	x=1:n;
	subplot(1,2,1);
	stem(x,ave);
	title('The three concatenated mismatches cleavage activity ');
	xlabel('The serial number');ylabel('cleavage activity');
	subplot(1,2,2);
	stem(x,S,'g');
	title('The three concatenated mismatches score  ');
	xlabel('The serial number');ylabel('score');
	p=polyfit(S,ave,1)
	y=p(1)*S+p(2);%Fitting equation of the straight line
	figure(2);
	plot(S,ave,'b',S,y,'r');
	title('The relation of the three concatenated mismatches cleavage activity and mismatch score  ');
	xlabel('Mismatch score');ylabel('Mismatch cleavage activity');
	legend('Relation curve','Fitting straight line');
	R=corrcoef(ave,S)

The result and figures are:

Therefore, the correlation coefficient between the three concatenated mismatches cleavage activity and CRISPR-X mismatch score is 0.8560. And the fitting equation of the straight line is: Y=0.0175*X-0.0082. (X stands for mismatch score).

(d) Three interspaced mismatches

MATLAB program is shown below:

	data=load('E:\matlab\work\data3misi.mat');
	[m n]=size(data.data3misi);
	for i=1:n
	    ave(i)=mean(data.data3misi(:,i));
	end
	N=[16 18 20;14 17 20;12 16 20;10 15 20;14 16 18;12 15 18;10 14 18;8 13 18];%The matrix element represents the position of the sgrna
	M=[0,0,0.014,0,0,0.395,0.317,0,0.389,0.079,0.445,0.508,0.613,0.851,0.732,0.828,0.615,0.804,0.685,0.583];%The weight of each position
	for j=1:n
	    d0(j)=mean(N(j,:));
	    S(j)=4/9*(exp(1-M(N(j,1)))+exp(1-M(N(j,2)))+exp(1-M(N(j,3))))/((4*d0(j)+19)/19);%Find the Smm
	end
	x=1:n;
	subplot(1,2,1);
	stem(x,ave);
	title('The three interspaced mismatches cleavage activity ');
	xlabel('The serial number');ylabel('cleavage activity');
	subplot(1,2,2);
	stem(x,S,'g');
	title('The three interspaced mismatches score  ');
	xlabel('The serial number');ylabel('score');
	p=polyfit(S,ave,1)
	y=p(1)*S+p(2);%Fitting equation of the straight line
	figure(2);
	plot(S,ave,'b',S,y,'r');
	title('The relation of the three interspaced mismatches cleavage activity and mismatch score  ');
	xlabel('Mismatch score');ylabel('Mismatch cleavage activity');
	legend('Relation curve','Fitting straight line');
	R=corrcoef(ave,S)%Find the the correlation coefficient

The result and figures are:

Therefore, the correlation coefficient between the three interspaced mismatches cleavage activity and CRISPR-X mismatch score is 0.6092. And the fitting equation of the straight line is: Y=0.0065*X-0.0018. (X stands for mismatch score).

In summary, the correlation coefficient of the above-mentioned five different conditions (single mismatch, two concatenated mismatches, two interspaced mismatches, three concatenated mismatches and three interspaced mismatches) respectively are: 0.8840, 0.8902, 0.7688, 0.8566, and 0.6092. The correlation coefficients are all over 0.6, and three correlation coefficients are over 0.85. In some extent, this result demonstrated the validity and availability of our scoring algorithm.

Reference:

[1] Hsu, P. D., Scott, D. A., Weinstein, J. A., Ran, F. A., Konermann, S., Agarwala, V., ... & Zhang, F. (2013). DNA targeting specificity of RNA-guided Cas9 nucleases. Nature biotechnology, 31(9), 827-832.

Complexity Analysis

Software Validation

Algorithm Validation

Team:UESTC-Software/Validation.html

From 2014.igem.org

Validation

1 Notation

2 Analysis

3 Run Time

(a)The number of sgRNA results

(b)The ranking of sgRNA results

(c) The position and strand of sgRNA results

(d)The number of off-target sites(PAM is NGG) of sgRNA results

(A)Single mismatch

(B) Multiple mismatches

(a) Two concatenated mismatches

(b) Two interspaced mismatches

(c) Three concatenated mismatches

(d) Three interspaced mismatches