Monday, September 13, 2010

Understanding transcriptome assembly in newbler

what is the difference between isotig , isogropu and contigs??
  • For a quick view Isogropus are made of isotigs and isotigs are made of contigs

  • Normally one gene should have one transcript, but due to splice variation one gene can have many transcripts. suppose a gene is made of 3 exons, exon 1, exon2 and exon 3. It will generate 3 contigs in newbler, contig 1, contig 2 and contig 3. Due to splice variation the final transcript can be consist of exon1+exon2+exon3 or exon1+exon3 etc.. Thus we get 2 type of variation here, these two are called isotig 1 and isotig 2. isotig 1 consist of contig 1+ contig 2 + contig3 and isotig 2 consists of contig 1+ contig 3. These two isotigs are variation of one transcript. So these 2 isotigs combinedly fall in one isogroup.

  • It means that isogroups are transcripts, isotigs are splice variants of one transcript and contigs are separate exons.

  • Sometimes primary transcripts can have introns or UTR regions which are also got in form of contigs

  • sometimes similar isotigs can be found if come from different sister chromosomes, it can be identified by clustering with cdhit.

Friday, June 25, 2010

odd number of sequences in paired end file

sed '$d' filename.txt

Monday, June 21, 2010

A very important blog

http://contig.wordpress.com/2010/04/13/newbler-output-iii-the-454contiggraph-txt-file/

About paired-end sequencing

3
3

Hi all, here are some questions about paired-end sequencing for NGS:

  • What are the main differences between mate-paired sequencing and paired-end sequencing; Should I care when I suse tools like 'samtools', maq, etc.... ? Should one, and only one short read, should be paired with another one (1-1)?
  • What is removing duplicates ? does it mean that a pair of short reads has been mapped at two distint positions on the genome or does it mean that a pair matched too many time at one position ?
  • Knowing that bwa sampe "Generates alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly", is there any need to "remove the duplicates" ?
  • How does picard MarkDuplicates work ? How can I find the reads that have been 'tagged' ? will it remove the reads from the BAM file ?

Thanks

Pierre

Wednesday, May 26, 2010

walking through assembler 1

ultrasound or high pressure air stream randomly shatters the DNA into pieces

These libraries provide a “clone coverage” of more than 20-fold, meaning that, on average, 20
clones span each of the genome’s bases, thus offering the theoretical guarantee that each base is contained in at least one of the clones This guarantee assumes uniformly random-sampled clones from the genome. In practice, this requirement is seldom perfectly satisfied. Cloning biases lead to a nonrandom clone distribution, causing areas of the genome to remain unsequenced regardless of theamount of sequencing performed.

The gaps between contigs belonging to the same
scaffold are called sequence gaps. Although they
represent genuine gaps in the sequence, researchers
can retrieve the original clone inserts spanning the
gap and use a straightforward “walking” technique
to fill in the sequence.

The gaps between scaffolds are called physical
gaps because the physical DNA that would span
them is either not present in the clone inserts or
indeterminable due to misassemblies. Filling these
gaps involves a large amount of manual labor and
complex laboratory techniques

These limitations spurred the development of
new algorithms. Two approaches exploit techniques
developed in the field of graph theory: one
that represents the sequence reads as graph nodes
and another that represents them as edges.

Euler9 detects repeats by finding complex areas, or
tangles, in the graph constructed during assembly.

Wednesday, May 19, 2010

Illumina paired end data to use in newbler

Preparation of illumina paired end data for using in gsAssmebler
  • convert the illumina .txt file to standard .fastq file with maq script fqall2std.pl --- perl fqall2std sol2std test.txt test.fastq
  • convert the illumina fastq to sanger fastq --- maq sol2sanger test.fastq test1.fastq
  • convert the sanger fastq to fasta and qual --- perl fqall2std.pl std2qual out.prefix test1.fastq
  • convert the headers of fasta and quals for newbler so that it can recognize it as paired end --- perl replacefasta.pl test1.fasta F lib > test1.fna --- perl replacefatsa.pl test1.qual F lib > test11.qual
  • rename test11.qual to test1.qual
  • be sure your fna and qual files are in same folder