Cephalotaxus_harringtonia-20120822 ---------------------------------- stats.txt: Summary statistics associated with contigs.fa. In- cludes the total number of sequences and bases in the contig set, N50, etc. Q1, Q2, Q3 are the quartiles of the reported contig lengths. B1000 and B2000 indicate the percentage of bases in- volved in contigs at least 1000 bp and 2000 bp, respectively. -- contigs.fa: Contigs from the assembly, minimum 100 bp. Possibly includes UTRs. Sequences contain IUPAC ambiguity codes represent- ing ambiguous bases, http://www.bioinformatics.org/sms/iu- pac.html. -- cds.fa: Coding regions associated with contigs, as predicted by ESTscan, minimum 100 bp. Sequence identifiers for these predict- ed CDS are provided suffixes _1, _2, etc., to accommodate multi- ple predictions, and to indicate association with predicted pro- tein products. Sequences contain IUPAC ambiguity codes represent- ing ambiguous bases, http://www.bioinformatics.org/sms/iu- pac.html. Note that the total number of predicted CDS might be higher or lower than the number of contigs. This can be due to the reporting threshold of 100 nt or multiple predictions per contig. -- peptides.fa: Protein products associated with contigs, as pre- dicted by ESTScan, minimum 30 aa. Sequence identifiers for these predicted products correspond to the associated nucleotide se- quence in contig.fa, and are provided suffixes _1, _2, etc., to accommodate multiple predictions. Note that the total number of predicted peptides might be higher or lower than the number of contigs. This can be due to the reporting threshold of 30 aa or multiple predictions per contig. -- readcounts/contigs/all_alignments.dat: Read counts obtained by post hoc alignment of reads using BWA to reported contigs with default parameters. Tab-delimited columns with the format id all_aligned all_aligned_fraction unique_aligned paired_aligned len where id is the contig identifier, for example, Cephalotaxus_har- ringtonia-20120822|1234; all_aligned is the number of reads aligned to this contig, including multimapped reads. all_aligned_fraction is the number of reads aligned to this con- tig, but in the case of multimapped reads, the read is assigned fractionally to the hit contigs. This has the advantage that the sum of the all_aligned_fraction counts equals the total number of reads that aligned. unique_aligned is the number of reads that aligned uniquely to this contig; and paired_aligned is the number of read pairs aligned to this contig. len is the length of the contig in bp. * * * NOTE * * * While this information is sufficient to compute common normalized values such as RPKM (reads per kilobase of transcript per million mapped reads) and FPKM (fragments per kilobase of transcript per million mapped reads), these read counts are provided for quality assessment of the contig set only. For differential expression analyses, it is recommended more sophisticated estimators of rel- ative expression level be employed. See for example: Salzman J, Jiang H, Wong WH. Statistical modeling of RNA-Seq data. Statisti- cal Science 26 (2011). -- annot/pfam.gff3, ...: Models matching predicted protein products (peptides.fa) reported in GFF3 format; based on HMMER3 searches against the Pfam-A, Superfamily, and TIGRFAMs model sets. These are restricted to full-sequence-evalue <= 1.0e-5 with the top five hits reported. Association with InterPro terms is indicated in the Ontology_term attribute, and is based on the assertions (InterPro -> model, In- terPro -> protein accession) made by InterPro. Currently InterPro associations from Superfamily hits are not computed. ------------------------------------ National Center for Genome Resources http://www.ncgr.org