README (Last Updated 01/13/2020) Edited by: Elizabeth Tseng (etseng@pacb.com) IMPORTANT: Please note that this release of Iso-Seq data maps to the Hifiasm v12 version of the genome at https://downloads.pacbcloud.com/public/dataset/redwood2020/hifiasm/v12/ ******************** INTRODUCTION ******************** This README file describes the contents in this directory. This dataset contains intermediate and processed files for a Redwood Iso-Seq® dataset. The library was sequenced on the Sequel® II System and processed using SMRTLink v10.1 followed by community tool analysis. For moreinformation on Iso-Seq® methods[1], bioinformatics analysis, see the PacBio® Iso-Seq GitHub[2] and additional references below. ******************** SAMPLE ******************** Needles were collected from the same redwood tree as used for the genome sequencing, flash frozen and stored at -80C. ******************** METHODS ******************** Library Preparation: Iso-Seq® Express Template Preparation for Sequel® and Sequel® II Systems Sequencing: Sequel II System with Sequel II Binding Kit 2.1 and Sequel II Sequencing Kit 2.0 Analysis: SMRT Link v10.1 "IsoSeq" protocol, followed by mapping to PacBio's redwood reference genome of the same species (https://downloads.pacbcloud.com/public/dataset/redwood2020/hifiasm/v12/redwood_v12.p_ctg.fa.gz) and collapsed into non-redundant transcript set using Cupcake.[3] The mapping was done using gmapl which can handle large complex genomes: ``` gmapl -n 0 -t 60 --max-intronlength-ends 200000 --max-intronlength-middle 200000 --cross-species -z sense_force ``` After mapping, the Cupcake collapse script was run using a cutoff of 95% alignment coverage and 90% alignment identity. The finalfiltered Iso-Seq HQ transcripts above this threshold are in the "Final-MappedTranscripts" subdirectory in the data release. Iso-Seq transcripts that were unmapped or poorly mapped are additionally run through Cogent[4] analysis and put in the "Final-UnMmappedTranscripts" subdirectory. Additionally, BLASTN is run for both mapped and unmapped transcripts against NT database using E-value cutoff of 0.1. Mapped transcripts were further phased using IsoPhase as described in [5] and [6]. The `run.sh` script in the subdirectories contain detailed run parameters. ******************** FILE DESCRIPTION ******************** =========================================================== Final Mapped Transcripts: Mapped, Collapsed, and Phased =========================================================== Users wishing to immediately make use the processed, mapped, results should use the following files: Final-MappedTranscripts ├── redwood_isoseq.mapped.collapsed.blastn_evidence.txt ├── redwood_isoseq.mapped.collapsed.exon_stats.txt ├── redwood_isoseq.mapped.collapsed.fasta ├── redwood_isoseq.mapped.collapsed.faa ├── redwood_isoseq.mapped.collapsed.fl_count.txt ├── redwood_isoseq.mapped.collapsed.gff ├── redwood_isoseq.mapped.collapsed.withCDS.gff ├── redwood_isoseq.mapped.collapsed.read_stat.txt ├── redwood_isoseq.mapped.collapsed.simple_stats.txt └── run.sh ├── IsoPhase │   ├── redwood_isoseq.IsoPhase.by_loci_info.txt │   ├── redwood_isoseq.IsoPhase.by_loci_results.tar.gz │   ├── redwood_isoseq.IsoPhase.summarized_loci_results.txt │   ├── redwood_isoseq.IsoPhase.use_partial.cleaned.vcf │   ├── redwood_isoseq.IsoPhase.use_partial.pre_cleaned.vcf │   ├── run_phasing_in_dir.sh │   └── run.sh Where the companion reference genome is: https://downloads.pacbcloud.com/public/dataset/redwood2020/hifiasm/v12/redwood_v12.p_ctg.fa.gz (Internal version number: redwood_v12) Phasing is run using IsoPhase as described in [5] and [6], with a minimum required coverage per locus of 40 FLNC reads. Only SNPs are called (no indel calls). The "cleaning" (or error correction) step IsoPhase can potentially reduce the number of true alleles called given the high ploidy (n=6) and uneven allelic isoform expressions of this species. As such, we have provided both the pre-cleaned and cleaned VCF file. A lack of SNPs called for a locus does not necessarily indicate lack of variants, rather could also be due to lack of transcript coverage. =========================================================== Final Unmapped Transcripts: Additional Cogent Analysis For potentially missing or poorly assembled genes =========================================================== Users wishing to look at unmapped (or poorly mapped) transcripts that were excluded from the "Final-MappedTranscripts" bin, can use the files in the directory below. The file "redwood_isoseq.unmapped_or_badmapped.fasta" are the individual Iso-Seq HQ transcripts. The file "redwood_isoseq.unmapped_or_badmapped.cogent_reconstructed_contigs.fasta" are the result of further running Cogent analysis [4] where gene family information between the HQ transcripts were inferred based on k-mer similarities and transcripts thought to be from the same gene family and were reconstructed using Cogent. Not all unmapped HQ transcripts have a Cogent output. For those that do, the reconstructed contigs can be used to visualize alternative splicing of the isoforms. Further, these reconstructed contigs can be used to supplement the reference genome for missing genes or mis-assembled genes. For an example of Cogent applied to ref genome QC, see Warr et al. [7] Final-UnmappedTranscripts ├── redwood_isoseq.unmapped_or_badmapped.cogent_reconstructed_contigs.fasta ├── redwood_isoseq.unmapped_or_badmapped.cogent_stats.txt ├── redwood_isoseq.unmapped_or_badmapped.fasta └── run.sh =========================================================== Intermediate: Full-Length, Non-Concatemer (FLNC) Reads =========================================================== We DO NOT recommend most users re-analyzing from intermediate (FLNC) data. Intermediate-FullLengthReads ├── flnc.bam └── flnc.report.csv ******************** REFERENCES ******************** [1] PacBio Iso-Seq Landing Page: https://www.pacb.com/applications/rna-sequencing/ [2] PacBio Iso-Seq GitHub Wiki: https://github.com/PacificBiosciences/IsoSeq_SA3nUP [3] Community Tool Cupcake: https://github.com/Magdoll/cDNA_Cupcake [4] Community Tool Cogent: https://github.com/Magdoll/Cogent [5] Community Tool IsoPhase: https://github.com/Magdoll/cDNA_Cupcake/wiki/IsoPhase:-Haplotyping-using-Iso-Seq-data [6] Wang et al. "Variant phasing and haplotypic expression from long-read sequencing in maize", Communications Biology (2020) https://www.nature.com/articles/s42003-020-0805-8 [7] Warr et al. "An improved pig reference genome sequence to enable pig genetics and genomics research ", GigaScience (2020) https://academic.oup.com/gigascience/article/9/6/giaa051/5858065 For Research Use Only. Not for use in diagnostic procedures. Copyright 2021, Pacific Biosciences of California, Inc. All rights reserved. The data provided in these files is subject to change without notice and Pacific Biosciences assumes no responsibility for any errors or omissions. Certain notices, terms, conditions and/or use restrictions may pertain to your use of Pacific Biosciences data, products and/or third party products. Please refer to the applicable Pacific Biosciences Terms and Conditions of Sale and to the applicable license terms at http://www.pacificbiosciences.com/licenses.html.