README  (Last Updated 01/13/2020)

Edited by: Elizabeth Tseng (etseng@pacb.com) 


IMPORTANT: Please note that this release of Iso-Seq data maps to the
Hifiasm v12 version of the genome at 
https://downloads.pacbcloud.com/public/dataset/redwood2020/hifiasm/v12/


********************
INTRODUCTION
********************

   This README file describes the contents in this directory.

   This dataset contains intermediate and processed files for a
Redwood Iso-Seq® dataset. The library was sequenced on the Sequel® II System 
and processed using SMRTLink v10.1 followed by community tool analysis. 

For moreinformation on Iso-Seq® methods[1], bioinformatics analysis, 
see the PacBio® Iso-Seq GitHub[2] and additional references below.


********************
SAMPLE
********************

Needles were collected from the same redwood tree as used for the genome sequencing,
flash frozen and stored at -80C.

********************
METHODS
********************

Library Preparation: 
Iso-Seq® Express Template Preparation for Sequel® and Sequel® II Systems 

Sequencing: 
Sequel II System with Sequel II Binding Kit 2.1 and Sequel II Sequencing Kit 2.0

Analysis: 
SMRT Link v10.1 "IsoSeq" protocol, followed by mapping to PacBio's redwood 
reference genome of the same species 
(https://downloads.pacbcloud.com/public/dataset/redwood2020/hifiasm/v12/redwood_v12.p_ctg.fa.gz)
and collapsed into  non-redundant transcript set using Cupcake.[3] 

The mapping was done using gmapl which can handle large complex genomes:

```
gmapl -n 0 -t 60 --max-intronlength-ends 200000 --max-intronlength-middle 200000 --cross-species -z sense_force 
```

After mapping, the Cupcake collapse script was run using a cutoff of 
95% alignment coverage and 90% alignment identity. The finalfiltered Iso-Seq 
HQ transcripts above this threshold are in the "Final-MappedTranscripts" 
subdirectory in the data release.

Iso-Seq transcripts that were unmapped or poorly mapped are additionally run 
through Cogent[4] analysis and put in the "Final-UnMmappedTranscripts" subdirectory.

Additionally, BLASTN is run for both mapped and unmapped transcripts 
against NT database using E-value cutoff of 0.1.


Mapped transcripts were further phased using IsoPhase as described in [5] and [6].


The `run.sh` script in the subdirectories contain detailed run parameters.

   
********************
FILE DESCRIPTION
********************

===========================================================
Final Mapped Transcripts: Mapped, Collapsed, and Phased
===========================================================

Users wishing to immediately make use the processed, mapped, 
results should use the following files:

Final-MappedTranscripts
├── redwood_isoseq.mapped.collapsed.blastn_evidence.txt
├── redwood_isoseq.mapped.collapsed.exon_stats.txt
├── redwood_isoseq.mapped.collapsed.fasta
├── redwood_isoseq.mapped.collapsed.faa
├── redwood_isoseq.mapped.collapsed.fl_count.txt
├── redwood_isoseq.mapped.collapsed.gff
├── redwood_isoseq.mapped.collapsed.withCDS.gff
├── redwood_isoseq.mapped.collapsed.read_stat.txt
├── redwood_isoseq.mapped.collapsed.simple_stats.txt
└── run.sh
├── IsoPhase
│   ├── redwood_isoseq.IsoPhase.by_loci_info.txt
│   ├── redwood_isoseq.IsoPhase.by_loci_results.tar.gz
│   ├── redwood_isoseq.IsoPhase.summarized_loci_results.txt
│   ├── redwood_isoseq.IsoPhase.use_partial.cleaned.vcf 
│   ├── redwood_isoseq.IsoPhase.use_partial.pre_cleaned.vcf
│   ├── run_phasing_in_dir.sh
│   └── run.sh


Where the companion reference genome is: 
https://downloads.pacbcloud.com/public/dataset/redwood2020/hifiasm/v12/redwood_v12.p_ctg.fa.gz
(Internal version number: redwood_v12)

Phasing is run using IsoPhase as described in [5] and [6], with a minimum required
coverage per locus of 40 FLNC reads. Only SNPs are called (no indel calls).

The "cleaning" (or error correction) step IsoPhase can potentially reduce the number of true 
alleles called given the high ploidy (n=6) and uneven allelic isoform expressions of this species. 
As such, we have provided both the pre-cleaned and cleaned VCF file. A lack of SNPs called
for a locus does not necessarily indicate lack of variants, rather could also be due to lack of
transcript coverage.


===========================================================
Final Unmapped Transcripts: Additional Cogent Analysis 
 For potentially missing or poorly assembled genes
===========================================================

Users wishing to look at unmapped (or poorly mapped) transcripts
that were excluded from the "Final-MappedTranscripts" bin, can use 
the files in the directory below.

The file "redwood_isoseq.unmapped_or_badmapped.fasta" are the 
individual Iso-Seq HQ transcripts.

The file "redwood_isoseq.unmapped_or_badmapped.cogent_reconstructed_contigs.fasta"
are the result of further running Cogent analysis [4] where gene family information
between the HQ transcripts were inferred based on k-mer similarities and transcripts
thought to be from the same gene family and were reconstructed using Cogent. Not all
unmapped HQ transcripts have a Cogent output. For those that do, the reconstructed
contigs can be used to visualize alternative splicing of the isoforms. Further, these
reconstructed contigs can be used to supplement the reference genome for missing genes
or mis-assembled genes. 

For an example of Cogent applied to ref genome QC, see Warr et al. [7]

 
Final-UnmappedTranscripts
├── redwood_isoseq.unmapped_or_badmapped.cogent_reconstructed_contigs.fasta 
├── redwood_isoseq.unmapped_or_badmapped.cogent_stats.txt 
├── redwood_isoseq.unmapped_or_badmapped.fasta 
└── run.sh


===========================================================
Intermediate: Full-Length, Non-Concatemer (FLNC) Reads
===========================================================

We DO NOT recommend most users re-analyzing from intermediate (FLNC) data.

Intermediate-FullLengthReads
├── flnc.bam 
└── flnc.report.csv 

  
********************
REFERENCES
********************

[1] PacBio Iso-Seq Landing Page: https://www.pacb.com/applications/rna-sequencing/
[2] PacBio Iso-Seq GitHub Wiki: https://github.com/PacificBiosciences/IsoSeq_SA3nUP
[3] Community Tool Cupcake: https://github.com/Magdoll/cDNA_Cupcake
[4] Community Tool Cogent: https://github.com/Magdoll/Cogent
[5] Community Tool IsoPhase: https://github.com/Magdoll/cDNA_Cupcake/wiki/IsoPhase:-Haplotyping-using-Iso-Seq-data
[6] Wang et al. "Variant phasing and haplotypic expression from long-read sequencing in maize", Communications Biology (2020)
    https://www.nature.com/articles/s42003-020-0805-8
[7] Warr et al. "An improved pig reference genome sequence to enable pig genetics and genomics research ", GigaScience (2020)
    https://academic.oup.com/gigascience/article/9/6/giaa051/5858065


For Research Use Only. Not for use in diagnostic procedures.  Copyright 2021, 
Pacific Biosciences of California, Inc. All rights reserved. The data provided in 
these files is subject to change without notice and Pacific Biosciences assumes no 
responsibility for any errors or omissions. Certain notices, terms, conditions and/or 
use restrictions may pertain to your use of Pacific Biosciences data, products and/or 
third party products. Please refer to the applicable Pacific Biosciences Terms and 
Conditions of Sale and to the applicable license terms at 
http://www.pacificbiosciences.com/licenses.html.