Index of /public/dataset/Alzheimer_IsoSeq_2016
Name Last modified Size Description
Parent Directory -
final_confident/ 2016-11-14 15:32 -
final_promiscuous/ 2016-08-23 10:21 -
intermediate_files/ 2016-08-22 13:37 -
movies.list.txt 2016-08-24 16:10 2.2K
compare.Confident_vs_Promiscuous.pdf 2016-08-24 16:10 73K
README.txt 2016-08-24 13:08 6.5K
README (Last Updated 08/24/2016)
1. INTRODUCTION
This README file describes the contents in this directory.
The dataset released in this directory contains the polished results of transcriptome sequencing for the Alzheimer human brain transcriptome using PacBio(R) SMRT(R) Sequencing.
2. LIBRARY PREPARATION AND SEQUENCING
This is an Alzheimer's Disease Brain total RNA sample purchased from BioChain (http://www.biochain.com/biospecimen-rna-purification-analysis/total-rna/human-disease.html). The Lot Number is A703252.
First strand cDNA library was generated using Clontech SMARTer cDNA synthesis kit followed by size selection using the SageELF(TM) device by Sage Science, with lanes combined to create 5 size libraries that roughly correspond to 1-2 kb, 2-3 kb, 3-5 kb, 5-7 kb, > 7kb libraries [1]. Sequencing was done using P6-C4 chemistry and 3-hr movies for the 1-2 kb fraction and 4-hr movies for the remaining fractions. Sequencing was completed in 2015.
3. BIOINFORMATICS PROCESSING
The standard Iso-Seq pipeline (ToFU version 2.2.3 or equivalent to SMRTAnalysis 3.1) was used to process the data. Iso-Seq classify generated 1,107,889 FLNC reads and 1,929,319 nFL reads. The reads were then used to generate high-quality, full-length isoforms using ICE followed by Quiver polishing (HQ Quiver isoform consensus). By definition, a HQ Quiver isoform must have at least 2 supporting FL reads and predicted accuracy of >= 99%.
The HQ Quiver consensus sequences were aligned to hg38 using GMAP with the parameters:
GMAP version 2015-12-31 called with args: -d hg38 -n 0 -z sense_force --cross-species
And the following filter criterion were applied:
<a> minimum alignment coverage >= 99%
<b> minimum alignment identity >= 95%
<c> mapped isoforms that have a degraded 5' end compared to a longer isoform are removed
(EX: isoform A has exon 1, 2, 3, 5; isoform B has exon 2, 3, 5; isoform C has exon 2, 3 ==> only isoforms A and C are kept)
Fusion candidates were additionally identified based on GMAP alignment to hg38 and the following criteria:
<a> aligned to 2 or more loci
<b> each loci is at least 5% (or 100 bp) of the transcript
<c> each loci is at least 100 kb apart
<d> at least 5 FL read support
For a schematic of the bioinformatics process, please refer to the Methods section in Gordon et al. [2]
A public UCSC browser track containing the GFF files from below is available:
https://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Magdoll&hgS_otherUserSessionName=2015_Ting_AlzBreast
4. DIFFERENCE BETWEEN "CONFIDENT" and "PROMISCUOUS" FINAL DATASET
Two final datasets are provided. "Confident" is the traditional criterion used by Iso-Seq. The sequences were high-quality (HQ) Quiver consensus sequences with at least 2 FL read support and >= 99% predicted accuracy. It was then aligned to the genome with further filtering and collapsing steps as outlined in (3).
"Promiscuous" is the relaxed version that takes both HQ Quiver consensus (FL>=2, predicted accuracy >= 99%) and good LQ Quiver consensus sequences (FL=1, predicted accuracy >= 99%). It is then aligned to the genome with the same filtering and collapsing steps. Because this dataset contains low abundance sequences (many supported by only 1 FL reads), it is expected that it will contain more artifacts.
Users are HIGHLY ENCOURAGED to use only the "Confident" set unless there's reason to use the less reliable "Promiscuous" dataset.
5. DESCRIPTION OF FILES
IsoSeq_Alzheimer_2016edition_polished.confident.unimapped.rep.fastq - Polished fastq sequences, non-chimeric only.
IsoSeq_Alzheimer_2016edition_polished.confident.unimapped.hg38.gff - GFF alignment of the above to hg38.
IsoSeq_Alzheimer_2016edition_polished.confident.unimapped.hg38.bam - BAM alignment of the above to hg38. A hg19 version is also provided.
IsoSeq_Alzheimer_2016edition_polished.confident.unimapped.abundance.txt - Supporting number of FL and nFL reads for each unique isoform.
IsoSeq_Alzheimer_2016edition_polished.confident.unimapped.read_stat.txt - Detailed list of which FL/nFL reads belong to which isoforms.
IsoSeq_Alzheimer_2016edition_polished.confident.unimapped.rep.fastq.sorted.sam.matchAnnot_gencode25_all.txt - matchAnnot comparison result.
IsoSeq_Alzheimer_2016edition_polished.confident.unimapped.ANGEL_ORF.pep - Predicted ORF using ANGEL [3]
IsoSeq_Alzheimer_2016edition_polished.confident.fusion.rep.fastq - Polished fastq sequences, fusion candidates only.
IsoSeq_Alzheimer_2016edition_polished.confident.fusion.gff - Alignment of the above to hg38. Each fusion candidate is named using the format <gene1>+<gene2> followed by the suffix .1, .2, to allow proper loading the UCSC browser track.
IsoSeq_Alzheimer_2016edition_polished.confident.fusion.abundance.txt - Supporting number of FL and nFL reads for each fusion candidate.
The description for the "promiscuous" dataset are the same, except the filenames are changed to contain the label "promiscuous".
3. REFERENCES
[1] Isoform Sequencing using the Clontech SMARTer cDNA Synthesis Kit and SageELF Size-selection System: http://www.pacb.com/wp-content/uploads/Procedure-Checklist-Isoform-Sequencing-Iso-Seq-Analysis-using-the-Clontech-SMARTer-PCR-cDNA-Synthesis-Kit-and-SageELF-Size-Selection-System.pdf
[2] Gordon, S. P. et al. Widespread Polycistronic Transcripts in Fungi Revealed by Single-Molecule mRNA Sequencing. PLoS ONE 10, e0132628 (2015).
[3] ANGEL: Robust ORF prediction https://github.com/PacificBiosciences/ANGEL
For Research Use Only. Not for use in diagnostic procedures. Copyright 2016, Pacific Biosciences of California, Inc. All rights reserved. The data provided in these files is subject to change without notice and Pacific Biosciences assumes no responsibility for any errors or omissions. Certain notices, terms, conditions and/or use restrictions may pertain to your use of Pacific Biosciences data, products and/or third party products. Please refer to the applicable Pacific Biosciences Terms and Conditions of Sale and to the applicable license terms at http://www.pacificbiosciences.com/licenses.html. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell and Iso-Seq are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science, Inc. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners.