README  (Last Updated 06/08/2020)

********************
INTRODUCTION
********************

   This README file describes the contents in this directory.

   This dataset contains raw, intermediate, and processed files of targeted 
sequence data for a set of 7 samples with repeat-expansion genotypes and 1
control sample with no repeat expansions at the targeted sites. Targeted
sites of interest are HTT and FMR1.  The library was sequenced on the 
Sequel II system and processed using community GitHub tool analysis. 
For more information on No-Amp methods[1], bioinformatics 
analysis, see the PacBio GitHub[2] and additional references below.


********************
SAMPLE
********************

Seven genomic DNA samples from Coriell Institute and one DNA sample 
from HEK293 cell line:  

Samples with HTT CAG repeat expansions
    NA13505 with sequencing barcode BC1015
    NA13509 with sequencing barcode BC1016
    NA20253 with sequencing barcode BC1017
    NA14044 with sequencing barcode BC1018

Samples with FMR1 CGG repeat expansions
    NA13664 with sequencing barcode BC1020
    NA06896 with sequencing barcode BC1021
    NA07537 with sequencing barcode BC1022

Sample without know repeat expansions –
    HEK293 with sequencing barcode BC1019

********************
METHODS
********************

Library Preparation: 
Procedure & Checklist – No-Amp Targeted Sequencing Utilizing the CRISPR-Cas9 System (PN 101-801-500) 

Sequencing: 
Sequel II System with Sequel II Binding Kit 2.0 (PN 101-842-900) and 
Sequel II Sequencing Kit 2.0 (4 rxn) (PN 101-820-200)

Run time: 
30hr movie + 0.5hr pre-extension 

Analysis: 
PacBio GitHub Repeat Analysis Tools pipeline[1] using the following executable versions for 
data preparation:
    ccs 4.2.0 (commit v4.2.0-1-g450908e4) (available from pbbioconda[3])
    lima 1.11.0 (commit v1.11.0-1-gec618c9)
    pbmm2 1.2.0 (commit v1.2.0-1-g31b4be0)

Post-mapping repeat analysis was performed using GitHub scripts[2].
   
********************
FILE DESCRIPTION
********************

========================
WHAT FILES SHOULD I USE? 
========================
Users wishing to immediately make use the processed, demuxed, and mapped
results in 3rd party tools should use the following BAM files:

analysis/align/
├── m64012_191221_044659.ccsset.bc1015--bc1015.bam
├── m64012_191221_044659.ccsset.bc1016--bc1016.bam
├── m64012_191221_044659.ccsset.bc1017--bc1017.bam
├── m64012_191221_044659.ccsset.bc1018--bc1018.bam
├── m64012_191221_044659.ccsset.bc1019--bc1019.bam
├── m64012_191221_044659.ccsset.bc1020--bc1020.bam
├── m64012_191221_044659.ccsset.bc1021--bc1021.bam
├── m64012_191221_044659.ccsset.bc1022--bc1022.bam

Additionally, users who wish to use extracted repeat expansion regions as defined
in the targets.BED file should use the following FASTQ files:

analysis/fastq
├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_FMR1.fastq
├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_HTT.fastq
├── m64012_191221_044659.ccsset.bc1016--bc1016.extracted_FMR1.fastq
├── m64012_191221_044659.ccsset.bc1016--bc1016.extracted_HTT.fastq
├── m64012_191221_044659.ccsset.bc1017--bc1017.extracted_FMR1.fastq
├── m64012_191221_044659.ccsset.bc1017--bc1017.extracted_HTT.fastq
├── m64012_191221_044659.ccsset.bc1018--bc1018.extracted_FMR1.fastq
├── m64012_191221_044659.ccsset.bc1018--bc1018.extracted_HTT.fastq
├── m64012_191221_044659.ccsset.bc1019--bc1019.extracted_FMR1.fastq
├── m64012_191221_044659.ccsset.bc1019--bc1019.extracted_HTT.fastq
├── m64012_191221_044659.ccsset.bc1020--bc1020.extracted_FMR1.fastq
├── m64012_191221_044659.ccsset.bc1020--bc1020.extracted_HTT.fastq
├── m64012_191221_044659.ccsset.bc1021--bc1021.extracted_FMR1.fastq
├── m64012_191221_044659.ccsset.bc1021--bc1021.extracted_HTT.fastq
├── m64012_191221_044659.ccsset.bc1022--bc1022.extracted_FMR1.fastq
└── m64012_191221_044659.ccsset.bc1022--bc1022.extracted_HTT.fastq

Visual graphs of all on-target reads including waterfall plots and expansion size distributions,
as well as per-read motif counts can be found in the reports directory:

analysis/reports/
├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_FMR1.counts.csv
├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_FMR1.insertSize.png
├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_FMR1.motifCount.png
├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_FMR1.waterfall.pdf
├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_HTT.counts.csv
├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_HTT.insertSize.png
├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_HTT.motifCount.png
├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_HTT.waterfall.pdf
... (truncated)

Clustered per-allele results with confidence intervals on repeat expansions 
and colorized BAMs (for viewing in IGV) can be found in the cluster directory:

analysis/cluster/
├── m64012_191221_044659.ccsset.bc1015--bc1015.FMR1.hptagged.bam
├── m64012_191221_044659.ccsset.bc1015--bc1015.FMR1.hptagged.bam.bai
├── m64012_191221_044659.ccsset.bc1015--bc1015.FMR1.readnames.txt
├── m64012_191221_044659.ccsset.bc1015--bc1015.FMR1.summary.csv
├── m64012_191221_044659.ccsset.bc1015--bc1015.HTT.hptagged.bam
├── m64012_191221_044659.ccsset.bc1015--bc1015.HTT.hptagged.bam.bai
├── m64012_191221_044659.ccsset.bc1015--bc1015.HTT.readnames.txt
├── m64012_191221_044659.ccsset.bc1015--bc1015.HTT.summary.csv
... (truncated)

========================
Raw Subreads
========================
The RawMovie/ folder contains the movie BAM file. 

rawMovie/
|---- m64012_191221_044659.adapters.fasta 
|---- m64012_191221_044659.sts.xml 
|---- m64012_191221_044659.subreads.bam 
|---- m64012_191221_044659.subreads.bam.pbi 
|---- md5sums.txt

========================
Auxiliary files
========================
See auxiliary/ directory for barcodes and target BED file.

auxiliary/
├── Barcoded_Adapter_8B.fasta
└── human_hs37d5.targets_repeatonly.bed

hs37d5 (hg19) reference can be downloaded here:
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

========================
Intermediate Results
========================

The directories analysis/ccs and analysis/demux contain BAM files with unaligned CCS
and unaligned demultiplexed reads, respectively.

********************
REFERENCES
********************

[1] PacBio No-Amp landing page: https://www.pacb.com/applications/targeted-sequencing/no-amp-targeted-sequencing/
[2] Community tool RepeatAnalysis: https://github.com/PacificBiosciences/apps-scripts/tree/master/RepeatAnalysisTools
[3] PacBio pbbioconda landing page: https://github.com/PacificBiosciences/pbbioconda