README (Last Updated 06/08/2020) ******************** INTRODUCTION ******************** This README file describes the contents in this directory. This dataset contains raw, intermediate, and processed files of targeted sequence data for a set of 7 samples with repeat-expansion genotypes and 1 control sample with no repeat expansions at the targeted sites. Targeted sites of interest are HTT and FMR1. The library was sequenced on the Sequel II system and processed using community GitHub tool analysis. For more information on No-Amp methods[1], bioinformatics analysis, see the PacBio GitHub[2] and additional references below. ******************** SAMPLE ******************** Seven genomic DNA samples from Coriell Institute and one DNA sample from HEK293 cell line: Samples with HTT CAG repeat expansions NA13505 with sequencing barcode BC1015 NA13509 with sequencing barcode BC1016 NA20253 with sequencing barcode BC1017 NA14044 with sequencing barcode BC1018 Samples with FMR1 CGG repeat expansions NA13664 with sequencing barcode BC1020 NA06896 with sequencing barcode BC1021 NA07537 with sequencing barcode BC1022 Sample without know repeat expansions – HEK293 with sequencing barcode BC1019 ******************** METHODS ******************** Library Preparation: Procedure & Checklist – No-Amp Targeted Sequencing Utilizing the CRISPR-Cas9 System (PN 101-801-500) Sequencing: Sequel II System with Sequel II Binding Kit 2.0 (PN 101-842-900) and Sequel II Sequencing Kit 2.0 (4 rxn) (PN 101-820-200) Run time: 30hr movie + 0.5hr pre-extension Analysis: PacBio GitHub Repeat Analysis Tools pipeline[1] using the following executable versions for data preparation: ccs 4.2.0 (commit v4.2.0-1-g450908e4) (available from pbbioconda[3]) lima 1.11.0 (commit v1.11.0-1-gec618c9) pbmm2 1.2.0 (commit v1.2.0-1-g31b4be0) Post-mapping repeat analysis was performed using GitHub scripts[2]. ******************** FILE DESCRIPTION ******************** ======================== WHAT FILES SHOULD I USE? ======================== Users wishing to immediately make use the processed, demuxed, and mapped results in 3rd party tools should use the following BAM files: analysis/align/ ├── m64012_191221_044659.ccsset.bc1015--bc1015.bam ├── m64012_191221_044659.ccsset.bc1016--bc1016.bam ├── m64012_191221_044659.ccsset.bc1017--bc1017.bam ├── m64012_191221_044659.ccsset.bc1018--bc1018.bam ├── m64012_191221_044659.ccsset.bc1019--bc1019.bam ├── m64012_191221_044659.ccsset.bc1020--bc1020.bam ├── m64012_191221_044659.ccsset.bc1021--bc1021.bam ├── m64012_191221_044659.ccsset.bc1022--bc1022.bam Additionally, users who wish to use extracted repeat expansion regions as defined in the targets.BED file should use the following FASTQ files: analysis/fastq ├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_FMR1.fastq ├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_HTT.fastq ├── m64012_191221_044659.ccsset.bc1016--bc1016.extracted_FMR1.fastq ├── m64012_191221_044659.ccsset.bc1016--bc1016.extracted_HTT.fastq ├── m64012_191221_044659.ccsset.bc1017--bc1017.extracted_FMR1.fastq ├── m64012_191221_044659.ccsset.bc1017--bc1017.extracted_HTT.fastq ├── m64012_191221_044659.ccsset.bc1018--bc1018.extracted_FMR1.fastq ├── m64012_191221_044659.ccsset.bc1018--bc1018.extracted_HTT.fastq ├── m64012_191221_044659.ccsset.bc1019--bc1019.extracted_FMR1.fastq ├── m64012_191221_044659.ccsset.bc1019--bc1019.extracted_HTT.fastq ├── m64012_191221_044659.ccsset.bc1020--bc1020.extracted_FMR1.fastq ├── m64012_191221_044659.ccsset.bc1020--bc1020.extracted_HTT.fastq ├── m64012_191221_044659.ccsset.bc1021--bc1021.extracted_FMR1.fastq ├── m64012_191221_044659.ccsset.bc1021--bc1021.extracted_HTT.fastq ├── m64012_191221_044659.ccsset.bc1022--bc1022.extracted_FMR1.fastq └── m64012_191221_044659.ccsset.bc1022--bc1022.extracted_HTT.fastq Visual graphs of all on-target reads including waterfall plots and expansion size distributions, as well as per-read motif counts can be found in the reports directory: analysis/reports/ ├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_FMR1.counts.csv ├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_FMR1.insertSize.png ├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_FMR1.motifCount.png ├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_FMR1.waterfall.pdf ├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_HTT.counts.csv ├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_HTT.insertSize.png ├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_HTT.motifCount.png ├── m64012_191221_044659.ccsset.bc1015--bc1015.extracted_HTT.waterfall.pdf ... (truncated) Clustered per-allele results with confidence intervals on repeat expansions and colorized BAMs (for viewing in IGV) can be found in the cluster directory: analysis/cluster/ ├── m64012_191221_044659.ccsset.bc1015--bc1015.FMR1.hptagged.bam ├── m64012_191221_044659.ccsset.bc1015--bc1015.FMR1.hptagged.bam.bai ├── m64012_191221_044659.ccsset.bc1015--bc1015.FMR1.readnames.txt ├── m64012_191221_044659.ccsset.bc1015--bc1015.FMR1.summary.csv ├── m64012_191221_044659.ccsset.bc1015--bc1015.HTT.hptagged.bam ├── m64012_191221_044659.ccsset.bc1015--bc1015.HTT.hptagged.bam.bai ├── m64012_191221_044659.ccsset.bc1015--bc1015.HTT.readnames.txt ├── m64012_191221_044659.ccsset.bc1015--bc1015.HTT.summary.csv ... (truncated) ======================== Raw Subreads ======================== The RawMovie/ folder contains the movie BAM file. rawMovie/ |---- m64012_191221_044659.adapters.fasta |---- m64012_191221_044659.sts.xml |---- m64012_191221_044659.subreads.bam |---- m64012_191221_044659.subreads.bam.pbi |---- md5sums.txt ======================== Auxiliary files ======================== See auxiliary/ directory for barcodes and target BED file. auxiliary/ ├── Barcoded_Adapter_8B.fasta └── human_hs37d5.targets_repeatonly.bed hs37d5 (hg19) reference can be downloaded here: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz ======================== Intermediate Results ======================== The directories analysis/ccs and analysis/demux contain BAM files with unaligned CCS and unaligned demultiplexed reads, respectively. ******************** REFERENCES ******************** [1] PacBio No-Amp landing page: https://www.pacb.com/applications/targeted-sequencing/no-amp-targeted-sequencing/ [2] Community tool RepeatAnalysis: https://github.com/PacificBiosciences/apps-scripts/tree/master/RepeatAnalysisTools [3] PacBio pbbioconda landing page: https://github.com/PacificBiosciences/pbbioconda