README – PACBIO ALZHEIMER’S DISEASE PANEL DATA RELEASE Last Updated: September 21, 2017 BACKGROUND This README file contains information for the SMRT Sequencing data from an Alzheimer’s Disease (AD) targeted enrichment experiment. A custom 35 gene panel containing candidate AD genes was designed in collaboration with IDT using xGen Lockdown Probes. A range of variants including SNPs, insertions and deletions were detected with long-read sequencing of a ~7 KB capture of genomic DNA isolated from brain and skeletal muscle of two AD suspected individuals. Furthermore, phased alleles could be distinguished by leveraging contiguous multi-kilobase reads and heterozygous SNPs. This release contains only the genomic capture sequencing data presented in the 2017 AGBT Poster: “A method for the identification of variants in Alzheimer’s disease candidate genes and transcripts using hybridization capture combined with long-read sequencing.” (http://www.pacb.com/proceedings/a-method-for-the-identification-of-variants-in-alzheimers-disease-candidate-genes-and-transcripts-using-hybridization-capture-combined-with-long-read-sequencing/). Two subjects with AD were used in this study. Genomic DNA from brain tissue was obtained from a male subject, while skeletal muscle gDNA was obtained from a female subject. This release contains the sequencing data from the Sequel system (sequencing chemistry SP1.0/SC1.2) and the following analysis data: aligned circular consensus sequences (“CCS Mapping”) and subreads (“Resequencing”) to the GRCh38 human reference genome. CCS analysis is done with SMRT Link v5.0.1. The CCS and Reseq datasets are used as input for the targeted phasing consensus analysis available on GitHub which outputs the phased reads. Phased genomic regions from this capture study are also made available for viewing in IGV 2.4 beta. DATASETS and supporting files: PacBioCapture_AD_S1_Brain_CCS.tar.gz – CCS of genomic capture data from brain tissue of subject 1 aligned to GRCh38 reference. PacBioCapture_AD_S1_Brain_CCS_12plex.tar.gz – Brain CCS BAM containing randomly downsampled reads as an example to represent lower coverage data that may be expected for a 12-plex sample experimental design scenario. Aligned to GRCh38 reference. PacBioCapture_AD_S1_Brain_12plex_Phased.tar.gz – Phased genomic regions of captured brain gDNA using coverage representative of a 12-plex sample capture experiment. Output of targeted phasing consensus analysis workflow. PacBioCapture_AD_S1_Brain_ReSeq.tar.gz – Aligned subreads of brain gDNA to GRCh38 reference. PacBioCapture_AD_S1_Brain_Subreads.tar.gz – Sequel data of brain tissue captured genomic DNA for importing into SMRT Link. PacBioCapture_AD_S2_Skeletal_CCS.tar.gz - CCS of genomic capture data from skeletal muscle of Subject 2 aligned to GRCh38 reference. PacBioCapture_AD_S2_Skeletal_CCS_12plex.tar.gz – Skeletal CCS BAM containing randomly downsampled reads as an example to represent lower coverage data that may be expected for a 12-plex sample experimental design scenario. Aligned to GRCh38 reference. PacBioCapture_AD_S2_Skeletal_12plex_Phased.tar.gz - Phased genomic regions of captured skeletal muscle gDNA using coverage representative of a 12-plex sample capture experiment. Output of targeted phasing consensus analysis workflow. PacBioCapture_AD_S2_Skeletal_ReSeq.tar.gz - Aligned subreads of skeletal gDNA to GRCh38 reference. PacBioCapture_AD_S2_Skeletal_Subreads.tar.gz - Sequel data of brain tissue captured genomic DNA for importing into SMRT Link. PacBioCapture_AD_probes.hg38.bed – BED file for custom AD PANEL capture probes. PacBioCapture_AD_DATA_RELEASE.md5 – md5 checksum file to ensure proper data transfer. GRCh38_REFERENCE.tar.gz – contains .fa and .fai files for the GRCh38 reference genomes DATA ANALYSIS – GENERATING PHASED CONSENSUS SEQUENCES GitHub Home: https://github.com/PacificBiosciences/targeted-phasing-consensus Phasing Consensus Analysis for Targeted Sequencing Data (targeted-phasing-consensus.sh) is available on GitHub. The set of tools and analysis workflow generate phased consensus sequences from PacBio sequencing data for probe-based hybridization enrichment studies. Please refer to the Phasing Consensus Analysis for Targeted Sequencing Data GitHub repository as the primary resource with a tutorial detailing the analysis workflow. [ https://github.com/PacificBiosciences/targeted-phasing-consensus ] Here is a walkthrough of the analysis steps for the Phasing Consensus Analysis for Targeted Sequencing Data workflow. Please note this example uses the downsampled brain sequencing data to represent an example coverage for a 12-plex sample capture experiment. ## HOW TO DOWNLOAD AND INSTALL THE SOFTWARE PACKAGE AND THE DATASET 1. Download and install targeted-phasing-consensus from GitHub. 2. Unzip and decompress tar file into a new script folder ‘targeted-phasing-consensus’ 3. Download datasets provided in this release: - PacBioCapture_AD_Brain_CCS_12plex.tar.gz - PacBioCapture_AD_Brain_ReSeq.tar.gz - PacBioCapture_AD_probes.hg38.bed - GRCh38_REFERENCE.tar.gz 4. Unzip and decompress tar files into a new folder ‘DATASETS’. ## SETTING THE ANALYSIS ENVIRONMENT # Add script folder ‘targeted-phasing-consensus’ to path cd targeted-phasing-consensus export PATH = $PWD:$PATH # ensure that samtools, bedtools, and arrow are in your path samtools --version # should report >= 1.3.1 bedtools --version # should report >= 2.25 arrow --version # should report >= 2.2.0 # define some variables to make this easier to read CCSBAM=/path/to/your/DATASETS/PacBioCapture_AD_Brain_CCS_12plex/subset.12.ccs.bam # ccs reads are aligned to reference SUBREADSBAM=/path/to/your/DATASETS/PacBioCapture_AD_Brain_ReSeq/alignmentset.bam # subreads are aligned to reference REFERENCE=/path/to/your/DATASETS/GRCh38_REFERENCE/GRCh38_reference.fasta # with indices PROBES_BED=/path/to/your/DATASETS/PacBioCapture_AD_probes_hg38.bed FRAG_SIZE=6000 # the observed sheared size distribution during sample preparation ## PHASING MULTIPLE TARGETED REGIONS # create a working directory and change into it, e.g. mkdir ~/phased_data cd ~/phased_data # copy probes.bed file to your working directory cp $PROBES_BED ./ # produce a BED file named capture_probes.bed.targets with the target regions of interest capture2target.py PacBioCapture_AD_probes.hg38.bed $FRAG_SIZE # generate shell scripts to phase each region of interest generate_jobs.py ./PacBioCapture_AD_probes.hg38.bed.targets $CCSBAM $SUBREADSBAM $REFERENCE # 1) if running locally without a cluster you can launch all of the jobs from parallel # parallel will manage the number of jobs running concurrently if you provide the '-j NUMBER' argument: NUM_CORES=8 # set this to the number of concurrent jobs parallel -j $NUM_CORES 'bash {} > {}.out' ::: phase_*.sh The targeted-phasing-consensus analysis workflow is now completed. Please refer to the targeted-phasing-consensus GitHub repository for guidance on the output files and visualizing the data. For Research Use Only. Not for use in diagnostic procedures. Copyright 2017, Pacific Biosciences of California, Inc. All rights reserved. The data provided in these files is subject to change without notice and Pacific Biosciences assumes no responsibility for any errors or omissions. Certain notices, terms, conditions and/or use restrictions may pertain to your use of Pacific Biosciences data, products and/or third party products. Please refer to the applicable Pacific Biosciences Terms and Conditions of Sale and to the applicable license terms at http://www.pacificbiosciences.com/licenses.html. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell and Iso-Seq are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science, Inc. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners.