README (Last Updated 05/09/2020) ******************** INTRODUCTION ******************** This README file describes the contents in this directory. This dataset contains processed data of SARS-CoV-2 sequencing on the PacBio Systems [1] using the Eden primer set [2] on ATCC full-length controls [3]. Bioinformatics processing is described in the CoSA tutorial [4] using the 2020-05-01 version of workflow. For issues or questions regarding this dataset, file a "bug" at https://github.com/Magdoll/CoSA/issues. ******************** SAMPLE ******************** ATCC VR-1986D Lot# 70034826 (https://www.atcc.org/en/Global/Products/VR-1986D.aspx) ******************** METHODS ******************** Library Preparation & Sequencing: The library was constructed using SMRTbell Express Template Prep Kit 2.0. Sequencing was done on one SMRT Cell 8M on the Sequel II system for 15hr with 0.6hr pre-extension time using Sequel II Binding Kit 2.0. Analysis: Detailed bioinformatics processing is described in the CoSA tutorial [4] using the 2020-05-01 version of workflow. Briefly, CCS reads were generated using SMRT Link, then demultiplexed of M13 barcodes. A second round of demux (using lima) was performed to identify the Eden primers, allowing only for adjacent pairs (ex: A3F--A3R) and filtering out invalid pairs (ex: A1F--A3R). The demuxed, trimmed, and filtered CCS reads were then pooled together and downsampled at 1000, 100, and 20 reads per amplicon using the CoSA script `subsample_amplicons.py`. Mapping and variant calling was done using pbmm2 (minimap2 wrapper) to the reference genome, followed by juliet (minorseq) with --min-perc 10 frequency cutoff. Analysis tool versions: ccs v5.0.0 (using SMRT Link v9.1.0.94448) lima v1.11.0 pbmm2 v1.2.1 juliet v1.12.0 ******************** FILE DESCRIPTION ******************** NC_045512.2.fasta - the reference genome fasta file, note the ID is "NC_045512v2" to be consistent with the UCSC genome browser convention. eden.primers.fasta - the Eden primers eden.primers.plus_M13constant.fasta - the Eden primers, with the M13 constant sequence added sarscov2.json - the SARS-CoV-2 config file used by Juliet (MinorSeq) for variant calling run_juliet_per_sample.sh - template command file for mapping and variant calling subsampled.ccs.Q20.fastq - CCS (HiFi) amplicon reads. Barcodes and Eden primers have been trimmed. subsampled.mapped.bam - mapping of "subsampled.ccs.Q20.fastq" to the reference genome. subsampled.minperc10.juliet.* - variant calling output using Juliet (minorseq). ******************** FILE LIST ******************** ├── NC_045512.2.fasta ├── run_juliet_per_sample.sh ├── sarscov2.json ├── eden.primers.fasta ├── eden.primers.plus_M13constant.fasta ├── subsample_1000 │   ├── subsampled.ccs.Q20.fastq │   ├── subsampled.mapped.bam │   ├── subsampled.mapped.bam.bai │   ├── subsampled.minperc10.juliet.html │   ├── subsampled.minperc10.juliet.json │   └── subsampled.minperc10.juliet.vcf ├── subsample_100 │   ├── subsampled.ccs.Q20.fastq │   ├── subsampled.mapped.bam │   ├── subsampled.mapped.bam.bai │   ├── subsampled.minperc10.juliet.html │   ├── subsampled.minperc10.juliet.json │   └── subsampled.minperc10.juliet.vcf └── subsample_20 ├── subsampled.ccs.Q20.fastq ├── subsampled.mapped.bam ├── subsampled.mapped.bam.bai ├── subsampled.minperc10.juliet.html ├── subsampled.minperc10.juliet.json └── subsampled.minperc10.juliet.vcf 4. REFERENCES [1] https://www.pacb.com/covid-19 [2] https://www.pacb.com/wp-content/uploads/Customer-Collaboration-PacBio-Compatible-Eden-Protocol-for-SARS-CoV-2-Sequencing.pdf [3] https://www.atcc.org/en/Global/Products/VR-1986HK.aspx [4] https://github.com/Magdoll/CoSA