# Illumina NovaSeq Sequencing of SeraCare Seraseq(R) ctDNA Complete Reference Material target enrichment library with Agilent SureSelect XTHS2 library prep and Comprehensive Cancer Panel (CCP) probes

## Legal disclaimer

All trademarks, trade names, or logos mentioned or used are the property of their respective owners.

## Data

The "NovaSeq_SeraCare_Agilent_ctDNA" folder contains raw (untrimmed) target enriched library reads sequenced on an Illumina NovaSeq instrument. The read data contain Agilent UMIs and Illumina P5/P7 adapters that should be trimmed prior to analysis (described later). The libraries were sequenced with paired-end 2x150bp sequencing chemistry and demultiplexed by the sequencing provider.

### Samples

Below is a brief description of each sample and two controls:

- SureSelectXTHS2-SCv3-0p05P -- VAFs at 0.05%
- SureSelectXTHS2-SCv3-0p1P  -- VAFs at 0.1%
- SureSelectXTHS2-SCv3-0p25P -- VAFs at 0.25%
- SureSelectXTHS2-SCv3-0p5P  -- VAFs at 0.5%
- SureSelectXTHS2-SCv3-wt    -- The WT control provided by SeraCare
- SureSelectXTHS2-GM24385    -- GM24385 control prepared by PacBio

### Data Processing

Reads require trimming prior to alignment. An in-house tool was used to trim and parse Agilent UMIs into read headers. Briefly, the tool grabs the first 3 BPs from each read (R1 and R2) and adds the two sequences to each read's header in the FASTQ file. This step can be achieved using Agilent's AGeNT Trimmer tool. More information can be found here: https://www.agilent.com/en/product/next-generation-sequencing/hybridization-based-next-generation-sequencing-ngs/ngs-software/agent-232879.

We then use `cutadapt` to trim the Illumina P5/P7 adapters from the 5' end and then clip additional bases from the 3' and 5' ends to remove any lingering adapter or artifacts from A-tailing.

### Trimming

Adapter trimming was completed with the following (generic) command:

```bash
cutadapt \
    -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
    -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
    --overlap 3 \
    -j 10 \
    -m 10 \
    -o {output.fastq1} \
    -p {output.fastq2} \
    {input_filt.fastq1} \
    {input_filt.fastq2}
```

Additional bases were removed with the following commands:

```bash
cutadapt \
    --cut 2 \
    --cut -5 \
    -j 12 \
    -o {output.fastq1} \
    {input.fastq1} > {params.log1}

cutadapt \
    --cut 2 \
    --cut -5 \
    -j 12 \
    -o {output.fastq2} \
    {input.fastq2} > {params.log2}
```


### Alignment

Reads were aligned to the GrCH38 reference without alt contigs (hg38_no_alt). The alignment was performed with the following commands to parse the UMIs from the FASTQ read headers into the SAM RX tag using `fgbio`:

```bash
bwa mem -t48 -R "{params.rg_tag}" {REF} {input.fastq1} {input.fastq2} | \
    samtools sort -n - | \
    fgbio SetMateInformation | \
    fgbio CopyUmiFromReadName -i /dev/stdin -o {params.tmpbam}
        
samtools sort -@24 {params.tmpbam} -o {output.bam_out}
```

### Variant detection

Variants were detected using the PySAM Python module with no filtering to understand what variants were present at each of the 19 Seracare variant sites. We found two variants consistently popping up in the controls of Onso runs as well as sequencing runs from an orthogonal technology:

- **ALK.COSM28055**
- **BRCA1.COSM1383519**

We masked these variants from accuracy tabulations, as it's plausible that these are contaminants from the manufacturer in the control.

*Rev 2023-09-07*