Index of /public/dataset/Ecoli/egs
 Name                                Last modified      Size  Description
 Parent Directory                                         -   
 README.html                         2021-02-09 07:34  2.6K  
 README.txt                          2021-02-09 07:34  2.6K  
 ecoli_pbi_Jan2021_majorStrain.fasta 2021-01-12 07:04  4.4M  
 m64004_200618_002500.hifi_reads.bam 2020-09-24 17:32   20G  
 m64004_200618_002500.subreads.bam   2020-06-18 18:07  433G  
================================
# E.coli Gold Standard (EGS)
#GOAL
- Share gold-standard E.coli sample data:
  - vetted reference sequence
  - Pacbio sequencing data
  - documentation of biologically irreducible minor variant
  contaminants to be filtered away
================================
# Data Locations Reference and Sequencing Reads
- Gold standard E.coli sample sequencing PACB
data type     | size            | link / file
------------- | --------------- | -----------------------------------
Reference     |       4,639,002 | ecoli_pbi_Jan2021_majorStrain.fasta
CCS HIFI Reads|  21,591,167,182 | m64004_200618_002500.hifi_reads.bam
Raw Subreads  | 464,943,830,201 | m64004_200618_002500.subreads.bam
================================
# Data Methods
- Pacbio Data Methods 
PACB key         | value
---------------- | ----------------------------------------------------------------------
Sample           | ATCC E.coli K12 MG1655 prep WL_052920
Shearing         | 15 kb Megaruptor3, Shear Speed 37, 10 ug/column, 500 uL per column
Size selection   | BluePippin U1 12 kb to 18 kb
Library prep     | TPK-1, Express V2, 4 EM, WL_061420b
Sequencing       | Sequel System II with BINDINGKIT=101-820-500 SEQUENCINGKIT=101-826-100
Run time         | 2.9 hour pre-extension; 15 hour movie
CCS              | SMRT Link 10.0.0  Circular Consensus Sequence Analysis (ccs v5.0.0)
Alignment        | pbmm2 1.5.0 (commit v1.5.0-2-g464414e)
Alignment Params | --min-concordance-perc 70.0 --min-length 50 --preset CCS
- Pacbio CCS Stats
value     | statistic
--------- | -----------------------------
1,519,099 | hifi reads
13,956    | mean readlength
q29       | median predicted read quality
8         | mean number of passes
- Overall error rates and coverages
median err | mean err | median cover | mean cover
---------- | -------- | ------------ | ----------
QV31.7     | QV26.0   | 4561         | 4557
================================
# Biological Minor Subspecies
- There are biological minor subspecies present in our sample.
  - Our best practices strived for as little biological variation as
  possible. These variants appear to be biologically irredicible given
  our methods.
  - The subspecies were indicated by large error events.
 - Reads that map to these minor subspecies at these locations should
  be appropriately filtered as they differ from the reference
  sequence.
location        | size | abundance | cause
--------------- | ---- | --------- | -------------------------
1035756-1037553 | ~2k  | 20%       | prophage inversion
2343429-2343725 | ~300 | 2%        | fimB regulatory inversion
================================